SIGIR-AP 2023 Tutorial: Recent Advances in Generative Information Retrieval

About this tutorial

Generative retrieval (GR) has become a highly active area of information retrieval (IR) that has witnessed significant growth recently. Compared to the traditional ``index-retrieve-then-rank'' pipeline, the GR paradigm aims to consolidate all information within a corpus into a single model. Typically, a sequence-to-sequence model is trained to directly map a query to its relevant document identifiers (i.e., docids). This tutorial offers an introduction to the core concepts of the GR paradigm and a comprehensive overview of recent advances in its foundations and applications.

We start by providing preliminary information covering foundational aspects and problem formulations of GR. Then, our focus shifts towards recent progress in docid design, training approaches, inference strategies, and the applications of GR. We end by outlining remaining challenges and issuing a call for future GR research. This tutorial is intended to be beneficial to both researchers and industry practitioners interested in developing novel GR solutions or applying them in real-world scenarios.

Schedule

Our tutorial is scheduled for November 26th from 13:00 to 16:30 (GMT+8). Please note that there could be revisions to the presentation slides. [Slides]

Time	Section	Presenter
13:00 — 13:10	Section 1: Introduction	Maarten de Rijke
13:10 — 13:30	Section 2: Definition & Preliminaries	Jiafeng Guo
13:30 — 14:30	Section 3: Docid designs	Yubao Tang
14:30 — 14:45	15min coffee break
14:45 — 15:20	Section 4: Training approaches	Ruqing Zhang
15:20 — 15:40	Section 5: Inference strategies	Ruqing Zhang
15:40 — 16:00	Section 6: Applications	Yubao Tang
16:00 — 16:10	Section 7: Challenges & Opportunities	Maarten de Rijke
16:10 — 16:30	Q & A	All

Reading List

The tutorial extensively covers papers highlighted in bold.

Section 3: Docid design

3.1 Pre-defined docids

3.1.1 A single docid represents a document

3.1.1.1 Number-based docids

Unstructured atomic integers

Transformer Memory as a Differentiable Search Index (Tay et al. 2022)
DynamicRetriever: A Pre-trained Model-based IR System Without an Explicit Index (Zhou et al. 2023)
Generative Retrieval as Dense Retrieval (Nguyen and Yates et al. 2023c)
Ultron: An ultimate retriever on corpus with a model-based indexer (Zhou et al. 2022)
CodeDSI: Differentiable Code Search (Nadeem et al. 2022)
DSI++: Updating Transformer Memory with New Documents (Mehta et al. 2022)

Naively structured strings

Transformer Memory as a Differentiable Search Index (Tay et al. 2022)
Bridging the Gap between Indexing and Retrieval for Differentiable Search Index with Query Generation (Zhuang et al. 2023)
CodeDSI: Differentiable Code Search (Nadeem et al. 2022)

Semantically structured strings

Transformer Memory as a Differentiable Search Index (Tay et al. 2022)
A Neural Corpus Indexer for Document Retrieval (Wang et al. 2022)
Understanding Differential Search Index for Text Retrieval (Chen et al. 2023c)
CodeDSI: Differentiable Code Search (Nadeem et al. 2022)

Product quantization strings

Ultron: An ultimate retriever on corpus with a model-based indexer (Zhou et al. 2022)
Continual Learning for Generative Retrieval over Dynamic Corpora (Chen et al. 2023a)
Recommender Systems with Generative Retrieval (Rajput et al. 2023)

3.1.1.2 Word-based docids

Titles

Autoregressive Entity Retrieval (De Cao et al. 2021)
CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks (Chen et al. 2022b)
GERE: Generative evidence retrieval for fact verification (Chen et al. 2022a)
Data-efficient Autoregressive Document Retrieval for Fact Verification (Thorne et al. 2022)
Ultron: An ultimate retriever on corpus with a model-based indexer (Zhou et al. 2022)
Generative Multi-hop Retrieval (Lee et al. 2022)
Nonparametric Decoding for Generative Retrieval (Lee et al. 2023)
Multiview Identifiers Enhanced Generative Retrieval (Li et al. 2023)

URLs

Ultron: An ultimate retriever on corpus with a model-based indexer (Zhou et al. 2022)
TOME: A Two-stage Approach for Model-based Retrieval (Ren et al. 2023)
Data-efficient Autoregressive Document Retrieval for Fact Verification (Thorne et al. 2022)

Pseudo queries

Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies (Tang et al. 2023a)
Multiview Identifiers Enhanced Generative Retrieval(Li et al. 2023)

Important terms

Term-Sets Can Be Strong Document Identifiers For Auto-Regressive Search Engines (Zhang et al. 2023)

3.1.2 Multiple docids represent a document

Section 4: Training approaches

4.1 Stationary scenarios

4.1.1 Supervised learning

Transformer Memory as a Differentiable Search Index (Tay et al. 2022)
Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies (Tang et al. 2023a)
Bridging the Gap between Indexing and Retrieval for Differentiable Search Index with Query Generation (Zhuang et al. 2023)
A Neural Corpus Indexer for Document Retrieval (Wang et al. 2022)
How Does Generative Retrieval Scale to Millions of Passages? (Pradeep et al., 2023)

4.1.2 Pre-training

CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks (Chen et al. 2022b)

4.1.3 Listwise optimization

Listwise Generative Retrieval Models via a Sequential Learning Process (Major revision) (Tang et al. 2023b)

4.2 Dynamic scenarios

DSI++: Updating Transformer Memory with New Documents (Mehta et al. 2022)
Continual Learning for Generative Retrieval over Dynamic Corpora (Chen et al. 2023a)
IncDSI: Incrementally Updatable Document Retrieval (Kishore et al., 2023)

Section 5: Inference strategies

5.1 A single docid represents a document

Constrained beam search with prefix tree

Autoregressive Entity Retrieval (De Cao et al. 2021)

Constrained greedy search with inverted index

Term-Sets Can Be Strong Document Identifiers For Auto-Regressive Search Engines (Zhang et al. 2023)

5.2 Multiple docids represent a document

Constrained beam search with FM-index

Autoregressive Search Engines: Generating Substrings as Document Identifiers (Bevilacqua et al. 2022)

Aggregation functions

Autoregressive Search Engines: Generating Substrings as Document Identifiers (Bevilacqua et al. 2022)
Multiview Identifiers Enhanced Generative Retrieval (Li et al. 2023)

Section 6: Applications

6.1 Knowledge-intensive language tasks (KILT)

Autoregressive Entity Retrieval (De Cao et al. 2021)
GERE: Generative evidence retrieval for fact verification (Chen et al. 2022a)
Autoregressive Search Engines: Generating Substrings as Document Identifiers (Bevilacqua et al. 2022)
A Unified Generative Retriever for Knowledge-Intensive Language Tasks via Prompt Learning (Chen et al. 2023b)
CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks (Chen et al. 2022b)
Data-efficient Autoregressive Document Retrieval for Fact Verification (Thorne et al. 2022)

6.2 Multi-hop retrieval

Generative Multi-hop Retrieval (Lee et al. 2022)

6.3 Recommendation

Recommender Systems with Generative Retrieval (Rajput et al. 2023)
Generative Retrieval with Semantic Tree-Structured Item Identifiers via Contrastive Learning (Si et al. 2023 )

6.4 Code retrieval

CodeDSI: Differentiable Code Search (Nadeem et al. 2022)

Available code

Autoregressive Entity Retrieval (De Cao et al. 2021)
CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks (Chen et al. 2022b)
GERE: Generative evidence retrieval for fact verification (Chen et al. 2022a)
Autoregressive Search Engines: Generating Substrings as Document Identifiers (Bevilacqua et al. 2022)
Multiview Identifiers Enhanced Generative Retrieval (Li et al. 2023)
Continual Learning for Generative Retrieval over Dynamic Corpora (Chen et al. 2023a)
Nonparametric Decoding for Generative Retrieval (Lee et al. 2023)
Bridging the Gap between Indexing and Retrieval for Differentiable Search Index with Query Generation (Zhuang et al. 2023)
A Neural Corpus Indexer for Document Retrieval (Wang et al. 2022)
A Unified Generative Retriever for Knowledge-Intensive Language Tasks via Prompt Learning (Chen et al. 2023b)
Understanding Differential Search Index for Text Retrieval (Chen et al. 2023c)
Generative Multi-hop Retrieval (Lee et al. 2022)

BibTeX

@inproceedings{tang-2023-recent,
      author = {Tang, Yubao and Zhang, Ruqing and Guo, Jiafeng and de Rijke, Maarten},
      booktitle = {SIGIR-AP 2023: 1st International ACM SIGIR Conference on Information Retrieval in the Asia Pacific},
      date-added = {2023-10-07 17:24:48 +0200},
      date-modified = {2023-10-07 17:26:24 +0200},
      month = {November},
      publisher = {ACM},
      title = {Recent Advances in Generative Information Retrieval},
      year = {2023}
}


Yubao Tang¹,	Ruqing Zhang¹,	Jiafeng Guo¹,	Maarten de Rijke²