TY - GEN
T1 - Learning Passage Impacts for Inverted Indexes
AU - Mallia, Antonio
AU - Khattab, Omar
AU - Suel, Torsten
AU - Tonellotto, Nicola
N1 - Funding Information:
Acknowledgments: This research was partially supported by NSF Grant IIS-1718680 and CAREER grant CNS-1651570, affiliate members and other supporters of the Stanford DAWN project—Ant Financial, Facebook, Google, Infosys, NEC, and VMware—as well as Cisco and SAP, the Italian Ministry of Education and Research (MIUR) in the framework of the Cross-Lab project (Departments of Excellence), and by the University of Pisa in the framework of the AUTENS project (Sustainable Energy Autarky). We would like to thank Matei Zaharia for insightful discussions and feedback. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Publisher Copyright:
© 2021 ACM.
PY - 2021/7/11
Y1 - 2021/7/11
N2 - Neural information retrieval systems typically use a cascading pipeline, in which a first-stage model retrieves a candidate set of documents and one or more subsequent stages re-rank this set using contextualized language models such as BERT. In this paper, we propose DeepImpact, a new document term-weighting scheme suitable for efficient retrieval using a standard inverted index. Compared to existing methods, DeepImpact improves impact-score modeling and tackles the vocabulary-mismatch problem. In particular, DeepImpact leverages DocT5Query to enrich the document collection and, using a contextualized language model, directly estimates the semantic importance of tokens in a document, producing a single-value representation for each token in each document. Our experiments show that DeepImpact significantly outperforms prior first-stage retrieval approaches by up to 17% on effectiveness metrics w.r.t. DocT5Query, and, when deployed in a re-ranking scenario, can reach the same effectiveness of state-of-the-art approaches with up to 5.1x speedup in efficiency.
AB - Neural information retrieval systems typically use a cascading pipeline, in which a first-stage model retrieves a candidate set of documents and one or more subsequent stages re-rank this set using contextualized language models such as BERT. In this paper, we propose DeepImpact, a new document term-weighting scheme suitable for efficient retrieval using a standard inverted index. Compared to existing methods, DeepImpact improves impact-score modeling and tackles the vocabulary-mismatch problem. In particular, DeepImpact leverages DocT5Query to enrich the document collection and, using a contextualized language model, directly estimates the semantic importance of tokens in a document, producing a single-value representation for each token in each document. Our experiments show that DeepImpact significantly outperforms prior first-stage retrieval approaches by up to 17% on effectiveness metrics w.r.t. DocT5Query, and, when deployed in a re-ranking scenario, can reach the same effectiveness of state-of-the-art approaches with up to 5.1x speedup in efficiency.
KW - inverted index
KW - neural IR
KW - query processing
KW - term weighting
UR - http://www.scopus.com/inward/record.url?scp=85109405570&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85109405570&partnerID=8YFLogxK
U2 - 10.1145/3404835.3463030
DO - 10.1145/3404835.3463030
M3 - Conference contribution
AN - SCOPUS:85109405570
T3 - SIGIR 2021 - Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 1723
EP - 1727
BT - SIGIR 2021 - Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
PB - Association for Computing Machinery, Inc
T2 - 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021
Y2 - 11 July 2021 through 15 July 2021
ER -