TY - GEN
T1 - Compressing term positions in web indexes
AU - Yan, Hao
AU - Ding, Shuai
AU - Suel, Torsten
PY - 2009
Y1 - 2009
N2 - Large search engines process thousands of queries per second on billions of pages, making query processing a major factor in their operating costs. This has led to a lot of research on how to improve query throughput, using techniques such as massive parallelism, caching, early termination, and inverted index compression. We focus on techniques for compressing term positions in web search engine indexes. Most previous work has focused on compressing docID and frequency data, or position information in other types of text collections. Compression of term positions in web pages is complicated by the fact that term occurrences tend to cluster within documents but not across document boundaries, making it harder to exploit clustering effects. Also, typical access patterns for position data are different from those for docID and frequency data. We perform a detailed study of a number of existing and new techniques for compressing position data in web indexes. We also study how to efficiently access position data for ranking functions that take proximity features into account.
AB - Large search engines process thousands of queries per second on billions of pages, making query processing a major factor in their operating costs. This has led to a lot of research on how to improve query throughput, using techniques such as massive parallelism, caching, early termination, and inverted index compression. We focus on techniques for compressing term positions in web search engine indexes. Most previous work has focused on compressing docID and frequency data, or position information in other types of text collections. Compression of term positions in web pages is complicated by the fact that term occurrences tend to cluster within documents but not across document boundaries, making it harder to exploit clustering effects. Also, typical access patterns for position data are different from those for docID and frequency data. We perform a detailed study of a number of existing and new techniques for compressing position data in web indexes. We also study how to efficiently access position data for ranking functions that take proximity features into account.
KW - Index compression
KW - Inverted index
KW - Search engines
UR - http://www.scopus.com/inward/record.url?scp=72449208572&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=72449208572&partnerID=8YFLogxK
U2 - 10.1145/1571941.1571969
DO - 10.1145/1571941.1571969
M3 - Conference contribution
AN - SCOPUS:72449208572
SN - 9781605584836
T3 - Proceedings - 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009
SP - 147
EP - 154
BT - Proceedings - 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009
T2 - 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009
Y2 - 19 July 2009 through 23 July 2009
ER -