TY - GEN
T1 - Inverted index compression and query processing with optimized document ordering
AU - Yan, Hao
AU - Ding, Shuai
AU - Suel, Torsten
PY - 2009
Y1 - 2009
N2 - Web search engines use highly optimized compression schemes to decrease inverted index size and improve query through-put, and many index compression techniques have been studied in the literature. One approach taken by several recent studies [7, 23, 25, 6, 24] first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. It is known that this can significantly improve index compression compared to a random document ordering. We study index compression and query processing techniques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. We perform an extensive study of compression techniques for document IDs and present new optimizations of existing techniques which can achieve significant improvement in both compression and decompression performances. We also propose and evaluate techniques for compressing frequency values for this case. Finally, we study the effect of this approach on query processing performance. Our experiments show very significant improvements in index size and query processing speed on the TREC GOV2 collection of 25.2 million web pages. Copyright is held by the International World Wide Web Conference Committee (IW3C2).
AB - Web search engines use highly optimized compression schemes to decrease inverted index size and improve query through-put, and many index compression techniques have been studied in the literature. One approach taken by several recent studies [7, 23, 25, 6, 24] first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. It is known that this can significantly improve index compression compared to a random document ordering. We study index compression and query processing techniques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. We perform an extensive study of compression techniques for document IDs and present new optimizations of existing techniques which can achieve significant improvement in both compression and decompression performances. We also propose and evaluate techniques for compressing frequency values for this case. Finally, we study the effect of this approach on query processing performance. Our experiments show very significant improvements in index size and query processing speed on the TREC GOV2 collection of 25.2 million web pages. Copyright is held by the International World Wide Web Conference Committee (IW3C2).
KW - Document ordering
KW - IR query processing
KW - Index compression
KW - Inverted index
KW - Search engines
UR - http://www.scopus.com/inward/record.url?scp=84865646680&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84865646680&partnerID=8YFLogxK
U2 - 10.1145/1526709.1526764
DO - 10.1145/1526709.1526764
M3 - Conference contribution
AN - SCOPUS:84865646680
SN - 9781605584874
T3 - WWW'09 - Proceedings of the 18th International World Wide Web Conference
SP - 401
EP - 410
BT - WWW'09 - Proceedings of the 18th International World Wide Web Conference
T2 - 18th International World Wide Web Conference, WWW 2009
Y2 - 20 April 2009 through 24 April 2009
ER -