TY - GEN
T1 - Compact full-text indexing of versioned document collections
AU - He, Jinru
AU - Yan, Hao
AU - Suel, Torsten
PY - 2009
Y1 - 2009
N2 - We study the problem of creating highly compressed full-text index structures for versioned document collections, that is, collections that contain multiple versions of each document. Important examples of such collections are Wikipedia or the web page archive maintained by the Internet Archive. A straightforward indexing approach would simply treat each document version as a separate document, such that index size scales linearly with the number of versions. However, several authors have recently studied approaches that exploit the significant similarities between different versions of the same document to obtain much smaller index sizes. In this paper, we propose new techniques for organizing and compressing inverted index structures for such collections. We also perform a detailed experimental comparison of new techniques and the existing techniques in the literature. Our results on an archive of the English version of Wikipedia, and on a subset of the Internet Archive collection, show significant benefits over previous approaches.
AB - We study the problem of creating highly compressed full-text index structures for versioned document collections, that is, collections that contain multiple versions of each document. Important examples of such collections are Wikipedia or the web page archive maintained by the Internet Archive. A straightforward indexing approach would simply treat each document version as a separate document, such that index size scales linearly with the number of versions. However, several authors have recently studied approaches that exploit the significant similarities between different versions of the same document to obtain much smaller index sizes. In this paper, we propose new techniques for organizing and compressing inverted index structures for such collections. We also perform a detailed experimental comparison of new techniques and the existing techniques in the literature. Our results on an archive of the English version of Wikipedia, and on a subset of the Internet Archive collection, show significant benefits over previous approaches.
KW - Inverted index
KW - Inverted index compression
KW - Search engines
KW - Versioned documents
KW - Web archives
KW - Wikipedia
UR - http://www.scopus.com/inward/record.url?scp=74549161572&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=74549161572&partnerID=8YFLogxK
U2 - 10.1145/1645953.1646008
DO - 10.1145/1645953.1646008
M3 - Conference contribution
AN - SCOPUS:74549161572
SN - 9781605585123
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 415
EP - 424
BT - ACM 18th International Conference on Information and Knowledge Management, CIKM 2009
T2 - ACM 18th International Conference on Information and Knowledge Management, CIKM 2009
Y2 - 2 November 2009 through 6 November 2009
ER -