TY - GEN
T1 - Optimizing positional index structures for versioned document collections
AU - He, Jinru
AU - Suel, Torsten
PY - 2012
Y1 - 2012
N2 - Versioned document collections are collections that contain multiple versions of each document. Important examples are Web archives, Wikipedia and other wikis, or source code and documents maintained in revision control systems. Versioned document collections can become very large, due to the need to retain past versions, but there is also a lot of redundancy between versions that can be exploited. Thus, versioned document collections are usually stored using special differential (delta) compression techniques, and a number of researchers have recently studied how to exploit this redundancy to obtain more succinct full-text index structures. In this paper, we study index organization and compression techniques for such versioned full-text index structures. In particular, we focus on the case of positional index structures, while most previous work has focused on the non-positional case. Building on earlier work in [zs:redun], we propose a framework for indexing and querying in versioned document collections that integrates non-positional and positional indexes to enable fast top-k query processing. Within this framework, we define and study the problem of minimizing positional index size through optimal substring partitioning. Experiments on Wikipedia and web archive data show that our techniques achieve significant reductions in index size over previous work while supporting very fast query processing.
AB - Versioned document collections are collections that contain multiple versions of each document. Important examples are Web archives, Wikipedia and other wikis, or source code and documents maintained in revision control systems. Versioned document collections can become very large, due to the need to retain past versions, but there is also a lot of redundancy between versions that can be exploited. Thus, versioned document collections are usually stored using special differential (delta) compression techniques, and a number of researchers have recently studied how to exploit this redundancy to obtain more succinct full-text index structures. In this paper, we study index organization and compression techniques for such versioned full-text index structures. In particular, we focus on the case of positional index structures, while most previous work has focused on the non-positional case. Building on earlier work in [zs:redun], we propose a framework for indexing and querying in versioned document collections that integrates non-positional and positional indexes to enable fast top-k query processing. Within this framework, we define and study the problem of minimizing positional index size through optimal substring partitioning. Experiments on Wikipedia and web archive data show that our techniques achieve significant reductions in index size over previous work while supporting very fast query processing.
KW - Inverted index
KW - index compression
KW - positional index structures
KW - redundancy elimination
KW - versioned documents
UR - http://www.scopus.com/inward/record.url?scp=84866620381&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84866620381&partnerID=8YFLogxK
U2 - 10.1145/2348283.2348319
DO - 10.1145/2348283.2348319
M3 - Conference contribution
AN - SCOPUS:84866620381
SN - 9781450316583
T3 - SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 245
EP - 254
BT - SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval
T2 - 35th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012
Y2 - 12 August 2012 through 16 August 2012
ER -