TY - GEN
T1 - Scalable techniques for document identifier assignment in inverted indexes
AU - Ding, Shuai
AU - Attenberg, Josh
AU - Suel, Torsten
PY - 2010
Y1 - 2010
N2 - Web search engines depend on the full-text inverted index data structure. Because the query processing performance is so dependent on the size of the inverted index, a plethora of research has focused on fast end effective techniques for compressing this structure. Recently, several authors have proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant reduction in overall index size. In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on the Traveling Salesman Problem or on graph partitioning. These techniques achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing. This technique achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.
AB - Web search engines depend on the full-text inverted index data structure. Because the query processing performance is so dependent on the size of the inverted index, a plethora of research has focused on fast end effective techniques for compressing this structure. Recently, several authors have proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant reduction in overall index size. In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on the Traveling Salesman Problem or on graph partitioning. These techniques achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing. This technique achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.
KW - documentID reassignment
KW - index compression
UR - http://www.scopus.com/inward/record.url?scp=77954565988&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77954565988&partnerID=8YFLogxK
U2 - 10.1145/1772690.1772723
DO - 10.1145/1772690.1772723
M3 - Conference contribution
AN - SCOPUS:77954565988
SN - 9781605587998
T3 - Proceedings of the 19th International Conference on World Wide Web, WWW '10
SP - 311
EP - 320
BT - Proceedings of the 19th International Conference on World Wide Web, WWW '10
T2 - 19th International World Wide Web Conference, WWW2010
Y2 - 26 April 2010 through 30 April 2010
ER -