TY - GEN
T1 - Scalable manipulation of archival web graphs
AU - Avcular, Yasemin
AU - Suel, Torsten
PY - 2011
Y1 - 2011
N2 - In this paper, we study efficient ways to construct, represent and analyze large-scale archival web graphs. We first discuss details of the distributed graph construction algorithm implemented in MapReduce and the design of a space-efficient layered graph representation. While designing this representation, we consider both offline and online algorithms for the graph analysis. The offline algorithms, such as PageRank, can use MapReduce and similar large-scale, distributed frameworks for computation. On the other side, online algorithms can be implemented by tapping into a scalable repository (similar to DEC's Connectivity Server or Scalable Hyperlink Store by Najork), in order to perform the computations. Moreover, we also consider updating the graph representation with the most recent information available and propose an efficient way to perform updates using MapReduce. We survey various storage options and outline essential API calls for the archival web graph specific real-time access repository. Finally, we conclude with a discussion of ideas for interesting archival web graph analysis that can lead us to discover novel patterns for designing state-of-art compression techniques.
AB - In this paper, we study efficient ways to construct, represent and analyze large-scale archival web graphs. We first discuss details of the distributed graph construction algorithm implemented in MapReduce and the design of a space-efficient layered graph representation. While designing this representation, we consider both offline and online algorithms for the graph analysis. The offline algorithms, such as PageRank, can use MapReduce and similar large-scale, distributed frameworks for computation. On the other side, online algorithms can be implemented by tapping into a scalable repository (similar to DEC's Connectivity Server or Scalable Hyperlink Store by Najork), in order to perform the computations. Moreover, we also consider updating the graph representation with the most recent information available and propose an efficient way to perform updates using MapReduce. We survey various storage options and outline essential API calls for the archival web graph specific real-time access repository. Finally, we conclude with a discussion of ideas for interesting archival web graph analysis that can lead us to discover novel patterns for designing state-of-art compression techniques.
KW - archival web graphs
KW - hadoop
KW - mapreduce
UR - http://www.scopus.com/inward/record.url?scp=83255193379&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=83255193379&partnerID=8YFLogxK
U2 - 10.1145/2064730.2064739
DO - 10.1145/2064730.2064739
M3 - Conference contribution
AN - SCOPUS:83255193379
SN - 9781450309592
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 27
EP - 32
BT - CIKM 2011 Glasgow
T2 - 9th Workshop on Large-Scale and Distributed Systems for Information Retrieval, LSDS-IR'11
Y2 - 28 October 2011 through 28 October 2011
ER -