Scalable manipulation of archival web graphs

Yasemin Avcular, Torsten Suel

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    In this paper, we study efficient ways to construct, represent and analyze large-scale archival web graphs. We first discuss details of the distributed graph construction algorithm implemented in MapReduce and the design of a space-efficient layered graph representation. While designing this representation, we consider both offline and online algorithms for the graph analysis. The offline algorithms, such as PageRank, can use MapReduce and similar large-scale, distributed frameworks for computation. On the other side, online algorithms can be implemented by tapping into a scalable repository (similar to DEC's Connectivity Server or Scalable Hyperlink Store by Najork), in order to perform the computations. Moreover, we also consider updating the graph representation with the most recent information available and propose an efficient way to perform updates using MapReduce. We survey various storage options and outline essential API calls for the archival web graph specific real-time access repository. Finally, we conclude with a discussion of ideas for interesting archival web graph analysis that can lead us to discover novel patterns for designing state-of-art compression techniques.

    Original languageEnglish (US)
    Title of host publicationCIKM 2011 Glasgow
    Subtitle of host publicationLSDS-IR'11 - Proceedings of the 9th Workshop on Large-Scale and Distributed Informational Retrieval
    Pages27-32
    Number of pages6
    DOIs
    StatePublished - 2011
    Event9th Workshop on Large-Scale and Distributed Systems for Information Retrieval, LSDS-IR'11 - Glasgow, United Kingdom
    Duration: Oct 28 2011Oct 28 2011

    Publication series

    NameInternational Conference on Information and Knowledge Management, Proceedings

    Other

    Other9th Workshop on Large-Scale and Distributed Systems for Information Retrieval, LSDS-IR'11
    Country/TerritoryUnited Kingdom
    CityGlasgow
    Period10/28/1110/28/11

    Keywords

    • archival web graphs
    • hadoop
    • mapreduce

    ASJC Scopus subject areas

    • General Decision Sciences
    • General Business, Management and Accounting

    Fingerprint

    Dive into the research topics of 'Scalable manipulation of archival web graphs'. Together they form a unique fingerprint.

    Cite this