Cleaning search results using term Distance features

Josh Attenberg, Torsten Suei

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    The presence of Web spam in query results is one of the critical challenges facing search engines today. While search engines try to combat the impact of spam pages on their results, the incentive for spammers to use increasingly sophisticated techniques has never been higher, since the commercial success of a Web page is strongly correlated to the number of views that page receives. This paper describes a term-based technique for spam detection based on a simple new summary data structure called Term Distance Histograms that tries to capture the topical structure of a page. We apply this technique as a post-filtering step to a major search engine. Our experiments show that we are able to detect many of the artificially generated spam pages that remained in the results of the engine. Specifically, our method is able to detect many web pages generated by utilizing techniques such, as dumping, weaving, or phrase stitching [11], which are spamming techniques designed to achieve high rankings while still exhibiting many of the individual word frequency (and even bi-gram.) properties of natural human text.

    Original languageEnglish (US)
    Title of host publicationAIRWeb 2008 - Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web
    Pages21-24
    Number of pages4
    DOIs
    StatePublished - 2008
    Event4th International Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2008 - Beijing, China
    Duration: Apr 22 2008Apr 22 2008

    Publication series

    NameAIRWeb 2008 - Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web

    Other

    Other4th International Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2008
    CountryChina
    CityBeijing
    Period4/22/084/22/08

    ASJC Scopus subject areas

    • Computer Networks and Communications
    • Information Systems

    Fingerprint Dive into the research topics of 'Cleaning search results using term Distance features'. Together they form a unique fingerprint.

    Cite this