TY - GEN
T1 - Cleaning search results using term Distance features
AU - Attenberg, Josh
AU - Suei, Torsten
PY - 2008
Y1 - 2008
N2 - The presence of Web spam in query results is one of the critical challenges facing search engines today. While search engines try to combat the impact of spam pages on their results, the incentive for spammers to use increasingly sophisticated techniques has never been higher, since the commercial success of a Web page is strongly correlated to the number of views that page receives. This paper describes a term-based technique for spam detection based on a simple new summary data structure called Term Distance Histograms that tries to capture the topical structure of a page. We apply this technique as a post-filtering step to a major search engine. Our experiments show that we are able to detect many of the artificially generated spam pages that remained in the results of the engine. Specifically, our method is able to detect many web pages generated by utilizing techniques such, as dumping, weaving, or phrase stitching [11], which are spamming techniques designed to achieve high rankings while still exhibiting many of the individual word frequency (and even bi-gram.) properties of natural human text.
AB - The presence of Web spam in query results is one of the critical challenges facing search engines today. While search engines try to combat the impact of spam pages on their results, the incentive for spammers to use increasingly sophisticated techniques has never been higher, since the commercial success of a Web page is strongly correlated to the number of views that page receives. This paper describes a term-based technique for spam detection based on a simple new summary data structure called Term Distance Histograms that tries to capture the topical structure of a page. We apply this technique as a post-filtering step to a major search engine. Our experiments show that we are able to detect many of the artificially generated spam pages that remained in the results of the engine. Specifically, our method is able to detect many web pages generated by utilizing techniques such, as dumping, weaving, or phrase stitching [11], which are spamming techniques designed to achieve high rankings while still exhibiting many of the individual word frequency (and even bi-gram.) properties of natural human text.
UR - http://www.scopus.com/inward/record.url?scp=63049116995&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=63049116995&partnerID=8YFLogxK
U2 - 10.1145/1451983.1451989
DO - 10.1145/1451983.1451989
M3 - Conference contribution
AN - SCOPUS:63049116995
SN - 9781605581590
T3 - AIRWeb 2008 - Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web
SP - 21
EP - 24
BT - AIRWeb 2008 - Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web
T2 - 4th International Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2008
Y2 - 22 April 2008 through 22 April 2008
ER -