TY - GEN
T1 - A fast and robust method for web page template detection and removal
AU - Vieira, Karane
AU - Da Silva, Altigran S.
AU - Pinto, Nick
AU - De Moura, Edleno S.
AU - Cavalcanti, Joo M B
AU - Freire, Juliana
PY - 2006
Y1 - 2006
N2 - The widespread use of templates on the Web is considered harmful for two main reasons. Not only do they compromise the relevance judgment of many web IR and web mining methods such as clustering and classification, but they also negatively impact the performance and resource usage of tools that process web pages. In this paper we present a new method that efficiently and accurately removes templates found in collections of web pages. Our method works in two steps. First, the costly process of template detection is performed over a small set of sample pages. Then, the derived template is removed from the remaining pages in the collection. This leads to substantial performance gains when compared to previous approaches that combine template detection and removal. We show, through an experimental evaluation, that our approach is effective for identifying terms occurring in templates - obtaining F-measure values around 0.9, and that it also boosts the accuracy of web page clustering and classification methods.
AB - The widespread use of templates on the Web is considered harmful for two main reasons. Not only do they compromise the relevance judgment of many web IR and web mining methods such as clustering and classification, but they also negatively impact the performance and resource usage of tools that process web pages. In this paper we present a new method that efficiently and accurately removes templates found in collections of web pages. Our method works in two steps. First, the costly process of template detection is performed over a small set of sample pages. Then, the derived template is removed from the remaining pages in the collection. This leads to substantial performance gains when compared to previous approaches that combine template detection and removal. We show, through an experimental evaluation, that our approach is effective for identifying terms occurring in templates - obtaining F-measure values around 0.9, and that it also boosts the accuracy of web page clustering and classification methods.
KW - Web page noise removal
KW - Web template extraction
UR - http://www.scopus.com/inward/record.url?scp=34547631600&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=34547631600&partnerID=8YFLogxK
U2 - 10.1145/1183614.1183654
DO - 10.1145/1183614.1183654
M3 - Conference contribution
AN - SCOPUS:34547631600
SN - 1595934332
SN - 9781595934338
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 258
EP - 267
BT - Proceedings of the 15th ACM Conference on Information and Knowledge Management, CIKM 2006
T2 - 15th ACM Conference on Information and Knowledge Management, CIKM 2006
Y2 - 6 November 2006 through 11 November 2006
ER -