Organizing hidden-Web databases by clustering visible Web documents

Luciano Barbosa, Juliana Freire, Altigran Silva

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper we address the problem of organizing hidden-Web databases. Given a heterogeneous set of Web forms that serve as entry points to hidden-Web databases, our goal is to cluster the forms according to the database domains to which they belong. We propose a new clustering approach that models Web forms as a set of hyperlinked objects and considers visible information in the form context-both within and in the neighborhood of forms - as the basis for similarity comparison. Since the clustering is performed over features that can be automatically extracted, the process is scalable. In addition, because it uses a rich set of metadata, our approach is able to handle a wide range of forms, including content-rich forms that contain multiple attributes, as well as simple keyword-based search interfaces. An experimental evaluation over real Web data shows that our strategy generates high-quality clusters - measured both in terms of entropy and F-measure. This indicates that our approach provides an effective and general solution to the problem of organizing hidden-Web databases.

Original languageEnglish (US)
Title of host publication23rd International Conference on Data Engineering, ICDE 2007
Pages326-335
Number of pages10
DOIs
StatePublished - 2007
Event23rd International Conference on Data Engineering, ICDE 2007 - Istanbul, Turkey
Duration: Apr 15 2007Apr 20 2007

Publication series

NameProceedings - International Conference on Data Engineering
ISSN (Print)1084-4627

Other

Other23rd International Conference on Data Engineering, ICDE 2007
Country/TerritoryTurkey
CityIstanbul
Period4/15/074/20/07

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems

Fingerprint

Dive into the research topics of 'Organizing hidden-Web databases by clustering visible Web documents'. Together they form a unique fingerprint.

Cite this