TY - GEN
T1 - Organizing hidden-Web databases by clustering visible Web documents
AU - Barbosa, Luciano
AU - Freire, Juliana
AU - Silva, Altigran
PY - 2007
Y1 - 2007
N2 - In this paper we address the problem of organizing hidden-Web databases. Given a heterogeneous set of Web forms that serve as entry points to hidden-Web databases, our goal is to cluster the forms according to the database domains to which they belong. We propose a new clustering approach that models Web forms as a set of hyperlinked objects and considers visible information in the form context-both within and in the neighborhood of forms - as the basis for similarity comparison. Since the clustering is performed over features that can be automatically extracted, the process is scalable. In addition, because it uses a rich set of metadata, our approach is able to handle a wide range of forms, including content-rich forms that contain multiple attributes, as well as simple keyword-based search interfaces. An experimental evaluation over real Web data shows that our strategy generates high-quality clusters - measured both in terms of entropy and F-measure. This indicates that our approach provides an effective and general solution to the problem of organizing hidden-Web databases.
AB - In this paper we address the problem of organizing hidden-Web databases. Given a heterogeneous set of Web forms that serve as entry points to hidden-Web databases, our goal is to cluster the forms according to the database domains to which they belong. We propose a new clustering approach that models Web forms as a set of hyperlinked objects and considers visible information in the form context-both within and in the neighborhood of forms - as the basis for similarity comparison. Since the clustering is performed over features that can be automatically extracted, the process is scalable. In addition, because it uses a rich set of metadata, our approach is able to handle a wide range of forms, including content-rich forms that contain multiple attributes, as well as simple keyword-based search interfaces. An experimental evaluation over real Web data shows that our strategy generates high-quality clusters - measured both in terms of entropy and F-measure. This indicates that our approach provides an effective and general solution to the problem of organizing hidden-Web databases.
UR - http://www.scopus.com/inward/record.url?scp=34548729668&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=34548729668&partnerID=8YFLogxK
U2 - 10.1109/ICDE.2007.367878
DO - 10.1109/ICDE.2007.367878
M3 - Conference contribution
AN - SCOPUS:34548729668
SN - 1424408032
SN - 9781424408030
T3 - Proceedings - International Conference on Data Engineering
SP - 326
EP - 335
BT - 23rd International Conference on Data Engineering, ICDE 2007
T2 - 23rd International Conference on Data Engineering, ICDE 2007
Y2 - 15 April 2007 through 20 April 2007
ER -