TY - GEN
T1 - Combining classifiers to identify online databases
AU - Barbosa, Luciano
AU - Freire, Juliana
PY - 2007
Y1 - 2007
N2 - We address the problem of identifying the domain of onlinedatabases. More precisely, given a set F of Web forms automaticallygathered by a focused crawler and an online databasedomain D, our goal is to select from F only the formsthat are entry points to databases in D. Having a set ofWebforms that serve as entry points to similar online databasesis a requirement for many applications and techniques thataim to extract and integrate hidden-Web information, suchas meta-searchers, online database directories, hidden-Webcrawlers, and form-schema matching and merging.We propose a new strategy that automatically and accuratelyclassifies online databases based on features that canbe easily extracted from Web forms. By judiciously partitioningthe space of form features, this strategy allows theuse of simpler classifiers that can be constructed using learningtechniques that are better suited for the features of eachpartition. Experiments using real Web data in a representativeset of domains show that the use of different classifiersleads to high accuracy, precision and recall. This indicatesthat our modular classifier composition provides an effectiveand scalable solution for classifying online databases.
AB - We address the problem of identifying the domain of onlinedatabases. More precisely, given a set F of Web forms automaticallygathered by a focused crawler and an online databasedomain D, our goal is to select from F only the formsthat are entry points to databases in D. Having a set ofWebforms that serve as entry points to similar online databasesis a requirement for many applications and techniques thataim to extract and integrate hidden-Web information, suchas meta-searchers, online database directories, hidden-Webcrawlers, and form-schema matching and merging.We propose a new strategy that automatically and accuratelyclassifies online databases based on features that canbe easily extracted from Web forms. By judiciously partitioningthe space of form features, this strategy allows theuse of simpler classifiers that can be constructed using learningtechniques that are better suited for the features of eachpartition. Experiments using real Web data in a representativeset of domains show that the use of different classifiersleads to high accuracy, precision and recall. This indicatesthat our modular classifier composition provides an effectiveand scalable solution for classifying online databases.
KW - Hidden web
KW - Hierarchical classifiers
KW - Learning classifiers
KW - Online database directories
KW - Web crawlers
UR - http://www.scopus.com/inward/record.url?scp=35348849557&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=35348849557&partnerID=8YFLogxK
U2 - 10.1145/1242572.1242631
DO - 10.1145/1242572.1242631
M3 - Conference contribution
AN - SCOPUS:35348849557
SN - 1595936548
SN - 9781595936547
T3 - 16th International World Wide Web Conference, WWW2007
SP - 431
EP - 440
BT - 16th International World Wide Web Conference, WWW2007
T2 - 16th International World Wide Web Conference, WWW2007
Y2 - 8 May 2007 through 12 May 2007
ER -