TY - GEN
T1 - PruSM
T2 - 19th International Conference on Information and Knowledge Management and Co-located Workshops, CIKM'10
AU - Nguyen, Thanh
AU - Nguyen, Hoa
AU - Freire, Juliana
PY - 2010
Y1 - 2010
N2 - There has been a substantial increase in the number of Web data sources whose contents are hidden and can only be accessed through form interfaces. To leverage this data, several applications have emerged that aim to automate and simplify the access to these data sources, from hidden-Web crawlers and meta-searchers to Web information integration systems. A requirement shared by these applications is the ability to understand these forms, so that they can automatically fill them out. In this paper, we address a key problem in form understanding: how to match elements across distinct forms. Although this problem has been studied in the literature, existing approaches have important limitations. Notably, they only handle small form collections and assume that form elements are clean and normalized, often through manual pre-processing. When a large number of forms is automatically gathered, matching form schemata presents new challenges: data heterogeneity is compounded with the Web-scale and noise introduced by automated processes. We propose PruSM, a prudent schema matching strategy the determines matches for form elements in a prudent fashion, with the goal of minimizing error propagation. A experimental evaluation of PruSM using widely available data sets shows that the approach effective and able to accurately match a large number of form schemata and without requiring any manual pre-processing.
AB - There has been a substantial increase in the number of Web data sources whose contents are hidden and can only be accessed through form interfaces. To leverage this data, several applications have emerged that aim to automate and simplify the access to these data sources, from hidden-Web crawlers and meta-searchers to Web information integration systems. A requirement shared by these applications is the ability to understand these forms, so that they can automatically fill them out. In this paper, we address a key problem in form understanding: how to match elements across distinct forms. Although this problem has been studied in the literature, existing approaches have important limitations. Notably, they only handle small form collections and assume that form elements are clean and normalized, often through manual pre-processing. When a large number of forms is automatically gathered, matching form schemata presents new challenges: data heterogeneity is compounded with the Web-scale and noise introduced by automated processes. We propose PruSM, a prudent schema matching strategy the determines matches for form elements in a prudent fashion, with the goal of minimizing error propagation. A experimental evaluation of PruSM using widely available data sets shows that the approach effective and able to accurately match a large number of form schemata and without requiring any manual pre-processing.
KW - Hidden web
KW - Schema matching
KW - Web forms
UR - http://www.scopus.com/inward/record.url?scp=78651311226&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=78651311226&partnerID=8YFLogxK
U2 - 10.1145/1871437.1871627
DO - 10.1145/1871437.1871627
M3 - Conference contribution
AN - SCOPUS:78651311226
SN - 9781450300995
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 1385
EP - 1388
BT - CIKM'10 - Proceedings of the 19th International Conference on Information and Knowledge Management and Co-located Workshops
Y2 - 26 October 2010 through 30 October 2010
ER -