PruSM: A prudent schema matching approach for web forms

Thanh Nguyen, Hoa Nguyen, Juliana Freire

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

There has been a substantial increase in the number of Web data sources whose contents are hidden and can only be accessed through form interfaces. To leverage this data, several applications have emerged that aim to automate and simplify the access to these data sources, from hidden-Web crawlers and meta-searchers to Web information integration systems. A requirement shared by these applications is the ability to understand these forms, so that they can automatically fill them out. In this paper, we address a key problem in form understanding: how to match elements across distinct forms. Although this problem has been studied in the literature, existing approaches have important limitations. Notably, they only handle small form collections and assume that form elements are clean and normalized, often through manual pre-processing. When a large number of forms is automatically gathered, matching form schemata presents new challenges: data heterogeneity is compounded with the Web-scale and noise introduced by automated processes. We propose PruSM, a prudent schema matching strategy the determines matches for form elements in a prudent fashion, with the goal of minimizing error propagation. A experimental evaluation of PruSM using widely available data sets shows that the approach effective and able to accurately match a large number of form schemata and without requiring any manual pre-processing.

Original languageEnglish (US)
Title of host publicationCIKM'10 - Proceedings of the 19th International Conference on Information and Knowledge Management and Co-located Workshops
Pages1385-1388
Number of pages4
DOIs
StatePublished - 2010
Event19th International Conference on Information and Knowledge Management and Co-located Workshops, CIKM'10 - Toronto, ON, Canada
Duration: Oct 26 2010Oct 30 2010

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Other

Other19th International Conference on Information and Knowledge Management and Co-located Workshops, CIKM'10
Country/TerritoryCanada
CityToronto, ON
Period10/26/1010/30/10

Keywords

  • Hidden web
  • Schema matching
  • Web forms

ASJC Scopus subject areas

  • General Business, Management and Accounting
  • General Decision Sciences

Fingerprint

Dive into the research topics of 'PruSM: A prudent schema matching approach for web forms'. Together they form a unique fingerprint.

Cite this