Learning to extract form labels

Hoa Nguyen, Thanh Nguyen, Juliana Freire

Research output: Chapter in Book/Report/Conference proceedingChapter

Abstract

In this paper we describe a new approach to extract element labels from Web form interfaces. Having these labels is a requirement for several techniques that attempt to retrieve and integrate information that is hidden behind form interfaces, such as hidden Web crawlers and metasearchers. However, given the wide variation in form layout, even within a well-defined domain, automatically extracting these labels is a challenging problem. Whereas previous approaches to this problem have relied on heuristics and manually specified extraction rules, our technique makes use of a learning classifier ensemble to identify element-label mappings; and it applies a reconciliation step which leverages the classifier-derived mappings to boost extraction accuracy. We present a detailed experimental evaluation using over three thousand Web forms. Our results show that our approach is effective: it obtains significantly higher accuracy and is more robust to variability in form layout than previous label extraction techniques.

Original languageEnglish (US)
Title of host publicationProceedings of the VLDB Endowment
Pages684-694
Number of pages11
Volume1
Edition1
StatePublished - 2008

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • General Computer Science

Fingerprint

Dive into the research topics of 'Learning to extract form labels'. Together they form a unique fingerprint.

Cite this