TY - GEN
T1 - Interactive wrapper generation with minimal user effort
AU - Irmak, Utku
AU - Suel, Torsten
PY - 2006
Y1 - 2006
N2 - While much of the data on the web is unstructured in nature, there is also a significant amount of embedded structured data, such as product information on e-commerce sites or stock data on financial sites. A large amount of research has focused on the problem of generating wrappers, i.e., software tools that allow easy and robust extraction of structured data from text and HTML sources. In many applications, such as comparison shopping, data has to be extracted from many different sources, making manual coding of a wrapper for each source impractical. On the other hand, fully automatic approaches are often not reliable enough, resulting in low quality of the extracted data.We describe a complete system for semi-automatic wrapper generation that can be trained on different data sources in a simple interactive manner. Our goal is to minimize the amount of user effort for training reliable wrappers through design of a suitable training interface that is implemented based on a powerful underlying extraction language and a set of training and ranking algorithms. Our experiments show that our system achieves reliable extraction with a very small amount of user effort.
AB - While much of the data on the web is unstructured in nature, there is also a significant amount of embedded structured data, such as product information on e-commerce sites or stock data on financial sites. A large amount of research has focused on the problem of generating wrappers, i.e., software tools that allow easy and robust extraction of structured data from text and HTML sources. In many applications, such as comparison shopping, data has to be extracted from many different sources, making manual coding of a wrapper for each source impractical. On the other hand, fully automatic approaches are often not reliable enough, resulting in low quality of the extracted data.We describe a complete system for semi-automatic wrapper generation that can be trained on different data sources in a simple interactive manner. Our goal is to minimize the amount of user effort for training reliable wrappers through design of a suitable training interface that is implemented based on a powerful underlying extraction language and a set of training and ranking algorithms. Our experiments show that our system achieves reliable extraction with a very small amount of user effort.
KW - Active learning
KW - Data extraction
KW - Wrapper generation
UR - http://www.scopus.com/inward/record.url?scp=34250750133&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=34250750133&partnerID=8YFLogxK
U2 - 10.1145/1135777.1135859
DO - 10.1145/1135777.1135859
M3 - Conference contribution
AN - SCOPUS:34250750133
SN - 1595933239
SN - 9781595933232
T3 - Proceedings of the 15th International Conference on World Wide Web
SP - 553
EP - 563
BT - Proceedings of the 15th International Conference on World Wide Web
T2 - 15th International Conference on World Wide Web
Y2 - 23 May 2006 through 26 May 2006
ER -