Abstract
The growing number of open datasets has created new opportunities to derive insights and address important societal problems. These data, however, often come with little or no metadata, in particular about the types of their attributes, thus greatly limiting their utility. In this paper, we address the problem of domain discovery: given a collection of tables, we aim to identify sets of terms that represent instances of a semantic concept or domain. Knowledge of attribute domains not only enables a richer set of queries over dataset collections, but it can also help in data integration. We propose a data-driven approach that leverages value co-occurrence information across a large number of dataset columns to derive robust context signatures and infer domains. We discuss the results of a detailed experimental evaluation, using real urban dataset collections, which show that our approach is robust and outperforms stateof- the-art methods in the presence of incomplete columns, heterogeneous or erroneous data, and scales to datasets with several million distinct terms.
Original language | English (US) |
---|---|
Pages (from-to) | 953-965 |
Number of pages | 13 |
Journal | Proceedings of the VLDB Endowment |
Volume | 13 |
Issue number | 7 |
DOIs | |
State | Published - 2020 |
Event | 46th International Conference on Very Large Data Bases, VLDB 2020 - Virtual, Japan Duration: Aug 31 2020 → Sep 4 2020 |
ASJC Scopus subject areas
- Computer Science (miscellaneous)
- General Computer Science