Data-driven domain discovery for structured datasets

Masayo Ota, Heiko Mu¨ller, Juliana Freire, Divesh Srivastava

Research output: Contribution to journalConference articlepeer-review

Abstract

The growing number of open datasets has created new opportunities to derive insights and address important societal problems. These data, however, often come with little or no metadata, in particular about the types of their attributes, thus greatly limiting their utility. In this paper, we address the problem of domain discovery: given a collection of tables, we aim to identify sets of terms that represent instances of a semantic concept or domain. Knowledge of attribute domains not only enables a richer set of queries over dataset collections, but it can also help in data integration. We propose a data-driven approach that leverages value co-occurrence information across a large number of dataset columns to derive robust context signatures and infer domains. We discuss the results of a detailed experimental evaluation, using real urban dataset collections, which show that our approach is robust and outperforms stateof- the-art methods in the presence of incomplete columns, heterogeneous or erroneous data, and scales to datasets with several million distinct terms.

Original languageEnglish (US)
Pages (from-to)953-965
Number of pages13
JournalProceedings of the VLDB Endowment
Volume13
Issue number7
DOIs
StatePublished - 2020
Event46th International Conference on Very Large Data Bases, VLDB 2020 - Virtual, Japan
Duration: Aug 31 2020Sep 4 2020

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • General Computer Science

Fingerprint

Dive into the research topics of 'Data-driven domain discovery for structured datasets'. Together they form a unique fingerprint.

Cite this