From papers to practice: the openclean open-source data cleaning library

Heiko Müller, Sonia Castelo, Munaf Qazi, Juliana Freire

Research output: Contribution to journalConference articlepeer-review

Abstract

Data preparation is still a major bottleneck for many data science projects. Even though many sophisticated algorithms and tools have been proposed in the research literature, it is difficult for practitioners to integrate them into their data wrangling efforts. We present openclean, a open-source Python library for data cleaning and profiling. openclean integrates data profiling and cleaning tools in a single environment that is easy and intuitive to use. We designed openclean to be extensible and make it easy to add new functionality. By doing so, it will not only become easier for users to access state-of-the-art algorithms for their data wrangling efforts, but also allow researchers to integrate their work and evaluate its effectiveness in practice. We envision openclean as a first step to build a community of practitioners and researchers in the field. In our demo, we outline the main components and design decisions in the development of openclean and demonstrate the current functionality of the library on real-world use cases.

Original languageEnglish (US)
Pages (from-to)2763-2766
Number of pages4
JournalProceedings of the VLDB Endowment
Volume14
Issue number12
DOIs
StatePublished - 2021
Event47th International Conference on Very Large Data Bases, VLDB 2021 - Virtual, Online
Duration: Aug 16 2021Aug 20 2021

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • General Computer Science

Fingerprint

Dive into the research topics of 'From papers to practice: the openclean open-source data cleaning library'. Together they form a unique fingerprint.

Cite this