MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines

Stefan Grafberger, Shubha Guha, Julia Stoyanovich, Sebastian Schelter

    Research output: Contribution to journalConference articlepeer-review

    Abstract

    Machine Learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this wide-spread use are garnering attention from policymakers, scientists, and the media. ML applications are often very brittle with respect to their input data, which leads to concerns about their reliability, accountability, and fairness. While bias detection cannot be fully automated, computational tools can help pinpoint particular types of data issues. We recently proposed mlinspect, a library that enables lightweight lineage-based inspection of ML preprocessing pipelines. In this demonstration, we show how mlinspect can be used to detect data distribution bugs in a representative pipeline. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines, can handle both relational and matrix data, and does not require manual code instrumentation. The library is publicly available at https://github.com/stefan-grafberger/mlinspect.

    Original languageEnglish (US)
    Pages (from-to)2736-2739
    Number of pages4
    JournalProceedings of the ACM SIGMOD International Conference on Management of Data
    DOIs
    StatePublished - 2021
    Event2021 International Conference on Management of Data, SIGMOD 2021 - Virtual, Online, China
    Duration: Jun 20 2021Jun 25 2021

    Keywords

    • data distribution debugging
    • machine learning pipelines
    • responsible data science
    • technical bias

    ASJC Scopus subject areas

    • Software
    • Information Systems

    Fingerprint

    Dive into the research topics of 'MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines'. Together they form a unique fingerprint.

    Cite this