Clean or Annotate: How to Spend a Limited Data Collection Budget

Derek Chen, Zhou Yu, Samuel R. Bowman

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Crowdsourcing platforms are often used to collect datasets for training machine learning models, despite higher levels of inaccurate labeling compared to expert labeling. There are two common strategies to manage the impact of such noise: The first involves aggregating redundant annotations, but comes at the expense of labeling substantially fewer examples. Secondly, prior works have also considered using the entire annotation budget to label as many examples as possible and subsequently apply denoising algorithms to implicitly clean the dataset. We find a middle ground and propose an approach which reserves a fraction of annotations to explicitly clean up highly probable error samples to optimize the annotation process. In particular, we allocate a large portion of the labeling budget to form an initial dataset used to train a model. This model is then used to identify specific examples that appear most likely to be incorrect, which we spend the remaining budget to relabel. Experiments across three model variations and four natural language processing tasks show our approach outperforms or matches both label aggregation and advanced denoising methods designed to handle noisy labels when allocated the same finite annotation budget.

    Original languageEnglish (US)
    Title of host publicationDeepLo 2022 - 3rd Workshop on Deep Learning Approaches for Low-Resource NLP, Proceedings of the DeepLo Workshop
    EditorsColin Cherry, Angela Fan, George Foster, Gholamreza Haffari, Shahram Khadivi, Nanyun Peng, Xiang Ren, Ehsan Shareghi, Swabha Swayamdipta
    PublisherAssociation for Computational Linguistics (ACL)
    Pages152-168
    Number of pages17
    ISBN (Electronic)9781955917971
    StatePublished - 2022
    Event3rd Workshop on Deep Learning Approaches for Low-Resource NLP, DeepLo 2022 - Seattle, United States
    Duration: Jul 14 2022 → …

    Publication series

    NameDeepLo 2022 - 3rd Workshop on Deep Learning Approaches for Low-Resource NLP, Proceedings of the DeepLo Workshop

    Conference

    Conference3rd Workshop on Deep Learning Approaches for Low-Resource NLP, DeepLo 2022
    Country/TerritoryUnited States
    CitySeattle
    Period7/14/22 → …

    ASJC Scopus subject areas

    • Language and Linguistics
    • Software
    • Linguistics and Language

    Fingerprint

    Dive into the research topics of 'Clean or Annotate: How to Spend a Limited Data Collection Budget'. Together they form a unique fingerprint.

    Cite this