Successes and critical failures of neural networks in capturing human-like speech recognition

Federico Adolfi, Jeffrey S. Bowers, David Poeppel

Research output: Contribution to journalArticlepeer-review

Abstract

Natural and artificial audition can in principle acquire different solutions to a given problem. The constraints of the task, however, can nudge the cognitive science and engineering of audition to qualitatively converge, suggesting that a closer mutual examination would potentially enrich artificial hearing systems and process models of the mind and brain. Speech recognition — an area ripe for such exploration — is inherently robust in humans to a number transformations at various spectrotemporal granularities. To what extent are these robustness profiles accounted for by high-performing neural network systems? We bring together experiments in speech recognition under a single synthesis framework to evaluate state-of-the-art neural networks as stimulus-computable, optimized observers. In a series of experiments, we (1) clarify how influential speech manipulations in the literature relate to each other and to natural speech, (2) show the granularities at which machines exhibit out-of-distribution robustness, reproducing classical perceptual phenomena in humans, (3) identify the specific conditions where model predictions of human performance differ, and (4) demonstrate a crucial failure of all artificial systems to perceptually recover where humans do, suggesting alternative directions for theory and model building. These findings encourage a tighter synergy between the cognitive science and engineering of audition.

Original languageEnglish (US)
Pages (from-to)199-211
Number of pages13
JournalNeural Networks
Volume162
DOIs
StatePublished - May 2023

Keywords

  • Audition
  • Human-like AI
  • Neural networks
  • Robustness
  • Speech

ASJC Scopus subject areas

  • Cognitive Neuroscience
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Successes and critical failures of neural networks in capturing human-like speech recognition'. Together they form a unique fingerprint.

Cite this