Causal inference of asynchronous audiovisual speech

John F. Magnotti, Wei Ji Ma, Michael S. Beauchamp

Research output: Contribution to journalArticlepeer-review


During speech perception, humans integrate auditory information from the voice with visual information from the face. This multisensory integration increases perceptual precision, but only if the two cues come from the same talker; this requirement has been largely ignored by current models of speech perception. We describe a generative model of multisensory speech perception that includes this critical step of determining the likelihood that the voice and face information have a common cause. A key feature of the model is that it is based on a principled analysis of how an observer should solve this causal inference problem using the asynchrony between two cues and the reliability of the cues. This allows the model to make predictions about the behavior of subjects performing a synchrony judgment task, predictive power that does not exist in other approaches, such as post-hoc fitting of Gaussian curves to behavioral data. We tested the model predictions against the performance of 37 subjects performing a synchrony judgment task viewing audiovisual speech under a variety of manipulations, including varying asynchronies, intelligibility, and visual cue reliability. The causal inference model outperformed the Gaussian model across two experiments, providing a better fit to the behavioral data with fewer parameters. Because the causal inference model is derived from a principled understanding of the task, model parameters are directly interpretable in terms of stimulus and subject properties.

Original languageEnglish (US)
Article number854
JournalFrontiers in Psychology
Issue numberNOV
StatePublished - 2013


  • Bayesian observer
  • Causal inference
  • Multisensory integration
  • Speech perception
  • Synchrony judgments

ASJC Scopus subject areas

  • General Psychology


Dive into the research topics of 'Causal inference of asynchronous audiovisual speech'. Together they form a unique fingerprint.

Cite this