Combining frame and segment level processing via temporal pooling for phonetic classification

Sumit Chopra, Patrick Haffner, Dimitrios Dimitriadis

Research output: Contribution to journalConference articlepeer-review

Abstract

We propose a simple, yet novel, multi-layer model for the problem of phonetic classification. Our model combines a frame level transformation of the acoustic signal with a segment level phone classification. Our key contribution is the study of new temporal pooling strategies that interface these two levels, determining how frame scores are converted into segment scores. On the TIMIT benchmark, we match the best performance obtained using a single classifier. Diversity in pooling strategies is further used to generate candidate classifiers with complementary performance characteristics, which perform even better as an ensemble. Without the use of any phonetic knowledge, our ensemble model achieves a 16.96% phone classification error. While our data-driven approach is exhaustive, the combinatorial inflation is limited to the smaller segmental half of the system.

Original languageEnglish (US)
Pages (from-to)233-236
Number of pages4
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
StatePublished - 2011
Event12th Annual Conference of the International Speech Communication Association, INTERSPEECH 2011 - Florence, Italy
Duration: Aug 27 2011Aug 31 2011

Keywords

  • Deep network
  • Ensemble method
  • Multi-layer perceptron
  • Phonetic classification
  • TIMIT

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Fingerprint

Dive into the research topics of 'Combining frame and segment level processing via temporal pooling for phonetic classification'. Together they form a unique fingerprint.

Cite this