TY - GEN
T1 - Textless Speech-to-Speech Translation With Limited Parallel Data
AU - Diwan, Anuj
AU - Srinivasan, Anirudh
AU - Harwath, David
AU - Choi, Eunsol
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Existing speech-to-speech translation (S2ST) models fall into two camps: they either leverage text as an intermediate step or require hundreds of hours of parallel speech data. Both approaches are incompatible with textless languages or language pairs with limited parallel data. We present PFB, a framework for training textless S2ST models that require just dozens of hours of parallel speech data. We first pretrain a model on large-scale monolingual speech data, finetune it with a small amount of parallel speech data (20 − 60 hours), and lastly train with an unsupervised backtranslation objective. We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains (European Parliament, Common Voice, and All India Radio) with single-speaker synthesized speech. Evaluated using the ASR-BLEU metric, our models achieve reasonable performance on all three domains, with some being within 1 − 2 points of our higher-resourced topline.
AB - Existing speech-to-speech translation (S2ST) models fall into two camps: they either leverage text as an intermediate step or require hundreds of hours of parallel speech data. Both approaches are incompatible with textless languages or language pairs with limited parallel data. We present PFB, a framework for training textless S2ST models that require just dozens of hours of parallel speech data. We first pretrain a model on large-scale monolingual speech data, finetune it with a small amount of parallel speech data (20 − 60 hours), and lastly train with an unsupervised backtranslation objective. We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains (European Parliament, Common Voice, and All India Radio) with single-speaker synthesized speech. Evaluated using the ASR-BLEU metric, our models achieve reasonable performance on all three domains, with some being within 1 − 2 points of our higher-resourced topline.
UR - http://www.scopus.com/inward/record.url?scp=85216252768&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85216252768&partnerID=8YFLogxK
U2 - 10.18653/v1/2024.findings-emnlp.951
DO - 10.18653/v1/2024.findings-emnlp.951
M3 - Conference contribution
AN - SCOPUS:85216252768
T3 - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024
SP - 16208
EP - 16224
BT - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024
A2 - Al-Onaizan, Yaser
A2 - Bansal, Mohit
A2 - Chen, Yun-Nung
PB - Association for Computational Linguistics (ACL)
T2 - 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024
Y2 - 12 November 2024 through 16 November 2024
ER -