Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation

Injy Hamed, Nizar Habash, Slim Abdennadher, Ngoc Thang Vu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Data sparsity is a main problem hindering the development of code-switching (CS) NLP systems. In this paper, we investigate data augmentation techniques for synthesizing dialectal Arabic-English CS text. We perform lexical replacements using word-aligned parallel corpora where CS points are either randomly chosen or learnt using a sequence-to-sequence model. We compare these approaches against dictionary-based replacements. We assess the quality of the generated sentences through human evaluation and evaluate the effectiveness of data augmentation on machine translation (MT), automatic speech recognition (ASR), and speech translation (ST) tasks. Results show that using a predictive model results in more natural CS sentences compared to the random approach, as reported in human judge-ments. In the downstream tasks, despite the random approach generating more data, both approaches perform equally (outperforming dictionary-based replacements). Overall, data augmentation achieves 34% improvement in perplexity, 5.2% relative improvement on WER for ASR task, +4.0-5.1 BLEU points on MT task, and +2.1-2.2 BLEU points on ST over a baseline trained on available data without augmentation.

Original languageEnglish (US)
Title of host publication6th Workshop on Technologies for Machine Translation of Low-Resource Languages, LoResMT 2023 - Proceedings
PublisherAssociation for Computational Linguistics
Pages86-100
Number of pages15
ISBN (Electronic)9781959429555
StatePublished - 2023
Event6th Workshop on Technologies for Machine Translation of Low-Resource Languages, LoResMT 2023 - Dubrovnik, Croatia
Duration: May 6 2023 → …

Publication series

Name6th Workshop on Technologies for Machine Translation of Low-Resource Languages, LoResMT 2023 - Proceedings

Conference

Conference6th Workshop on Technologies for Machine Translation of Low-Resource Languages, LoResMT 2023
Country/TerritoryCroatia
CityDubrovnik
Period5/6/23 → …

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Software

Fingerprint

Dive into the research topics of 'Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation'. Together they form a unique fingerprint.

Cite this