TY - GEN
T1 - Exploring the Impact of Transliteration on NLP Performance
T2 - 2023 Workshop on Computation and Written Language, CAWL 2023
AU - Micallef, Kurt
AU - Eryani, Fadhl
AU - Habash, Nizar
AU - Bouamor, Houda
AU - Borg, Claudia
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Multilingual models such as mBERT have been demonstrated to exhibit impressive cross-lingual transfer for a number of languages. Despite this, the performance drops for lower-resourced languages, especially when they are not part of the pre-training setup and when there are script differences. In this work we consider Maltese, a low-resource language of Arabic and Romance origins written in Latin script. Specifically, we investigate the impact of transliterating Maltese into Arabic scipt on a number of downstream tasks: Part-of-Speech Tagging, Dependency Parsing, and Sentiment Analysis. We compare multiple transliteration pipelines ranging from deterministic character maps to more sophisticated alternatives, including manually annotated word mappings and non-deterministic character mappings. For the latter, we show that selection techniques using n-gram language models of Tunisian Arabic, the dialect with the highest degree of mutual intelligibility to Maltese, yield better results on downstream tasks. Moreover, our experiments highlight that the use of an Arabic pre-trained model paired with transliteration outperforms mBERT. Overall, our results show that transliterating Maltese can be considered an option to improve the cross-lingual transfer capabilities.
AB - Multilingual models such as mBERT have been demonstrated to exhibit impressive cross-lingual transfer for a number of languages. Despite this, the performance drops for lower-resourced languages, especially when they are not part of the pre-training setup and when there are script differences. In this work we consider Maltese, a low-resource language of Arabic and Romance origins written in Latin script. Specifically, we investigate the impact of transliterating Maltese into Arabic scipt on a number of downstream tasks: Part-of-Speech Tagging, Dependency Parsing, and Sentiment Analysis. We compare multiple transliteration pipelines ranging from deterministic character maps to more sophisticated alternatives, including manually annotated word mappings and non-deterministic character mappings. For the latter, we show that selection techniques using n-gram language models of Tunisian Arabic, the dialect with the highest degree of mutual intelligibility to Maltese, yield better results on downstream tasks. Moreover, our experiments highlight that the use of an Arabic pre-trained model paired with transliteration outperforms mBERT. Overall, our results show that transliterating Maltese can be considered an option to improve the cross-lingual transfer capabilities.
UR - http://www.scopus.com/inward/record.url?scp=85174835018&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85174835018&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85174835018
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 22
EP - 32
BT - Workshop on Computation and Written Language, CAWL 2023 - Proceedings of the Workshop
A2 - Gorman, Kyle
A2 - Roark, Brian
A2 - Sproat, Richard
PB - Association for Computational Linguistics (ACL)
Y2 - 14 July 2023
ER -