TY - GEN
T1 - Cross-Lingual Transfer from Related Languages
T2 - 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024
AU - Micallef, Kurt
AU - Habash, Nizar
AU - Borg, Claudia
AU - Eryani, Fadhl
AU - Bouamor, Houda
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Although multilingual language models exhibit impressive cross-lingual transfer capabilities on unseen languages, the performance on downstream tasks is impacted when there is a script disparity with the languages used in the multilingual model's pre-training data. Using transliteration offers a straightforward yet effective means to align the script of a resource-rich language with a target language, thereby enhancing cross-lingual transfer capabilities. However, for mixed languages, this approach is suboptimal, since only a subset of the language benefits from the cross-lingual transfer while the remainder is impeded. In this work, we focus on Maltese, a Semitic language, with substantial influences from Arabic, Italian, and English, and notably written in Latin script. We present a novel dataset annotated with word-level etymology. We use this dataset to train a classifier that enables us to make informed decisions regarding the appropriate processing of each token in the Maltese language. We contrast indiscriminate transliteration or translation to mixing processing pipelines that only transliterate words of Arabic origin, thereby resulting in text with a mixture of scripts. We fine-tune the processed data on four downstream tasks and show that conditional transliteration based on word etymology yields the best results, surpassing finetuning with raw Maltese or Maltese processed with non-selective pipelines.
AB - Although multilingual language models exhibit impressive cross-lingual transfer capabilities on unseen languages, the performance on downstream tasks is impacted when there is a script disparity with the languages used in the multilingual model's pre-training data. Using transliteration offers a straightforward yet effective means to align the script of a resource-rich language with a target language, thereby enhancing cross-lingual transfer capabilities. However, for mixed languages, this approach is suboptimal, since only a subset of the language benefits from the cross-lingual transfer while the remainder is impeded. In this work, we focus on Maltese, a Semitic language, with substantial influences from Arabic, Italian, and English, and notably written in Latin script. We present a novel dataset annotated with word-level etymology. We use this dataset to train a classifier that enables us to make informed decisions regarding the appropriate processing of each token in the Maltese language. We contrast indiscriminate transliteration or translation to mixing processing pipelines that only transliterate words of Arabic origin, thereby resulting in text with a mixture of scripts. We fine-tune the processed data on four downstream tasks and show that conditional transliteration based on word etymology yields the best results, surpassing finetuning with raw Maltese or Maltese processed with non-selective pipelines.
UR - http://www.scopus.com/inward/record.url?scp=85189936129&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85189936129&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85189936129
T3 - EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference
SP - 1014
EP - 1025
BT - EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference
A2 - Graham, Yvette
A2 - Purver, Matthew
A2 - Purver, Matthew
PB - Association for Computational Linguistics (ACL)
Y2 - 17 March 2024 through 22 March 2024
ER -