Abstract
We present de-lexical segmentation, a linguistically motivated alternative to greedy or other unsupervised methods, requiring only minimal language specific input. Our technique involves creating a small grammar of closed-class affixes which can be written in a few hours. The grammar over generates analyses for word forms attested in a raw corpus which are disambiguated based on features of the linguistic base proposed for each form. Extending the grammar to cover orthographic, morpho-syntactic or lexical variation is simple, making it an ideal solution for challenging corpora with noisy, dialect-inconsistent, or otherwise non-standard content. In two evaluations, we consistently outperform competitive unsupervised baselines and approach the performance of state-of-the-art supervised models trained on large amounts of data, providing evidence for the value of linguistic input during preprocessing.
Original language | Undefined |
---|---|
Title of host publication | Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology |
Place of Publication | Florence, Italy |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 113-124 |
Number of pages | 12 |
DOIs | |
State | Published - Aug 1 2019 |