Abstract
In this paper, extensive experiments are conducted to study the impact of features of different categories, in isolation and gradually in an incremental manner, on Arabic Person name recognition. We present an integrated system that employs the rule-based approach with the machine learning (ML)-based approach in order to develop a consolidated hybrid system. Our feature space is comprised of language-independent and language-specific features. The explored features are naturally grouped under six categories: Person named entity tags predicted by the rule-based component, word-level features, POS features, morphological features, gazetteer features, and other contextual features. As decision tree algorithm has proved comparatively higher efficiency as a classifier in current state-of-the-art hybrid Named Entity Recognition for Arabic, it is adopted in this study as the ML technique utilized by the hybrid system. Therefore, the experiments are focused on two dimensions: the standard dataset used and the set of selected features. A number of standard datasets are used for the training and testing of the hybrid system, including ACE (2003–2004) and ANERcorp. The experimental analysis indicates that both language-independent and language-specific features play an important role in overcoming the challenges posed by Arabic language and have demonstrated critical impact on optimizing the performance of the hybrid system.
Original language | English (US) |
---|---|
Pages (from-to) | 351-378 |
Number of pages | 28 |
Journal | Language Resources and Evaluation |
Volume | 51 |
Issue number | 2 |
DOIs | |
State | Published - Jun 1 2017 |
Keywords
- Hybrid approach
- Information extraction
- Machine learning
- Named entity recognition
- Natural language processing
- Rule-based approach
ASJC Scopus subject areas
- Language and Linguistics
- Education
- Linguistics and Language
- Library and Information Sciences