TY - JOUR
T1 - Studying the impact of language-independent and language-specific features on hybrid Arabic Person name recognition
AU - Oudah, Mai
AU - Shaalan, Khaled
N1 - Funding Information:
This research was funded by the British University in Dubai (Grant No. INF004-Using machine learning to improve Arabic named entity recognition).
Publisher Copyright:
© 2016, Springer Science+Business Media Dordrecht.
PY - 2017/6/1
Y1 - 2017/6/1
N2 - In this paper, extensive experiments are conducted to study the impact of features of different categories, in isolation and gradually in an incremental manner, on Arabic Person name recognition. We present an integrated system that employs the rule-based approach with the machine learning (ML)-based approach in order to develop a consolidated hybrid system. Our feature space is comprised of language-independent and language-specific features. The explored features are naturally grouped under six categories: Person named entity tags predicted by the rule-based component, word-level features, POS features, morphological features, gazetteer features, and other contextual features. As decision tree algorithm has proved comparatively higher efficiency as a classifier in current state-of-the-art hybrid Named Entity Recognition for Arabic, it is adopted in this study as the ML technique utilized by the hybrid system. Therefore, the experiments are focused on two dimensions: the standard dataset used and the set of selected features. A number of standard datasets are used for the training and testing of the hybrid system, including ACE (2003–2004) and ANERcorp. The experimental analysis indicates that both language-independent and language-specific features play an important role in overcoming the challenges posed by Arabic language and have demonstrated critical impact on optimizing the performance of the hybrid system.
AB - In this paper, extensive experiments are conducted to study the impact of features of different categories, in isolation and gradually in an incremental manner, on Arabic Person name recognition. We present an integrated system that employs the rule-based approach with the machine learning (ML)-based approach in order to develop a consolidated hybrid system. Our feature space is comprised of language-independent and language-specific features. The explored features are naturally grouped under six categories: Person named entity tags predicted by the rule-based component, word-level features, POS features, morphological features, gazetteer features, and other contextual features. As decision tree algorithm has proved comparatively higher efficiency as a classifier in current state-of-the-art hybrid Named Entity Recognition for Arabic, it is adopted in this study as the ML technique utilized by the hybrid system. Therefore, the experiments are focused on two dimensions: the standard dataset used and the set of selected features. A number of standard datasets are used for the training and testing of the hybrid system, including ACE (2003–2004) and ANERcorp. The experimental analysis indicates that both language-independent and language-specific features play an important role in overcoming the challenges posed by Arabic language and have demonstrated critical impact on optimizing the performance of the hybrid system.
KW - Hybrid approach
KW - Information extraction
KW - Machine learning
KW - Named entity recognition
KW - Natural language processing
KW - Rule-based approach
UR - http://www.scopus.com/inward/record.url?scp=84997236688&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84997236688&partnerID=8YFLogxK
U2 - 10.1007/s10579-016-9376-1
DO - 10.1007/s10579-016-9376-1
M3 - Article
AN - SCOPUS:84997236688
SN - 1574-020X
VL - 51
SP - 351
EP - 378
JO - Language Resources and Evaluation
JF - Language Resources and Evaluation
IS - 2
ER -