TY - JOUR
T1 - A hybrid approach to Arabic named entity recognition
AU - Shaalan, Khaled
AU - Oudah, Mai
N1 - Funding Information:
This research was funded by the British University in Dubai (grant no. INF004 – Using machine learning to improve Arabic named entity recognition).
PY - 2014/2
Y1 - 2014/2
N2 - In this paper, we propose a hybrid named entity recognition (NER) approach that takes the advantages of rule-based and machine learning-based approaches in order to improve the overall system performance and overcome the knowledge elicitation bottleneck and the lack of resources for underdeveloped languages that require deep language processing, such as Arabic. The complexity of Arabic poses special challenges to researchers of Arabic NER, which is essential for both monolingual and multilingual applications. We used the hybrid approach to develop an Arabic NER system that is capable of recognizing 11 types of Arabic named entities: Person, Location, Organization, Date, Time, Price, Measurement, Percent, Phone Number, ISBN and File Name. Extensive experiments were conducted using decision trees, Support Vector Machines and logistic regression classifiers to evaluate the system performance. The empirical results indicate that the hybrid approach outperforms both the rule-based and the ML-based approaches when they are processed independently. More importantly, our system outperforms the state-of-the-art of Arabic NER in terms of accuracy when applied to ANERcorp standard dataset, with F-measures 0.94 for Person, 0.90 for Location and 0.88 for Organization.
AB - In this paper, we propose a hybrid named entity recognition (NER) approach that takes the advantages of rule-based and machine learning-based approaches in order to improve the overall system performance and overcome the knowledge elicitation bottleneck and the lack of resources for underdeveloped languages that require deep language processing, such as Arabic. The complexity of Arabic poses special challenges to researchers of Arabic NER, which is essential for both monolingual and multilingual applications. We used the hybrid approach to develop an Arabic NER system that is capable of recognizing 11 types of Arabic named entities: Person, Location, Organization, Date, Time, Price, Measurement, Percent, Phone Number, ISBN and File Name. Extensive experiments were conducted using decision trees, Support Vector Machines and logistic regression classifiers to evaluate the system performance. The empirical results indicate that the hybrid approach outperforms both the rule-based and the ML-based approaches when they are processed independently. More importantly, our system outperforms the state-of-the-art of Arabic NER in terms of accuracy when applied to ANERcorp standard dataset, with F-measures 0.94 for Person, 0.90 for Location and 0.88 for Organization.
KW - hybrid approach
KW - information extraction
KW - information retrieval
KW - machine learning approach
KW - named entity recognition
KW - natural language processing
KW - rule-based approach
UR - http://www.scopus.com/inward/record.url?scp=84892754368&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84892754368&partnerID=8YFLogxK
U2 - 10.1177/0165551513502417
DO - 10.1177/0165551513502417
M3 - Article
AN - SCOPUS:84892754368
SN - 0165-5515
VL - 40
SP - 67
EP - 87
JO - Journal of Information Science
JF - Journal of Information Science
IS - 1
ER -