TY - GEN
T1 - Document classification for focused topics
AU - Power, Russell
AU - Chen, Jay
AU - Karthik, Trishank
AU - Subramanian, Lakshminarayanan
PY - 2010
Y1 - 2010
N2 - Feature extraction is one of the fundamental challenges in improving the accuracy of document classification. While there has been a large body of research literature on document classification, most existing approaches either do not have a high classification accuracy or require massive training sets. In this paper, we propose a simple feature extraction algorithm that can achieve high document classification accuracy in the context of development-centric topics. Our feature extraction algorithm exploits two distinct aspects in development-centric topics: (a) most of these topics tend to be very focused (unlike semantically hard classification topics such as chemistry or banks); (b) due to local language and cultural underpinnings in these topics, the authentic pages tend to use several region specific features. Our algorithm uses a combination of popularity and rarity as two separate metrics to extract features that describe a topic. Given a topic, our output feature set comprises of: (i) a list of popular keywords closely related to the topic; (ii) a list of rare keywords closely related to the topic. We show that a simple joint classifier based on these two feature sets can achieve high classification accuracy while each feature sub-set in itself is insufficient. We have tested our algorithm across a wide range of development-centric topics.
AB - Feature extraction is one of the fundamental challenges in improving the accuracy of document classification. While there has been a large body of research literature on document classification, most existing approaches either do not have a high classification accuracy or require massive training sets. In this paper, we propose a simple feature extraction algorithm that can achieve high document classification accuracy in the context of development-centric topics. Our feature extraction algorithm exploits two distinct aspects in development-centric topics: (a) most of these topics tend to be very focused (unlike semantically hard classification topics such as chemistry or banks); (b) due to local language and cultural underpinnings in these topics, the authentic pages tend to use several region specific features. Our algorithm uses a combination of popularity and rarity as two separate metrics to extract features that describe a topic. Given a topic, our output feature set comprises of: (i) a list of popular keywords closely related to the topic; (ii) a list of rare keywords closely related to the topic. We show that a simple joint classifier based on these two feature sets can achieve high classification accuracy while each feature sub-set in itself is insufficient. We have tested our algorithm across a wide range of development-centric topics.
UR - http://www.scopus.com/inward/record.url?scp=77957940540&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77957940540&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:77957940540
SN - 9781577354550
T3 - AAAI Spring Symposium - Technical Report
SP - 67
EP - 72
BT - Artificial Intelligence for Development - Papers from the AAAI Spring Symposium, Technical Report
PB - AI Access Foundation
T2 - 2010 AAAI Spring Symposium
Y2 - 22 March 2010 through 24 March 2010
ER -