Document classification for focused topics

Russell Power, Jay Chen, Trishank Karthik, Lakshminarayanan Subramanian

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Feature extraction is one of the fundamental challenges in improving the accuracy of document classification. While there has been a large body of research literature on document classification, most existing approaches either do not have a high classification accuracy or require massive training sets. In this paper, we propose a simple feature extraction algorithm that can achieve high document classification accuracy in the context of development-centric topics. Our feature extraction algorithm exploits two distinct aspects in development-centric topics: (a) most of these topics tend to be very focused (unlike semantically hard classification topics such as chemistry or banks); (b) due to local language and cultural underpinnings in these topics, the authentic pages tend to use several region specific features. Our algorithm uses a combination of popularity and rarity as two separate metrics to extract features that describe a topic. Given a topic, our output feature set comprises of: (i) a list of popular keywords closely related to the topic; (ii) a list of rare keywords closely related to the topic. We show that a simple joint classifier based on these two feature sets can achieve high classification accuracy while each feature sub-set in itself is insufficient. We have tested our algorithm across a wide range of development-centric topics.

Original languageEnglish (US)
Title of host publicationArtificial Intelligence for Development - Papers from the AAAI Spring Symposium, Technical Report
PublisherAI Access Foundation
Pages67-72
Number of pages6
ISBN (Print)9781577354550
StatePublished - 2010
Event2010 AAAI Spring Symposium - Stanford, CA, United States
Duration: Mar 22 2010Mar 24 2010

Publication series

NameAAAI Spring Symposium - Technical Report
VolumeSS-10-01

Other

Other2010 AAAI Spring Symposium
Country/TerritoryUnited States
CityStanford, CA
Period3/22/103/24/10

ASJC Scopus subject areas

  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Document classification for focused topics'. Together they form a unique fingerprint.

Cite this