TY - GEN
T1 - Extraction of (key,value) pairs from unstructured ads
AU - Chakraborty, Sunandan
AU - Subramanian, Lakshminarayanan
AU - Nyarko, Yaw
N1 - Publisher Copyright:
Copyright © 2014, Association for the Advancement of Artificial Intelligence.
PY - 2014
Y1 - 2014
N2 - In this paper, we focus on the problem of extracii ng structured labeled data from short unstructured ad- postings from online sources like Craigslist. where ads are posted on various topics, such as job posti ngs, rentals, car sales etc. A fundamental challenge in addressing this problem is that most ad-postings are highly unstructured, short-text postings written in an informal manner with no inherent grammar or well- defined dictionary. In this paper, we propose unsuperv ised and supervised algorithms for extracting struct ured data from unstructured ads in the form of (key, value) pairs where the keys naturally represent topic- specific features in the ads. The unsupervised algorithm is centered around building an affinity graph, using the words from a topic-specific corpus of such ads where the edge weights represent affinities between words: the (key, value) extraction algorithm identifies specific groups of words in the affinity graph corresponding to different classes of key attributes. The supervised alg orithm uses a Conditional Random Field based traini ng algorithm to identify specific structured (key, value) pairs based on pre-delined topic-specific structural data representations of ads. Based on a corpus of car and apartment ad-postings from Craigslis, the unsupervised algorithm reported an accuracy of 67.74% and 68.74% for car and apartment ads respectively. The supervised algorithm demonstrated an improved performance with accuracies of 74.07% and 72.59% respectively.
AB - In this paper, we focus on the problem of extracii ng structured labeled data from short unstructured ad- postings from online sources like Craigslist. where ads are posted on various topics, such as job posti ngs, rentals, car sales etc. A fundamental challenge in addressing this problem is that most ad-postings are highly unstructured, short-text postings written in an informal manner with no inherent grammar or well- defined dictionary. In this paper, we propose unsuperv ised and supervised algorithms for extracting struct ured data from unstructured ads in the form of (key, value) pairs where the keys naturally represent topic- specific features in the ads. The unsupervised algorithm is centered around building an affinity graph, using the words from a topic-specific corpus of such ads where the edge weights represent affinities between words: the (key, value) extraction algorithm identifies specific groups of words in the affinity graph corresponding to different classes of key attributes. The supervised alg orithm uses a Conditional Random Field based traini ng algorithm to identify specific structured (key, value) pairs based on pre-delined topic-specific structural data representations of ads. Based on a corpus of car and apartment ad-postings from Craigslis, the unsupervised algorithm reported an accuracy of 67.74% and 68.74% for car and apartment ads respectively. The supervised algorithm demonstrated an improved performance with accuracies of 74.07% and 72.59% respectively.
UR - http://www.scopus.com/inward/record.url?scp=84987657113&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84987657113&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84987657113
T3 - AAAI Fall Symposium - Technical Report
SP - 10
EP - 17
BT - Natural Language Access to Big Data - Papers from the AAAI Fall Symposium, Technical Report
PB - AI Access Foundation
T2 - 2014 AAAI Fall Symposium
Y2 - 13 November 2014 through 15 November 2014
ER -