TY - GEN
T1 - Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs
AU - Chen, Feng
AU - Neill, Daniel B.
PY - 2014
Y1 - 2014
N2 - Event detection in social media is an important but challenging problem. Most existing approaches are based on burst detection, topic modeling, or clustering techniques, which cannot naturally model the implicit heterogeneous network structure in social media. As a result, only limited information, such as terms and geographic locations, can be used. This paper presents Non-Parametric Heterogeneous Graph Scan (NPHGS), a new approach that considers the entire heterogeneous network for event detection: we first model the network as a "sensor" network, in which each node senses its "neighborhood environment" and reports an empirical p-value measuring its current level of anomalousness for each time interval (e.g., hour or day). Then, we efficiently maximize a nonparametric scan statistic over connected subgraphs to identify the most anomalous network clusters. Finally, the event represented by each cluster is summarized with information such as type of event, geographical locations, time, and participants. As a case study, we consider two applications using Twitter data, civil unrest event detection and rare disease outbreak detection, and present empirical evaluations illustrating the effectiveness and efficiency of our proposed approach.
AB - Event detection in social media is an important but challenging problem. Most existing approaches are based on burst detection, topic modeling, or clustering techniques, which cannot naturally model the implicit heterogeneous network structure in social media. As a result, only limited information, such as terms and geographic locations, can be used. This paper presents Non-Parametric Heterogeneous Graph Scan (NPHGS), a new approach that considers the entire heterogeneous network for event detection: we first model the network as a "sensor" network, in which each node senses its "neighborhood environment" and reports an empirical p-value measuring its current level of anomalousness for each time interval (e.g., hour or day). Then, we efficiently maximize a nonparametric scan statistic over connected subgraphs to identify the most anomalous network clusters. Finally, the event represented by each cluster is summarized with information such as type of event, geographical locations, time, and participants. As a case study, we consider two applications using Twitter data, civil unrest event detection and rare disease outbreak detection, and present empirical evaluations illustrating the effectiveness and efficiency of our proposed approach.
KW - event detection and forecasting
KW - heterogeneous graphs
KW - non-parametric scan statistics
KW - social media
UR - http://www.scopus.com/inward/record.url?scp=84907030632&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84907030632&partnerID=8YFLogxK
U2 - 10.1145/2623330.2623619
DO - 10.1145/2623330.2623619
M3 - Conference contribution
AN - SCOPUS:84907030632
SN - 9781450329569
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 1166
EP - 1175
BT - KDD 2014 - Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
PB - Association for Computing Machinery
T2 - 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2014
Y2 - 24 August 2014 through 27 August 2014
ER -