TY - GEN
T1 - Identifying products in online cybercrime marketplaces
T2 - 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017
AU - Durrett, Greg
AU - Kummerfeld, Jonathan K.
AU - Berg-Kirkpatrick, Taylor
AU - Portnoff, Rebecca S.
AU - Afroz, Sadia
AU - McCoy, Damon
AU - Levchenko, Kirill
AU - Paxson, Vern
N1 - Funding Information:
This work was supported in part by the National Science Foundation under grants CNS-1237265 and CNS-1619620, by the Office of Naval Research under MURI grant N000140911081, by the Center for Long-Term Cybersecurity and by gifts from Google. We thank all the people that provided us with forum data for our analysis; in particular Scraping Hub and SRI for their assistance in collecting data for this study. Any opinions, findings, and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.
Publisher Copyright:
© 2017 Association for Computational Linguistics.
PY - 2017
Y1 - 2017
N2 - One weakness of machine-learned NLP models is that they typically perform poorly on out-of-domain data. In this work, we study the task of identifying products being bought and sold in online cybercrime forums, which exhibits particularly challenging cross-domain effects. We formulate a task that represents a hybrid of slot-filling information extraction and named entity recognition and annotate data from four different forums. Each of these forums constitutes its own “fine-grained domain” in that the forums cover different market sectors with different properties, even though all forums are in the broad domain of cybercrime. We characterize these domain differences in the context of a learning-based system: supervised models see decreased accuracy when applied to new forums, and standard techniques for semi-supervised learning and domain adaptation have limited effectiveness on this data, which suggests the need to improve these techniques. We release a dataset of 1,938 annotated posts from across the four forums.1
AB - One weakness of machine-learned NLP models is that they typically perform poorly on out-of-domain data. In this work, we study the task of identifying products being bought and sold in online cybercrime forums, which exhibits particularly challenging cross-domain effects. We formulate a task that represents a hybrid of slot-filling information extraction and named entity recognition and annotate data from four different forums. Each of these forums constitutes its own “fine-grained domain” in that the forums cover different market sectors with different properties, even though all forums are in the broad domain of cybercrime. We characterize these domain differences in the context of a learning-based system: supervised models see decreased accuracy when applied to new forums, and standard techniques for semi-supervised learning and domain adaptation have limited effectiveness on this data, which suggests the need to improve these techniques. We release a dataset of 1,938 annotated posts from across the four forums.1
UR - http://www.scopus.com/inward/record.url?scp=85057778043&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85057778043&partnerID=8YFLogxK
U2 - 10.18653/v1/d17-1275
DO - 10.18653/v1/d17-1275
M3 - Conference contribution
AN - SCOPUS:85057778043
T3 - EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings
SP - 2598
EP - 2607
BT - EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings
PB - Association for Computational Linguistics (ACL)
Y2 - 9 September 2017 through 11 September 2017
ER -