Source code authorship attribution using long short-term memory based networks

Bander Alsulami, Edwin Dauber, Richard Harang, Spiros Mancoridis, Rachel Greenstadt

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Machine learning approaches to source code authorship attribution attempt to find statistical regularities in human-generated source code that can identify the author or authors of that code. This has applications in plagiarism detection, intellectual property infringement, and post-incident forensics in computer security. The introduction of features derived from the Abstract Syntax Tree (AST) of source code has recently set new benchmarks in this area, significantly improving over previous work that relied on easily obfuscatable lexical and format features of program source code. However, these AST-based approaches rely on hand-constructed features derived from such trees, and often include ancillary information such as function and variable names that may be obfuscated or manipulated. In this work, we provide novel contributions to AST-based source code authorship attribution using deep neural networks. We implement Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (BiLSTM) models to automatically extract relevant features from the AST representation of programmers’ source code. We show that our models can automatically learn efficient representations of AST-based features without needing hand-constructed ancillary information used by previous methods. Our empirical study on multiple datasets with different programming languages shows that our proposed approach achieves the state-of-the-art performance for source code authorship attribution on AST-based features, despite not leveraging information that was previously thought to be required for high-confidence classification.

    Original languageEnglish (US)
    Title of host publicationComputer Security – ESORICS 2017 - 22nd European Symposium on Research in Computer Security, Proceedings
    EditorsEinar Snekkenes, Simon N. Foley, Dieter Gollmann
    PublisherSpringer Verlag
    Pages65-82
    Number of pages18
    ISBN (Print)9783319664019
    DOIs
    StatePublished - 2017
    Event22nd European Symposium on Research in Computer Security, ESORICS 2017 - Oslo, Norway
    Duration: Sep 11 2017Sep 15 2017

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume10492 LNCS
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Conference

    Conference22nd European Symposium on Research in Computer Security, ESORICS 2017
    CountryNorway
    CityOslo
    Period9/11/179/15/17

    Keywords

    • Abstract syntax tree
    • Code stylometry
    • Long short-term memory
    • Privacy
    • Security
    • Source code authorship attribution

    ASJC Scopus subject areas

    • Theoretical Computer Science
    • Computer Science(all)

    Fingerprint Dive into the research topics of 'Source code authorship attribution using long short-term memory based networks'. Together they form a unique fingerprint.

    Cite this