Exploring paraphrasing techniques on formal language for generating semantics preserving source code transformations

Aviel J. Stein, Aviel Kapllani, Spiros Mancoridis, Rachel Greenstadt

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Automatically identifying and generating equivalent semantic content to a word, phrase, or sentence is an important part of natural language processing (NLP). The research done so far in paraphrases in NLP has been focused exclusively on textual data, but has significant potential if it is applied to formal languages like source code. In this paper, we present a novel technique for generating source code transformations via the use of paraphrases. We explore how to extract and validate source code paraphrases. The transformations can be used for stylometry tasks and processes like refactoring. A machine learning method of identifying valid transformations has the advantage of avoiding the generation of transformations by hand and is more likely to have more valid transformations. Our dataset is comprised by 27,300 C++ source code files, consisting of 273topics each with 10 parallel files. This generates approximately152,000 paraphrases. Of these paraphrases, 11% yield valid code transformations. We then train a random forest classifier that can identify valid transformations with 83% accuracy. In this paper we also discuss some of the observed relationships betweenlinked paraphrase transformations. We depict the relationshipsthat emerge between alternative equivalent code transformationsin a graph formalism.

    Original languageEnglish (US)
    Title of host publicationProceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020
    PublisherInstitute of Electrical and Electronics Engineers Inc.
    Pages242-248
    Number of pages7
    ISBN (Electronic)9781728163321
    DOIs
    StatePublished - Feb 2020
    Event14th IEEE International Conference on Semantic Computing, ICSC 2020 - San Diego, United States
    Duration: Feb 3 2020Feb 5 2020

    Publication series

    NameProceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020

    Conference

    Conference14th IEEE International Conference on Semantic Computing, ICSC 2020
    Country/TerritoryUnited States
    CitySan Diego
    Period2/3/202/5/20

    Keywords

    • Code transformations
    • Natural language processing
    • Semantic computing
    • Source code paraphrasing

    ASJC Scopus subject areas

    • Artificial Intelligence
    • Computer Science Applications
    • Computer Vision and Pattern Recognition

    Fingerprint

    Dive into the research topics of 'Exploring paraphrasing techniques on formal language for generating semantics preserving source code transformations'. Together they form a unique fingerprint.

    Cite this