TY - GEN
T1 - Exploring paraphrasing techniques on formal language for generating semantics preserving source code transformations
AU - Stein, Aviel J.
AU - Kapllani, Aviel
AU - Mancoridis, Spiros
AU - Greenstadt, Rachel
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/2
Y1 - 2020/2
N2 - Automatically identifying and generating equivalent semantic content to a word, phrase, or sentence is an important part of natural language processing (NLP). The research done so far in paraphrases in NLP has been focused exclusively on textual data, but has significant potential if it is applied to formal languages like source code. In this paper, we present a novel technique for generating source code transformations via the use of paraphrases. We explore how to extract and validate source code paraphrases. The transformations can be used for stylometry tasks and processes like refactoring. A machine learning method of identifying valid transformations has the advantage of avoiding the generation of transformations by hand and is more likely to have more valid transformations. Our dataset is comprised by 27,300 C++ source code files, consisting of 273topics each with 10 parallel files. This generates approximately152,000 paraphrases. Of these paraphrases, 11% yield valid code transformations. We then train a random forest classifier that can identify valid transformations with 83% accuracy. In this paper we also discuss some of the observed relationships betweenlinked paraphrase transformations. We depict the relationshipsthat emerge between alternative equivalent code transformationsin a graph formalism.
AB - Automatically identifying and generating equivalent semantic content to a word, phrase, or sentence is an important part of natural language processing (NLP). The research done so far in paraphrases in NLP has been focused exclusively on textual data, but has significant potential if it is applied to formal languages like source code. In this paper, we present a novel technique for generating source code transformations via the use of paraphrases. We explore how to extract and validate source code paraphrases. The transformations can be used for stylometry tasks and processes like refactoring. A machine learning method of identifying valid transformations has the advantage of avoiding the generation of transformations by hand and is more likely to have more valid transformations. Our dataset is comprised by 27,300 C++ source code files, consisting of 273topics each with 10 parallel files. This generates approximately152,000 paraphrases. Of these paraphrases, 11% yield valid code transformations. We then train a random forest classifier that can identify valid transformations with 83% accuracy. In this paper we also discuss some of the observed relationships betweenlinked paraphrase transformations. We depict the relationshipsthat emerge between alternative equivalent code transformationsin a graph formalism.
KW - Code transformations
KW - Natural language processing
KW - Semantic computing
KW - Source code paraphrasing
UR - http://www.scopus.com/inward/record.url?scp=85083451289&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85083451289&partnerID=8YFLogxK
U2 - 10.1109/ICSC.2020.00051
DO - 10.1109/ICSC.2020.00051
M3 - Conference contribution
AN - SCOPUS:85083451289
T3 - Proceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020
SP - 242
EP - 248
BT - Proceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 14th IEEE International Conference on Semantic Computing, ICSC 2020
Y2 - 3 February 2020 through 5 February 2020
ER -