TY - GEN
T1 - An Empirical Evaluation of GitHub Copilot's Code Suggestions
AU - Nguyen, Nhan
AU - Nadi, Sarah
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022
Y1 - 2022
N2 - GitHub and OpenAI recently launched Copilot, an 'AI pair programmer' that utilizes the power of Natural Language Processing, Static Analysis, Code Synthesis, and Artificial Intelligence. Given a natural language description of the target functionality, Copilot can generate corresponding code in several programming languages. In this paper, we perform an empirical study to evaluate the correctness and understandability of Copilot's suggested code. We use 33 LeetCode questions to create queries for Copilot in four different programming languages. We evaluate the correctness of the corresponding 132 Copilot solutions by running LeetCode's provided tests, and evaluate understandability using SonarQube's cyclomatic complexity and cognitive complexity metrics. We find that Copilot's Java suggestions have the highest correctness score (57%) while JavaScript is the lowest (27%). Overall, Copilot's suggestions have low complexity with no notable differences between the programming languages. We also find some potential Copilot shortcomings, such as generating code that can be further simplified and code that relies on undefined helper methods.
AB - GitHub and OpenAI recently launched Copilot, an 'AI pair programmer' that utilizes the power of Natural Language Processing, Static Analysis, Code Synthesis, and Artificial Intelligence. Given a natural language description of the target functionality, Copilot can generate corresponding code in several programming languages. In this paper, we perform an empirical study to evaluate the correctness and understandability of Copilot's suggested code. We use 33 LeetCode questions to create queries for Copilot in four different programming languages. We evaluate the correctness of the corresponding 132 Copilot solutions by running LeetCode's provided tests, and evaluate understandability using SonarQube's cyclomatic complexity and cognitive complexity metrics. We find that Copilot's Java suggestions have the highest correctness score (57%) while JavaScript is the lowest (27%). Overall, Copilot's suggestions have low complexity with no notable differences between the programming languages. We also find some potential Copilot shortcomings, such as generating code that can be further simplified and code that relies on undefined helper methods.
KW - Codex
KW - Empirical Evaluation
KW - GitHub Copilot
KW - Program Synthesis
UR - http://www.scopus.com/inward/record.url?scp=85134062961&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85134062961&partnerID=8YFLogxK
U2 - 10.1145/3524842.3528470
DO - 10.1145/3524842.3528470
M3 - Conference contribution
AN - SCOPUS:85134062961
T3 - Proceedings - 2022 Mining Software Repositories Conference, MSR 2022
SP - 1
EP - 5
BT - Proceedings - 2022 Mining Software Repositories Conference, MSR 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 Mining Software Repositories Conference, MSR 2022
Y2 - 23 May 2022 through 24 May 2022
ER -