TY - GEN
T1 - Evaluating the Effectiveness of LLMs in Fixing Maintainability Issues in Real-World Projects
AU - Nunes, Henrique
AU - Figueiredo, Eduardo
AU - Rocha, Larissa
AU - Nadi, Sarah
AU - Ferreira, Fischer
AU - Esteves, Geanderson
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Large Language Models (LLMs) have gained attention for addressing coding problems, but their effectiveness in fixing code maintainability remains unclear. This study evaluates LLMs capability to resolve 127 maintainability issues from 10 GitHub repositories. We use zero-shot prompting for Copilot Chat and Llama 3.1, and few-shot prompting with Llama only. The LLM-generated solutions are assessed for compilation errors, test failures, and new maintainability problems. Llama with few-shot prompting successfully fixed 44.9 % of the methods, while Copilot Chat and Llama zero-shot fixed 32.29 % and 30 %, respectively. However, most solutions introduced errors or new maintainability issues. We also conducted a human study with 45 participants to evaluate the readability of 51 LLM-generated solutions. The human study showed that 68.63 % of participants observed improved readability. Overall, while LLMs show potential for fixing maintainability issues, their introduction of errors highlights their current limitations.
AB - Large Language Models (LLMs) have gained attention for addressing coding problems, but their effectiveness in fixing code maintainability remains unclear. This study evaluates LLMs capability to resolve 127 maintainability issues from 10 GitHub repositories. We use zero-shot prompting for Copilot Chat and Llama 3.1, and few-shot prompting with Llama only. The LLM-generated solutions are assessed for compilation errors, test failures, and new maintainability problems. Llama with few-shot prompting successfully fixed 44.9 % of the methods, while Copilot Chat and Llama zero-shot fixed 32.29 % and 30 %, respectively. However, most solutions introduced errors or new maintainability issues. We also conducted a human study with 45 participants to evaluate the readability of 51 LLM-generated solutions. The human study showed that 68.63 % of participants observed improved readability. Overall, while LLMs show potential for fixing maintainability issues, their introduction of errors highlights their current limitations.
KW - large language models
KW - maintainability
KW - refactoring
UR - http://www.scopus.com/inward/record.url?scp=105007296691&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105007296691&partnerID=8YFLogxK
U2 - 10.1109/SANER64311.2025.00069
DO - 10.1109/SANER64311.2025.00069
M3 - Conference contribution
AN - SCOPUS:105007296691
T3 - Proceedings - 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2025
SP - 669
EP - 680
BT - Proceedings - 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 32nd IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2025
Y2 - 4 March 2025 through 7 March 2025
ER -