TY - JOUR
T1 - GPT-4 in Education
T2 - Evaluating Aptness, Reliability, and Loss of Coherence in Solving Calculus Problems and Grading Submissions
AU - Gandolfi, Alberto
N1 - Publisher Copyright:
© The Author(s) 2024.
PY - 2024
Y1 - 2024
N2 - In this paper, we initially investigate the capabilities of GPT-3 5 and GPT-4 in solving college-level calculus problems, an essential segment of mathematics that remains under-explored so far. Although improving upon earlier versions, GPT-4 attains approximately 65% accuracy for standard problems and decreases to 20% for competition-like scenarios. Overall, the models prove to be unreliable due to common arithmetic errors. Our primary contribution lies then in examining the use of ChatGPT for grading solutions to calculus exercises. Our objectives are to probe an in-context learning task with less emphasis over direct calculations; recognize positive applications of ChatGPT in educational contexts; highlight a potentially emerging facet of AI that could necessitate oversight; and introduce unconventional AI benchmarks, for which models like GPT are untrained. Pertaining to the latter, we uncover a tendency for loss of coherence in extended contexts. Our findings suggest that while the current ChatGPT exhibits comprehension of the grading task and often provides relevant outputs, the consistency of grading is marred by occasional loss of coherence and hallucinations. Intriguingly, GPT-4's overall scores, delivered in mere moments, align closely with human graders, although its detailed accuracy remains suboptimal. This work suggests that, when appropriately orchestrated, collaboration between human graders and LLMs like GPT-4 might combine their unique strengths while mitigating their respective shortcomings In this direction, it is imperative to consider implementing transparency, fairness, and appropriate regulations in the near future.
AB - In this paper, we initially investigate the capabilities of GPT-3 5 and GPT-4 in solving college-level calculus problems, an essential segment of mathematics that remains under-explored so far. Although improving upon earlier versions, GPT-4 attains approximately 65% accuracy for standard problems and decreases to 20% for competition-like scenarios. Overall, the models prove to be unreliable due to common arithmetic errors. Our primary contribution lies then in examining the use of ChatGPT for grading solutions to calculus exercises. Our objectives are to probe an in-context learning task with less emphasis over direct calculations; recognize positive applications of ChatGPT in educational contexts; highlight a potentially emerging facet of AI that could necessitate oversight; and introduce unconventional AI benchmarks, for which models like GPT are untrained. Pertaining to the latter, we uncover a tendency for loss of coherence in extended contexts. Our findings suggest that while the current ChatGPT exhibits comprehension of the grading task and often provides relevant outputs, the consistency of grading is marred by occasional loss of coherence and hallucinations. Intriguingly, GPT-4's overall scores, delivered in mere moments, align closely with human graders, although its detailed accuracy remains suboptimal. This work suggests that, when appropriately orchestrated, collaboration between human graders and LLMs like GPT-4 might combine their unique strengths while mitigating their respective shortcomings In this direction, it is imperative to consider implementing transparency, fairness, and appropriate regulations in the near future.
KW - AI
KW - Automated Grading
KW - Calculus
KW - ChatGPT
KW - Large Language Models
KW - Natural Language Processes
UR - http://www.scopus.com/inward/record.url?scp=85192049196&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85192049196&partnerID=8YFLogxK
U2 - 10.1007/s40593-024-00403-3
DO - 10.1007/s40593-024-00403-3
M3 - Article
AN - SCOPUS:85192049196
SN - 1560-4292
JO - International Journal of Artificial Intelligence in Education
JF - International Journal of Artificial Intelligence in Education
ER -