GPT-4 in Education: Evaluating Aptness, Reliability, and Loss of Coherence in Solving Calculus Problems and Grading Submissions

Research output: Contribution to journalArticlepeer-review


In this paper, we initially investigate the capabilities of GPT-3 5 and GPT-4 in solving college-level calculus problems, an essential segment of mathematics that remains under-explored so far. Although improving upon earlier versions, GPT-4 attains approximately 65% accuracy for standard problems and decreases to 20% for competition-like scenarios. Overall, the models prove to be unreliable due to common arithmetic errors. Our primary contribution lies then in examining the use of ChatGPT for grading solutions to calculus exercises. Our objectives are to probe an in-context learning task with less emphasis over direct calculations; recognize positive applications of ChatGPT in educational contexts; highlight a potentially emerging facet of AI that could necessitate oversight; and introduce unconventional AI benchmarks, for which models like GPT are untrained. Pertaining to the latter, we uncover a tendency for loss of coherence in extended contexts. Our findings suggest that while the current ChatGPT exhibits comprehension of the grading task and often provides relevant outputs, the consistency of grading is marred by occasional loss of coherence and hallucinations. Intriguingly, GPT-4's overall scores, delivered in mere moments, align closely with human graders, although its detailed accuracy remains suboptimal. This work suggests that, when appropriately orchestrated, collaboration between human graders and LLMs like GPT-4 might combine their unique strengths while mitigating their respective shortcomings In this direction, it is imperative to consider implementing transparency, fairness, and appropriate regulations in the near future.

Original languageEnglish (US)
JournalInternational Journal of Artificial Intelligence in Education
StateAccepted/In press - 2024


  • AI
  • Automated Grading
  • Calculus
  • ChatGPT
  • Large Language Models
  • Natural Language Processes

ASJC Scopus subject areas

  • Education
  • Computational Theory and Mathematics


Dive into the research topics of 'GPT-4 in Education: Evaluating Aptness, Reliability, and Loss of Coherence in Solving Calculus Problems and Grading Submissions'. Together they form a unique fingerprint.

Cite this