Can LLMs evaluate items measuring collaborative problem-solving?

Ella Anghel, Yu Wang, Madhumitha Gopalakrishnan, Pranali Mansukhani, Yoav Bergner

Research output: Contribution to journalConference articlepeer-review

Abstract

Collaborative problem-solving (CPS) is a vital skill for students to learn, but designing CPS assessments is challenging due to the construct’s complexity. Advances in the capabilities of large language models (LLMs) have the potential to aid the design and evaluation of CPS items. In this study, we tested whether six LLMs agree with human judges on the quality of items measuring CPS. We found that GPT-4 was consistently the best-performing model with an overall accuracy of .77 (k = .53). GPT-4 did the best with zero-shot prompts, with other models only marginally benefiting from more complex prompts (few-shot, chain-of-thought). This work highlights challenges in using LLMs for assessment and proposes future research directions on the utility of LLMs for assessment design.

Original languageEnglish (US)
JournalCEUR Workshop Proceedings
Volume3772
StatePublished - 2024
Event1st Workshop on Automated Evaluation of Learning and Assessment Content, EvalLAC 2024 - Recife, Brazil
Duration: Jul 8 2024 → …

Keywords

  • collaborative problem-solving
  • item evaluation
  • large language models
  • prompt engineering

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'Can LLMs evaluate items measuring collaborative problem-solving?'. Together they form a unique fingerprint.

Cite this