TY - JOUR
T1 - Can LLMs evaluate items measuring collaborative problem-solving?
AU - Anghel, Ella
AU - Wang, Yu
AU - Gopalakrishnan, Madhumitha
AU - Mansukhani, Pranali
AU - Bergner, Yoav
N1 - Publisher Copyright:
© 2024 Copyright for this paper by its authors.
PY - 2024
Y1 - 2024
N2 - Collaborative problem-solving (CPS) is a vital skill for students to learn, but designing CPS assessments is challenging due to the construct’s complexity. Advances in the capabilities of large language models (LLMs) have the potential to aid the design and evaluation of CPS items. In this study, we tested whether six LLMs agree with human judges on the quality of items measuring CPS. We found that GPT-4 was consistently the best-performing model with an overall accuracy of .77 (k = .53). GPT-4 did the best with zero-shot prompts, with other models only marginally benefiting from more complex prompts (few-shot, chain-of-thought). This work highlights challenges in using LLMs for assessment and proposes future research directions on the utility of LLMs for assessment design.
AB - Collaborative problem-solving (CPS) is a vital skill for students to learn, but designing CPS assessments is challenging due to the construct’s complexity. Advances in the capabilities of large language models (LLMs) have the potential to aid the design and evaluation of CPS items. In this study, we tested whether six LLMs agree with human judges on the quality of items measuring CPS. We found that GPT-4 was consistently the best-performing model with an overall accuracy of .77 (k = .53). GPT-4 did the best with zero-shot prompts, with other models only marginally benefiting from more complex prompts (few-shot, chain-of-thought). This work highlights challenges in using LLMs for assessment and proposes future research directions on the utility of LLMs for assessment design.
KW - collaborative problem-solving
KW - item evaluation
KW - large language models
KW - prompt engineering
UR - http://www.scopus.com/inward/record.url?scp=85207071776&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85207071776&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85207071776
SN - 1613-0073
VL - 3772
JO - CEUR Workshop Proceedings
JF - CEUR Workshop Proceedings
T2 - 1st Workshop on Automated Evaluation of Learning and Assessment Content, EvalLAC 2024
Y2 - 8 July 2024
ER -