Цікаво знати

GPT-4 Can Evaluate Student Answers as Accurately as Human Examiners — Passau University Study

Artificial intelligence GPT-4 by OpenAI has shown the ability to assess students’ written responses at a level comparable to — and sometimes better than — human lecturers. This conclusion comes from a study conducted by researchers at the University of Passau, led by Professor Johann Graf Lambsdorff.

Published in Scientific Reports, the study aimed to determine whether AI could reliably grade open-ended macroeconomics responses. The team analyzed 300 student answers to six typical questions, comparing evaluations made by both human reviewers and GPT-4.

Key findings of the study:

  • Innovative comparison method: Rather than treating human scores as the gold standard, researchers measured the consistency between evaluators. When GPT-4 replaced one of the three reviewers and agreement among all three increased, this indicated a better-quality assessment.
  • Accuracy of GPT-4: The AI system accurately ranked answers based on completeness and correctness. It frequently aligned with human judgments in identifying the best, middle, and weakest responses.
  • Tendency to over-score: GPT-4 occasionally assigned marks up to one point higher than human reviewers in numerical scoring.
  • Resilience to ambiguity: The technical part of the experiment, conducted by Abdullah Al Zubair under the guidance of Professor Michael Granitzer, showed that GPT-4 maintained consistent grading even when questions were vaguely formulated.

Despite these promising results, researchers emphasized that AI should not fully replace human evaluators. Humans are still essential in preparing model answers and making final assessments. However, GPT-4 can serve as a secondary reviewer, enhancing both grading efficiency and objectivity.

This Passau study suggests a new model for collaboration between humans and AI in higher education, where artificial intelligence functions as a reliable assistant rather than a replacement.