Core Finding

When AI is uncertain about a student's response, humans are uncertain about exactly the same things. AI uncertainty mirrors human uncertainty—signaling where teacher judgment is most needed.

The Project: AI-Powered Elementary Science Assessment

U.S. Department of Education Research Initiative

A $10 million, 5-year project bringing together four institutions to study how AI can support elementary science and literacy education through project-based learning.

This is a large-scale randomized controlled trial examining whether AI-assisted assessment improves student learning outcomes compared to traditional approaches.

$10M
DOE Funding
5
Years
160
First-Grade Teachers
6,400
Students

Partner Institutions:

  • Michigan State University – Lead institution, AI development
  • University of Alabama – Implementation site
  • New York University – Research collaboration
  • Excellent Learning – Curriculum publisher for dissemination
  • WestEd – Independent evaluator
View Original Quote
From the Forum

Namsoo Shin: "This is the 10 million grant, it's a five years grant. We just started this year... We are going to conduct a randomized trial, whether the AI using or not, controlling treatment group. We are going to do the comparison study. So each year, we are going to deal with 3,200 students. And then the total, the impact of this project is 6,400 students in 80 elementary schools in Alabama."

The Innovation: Multi-Agent AI Scoring

Five Agents Instead of One

Rather than using a single AI model, the system deploys five different agent models to score each student response. When agents disagree, that disagreement signals uncertainty—and the need for human review.

The multi-agent approach fundamentally changes the goal of AI assessment. Instead of simply scoring, the system is designed to detect its own mistakes and identify where human judgment is needed.

How It Works:

  • Agent diversity: Each agent has different characteristics—one is generous, another is strict and conservative
  • Aggregation: All five agents score the same response, and results are aggregated
  • Uncertainty detection: When agents disagree (e.g., 3 say "A", 2 say "B"), the system flags this as uncertain
  • Rationale provision: Each agent provides its reasoning, not just a score
View Original Quote
From the Forum

Namsoo Shin: "What we did is instead of using the one agent model, we're using the five different agent models scoring the same student responses. And what we did to kind of give the agent one score, the B, agent 2A, agent 3A, and agent 4AA, and we aggregated all of the student response, the agent output. And the major fault is A. However, because there was 2B, and we provided the rationale why we analyzed this data, and also the uncertainty level."

AI Uncertainty = Human Uncertainty

A remarkable finding: when AI agents are uncertain about a response, human experts are uncertain about the exact same responses. The points of confusion are identical.

This finding validates the multi-agent approach: if AI uncertainty reliably signals where humans would also struggle, then uncertainty detection becomes a tool for efficiently allocating human attention.

Why This Happens:

Elementary students' writing often contains ambiguous or unclear expressions—misspellings, incomplete thoughts, or creative interpretations. When a first-grader writes "stream" as "scream" or "mountain" as "wanting," both AI and human raters face the same interpretive challenge.

Practical Implication:

Rather than having teachers review every AI-scored response, they can focus on the flagged uncertain cases—dramatically reducing workload while maintaining accuracy where it matters most.

View Original Quote
From the Forum

Namsoo Shin: "But where you're uncertain, between you and your coder, and where AI is uncertain, between you and the AI, are those the same kinds of errors? Yes. Yeah? Yeah. Yes. That's important. That is very amazingly, kind of because we're putting our data, human data, and so AI analyzed the student's written responses. It's very similar to the uncertainty. The point is very similar. For example, kind of elementary students writing it, stream, scream, and then some mountain and wanting. So it's the kind of we don't know is that that kind of wording is whether you mean this one or that one. We are certain, and the AI also uncertain. So it's very same points."

The Human-in-the-Loop Process

Two Levels of Human Review

The system involves humans at two critical stages: the research team iteratively refines the AI during development, and teachers review uncertain cases before feedback reaches students.

Level 1: Development Team

  • Iteratively review AI scoring results
  • Revise rubrics based on where AI makes mistakes
  • Continue until achieving at least 80% human-AI agreement (typically reaching 87-90%)

Level 2: Teachers in Classrooms

  • Receive AI scores along with uncertainty levels and rationales
  • Can modify AI scores based on their knowledge of individual students
  • Revise automated feedback before sending to students

This dual-layer approach means no AI decision reaches students without the possibility of human review—and the system actively identifies which decisions most need that review.

View Original Quote
From the Forum

Namsoo Shin: "We gave that information to teacher. But teacher is not only final score. It's uncertainty level, and why this student response is very uncertain. So teacher can make a decision, and then they can change the AI score, and then they can revise the feedback, and then they can send it to the student. So it's kind of the, Jiliang mentions that, is we have two group of the human, the group, for developing and delivering the feedback to students."

The Surprising Bottleneck: Rubric Development

The main challenge isn't the AI, the data, or the prompting. It's rubric development. Humans intuitively understand rubrics, but AI needs explicit, step-by-step logical chains.

A critical insight for anyone implementing AI assessment: most problems trace back to how rubrics are specified, not to AI limitations or data quality.

The Translation Problem:

  • Human rubrics: Written for human interpretation, relying on implicit understanding and context
  • AI rubrics: Need explicit, step-by-step logical chains with no assumed knowledge
  • The gap: What's obvious to human experts must be made explicit for AI

The Iterative Solution:

The team revises rubrics iteratively, analyzing where AI makes mistakes and adding the specificity needed. This process continues until reaching the 80% agreement threshold—though most attempts achieve 87% or higher.

View Original Quote
From the Forum

Namsoo Shin: "And then most of our problem is rubric development, not data, or AI, or prompting. It's more our rubric is a kind of human, intuitively understand the rubric content, but AI need very specific direction to how to analyze the student response step by step, logical chain. So we revised the rubric iteratively until we got at least 80% agreement. And so far, most of our attempt is 87% agreement."

Efficiency Gains

From 1,500 to Less Than 100 Training Samples

Previous machine learning approaches required 1,500 labeled student responses per item. The new approach achieves similar accuracy with fewer than 100 examples—dramatically reducing human annotation effort.

This efficiency gain comes from leveraging large language models' pre-existing language understanding combined with carefully specified rubrics.

Comparison:

  • Traditional ML: 1,500 labeled examples per assessment item
  • Multi-agent LLM: Less than 100 examples, total around 300 students
  • Reduction: ~93% less human labeling effort

This makes it feasible to develop AI assessment for new items quickly—a critical factor for practical classroom implementation.

View Original Quote
From the Forum

Namsoo Shin: "Before, kind of, we have two grant, and the one is the, we, first one is we're using the machine learning. That time, we're using the 1,500 student's data for item. But this time, we're using the only the... less than 100 students, and the total is a little bit more than 300 students. That means human effort to scoring, and annotating, and we reduced a lot."

← Back to Blog AI Evolution in Assessment →