Let law students opt in to AI grading?

A student ponders whether to have his grade determined by AI.

Here's the blog post I so wanted to write. It would have cited a fascinating recent article by prominent law professors ("the Cope Study") showing that AI grading and human grading are closely correlated: a Pearson correlation coefficient of 0.93. My confidently stated insight was going to be to use the fact that random errors cancel out: if you used AI to grade a number of student submissions instead of just a single one, you could get the correlation of the average AI grade and the average human grade to be even closer, perhaps to over 0.98 if AI-human disagreement was random. I had lots of math to prove it and the discussions of the Spearman-Brown prophecy formula that would have permeated my hypothetical blog entry would surely have gone viral. Based on my analysis, I was going to argue to give students the choice: lots of prompt feedback and grades from AI that would be almost indistinguishable from those given by a human or the classic end of year final exam graded by a human. My bet was that almost all students would opt for AI grading, particularly if the professor were hypothetically to offer pizza as a reward for more than 90% AI selections.

What a happy result this would have been! Law faculty would be relieved from the need to grade submissions individually, a point that would have met with massive sympathy in late May. They could just write rubrics (or have AI do it for them) and hand the job of evaluating individual submissions over to Claude. We could feel like undergraduate faculty who have TAs. (Explain again to me why law school professors don't get grading TAs.) Students would get lots of feedback and essentially the same grades. A rare win-win.

Except then I then recognized that at most law schools students do not receive numeric grades. Students receive discrete letter grades (A, A-) in which numeric values are discretized into bins, losing information in the process. I then worked in collaboration with AI to determine the effect of this decision to lose information: how frequently would the ultimate letter grade assigned by the human and the ultimate letter grade assigned by the AI differ. The results were surprising and, from my perspective, disappointing. The results were surprising and, from my perspective, disappointing — and in two distinct ways.

The first: averaging cannot lift the correlation as far as I had hoped. A grade built from more graded assessments is a steadier measure of a student's ability than a grade built from fewer, and the AI can cheaply grade all five assessments in the course while I, realistically, personally grade only two of them — a midterm and a final. Because the correlation between two such averages is held down by whichever side rests on fewer gradings, my two-assessment side caps the effective correlation near 0.96, not the 0.98 when I neglected to think through the consequences of personally grading five submissions.

The second, and the real subject of this post, is discretization. Even with a Pearson correlation coefficient of 0.96 between the numeric mean grade assigned by human graders to a portfolio of student work and the numeric mean grade assigned by AI graders, the actual discrete letter grade the student was assigned by each grader would agree only about 65% of the time. Fortunately, the disagreement, when it occurs, is almost always small: roughly 33% of the time the grade differs by a single unit — A- versus B+, or B- versus B — and only about 2% of the time by two units.

How can this be? The reason is discretization. Letter grades are not the original measurements. They are bins imposed on a continuous score line. Two graders can assign different letters only when a student’s numeric score is near one of the cut lines between grades. A student in the middle of a grade band is stable; small differences between the professor’s score and the AI score do not matter. A student sitting near a boundary can flip from, say, B+ to B with almost no substantive disagreement at all. As discussed further in the draft academic paper attached to the bottom of this blog post, the actual probability of flipping depends on technical matters such as the spread of the professor's grades and the number of grading categories. But the central result remains unchanged. Even high degrees of agreement between professor and AI (or, actually one professor and another) in numeric scores can result in different letter grades a higher fraction of the time than most would expect or prefer.

This result does not mean that AI grading is a bad idea. I still think it is worth exploring giving students the discussed option. But the concrete results look a bit different. Here's the real question students would have to face. You can go for AI grading. If I give you five graded items over the semester, the AI grades all five and I grade two of them. There is roughly a 65% chance the grade you get from the AI will be the same as the grade you get from me. There is about a one-in-three chance it differs by a single unit — split roughly evenly between one unit higher (an A- instead of a B+) and one unit lower (a B instead of a B+) — and about a 2% chance it differs by two units. Or you can go the traditional route: a midterm and one end of year final exam graded by me. It's not clear in these instances which grade would be more "correct." The Cope study suggests that human law professor graders sometimes apply their rubrics quite wrongly.

It's not so clear to me which option students would take under these circumstances. There are real considerations on both sides. The AI option gives students feedback five times during the semester instead of whatever sparse comments arrive after a single final. For most students, this is the more valuable feature by far. Feedback is what lets you correct course while there is still time to correct course; yes, a midterm provides some information, but the graded final tells you what you got wrong only after the class is over. The AI option also spreads the assessment across five tests, which means a single bad day — illness, a misread question, the doctrine you happened not to review the night before — does less damage. Anyone who has bombed a final they should have aced will feel the weight of this. A midterm that counted for only a small fraction of their grade won't have helped them much.

On the other side, five tests is more work than one or two. Preparing five times across the semester is genuinely more demanding than preparing once in the middle and once at the end, even if the total hours are similar. Some students will pick the traditional route for that reason alone, and the choice is rational on its own terms.

A subtler factor: students who perceive themselves as strong may resist the AI option, but for a reason that does not survive much scrutiny. The intuition is that their own grade is the right grade and the AI's roughly one-in-six flip in either direction is a deviation from it. The Cope correlation gives no warrant for that assumption — either grader could be the one closer to whatever the student's work actually deserved. The Cope study pointed out that significant deviations between AI grades and professor grades were generally result of professor error. The students most likely to opt out of AI grading, in other words, may be opting out on a premise the underlying data does not support.

On balance, I suspect many but not all students will opt in and that the result may be a function of group dynamics and attitudes about AI more than any advanced scheming. Students who have already used AI as a study companion in undergraduate life or elsewhere will find the AI option familiar and probably welcome it. They might also prefer it if they mistrust the professor! Students who distrust AI on principle, or who have absorbed the misguided view that being graded by a machine is somehow demeaning, will decline regardless of what the math says. And students often watch each other. If the first few classmates to speak up announce they are taking the traditional route, others will follow without thinking too hard about why; if the early movers go the other way, the same drift will carry the room toward AI. The choice will be sociological as much as it is strategic, and the equilibrium any given class lands in may say more about that class than about the merits of either option.

Framing might also matter. Here are two ways to present the choice. Option A: "I'm planning on grading using AI this semester. The great news is that you will get five opportunities to perform and prompt and thorough feedback. But, if you'd prefer just a midterm and a final and grading by me over the week or two that follows, you can elect that. The thing is, although there is a 65% chance your grade will be the same, there is a 33% chance it will deviate by one unit and even a 2% chance it could deviate by two." Option B: "I'm planning on grading the usual way this semester. You'll get a midterm and a final graded by me over a week or two. But, if you want you can have AI grade five submissions and get prompt and thorough feedback. The thing is, although there is a 65% chance your grade will be the same, there is a 33% chance it will deviate by one unit and even a 2% chance it could deviate by two." Although Option A and Option B are functionally identical, student choice might well depend on which way the choice is described.

The presentation and resulting choice matters. From the perspective of the school, there might be a benefit even if only a fraction of students took the AI option. That fraction of students would receive frequent feedback. And instead of slogging through all the exams, law faculty could review one minus that fraction of the submissions, leaving them time to work on precious scholarship, improved teaching ideas in an era of AI, or, perish the thought, enjoy some leisure.

The work I did also suggests a more radical possibility. The happy world in which students receive tons of (AI) feedback and professors create rubrics rather than grade individual submissions could be restored if schools would just abandon discrete letter grades and give out raw numeric scores. Our misery is the consequence of our choice to deliberately lose information. Too heretical? Perhaps, but faculty should think hard about our many orthodoxies the next time they face a stack of student essays to evaluate or blithely accept that the amount of feedback their students received is insufficient.

Notes

Note 1

Suppose the only submission I personally grade is the final — or the midterm counts so little that it serves only as feedback. Then my side of the comparison rests on a single grading, and the effective correlation can no longer climb as it did with two. It is capped near the square root of my own single-grading reliability, which here is essentially Cope's 0.93, and no amount of AI grading rescues it: five AI gradings or fifty, the bottleneck is my once-graded score. The effective correlation falls from about 0.96 back to roughly 0.93 — meaning the second human grading, the midterm, was the entire source of the lift. At 0.93 the AI and I assign the same letter about 57% of the time rather than 65%, and the chance of a two-grade gap rises from roughly 2% to about 5%.

The opposite case is a course like legal research and writing, where grading five submissions across the semester is already traditional. Now both sides of the comparison rest on five gradings, and something nice happens: the effective correlation stops depending on which grader is the noisier one. With both reliabilities averaged five times over, the bottleneck disappears and the correlation is pinned near 0.985. The 0.98 of my original daydream turns out to be reachable after all — but only because I, and not just the AI, am grading five times. At that level the AI and I assign the same letter about 78% of the time, and a two-grade gap becomes vanishingly rare — well under 1%. Discretization still takes its cut, but in a course built around multiple graded submissions the penalty is about as small as it gets.

Note 2

There are hybrid possibilities. Borrow the mechanic from baseball's automated strike zone. A team gets a small, fixed number of challenges; a player taps his cap, the system reviews the call, and — the key incentive — if the challenge succeeds, the team keeps it. Win, and it costs nothing; lose, and it is gone. Now invert that mechanism. In baseball, with a tap of the cap by the offended player, the machine reviews the human; here the human reviews the machine. A student in the AI regime gets m challenges a year — start with m = 1, perhaps m = 2 — and may spend one (perhaps via raising their Bluebook or "filing a motion for reconsideration") to demand that a human professor re-grade a specific assessment. If the human, who promises not to be annoyed, finds the AI got it wrong, the student keeps the challenge. If the human sustains the AI, the challenge is spent. The keep-it-if-you-win rule rewards well-founded challenges and discourages nuisance ones; it guarantees a human backstop on the grades a student cares most about, And it does one more thing. It create a live audit: if students keep winning, the rubric or the model needs work.

Note 3

An adversarial AI might make grading even better that in the Cope study. Picture a second model whose only job is to attack the first one's grade, arguing the rubric was misread or the points miscounted, the two iterating until they converge. This is becoming technically more feasible as features such as /goal in Claude Code or similar features in Codex become more prevalent. I actually ran an earlier draft of this blog post through /goal and it slightly improved the draft. Here was the prompt:

@/Users/Seth/Dropbox/Scholarship/ai-graded-exams/blog-post.md /goal Here is a draft essay. Create two items of data: 1. An evaluation of the draft and suggestions for how it could be improved; 2. A score from 0 (awful) to 10 (fantastic). Then revise the essay using data[[1]]. Keep iterating until the score (data[[2]]) does not improve over two consecutive turns or for 8 iterations, whichever comes first.

Note 4

Why repeated AI grading can move a correlation from 0.93 toward 0.98

Earlier I suggested that one might move from a single-pass Pearson correlation of about 0.93 to something closer to 0.98 by averaging several AI gradings of the same work. That claim is basically right, but the exact formula depends on what one means by the “human standard.”

The intuition is the Spearman-Brown idea: each grade contains a shared signal plus error. When we average repeated gradings, the shared signal remains, while independent error partly cancels. The more independent the errors, the more averaging helps. The more the same error repeats across passes, the less averaging helps.

Let ρ be the single-pass correlation. Let n be the number of grading passes being averaged. Let z measure how correlated the errors are across repeated passes:

z = 0: the repeated errors are independent; averaging helps as much as possible.
z = 1: the same error repeats perfectly; averaging does not help at all.
Intermediate values mean that some of the error is random and some is persistent.

Case 1: symmetric Spearman-Brown averaging

If both sides are treated symmetrically — for example, if both the AI grade and the human comparison are composites of n parallel measurements — the generalized Spearman-Brown formula is:

    ρₙ = nρ / ( nρ + (1 − ρ)[1 + (n − 1)z] )
  

When z = 0, this reduces to the ordinary Spearman-Brown formula:

    ρₙ = nρ / (1 + (n − 1)ρ)
  

When z = 1, the formula gives ρₙ = ρ. That is exactly what should happen: if every AI pass repeats the same error, averaging does not improve reliability.

Case 2: AI composite compared with a fixed true-score standard

If instead the human benchmark is treated as a stable true-score standard, and only the AI side is being averaged, the technically cleaner formula is:

    ρₙ = 1 / sqrt( 1 + (1/ρ² − 1)[1 + (n − 1)z]/n )
  

At high correlations, this formula gives numbers very close to the symmetric Spearman-Brown formula. So for the examples here, the substantive conclusion does not change. But this version is conceptually better if the “human standard” is a careful, stabilized benchmark rather than another noisy single grade.

Case 3: AI composite compared with one noisy human grade

There is a third possibility. If the AI is averaged several times but the comparison point is just one ordinary noisy human grade, then averaging helps less. In that case, assuming the human and AI single-pass errors are otherwise comparable, a useful formula is:

    ρₙ = ρ / sqrt( ρ + (1 − ρ)[1 + (n − 1)z]/n )
  

This produces a smaller gain because the human side still contains unreduced noise. Averaging the AI grades makes the AI estimate more stable, but it does not remove the noise in the single human comparison grade.

Concrete example

Start with ρ = 0.93. If the repeated AI errors are mostly independent, five passes can plausibly move the AI composite into the high 0.98 range under the symmetric or true-score-standard models. But if the repeated AI passes share the same bias — for example, if the model misunderstands the same part of the answer each time — averaging helps much less.

n passes	z = 0	z = 0.10	z = 0.25	z = 0.50	z = 1.00
1	0.930	0.930	0.930	0.930	0.930
2	0.964	0.960	0.955	0.947	0.930
3	0.976	0.971	0.964	0.952	0.930
5	0.985	0.979	0.971	0.957	0.930
10	0.993	0.986	0.976	0.960	0.930

The lesson is not that averaging magically fixes grading. It fixes only the portion of the error that varies across passes. If the model's mistakes are mostly independent from pass to pass, averaging can substantially improve the composite. If the model's mistakes are systematic, averaging mostly preserves the mistake.

Note 5

The Cope study is: "Grading Machines: Can AI Exam-Grading Replace Law Professors?" (Journal of Law and Empirical Analysis, 2026), Kevin Cope, Jens Frankenreiter, Scott Hirst, Eric Posner, Daniel Schwarcz, and Dane Thorley. They took four real final exams — civil procedure, contracts, torts, corporate law — from four top-30 law schools, 205 students in all, and had GPT-5 grade them. Not a custom model tuned to minimize error. The model any professor could have opened in a browser. I suspect AI would do better under models that have developed since the time of their study. The study further found that the model's performance improved greatly if the professor gave it a rubric.

Note 6

People concerned about whether a 0.93 correlation is good enough to let AI grade given the discretization problem I have noted in this survey should consider the far worse discrepancy in human-human grading. Test-retest studies of human essay grading generally land well short of perfect, often in the 0.7-to-0.8 range: the grade you give an exam today is a noisy predictor of the grade you would give it next month. Against that backdrop, a 0.93 correlation between the AI and the professor is not the AI falling short of a fixed, reliable benchmark. There is no fixed benchmark. The professor's grade is itself one noisy draw, and neither it nor the AI's grade has an automatic claim to being the true one.

Note 7

Here is the 27-page draft academic paper containing 153 mathematical expressions and 26 references. It was produced through a dialog among me, ChatGPT (which did most of the work) and Claude, which was used for checking and editing. I used a Consensus connector to situate the work in the literature. Frankly, some of the paper is above my head. I am not 100% certain it is right in every detail – thus the draft label. But the closed form expressions yield values very close to what I obtain in simulations of the system. So I don't think it is fundamentally wrong.

academic-paper

academic-paper.pdf

287 KB

If someone wants the original .tex file from which the PDF was created, send me a note.