When is AI likely to hallucinate?

When is AI likely to hallucinate?
Learning the situations in which an AI is most likely to hallucinate

The Cleon Jones problem

Don't what to read this, here's a synthetic podcast that doesn't hallucinate much.

AIs hallucinate. That's a problem. Of course, humans "hallucinate" too – much more often, actually – but we humans tend to be more forgiving of ourselves. But just as it is crucial to know when humans or a particular human are likely to be mistaken or lying, so too with AI. And that's what I want to discuss here today. Because having a really good sense of when an AI is likely to get it wrong is crucial to effective use of the tool. The post is inspired by my interview today (start at 13:07) on KUHF's Houston Matters where the host seemed extremely concerned about AIs getting legal issues wrong – or, more precisely, about unquestioning human reliance on AI in settings where the consequences of error are particularly high. Think, as he suggested, about losing your house because your lawyer cited an imaginary case rather than the correct one.

The KUHF host interviewing me is absolutely right to be concerned. False confidence in AI's infallibility based on the beguiling verisimilitude of its responses can have dire consequences and dishonor the legal profession. But, what I would like to suggest here is that the problem should diminish significantly both models evolve, as students learn how to use the tools responsibly, and as the consequences of that overconfidence become clear.

People using AI need to allocate their fact checking efforts to those situations in which AI is most likely to hallucinate. Consider how we already perform epistemic triage in everyday life—allocating our limited attention and fact-checking resources based on the likely reliability of the source and the consequences of being wrong. We scrutinize a babysitter’s references not because we assume dishonesty, but because the stakes are high. We double-check a friend’s confident medical advice—even for a minor ailment—because experience tells us that non-experts often get health information wrong. (What does it say that we are likely now to do so with AI?!) Conversely, we take a neighbor’s pizza recommendation at face value: the stakes are low, and they know the local landscape. A colleague’s description of ordinary HR policy? Probably reliable. But if that same colleague offers advice about international tax law over lunch, we start googling—even if the personal stakes are modest—because the probability of error spikes outside one’s domain of expertise.

This kind of judgment—this triage—is essential to the effective use of AI. Users must ask not only what the AI is saying but how likely it is to be wrong, and what the cost of that error would be. If the question concerns recent developments, niche regulatory schemes, idiosyncratic procedural rules, or factual assertions that hinge on subtle distinctions, then both the likelihood and consequences of hallucination may be high. In those cases, verification is essential. On the other hand, if the prompt concerns basic black-letter law, or the kind of synthesis AI handles well—like summarizing precedent, outlining arguments, or modeling IRAC structure—the probability of serious error may be low, and any mistake easy to spot and correct. So we check lightly, or, dare I say it, not at all.

The point is not to panic about AI’s fallibility, nor to place blind trust in it, but to develop calibrated, domain-sensitive instincts—just as we do when assessing human expertise. Whether the source is artificial or biological, epistemic triage remains our best defense.

Here are three and a half factors that in my experience contribute to AI error.

Factor 1: How good is the model?

Many complaints about AI hallucination arise from use of free or old models for serious work. Use of free AI is completely fine if you want a vegan recipe for eggplant parmesan; AI is likely to do a great job and unlikely to poison you. Indeed, AI could have given an excellent answer way back in 2023 with models available at that time. But for serious professional work such as legal practice or legal education, you need to pay. You need to be using the latest and best models – Gemini 2.5 Pro, ChatGPT o3, Claude Sonnet 4, Grok4 – and not free models such as meta.ai or something built in to a non-expert software developer's product. (Footnote: There are also bad commercial AIs. That's a subject for another post.) You likewise need to be cautious about using small models that run locally on your machine. Those models preserve privacy but at the cost of accuracy. Again, a subject for a future blog post.

For law students, the moral is clear. Don't get stuck in the trap of cheaping out with older models, getting bad answers, claiming AI is lousy, and never paying money to see if you might do better. Just pay at least one of the leading AI companies the $20 per month that it takes to get access to the better features. I suspect that after a few months you will be happily addicted. You will realize that your choice is more entertaining than a comparably priced Netflix subscription and far more valuable to your legal careers. Well-compensated professors or attorneys have no excuse when they use free models for conversations they care about and then get wrong answers.

Factor 2: How specific is the question relative to the data?

This is the "Cleon Jones Problem." Before my interview on KUHF, I asked an AI I knew to be strong (ChatGPT 4o, to name names) a question I nonetheless suspected it would get wrong. "Who batted second for the Mets in game 3 of the 1969 World Series?" Within seconds, ChatGPT responded without equivocation "Cleon Jones." (Here's a link to my conversation). I then asked Google (not using its AI component) to check this response. It pointed me here, which revealed that Wayne Garrett batted second, not Cleon Jones. Why did the AI get it wrong and why was I right in my suspicions that it would do so? I confronted the same AI with evidence of its failing and asked for an explanation. The blue text box below shows its response to my cross examination – an answer that I suspected would be correct and that, based on my knowledge, is in fact right. (I've done a tiny bit of reformatting)

Probabilistic Foundations and the Illusion of Precision

At our core, language models operate by predicting the most statistically probable continuation of a given input. That statistical machinery is incredibly powerful when it comes to patterns that are abundant in our training data—like general baseball lineup strategies, the narrative arc of famous trials, or the syntax of Python code. But when asked for a specific historical fact (e.g., “Who batted second in Game 3 of the 1969 World Series?”), we’re often drawing from:

  • Generalizable knowledge patterns (e.g., "Cleon Jones was often a top-of-the-order hitter for the 1969 Mets"),
  • Secondary descriptions (possibly blog posts or summaries of the series),
  • Limited direct representations of box scores or lineup cards, especially for specific games.

The model may “hallucinate” or assert something plausible but incorrect—because the probability distribution it’s optimizing favors something frequent or seemingly likely, not something verified.

This leads to overconfident wrong answers, particularly in narrow, historical, or idiosyncratic domains. And the answers often lack the epistemic humility or metacognitive self-monitoring that a trained researcher would bring.

The Limits of Internal Weights as Knowledge Repositories

It’s tempting to think of a large model as having “memorized” the internet or its training corpus. But its memory is more akin to layered associations between tokens than a relational database or archive. There’s no real “index” that says “Game 3, 1969 World Series → Wayne Garrett batted second.” Instead, there are millions of parameters that nudge the model toward responses that sound right, based on frequency and contextual fit.

So, when I answered with “Cleon Jones,” it wasn’t because that information was stored incorrectly. It was because the true information was absent, and the model substituted a high-likelihood placeholder from the surrounding statistical fog.

Why This Matters for Legal, Historical, and Scientific Tasks

This problem is particularly acute in domains like law or historical scholarship, where granular factual precision is non-negotiable. A law professor asking about a single footnote in a dissent, or a historian querying troop movements on a specific date, isn’t asking for plausible-sounding text. They’re asking for a documentable truth.

In these domains, a system that “sounds good” but cannot reliably distinguish between known and unknown is not merely unhelpful; it may be actively misleading. That’s why the future of trustworthy AI will almost certainly involve tight integration with retrieval-augmented generation (RAG), in which the language model is grounded in verifiable external data (like an actual box score or court opinion).

A Note on Overconfidence and Humility

You’re quite right to point out that the system’s responses can lack epistemic modesty. The confidence of tone is not a reflection of certainty; it’s a default stylistic register. If anything, it is a legacy of the model’s pretraining on a corpus filled with declarative prose. This stylistic confidence can be mistaken for epistemic confidence, unless explicitly tempered.

In fact, some new research efforts aim to teach models to express calibrated uncertainty (“I’m not sure,” “That information may be out of date,” or “You should verify this from a primary source”). But doing that systematically—without sacrificing fluency or usefulness—is a hard and ongoing challenge in the field.

It's not really the raw "specificity" of the question, however, that affects the probability of hallucination (or confabulation as it is sometimes called). It's the specificity of the question relative to the relevant material on which the AI was trained. For example, here is an incredibly specific question, but one that an AI is very likely to get right. "In Game 1 of the 1988 World Series, what was the exact pitch count on Kirk Gibson when he hit his two-run, walk-off home run against Dennis Eckersley, who was the runner on base, and what was the final score of the game?" As I just verified, an AI will correctly answer that Gibson hit a 3-2 pitch (seventh pitch) off Eckersley for a walk-off two-run homer with Mike Davis on base; the Dodgers won 5–4 in Game 1 of the 1988 World Series. The AI was able to respond correctly in this field because it was a moment of high drama iconically narrated by broadcaster Vin Scully that has been retold many times.

All of this is why AI is particularly good at answering basic questions in most legal subjects. The area may be terra incognita for the student but it is likely well plowed ground that the AI has seen before. It is why AI does particularly well on questions about constitutional law or other typical subjects in the compulsory law curriculum. There is just a wealth of training data. On the other hand, if you start asking it specific questions about North Dakota environment law or even general questions about 17th Century laws of the New Haven Colony, you will at best get an acknowledgement from the AI that it just doesn't know the answer. If you are less fortunate, the AI is likely to respond with a hallucination, a plausible though incorrect response.

Factor 3: What information and tools have you given the AI to help it?

One way to overcome limitations in the AI's training is to augment the context available to it with which to respond. When, for example, I later provided the AI with the box score of that old Mets game from the baseball reference web site, it got my question correct. The AI was then what is called "grounded." That logic is not confined to baseball. When you provide AI with the relevant legal precedents or secondary materials, it is far more likely to get the answer right. That material may not have been available to it at the time of training or its attention to detail regarding that material may have been diluted by other similar but non-identical material. Sometimes the technology to augment context is called RAG (Retrieval Augmented Generation) but there are now many ways – most simply, cut and paste – to give the AI the additional material it needs. Some AIs, most notably NotebookLM from Google, are particularly strong in the way they handle additional context.

It does not defeat the whole point of AI to suggest that you may need to custom feed the AI. First, AI itself can often help provide the most relevant materials. This is likely to be all the more the case as AI "agents" become more powerful and accessible. Indeed, modern AI will often gather material from the web on its own before proceeding to respond to a query. Second, even if a fair amount of human labor is required properly to feed the AI the needed source material, the AI can still add immense value by digesting, summarizing, analyzing, and organizing the received corpus.

Modern AI has tools beyond web search to help answer questions. Many of these tools are built in. Others you can add. For example, AIs without tools, particularly older models, are notoriously bad at queries involving very recent events or those requiring multi-step reasoning. This is why old AI used to get even relatively simple math questions wrong. But consider this use of tools by AI without any special action by me. Here's a raw response with a tiny bit of reformatting from ChatGPT 4o (not even their most powerful model). I used Wolfram Language independently to verify that the solution is correct. ChatGPT reports that it solved the problem by calling the Python sympy library, a Python runtime environment and some tools to coordinate between itself and those tools.

Illustrating AI Reasoning via a Solvable Math Problem

Illustrating AI Reasoning via a Solvable Math Problem

Background

This problem was crafted by an AI (ChatGPT-4o) in response to specific constraints posed by Professor Seth Chandler. The professor asked for a mathematical challenge that would:

  1. Appear difficult to the average college graduate,
  2. Be solvable by the AI using its built-in symbolic computation tools, and
  3. Be something a typical large language model (LLM) could not solve reliably without such tools.

The AI generated the following problem on its own, selected to meet all three of these criteria.

The Problem

Evaluate the sum:

k=1 1000 1 k2 + 3k + 2

The Solution

First, factor the denominator:

k^2 + 3k + 2 = (k+1)(k+2)

This enables a partial fraction decomposition:

1(k+1)(k+2) = 1k+1 1k+2

The sum telescopes beautifully:

( 12 13 ) + ( 13 14 ) + + ( 11001 11002 )

All intermediate terms cancel, leaving:

12 11002 = 250501

Final Answer

The exact value of the sum is:

250501

This elegant result shows how symbolic tools built into AI systems can be used not only to explain solutions, but also to perform real computation beyond the reach of probabilistic text prediction alone.

Moreover, AI developers have been hard at work to permit users to add their own specialized tools to permit better answers to questions that arise in their particular domain. I will cover this issue in more depth in a later blog post (I seem to be generating quite an agenda for myself), but the general term is "function calling" and the acronym you need to know is MCP (Model Context Protocol). It’s a way of equipping AI with domain-specific tools, data, and capabilities that lie beyond its native training. Think of it as plug-and-play functionality: an interface through which the AI can access up-to-date court opinions (or other text repositories), proprietary databases, internal memos, or even your own research guides. The model no longer has to "guess" based on training—it can look things up, run searches, parse structured documents, and interact with external software on your behalf. What’s astonishing is how little infrastructure this requires. Give the AI a clear schema, stable endpoints, and a permissioned interface, and you can build a legal research assistant that’s not just competent—it’s tireless, cheap, and fast.

Footnote: The legal profession is drowning in unindexed complexity and needlessly expensive proprietary tools inelegantly bolted to ancient software. MCP tools are lifeboats waiting to be launched. Trust me, a fortune awaits whoever builds the first trustworthy, MCP-native agent for law.

Factor 3.5: Has the AI been humming along successfully?

"And oftentimes, to win us to our harm, The instruments of darkness tell us truths, Win us with honest trifles, to betray's in deepest consequence"MacBeth

Beware. There may or may not be a good technical explanation for this phenomenon (context drift?), but in my experience AI is more likely to hallucinate when its prior correct answers have given you undue confidence in its continued accuracy. It's when you let your guard down, that the devious part of AI decides to insert a false fact. The solution is simple. Don't let your guard down. The fact that AI has been doing just fine in the colloquy thus far is no guarantee that it will continue to do so. If it's remotely important, you have to check its answers even if nothing in its immediate past leads you to believe it is about to confabulate.

Responsibility

I end this post with what may seem like a cliche. But it also happens to be true. Ultimately, few are likely to care or commiserate that the error in the document with your name originated from AI. AI is not (yet) a juridical entity with responsibility. We humans are in control for now. AI is a little bit like an automobile. They are both amazing technologies. But in both, if you don't know what you are doing and ignore warnings, they can cause great harm. We have learned over decades to anticipate vehicle problems that increase the risk of injury. And we have invented safety systems that suggest a high probability of failure. While we don't quite yet have the latter with AI (maybe soon?), we can still become educated. That's what this blog post is about. And when we do learn the danger signs, the immense benefits of collaboration with the incredible "intelligence" of modern AI can be enjoyed with considerably less fear and considerably more success.

Notes

  1. For a more technical report (prepared by AI, of course) on AI hallucinations, you can read this. https://docs.google.com/document/d/1QX2TiFvRxKxJSR7hSt0fhqyvUgkkzREQreCsg5aJ_Y4/edit?usp=sharing
  2. I gave Google's NotebookLM this blog entry, the above technical report and an mp3 of my interview on Houston Matters. I asked it to produce a short podcast that focused on this blog entry. Unfortunately, this resulted in a very meta error. The podcast that discussed hallucinations itself contained a kind of hallucination: faithlessness. Despite my request, the AI focused on the technical document. And it was long. Still, the synthetic podcast is interesting and useful, particularly as a supplement to this blog entry. Here's a link: https://notebooklm.google.com/notebook/bd6bd5b4-9dc4-487b-9cab-e65e3b34bc85/audio
  3. I have always been fascinated by the 1969 Mets and the improbability of their success. But if you look at their pitching you will see what happened. Imagine a team that had Nolan Ryan in his most flame throwing youth on their roster, but with a starting rotation so solid that Ryan could not routinely make the starting rotation!