Why Legal AI Hallucinates (And How Structured Propositions Fix It)

The problem has been measured

In 2024, researchers at Stanford Law School published what became the largest empirical study of legal AI hallucination. They tested Westlaw's AI-Assisted Research and Lexis+ AI — the two tools used by nearly every major law firm — against a benchmark of legal research questions with known answers.

The results were stark:

33%

Westlaw AI-Assisted Research

hallucination rate

17%

Lexis+ AI

hallucination rate (minimum)

1,247+

Court sanctions cases

tracked for AI-fabricated citations

These aren't hypothetical failure modes. The Stanford study, published in the Journal of Empirical Legal Studies (Magesh et al., 2025), found that even tools specifically marketed as "hallucination-free" produced fabricated case citations, invented holdings, and misattributed legal rules at rates that would be unacceptable in any professional context.

Meanwhile, attorney sanctions for AI-fabricated citations have become a recognizable pattern. The most public cases — Mata v. Avianca in 2023, multiple federal cases since — are just the visible surface. By early 2026, over a thousand proceedings involving AI-generated legal errors have been tracked across federal and state courts.

Two types of hallucination (and why the distinction matters)

The research literature distinguishes two failure modes that behave very differently:

Fabrication

The AI invents a citation, case, or legal rule that doesn't exist at all. "Smith v. Jones, 524 F.3d 112 (2d Cir. 2019)" — except there's no such case.

Easier to catch: you can check if the case exists.

Misgrounding

The AI cites a real case but misrepresents what it holds. It links a genuine citation to a proposition the case doesn't actually support.

Harder to catch: the citation looks real because it is real.

The Stanford study found that misgrounding — real citations supporting fake propositions — was the more common and more dangerous failure mode. A fabricated citation is easy to catch with a database lookup. A misgrounded one requires actually reading the case and understanding whether it supports the stated proposition. That's exactly the kind of multi-step legal reasoning that current models struggle with.

Why RAG doesn't solve this

Every major legal AI tool uses some version of Retrieval-Augmented Generation (RAG): embed documents, store them in a vector database, retrieve relevant chunks at query time, and hand them to a language model to synthesize an answer.

RAG is an improvement over pure generation — the model at least sees real legal text. But the architecture has structural limitations that no amount of prompt engineering can fix:

Multi-hop reasoning breaks down

Legal analysis is rarely single-hop. "Does the taxpayer have a right to a CDP hearing?" requires chaining: IRC § 6330 grants the right → the request must be timely under § 6330(a) → untimeliness limits judicial review to abuse of discretion → unless equitable tolling applies. RAG retrieves documents; it doesn't traverse a legal argument.

No authority hierarchy awareness

When a retrieved IRS manual excerpt and a Tax Court holding contradict each other, the model has to figure out which controls. That requires understanding that case law trumps agency guidance — a meta-rule the model must derive from training, not from the retrieved text. Get it wrong and you're citing the IRM as if it's binding law.

No temporal or jurisdictional awareness

An overruled case and the case that overruled it look equally relevant to a vector similarity search. A Ninth Circuit rule and a contrary Second Circuit rule both match the same query. RAG doesn't know that one superseded the other, or that jurisdiction matters.

Confidence is a hallucination itself

When RAG systems produce confidence indicators, they're typically the model's self-assessment — which research consistently shows is poorly calibrated. The model says "I'm 90% confident" based on how fluent its output sounds, not based on the weight of legal authority.

The fundamental issue: RAG treats legal research as an information retrieval problem. But legal analysis isn't retrieval — it's structured reasoning over a hierarchy of authorities. Retrieving the right document isn't the same as understanding what the law is.

What we built instead

AgentLaw takes a different architectural approach. Instead of retrieving documents and asking a model to extract the law, we pre-analyze legal authorities into structured propositions — each one a discrete legal assertion with provenance, scoring, and graph relationships.

RAG approach

Documents → Embed → Retrieve chunks → LLM reasons over raw text → Answer

AgentLaw approach

Documents → Expert analysis → Propositions + Scores + Hierarchy → Agent consumes structured data → Answer

Each proposition in our knowledge graph carries metadata that directly addresses the failure modes above:

Authority type

Every proposition is classified: holding, dicta, statutory_text, reg_interpretation, procedural_rule, or agency_guidance. The agent never has to guess what kind of authority it's looking at.

Confidence scoring

A deterministic 4-component score (authority strength, recency, consistency, novelty) computed from measurable signals — not model self-assessment. The score is a property of the data, not the model reading it.

Authority hierarchy

Programmatic enforcement: IRC > Regulations > Case Law > IRM. When authorities conflict, resolve_authority returns them pre-sorted. The agent structurally cannot cite the IRM over a statute.

Typed graph edges

Propositions are linked by relationships: supports, narrows, contradicts, creates_exception. Circuit splits are explicitly marked, not left for the model to discover (or miss).

Our anti-hallucination infrastructure

Research into legal AI failure modes didn't just inform our architecture — it drove us to add specific provenance and verification infrastructure to every proposition in our knowledge graph. If we're going to claim our propositions are more reliable than document retrieval, we need to prove it structurally, not just assert it.

Here's what we built:

Verification provenance

Every proposition carries a verification_status that tracks how it entered the system:

human_verified Confirmed by a domain expert against primary sources

llm_extracted Generated by an LLM from source material, not yet independently verified

llm_unreviewed Not yet classified — new or unprocessed

needs_reverification Previously verified, but new legal developments may have changed the analysis

Currently, 10 of our 248 CDP propositions are human_verified — confirmed by an attorney with Tax Court experience against the original authorities. The remaining 238 are honestly labeled llm_extracted. We don't pretend LLM-generated propositions have the same reliability as human-verified ones. The label is in the data, visible to every consuming agent.

Source quotes

Citations can now carry source_quote — the original text from the authority that supports the proposition. When available, agents can verify that a proposition actually says what we claim it says, rather than trusting a paraphrase.

85 of our citations currently have source quotes (statutory text and IRM provisions). Holdings from court opinions don't — and we say so honestly with quote_verified: false rather than fabricating quotes we don't have.

Temporal validity

Every proposition tracks when its authority was established via temporal_valid_from. This feeds directly into confidence scoring: a 2024 Tax Court holding scores higher on recency than a 1998 statute that hasn't been amended. An agent can distinguish current law from historical holdings without reading the dates embedded in citation strings.

We extracted dates for 222 of 248 propositions. The remaining 26 are tracked in a manifest for manual research — we'd rather have no date than a wrong one.

Honest uncertainty

Every proposition carries an uncertainty_type: settled, unsettled (active circuit split or open question), or undeveloped (no authority directly on point). This is the opposite of how hallucinating systems work — they present everything with equal confidence. We mark what we don't know.

What this means for agent builders

If you're building a legal AI agent, the hallucination problem is your liability. When your agent gives wrong legal information, it's your user who suffers — whether that's a law firm facing sanctions or a pro se litigant who misses a deadline.

AgentLaw's proposition architecture gives you structural defenses:

Check before you cite. Every proposition has verification_status. Your agent can programmatically distinguish human-verified law from LLM-extracted assertions and adjust its confidence language accordingly.

Hierarchy is pre-resolved. Call resolve_authority and get propositions pre-sorted into controlling, subordinate, and conflicting groups. Your agent can't accidentally cite an IRM procedure over a statute because the hierarchy is in the data.

Uncertainty is explicit. When the law is unsettled, the proposition says so. Your agent doesn't have to guess whether there's a circuit split — it's marked with typed edges showing both sides.

Scores are deterministic. Confidence scores come from four measurable components, not model self-assessment. A 0.98 score means high authority, recent, consistent, well-established. A 0.50 means thin authority or open questions. Your agent can set thresholds.

The honest picture

We're not claiming we've eliminated hallucination. No system can, because the consuming model can still misinterpret structured data. But we've moved the legal reasoning — the part that hallucinates — out of the model and into verified, structured data.

What we have today:

248 propositions covering CDP (Collection Due Process) tax hearings
10 human-verified by a domain expert; 238 honestly labeled as LLM-extracted
85 citations with source quotes; the rest marked as unverified
Every proposition scored, typed, and placed in the authority hierarchy
Every circuit split, open question, and uncertainty explicitly marked

What we're building toward:

Automated ingestion from DAWSON (Tax Court case management) to keep propositions current
Expanding source quote coverage as cases flow through the pipeline
Increasing human-verified coverage through systematic expert review
Broadening beyond CDP into other tax controversy domains

The thesis is simple: structured legal knowledge, honestly labeled, with provenance at every layer, is a fundamentally better substrate for legal AI than throwing documents at a language model and hoping the legal reasoning comes out right.

The research says 17-33% of the time, it doesn't.

Try the API

Query 248 CDP propositions with confidence scoring, authority hierarchy, and provenance metadata. Free tier available. Connect via REST API or MCP.

View the API Connect via MCP

References

Magesh, V., Surani, F., Dahl, M., et al. (2025). "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools." Journal of Empirical Legal Studies.
Dahl, M., Magesh, V., Suzgun, M., & Ho, D.E. (2024). "Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models." Stanford HAI.
LegalTechSanctions.org. AI-Assisted Legal Research Sanctions Tracker (ongoing).