The problem has been measured
In 2024, researchers at Stanford Law School published what became the largest empirical study of legal AI hallucination. They tested Westlaw's AI-Assisted Research and Lexis+ AI — the two tools used by nearly every major law firm — against a benchmark of legal research questions with known answers.
The results were stark:
These aren't hypothetical failure modes. The Stanford study, published in the Journal of Empirical Legal Studies (Magesh et al., 2025), found that even tools specifically marketed as "hallucination-free" produced fabricated case citations, invented holdings, and misattributed legal rules at rates that would be unacceptable in any professional context.
Meanwhile, attorney sanctions for AI-fabricated citations have become a recognizable pattern. The most public cases — Mata v. Avianca in 2023, multiple federal cases since — are just the visible surface. By early 2026, over a thousand proceedings involving AI-generated legal errors have been tracked across federal and state courts.
Two types of hallucination (and why the distinction matters)
The research literature distinguishes two failure modes that behave very differently:
Fabrication
The AI invents a citation, case, or legal rule that doesn't exist at all. "Smith v. Jones, 524 F.3d 112 (2d Cir. 2019)" — except there's no such case.
Easier to catch: you can check if the case exists.
Misgrounding
The AI cites a real case but misrepresents what it holds. It links a genuine citation to a proposition the case doesn't actually support.
Harder to catch: the citation looks real because it is real.
The Stanford study found that misgrounding — real citations supporting fake propositions — was the more common and more dangerous failure mode. A fabricated citation is easy to catch with a database lookup. A misgrounded one requires actually reading the case and understanding whether it supports the stated proposition. That's exactly the kind of multi-step legal reasoning that current models struggle with.
Why RAG doesn't solve this
Every major legal AI tool uses some version of Retrieval-Augmented Generation (RAG): embed documents, store them in a vector database, retrieve relevant chunks at query time, and hand them to a language model to synthesize an answer.
RAG is an improvement over pure generation — the model at least sees real legal text. But the architecture has structural limitations that no amount of prompt engineering can fix:
Multi-hop reasoning breaks down
Legal analysis is rarely single-hop. "Does the taxpayer have a right to a CDP hearing?" requires chaining: IRC § 6330 grants the right → the request must be timely under § 6330(a) → untimeliness limits judicial review to abuse of discretion → unless equitable tolling applies. RAG retrieves documents; it doesn't traverse a legal argument.
No authority hierarchy awareness
When a retrieved IRS manual excerpt and a Tax Court holding contradict each other, the model has to figure out which controls. That requires understanding that case law trumps agency guidance — a meta-rule the model must derive from training, not from the retrieved text. Get it wrong and you're citing the IRM as if it's binding law.
No temporal or jurisdictional awareness
An overruled case and the case that overruled it look equally relevant to a vector similarity search. A Ninth Circuit rule and a contrary Second Circuit rule both match the same query. RAG doesn't know that one superseded the other, or that jurisdiction matters.
Confidence is a hallucination itself
When RAG systems produce confidence indicators, they're typically the model's self-assessment — which research consistently shows is poorly calibrated. The model says "I'm 90% confident" based on how fluent its output sounds, not based on the weight of legal authority.
The fundamental issue: RAG treats legal research as an information retrieval problem. But legal analysis isn't retrieval — it's structured reasoning over a hierarchy of authorities. Retrieving the right document isn't the same as understanding what the law is.
What we built instead
AgentLaw takes a different architectural approach. Instead of retrieving documents and asking a model to extract the law, we pre-analyze legal authorities into structured propositions — each one a discrete legal assertion with provenance, scoring, and graph relationships.
Each proposition in our knowledge graph carries metadata that directly addresses the failure modes above:
Authority type
Every proposition is classified: holding, dicta,
statutory_text, reg_interpretation, procedural_rule,
or agency_guidance. The agent never has to guess what kind of authority it's looking at.
Confidence scoring
A deterministic 4-component score (authority strength, recency, consistency, novelty) computed from measurable signals — not model self-assessment. The score is a property of the data, not the model reading it.
Authority hierarchy
Programmatic enforcement: IRC > Regulations > Case Law > IRM.
When authorities conflict, resolve_authority returns them pre-sorted.
The agent structurally cannot cite the IRM over a statute.
Typed graph edges
Propositions are linked by relationships: supports, narrows,
contradicts, creates_exception. Circuit splits are explicitly
marked, not left for the model to discover (or miss).
Our anti-hallucination infrastructure
Research into legal AI failure modes didn't just inform our architecture — it drove us to add specific provenance and verification infrastructure to every proposition in our knowledge graph. If we're going to claim our propositions are more reliable than document retrieval, we need to prove it structurally, not just assert it.
Here's what we built:
Verification provenance
Every proposition carries a verification_status that tracks how it entered the system:
human_verified
Confirmed by a domain expert against primary sources
llm_extracted
Generated by an LLM from source material, not yet independently verified
llm_unreviewed
Not yet classified — new or unprocessed
needs_reverification
Previously verified, but new legal developments may have changed the analysis
Currently, 10 of our 248 CDP propositions are human_verified —
confirmed by an attorney with Tax Court experience against the original authorities.
The remaining 238 are honestly labeled llm_extracted.
We don't pretend LLM-generated propositions have the same reliability as human-verified ones.
The label is in the data, visible to every consuming agent.
Source quotes
Citations can now carry source_quote — the original text from the authority
that supports the proposition. When available, agents can verify that a proposition actually
says what we claim it says, rather than trusting a paraphrase.
85 of our citations currently have source quotes (statutory text and IRM provisions).
Holdings from court opinions don't — and we say so honestly with quote_verified: false
rather than fabricating quotes we don't have.
Temporal validity
Every proposition tracks when its authority was established via temporal_valid_from.
This feeds directly into confidence scoring: a 2024 Tax Court holding scores higher on recency
than a 1998 statute that hasn't been amended. An agent can distinguish current law from
historical holdings without reading the dates embedded in citation strings.
We extracted dates for 222 of 248 propositions. The remaining 26 are tracked in a manifest for manual research — we'd rather have no date than a wrong one.
Honest uncertainty
Every proposition carries an uncertainty_type:
settled, unsettled (active circuit split or open question),
or undeveloped (no authority directly on point).
This is the opposite of how hallucinating systems work — they present everything
with equal confidence. We mark what we don't know.
What this means for agent builders
If you're building a legal AI agent, the hallucination problem is your liability. When your agent gives wrong legal information, it's your user who suffers — whether that's a law firm facing sanctions or a pro se litigant who misses a deadline.
AgentLaw's proposition architecture gives you structural defenses:
verification_status.
Your agent can programmatically distinguish human-verified law from LLM-extracted assertions
and adjust its confidence language accordingly.
resolve_authority and get propositions pre-sorted into
controlling, subordinate, and conflicting groups. Your agent can't accidentally cite
an IRM procedure over a statute because the hierarchy is in the data.
The honest picture
We're not claiming we've eliminated hallucination. No system can, because the consuming model can still misinterpret structured data. But we've moved the legal reasoning — the part that hallucinates — out of the model and into verified, structured data.
What we have today:
- 248 propositions covering CDP (Collection Due Process) tax hearings
- 10 human-verified by a domain expert; 238 honestly labeled as LLM-extracted
- 85 citations with source quotes; the rest marked as unverified
- Every proposition scored, typed, and placed in the authority hierarchy
- Every circuit split, open question, and uncertainty explicitly marked
What we're building toward:
- Automated ingestion from DAWSON (Tax Court case management) to keep propositions current
- Expanding source quote coverage as cases flow through the pipeline
- Increasing human-verified coverage through systematic expert review
- Broadening beyond CDP into other tax controversy domains
The thesis is simple: structured legal knowledge, honestly labeled, with provenance at every layer, is a fundamentally better substrate for legal AI than throwing documents at a language model and hoping the legal reasoning comes out right.
The research says 17-33% of the time, it doesn't.
Try the API
Query 248 CDP propositions with confidence scoring, authority hierarchy, and provenance metadata. Free tier available. Connect via REST API or MCP.
References
- Magesh, V., Surani, F., Dahl, M., et al. (2025). "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools." Journal of Empirical Legal Studies.
- Dahl, M., Magesh, V., Suzgun, M., & Ho, D.E. (2024). "Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models." Stanford HAI.
- LegalTechSanctions.org. AI-Assisted Legal Research Sanctions Tracker (ongoing).