LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI
Abstract
LegalHalluLens audits AI systems in legal workflows by identifying specific error patterns and directional biases in hallucinations across different claim types, enabling more reliable deployment through targeted diagnostic and mitigation approaches.
AI systems deployed in legal workflows hallucinate at rates that aggregate metrics report at ~52%, but this average conceals where errors concentrate and in which direction they run, leaving compliance officers without an actionable signal for trustworthy deployment. We present LegalHalluLens, an auditing framework with three components: typed hallucination profiles across four legally-motivated claim categories (numeric, temporal, obligation/entitlement, factual) over CUAD (Hendrycks et al., 2021); a Risk Direction Index (RDI) that reduces omission-versus-invention bias to a single deployment-comparable scalar; and a typed debate pipeline calibrated to both magnitudes and directions. Across 510 contracts and 249,252 clause-level instances we measure a within-model gap of approximately 38-40 pp between obligation/numeric and temporal claims that aggregate reporting hides, and show that two systems with matched 52% rates can carry opposite RDIs. The debate pipeline reduces fabricated detections by 45% with per-category gains tracking the diagnosis, matching commercial APIs with a substantially smaller backbone (4B active parameters). Typed profiles and RDI surface failure modes that aggregate metrics hide; we further show these diagnostics serve as calibration inputs for multi-agent debate pipelines, where Skeptic challenges and asymmetric gates targeted at measured failure modes outperform generically-tuned debate. The framework supports direction-aware procurement, accountability, and agent design for legal AI deployed in the wild.
Community
Hi Hugging Face community!
My co-author and I are excited to share LegalHalluLens, which was recently accepted at the ICML 2026 AIWILD workshop. We built this framework to address a massive blind spot in current LLM benchmarking: aggregate accuracy scores completely mask catastrophic domain-specific failures.
🔍What we did:
- Massive Audit: We evaluated 249,252 clause-level instances across commercial and open-source models to map exact failure profiles on high-liability text.
- The "Average" Lie: We proved that a blended 52% error rate hides a massive 40-point gap. Models excel at easy questions (dates/terms, ~29% error) but fail catastrophically on high-liability clauses (liability caps/indemnities, 65%-74% error).
- The Risk Direction Index (RDI): We introduced a directional metric to quantify whether a model is an "Omitter" (silently dropping rules) or an "Inventor" (hallucinating fake rules).
The Open-Source Fix:
Instead of relying on massive closed-source APIs, we used these empirical failure profiles to calibrate an asymmetric 6-role multi-agent debate pipeline.
By forcing the agents through targeted safety gates, we enabled a lightweight 4B active parameter model (Gemma) to cut fabrications by 45%—effectively matching the composite performance of commercial frontier APIs while drastically lowering inference costs.
Everything—the dataset processing scripts, the RDI evaluation suite, and the calibrated multi-agent pipeline—is fully open-source.
We would love to hear the community's thoughts on using directional metrics for agent alignment and routing!
Get this paper in your agent:
hf papers read 2606.18021 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper