Legal AGI Lab / Research / Benchmarking
Research Topic 2

Benchmarking

Legal reasoning spans rule extraction, statutory interpretation, analogical reasoning, judgment under ambiguity, and more. Most academic benchmarks do not measure reasoning — only outcomes. This track covers our work building evaluation infrastructure that distinguishes getting to the right answer the right way from coincidental correctness. It includes domain-specific benchmarks, reliability frameworks, and studies of LLM capabilities in specialized legal contexts.


Frontier model performance on advanced legal reasoning

Evaluating 8 frontier models across 1,456 questions spanning rule-application and interpretation tasks

Opus 4.6 Opus 4 GPT-5.4 GPT-5 Sonnet 4.6 Sonnet 4 GPT-5.4 Mini GPT-5 Mini
0 20 40 60 80 100 Opus 4/4.6 GPT-5/5.4 Sonnet 4/4.6 GPT-5/5.4 Mini 86% 79% 83% 76% 84% 76% 77% 78% Accuracy (%)

Error bars: ±1 standard deviation

Opus 4.6 Opus 4 GPT-5.4 GPT-5 Sonnet 4.6 Sonnet 4 GPT-5.4 Mini GPT-5 Mini
0 20 40 60 80 100 Opus 4/4.6 GPT-5/5.4 Sonnet 4/4.6 GPT-5/5.4 Mini 94% 88% 89% 90% 90% 80% 84% 88% Answer Consistency Across Turns (%)

% of questions with same answer across 10 turns

Frontier models are getting measurably better at legal reasoning. But they are far from production-ready in high-stakes corporate environments, where perfect accuracy and consistency are table stakes.


LegalBench: A collaboratively built benchmark for measuring legal reasoning in LLMs
arXiv · 2023
A holistic assessment of the reliability of AI
arXiv · 2023
Large Language Models as tax attorneys
Royal Society · 2024
Can LLMs Follow Simple Rules?
arXiv · 2023
Legal Engineering: A Paradigm Shift in Law
Stanford Law · 2025
How Regulators Can Use AI
Vanderbilt Law Review · 2024