Research Topic 2

Benchmarking

Legal reasoning spans rule extraction, statutory interpretation, analogical reasoning, judgment under ambiguity, and more. Most academic benchmarks do not measure reasoning — only outcomes. This track covers our work building evaluation infrastructure that distinguishes getting to the right answer the right way from coincidental correctness. It includes domain-specific benchmarks, reliability frameworks, and studies of LLM capabilities in specialized legal contexts.

Frontier model performance on advanced legal reasoning

Evaluating 8 frontier models across 1,456 questions spanning rule-application and interpretation tasks

Opus 4.6 Opus 4 GPT-5.4 GPT-5 Sonnet 4.6 Sonnet 4 GPT-5.4 Mini GPT-5 Mini

Error bars: ±1 standard deviation

Opus 4.6 Opus 4 GPT-5.4 GPT-5 Sonnet 4.6 Sonnet 4 GPT-5.4 Mini GPT-5 Mini

% of questions with same answer across 10 turns

Frontier models are getting measurably better at legal reasoning. But they are far from production-ready in high-stakes corporate environments, where perfect accuracy and consistency are table stakes.

LegalBench: A collaboratively built benchmark for measuring legal reasoning in LLMs

arXiv · 2023

A holistic assessment of the reliability of AI

arXiv · 2023

Large Language Models as tax attorneys

Royal Society · 2024

Can LLMs Follow Simple Rules?

arXiv · 2023

Legal Engineering: A Paradigm Shift in Law

Stanford Law · 2025

How Regulators Can Use AI

Vanderbilt Law Review · 2024