Legal AGI Lab / Research / Benchmarking
Research Topic 2

Benchmarking

Legal reasoning spans rule extraction, statutory interpretation, analogical reasoning, judgment under ambiguity, and more. Most academic benchmarks do not measure reasoning — only outcomes. This track covers our work building evaluation infrastructure that distinguishes getting to the right answer the right way from coincidental correctness. It includes domain-specific benchmarks, reliability frameworks, and studies of LLM capabilities in specialized legal contexts.


LegalBench: A collaboratively built benchmark for measuring legal reasoning in LLMs
arXiv · 2023
A holistic assessment of the reliability of AI
arXiv · 2023
Large Language Models as tax attorneys
Royal Society · 2024
Can Large Language Models follow rules?
arXiv · 2023
Legal Engineering: A Paradigm Shift in Law
Stanford Law · 2025
How Regulators Can Use AI
Vanderbilt Law Review · 2024