Norm AI

The Legal AGI Lab

Bringing AI agents into the legal fold through integrated AI & legal research

AI agents are becoming autonomous actors.

We are building agentic law to align them with societal values and enable high-stakes safe deployments.

01

Legal Turing Test

We develop various Turing Tests related to high-stakes real legal workflows. Norm will continue to build out that suite of benchmarks that only a technology company powering a full-service law firm can.

02

As Intelligence Becomes Cheap, Trust is the new Bottleneck

As AI drives the cost of intelligence toward zero, the bottleneck in the economy shifts to assurance of agentic legal systems. How do we ensure that AI systems act in ways that are legal, trustworthy, and enforceable?

03

Legal Infrastructure for the Agentic Economy

Legal systems were built for human actors. As AI agents become economic and societal actors, law is their real-time alignment infrastructure. We investigate the systems by which AI agents will transact, be governed, and held liable.

04

Benchmarking Legal Reasoning

Legal reasoning spans rule extraction, statutory interpretation, analogical reasoning, judgment under ambiguity, and more. Most academic benchmarks do not measure reasoning. We build the evaluation infrastructure for getting to the right answers the right way.

AI agents exhibit increasing levels of key aspects of intentionality

Domain experts award higher intentionality scores to newer models

Intentionality score (1–10) 10 8 6 4 2 0 7.39 9.48 Claude 3 Haiku Mar 7, 2024 Claude Opus 4.6 Feb 5, 2026
01

Foundations

Legal ontologies, reasoning taxonomies, and the formal structures that underpin evaluation.

02

Benchmarks

Expert-designed evaluations spanning multiple legal domains with blind review protocols.

03

Deployment

Battle-tested agent architectures operating under real compliance and advisory constraints.

04

Horizon

Exploratory research into reasoning capabilities that don't yet have established measurement.

Frontier model performance on advanced legal reasoning

Evaluating 8 frontier models across 1,456 questions spanning rule-application and interpretation tasks

Sonnet 4 Sonnet 4.6 Opus 4 Opus 4.6 GPT-5 Mini GPT-5.4 Mini GPT-5 GPT-5.4
Accuracy (%) 100 80 60 40 20 0 74% 81% 78% 78% 72% 72% 77% 80% Sonnet 4 Sonnet 4.6 Opus 4 Opus 4.6 GPT-5 Mini GPT-5.4 Mini GPT-5 GPT-5.4

Error bars: ±1 standard deviation

Sonnet 4 Sonnet 4.6 Opus 4 Opus 4.6 GPT-5 Mini GPT-5.4 Mini GPT-5 GPT-5.4
Accuracy (%) 100 80 60 40 20 0 77% 82% 83% 81% 75% 76% 79% 82% 60% 74% 55% 65% 58% 56% 60% 70% Rule-Application (92) Interpretation (20)

Error bars: ±1 standard deviation

0 20 40 60 80 100 Sonnet 4 Opus 4.6 Sonnet 4.6 GPT-5.4 GPT-5.4 Mini GPT-5 Opus 4 GPT-5 Mini 96% (108/112) 93% (104/112) 93% (104/112) 89% (100/112) 83% (93/112) 82% (92/112) 82% (92/112) 60% (67/112)

% of questions with exactly 1 distinct answer across 10 turns