LegacySWE Benchmark Release

COBOLBench

Measuring frontier coding agents on enterprise COBOL maintenance

COBOLBench evaluates whether coding systems can complete real maintenance work inside long-lived enterprise COBOL systems with automated correctness checks.

View results snapshot Review methodology notes

Key Findings

The strongest signal is how much remains unsolved

Finding 01

Best evaluated system reaches 11% Pass@4

Even the strongest reported result solves only a small share of the maintenance tasks in the benchmark.

Finding 02

88 of 100 tasks remain unsolved

Most tasks are not solved by any evaluated system, which makes the release useful as a durable difficulty surface rather than a short-lived leaderboard spike.

Finding 03

Tasks come from two real enterprise production systems

COBOLBench is grounded in real production environments rather than toy exercises or greenfield prompts.

Finding 04

The benchmark is designed to expose failure on hidden business logic, workflow state, and multi-file enterprise maintenance, not simple syntax recall.

Results Snapshot

Public release summary for the first benchmark page.

release-v1

Best Pass@411%

Tasks solved by at least one system12 / 100

Tasks unsolved by all systems88 / 100

Task Count

100

Enterprise maintenance tasks in the first release.

Task Source

Real enterprise production systems.

Evaluation

Auto

Automated correctness checks for task completion.

Benchmark Construction

Built around enterprise maintenance, not isolated coding prompts

COBOLBench focuses on the maintenance reality of long-lived production systems. The public release page keeps the framing simple: real production systems, enterprise maintenance tasks, and automated correctness checks.

What the benchmark captures

Maintenance tasks drawn from two real enterprise production systems.
Existing COBOL system behavior rather than greenfield generation.
Automated correctness checks as the task completion standard.
Evaluation pressure on business logic, context, and system interaction.

What the benchmark is not

It is not a syntax quiz or a short code repair set.
It is not optimized for flashy leaderboard movement.
It is not limited to the language surface of COBOL.
It is the first release surface for a broader LegacySWE program.

Why The Tasks Are Hard

The challenge lives in enterprise behavior

Weakly declared behavior

Important assumptions often live in record layouts, statuses, workflow transitions, and surrounding data conventions rather than explicit type systems or comments.

Historically layered systems

Long-lived production systems accumulate meaning over time. Correct edits depend on understanding why neighboring logic exists, not only what one file appears to do.

Multi-file consequences

A local-looking patch can affect workflow behavior, adjacent programs, or downstream file handling in ways that are invisible without system-level reasoning.

Behavioral Insights

COBOLBench is built to surface a deeper agent gap

The release highlights a practical research question: can coding agents preserve enterprise behavior when critical logic is distributed, historical, and only partially declared?

Why This Matters

COBOL is the first benchmark surface because it makes the challenge visible. The larger target is enterprise software engineering under hidden business logic, across languages and workflows.

Full Methodology

Use this page as the public release summary

This launch page is designed for outreach and linking. It covers the core findings and benchmark framing while LegacySWE expands into a fuller release and research package.

Research Writeup

For the broader LegacySWE thesis, return to the umbrella site. For publication and research conversations, route through Metaphi.

View LegacySWE Metaphi