Best evaluated system reaches 11% Pass@4
Even the strongest reported result solves only a small share of the maintenance tasks in the benchmark.
Measuring frontier coding agents on enterprise COBOL maintenance
COBOLBench evaluates whether coding systems can complete real maintenance work inside long-lived enterprise COBOL systems with automated correctness checks.
Even the strongest reported result solves only a small share of the maintenance tasks in the benchmark.
Most tasks are not solved by any evaluated system, which makes the release useful as a durable difficulty surface rather than a short-lived leaderboard spike.
COBOLBench is grounded in real production environments rather than toy exercises or greenfield prompts.
Finding 04
The benchmark is designed to expose failure on hidden business logic, workflow state, and multi-file enterprise maintenance, not simple syntax recall.
Results Snapshot
Public release summary for the first benchmark page.
COBOLBench focuses on the maintenance reality of long-lived production systems. The public release page keeps the framing simple: real production systems, enterprise maintenance tasks, and automated correctness checks.
Important assumptions often live in record layouts, statuses, workflow transitions, and surrounding data conventions rather than explicit type systems or comments.
Long-lived production systems accumulate meaning over time. Correct edits depend on understanding why neighboring logic exists, not only what one file appears to do.
A local-looking patch can affect workflow behavior, adjacent programs, or downstream file handling in ways that are invisible without system-level reasoning.
The release highlights a practical research question: can coding agents preserve enterprise behavior when critical logic is distributed, historical, and only partially declared?
Why This Matters
COBOL is the first benchmark surface because it makes the challenge visible. The larger target is enterprise software engineering under hidden business logic, across languages and workflows.
This launch page is designed for outreach and linking. It covers the core findings and benchmark framing while LegacySWE expands into a fuller release and research package.
Research Writeup
For the broader LegacySWE thesis, return to the umbrella site. For publication and research conversations, route through Metaphi.