Cielara Code Just Beat Claude Code and Codex at the Hardest Part of Agent Work

New research proves the $53.2 billion coding agent market has been solving the wrong problem. Generating code is not the hard part. Finding where to put it is. Cielara Code beats both Claude Code and OpenAI Codex at exactly that, across three independent benchmarks, with the data to prove it.

The autonomous coding agent market hit $8.29 billion in 2025. It is projected to reach $53.2 billion by 2030. That entire projection rests on one assumption almost nobody has stopped to question: that generating code is where agents fail. New research from Causal Dynamics Lab, submitted to NeurIPS 2026, proves that assumption wrong. Agents do not fail because they cannot write code. They fail because they cannot find where to write it. And until Cielara Code, every major system in the market was navigating that problem blind.

AI coding agent market growing from $3.1B in 2023 to projected $53.2B in 2030

The Problem Nobody Was Measuring

Before any coding agent fixes a bug or ships a feature, it must first locate exactly where in the codebase to make the change. Code localization is the foundational prerequisite of every automated repair. It is not glamorous. It does not show up in product demos. And for years, every major agent team has treated it as a solved problem. It is not.

Research shows that human developers spend up to 66% of their debugging time simply studying the system to understand where a modification should occur. Autonomous agents face the same bottleneck at machine scale and machine cost. The difference is that when a human developer wastes an afternoon reading the wrong files, it costs one person a day. When an agent does it across thousands of tasks a week, it becomes the largest line item in your engineering infrastructure budget.

Developers spend 28% of time studying where to make changes and 19% managing technical debt

Forty percent of corporate technology budgets are consumed by the fallout from technical debt. Developers spend 42% of their working week managing that debt, an estimated $85 billion in annual opportunity cost according to Stripe’s 2023 developer productivity report. Autonomous agents deployed on top of systems already carrying that debt do not reduce the cost. Without accurate localization, they compound it.

“The agent isn’t failing to think. It’s failing to find. The navigation is where the system breaks down.”

Failed trajectories use 4x compute, tech debt takes 40% of IT budgets, and $85B annual opportunity cost

What Agents Actually Do With Their Time

Causal Dynamics Lab instrumented native coding agents to measure exactly how they spend their compute across 2,510 total actions. The data is damning. The read tool, which opens and reads files one by one, was invoked 1,425 times, accounting for 56.8% of every action the agent took. Grep accounted for another 24.2%. Actual edits, the thing the agent exists to do, made up 0.8% of all actions.

Read tool at 56.8%, grep 24.2%, bash 14%, glob 4.2%, edit 0.8% of all agent actions

The pattern is consistent. Agents search for the answer by reading every file in sequence, hoping to stumble onto the right one. When a fix spans more than six files, recall collapses to 14%. The agent is not failing because the underlying model is weak. It is failing because it is navigating a city without a map, driving every street hoping to find the address by accident.

Conventional agent recall collapsing to 14% at 6+ files while Cielara Code maintains ~51% recall at 10 files

Failed trajectories make every metric worse. When an agent guesses wrong about localization, it consumes over four times more computational resources than a successful trajectory. At enterprise scale, with codebases exceeding 500,000 lines and teams running thousands of agent tasks a week, inaccurate localization is not a performance issue. It is a direct infrastructure cost that compounds every sprint.

The Fix: Map the Codebase Before Touching It

Causal Dynamics Lab built Cielara Code around a single architectural insight. The agent should map the entire codebase before it opens a single file. The core innovation is a Code Dependency Knowledge Graph, an automatically generated structural map of every file, function, class, and relationship in a codebase. The graph tracks four relationship types: which files import from where, which functions call what, which classes extend and to where, and which files contain which definitions.

The result is an agent that navigates like a senior engineer with deep knowledge of the system, not a new hire with a terminal and grep. Instead of reading blindly, it narrows at each level and jumps directly to relevant code. The difference is not incremental. It is architectural. The analogy the team uses is precise: finding an address by driving every road in a city versus opening a GPS. The GPS does not just reach the destination faster. It eliminates every wrong turn before the first one is taken.

Conventional agents like Claude Code operate with an effective context window of approximately 169,000 tokens after system prompts consume space from its 200,000 raw capacity. At enterprise codebase sizes of 500,000+ lines or roughly 1.75 million tokens, a 169K window cannot hold even 10% of the codebase at once. The agent reads piecemeal, losing structural context with every swap.

REASONARA solves this with a 125M+ token context window that holds entire production codebases in structured memory. Rather than feeding the agent 3,000 tokens per raw file read, REASONARA delivers focused, structurally organized context in 100 to 400 tokens per query. The agent does not search for files. It already knows every file, every function, and every dependency before it begins. The roadmap targets 500M+ tokens.

REASONARA uses 175 tokens vs 3000 for conventional agents, and responds in 18.7s vs 97-150s for Codex

The Benchmarks: Three Independent Tests, One Direction

Causal Dynamics Lab validated Cielara Code across three complementary benchmarks to guard against overfitting to any single dataset: SWE-Bench, MULocBench across 46 repositories and 1,033 instances, and LocBench. The overall localization accuracy reached 0.774 for Cielara, against 0.738 for Claude Code (Opus-4.6) and 0.707 for OpenAI Codex (GPT-5.2).

The research team is precise about where the margin matters and where it does not. The acc@1 gap between Cielara and Claude Code at 0.677 versus 0.676 falls within statistical noise on 1,033 instances. They do not lead with that number. The meaningful separation sits at acc@5 (0.619 vs 0.584) and recall@5 (0.752 vs 0.727), margins of 3.5 and 2.5 percentage points across 46 heterogeneous repositories. That reflects a structural pattern rather than single-instance variance. In production environments where a fix routinely spans multiple files, recall@5 is the metric that determines whether the agent finds the full scope of what needs to change.

The efficiency gains compound the accuracy story. Cielara Code runs 10% faster per instance and consumes 30 to 40% fewer tokens per task. At enterprise compute costs, that reduction does not stay marginal. It compounds across every pull request, every CI pipeline run, and every agent-initiated fix across an engineering organization running at scale.

Why REASONARA Is the Architecture Worth Watching

The benchmark results validate the product. The architecture underneath is the larger editorial story. REASONARA sets new state-of-the-art results on three independent memory benchmarks that have nothing to do with code localization, which means the advantage is architectural and general, not benchmark-tuned for a specific task.

REASONARA achieving 88.2% on LoCoMo, 87.4% on LongMemEval, and 94% on UltraDomain

On LoCoMo, the conversational memory benchmark, REASONARA reaches 88.2% against 82.5% for full context. On LongMemEval at 115,000-token contexts, it reaches 87.4% against Nemori at 74.6% and Zep plus GPT-4o at 71.2%. On UltraDomain at 125M+ tokens, REASONARA reaches 94%, 20 percentage points above frontier LLMs and 10 above RAG. These are not narrow wins. They are structural separations across three different categories of memory challenge, which is exactly what you would expect from a system that genuinely holds and reasons over long-range context rather than simulating it.

This matters for the production software argument because the failure mode of AI coding agents in production is not inaccurate code generation. Changes look correct in review. They pass static checks. Then they trigger unpredictable failures once they interact with real dependencies, policy constraints, runtime state, and infrastructure topology. Agents cannot see those interactions because they cannot hold the full system context at once. REASONARA is built to hold it, and the memory benchmarks prove the system works at the scale where production codebases actually live.

The Team and Why the Research Holds

Causal Dynamics Lab was founded by former Uber platform engineers and AI researchers from Microsoft Research and Emory University. The team includes a Stanford Top 2% Scientist with more than 200 publications at NeurIPS, ICLR, and KDD. The LAGR paper formalizing Agentic Graph Retrieval as the mathematical framework behind Cielara’s localization approach has been submitted to the 40th Conference on Neural Information Processing Systems.

The NeurIPS submission matters for one specific reason. Research claims about benchmark performance in AI are easy to manufacture. Peer review at NeurIPS is not. The community will stress-test the benchmark methodology, the LAGR formalization, and the statistical claims about structural patterns versus single-instance variance. The team’s decision to lead with acc@5 and recall@5 over the noise-level acc@1 gap signals that they understand what the reviewers will look for. They are not hiding from the hard questions. They are leading with them.

What This Means for Engineering Teams Right Now

The practical implication for enterprise engineering teams is direct. If your current AI coding agent spends 56.8% of its compute on file reads and failed trajectories cost four times more than successful ones, you are not paying for intelligence. You are paying for navigation overhead that compounds every time the agent touches a multi-file fix, which is most production fixes in any mature codebase.

The deeper implication is about production safety. AI coding agents are shipping code faster than teams can verify what that code will do in production. Changes that look correct in isolation trigger failures once they hit real dependencies and runtime state. Cielara’s pre-deployment simulation layer addresses this directly, replacing post-deployment debugging with structural validation before the change ships. That is not a feature. It is the difference between an agent that accelerates your engineering team and one that creates a new category of production incident.

The $53.2 billion projection for autonomous coding agents by 2030 assumes the current architecture scales. Causal Dynamics Lab’s research argues, with data, that it does not. An agent spending 56.8% of its compute reading files and losing recall at six files is not a foundation for engineering at production scale. A system that maps the codebase first, holds it in structured memory, validates changes before they ship, and does all of that 5 to 8 times faster with 98% fewer tokens per query is. That is what Cielara Code is. The benchmarks show it is ahead. The architecture shows why it stays ahead.