Introducing the agent performance loop: AgentCore Optimization now in preview

Generate recommendations from production traces, validate them with batch evaluation and A/B testing, and ship with confidence.

AI agents that perform well at launch don’t stay that way. As models evolve, user behavior shifts, and prompts get reused in new contexts they were never designed for. Agent quality quietly degrades. In most teams, the improvement process still looks the same: without automatic feedback loops, when a user complains, a developer reads through traces, forms a hypothesis, rewrites the prompt, tests a handful of cases, and ships the fix. Then the cycle repeats, often introducing a new issue for a different user. Up until today, Amazon Bedrock AgentCore provided the pieces for you to debug it manually or build custom implementations: check the evaluation scores to detect quality drop, deep dive into the traces to determine the root cause and update the agent with an improved configuration. The developer is the performance engine relying on intuition rather than on systematic data-backed evidence. Dedicated science teams and large centralized benchmarks help, but they are neither a practical nor timely solution for most product teams. Even when you have that machinery, it tends to move on weekly or monthly cycles, while agents drift in production every day.

AgentCore is the platform to build, connect, and optimize agents at scale, with security enforced at the infrastructure layer. Thousands of developers already use AgentCore to build agents that reason, plan, and act across complex workflows. Today we are announcing new capabilities in AgentCore that complete the observe, evaluate, improve loop for agent performance and quality: recommendations and two ways to validate them.

Recommendations analyze production traces and evaluation outputs to optimize your system prompt or tool descriptions for the evaluator you specify. Batch evaluation helps test the recommendation against a pre-defined test dataset and reports aggregate scores, catching regressions on cases you know matter. When hand-authored scenarios aren’t enough, you can also simulate a dataset using an LLM-backed actor to play the role of an end user. A/B testing runs a controlled comparison between versions of an agent through AgentCore Gateway, splitting live production traffic at the percentage you configure and reporting results with confidence intervals and statistical significance. Recommendations propose changes, batch evaluation and A/B testing validate them, and together they replace the manual cycle of reading traces, guessing at fixes, and deploying blind.

“Continuously evaluating and improving agents is essential for driving data-driven value creation. Processes that traditionally required weeks of manual prompt tuning have evolved into rapid, repeatable cycles through the use of AgentCore. By deriving improvement recommendations from production trace data and validating their impact through A/B testing, organizations can optimize performance while ensuring accuracy and effectiveness. This approach enables continuous, highly efficient improvement at scale.” Yoshiharu Okuda, Head of Generative AI Business Strategy Department, NTT DATA

How the loop runs in practice

Here is how the loop runs for the model upgrade scenario. The pattern is the same for any change: a prompt refactor, a tool set update, a framework upgrade.

End-to-end traceability in AgentCore captures every model call, tool invocation, and reasoning step as OpenTelemetry-compatible traces managed using AgentCore Observability. Evaluations score those traces automatically across dimensions like goal success rate, tool selection accuracy, helpfulness, and safety, using built-in evaluators, ground-truth comparisons, or custom LLM-as-judge scoring.

Generate a recommendation. Point the Recommendations API at the CloudWatch Log group where your agent writes traces. Pick the reward signal as the evaluator you want to optimize for, either a built-in evaluator from AgentCore or a custom evaluator you’ve built, and choose what to optimize: the system prompt or the tool descriptions. AgentCore reflects on the traces, considering the provided reward signal, and generates a recommendation aimed at improving the performance on that reward signal. For tool description recommendations, it only sharpens the tool description without touching the tool implementation. The service proposes, and you decide what to take forward into the validation steps.

Package the change as a configuration bundle. Configurations ship as bundles, which are immutable, versioned snapshots of your agent’s configuration keyed by runtime ARN: model ID, system prompt, tool descriptions. Your agent reads its active configuration dynamically at runtime through the AgentCore SDK, so swapping a prompt or a model is a configuration change, not a code change. Create one bundle for your current configuration and another for the recommendation. Bundles are optional. For changes that include code, deploy to a separate runtime endpoint instead.

Validate offline: batch evaluation. Run your agent against a curated data set using the new bundle, then evaluate the resulting sessions in batch and compare aggregate scores to your baseline. This catches regressions on use cases you have already defined. Teams typically wire batch evaluation into their CI/CD pipelines so no configuration change reaches production without passing their known-good cases.

Validate against live traffic: A/B testing. Configure AgentCore Gateway to split live production traffic between two variants, with the current version as the control and the candidate as the treatment. Variants can be different bundle versions on the same runtime for configuration-only changes, or different gateway targets pointing to separate runtime endpoints for changes that include code. Online evaluation scores every session with your specified evaluators. The A/B test results include confidence intervals and p-values. When you have adequate data to give you confidence in the new version’s performance, stop the test and promote the new variant by setting it as the default. To roll back, pause the test and the agent reverts to its existing configuration.

“What took weeks of manual prompt iteration is now a repeatable cycle with AgentCore: generate a recommendation from production traces, validate it against live traffic with statistical significance, and deploy the winning configuration. Each cycle produces the baseline data for the next — the improvement process compounds.” — Masashi Shimizu, Senior Managing Director, Nomura Research Institute, Ltd.

Where we’re headed

Today’s preview is developer-triggered by design. You choose when to generate a recommendation, which evaluator to target, and whether to promote the result. Our vision is a flywheel where traces feed evaluations, evaluations surface drift, recommendations turn that signal into a concrete change, and A/B testing proves it works. The winning configuration becomes the new baseline, and the traces it produces are the input for the next cycle.Over time, the flywheel spins with less effort. Recommendations weigh multiple evaluators together, surfacing trade-offs with evidence. They also expand the optimization surface to skills, proposing new ones or refining existing ones based on production usage. Trace analysis clusters production failures into patterns you can address before they multiply. Monitor alarms launch a recommendation and validation on their own when an evaluator drops below a threshold, landing the result in a review queue. You decide what ships, and the system can do the heavy lifting to get there.

See it in action

The Market Trends Agent sample on GitHub is a market intelligence agent built for investment brokers covering real-time stock data, sector analysis, news search, and personalized broker profiles. For an agent serving brokers with different risk profiles, sector interests, and conversational styles, quality degradation is hard to spot and harder to fix without the right tooling.

Walk through the full improvement loop: generate a recommendation that surfaces where the agent fails to personalize advice to a broker’s stated strategy or selects the wrong tool when a query spans multiple sectors. Package the change as a configuration bundle version. Validate the fix with batch evaluation across a curated set of broker conversations. Then A/B test the configuration against real broker sessions with statistical confidence before promoting it to production.

Get started

These capabilities are available in preview today through Amazon Bedrock AgentCore in AWS Regions where AgentCore Evaluations is available. During preview, AgentCore Optimization targets system prompts and tool descriptions for agents deployed on AgentCore Runtime and using AgentCore Observability and Evaluations.

Get started through the AgentCore Console or CLI. Read the documentation and follow step by step tutorials here.


About the authors

Amandeep Khurana

Amandeep Khurana is a Principal Product Manager, working on Amazon Bedrock AgentCore, focusing on agent operations and performance tooling. He’s passionate about building products at the cutting edge of technology and helping customers adopt them to solve their business problems.

Nikhil Kandoi

Nikhil Kandoi is a Principal Engineer on the AgentCore team. Nikhil brings deep expertise in building and scaling intelligent systems spanning multiple AI services like AWS Lex, Panorama and Amazon Q. Today, he focuses on the challenges of deploying and managing AI agents at enterprise scale that make large-scale agent deployments reliable and secure.

Bharathi Srinivasan

Bharathi Srinivasan is a Senior Generative AI Data Scientist at AWS. Bharathi works with enterprise customers on large‑scale generative AI challenges, including robustness and verification of non‑deterministic systems, governance of GenAI and agentic AI platforms, and the quality of dynamic agentic AI systems.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.