Amazon Bedrock AgentCore vs Reality

Introduction

Amazon Bedrock AgentCore is a relatively new service from AWS, and it aims to simplify building and deploying agentic AI solutions. In this article, we will explore the realities when it comes to building AI agents on the AWS tech stack using the example of the incident response agent. I will walk you through the lessons learnt of building and deploying a production-grade incident response AI agent that you can use for your deployed applications.

Of course, building an agentic solution is never just one service, so the article is focused on AgentCore lessons learnt, but mentions other services as well, including Bedrock Guardrails, EventBridge, Lambda, CloudWatch, Elastic Beanstalk, Amplify, and DynamoDB.

Why an incident response agent?

One of the most stressful parts of my job, being a Tech Lead with multiple production workloads, is when something happens to the live applications, especially when something happens during the night and outside working hours. You have to go and check logs and errors across multiple services, jump between different dashboards, trying to understand what failed and where.

This made me wonder, what if production incidents didn’t have to be so stressful? What if I build an agent in a controlled way that can investigate incidents for me, maybe try to attempt safe remediation actions, and send me a comprehensive incident report?

That’s how I decided to move in this direction, starting with a smaller-scoped agent, but production-ready, it should be secure, scalable, and reliable enough.

Architecture Overview

The agent itself is built using TypeScript with the use of the following libraries (among others): @aws-sdk/client-bedrock-agentcore and @strands-agents/sdk .

The architecture is relatively straightforward. An Amazon EventBridge schedule triggers a monitoring Lambda function every N minutes (1 minute intervals in my case), which performs simple health checks against both the frontend and backend running services.

If either health check fails, Lambda invokes the incident response agent guarded by Bedrock Guardrails and deployed on Amazon Bedrock AgentCore and opens an incident case in DynamoDB. The agent then used a collection of custom tools to investigate the incident. Tools I created and added to the agent include (but can certainly be scaled and will depend on the infrastructure of your application):

checking Elastic Beanstalk health and logs
checking CloudWatch logs for various services (Lambda, Elastic Beanstalk, Amplify, in my case)
restarting app servers, etc.

The agent decides which tools to use and when, depending on the live incident. Once the agent completes its work, an incident report is generated and sent to the required email.

For convenience, here is a high-level diagram of the architecture:

Picture 1. High level architecture of incident response agentic system

Now that we have an understanding of the system, we can dive into exploring the realities of building an AI agent using AWS Bedrock AgentCore.

Reality #1 – Choosing a model is harder than it looks

The first reality check actually happened before I wrote a single line of agent logic. It started with something that initially looked like a simple task of choosing an LLM.

My first blocker was that model access isn’t always immediate. Some Anthropic models require submitting an access request before they can be used.

Then I discovered that not every model supports tool use, which is an important capability if you’re building an agentic solution.

I also learned about inference profiles. Some Bedrock models can only be invoked through an inference profile rather than directly. An inference profile acts as a Bedrock-managed endpoint that allows Bedrock to route requests, manage capacity, and optimise throughput behind the scenes.

Even after narrowing down the options, there were still practical trade-offs to consider: latency, cost, deployment complexity, and how well the models are optimised for agentic workflows.

For my use case, I went with Amazon Nova, as it met my requirements: tool use support, direct invocation, no additional approvals required, and optimized for AI agents.

Also note that model selection nowadays is becoming a challenging task in itself, considering the number of models that exist nowadays and the great number of models created every day.

The biggest lesson was that model selection isn’t just about reasoning capability. It affects access, deployment, tooling, latency, cost, and operational complexity. Spend some time understanding these trade-offs before writing any agent code.

Reality #2 – AgentCore is still a new tech

A little intro to the Reality #2:

Creating an agent with AgentCore is really simple; it is a matter of one command with the AgentCore CLI. You basically just run agentcore create and follow the prompts. And adding custom tools is straightforward; tools are just functions configured for the agent’s use. So I won’t focus on how to build and add tools to the agent.

An important note is that, at this point, I had a basic agent working locally, and the next logical step was to deploy it to AgentCore Runtime.

According to the documentation, deploying an agent with AgentCore CLI should be as simple as running agentcore deploy. In reality, at least in my case, which is likely to have happened to other developers as well, is that deployment turned out to be one of the more frustrating parts of the project.

The first issue was that the deployment tooling assumed a Python environment by default, while my project was written using the TypeScript SDK. This required installing additional tooling (such as brew install uv) before I could even begin deploying the agent.

After resolving that, the deployment failed again with a CloudFormation schema mismatch, despite using the latest version of the AgentCore CLI. After some investigation, it turned out that the CLI was generating one version of the CDK assets while the deployment tooling expected another.

With limited time available, I decided to bypass part of the AgentCore abstraction and deployed the runtime directly using AWS CDK, which worked without further issues.

To be clear, this isn’t a criticism of the AgentCore itself. It’s a relatively new service, and these kinds of issues are probably expected while the ecosystem matures. The challenge is that when you’re working with a new tech, and something goes wrong, error messages are often vague, community support is still limited, and debugging can take considerably longer than with more established services.

Reality #3 – Agent intelligence is bounded by its tools

With the deployment issues resolved, I finally had an agent running on AgentCore Runtime and could start testing its performance.

As I expected, the investigation quality was rather disappointing.

Initially, my agent only had a few tools available, such as checking whether the frontend and backend were reachable, so when something went wrong, investigation reports were often generic, and it was clear that the agent behaved like your usual RAG-enhanced conversational bot, trying to give general guidelines on how to resolve an issue with services being unreachable.

The problem wasn’t the model, though; it was that the agent simply didn’t have enough information to investigate further.

As I added more tools, the quality of the investigations improved significantly: the agent could inspect Elastic Beanstalk health, logs, Lambda logs, analyse Lambda failures, and gather evidence before reaching a conclusion. The additional context made a lot of difference rather than changing the model itself.

However, there was another side to this. The more tools I added to the agent, the more carefully I had to think about which ones should actually be available, which ones should be guarded by policies, and which ones are excessive so the agent starts to hallucinate.

In fact, at some point, logs showed that the agent called the tools that I never added in the first place, such as web_search_ext . Here, you should be aware that the bootstrapped AgentCore project comes with an MCP server that might already expose preconfigured tools out of the box.

The lesson here is that agent intelligence is bounded by its tools. Too few, and the agent doesn’t have enough evidence to investigate and behaves like a conversational chatbot; too many, and it may start calling them unnecessarily or reaching for capabilities it doesn’t actually need. Choosing the right set of tools is just as important as choosing the right model or writing a good prompt.

Reality #4 – Premature Synthesis

At this point, my agent had enough tools to perform a meaningful investigation.

The next reality happened when I noticed this recurring pattern: if the frontend was unavailable, the agent would often check only the frontend, conclude that it had found a root cause, and immediately generate an incident report.

From the model’s perspective, this behavior seems reasonable: the agent found an issue and wanted to help (because deep inside, it is still our helpful AI chatbot that wants to provide a response). But the problem was that it stopped investigating and jumped to conclusions too early.

A human engineer wouldn’t normally stop after finding the first issue. They would continue checking other infrastructure components to determine the root cause, to rule out additional failures, and build a complete picture of the incident before drawing conclusions.

This AI behavior is sometimes referred to as “premature synthesis” – a tendency of language models to produce an explanation before collecting enough evidence.

In this case, the solution is more related to prompt engineering, though, rather than the AgentCore. By explicitly instructing the agent to investigate all relevant components before reaching a conclusion, the quality and consistency of the investigations improved significantly.

The lesson here is that agents don’t naturally follow the investigation process, so if you want systematic reasoning, you need to either enhance your system with elements of the deterministic workflow or make that reasoning process explicit instead of assuming the model will infer this on its own.

Reality #5 – When prompt engineering isn’t enough

After improving the investigation process through some prompt engineering, I thought the agent was finally behaving as expected.

Then I noticed a risky behavior: the agent was using the tool to restart Elastic Beanstalk when the environment was already healthy, and there was no need for the restart. Even though the agent concluded that the environment was healthy, it still decided to restart the app servers.

Here, merely prompt engineering is not enough, and this is when we start to think about policies to control tool usage.

Instead of relying on the model to always make the correct decision, I moved the safety checks into the tool itself. Before restarting the environment, the tool now has to verify whether the environment is healthy or not. If it is, then the restart request is simply rejected regardless of what model decides.

This is an important design pattern for production agents. The model is deciding which tool to use and when, but the tool itself should be guarded by checks and policies, and determine whether it can be called or not.

This approach proved to be much more reliable than trying to solve this through prompt engineering alone. And the lesson here is simple: don’t rely on prompts to enforce operational safety. Critical business rules, check and policies belong in code logic and not in the probabilistic flow.

Reality #6 – Hallucinations don’t just disappear

Up until this point, most of the issues I encountered were engineering problems. Deployment, tool selection, prompts, and remediation policies all had relatively straightforward solutions.

Then I started comparing the incident investigation reports with the actual execution traces and found out that some reports contained investigation steps that had never happened. For example, the agent claimed it had checked security groups, network ACLs, web server logs, and other infrastructure components, even though it didn’t have access to those systems and tools.

This is where hallucinations become dangerous, in my opinion, because the model wasn’t simply generating an incorrect answer; it was creating an audit trail that couldn’t be trusted.

The solution wasn’t trying to eliminate hallucinations completely, as we all know, this is impossible in a probabilistic system.

Instead, I changed the architecture so that the investigation timeline and audit trail were no longer generated by the model. Every tool invocation, timestamp, result, and execution status is now recorded deterministically by the application itself, while the agent is only responsible for interpreting those results and producing the final report.

The lesson here is to carefully evaluate the responsibilities within your system and decide which ones can safely remain agentic and which should be implemented deterministically. Functions that require correctness, auditability, or compliance should usually remain in traditional software, while the agent should focus on areas that benefit from reasoning under uncertainty and ambiguity.

Reality #7 – Observability is mandatory.

After discovering dangerous hallucinations and an incorrect audit trail generated by the agent, I understood that observability is more important than I thought.

Unlike traditional software, agents don’t always follow the same execution path. Two investigations of the same incident may result in a different sequence of tool calls, different reasoning, and sometimes even different conclusions, which makes debugging significantly harder.

To understand the agent’s behavior, I started recording tool execution history, investigation timelines, timestamps, and Bedrock invocation logs. This made evaluating the agent’s performance much easier. Instead of simply asking whether the final answer looked correct, I could analyse the reasoning process, identify unnecessary tool calls, detect loops, and spot opportunities to improve prompts, policies, architecture, and other system components.

The lesson here is that observability isn’t a nice-to-have for agentic systems. It’s a fundamental requirement, because if you can’t see what your agent is doing, you can’t reliably debug, evaluate, or improve it.

Reality #8 – Security becomes a different problem.

Building traditional software taught us to think about authentication, authorisation, encryption, and input validations. Agentic systems introduce a new set of security concerns.

The first security question to ask is: What is the agent actually allowed to do?

In my case, the agent was able to inspect logs, query infrastructure, and restart Elastic Beanstalk app servers. Every additional tool increased the agent’s capabilities and potential attack surface.

Another concern was prompt injection. We often think of prompt injection as something that happens through user input, but in operational systems, it can also come from unexpected places. If an attacker can influence application logs and the agent later reads those logs as part of its investigations, malicious system instructions could become part of the agent’s context.

Cost also becomes a part of the security discussion. Agents don’t naturally stop after a fixed number of tool calls; they stop when they believe that the objective has been reached. Without execution and tool limits, a single agent invocation could consume far more resources than expected.

To mitigate these risks, I implemented multiple security layers, including, but not limited to:

least privilege IAM permissions
explicit tool allow list
deterministic remediation policies inside tools
execution limits (max number of tool calls allowed, max duration the agent can run)
Bedrock Guardrails (against prompt injection attacks)

You can think of different security mitigations depending on your use case, but the lesson here is that production agents should never rely solely on model behaviour for safety. Security comes from deterministic controls around the modelm while the model provides the reasoning within those boundaries.

Conclusion

Thanks for reading about my experience of building an agent on the AWS tech stack.

Before starting this project, I expected to spend most of my time “working on AI”, i.e., testing models, refining prompts, and improving the agent’s reasoning. However, in reality, those were only a small part of the engineering effort.

Most of my time went into designing a reliable system around the model:

defining clear ownership of responsibilities
implementing deterministic workflows where correctness mattered most
building observability
introducing security controls
limiting tool access

and generally making the solution resilient, reliable, secure, and scalable for production use.

AgentCore did what it promised and provided a managed runtime for running and scaling an AI agent. But building a production-ready agent involved much more than deploying a model with a few tools.

If there is one takeaway from this project, it’s that building an agent is easy, but building a reliable agent is mostly software engineering.