Optimizing a Customer Support Agent on AgentCore
Three A/B experiments on a live AI agent using Amazon Bedrock AgentCore: testing prompts, tool descriptions, and model size.

1. Introduction
The AI agent stack has evolved quickly through a few distinct phases. First came the model: call an API, get a response. The intelligence is in the model. Your job is to write a good prompt. Then came the harness: frameworks like LangGraph, CrewAI, and Strands gave agents tools, memory, and multi-step loops. Orchestration became the product.
Now the question is: how do you make a deployed agent better over time without rebuilding it from scratch on every iteration?
AgentCore Optimization is designed for this. Now in public preview, it gives you the infrastructure to run controlled A/B experiments on a live agent. Split traffic across configurations, score every session automatically with LLM-as-a-judge evaluators, and read results in CloudWatch.
In this post, I'll walk through how I built a LangGraph-based customer support agent on Amazon Bedrock AgentCore, then ran three sequential A/B experiments to optimize it: a better prompt, better tool descriptions, and a bigger model.
2. What's AgentCore Optimization?
AgentCore Optimization is a set of integrated AWS services that lets you continuously improve an agent without rebuilding it. It's built on three primitives:
- Configuration Bundles are versioned JSON payloads containing whatever per-request config you want to test: system prompt, tool descriptions, model ID, or any arbitrary key. The bundle gets injected into every invocation by the gateway, so you can run two different agent configurations off the same container, with no redeployment.
1# Agent reads its config bundle on every request
2bundle = BedrockAgentCoreContext.get_config_bundle()
3model_id = bundle["model_id"]
4system_prompt = bundle["system_prompt"]
5tool_descriptions = bundle.get("tool_descriptions", {})- The AgentCore Gateway sits in front of your runtime and handles traffic routing. You create an A/B test that maps two config bundles to traffic percentages (50/50, 80/20, etc.) and attach it to a gateway target. From that point, every invocation is probabilistically routed to one variant — so a 50/50 split is a target, not a guarantee — and the variant assignment is recorded in OTel spans.
- Online Evaluators are LLM-as-a-judge scorers that run asynchronously after every session. You define evaluation criteria in natural language, choose a judge model and scoring scale, and register the evaluator with AgentCore. Once attached to an online evaluation config, it scores every session in the A/B test and writes the results to a CloudWatch log group. You can define custom evaluators tuned to your domain, or use AgentCore's built-in evaluators.
These three primitives compose into a four-step continuous improvement loop as described in the official docs:
1. Generate a recommendation. Point the Recommendations API at agent traces in CloudWatch and specify the evaluator you want to optimize for. It analyzes failure patterns and returns an improved system prompt or tool descriptions, along with an explanation of what changed and why.
2. Package as a configuration bundle. Version the recommended config as an immutable snapshot. This decouples agent behavior from code: you can change prompts, models, and tool descriptions without touching the container.
3. Validate with an A/B test. Split production traffic between current (control) and improved (treatment) through the gateway. Online evaluation scores every session and reports statistical significance.
4. Deploy the winner and repeat. Route 100% of traffic to the winning variant. The new baseline's traces seed the next iteration.
3. What I Built
I built a customer support agent that handles three common ticket types:
- Account locked: "I can't log in, keep getting an error" → call `validate_account_identity`, confirm identity, explain unlock steps
- Billing duplicate: "I was charged twice" → call `fetch_billing_history`, identify duplicate, initiate refund via `check_refund_status`
- GDPR deletion: "Delete my data under Article 17" → verify identity, explain deletion process, escalate to privacy team
The agent has three tools:
1def fetch_billing_history(user_id: str) -> dict:
2 """Retrieve complete billing transaction history for a customer by user_id.
3 Returns itemized charges, payment dates, amounts, and subscription details
4 for the past 90 days."""
5 ...
6
7@tool
8def check_refund_status(ticket_id: str) -> dict:
9 """Check the current processing status of a refund request by ticket_id.
10 Returns status (pending/approved/rejected), refund amount, and estimated
11 completion timeline."""
12 ...
13
14@tool
15def validate_account_identity(user_id: str) -> dict:
16 """Verify a customer's account identity and retrieve their account status,
17 access level, subscription tier, and any active restrictions or flags."""
18 ...4. Architecture

Architecture
The agent is built with LangGraph's `StateGraph` and `ToolNode`. AgentCore is framework-agnostic, so plain Python, LangChain, CrewAI, or any other framework works.
1def create_agent():
2 def chatbot(state: MessagesState):
3 model_id = _active_model_id.get() # from config bundle
4 llm_with_tools = _get_llm_with_tools(model_id)
5 system_prompt = _active_system_prompt.get()
6 messages = [SystemMessage(content=system_prompt)] + state["messages"]
7 return {"messages": state["messages"] + [llm_with_tools.invoke(messages)]}
8
9 graph = StateGraph(MessagesState)
10 graph.add_node("chatbot", chatbot)
11 graph.add_node("tools", ToolNode(ALL_TOOLS))
12 graph.add_conditional_edges("chatbot", tools_condition)
13 graph.add_edge("tools", "chatbot")
14 graph.set_entry_point("chatbot")
15 return graph.compile()The per-request config injection happens in the `@app.entrypoint`:
1@app.entrypoint
2def customer_support_agent_runtime(payload: dict) -> str:
3 bundle = BedrockAgentCoreContext.get_config_bundle()
4 if bundle:
5 model_id = bundle.get("model_id")
6 if not model_id:
7 raise ValueError("Config bundle missing model_id")
8 _active_model_id.set(model_id)
9 _active_system_prompt.set(bundle.get("system_prompt", BASELINE_SYSTEM_PROMPT))
10 _apply_tool_description_overrides(ALL_TOOLS, bundle.get("tool_descriptions", {}))
11
12 response = agent.invoke({"messages": [HumanMessage(content=payload["prompt"])]})
13 return response["messages"][-1].content`contextvars.ContextVar` scopes the config to the current request without thread-safety issues, even under concurrent invocations.
Observability
OTel spans flow to `aws/spans` via the AWS Distro for OpenTelemetry (ADOT). The `LangchainInstrumentor` captures LangGraph node executions. The online evaluators read both `aws/spans` and the runtime log group. If either is missing, scoring fails silently.
One gotcha: the default Dockerfile from AgentCore starter sets `OTEL_TRACES_EXPORTER=none`, which disables all span export. You have to remove that line and add the ADOT configurator:
1# Remove this — it kills all observability:
2# ENV OTEL_TRACES_EXPORTER=none
3
4ENV AGENT_OBSERVABILITY_ENABLED=true
5ENV OTEL_PYTHON_DISTRO=aws_distro
6ENV OTEL_PYTHON_CONFIGURATOR=aws_configurator
7ENV OTEL_EXPORTER_OTLP_PROTOCOL=http/protobufInfrastructure Setup
The official way to set up an AgentCore project is:
1agentcore create # scaffold project
2agentcore deploy # build container, push to ECR, create runtime
In practice, I hit two bugs that made this not work out of the box.
Bug 1 — CodeBuild project name mismatch.
The CLI creates a CodeBuild project named `AgentCore-<project>-default-container-builder`, but `deploy.py` looks for `bedrock-agentcore-<agent_name>-builder`. The build trigger silently does nothing because the project it expects doesn't exist.
Bug 2 — Wrong architecture.
AgentCore Runtime requires arm64 containers. The CLI-generated CodeBuild project uses x86, which fails at runtime with `ValidationException: Architecture incompatible`. You need `ARM_CONTAINER` compute type and the `amazonlinux2-aarch64-standard:3.0` image, neither of which the CLI sets.
I worked around this with `bootstrap_infra.py`, a one-time setup script that creates the ECR repo, S3 bucket, IAM role, and CodeBuild project with the correct name and architecture. It's idempotent, so safe to re-run if anything already exists.
Pre-Built Evaluators
AgentCore ships with built-in evaluators out of the box. No setup, works immediately. Here's what each one actually measures:
- Builtin.GoalSuccessRate (session-level): Did the agent successfully complete all user goals across the entire conversation? The judge outputs Yes / No, which AgentCore maps to 1.0 / 0.0 before writing to CloudWatch. The aggregated scores you see (e.g. 0.154, 0.647) are the proportion of sessions that scored "Yes".
- Builtin.Helpfulness (trace-level): Did the response move the user closer to their goal, from the user's perspective? Scores on a 0–6 categorical scale. Explicitly ignores factual accuracy — it only evaluates whether the response felt helpful to the user.
- Builtin.Correctness (trace-level): Is the response factually accurate? Framed like a quiz: only content matters, not style or presentation. Scores Perfectly Correct / Partially Correct / Incorrect.
The full prompt templates for all built-in evaluators are published in the AWS docs.
They're domain-agnostic. For a customer support agent, that's not specific enough — but I'll show exactly what I mean once the results are in.
Custom Evaluators
I defined four domain-specific LLM-as-a-judge evaluators, each scoring on a 0.0–1.0 scale:
- cs_intent_resolution (Exp 1 — Prompt Strategy): Did the agent correctly identify the customer's underlying intent and fully address it, even when the request was ambiguous?
1You are evaluating a SaaS customer support agent.
2
3Assess whether the agent correctly identified what the customer actually needed
4and addressed it completely.
5
6HIGH QUALITY:
7- Correctly classifies the intent (billing, access, refund, privacy, etc.)
8- Asks a targeted clarifying question when the request is genuinely ambiguous
9- Does not ask for information it already has
10- Resolves the stated problem or provides a clear path to resolution
11
12LOW QUALITY:
13- Misidentifies or ignores the customer's actual need
14- Responds to a surface request while missing the underlying issue
15- Asks unnecessary clarifying questions when intent is already clear
16- Leaves the customer without a resolution or next step
17
18Context (customer message and conversation): {context}
19Agent response to evaluate: {assistant_turn}- 1.00 — Perfect Resolution: Intent correctly identified; response fully addresses the customer's need with a concrete resolution or escalation path
- 0.75 — Mostly Resolved: Intent correctly identified and mostly addressed, but one minor gap (e.g. missing a follow-up step or detail)
- 0.50 — Partially Resolved: Intent recognised but only partially addressed, or a correct clarifying question was asked but no resolution yet
- 0.25 — Misaligned: Agent responded to the wrong intent or provided a solution that does not match the customer's actual problem
- 0.00 — Failed: Intent completely missed, customer redirected incorrectly, or no actionable response provided
- cs_tool_groundedness (Exp 2 — Tool Descriptions): Did the agent select the right tool, cite specific data from the tool result, and avoid making up facts that should have come from a tool call?
1You are evaluating a SaaS customer support agent that has access to three tools:
2fetch_billing_history, check_refund_status, and validate_account_identity.
3
4Assess whether the agent:
5(a) selected the right tool for the customer's issue
6(b) cited specific data from the tool result (amounts, dates, statuses, account details)
7(c) avoided making up facts that should have come from a tool call
8
9HIGH QUALITY:
10- Calls the most appropriate tool for the stated issue
11- Cites specific values: '$49.00 duplicate charge on May 1st',
12 'account locked after 5 failed attempts', 'refund approved, ETA May 9th'
13- Never invents billing amounts, account statuses, or ticket details
14
15LOW QUALITY:
16- Calls the wrong tool or skips tool calls entirely
17- Responds with generic statements: 'your billing looks fine' without checking
18- Fabricates specific data that should have been retrieved
19
20Context (customer message and conversation): {context}
21Agent response to evaluate: {assistant_turn}- 1.00 — Fully Grounded: Correct tool selected; response cites specific retrieved data; no hallucinated facts
- 0.75 — Mostly Grounded: Correct tool used; most claims are data-backed but one minor detail is missing or slightly imprecise
- 0.50 — Partially Grounded: Tool was called but the response mixes real data with generic or inferred statements
- 0.25 — Wrong Tool / Mostly Generic: Wrong tool called, or the right tool was skipped and the response is largely generic with little specific data
- 0.00 — Hallucinated / No Tool: No tool called when one was clearly needed, or data cited in the response was fabricated
- cs_support_quality (Exp 3 — Model Comparison): Holistic quality scored equally across four dimensions: empathy, clarity, completeness, and tone.
1You are evaluating a SaaS customer support agent response.
2
3Assess the response on four dimensions equally:
41. EMPATHY — does it acknowledge the customer's frustration or situation?
52. CLARITY — is the response easy to understand and act on?
63. COMPLETENESS — does it cover all aspects of the customer's issue?
74. TONE — is it professional, warm, and appropriate for support?
8
9HIGH QUALITY:
10- Opens with genuine acknowledgment of the customer's experience
11- Explains what happened and why in plain language
12- Provides concrete next steps with timelines where applicable
13- Closes with an offer to help further
14- Would not cause the customer to escalate or churn
15
16LOW QUALITY:
17- Robotic or dismissive tone
18- Incomplete — addresses only part of the issue
19- Unclear or filled with jargon
20- Leaves the customer without a clear next step
21
22Context (customer message and conversation): {context}
23Agent response to evaluate: {assistant_turn}- 1.00 — Excellent: Empathetic, clear, complete, and professional. Would fully satisfy the customer and prevent escalation
- 0.75 — Good: Strong on most dimensions with a minor gap, perhaps slightly terse or missing one follow-up detail
- 0.50 — Adequate: Technically correct but lacking empathy, clarity, or completeness in a noticeable way
- 0.25 — Poor: Multiple gaps: robotic tone, incomplete answer, or confusing language that would frustrate the customer
- 0.00 — Unacceptable: Response would cause the customer to escalate or churn: dismissive, wrong, incoherent, or entirely unhelpful
- cs_overall_customer_outcome ⭐ (North star — all experiments): Holistic score across all dimensions simultaneously: resolution, data accuracy, tone, and compliance process. A response that excels on one dimension but fails another (e.g. empathetic but factually wrong) should not score above 0.50.
1You are evaluating a SaaS customer support agent on its ultimate business outcome:
2did the customer get a good result?
3
4Score based on ALL of the following:
5- Was the customer's issue RESOLVED or correctly ESCALATED?
6- Did the agent use REAL DATA (no hallucinated amounts, statuses, dates)?
7- Was the TONE empathetic enough that the customer would not churn?
8- For compliance issues (GDPR, legal): was the correct process followed?
9
10This is a holistic score — a response that excels on one dimension but fails another
11(e.g. empathetic but factually wrong) should not score above 0.50.
12
13Context (customer message and conversation): {context}
14Agent response to evaluate: {assistant_turn}- 1.00 — Outstanding Outcome: Issue fully resolved or correctly escalated; no hallucinated data; empathetic tone; customer would be satisfied
- 0.75 — Good Outcome: Issue substantially addressed with minor gaps; data accurate; tone acceptable; customer unlikely to escalate
- 0.50 — Neutral Outcome: Issue partially addressed, or data accurate but tone poor, or tone good but resolution incomplete
- 0.25 — Poor Outcome: Issue largely unresolved, or significant hallucinated data, or tone likely to frustrate the customer
- 0.00 — Failed Outcome: Issue not addressed, wrong advice given, compliance process ignored, or response would directly cause churn or harm
An important lesson on evaluator throughput: I originally used Claude Sonnet 4.5 as the judge model. With 4 evaluators firing asynchronously per session, concurrent Converse calls regularly exceeded Sonnet's throughput limit. About half of `cs_overall_customer_outcome` scores silently failed with `ThrottlingException`. No error in the logs; scores just didn't appear. The fix was switching to Claude Haiku 4.5, which has roughly 10x higher throughput limits:
1EVALUATOR_MODEL_ID = "us.anthropic.claude-haiku-4-5-20251001-v1:0"
Haiku is fast enough for async LLM-as-judge scoring at demo scale. Save Sonnet for the inference model, not the judge.
5. Test Overview
Three sequential experiments, each isolating one variable:
- Exp 1 — System prompt: C = Baseline ("Respond immediately with a solution"), T1 = Optimized (Classify → clarify if ambiguous → cite tool data)
- Exp 2 — Tool descriptions: C = Vague ("Get data for a user."), T1 = Precise (full typed signatures with return value descriptions)
- Exp 3 — Model: C = Claude Haiku 4.5, T1 = Claude Sonnet 4.6
Each experiment ran 30 sessions (10 repeats × 3 ticket types), routed 50/50 via the AgentCore Gateway. Each session was scored by all four evaluators asynchronously.
Each experiment ran 30 sessions (10 repeats × 3 ticket types), routed 50/50 via the AgentCore Gateway. Each session was scored by all four evaluators asynchronously.
Experiment Pipeline
The three experiments ran sequentially. AgentCore only allows one active A/B test per gateway at a time. The full phase sequence:
- Phase 1: Generate baseline traffic. Invoke the runtime directly across all 3 ticket types.
- Phase 2: Baseline batch evaluation. Score the baseline sessions to establish a starting benchmark.
- Phase 3: AI prompt recommendation. Point the Recommendations API at baseline traces; get an optimized system prompt.
- Phase 4: AI tool description recommendation. Same API, optimized tool descriptions.
- Phase 5: Create Exp 1 config bundles, run A/B test (prompt strategy), promote winner.
- Phase 6: AI tool description recommendation. Generate improved tool descriptions based on Exp 1 traces.
- Phase 7: Create Exp 2 config bundles. Best prompt + vague tools (C) vs best prompt + precise tools (T1).
- Phase 8: Run Exp 2 A/B test. Stop Exp 1, create new A/B test, send 30 sessions; promote Exp 2 winner.
- Phase 9: Run Exp 3 A/B test. Best prompt + best tools, vary only `model_id`: Haiku (C) vs Sonnet (T1).
The traffic in this demo is synthetic, not real user activity. Two different mechanisms were used depending on the phase.
Phase 2 (baseline batch evaluation) uses AgentCore's `BatchEvaluationRunner` with a simulated customer actor — a Claude Haiku model playing the customer role. Given a character profile and a goal, the actor dynamically responds to whatever the support agent says, producing realistic multi-turn conversations up to 4 turns deep. For example, the GDPR ticket actor is briefed as an EU customer who understands their Article 17 rights and will push back if the agent seems evasive.
Phases 5, 8, and 9 (A/B experiments) use single-turn prompts sent directly to the gateway — one fixed message per ticket type, repeated 10 times each. There is no back-and-forth; each invocation is a complete self-contained session.
In production, you would replace synthetic traffic with real user interactions. The infrastructure — gateway routing, online evaluation, CloudWatch logging — works identically regardless of whether the traffic is real or simulated. The practical advantage of real traffic is that it captures the authentic distribution of how users phrase requests, including edge cases and ambiguous formulations that synthetic prompts don't cover.
How the Recommendations API works
Phases 3, 4, and 6 each call `StartRecommendation` with a `type` parameter that specifies what to optimize.
1dp.start_recommendation(
2 type="SYSTEM_PROMPT_RECOMMENDATION", # or "TOOL_DESCRIPTION_RECOMMENDATION"
3 recommendationConfig={
4 "systemPromptRecommendationConfig": {
5 "systemPrompt": {"text": CURRENT_SYSTEM_PROMPT},
6 "agentTraces": {"cloudwatchLogs": {...}},
7 "evaluationConfig": {"evaluators": [{"evaluatorArn": "Builtin.GoalSuccessRate"}]},
8 }
9 },
10)6. Test Results and Analysis
I included `Builtin.GoalSuccessRate` as a reference signal alongside my custom evaluators. Across all three experiments, it frequently disagreed with `cs_overall_customer_outcome`, my north star. The pattern was consistent: any ticket that required escalation or a follow-up step — GDPR deletion, account unlock pending identity verification — scored "No" from GoalSuccessRate because the agent didn't complete the action in a single turn. That's the correct process, but GoalSuccessRate doesn't know that.
The gap matters because the Recommendations API only accepts session-level evaluators, which means it optimizes for GoalSuccessRate, not your custom north star. Worth knowing before you treat the API's output as ground truth.
All verdicts below are based on `cs_overall_customer_outcome` only. Builtin scores are shown for reference.
Experiment 1 — Prompt Strategy
Baseline prompt (C): Answer immediately, use tools when needed, keep responses concise.
Optimized prompt (T1): Classify intent first → ask one clarifying question if ambiguous → call the right tool → cite actual tool data → provide clear next steps.
The optimized prompt was generated by AgentCore's recommendation API after analyzing baseline session traces.
- Scores:
- cs_overall_customer_outcome ⭐: C = 0.708 (n=13), T1 = 0.835 (n=17), +18.0%, p=0.059, not significant
- cs_tool_groundedness: C = 0.865 (n=13), T1 = 0.985 (n=17), +13.9%, p=0.002, significant
- cs_intent_resolution: C = 0.923 (n=13), T1 = 0.971 (n=17), +5.1%, p=0.324, not significant
- cs_support_quality: C = 0.827 (n=13), T1 = 0.838 (n=17), +1.4%, p=0.800, not significant
- Builtin.GoalSuccessRate: C = 0.154 (n=13), T1 = 0.647 (n=17), +320.6%, p=0.002, significant
- Verdict: DIRECTIONAL. T1 leads on `cs_overall_customer_outcome` (+18.0%, p=0.059), just misses significance at n=30.
- Analysis: The most revealing number is `cs_tool_groundedness`, the only statistically significant cs_* result across all three experiments (p=0.002). The baseline prompt was partially answering from the model's own knowledge rather than grounding responses in what tools returned.
The billing sessions show where GoalSuccessRate is actually useful. The baseline called `fetch_billing_history`, confirmed the duplicate $49 charge, then stopped. cs_overall gave it 0.75 — correct tool, accurate data, reasonable response. GoalSuccessRate gave 0 — the overcharge wasn't resolved, so the user's goal wasn't met. GoalSuccessRate was right. Diagnosing a problem is not the same as fixing it. The optimized prompt's "provide clear next steps" step is what moved the agent from diagnosis to action, and GoalSuccessRate captured that clearly (0.0 → 0.833 on billing).
Experiment 2 — Tool Descriptions
Baseline tool descriptions (C): Vague one-liners that give the LLM almost no signal.
1"fetch_billing_history": "Get data for a user."
2"check_refund_status": "Process a request."
3"validate_account_identity": "Run a query."
Optimized tool descriptions (T1): Precise typed signatures with return value descriptions.
1"fetch_billing_history": (
2 "Retrieve complete billing transaction history for a customer by user_id. "
3 "Returns itemized charges, payment dates, amounts, and subscription details "
4 "for the past 90 days."
5)Both variants use the winning prompt from Exp 1, so only tool selection behavior changes.
- Scores:
- cs_overall_customer_outcome ⭐: C = 0.804 (n=14), T1 = 0.859 (n=16), +6.9%, p=0.193, not significant
- cs_tool_groundedness: C = 0.964 (n=14), T1 = 1.000 (n=16), +3.7%, p=0.141, not significant
- cs_intent_resolution: C = 0.964 (n=14), T1 = 0.922 (n=16), -4.4%, p=0.271, not significant
- cs_support_quality: C = 0.814 (n=14), T1 = 0.797 (n=16), -2.1%, p=0.650, not significant
- Builtin.GoalSuccessRate: C = 0.643 (n=14), T1 = 0.188 (n=16), -70.8%, p=0.006, significant
- Verdict: DIRECTIONAL. T1 leads (+6.9%, p=0.193), not significant.
- Analysis: Better descriptions pushed `cs_tool_groundedness` to a perfect 1.000. The GoalSuccessRate collapse (-70.8%) looks bad but reflects a judge inconsistency, not a real regression. On account_locked sessions, C scored 0.75 and T1 scored 0 — yet cs_overall was 0.875 for both. Both variants did the same thing: confirmed the account lock, requested identity verification. GoalSuccessRate sometimes counted that as success, sometimes as failure. cs_overall (+6.9%) is the more consistent signal here.
Experiment 3 — Model Comparison
Control (C): Claude Haiku 4.5 (fast, cost-efficient, ~$0.80/M input tokens).
Treatment (T1): Claude Sonnet 4.6 (more capable, ~$3/M input tokens).
Both variants use the best prompt and best tool descriptions from Exp 1 and 2. The only difference is `model_id` in the config bundle. No redeployment needed.
- Scores:
- cs_overall_customer_outcome ⭐: C = 0.875 (n=14), T1 = 0.812 (n=16), -7.1%, p=0.212, not significant
- cs_tool_groundedness: C = 0.982 (n=14), T1 = 1.000 (n=16), +1.8%, p=0.317, not significant
- cs_intent_resolution: C = 0.911 (n=14), T1 = 0.938 (n=16), +2.9%, p=0.537, not significant
- cs_support_quality: C = 0.786 (n=14), T1 = 0.844 (n=16), +7.4%, p=0.142, not significant
- Builtin.GoalSuccessRate: C = 0.214 (n=14), T1 = 0.438 (n=16), +104.2%, p=0.193, not significant
- Verdict: INCONCLUSIVE / C holds. Haiku 4.5 leads on `cs_overall_customer_outcome` across every ticket type. Not statistically significant (p=0.212), but the direction is consistent.
- Analysis: Sonnet scored higher on `cs_support_quality` (+7.4%) — richer responses, better formatted. But on one billing session it scored 0.5 because it required identity verification before committing to the refund, even though `fetch_billing_history` had already confirmed the duplicate and the user's eligibility. That extra step wasn't warranted by the data. Haiku saw the same tool output and offered the refund directly. On a structured task where the tool result already tells you what to do, Sonnet's tendency to add caution worked against it.
7. Key Learnings
What worked well
- Online evaluation is fully automatic once configured. After the eval config is set up, scores land in CloudWatch for every gateway session without any extra instrumentation on your end. The only thing you need to do is read the log group.
- Config bundles make iteration fast. Swapping model ID, system prompt, and tool descriptions across variants with no container rebuild changes the cost of an experiment from hours to minutes.
Gotchas
1. The Recommendations API optimizes for a metric you may not be using
The API only accepts session-level evaluators. If your north star is a custom trace-level evaluator (as mine was), the API silently falls back to `Builtin.GoalSuccessRate` instead. As the results show, those two metrics frequently disagree. Treat AI-generated recommendations as a strong starting point, not a guaranteed improvement. The A/B test is the actual verdict.
2. Each A/B test is limited to two variants
The gateway supports one control and one treatment per experiment. Testing three or more configurations requires running sequential experiments, which means more time and the risk of confounds between runs.
3. LLM-as-judge has variance
For outputs with a clear correct answer, a deterministic check (exact field match, regex, schema validation) is more reliable than asking a judge model. LLM-as-judge is necessary for open-ended quality, but if part of your rubric can be verified programmatically, that part should be.
4. Use a smaller, high-throughput model for LLM-as-judge
When multiple evaluators fire concurrently per session, a capable-but-limited-throughput model will silently drop scores under throttling — no errors, scores just don't appear. A faster, cheaper model handles the concurrency, and for judging structured rubrics the quality difference is negligible.
8. Conclusion
Building the agent was the easy part. The surprising difficulty was eval design. Writing a north star metric that genuinely reflects your business goal, not just something easy to score, takes real iteration. And once you have a north star, designing the supporting diagnostics that explain why it moves is just as hard. The built-in evaluators are a useful reference, but they're domain-agnostic by design. They will disagree with your north star at exactly the moments that matter most.
A few things I'd carry into the next project:
- Config bundles are the operational win: Swapping prompts, tool descriptions, and model IDs in production with a 50/50 split, with no container rebuild, makes iterations easier.
- The loop is the product: Baseline → recommend → A/B test → promote → repeat. Every step is already an API call returning structured data. There's nothing stopping an agent from evaluating itself, triggering a new recommendation when scores drop, and starting a test automatically. I'm not quite at fully self-driving agents yet, but the primitives are already here.
9. Resources
🚀 Try It Yourself
📚 Learn More

