The Two Failure Modes That Break Your AI Data Agent

A Case Study in Agent Psychology

Posted by Craig Johnston on Thursday, February 26, 2026

What building an MCP-powered data platform taught me about why LLMs ignore the instructions you spent months writing.

You’ve built the architecture. A Model Context Protocol (MCP) server sits between the LLM and your data warehouse. You’ve written semantic descriptions for every table, tagged domains, glossary terms, and column-level documentation. You’ve embedded routing instructions that tell the agent exactly which query engine to use for which workload.

You deploy it. You ask your first real question: a multi-year aggregation across a few hundred million rows. The agent reads your instructions, acknowledges them, explains them back to you accurately, and then does the exact opposite.

This is about why that happens and what actually fixes it.

The Setup

The data platform in question uses a dual-engine architecture common in modern analytics:

  • PostgreSQL holding the transactional source tables, views, and dimension tables. Good for joins, real-time lookups, and queries against small-to-medium tables.
  • OpenSearch holding time-partitioned analytical indices optimized for aggregations. Designed for the exact queries that bring PostgreSQL to its knees.

Both engines are exposed through a single query layer (Trino), so the agent can reach either one. The MCP server includes a data catalog with rich metadata. Every PostgreSQL fact table that has an OpenSearch equivalent is documented with instructions: “Use OpenSearch for analytics. Here’s the index name. Here’s the query pattern.”

The instructions were clear. Present in the platform documentation, the domain descriptions, the dataset descriptions, and the agent-level instructions. Multiple layers.

It didn’t matter.

The Failure

The agent was asked to analyze multi-year streaming trends, a textbook OLAP workload. Aggregation across years of data, grouped by a handful of dimensions. The kind of query OpenSearch handles in under a second.

Here’s what happened:

  1. The agent called platform_info and read the full architecture documentation, including the OpenSearch-first routing policy.
  2. The agent searched the catalog and found both the PostgreSQL table and its OpenSearch equivalent.
  3. The agent explained the architecture to the user, accurately describing that the PostgreSQL table had 300M+ rows and that OpenSearch indices existed specifically for this kind of aggregation.
  4. The agent then wrote a SELECT ... GROUP BY query against the PostgreSQL table.
  5. The query timed out.

The agent understood the instructions, could explain the reasoning back, and still wrote the wrong query. Comprehension and compliance turned out to be different problems.

The Diagnosis: Two Failure Modes

After digging into the logs and behavior patterns, two recurring issues stood out.

Failure Mode 1: Recency Bias

When the agent inspected the PostgreSQL table schema, reading column names, types, and sample data, that inspection became the most recent thing in its working memory. The concrete schema details created a stronger pull than the abstract policy instruction it had read moments earlier.

The agent treated recency as authority. The last thing it saw became the most important thing it knew.

In data work, recency should only matter if new information actually overrides prior instructions through explicit reasoning. A table schema doesn’t override a routing instruction. The schema tells you how to execute the plan you already have. It doesn’t change the plan.

The distinction matters: there’s a difference between “new information that changes the plan” and “new context that supports execution of the existing plan.” A PostgreSQL table schema is the latter. The agent’s default behavior treats all new context as the former.

Failure Mode 2: Complexity Avoidance

The OLAP query pattern required:

  • A JSON aggregation DSL embedded in a SQL function call
  • CROSS JOIN UNNEST to extract nested aggregation buckets
  • .keyword suffix on text fields (an OpenSearch-specific requirement)
  • Double-casting of numeric results through DOUBLE before BIGINT (because OpenSearch returns scientific notation)

The PostgreSQL query required:

  • SELECT dimension, SUM(metric) FROM table GROUP BY dimension

The agent chose the path with less syntactic friction, even though it led to a timeout on 300 million rows. It optimized for fewer characters of SQL at the expense of correctness.

Complexity avoidance and simplicity are not the same thing. Simplicity in engineering comes from reducing a problem to its appropriate level of complexity, not from refusing to engage with that complexity. A query that runs in 900ms and returns correct results is simple, whatever it looks like. A query that times out after 60 seconds and returns nothing is not simple. It’s just wrong.

The agent wasn’t choosing simplicity. It was choosing familiarity. Standard SQL appears far more often in training data than OpenSearch aggregation patterns. Less common doesn’t mean more complex; it means less practiced.

Why Standard Fixes Didn’t Work

Before landing on a solution, several approaches failed.

Fix Attempt 1: Add “USE OPENSEARCH INSTEAD” to Table Descriptions

The first approach: append a note to each PostgreSQL table’s description: “USE OPENSEARCH INSTEAD: query os_index_name-YYYY via opensearch.system.raw_query().”

It didn’t work. The note appeared after the detailed table description. The agent read through the schema first: what the table contains, how it’s structured, what the columns mean. By the time it reached the routing note at the bottom, the PostgreSQL path was already the default. The routing instruction was metadata about metadata. It lost the salience competition.

Fix Attempt 2: Emphasize the Instruction in Agent-Level Documentation

Platform info and domain descriptions were updated to stress the OpenSearch-first policy. Instructions were present at three levels: platform, domain, and dataset.

Still didn’t work. More of the same information in more places doesn’t overcome recency bias. The agent read all of it and understood all of it. The problem was never comprehension. It was behavioral priority at the moment of query construction. When the agent shifts from “reading instructions” to “writing SQL,” the most recent concrete context wins.

Fix Attempt 3: Detailed Architecture Explanations

The agent was given documentation about why OpenSearch is faster: time-partitioned indices, columnar aggregation, no full table scans.

Also didn’t work. Knowing why a choice is correct doesn’t mean the agent will make that choice. It’s the same gap as knowing exercise is good for you versus actually going for a run. LLMs are pattern-completion systems strongly influenced by what’s most salient in their immediate context, not reasoning engines that re-derive behavior from first principles at every step.

What Actually Worked

The fix required changes at three layers simultaneously, all aimed at the same goal: the correct path needs to be the most salient thing in context at the moment the agent writes a query.

Layer 1: STOP-First Ordering in Table Descriptions

Every PostgreSQL fact table with an OpenSearch equivalent was rewritten so the very first line is:

STOP: DO NOT QUERY THIS TABLE FOR ANALYTICS.

Followed immediately by the exact alternative:

QUERY THIS INSTEAD: opensearch.default.os_index_name-YYYY

The original table description was pushed below a separator. When the agent inspects the table, the first thing it reads is the redirect. The STOP notice occupies the recency slot that previously belonged to the schema details.

If the agent anchors on whatever it read most recently, make sure the most recently read thing is the routing instruction.

Layer 2: Curated Queries Attached to PostgreSQL Tables

Pre-written, tested OpenSearch queries were attached directly to the PostgreSQL tables using the catalog’s curated query feature. These cover the common patterns: totals by dimension, year-over-year trends, segment comparisons.

This sidesteps complexity avoidance by removing the complexity. The agent doesn’t need to construct an OpenSearch raw_query() from scratch with all its syntactic requirements. It retrieves a working query and adapts it. The easy path is now the correct path.

Notably, the curated queries are attached to the PostgreSQL tables, not the OpenSearch indices. That’s because the agent’s discovery path starts at the PostgreSQL table. Attaching the queries there intercepts the agent at the point of decision.

Layer 3: A Decision Tree Instead of a Policy

The agent instructions were rewritten from prose to a procedural decision tree:

  1. Is this table in the PostgreSQL-to-OpenSearch routing table? YES: STOP. Use OpenSearch.
  2. Does a curated query exist for this pattern? YES: Use it directly.
  3. No curated query? Write OpenSearch SQL using the provided boilerplate template.

The routing table is an explicit lookup, not a policy to interpret. The agent’s job goes from “evaluate which engine is appropriate” to “check whether this table name is in the list.”

Policy interpretation is where both failure modes take hold. Recency and complexity avoidance operate during ambiguous decisions. A lookup table removes the ambiguity.

The Token Budget: How Much Instruction Is Enough?

Agent instructions compete for context window space with conversation history, tool results, and catalog metadata. Beyond a certain point, more instructions make things worse: instruction density drops and the model has to do more work to find the relevant guidance.

Based on this deployment, here are the practical ranges:

Token RangeAssessment
Under 500Too thin. Behavior will be underspecified and inconsistent.
1,000-3,000Sweet spot. Specific enough to constrain behavior, short enough to retain throughout a conversation.
3,000-6,000Acceptable for complex domains, but starts competing with conversation context in longer sessions.
Over 6,000Diminishing returns. Instructions get lost, which ironically increases recency bias because the model can’t hold the full instruction set in working memory.

These instructions landed at roughly 2,000 tokens.

Token count is only half the picture. Structure matters as much as length. A 2,000-token decision tree with a lookup table delivers more behavioral constraint per token than 2,000 tokens of prose. Decision trees give the model a procedure to follow rather than a policy to interpret. Procedures get followed more reliably.

Broader Implications

These patterns aren’t specific to data platforms. They show up anywhere agent instructions compete with live context.

Recency bias appears in any agent that inspects something before acting on it. Code review agents that read a file before applying style rules. Research agents that pull source material before applying synthesis instructions. Any workflow where “gather context” comes before “follow instructions” is vulnerable.

Complexity avoidance shows up wherever the correct action requires an unfamiliar pattern. Agents writing infrastructure code will reach for familiar syntax over correct-but-uncommon constructs. Agents generating API calls will default to REST over GraphQL even when GraphQL fits the query shape better.

The underlying issue: LLM instruction-following isn’t a comprehension problem. It’s a salience problem. The model understands your instructions and can explain them back to you. What matters is whether your instructions are the most salient thing in context at the moment a decision gets made. If something more concrete, more recent, or more familiar is occupying that slot, your instructions lose.

Principles for MCP Developers

If you’re building MCP servers or agent tooling for data platforms, here’s what carries forward:

1. Put the redirect where the agent looks, not where it makes sense to you. The STOP notice belongs on the PostgreSQL table description, not in a separate routing document. The agent inspects the PostgreSQL table, so that’s where it needs to be intercepted.

2. Attach solutions to the point of failure. Curated queries on the PostgreSQL table, not on the OpenSearch index, because the agent’s discovery path leads to the PostgreSQL table first.

3. Replace policy with procedure. “Prefer OpenSearch for analytical workloads” is a policy. A routing lookup table is a procedure. LLMs follow procedures more reliably than they interpret policies.

4. Fight recency with recency. If the agent anchors on whatever it read most recently, engineer the system so the most recently read thing is the routing instruction, not the schema detail.

5. Make the easy path the right path. If the correct path is syntactically harder, provide templates and curated examples that close the complexity gap. The agent will take the path of least resistance; make sure that path leads somewhere correct.

6. Test with the actual failure scenario. These instructions worked fine when I asked the agent to explain the architecture. They failed when I asked it to query data. The test needs to be the real workload, not a comprehension check.


This case study is based on a production MCP server development project for a media analytics data warehouse. Implementation details have been generalized to protect client confidentiality.

Note: This blog is a collection of personal notes. Making them public encourages me to think beyond the limited scope of the current problem I'm trying to solve or concept I'm implementing, and hopefully provides something useful to my team and others.

This blog post, titled: "The Two Failure Modes That Break Your AI Data Agent: A Case Study in Agent Psychology" by Craig Johnston, is licensed under a Creative Commons Attribution 4.0 International License. Creative Commons License