The Hard Lessons Behind Production Agentic AI

I have spent a lot of time building agentic AI systems that actually have to work in production. Not demos. Not playgrounds. Real systems that take actions, touch real workflows, and are expected to keep going when something breaks.

That experience taught me something quickly. The hard part is not getting the model to sound intelligent. The hard part is building the system around it so it stays useful when reality gets in the way.

The gap between a demo that wows a room and a system a real person depends on is enormous, and almost none of that gap is about the model. It is about everything you have to put around the model so it behaves when the conditions are not clean.

Here are some of the biggest lessons.

I kept blaming the model. It was almost never the model

Early on, I had a rule that mattered. A hard limit that the system should never cross, because crossing it had real consequences for someone. I did the obvious thing. I wrote it clearly into the prompt, told the model never to go past it, and moved on. I could see the instruction sitting right there in the text, so I felt safe.

The system crossed it anyway.

The action went out, the limit was broken, and when I went back to look, the prompt was exactly as I had written it. The model had read the rule, considered it, and was free to ignore it, because nothing in the system actually stopped the action from happening. That was the whole problem. A rule that lives only in a prompt is not a rule. It is a suggestion, and the model gets a vote.

So I moved every rule that genuinely mattered out of the prompt and into the code, at a fixed checkpoint the action has to pass through, where the model gets no vote at all. A prompt can ask for something. The code is the only thing that actually stops it from happening.

The deeper lesson underneath that is this. The hardest part of building a good agent is not the AI, it is understanding the thing you are automating well enough to know which numbers are sacred and which are flexible. You cannot tell the difference unless you understand the domain like you will personally answer for it. Because in production, you will.

My retrieval got better when I stopped trying to be clever

My first serious retrieval system gave answers that were plausible and subtly wrong. Not wrong enough to obviously fail, which is worse. It would pull information that sounded related, reason confidently over it, and land somewhere slightly off. Chasing it was miserable, because the model was doing its job. The problem was upstream, in what I was feeding it.

Two things were broken. My chunks were too big and they were not self contained, so a chunk would only make sense alongside the chunks around it, but retrieval pulls things in isolation. And everything lived in one undifferentiated pile, so unrelated text that happened to look similar kept winning the small number of retrieval slots and crowding out the chunks that actually mattered.

The fixes were boring and they worked. I made each chunk stand on its own, so it means something even when it shows up alone, the way a good paragraph in a textbook makes sense even if you opened to a random page. And I tagged chunks by the domains they actually belong to, moderately, so I could narrow the field by topic before the similarity search even ran.

That second part is the one people miss. Retrieval is a competition for a few seats. Every irrelevant chunk you allow into the running is a seat stolen from a relevant one. Honest tagging keeps the wrong chunks out of the race entirely, so the right ones win. A smaller, disciplined knowledge base beats a giant one you dumped in and hoped over, every single time.

Machines are still bad at the quick glance

I wanted an agent to form the kind of quick read a person forms in a glance. Open someone's page, scroll for ten seconds, and you already have a real opinion. You know roughly what they sell, who they are talking to, whether it looks expensive or cheap, whether they are serious. You did not study anything. Your eye skimmed dozens of images at once and pulled signal out of all of them, almost for free, while half thinking about something else.

I could not give the agent that ten seconds without it getting expensive fast.

The system either worked from text alone, which throws away most of what a page actually communicates, or it had to look at the images one at a time. And looking, for a machine, is nothing like looking for you. Every image is a heavy call. There are real limits on how many you can pass at once, on cost, and on time. The casual scroll your eye did becomes, for an agent, a long line of expensive operations over a fraction of the data, and it still ends up understanding less than you did while standing in a queue.

What I changed was the pretending. I stopped designing as if the agent perceives the way a person perceives, and started deciding deliberately what is actually worth the cost of looking at, and what to infer from cheaper signals like text. Ignore that gap and you end up either broke from vision costs or blind to half of reality. The human eye is one of the most underrated pieces of infrastructure there is. You only notice how good it was when you try to replace it.

I thought resilience was a retry. It is not

When I started, I thought resilience meant wrapping a call in a try and except and adding a retry. Then production taught me, repeatedly, that resilience is not a thing you add in one place. It is a posture you have to hold across the whole system, and it fights you the entire way. A few of the scars.

An unbounded loop driven by a model will loop. I let an agent decide on its own when it was done with a step and did not put a hard limit around it. It got into a state where it kept doing the same thing, politely, forever, quietly burning money the whole time. The fix is unromantic. Put a counter on every loop and bound it. Not might loop. Will loop.

A reasoning model handed me nothing and called it a success. The newer models that think before they answer split their budget between the hidden thinking and the visible answer. Give them too little, the thinking eats all of it, and you get back an empty response with a perfectly healthy status code. My code treated empty but successful as a valid answer, so it shipped a silent failure straight to a user. Now empty, truncated, and refused responses are treated as real, typed errors, never as blank strings to shrug at.

The system crashed in the middle of an action that costs money. The danger was not the crash. It was the resume. A naive resume does the entire action again from the top and charges twice. The fix is to make every action that matters carry a key, so that resuming after a crash picks up the work already done instead of doing it a second time.

And around all of it, every call that leaves the system now has a timeout so a hanging dependency cannot freeze a whole run, backoff so I am not hammering a service that is already down, and a circuit breaker so that when something downstream is clearly dead, the system stops throwing requests at it instead of piling up a thousand timeouts. The first time a breaker held a run together through a dead dependency, I finally understood why people bother.

None of these are clever. That is the point. Resilience is a hundred unglamorous decisions like these, and you only find out you got one wrong when production shows you, in front of a real user.

Making it recompute everything was a mistake

My first version worked everything out from scratch on every single run. Fresh understanding, fresh analysis, fresh everything, every time. It was simple to reason about, and it was slow, expensive, and oddly less accurate, because it threw away everything it had just learned a moment earlier.

Then it did something worse. A routine refresh of its stored understanding overwrote knowledge that was backed by real evidence with a weaker, fresher guess. The system had been right, and the act of trying to keep it current made it wrong. That was the moment I understood the actual problem.

Some things change slowly. Who you are dealing with, what the situation around them looks like, the shape of the thing you are working with. There is no reason to rebuild that from zero on every run. You build it once, store it, and enrich it in place as new evidence arrives, so it compounds and gets sharper over time. Other things are specific to this exact moment and have to be computed fresh. The whole art is drawing that line correctly, then putting a freshness policy on the slow moving side so you only recompute it when it is genuinely stale, or enough new evidence has arrived to be worth it, or someone explicitly asks.

Two rules kept that from turning into a mess. Updates must never make the system dumber, so a refresh can never lower the confidence or coverage you already had, and something you actually observed always outranks something you merely guessed. And progress has to survive a crash, so the system saves its state after each meaningful step and resumes from the last good point instead of starting over. Work that is already finished stays finished, and what the system knows only moves forward, never backward.

A system that remembers, compounds what it learns, and only recalculates what truly needs it feels completely different to use. It gets faster and sharper over time instead of paying full price on every interaction. The longer I do this, the more I think the useful kind of intelligence here is less about how much you can compute and more about knowing what you do not need to compute again.

What this actually comes down to

When I look back at all of it, the thing that stands out is how little of it was about the model. Most of it was about control, judgment, and the architecture around the model.

Production agentic AI is not hard because the models are weak. The models are remarkable. It is hard because the moment you let a system act on its own in the real world, every shortcut you took comes back with interest. The rule you left in a prompt. The chunk you never cleaned up. The loop you never bounded. The knowledge you kept throwing away. None of it shows up in a demo. All of it shows up in production.

The people who do this well are not the ones with the cleverest prompts. They are the ones who took the unglamorous parts seriously, because they understood that the model was never the hard part. The system around it was.

That is most of what this work has taught me. If you are building in this space, I would genuinely like to know which of these you have hit yourself, and which ones you would argue with. Some of my strongest opinions here started as someone else's comment. So tell me where I am wrong.

The Hard Lessons Behind Production Agentic AI

I kept blaming the model. It was almost never the model

My retrieval got better when I stopped trying to be clever

Machines are still bad at the quick glance

I thought resilience was a retry. It is not

Making it recompute everything was a mistake

What this actually comes down to

Comments

The Hard Lessons Behind Production AI

More from this blog

Unable To Access Environment Variables in Firebase Application After Deploying

Passing But Not Accessing cookie in firebase?

Command Palette

I kept blaming the model. It was almost never the model

My retrieval got better when I stopped trying to be clever

Machines are still bad at the quick glance

I thought resilience was a retry. It is not

Making it recompute everything was a mistake

What this actually comes down to

Comments

The Hard Lessons Behind Production AI

More from this blog