Why Vibe Coding Falls Apart in Production (And What Agentic Engineering Does Differently)
The difference between author and orchestrator isn't workflow polish — it's a different job. A payment-webhook race condition taught me which one I'd actually been doing.
The production incident that made me stop vibe coding
It started with a payment webhook handler. I described the behavior I wanted to Claude, watched it write the code, skimmed it quickly, and pushed. Local tests passed. CI passed. Two days after deploy, a user reported they'd been charged twice for a subscription renewal.
The bug was subtle — a race condition in the idempotency check that only surfaced under specific timing conditions. The AI had written valid-looking code. It just hadn't modeled the edge case because I hadn't told it to, and I hadn't read carefully enough to notice the gap.
That's not a dramatic story. It's just the kind of thing that happens when you delegate critical logic without thinking clearly about what you're delegating. And apparently it's common: research shows AI co-authored code contains 1.7× more major issues than human-written code, and around 45% of AI-generated samples include OWASP Top-10 vulnerabilities [1]. I'd been treating a code generator like a senior engineer I could trust completely. Those aren't the same thing.
The knee-jerk response would have been to stop using AI coding tools entirely. I'd written about that temptation before. But stepping back further, I realized the problem wasn't the AI — it was my workflow.
I was vibe coding. And vibe coding, as it turns out, doesn't scale to production systems.
Agentic engineering is not "disciplined vibe coding"
When I first heard Karpathy's term "agentic engineering," I assumed it was just vibe coding with more careful review. Same process, tighter guardrails. Wrong.
The actual distinction is a different mental model for the developer's role. In vibe coding, you're the author — you describe what you want, the AI writes it, you check it and ship it. You're still in the code, just with AI assistance. In agentic engineering, you're the orchestrator. The AI is running the workflow. You're defining the spec, setting the guardrails, and reviewing at decision points — not watching every line get written.
Karpathy put it clearly at Sequoia Ascent earlier this year: "the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight" [2]. The unit of work shifts from "write this function" to "here's the requirements doc, here's the test suite, run until green and report back."
That shift sounds minor. It isn't. When you're the orchestrator, the things that matter change. Clear task decomposition matters more than prompt crafting. Context management becomes a skill in itself. And review gates at meaningful checkpoints replace the habit of watching every edit scroll by.
Vibe coding is fast but shallow. Agentic engineering is where the real leverage is — but it requires thinking differently about what the AI is actually doing for you.— From the field notes
I'd been building the wrong muscle. Vibe coding is fast but shallow. Agentic engineering is where the real leverage is — but it requires thinking differently about what the AI is actually doing for you.
Here's a concrete way to feel the difference: in vibe coding, you notice when something looks wrong and intervene mid-generation. In agentic engineering, you write a spec document before the agent even starts, then you review the output against that document at a checkpoint. One mode keeps you embedded in the code. The other puts you one level above it. Neither is wrong for all situations — but for anything that touches production, being one level above is safer and, eventually, faster.
Three workflow changes that actually shifted my output
CLAUDE.md as the source of truth
The first thing I did was create a proper CLAUDE.md file in every project. Not just a list of style preferences — an actual project document: architecture decisions, data model conventions, which patterns we use for API error handling, the naming conventions that matter, what tests need to pass before anything merges.
Previously, I was re-explaining context at the start of every session. "We use PostgreSQL, our auth is JWT-based, don't modify the schema directly, use the migration helper..." That repetition was wasting tokens and occasionally getting truncated. More importantly, when I started a new session after a few days away, I'd sometimes forget to mention something critical and the agent would wander off in a direction I didn't intend.
Once CLAUDE.md had the full picture, the agents stayed on track without my constant correction. The file became the single source of truth for how the project works. It does require maintenance — when you make an architectural decision, you need to update the doc. But that overhead is smaller than the alternative.
Context management: load less, delegate more
My old approach: dump the whole codebase into context and hope the AI would find what it needed. After hitting context window limits repeatedly — and reading some painful lessons about bigger context windows not always meaning better results — I shifted to just-in-time loading.
The principle: give the agent lightweight references (file paths, function names, test file locations) and let it pull what it needs via tools rather than pre-loading everything. When an agent reads a 5,000-line log file, don't leave the whole thing in context — extract the relevant parts, drop the original. Anthropic's engineering team has written about this approach and the context volume reduction can be dramatic, sometimes 90% less with equivalent quality [3].
Practically, this means writing cleaner task descriptions and trusting the agent to navigate rather than doing the navigation for it. It felt like giving up control at first. The results were better.
Review gates instead of constant checking
This one changed my focus the most. I used to watch agent output as it came in — ready to intervene the moment something looked off. That's basically the vibe coding mindset applied to agentic tasks, and it's exhausting. It also defeats the purpose of delegation.
Now I define review checkpoints in advance: before any database migration runs, before a PR is opened, before any external API call gets implemented. Between checkpoints, the agent runs. I do something else. When it reaches a checkpoint, I review the state, decide whether to continue or correct, and then it keeps going.
- ✓ Set checkpoints before the agent starts (pre-migration, pre-PR, pre-external-API)
- ✓ Step away between gates — work on something else
- ✓ Review the state, not the line-by-line generation
- ✓ Treat CLAUDE.md updates as part of the change
- ✗ Watch every line and interrupt mid-generation (surveillance mode)
- ✗ Re-explain project context at every session start
- ✗ Pre-dump the whole codebase "just in case"
- ✗ Skip code review because the gates exist
The vibe coding fatigue I'd noticed before — that constant low-grade attention load of monitoring AI output — mostly came from this pattern of continuous supervision. I'd written about that fatigue without really identifying the cause. The cause was supervision mode, not agentic mode. Once I stopped watching and started reviewing at gates, the cognitive load dropped.
What I thought would improve — but didn't
I expected the speed gains to be dramatic. They weren't. My throughput went up, but not by some 10× multiplier. The gains were real and consistent — maybe 40% faster on well-defined tasks, less on anything requiring domain judgment. The 10× figures floating around assume you're working on greenfield projects with simple requirements. Production codebases with years of technical decisions baked in don't compress that neatly.
I also expected trust in AI-generated code to become a non-issue once I had better review gates. It didn't. I still read every piece of code before it ships. I'm more systematic about it — the gates create structured review moments — but the review itself isn't shorter. The 45% OWASP vulnerability rate doesn't drop to zero because your workflow improved. It just stops being a surprise.
The biggest overhead I underestimated was CLAUDE.md maintenance. Every time a pattern changes, every time a dependency gets added, every time there's an architectural shift — the doc needs updating or the next agent session will drift. It's not complicated work, but it's work that didn't exist before. Think of it as writing for your future AI collaborators the same way you'd write good documentation for a new team member.
Developer trust in AI-generated code industry-wide has actually dropped, from 43% in 2024 to 33% this year [1]. Agentic engineering improves the structure around AI, not the fundamental reliability of AI output. The tools are getting better. The judgment still has to come from somewhere else.
The trade you actually want
All that said: the billing bug incident hasn't repeated. The review gates catch edge cases before they ship. The workflow takes more upfront thought, but the downstream cost of mistakes dropped significantly. That's the trade-off. Not faster, looser development — more structured development where the AI does more of the repetitive work and the engineer does more of the architectural thinking.
That's closer to what I wanted when I started using AI coding tools. It just took a production incident to get there.
If you're still in pure vibe coding mode and it's working — for prototypes, personal projects, throwaway scripts — that's fine. Vibe coding isn't bad. It's just the wrong tool when the stakes go up. Agentic engineering is what fills that gap, and it's worth building the habit before the incident rather than after.
Build the habit before the incident, not after. The cost of switching modes after a production failure is much higher than the cost of switching before one.
- [1] Stack Overflow Developer Survey 2026 and recent AI-coding quality research — AI co-authored code shows 1.7× more major issues; ~45% of generated samples contain OWASP Top-10 vulnerabilities; developer trust dropped from 43% (2024) to 33% (2026).
- [2] Andrej Karpathy, Sequoia Capital Ascent conference, 2026 — remarks on the orchestrator role and "99% not writing code directly."
- [3] Anthropic engineering posts on context management and just-in-time loading — reported volume reductions up to ~90% with equivalent task quality.
- [4] Related IX Works posts: AI coding tools and productivity · Context window lessons · Vibe coding fatigue.
댓글
댓글 쓰기