Why Vibe Coding Falls Apart in Production (And What Agentic Engineering Does Differently)

    01 / THE INCIDENT
  

The production incident that made me stop vibe coding

It started with a payment webhook handler. I described the behavior I wanted to Claude, watched it write the code, skimmed it quickly, and pushed. Local tests passed. CI passed. Two days after deploy, a user reported they'd been charged twice for a subscription renewal.

The bug was subtle — a race condition in the idempotency check that only surfaced under specific timing conditions. The AI had written valid-looking code. It just hadn't modeled the edge case because I hadn't told it to, and I hadn't read carefully enough to notice the gap.

That's not a dramatic story. It's just the kind of thing that happens when you delegate critical logic without thinking clearly about what you're delegating. And apparently it's common: research shows AI co-authored code contains 1.7× more major issues than human-written code, and around 45% of AI-generated samples include OWASP Top-10 vulnerabilities [1]. I'd been treating a code generator like a senior engineer I could trust completely. Those aren't the same thing.

        FIG. 01 — THE NUMBERS THAT REFRAMED MY WORKFLOW
      

1.7×

more major issues in AI co-authored code vs human-written

45%

of AI-generated samples contain OWASP Top-10 vulnerabilities

33%

of developers trust AI-generated code in 2026 (down from 43% in 2024)

      Quality and trust signals from recent AI-coding research — context for the gap I missed.
    

The knee-jerk response would have been to stop using AI coding tools entirely. I'd written about that temptation before. But stepping back further, I realized the problem wasn't the AI — it was my workflow.

I was vibe coding. And vibe coding, as it turns out, doesn't scale to production systems.

    02 / THE DISTINCTION
  

Agentic engineering is not "disciplined vibe coding"

When I first heard Karpathy's term "agentic engineering," I assumed it was just vibe coding with more careful review. Same process, tighter guardrails. Wrong.

The actual distinction is a different mental model for the developer's role. In vibe coding, you're the author — you describe what you want, the AI writes it, you check it and ship it. You're still in the code, just with AI assistance. In agentic engineering, you're the orchestrator. The AI is running the workflow. You're defining the spec, setting the guardrails, and reviewing at decision points — not watching every line get written.

      FIG. 02 — TWO ROLES, TWO LEVELS OF ABSTRACTION
    

MODE A

Vibe coding

You're the author. Still in the code, with AI helping.

UNIT OF WORK

"Write this function"

REVIEW PATTERN

Watch every line, intervene mid-generation

SKILL FOCUS

Prompt crafting

FITS WHEN

Prototypes · scripts · throwaway code

MODE B

Agentic engineering

You're the orchestrator. One level above the code.

UNIT OF WORK

"Run against this spec + tests, report"

REVIEW PATTERN

Checkpoints at decision gates

SKILL FOCUS

Task decomposition · context management

FITS WHEN

Production systems · multi-step changes

      Author vs orchestrator — the abstraction level that determines what your day looks like.
    

Karpathy put it clearly at Sequoia Ascent earlier this year: "the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight" [2]. The unit of work shifts from "write this function" to "here's the requirements doc, here's the test suite, run until green and report back."

That shift sounds minor. It isn't. When you're the orchestrator, the things that matter change. Clear task decomposition matters more than prompt crafting. Context management becomes a skill in itself. And review gates at meaningful checkpoints replace the habit of watching every edit scroll by.

FIG. 03 — THE ASYMMETRY

Vibe coding is fast but shallow. Agentic engineering is where the real leverage is — but it requires thinking differently about what the AI is actually doing for you.

— From the field notes

      The realization that changed which muscle I was building.
    

I'd been building the wrong muscle. Vibe coding is fast but shallow. Agentic engineering is where the real leverage is — but it requires thinking differently about what the AI is actually doing for you.

Here's a concrete way to feel the difference: in vibe coding, you notice when something looks wrong and intervene mid-generation. In agentic engineering, you write a spec document before the agent even starts, then you review the output against that document at a checkpoint. One mode keeps you embedded in the code. The other puts you one level above it. Neither is wrong for all situations — but for anything that touches production, being one level above is safer and, eventually, faster.

    03 / WHAT CHANGED
  

Three workflow changes that actually shifted my output

      FIG. 04 — THE THREE SHIFTS, IN ORDER OF IMPACT
    

01

Single source of truth

CLAUDE.md holds architecture decisions, conventions, must-pass tests — agents stop drifting.

→

02

Just-in-time context

Lightweight references over preload. Agent fetches what it needs; up to ~90% less context volume.

→

03

Review gates, not surveillance

Defined checkpoints replace continuous monitoring. Cognitive load drops.

      Sequence matters — the doc precedes context strategy, which precedes review discipline.
    

CLAUDE.md as the source of truth

The first thing I did was create a proper CLAUDE.md file in every project. Not just a list of style preferences — an actual project document: architecture decisions, data model conventions, which patterns we use for API error handling, the naming conventions that matter, what tests need to pass before anything merges.

Previously, I was re-explaining context at the start of every session. "We use PostgreSQL, our auth is JWT-based, don't modify the schema directly, use the migration helper..." That repetition was wasting tokens and occasionally getting truncated. More importantly, when I started a new session after a few days away, I'd sometimes forget to mention something critical and the agent would wander off in a direction I didn't intend.

Once CLAUDE.md had the full picture, the agents stayed on track without my constant correction. The file became the single source of truth for how the project works. It does require maintenance — when you make an architectural decision, you need to update the doc. But that overhead is smaller than the alternative.

Context management: load less, delegate more

My old approach: dump the whole codebase into context and hope the AI would find what it needed. After hitting context window limits repeatedly — and reading some painful lessons about bigger context windows not always meaning better results — I shifted to just-in-time loading.

The principle: give the agent lightweight references (file paths, function names, test file locations) and let it pull what it needs via tools rather than pre-loading everything. When an agent reads a 5,000-line log file, don't leave the whole thing in context — extract the relevant parts, drop the original. Anthropic's engineering team has written about this approach and the context volume reduction can be dramatic, sometimes 90% less with equivalent quality [3].

Practically, this means writing cleaner task descriptions and trusting the agent to navigate rather than doing the navigation for it. It felt like giving up control at first. The results were better.

Review gates instead of constant checking

This one changed my focus the most. I used to watch agent output as it came in — ready to intervene the moment something looked off. That's basically the vibe coding mindset applied to agentic tasks, and it's exhausting. It also defeats the purpose of delegation.

Now I define review checkpoints in advance: before any database migration runs, before a PR is opened, before any external API call gets implemented. Between checkpoints, the agent runs. I do something else. When it reaches a checkpoint, I review the state, decide whether to continue or correct, and then it keeps going.

      FIG. 05 — REVIEW DISCIPLINE: WHAT WORKED, WHAT DIDN'T
    

✓ DO

✓ Set checkpoints before the agent starts (pre-migration, pre-PR, pre-external-API)
✓ Step away between gates — work on something else
✓ Review the state, not the line-by-line generation
✓ Treat CLAUDE.md updates as part of the change

✗ DON'T

✗ Watch every line and interrupt mid-generation (surveillance mode)
✗ Re-explain project context at every session start
✗ Pre-dump the whole codebase "just in case"
✗ Skip code review because the gates exist

      The line between delegation and abandonment runs through what you check, not how often.
    

The vibe coding fatigue I'd noticed before — that constant low-grade attention load of monitoring AI output — mostly came from this pattern of continuous supervision. I'd written about that fatigue without really identifying the cause. The cause was supervision mode, not agentic mode. Once I stopped watching and started reviewing at gates, the cognitive load dropped.

    04 / HONEST TRADE-OFFS
  

What I thought would improve — but didn't

I expected the speed gains to be dramatic. They weren't. My throughput went up, but not by some 10× multiplier. The gains were real and consistent — maybe 40% faster on well-defined tasks, less on anything requiring domain judgment. The 10× figures floating around assume you're working on greenfield projects with simple requirements. Production codebases with years of technical decisions baked in don't compress that neatly.

I also expected trust in AI-generated code to become a non-issue once I had better review gates. It didn't. I still read every piece of code before it ships. I'm more systematic about it — the gates create structured review moments — but the review itself isn't shorter. The 45% OWASP vulnerability rate doesn't drop to zero because your workflow improved. It just stops being a surprise.

The biggest overhead I underestimated was CLAUDE.md maintenance. Every time a pattern changes, every time a dependency gets added, every time there's an architectural shift — the doc needs updating or the next agent session will drift. It's not complicated work, but it's work that didn't exist before. Think of it as writing for your future AI collaborators the same way you'd write good documentation for a new team member.

Developer trust in AI-generated code industry-wide has actually dropped, from 43% in 2024 to 33% this year [1]. Agentic engineering improves the structure around AI, not the fundamental reliability of AI output. The tools are getting better. The judgment still has to come from somewhere else.

    05 / TAKEAWAY
  

The trade you actually want

All that said: the billing bug incident hasn't repeated. The review gates catch edge cases before they ship. The workflow takes more upfront thought, but the downstream cost of mistakes dropped significantly. That's the trade-off. Not faster, looser development — more structured development where the AI does more of the repetitive work and the engineer does more of the architectural thinking.

That's closer to what I wanted when I started using AI coding tools. It just took a production incident to get there.

If you're still in pure vibe coding mode and it's working — for prototypes, personal projects, throwaway scripts — that's fine. Vibe coding isn't bad. It's just the wrong tool when the stakes go up. Agentic engineering is what fills that gap, and it's worth building the habit before the incident rather than after.

THE BOTTOM LINE

Build the habit before the incident, not after. The cost of switching modes after a production failure is much higher than the cost of switching before one.

    REFERENCES & SOURCES
  

[1] Stack Overflow Developer Survey 2026 and recent AI-coding quality research — AI co-authored code shows 1.7× more major issues; ~45% of generated samples contain OWASP Top-10 vulnerabilities; developer trust dropped from 43% (2024) to 33% (2026).
[2] Andrej Karpathy, Sequoia Capital Ascent conference, 2026 — remarks on the orchestrator role and "99% not writing code directly."
[3] Anthropic engineering posts on context management and just-in-time loading — reported volume reductions up to ~90% with equivalent task quality.
[4] Related IX Works posts: AI coding tools and productivity · Context window lessons · Vibe coding fatigue.

IX Tech Insights

이 블로그 검색