TL;DR: I spent six months treating LLMs like a search engine that writes code. That was wrong. The moment I started treating them like a team — an architect, a developer, and a reviewer with different strengths — everything changed. This is the exact workflow I use today, why each piece exists, and what breaks without it.


The Honest Starting Point

I want to be direct about something before this article goes any further.
Most “LLM coding workflow” posts are written by people who have used AI tools for a few weeks and want to share the excitement. That is fine. But this is not that.
I have spent the better part of a year building real production software with LLMs as a primary collaborator — not a suggestions engine, not an autocomplete, but an active participant in architecture decisions, implementation, and code review. Some of it has worked exceptionally well. Some of it has produced the most quietly catastrophic codebases I have ever had to untangle.
53% of senior developers now believe LLMs write code better than humans in certain domains. I do not fully agree with that framing — but I understand why the number is that high. When the workflow is right, the output is genuinely impressive. When it is wrong, it is confidently wrong at a scale that is hard to recover from.
This is the workflow that got me to the right side of that line.

What Changed My Thinking

For the first six months, I used LLMs exactly the way most developers do. I opened a chat window. I described what I needed. I got code. I tested it. I integrated it.
It worked — for a while. Scripts, utilities, isolated functions. The kind of work where the scope is small enough that the model’s blind spots never compound into anything catastrophic.
Then I tried to build something real. A full backend service. Multi-tenant. State-dependent. Integration-heavy.
Within three weeks, the codebase was in a state I recognized immediately from early-career mistakes: every new feature required touching six files in ways that were not obvious. Every bug fix introduced two new ones. The LLM kept saying “I know why this is breaking, let me fix it” — and kept making it worse.
That failure taught me the most important thing I know about writing software with LLMs: the model does not fail at writing code. It fails at architecture. And if you do not own the architecture, you will not own the outcome.
LLMs show under 10% improvement on deep reasoning tasks that demand multi-step reasoning, architecture decisions, and system-level thinking. The code generation gains are real. The architectural judgment is still yours to own.

The Shift: From Tool to Team

The mental model that changed everything was simple.
Stop thinking of an LLM as a tool you use. Start thinking of it as a team member with specific strengths, consistent blind spots, and a tendency to agree with whoever spoke last.
Once I started treating different models as different team members — each with a defined role, different access levels, and different responsibilities — the quality of what I built changed significantly.
At Anthropic, engineers adopted Claude Code so heavily that today roughly 90% of the code for Claude Code is written by Claude Code itself. But that level of reliability comes from structure, not from just pointing a powerful model at a problem.
Here is the structure I use.

My Full Workflow for Writing Software with LLMs

The Three Roles

Every session I run has three distinct roles. Sometimes one model plays multiple roles on small tasks. On anything that will live in production, each role is filled separately.
The Architect — handles planning only. No code. This is always the strongest model I have access to (currently Claude Opus). Its job is to understand what I want to build, ask every question it needs to ask, and produce a specific, low-level plan before a single line of code is written.
The Developer — implements only. It receives a plan it did not write, which limits how many structural decisions it can make independently. I use a faster, more token-efficient model here (Sonnet). Its job is execution, not judgment.
The Reviewer — reads the diff and the original plan independently and critiques what was built against what was asked. Critically, this is always a different model than the one that wrote the code. A model reviewing its own work tends to agree with itself. A different model finds different things.
This mirrors how experienced engineering teams work — the person who architects the system is not the same person who implements every component, and neither of them is the right person to review their own work.

Phase 1: The Architecture Conversation

This is the phase most developers skip. It is also the phase that determines whether the rest of the session produces something maintainable.
I never start with “write me a function that does X.” I start with the problem and let the conversation shape the solution.
A real example from a recent session:

“I want to add retry logic with exponential backoff to our external API calls. Let’s think through how we would do this.”

What happens next is the part that takes time — and that time is where the value is. The model reads the existing codebase, maps the relevant components, and comes back with specific questions:

Which codepaths make external calls?
Should retries be per-function or centralized in a wrapper?
What is the acceptable retry budget — attempts, total elapsed time, or both?
How do we handle non-retryable errors (400s vs 500s)?

I answer every question. We go back and forth — sometimes for 20–30 minutes — until the plan is specific enough that the developer agent has no structural decisions left to make.
This part is not prompt engineering. It is engineering. I still correct the model regularly — when it suggests patterns that would work in a different codebase but not mine, when it optimizes for something I do not need, when it misreads how a component is used.
The explicit approval gate: I never let the model start implementing until I say the word “approved.” Some models, given an inch of confidence that they understand, will sprint toward implementation. I want to be certain the plan is right before a line of code is written — because changing code is expensive. Changing a plan is not.

Phase 2: Implementation

The developer agent receives the plan. That is its entire context. It has minimal leeway on structure, because the structure was already decided.
Its job is mechanical execution of a well-reasoned plan. And at this level, current models are genuinely excellent. Code generation time drops by 35–45% with LLM usage, and code documentation time drops by 45–50%. The gains at the implementation layer are real and consistent.
What the developer agent produces:

Code that matches the plan’s specified file locations and function signatures
Tests for the components it builds
A brief summary of what was implemented and what it chose not to cover

What it does not decide:

Where new abstractions live
Whether a new utility function should be shared or scoped
What the API contract looks like

Those decisions happened in Phase 1. They are not re-litigated here.

Phase 3: The Multi-Model Review

This is the step that most LLM coding guides omit entirely. It is also the step that separates code I am confident shipping from code I am nervous about.
After implementation, I route the plan and the resulting diff to a reviewer — always a different model than the implementer.
Why different models? Because asking a model to review its own code is like asking a writer to proofread their own work an hour after finishing it. They see what they intended, not what is actually there. A different model sees the code without the implementation context and catches patterns the original model normalized.
In practice: Codex tends to be pedantic — it catches style and convention issues that are annoying during implementation but genuinely valuable before code ships. Gemini Flash occasionally suggests approaches that neither of the other models saw, which is a meaningful signal worth investigating.
The reviewer’s output goes back to the developer. If the feedback is clear and unambiguous, the developer integrates it. If reviewers disagree, it escalates to the architect for a judgment call.

What Your Engineering Skills Are For Now

A concern I hear constantly: “Will LLMs make my engineering skills irrelevant?”
My honest answer after a year of building this way: no. But they will shift what matters.
If you come to the table with solid software engineering fundamentals, the AI will amplify your productivity multifold.
What matters less now:

Remembering exact syntax
Writing boilerplate
Looking up standard library methods
Producing repetitive structural code

What matters more now:

Knowing when an LLM’s architectural suggestion is wrong for your specific system
Understanding tradeoffs the model presents without over-explaining them
Catching the quiet bad decisions — the ones the model makes confidently, in well-formed code, that will cost you three months later
Knowing when to override the reviewer’s feedback because it is technically correct but wrong for your codebase

The engineers I have seen struggle with LLMs are not the ones who cannot prompt well. They are the ones who do not have enough experience to recognize when the model is steering them into a wall.
This is not a warning — it is a clarification. Your engineering judgment is the quality gate. The LLM generates the code. You are still responsible for the architecture.

The Failure Modes Nobody Talks About

  1. The Confidence Spiral
    You tell the model the code is not working. It says “I see the issue — let me fix it.” It fixes something. Something else breaks. You tell it again. It says “ah, now I understand — here is the real problem.” It fixes that. Three more things break.
    This is a real failure mode and it is more common than any LLM coding guide admits. The root cause is almost always the same: a bad architectural decision made early that the model keeps building on top of, adding complexity with each fix rather than addressing the foundation.
    How to catch it early: When you notice the same component appearing in the third consecutive fix, stop. Go back to the architect. Describe the pattern you are seeing. Ask it to trace back to the original decision point. Nine times out of ten, there is a structural choice from Phase 1 that needs to be reconsidered — and the developer fixing symptoms will not find it.
  2. The Unfamiliar Stack Problem
    I have never built a bad LLM-assisted codebase in a stack I know deeply. I have built several in stacks I was learning.
    The reason is straightforward: when I do not know the technology well enough, I cannot catch the model’s bad decisions. It will suggest patterns that are technically valid but wrong for the context — an ORM query pattern that works but destroys performance at scale, a state management approach that is fine for small apps but unworkable at the size we are building to.
    When you are working in an unfamiliar stack, spend more time in Phase 1. Ask the architect to explain its choices. Ask it what the alternatives are and why it rejected them. If you still cannot evaluate the tradeoffs, that is a signal to slow down — not to trust the model more.
  3. The Single-Model Review Trap
    Using the same model to review its own code is nearly useless. It will find surface-level issues. It will miss the structural ones. It will recommend changes that are consistent with the choices it already made — because those choices feel correct to it.
    Different models have different blind spots. The review is valuable specifically because it is not performed by the model that built the thing.

The Tools I Actually Use

I use a multi-model harness — the specific product matters less than the capability requirements. What I require:
Multi-model access: I need to route different tasks to different models. Single-provider tools limit this. The architect, developer, and reviewer should be able to be different models from different companies.
Agent-to-agent calls: The architect should be able to call the developer, which should be able to call the reviewer, without me ferrying outputs manually between windows. Manual ferrying is error-prone and slow.
Session context: The architect needs to maintain awareness of what the developer implemented so subsequent conversations are grounded in the actual state of the code.
For the models themselves in my current setup:

Architect: Claude Opus — strongest reasoning, best at asking the right questions
Developer: Claude Sonnet — fast, capable, cost-efficient for high-volume implementation
Reviewer 1: Codex — pedantic, catches convention and pattern issues
Reviewer 2 (important projects): Gemini Flash — surprisingly good at alternative approaches

I write the agent instruction files by hand. Asking the LLM to write its own instructions produces something that sounds right and produces mediocre results. Instructions are not prompts — they are behavioral contracts, and they need to be authored with the same care as any other specification.

What I Have Built This Way

The most common criticism of LLM-assisted development is that it produces toy scripts — things that work in demos and break in production.
In the past year alone, using this workflow, I have shipped:

A multi-tenant SaaS backend handling production traffic with 40,000+ lines of code — maintained cleanly for four months without architectural regression
A real-time event processing pipeline integrating five external services, built in a stack I was learning from scratch
Internal tooling used daily by a team of twelve, zero critical bugs in three months of production use

None of these are toy projects. All of them were built with LLMs as the primary code generator and me as the architect, reviewer, and quality gate.
The workflow does not eliminate the need for engineering experience. It amplifies what you already know — and it exposes, ruthlessly, what you do not.

The Honest Productivity Numbers

I will not claim I am 10x more productive. I do not believe that number.
What I believe, because I have measured it: code generation time drops by 35–45% and code documentation time drops by 45–50% with structured LLM usage. For a senior engineer with a solid workflow, the overall productivity gain is real and consistent — somewhere in the 25–40% range on projects where I know the stack deeply.
On projects where I am learning the stack as I go, the gain is smaller — maybe 15–20% — because the time I save in implementation I spend in more careful architectural review.
GitHub’s own studies found that developers using AI tools completed tasks 55% faster on average than those without — but that is task-level speed, not project-level productivity. The difference matters. I can implement a feature 55% faster. Whether that feature was the right feature, designed correctly, placed in the right abstraction layer — that is still entirely on me.
The engineers who claim 10x gains are usually measuring the wrong thing. The engineers who say LLMs are useless are usually not using a structured workflow. The honest number is somewhere in the 25–45% range for developers with strong fundamentals and a disciplined process.
That is not magic. It is still genuinely significant — the equivalent of adding a quarter to a third of a senior engineer’s capacity without adding headcount.

FAQ

Can LLMs actually write production-quality software?

Yes — with the right workflow around them. LLMs produce production-quality code when the architecture is planned before implementation begins, the implementation is scoped tightly, and the output is reviewed by a different model than the one that wrote it. Without these constraints, code degrades into unmaintainability within a few weeks on any sufficiently complex project. The model quality matters less than the process quality.

What is the best LLM for coding in 2026?

No single model is best for every step. Claude Opus handles architecture and planning best — its reasoning depth and tendency to ask clarifying questions make it the right choice for the phase that determines everything else. Sonnet handles implementation efficiently at lower token cost. For review, a different model than the implementer — Codex or Gemini Flash — consistently catches issues the original model missed. The multi-model workflow outperforms any single-model approach on production-grade projects.

Do I need to be a senior engineer to use this workflow?

Not for simple tasks. But for production software, your engineering experience is the quality gate. LLMs make architectural decisions confidently and incorrectly. If you do not have enough experience to recognize a bad decision in well-written code, you will not catch it — and those quiet bad decisions compound. Junior developers who fully trust LLM output without review create unmaintainable codebases faster than they would writing everything by hand.

What is the biggest mistake developers make with LLM coding?

Skipping Phase 1. Most developers jump straight from problem to implementation. The LLM starts writing code. It makes structural decisions. Those decisions become the foundation everything else is built on. Three weeks later, every change requires touching six files and nothing is clear. The architecture conversation is not overhead — it is the work. Everything else is execution.

How do I know when the LLM workflow is failing?

Three clear signals: the same component keeps appearing in successive fixes, the model says “I see the issue” repeatedly but each fix introduces new problems, or you find yourself unable to explain to a colleague why a piece of the codebase is structured the way it is. Any of these means you have drifted from the architecture into symptom-chasing. Stop. Go back to the architect. Retrace the structural decision that produced the pattern you are fighting.

Conclusion: The Workflow Is the Work

LLMs have genuinely changed how I build software. Not by replacing the engineering — by compressing the mechanical parts of it until the intellectual parts are what remain.
The architecture conversation. The tradeoff evaluation. The judgment call when two technically valid approaches diverge. The decision to override a reviewer’s feedback because it is right in the abstract and wrong for this specific system.
These are the parts of software engineering that were always the most valuable. LLMs have not made them less important. They have made them the only part of the job that the model cannot do for you.
Treat every AI coding session as a learning opportunity — the more you know, the more the AI can help you, creating a virtuous cycle.
This is the most accurate thing I have read about LLM-assisted development. The workflow does not deprecate your expertise. It requires it, more specifically than before, at exactly the moments that matter most.
Build the architecture first. Own every structural decision. Let the model write the code. Review with a different model. Ship what you understand.
That is the workflow. It works — not because the models are perfect, but because the process is designed around the places where they are not.

→ Building a SaaS product using this workflow? Read our complete guide — it covers the full technical stack, MVP scoping, and the architecture decisions that matter before your first line of code is written. Every principle in this article applies directly to how we approach SaaS development at SNTPL.

Leave a Reply

Share