Six thousand lines changed in a single refactor. The AI agent was fast, the architecture was correct, and the tests passed in four hours. The sceptic’s instinct is to call this foolhardy; six thousand lines is too large, too risky, too much to trust to a machine. I would put it differently. The industry has made velocity non-negotiable, and the answer to that tension is not to move slower. It is to build processes honest enough to deserve the speed. Most engineering teams have not yet done so, and most have not yet reckoned with what AI-assisted development has done to the ratio between velocity and verification.
A capability in our system had grown incrementally over several months, authored by three different engineers, each working within a different microservice. Each made locally reasonable decisions. The system functioned. Viewed from a distance, though, the separation of concerns had collapsed: the logic was coupled, scattered, and maintaining it required holding three mental models simultaneously. The practical forcing function was the AI coding agent itself; it was drowning, its context window spanning three services to reason about a single feature. That is a code smell measurable in tokens.
The diagnosis was clear. Consolidate the scattered implementation into a single, purpose-built microservice. What was not clear was the risk surface. The only honest question worth asking: after 6,000 lines move, how does one actually know it works?
§ 01 — Phase 0: the test that earns the right to refactor
The instinct under schedule pressure is to begin. To start moving code. To trust that you understand the system well enough to know when it breaks. This instinct is wrong, and I have the scar tissue to prove it. Phase 0 was the non-negotiable prerequisite: build an exhaustive test harness before touching a single line of production code. Without it, a 6,000-line refactor is not bold engineering. It is gambling with someone else’s money.
This sounds obvious inside a test-driven framework. It is less obvious when the code was written by people other than you, scattered across services with their own environment assumptions, their own internal contracts, and their own undocumented edge cases. I examined portions of the codebase which I did not fully understand. My response to that uncertainty was deliberate: I over-tested. When I was not certain whether an edge case existed, I tested for it regardless.
The cost calculus is asymmetric. A redundant test case costs milliseconds at runtime and a few minutes to write. A missing test case for a real edge condition costs hours of debugging; or worse, a production incident during a major release. There is no version of this calculation where skipping Phase 0 makes sense.
Static unit tests were insufficient. The scattered services had live infrastructure dependencies: Docker containers, inter-service calls, and environment-specific behaviours which only surface during execution. Phase 0 required standing up the Docker services and running the full suite against a real environment. It added substantial friction to test authoring. The entire refactor then validated in four hours. That is the return on the investment.
Phase 0 is not a plan for testing. It is the tests themselves, running green against the old code. That is the baseline. Everything after is execution.
§ 02 — Execution: why phases 1–10 were actually fast
Once the test harness was green against the original scattered code, execution of the refactor was genuinely fast. The AI coding agent, now operating within a bounded and coherent context rather than spanning three services, performed well. The majority of failures during continuous test runs were unambiguously wrong and unambiguously fixable: missing arguments, incomplete .env files. Not architectural failures. Finishing details.
I ran the test suite continuously. The acceptance bar was 100%. Not 98%, not “functionally equivalent for the happy path.” 100%. A test case which you wrote yourself, for behaviour you specifically examined, that is now failing, is telling you something. You do not get to decide it is unimportant without first understanding why it failed. The sceptic who fears the 6,000-line refactor is actually fearing the absence of this discipline. Install the discipline and the fear is no longer rational.
The test suite was the single source of truth for the entire refactor. Not code review. Not intuition. Not the AI agent’s confidence. The tests.
§ 03 — The merge: where the difficulty actually lives
Here is the part most refactoring write-ups omit. The code was the manageable part. The merge was the nightmare, and it was a self-inflicted complexity which no test suite can fully protect against.
The refactor ran parallel to active development on the scattered services. Bug fixes were landing in the original codebase daily, as the system was approaching a major release. My refactored branch had to remain current with main throughout the entire process, absorbing upstream changes into a fundamentally reorganised codebase. I was not merging once; I was maintaining continuous integration with an actively moving target.
This required building a specialised agent with a narrow mandate: examine bug fixes committed to the scattered services and apply the semantic equivalent to the consolidated service. Not mechanical line-for-line translation; reasoning about intent. What invariant was the bug fix restoring, and where does that invariant live in the new architecture?
The agent also generated detailed commit notes, not for me, but for the peer reviewers. A 6,000-line changeset cannot be reviewed the way a 60-line PR is reviewed. Writing for the reviewer’s mental model, not your own, is a discipline which matters more in the era of AI-assisted development than it ever did before.
§ 04 — Context preservation: the unlock nobody talks about
One final operational lesson, unglamorous but important. The AI agent sessions supporting this refactor were long, stateful, and complex. VSCode crashed. Context windows filled and went stale. Sessions died mid-reasoning. This is the current operational reality of working with large coding agents on large problems, and pretending otherwise serves no one.
My solution was simple: I periodically copied the full chat context into a structured checkpoint document. When a session died, I restored state rapidly; architectural decisions already made, edge cases already discovered, test cases already passing. The checkpoint was not sophisticated. It was the engineering equivalent of saving your work before a long compile. It saved hours. Probably more than once.
Treat your agent context as a deliverable, not an afterthought. The session is ephemeral. The reasoning inside it is not.
§ 05 — What AI development is doing to code review, and what we need to do about it
The AI coding agent accelerated this refactor. It also made the refactor more ambitious than I would have attempted without it. But the agent did not provide confidence. The test harness provided confidence. The agent provided velocity. Conflating those two is how you end up with 6,000 changed lines and no honest answer to whether the system works. The sceptic who calls this reckless is not wrong about the risk; they are wrong about the remedy.
Here is the uncomfortable claim: traditional peer review was designed for incremental, human-authored changes. A reviewer reads a diff, traces the logic, and spots the anomaly. That model degrades badly when the changeset is large, semantically reorganised, and AI-generated. The reviewer’s cognitive load is not linear with diff size; it is superlinear. At 6,000 lines, line-by-line review is not merely slow.
It is theatre. You are performing review without actually achieving it.
Most engineering teams have not admitted this yet. I think they should, and I think the teams which admit it first will have a durable competitive advantage over those which do not.
My working model for reviewing AI-assisted refactors at this scale: the reviewer’s first artefact to examine is the test harness, not the code. If the tests are exhaustive and correct, they constitute the functional specification. The reviewer’s job becomes auditing the spec, not reading the implementation. Second, the commit notes structured by the merge agent serve as the architectural narrative; the reviewer follows the reasoning, not the lines. Third, code review at scale is sampling, not coverage: spot-check the high-risk surfaces, validate the invariants, verify the contracts at service boundaries. Stop pretending otherwise.
This is a shift in what we ask of peer reviewers. It requires that test harnesses be written with reviewers in mind, that commit notes be treated as first-class deliverables, and that engineering culture honestly redefine what “reviewed” means for large AI-assisted changesets. The teams which navigate this transition deliberately will ship faster and break less. The teams which keep applying 2018 review practices to 2026 development velocity will discover the gap the hard way; usually in production, usually during a release.
The industry has demanded the velocity. The processes must be honest enough to deserve it. That is the work.