All articles

Reviewing AI Code Is a Different Job.

AI code looks clean. But whether it fits the system can't be seen in the diff — only when you know what to check against.

Clean code surface hiding misaligned architecture beneath — the hidden challenge of AI code reviewClean code surface hiding misaligned architecture beneath — the hidden challenge of AI code review

Recently, I was working on a tool of my own for Entwicklerherz — nothing major, an internal service I'm building on the side. An AI agent delivered a pull request that, at first glance, was more flawless than most of what I've seen in fifteen years of software development. Clean methods, sensible naming, tests included, documented. The code worked. Technically, there was nothing to complain about.

What I noticed the next day: the controller had brought its own error handling. Elegantly done. Except the project already had a global ExceptionHandler — born from a bug that cost me a Saturday. That wasn't written down as a rule anywhere. It was in my head. And because I didn't actively cross-check during the review, it slipped through.

The mistake wasn't the AI's. The mistake was mine. I reviewed AI code like human code — line by line, top to bottom, looking for correctness. And missed that the real question was a different one entirely.

Correct isn't enough

There's a difference that sounds trivial but changes everything: A colleague writes code within your project. An AI writes code that gets placed into your project. The difference is context.

Of course you can give an AI context. You can provide architecture documents, coding guidelines, existing modules as reference. The better the context, the better the output — that's no longer news. But there's a kind of context that's hard to pack into a prompt: the lived history of a project.

Why service X communicates with service Y through a detour, even though a direct connection would be more obvious. Why a column is defined as VARCHAR where an ENUM would be cleaner. Why a specific module has its own error handling. These aren't technical decisions — they're scars. Scars from incidents, from compliance audits, from dependencies on third-party systems that can't be changed.

AI code that doesn't know this history can still be excellent. But it can also be excellently wrong. And the problem is: Both look the same from the outside. Clean code, green tests, proper structure. The difference only shows when you know what to check against.

Why the classic review workflow isn't enough

When reviewing human code, you read a colleague's thought process. You see where they weighed options, where they were uncertain, where they deliberately took a shortcut. You recognize this because you think the same way. The code tells a story.

AI code tells no story. It's result without derivation. Clean, often cleaner than human code, but without the traces of thinking. And that means you as a reviewer are missing something crucial: the thought process you read along and use to spot errors.

That's the core of the problem. Not that AI code is worse — it's often better. But that the way we've been doing reviews for years is built on an assumption that no longer holds for AI code: that the author and the reviewer work in the same context.

When a colleague opens a PR, they've attended the same meetings, heard the same retros, experienced the same incidents. When an AI generates a PR, it has none of that. And no prompt, however good, replaces the fact that it wasn't there when the team decided eighteen months ago to stop using synchronous calls between services.

Filter instead of read

What became clear to me after the ExceptionHandler mistake: I need to approach AI code differently. Not more thoroughly — that doesn't help when you're being thorough in the wrong places. But in a different order.

With human code, I start in the diff and read through. With AI code, I start one level higher. Not "Is the method correct?" but "Does this class even belong in this module?" Not "Are the tests green?" but "Do they test the scenarios that actually occur in our system?"

I now work in layers. First architecture — is the placement in the system correct? Then project conventions — does the code follow the patterns the team developed over months? Then a quick check whether the AI built something new that already exists. And only when all of that checks out do I go into the implementation.

That sounds like more work. It actually saves time. Because a PR that lands in the wrong place in the system doesn't get better when you read it line by line. The five minutes I invest in the upper layers save me the thirty minutes I'd otherwise spend in a detail review that ends up being a reject anyway.

What changes for everyone

This doesn't just affect the individual reviewer. It changes how teams work and how companies think about software quality.

For developers
Reviews become more demanding. The days when a review was mainly about syntax and style are over. Anyone who wants to meaningfully review AI code needs an understanding of the overall system — not just the language it's written in. This applies to juniors who need to build this understanding, just as much as to experienced developers who need to keep it current.
For companies
Documentation moves from side project to production factor. Architecture Decision Records, maintained coding guidelines, documented conventions — these were diligence tasks that were easy to postpone. In a world where AI agents generate code, every undocumented convention becomes a review bottleneck. Not because the AI couldn't follow it — but because no one can check whether it was followed when it's written down nowhere.
For team builders
The skills that truly matter in reviews are shifting. It's less about whether someone can find a bug on line 47. It's about whether someone understands why line 47 has to be this way and not another. That's a difference measured not in years of professional experience, but in years in the same context.

The actual point

AI makes code production cheaper and faster. That's a fact. But it doesn't make code evaluation easier. On the contrary — it makes it more demanding, because the surface becomes cleaner and the real problems lie deeper.

This isn't a criticism of AI. I use it daily, and the output gets better every month. But anyone who believes that good AI output automatically means good software is confusing production with quality. Good code is code that works. Good software is code that fits the system, respects the project's history, and is still maintainable in a year.


AI writes code that works. Whether it fits — that's your call.

Companion guideAI Code Reviews: A Guide for TeamsPractical approaches for AI code reviews — from the layered review model to documentation as a production factor.Read the guide