A Real-World Evaluation System for AI-Augmented Coding
At Viabl, we believe the most effective development workflow combines the speed and scalability of AI with the judgment and experience of a human developer. This hybrid approach gives teams the best of both worlds: the efficiency of an AI model paired with the contextual awareness and decision-making skills of a professional engineer.
In this post, we introduce a structured way to evaluate how well that collaboration is working. This framework can be used both to help developers improve how they work with AI tools and to assess how different models perform in real-world coding scenarios.
Classification System
To better understand how and why AI-assisted software development can go wrong, we’ve outlined a classification system for common failure modes that occur when developers work with large language models. These error classes highlight specific breakdowns. By identifying where responsibility typically lies, this framework helps teams provide more targeted feedback, improve collaboration, and ultimately produce more reliable software. What follows is a breakdown of five key error types, each illustrating a different kind of failure and who is responsible.
Class I: Bad Instructions from the Developer
In these types of errors, the user fails to give the model enough information to complete the task successfully. An example would be “create a calculator”. The AI will dutifully build you its best guess of what you wanted but if you’re really counting on it nailing trig functions, you probably should have specified that.
Failure primarily caused by: developer
Class II: Failure to Address Edge Cases
Edge cases are scenarios outside the expected or “happy” path. In Class II errors, the AI has failed to consider likely edge cases or handle them in an appropriate way. In keeping with the calculator example, the AI may not anticipate and address floating point errors. It’s helpful here to have experienced developers to help the AI anticipate and address these edge cases.
Failure primarily caused by: developer & AI*
Class III: Adding Unwanted “Features”
As more AI tools are optimized to generate full applications from a single prompt, many models have started inferring and implementing features that weren’t explicitly requested. At first glance, this might seem like a positive outcome: an AI that goes above and beyond. But in practice, it often introduces unintended consequences.
Software is carefully scoped to balance usability, performance, and maintainability. When models add functionality that wasn’t asked for, it can lead to bloated interfaces, unclear user flows, and increased complexity. These additions not only make the product harder to use, but they also expand the application’s attack surface and raise the long-term cost of development and maintenance.
Failure primarily caused by: AI**
Class IV: Failure to Follow Instructions
In these types of errors, the AI implements some but not all of what the developer asked for. An example would be “create a scientific calculator. Make sure to include trig functions like sin.” When you go to test the AI’s code, you see that the trig function cos is missing.
Failure primarily caused by: AI**
Class V: Coding Errors
Due to the way modern coding tools work, it’s really rare to have outright syntax errors. However, it’s still essential to understand the limitations of the tools you're working with in order to test their outputs effectively.
For instance, Claude Code can’t load up a web browser and click a button (out of the box – you can always set up an MCP server to do so). As a result, a common Class V error involves interface elements that appear correctly in the code but don’t actually function. In the context of our calculator example, this might look like a parenthesis button—such as “)”—that renders in the UI but doesn’t trigger any behavior when clicked.
Failure primarily caused by: AI**
Important Notes
* In the case of Class II errors, edge cases can be identified by either party. Identifying and handling edge cases is a core, though often underemphasized, responsibility of software developers. While modern state-of-the-art models attempt to anticipate edge conditions, their performance in this area remains inconsistent.
** Regardless of origin, responsibility for any code that gets committed ultimately rests with the developer. The goal of assigning failure categories is not to shift blame, but to provide clear, quantitative feedback that could help both AI and developers improve over time.
Utility of Error Classification
We use this classification system in two ways: as a tool for improving how we work with AI as developers, and as a practical benchmark for evaluating different coding assistants to find the best model for practical development.
Developer Improvement
AI can accelerate development, but like any tool, it requires skill to use effectively. To sharpen that skill, we track three key metrics:
How many Class I errors did we introduce?
How many Class II errors did we fail to anticipate?
How many Class II - V errors did we let get to the Code Review?
All of these are quantifiable and your favorite AI tool can help you quantify them. With consistent tracking, teams can build a clear picture of where their workflows are strong and where there’s room for improvement.
Toward Real-World Benchmarking
Current AI-coding benchmarks mostly use function-level programming. This is, of course, important data. However, it’s also the lowest-hanging fruit: it’s easy to determine programmatically when a function fails to meet requirements. All you have to do is write a unit test. It’s significantly more difficult to measure whether a full application feature is fulfilled programmatically.
Further, there’s much more to development than writing functions: U/X, Architecture, discovery, security, etc. Measuring Class II - V errors measures how well an AI sticks to specifications and thus gets a bit closer to measuring some of these more impactful activities in a quantitative way.
At Viabl, we use Claude Code for all of our development. We instruct Claude to copy our conversation word-for-word in a conversation.md file. By pasting the conversation alongside the Class I–V definitions into any capable AI model you can ask it to annotate each user request and response with the relevant error class, if applicable.
Key Takeaways
As AI becomes a more integral part of modern software development, effective collaboration requires clear communication, thoughtful oversight, and a reliable framework for evaluation. The classification system presented in this post provides a practical way to identify common failure patterns, assign responsibility, and improve both human and AI performance.
By applying this system to our day-to-day workflows and model evaluations, we can go beyond simple correctness checks and begin to measure what actually matters. High-quality software depends not just on code that works, but on a process that consistently produces reliable, maintainable, and well-scoped solutions. This framework helps teams take a step in that direction.
For an example of this classification system in action, please see our blog post We Vibe Coded a Calculator with Claude Code. Here’s How it Went.