You Can't Tell What It Smells Like If You Haven't Been Stepping In It

by

Something stinks in the vicinity of agentic engineering. For decades, we’ve been able to walk into any software company and find a programmer – often one with a few greys in their hair – to complain that most software companies don’t do “real engineering”. They’ll probably make some reference to a civil engineering disaster, juxtapose it with the Therac-25 case study, then complain that there’s never the time or appetite to “do things right” before they go back to their interminable design document and code change review queue. Were we to ask around, we’d probably hear from their coworkers and managers that this person does good work, but they could stand to do it quicker and not hold up the design process by being so fussy. Maybe an especially honest coworker would admit that there are certain “high-velocity” projects that are kept out of sight of this person in hopes that they will be completed on time.

You or I have perhaps been this person at various times in our careers, and if we’re honest with ourselves, we’ve definitely been their hypothetical candid coworker. Often this is for good measure – there isn’t time to do it right! Features in customers hands close sales and produce valuable product insight at a fraction of the cost of months of focus groups and reams of product design specifications. When the company goes to do it a second time, there’s more appetite for engineering excellence because its value can finally be communicated in the language spoken by businesses.

But I digress from our agentic stench, if only to catch a whiff of the familiar scent of a traditional software development smell. When considering the persona of this putative “real engineer”, we thought about how their behavior impacts the projects they participate in and the teams they interact with. We can see that they produce code reviews, comments on designs, analyses of system operation, sometimes even their own design documents or code. What is it, though, that we would say that this engineer actually does?

I believe the core contribution of this engineer to the company is their practice of making decisions based on their comprehensive understanding of the systems they work with and their historical experience of creating and operating these systems. They might perform these decisions in a design review, in producing an implementation of a tricky service, in a steering document for a team or project, in their seemingly clairvoyant ability to wring a diagnosis from metrics and logs during an incident – but, fundamentally, it’s all the same skill.

What distinguishes the decisions made by this class of engineer from the milieu of choice-and-action in a software development organization is the clarity of purpose and framing contained within them. These engineers work from complex theories of system operation that are continually refined by reading code and designs, testing, experimentation, and data analysis. Decisions are derived by asking questions from theories, then pursuing answers to spur further questions until one can rationalize a course of action. Proposing an event-driven architecture for propagating changes to a database is easy enough that one could suggest it in a system design interview. Determining that queue processing and database replication can race in an event consumer requires a much more detailed model of system operation.

At this point, I need to make a conjecture: every software project inherently requires a certain quantity of decisions to complete. One can outsource some of these decisions via abstraction – the ease of writing managed code often wins out over manually allocating and freeing heap memory – but one cannot shirk the decisions altogether – as it is now possible to earn a living by tuning the JVM allocator. The advantage of an (even imperfect) abstraction is that it offers a point at which one intentionally delegates a decision to the other side of the abstraction. Same quantity of decisions, but some can be re-used, complete with their underpinning theories. This required quantity of decision making is one potential bottleneck in the software development process.

Into this rosy briar enter agentic software development. It is marvelous that these language models can generate code of the scale and complexity they do; the extent to which they can integrate documentation, compiler output, and test failure logs into the generation process has heretofore been unachievable. These models can perform queries, read logs, identify patterns, and output text presenting this information. Often, the code works, the pattern holds, the presentation is believable.

However, to call the output of these probabilistic models engineering is to commit a category error. This output is, fundamentally, statistically likely text based on the contents of the context window; as anyone who has tried to talk an agent through the process of writing a failing test knows, it can be remarkably hard to get one to generate an unlikely output. Lacking a theory of operation and the question-answer-refinement process, these models cannot rationalize a decision. At best, they can generate a convincing explanation.

All software developers have some amount of a theory of the workings of the system they are developing and engage in decision making to some extent as part of the choice-and-action of the day-to-day of programming. For those lacking the clarity held by our hypothesized engineer, making decisions and performing choices and actions blurs together; if it works, isn’t that good enough? To this class of engineer, it is not – they want to be convinced of why a choice works.

In the era of handwritten, artisanal code and design documents, the labor of turning a document template into a design and of writing code and tests spurred developers through the process of making decisions, even if they remained implicit in the work. Observant programmers would identify points where abstractions could be introduced to make decisions re-usable. A skilled author would consider the audience for whom they write and work to align that audience’s question-answer-refinement process with the proposed decisions. Reviewers would check those decisions against their own theories and discuss where they saw discrepancies.

Agentic software generation introduces a subtle break from this development contract. With a short prompt, developers can produce copious amounts of code and documentation without having to consider the contents of this output nearly as much as if they were to type it all out themselves. It is formatted well, uses correct spelling and grammar, has tests that pass (though which may not be fallible), and comes with additional text suggesting a thought process leading up to the output – sufficient in the eyes of some to put up for review.

An engineer tasked to review this output will feel a dissonance, perhaps subconsciously, from the category error that has been committed. The prompter has performed choices and actions, but made only the decisions encoded in the prompt to the model. The reviewer finds themselves making decisions, rather than primarily verifying the decisions made by the author. Because there is no theory driving the output, there are no questions the output answers. The generated documentation lacks explanatory power because it has no theory to explain nor audience to receive the explanation. The result is a mental malodor for the reviewers.

This situation presents risk across the software development organization. Programmers miss the opportunity to refine the choice-action process into one producing clear decisions, stymieing their growth and avoiding refinement of their operational theories. Careful reviewers find themselves dividing their focus between more tasks and becoming more exhausted as they experience the effects of ego depletion.

Effectively, decision making is concentrated into specific members of the organization, exacerbating the developer productivity bottleneck by reducing the number of people participating in the decision-making process. Or, decisions are just not made in order to keep velocity up or because no one involved realized that one was necessary. Deferred decisions are nucleation points for inconsistency and confusion; the resulting systems are more complex to create theories about, as there is little theory from which they were created.

Our solution is twofold. First, use agents to drive the rote work, explore the decision space, and refine your theories. Keep the question-answer-refinement process in your hands. Second, decisions can be made to be reusable – if they are identified and appropriately factored out of the system. Observe points where the agent becomes confused, has to try multiple times – or just where you can make use of type systems, library design, and explicit modeling of system functionality to reduce or contain the parts of the system requiring especially careful human review. Fundamentally, we are still engineering software, even if a language model is doing more of the labor for us.

Further Reading

It is always gratifying to write something from the heart, then afterward find others saying the same things with more clarity and empirical validation: