Nick Hauenstein

AI evolution has two axes, and you're likely only tracking one

tl;dr; Evolution in AI is not linear, there are at least 2 dimensions of evolution happening simultaneously. For best outcomes realize that more agents without process compounds slop whereas more process without scale compounds ceremony.

This has truly been an eventful year for Generative AI. We’ve seen step-function change in the sheer capability of frontier models from major vendors, and smaller players with open weights models still punching above their weight and keeping the larger players on their toes. With such incredible advancement, developers must be shipping at 50x pace and 10x quality, right? You may in fact find that the opposite is true.

Imagine if you were given the tools of 3.5 years ago – tools that were revolutionary at the time. Those tools harnessed the power of completion, and transformed it into conversation. In such tools, even with the most advanced models available, you could at best hope to copy/paste a single file at a time – and often folks would even request all of the code for a given application in a single file[1] as a result. But only a certain complexity was readily achievable.

The Evolution of Agent Orchestration

There are two parallel axes of evolution[2] happening in the world of AI-assisted Software Engineering right now that nobody is talking about. Instead there are isolated pockets of the industry acting and talking as if the current evolution is somehow linear and isolated in a single vector.

Some in the industry are focusing on and chasing the quality of outputs and outcomes through process optimizations. Others are focusing on and chasing more throughput through multiplication of agents, while ensuring they’re continuously running (to ensure that they can take proactive vs. reactive action – driving throughput even when we’re not watching or even conscious).

These pursuits have led to the two parallel axes of evolution:

  1. Process maturity - the process(es) AI worker(s) follow to get work done
  2. System architecture - the structure(s) of the organization(s) of those AI worker(s) doing the work

Because we are self-aware humans with free will, and have the ability to take a step back and examine the landscape, we’re in a very unique position. We don’t have to chase what is popular just for the sake of it. Nor do we need to cling to a linear model of the ongoing transformation to predict where this all might lead. Instead, we can make an informed and intentional choice of where within this world we want to operate irrespective of where things might lead (up and to the right won’t always lead to the best outcomes).

I suspect the choice will be different for different people/teams. However, I think there are some specific combinations that are especially problematic/painful and others that might be a newfound sweet spot.

To help guide that choice, I herein present the Hauenstein Model of Agentic Evolution[3].

A 0-to-4 grid with Process Maturity on the X-axis and System Architecture on the Y-axis, divided into four quadrants at the midpoint. Lower-left: Vibe Coding. Lower-right: Disciplined Prompting. Upper-left: Agentic Chaos. Upper-right: Governed Autonomy. A purple evolutionary fringe marks the level-4 edges on both axes.

Axis 1: Process Maturity – From Vibe Coding to Evolutionary Autonomy

Horizontal progression of five labeled cards from left to right along a gradient arrow. Level 0, Vibe Coding: raw prompting, no structure, human is driver. Level 1, Structured Patterns: named methodology like RPI or SDD, human is operator. Level 2, Automated Orchestration: declarative pipelines with quality gates, human is reviewer. Level 3, Governed Autonomy: mandatory human checkpoints and evidence requirements, human is arbitrator. Level 4, Evolutionary Autonomy: process evolves itself, human is objective-setter.

Level Process Human Role Quality Signal Example
0. Vibe coding Raw prompting against a single model. No structured process, just vibes. Driver: writes every prompt, interprets every output Phase outputs are code reviewed and manually tested by a human and/or agent. Build me a clone of Tetris – consider that the word Tetris here is carrying all of the semantic weight. That single word encodes within itself the rules, gameplay mechanic, UX, controls, and even sounds of the game. Net-new concepts risk underspecification and low quality outputs.
1. Structured patterns Named methodology with defined phases that human invokes manually, typically with clean context Operator: Human prompts agent to Research, then reviews. Human prompts fresh agent to Plan, then reviews. Human prompts fresh agent to Implement plan, then reviews. Phase outputs are code reviewed and manually tested by a human and/or agent. RPI, SDD
2. Automated orchestration A workflow engine drives a process – pipeline, state machine, or otherwise. You set it running and walk away. Reviewer – reviews output, not each step Automated scoring thresholds (e.g., ≥ 90) may be provided at each step. Process outputs code reviewed and manually tested by a human and/or agent. Conductor executing a workflow that implements SDD
3. Governed autonomy Agents have autonomy but must prove correctness through adversarial review, deterministic hooks, and agentic/human gates. You govern outcomes, not steps. Arbitrator – reviews evidence (artifacts, test results, telemetry), not code. Attention is honored. Evidence-based: test results traced to requirements, screenshots, adversarial audit, independent verification. FORGED
4. Evolutionary autonomy Governed autonomy that also optimizes its own governance. The system measures outcomes, experiments with workflow changes, and updates its own process automatically. Changes are bounded, audited, and reversible. Objective-setter – defines goals, constraints, and rollback rules; intervenes on regressions Metric-driven: outcome tracking, A/B experiments, canary deployments of process changes, automatic rollback on regression Autoresearch, Reflexion

The human’s role evolves from driver to operator to reviewer to arbitrator to objective-setter, but it never disappears. The human provides intent (what to build and why) and judgment (design arbitration, test criteria approval, evidence review). At Level 4, the human additionally provides constraints (what the system may and may not change about itself) and objectives (what metrics to optimize).

“Watching a Ralph loop clone an entire sponsor project overnight, port Python to TypeScript, or generate 50,000 lines of working code was a visceral reminder of how much execution could be commoditized once the right workflows and evaluation signals were in place” - Andrew Zigler

Axis 2: System Architecture – From Single Agent to Evolutionary Teams

Horizontal progression of five labeled cards from left to right along a gradient arrow. Level 0, Single Agent: one LLM, one task, 1 agent to 1 human. Level 1, Multi-Agent: specialized agents with handoffs, N agents to 1 task. Level 2, Agent Organizations: simulated dev teams, N teams times M agents. Level 3, Continuous AI Organizations: always-on signal-driven orgs, continuous work stream. Level 4, Evolutionary Teams: dynamic self-organizing topology, adaptive structure.

Level Architecture Coordination Scale
0. Single agent One LLM, one context window, one task Human is the sole coordinator 1 agent : 1 task
1. Multi-agent Multiple specialized agents (e.g., planner, coder, reviewer, evaluator) with defined roles and handoffs Intra-task – handoffs, routing, or parallel dispatch within a single task boundary N agents : 1 feature
2. Agent organizations Entire simulated development teams – each “team member” is an agent with a charter, and teams coordinate across organizational boundaries Inter-team – dispatchers or coordinators manage work across team boundaries N teams × M agents : 1 portfolio of work
3. Continuous AI organizations[5] Agent organizations that are always running – not kicked off per task but standing by for whatever comes up. An orchestration loop continuously polls for signals, triages work, dispatches to agent teams, and has the ability to monitor outcomes. This is also usually where proactive work emerges Org-wide – an always-on loop triages signals and dispatches work across all teams Always-on teams × continuous work stream(s)
4. Evolutionary teams[6] Agent team structure is itself a variable the system optimizes. Teams are dynamically formed, expanded, contracted, reorganized, and dissolved based on measured outcomes and goal alignment. A meta-controller manages the team topology. Self-organizing – the coordination topology itself adapts based on measured outcomes Adaptive teams × self-organizing structure

The Pain Grid – Not Every Step Up-Right Is Progress

The natural instinct is to treat these axes like a maturity model and chase (4,4) – maximum process evolution with maximum architectural complexity. However, if you pause and think critically about what development would look like in a world where the organizational structure of your AI agents was in constant evolution, alongside a process that was also in constant evolution, with both evolutions being driven by the AI itself, you might quickly realize that it quickly becomes an unusable system that you cannot even hope to begin to reason over. Each intersection on the grid produces different outcomes, and many of the “advanced” positions are actively harmful:

Position What Happens Why It Hurts
(0, 2+) Agent organizations with no process Agent slop compounds multiplicatively across teams. No single human can review the output. The attention gap becomes unbridgeable – at 10× execution rate, how do you even hope to monitor what your agents are doing?
(3+, 0) Governed autonomy for a single agent The governance overhead crushes the developer. Human checkpoints, evidence requirements, and adversarial audit for a single agent is absurd. You’re driving with the parking brake on – every task takes 5× longer than it should, that level of process only makes sense at scale.
(4, 4) Evolutionary everything Both the process and the team structure are changing constantly. The system optimizes toward some goal, but Goodhart’s Law kicks in – it starts measuring things to death, changing constantly, and loses focus on what is actually important. Even with reflection and rollback, the system oscillates. Nothing stabilizes long enough to deliver value. This is the research frontier, not a production target.
(3, 3) Governed autonomy + continuous AI organizations This could be a sweet spot. Governance provides guardrails. Continuous operation provides throughput. The human provides intent and judgment. The system handles operational cadence. This is where output quality meets manageable cognitive load – enough structure to prevent agent slop, enough automation to be worth the overhead.

The goal is not to go “up and to the right.” The goal is to find where your organization can thrive – where you grant agents enough autonomy to be useful without being harmful, and without being a burden to manage. For most teams today, that means progressing along the diagonal: increasing architecture complexity only as fast as process maturity can keep up. Scaling architecture without scaling governance puts you in the danger zone (upper-left triangle), where the failure modes compound multiplicatively.

At the very minimum two parallel forces make this navigation hard, and they’re distinct challenges that must be solved together:

  1. Output quality: the AI’s work product isn’t where it needs to be. Model behavior issues (over-agency, speculation, drift, lack of follow-through, lack of honor), the understanding gap (can the human evaluate what the AI produced?), the attention gap (can the human afford to evaluate it at the volume AI produces?), and the verification gap (if AI writes both code and tests, green doesn’t mean good).

  2. Craft evolution: the discipline of software engineering itself is evolving, and engineers are at different stages. Some are unaware of the quality issues (they trust the AI because it’s confident). Some are aware but don’t know what options exist to address them. Some know the options but are overwhelmed by choice, and end up in analysis paralysis with a landscape that changes weekly.

You might find it a helpful thought experiment to re-generate the following grid for yourself using this starter prompt[7].

The Pain Grid. A 5 by 5 grid with 25 color-coded cells. Each cell contains a quote and citation describing outcomes at that intersection of process maturity and system architecture. Green cells cluster along the diagonal where process and architecture grow together -- the brightest green cell at position 3,3 is labeled the sweet spot. Red cells dominate the upper-left triangle where architecture outpaces process, with the worst at 0,4 and 4,4. Orange cells mark friction zones where governance overhead exceeds value, like 3,0. The subtitle reads: not every step up and to the right is progress.

From there, you might plot the current landscape of tools, frameworks, and platforms onto the model to produce something like the graph below (current placements are illustrative – this landscape changes weekly). From there you can find what exists that is operating within your teams’ “sweet spot” already. I encourage you to generate the latest version yourself by feeding the prompt in [8] to an AI with access to this blog post and the SVG file.

Scatter plot of AI developer tools on a 0-to-4 grid. X-axis: Process Maturity (vibe coding to evolutionary autonomy). Y-axis: System Architecture (single agent to evolutionary teams). Most tools cluster in the lower-left between levels 0 and 2 on both axes. ChatGPT sits near the origin. Cursor, Aider, Google Jules, and LangChain cluster around (1, 0.5). LangGraph and Anthropic Pipeline sit near (2, 1). Claude Code, Copilot CLI, and OpenClaw cluster around (2, 1.5). Stripe Minions and Conductor plus SDD are near (2, 2). Squad SDK sits at roughly (2, 2.5). GasTown and Paperclip approach the governed autonomy sweet spot near (3, 3). Auto-GPT appears in the agentic chaos quadrant at (0.5, 2). Autoresearch sits alone at (4, 0) with evolutionary process but single-agent architecture. A purple evolutionary fringe marks the level-4 edges on both axes.

Pick Your Spot

I opened this post by saying we’re in a unique position – self-aware humans with free will who can step back and examine the landscape instead of being swept along by it. I want to close with the same thought. Don’t allow someone else to pick your spot on this grid for you. Every vendor will want you buying their platform, the hype cycle with GitHub stars and LinkedIn likes doesn’t know (or frankly care) about your team, your codebase, or your risk tolerance.

Also, the grid isn’t a prescription. You can use it to figure out where you actually are today, and not where you think you are, or where you wish you were. Then look at the cells around you and ask yourself: which direction reduces pain for my team? Sometimes that’s up. Sometimes it’s to the right. Sometimes it’s staying exactly where you are and getting genuinely good at the level you’re at before reaching for the next one.

The worst thing you can do is move in a direction you didn’t choose – drifting into agentic chaos because a new tool made it easy to spin up agents, or bolting on governance theater because someone said you should, without understanding what problem it solves for you.

Every position on this grid is a tradeoff, and the only bad tradeoff is one you did not choose with intention.

Footnotes

[1] I now often wonder if such requests reinforce behaviors that we now treat as indicators of AI slop with so many lines of code being generated.

[2] One could argue that there are potentially 4 or more (e.g., another axis for model evolution, another for operation mode – single turn, multi-turn, continuous, another for continuous polling vs. continuous event-driven[4]). However, to simplify the analysis, I’ve left out model evolution given that even today you will find models of varying generations and types in use, and I’ve collapsed mode of operation into the system architecture despite the valid argument that these may be somewhat orthogonal. Many systems in fact support varying modes of operation.

[3] For pure vanity and maximum entertainment it is herein named the Hauenstein Model of Agentic Evolution – but really I just want to seed some shorthand in training data for future models so that I can in a few short tokens invoke this whole concept in discussion without burning through thousands of tokens of context.

[4] Like the distinction between high latency and offline, the distinction between event driven systems and polling systems collapses at certain units of time that humans begin to take notice. I won’t be making a distinction between event-driven and polling systems here – treating both as “continuous”. Polling has the added benefit of being able to use free cycles with no active work to “reflect” on the state of the system and identify potential proactive steps it might take to delight the users of the system, however the same reflection could be scheduled as an event that fires based on a time trigger. Yes, this is indeed a footnote for a footnote.

[5] Continuous AI organizations can be event-driven, poll-driven, or both. Event-driven: Stripe’s Minions provision and warm dev boxes ahead of time – when a signal arrives (Slack message, internal ticket), work dispatches immediately with no cold start. Poll-driven: The system actively polls for conditions and schedules work outside of any external request – monitoring build health, identifying stale PRs, suggesting refactors, and proactively doing work it identifies without waiting to be asked. Hybrid: Most production systems combine both. Events trigger immediate work; a polling loop handles proactive identification, scheduling, and continuous improvement.

[6] Level 4 carries significant risks. Goodhart’s Law (“when a measure becomes a target, it ceases to be a good measure”) applies directly – an evolutionary system optimizing a metric may game that metric rather than genuinely improving. Specification gaming, alignment drift, and loss of controllability (corrigibility) are active research challenges. Another challenge is rapid evolution without convergence. I built an experimental system that was given a pre-read of this article and an instruction to fully evolve every aspect of itself, including a web-based UX it built for itself. It changed so much daily that it was near impossible to ever learn how to use it, and the churn more often than not broke functionality. This is why governed autonomy (Level 3) is the prerequisite – you need reliable oversight, evidence requirements, and rollback mechanisms before allowing the system to modify its own processes. One system might, for example, deliberately target Level 3 on both axes for its initial release, with Level 4 capabilities to be unlocked only after the governance layer is proven stable.

[7] Prompt for generating your own personalized pain grid: Read the blog post at https://nickhauenstein.com/blog/2026/04/09/ai-native-software-engineering/ -- specifically the two axis definition tables (Process Maturity levels 0-4 and System Architecture levels 0-4) and the pain grid section. You are going to walk me through every intersection of the two axes to build my own personalized pain grid. For each intersection, re-describe what both levels mean so I don't have to keep referring back to the article. For example: "Imagine your process is Level 0 (Vibe Coding) -- raw prompting, no structure, you write every prompt and interpret every output. And your system architecture is Level 2 (Agent Organizations) -- entire simulated dev teams where each team member is an agent with a charter, coordinating across organizational boundaries. What would that experience be like for your team? On a scale of 1-10 where 10 is maximum pain, how painful would this be? Give me the number and a single sentence explaining why." Start at (0,0) and proceed row by row to (4,4). Wait for my answer before moving to the next intersection. As you go, track my responses and look for patterns. If a region is clearly painful (scores of 7+), you can ask whether I'd like to skip the remaining intersections in that region and auto-fill them at my stated pain level. Conversely, when you find a low-pain zone, probe it -- ask a follow-up like "This seems like your sweet spot -- what about it works for you?" to help me articulate why. After all 25 cells have scores (whether answered directly or auto-filled), produce a modified version of the pain grid SVG. For each cell, set the fill color on a gradient from green (#3fb950 at pain=1) through yellow (#d29922 at pain=5) to red (#f85149 at pain=10). Place my pain score and my single-sentence response in each cell. Summarize the results: identify my sweet spot (lowest pain cluster), my danger zone (highest pain cluster), and the diagonal line where pain transitions from tolerable to intolerable. Finally, offer to research real-world systems, tools, or case studies on the internet that match the characteristics of my sweet spot and any intersection I found particularly interesting or surprising, so I can see what others have experienced at that same position on the grid.

[8] Prompt for generating an updated landscape plot: Read the blog post at https://nickhauenstein.com/blog/2026/04/09/ai-native-software-engineering/ -- specifically the axis definition tables and the pain grid section. Then update the SVG file at assets/images/ai-native-landscape/ai-landscape-blog.svg. COORDINATE FORMULA: x_svg = 140 + (process_level / 4) * 660, y_svg = 720 - (y/4) * 660. PLACEMENT RULES: Plot each tool/framework/harness on BOTH axes using its most generous interpretation: if a capability is built-in and just needs to be enabled/configured, the tool has achieved that level. If achieving a level requires significant custom development effort by the user, it has NOT achieved that level. Only include real, shipped products. Add any new tools that have arrived since the last update. Remove any that have been discontinued. Update existing placements if a tool has shipped new features that change its position. COLLISION AVOIDANCE: When multiple items land at similar coordinates, slightly offset dots and labels so all text remains readable. Use dashed cluster circles to visually group items in the same zone. Position labels to the left (text-anchor="end") or right of dots as needed to prevent overlap. COLOR CODING by category (not vendor): #8b949e (gray) for model/chat interfaces, #58a6ff (blue) for agent harnesses, #f0883e (orange) for frameworks/SDKs, #d29922 (gold) for workflow engines, #00e5ff (cyan) for agent operating models (harness+workflow+methodology combined), #39d353 (green) for agent platforms, #bc8cff (purple) for company simulators/autonomous orgs, #8b5cf6 (deep purple) for research/evolutionary. SVG STRUCTURE: Keep all structural elements (axes, grid lines, gradient defs, evolutionary fringe, zone watermarks, magic glow border) unchanged. Only modify the TECHNOLOGY DOTS section.