Over the past two years, LLMs have moved from interesting experiments to everyday tools embedded deeply in the software development lifecycle. Developers use them to generate boilerplate, draft services, write tests, refactor code, explain logs, craft documentation, and debug tricky issues. These capabilities created a dramatic shift in how quickly individual contributors can produce code. Pull requests arrive faster. Cycle time shrinks. Story throughput rises. Teams that once struggled with backlog volume can now push changes at a pace that was previously unrealistic.
If you look only at traditional engineering dashboards, this appears to be a golden age of productivity. Nearly every surface metric suggests improvement. Yet many engineering leaders report a very different lived reality. Roadmaps are not accelerating at the pace the dashboards imply. Review queues feel heavier, not lighter. Senior engineers spend more time validating work rather than shaping the system. Incidents take longer to diagnose. And teams who felt energised by AI tools in the first few weeks begin reporting fatigue a few months later.
This mismatch is not anecdotal. It reflects a meaningful change in the nature of engineering work. Productivity did not get worse. It changed form. But most measurement models did not.
This blog unpacks what actually changed, why traditional metrics became misleading, and how engineering leaders can build a measurement approach that reflects the real dynamics of LLM-heavy development. It also explains how Typo provides the system-level signals leaders need to stay grounded as code generation accelerates and verification becomes the new bottleneck.
For most of software engineering history, productivity tracked reasonably well to how efficiently humans could move code from idea to production. Developers designed, wrote, tested, and reviewed code themselves. Their reasoning was embedded in the changes they made. Their choices were visible in commit messages and comments. Their architectural decisions were anchored in shared team context.
When developers wrote the majority of the code, it made sense to measure activity:
how quickly tasks moved through the pipeline, how many PRs shipped, how often deployments occurred, and how frequently defects surfaced. The work was deterministic, so the metrics describing that work were stable and fairly reliable.
This changed the moment LLMs began contributing even 30 to 40 percent of the average diff.
Now the output reflects a mixture of human intent and model-generated patterns.
Developers produce code much faster than they can fully validate.
Reasoning behind a change does not always originate from the person who submits the PR.
Architectural coherence emerges only if the prompts used to generate code happen to align with the team’s collective philosophy.
And complexity, duplication, and inconsistency accumulate in places that teams do not immediately see.
This shift does not mean that AI harms productivity. It means the system changed in ways the old metrics do not capture. The faster the code is generated, the more critical it becomes to understand the cost of verification, the quality of generated logic, and the long-term stability of the codebase.
Productivity is no longer about creation speed.
It is about how all contributors, human and model, shape the system together.
To build an accurate measurement model, leaders need a grounded understanding of how LLMs behave inside real engineering workflows. These patterns are consistent across orgs that adopt AI deeply.
Two developers can use the same prompt but receive different structural patterns depending on model version, context window, or subtle phrasing. This introduces divergence in style, naming, and architecture.
Over time, these small inconsistencies accumulate and make the codebase harder to reason about.
This decreases onboarding speed and lengthens incident recovery.
Human-written code usually reflects a developer’s mental model.
AI-generated code reflects a statistical pattern.
It does not come with reasoning, context, or justification.
Reviewers are forced to infer why a particular logic path was chosen or why certain tradeoffs were made. This increases the cognitive load of every review.
When unsure, LLMs tend to hedge with extra validations, helper functions, or prematurely abstracted patterns. These choices look harmless in isolation because they show up as small diffs, but across many PRs they increase the complexity of the system. That complexity becomes visible during incident investigations, cross-service reasoning, or major refactoring efforts.
LLMs replicate logic instead of factoring it out.
They do not understand the true boundaries of a system, so they create near-duplicate code across files. Duplication multiplies maintenance cost and increases the amount of rework required later in the quarter.
Developers often use one model to generate code, another to refactor it, and yet another to write tests. Each agent draws from different training patterns and assumptions. The resulting PR may look cohesive but contain subtle inconsistencies in edge cases or error handling.
These behaviours are not failures. They are predictable outcomes of probabilistic models interacting with complex systems.
The question for leaders is not whether these behaviours exist.
It is how to measure and manage them.
Traditional metrics focus on throughput and activity.
Modern metrics must capture the deeper layers of the work.
Below are the three surfaces engineering leaders must instrument.
A PR with a high ratio of AI-generated changes carries different risks than a heavily human-authored PR.
Leaders need to evaluate:
This surface determines long-term engineering cost.
Ignoring it leads to silent drift.
Developers now spend more time verifying and less time authoring.
This shift is subtle but significant.
Verification includes:
This work does not appear in cycle time.
But it deeply affects morale, reviewer health, and delivery predictability.
A team can appear fast but become unstable under the hood.
Stability shows up in:
Stability is the real indicator of productivity in the AI era.
Stable teams ship predictably and learn quickly.
Unstable teams slip quietly, even when dashboards look good.
Below are the signals that reflect how modern teams truly work.
Understanding what portion of the diff was generated by AI reveals how much verification work is required and how likely rework becomes.
Measuring complexity on entire repositories hides important signals.
Measuring complexity specifically on changed files shows the direct impact of each PR.
Duplication increases future costs and is a common pattern in AI-generated diffs.
This includes time spent reading generated logic, clarifying assumptions, and rewriting partial work.
It is the dominant cost in LLM-heavy workflows.
If AI-origin code must be rewritten within two or three weeks, teams are gaining speed but losing quality.
Noise reflects interruptions, irrelevant suggestions, and friction during review.
It strongly correlates with burnout and delays.
A widening cycle time tail signals instability even when median metrics improve.
These metrics create a reliable picture of productivity in a world where humans and AI co-create software.
Companies adopting LLMs see similar patterns across teams and product lines.
Speed of creation increases.
Speed of validation does not.
This imbalance pulls senior engineers into verification loops and slows architectural decisions.
They carry the responsibility of reviewing AI-generated diffs and preventing architectural drift.
The load is significant and often invisible in dashboards.
Small discrepancies from model-generated patterns compound.
Teams begin raising concerns about inconsistent structure, uneven abstractions, or unclear boundary lines.
Models can generate correct syntax with incorrect logic.
Without clear reasoning, mistakes slip through more easily.
Surface metrics show improvement, but deeper signals reveal instability and hidden friction.
These patterns highlight why leaders need a richer understanding of productivity.
Instrumentation must evolve to reflect how code is produced and validated today.
Measure AI-origin ratio, complexity changes, duplication, review delays, merge delays, and rework loops.
This is the earliest layer where drift appears.
A brief explanation restores lost context and improves future debugging speed.
This is especially helpful during incidents.
Track how prompt iterations, model versions, and output variability influence code quality and workflow stability.
Sentiment combined with workflow signals shows where AI improves flow and where it introduces friction.
Reviewers, not contributors, now determine the pace of delivery.
Instrumentation that reflects these realities helps leaders manage the system, not the symptoms.
This shift is calm, intentional, and grounded in real practice.
Fast code generation does not create fast teams unless the system stays coherent.
Its behaviour changes with small variations in context, prompts, or model updates.
Leadership must plan for this variability.
Correctness can be fixed later.
Accumulating complexity cannot.
Developer performance cannot be inferred from PR counts or cycle time when AI produces much of the diff.
Complexity and duplication should be watched continuously.
They compound silently.
Teams that embrace this mindset avoid long-tail instability.
Teams that ignore it accumulate technical and organisational debt.
Below is a lightweight, realistic approach.
This helps reviewers understand where deeper verification is needed.
This restores lost context that AI cannot provide.
This reduces future rework and stabilises the system over time.
Verification is unevenly distributed.
Managing this improves delivery pace and morale.
These cycles remove duplicated code, reduce complexity, and restore architectural alignment.
New team members need to understand how AI-generated code behaves, not just how the system works.
Version, audit, and consolidate prompts to maintain consistent patterns.
This framework supports sustainable delivery at scale.
Typo provides visibility into the signals that matter most in an LLM-heavy engineering organisation.
It focuses on system-level health, not individual scoring.
Typo identifies which parts of each PR were generated by AI and tracks how these sections relate to rework, defects, and review effort.
Typo highlights irrelevant or low-value suggestions and interactions, helping leaders reduce cognitive overhead.
Typo measures complexity and duplication at the file level, giving leaders early insight into architectural drift.
Typo surfaces rework loops, shifts in cycle time distribution, reviewer bottlenecks, and slowdowns caused by verification overhead.
Typo correlates developer sentiment with workflow data, helping leaders understand where friction originates and how to address it.
These capabilities help leaders measure what truly affects productivity in 2026 rather than relying on outdated metrics designed for a different era.
LLMs have transformed engineering work, but they have also created new challenges that teams cannot address with traditional metrics. Developers now play the role of validators and maintainers of probabilistic code. Reviewers spend more time reconstructing reasoning than evaluating syntax. Architectural drift accelerates. Teams generate more output yet experience more friction in converting that output into predictable delivery.
To understand productivity honestly, leaders must look beyond surface metrics and instrument the deeper drivers of system behaviour. This means tracking AI-origin code health, understanding verification load, and monitoring long-term stability.
Teams that adopt these measures early will gain clarity, predictability, and sustainable velocity.
Teams that do not will appear productive in dashboards while drifting into slow, compounding drag.
In the LLM era, productivity is no longer defined by how fast code is written.
It is defined by how well you control the system that produces it.