AI Users Merged 98% More PRs. Metrics Didn't Move | JAM

By Jordan Hauge — Published December 22, 2025 — Category: Engineering Velocity, AI Coding Productivity

AI is making individual developers look more productive. Faros AI found that AI users merged 98% more PRs and completed 21% more tasks, yet saw no significant correlation with company-level delivery improvements.The gains are real. They're just getting absorbed by review bottlenecks, coordination overhead, and the same weak decision-making that plagued teams before AI showed up.If your exec team expects "faster coding" to translate into better outcomes automatically, 2026 is going to be a rough year.

The part nobody wants to say out loudAI is making individual developers look more productive. That does not mean the business is getting more productive.Faros AI analyzed activity from 10,000+ developers across 1,255 teams and found that developers using AI tools merged 98% more pull requests and completed 21% more tasks.If you stop the story there, you get a victory lap.But the punchline is the one leaders should care about: Faros observed no significant correlation between AI adoption and improvements in company-level delivery outcomes.Not in throughput.Not in DORA metrics.Not in quality KPIs.The gains were real. They just didn't show up where executives expect to see them.So where did they go?Faros gives you a clue: PR review time increased 91%.Translation: AI accelerates the "writing" part of delivery. But it also expands what the organization has to review, validate, coordinate, and absorb.Your developers got faster.Your review bottleneck got fatter.Net effect? A wash. Sometimes worse.That's not a tooling problem. That's a systems problem. And no number of Copilot license fixes a systems problem.The productivity paradox is real. And measured.It helps to separate three things that get conflated in board discussions:Output is what gets produced.Outcomes are what changes for the customer or business.Risk is what now must be verified, monitored, and supported.The evidence increasingly shows output is rising.Stack Overflow's 2025 survey indicates AI tool usage is now baseline among professional developers. JetBrains' 2025 research reports meaningful time savings for developers using AI tools. This isn't early adopter territory anymore. It's table stakes.But trust has not caught up.Stack Overflow's same survey shows only about 3% of developers report "high" trust in AI output. Distrust is common. And distrust isn't just an opinion.Distrust creates work: reviews, verification, testing, rework.Here's the study that should keep every productivity evangelist honest.METR ran a randomized evaluation with experienced open-source developers using AI tools on real tasks in their own repositories. On average, developers using the tools took 19% longer.But here's the part that's actually alarming: participants believed AI tools made them faster. By a lot. Perception != Reality.That perception gap is exactly how organizations end up making investment decisions based on vibes instead of measurement. Which explains a lot of the pitch decks I've seen this year.This doesn't mean AI tools are bad.It means AI shifts where work shows up in the system. And if you're only measuring part of the pipeline, you're probably telling yourself a nice story that isn't true.Most features still don't matter. That's the real multiplier.Here's the benchmark that should reframe every "we can ship twice as fast" celebration:Pendo reports that roughly 6.4% of features drive 80% of click volume.Userpilot's 2024 benchmark across 181 companies found a median core feature adoption rate of 16.5%.Median.Half of all products sit below that line.So when a team celebrates "more shipped," the cold question product leaders should ask is: Is our hit rate improving, or are we scaling waste?Because if your feature effectiveness is already low, increased shipping velocity does something unintuitive. It doesn't multiply impact. It multiplies the graveyard.I've watched teams ship three features in the time it used to take to ship one, then sit in a quarterly review wondering why their numbers didn't move. The math isn't complicated. Three times zero is still zero.That's not a moral judgment. It's arithmetic. And it's the calculation most organizations avoid because it's more fun to count deployments than to admit most of them didn't matter.The actual bottleneck isn't coding. It's decision quality under compressed cycles.Andrew Ng made this point on the "No Priors" podcast earlier this year: "Things that used to take six engineers three months to build, my friends and I, we'll just build on a weekend. The bottleneck is deciding what do we actually want to build."That's the right frame. And the implication for 2026 is significant.When prototypes take a day to build and meaningful learning takes a week to gather, validation starts to feel "slow." The temptation is to treat shipping speed as proof of progress.That's how teams become demo-rich, shipping-heavy, and outcome-poor.The failure mode isn't incompetence. It's economics.When building becomes cheaper, weak decisions become cheaper to execute. But they don't become cheaper to unwind.You still have to support that feature.Explain it to customers.Maintain the code.Carry it forward in every future decision.The bill arrives later. With interest.The PM squeeze is real. That's why this matters now.Here's what makes this more than a generic "PMs are important" argument:Product roles are under scrutiny at the same time decision quality is becoming more consequential.Microsoft has publicly discussed reducing management layers and increasing "builder ratios" as part of broader restructuring.More engineers relative to non-coding roles.LinkedIn reportedly discontinued its Associate Product Manager program and described a shift toward "full-stack builders."These aren't isolated moves. They're a pattern.So yes -The case for product judgment is stronger than it's been in years.But the tolerance for PM-as-process-overhead is lower than it's been in years.That means Product Management isn't automatically more valuable in 2026.Good product judgment is more valuable.There's a difference.And if you're a PM who can't articulate that difference clearly, you're probably in someone's crosshairs right now."I manage the backlog" reads like overhead."I reduce decision risk and increase outcome certainty" reads like leverage.One of those makes a CFO nervous.The other makes a CFO interested. Choose accordingly.What changes in 2026These are predictions. They could be wrong. But they follow directly from the incentives and evidence above.Output inflation becomes normal. If output increases at the individual level, leadership will expect more shipped work. The baseline rises.Teams that don't upgrade decision quality will simply ship more low-impact work, faster. ...And they'll be genuinely confused when the dashboard doesn't move.Verification becomes a real operating cost. When trust is low, someone pays for verification. Reviews, tests, rework, incident handling. That pressure doesn't vanish with better models. It changes shape. The organization still needs accountable humans who can say "this is ready" and mean it.Product leadership differentiates by operating mechanisms, not artifacts. PRDs and clearly written tickets won't be the separator. Decision systems will. Prioritization under uncertainty. Validation proportional to risk. Measurement that actually changes future decisions. None of that is new. It's the original Product Management job. It's just under time compression now.Five questions to pressure-test your velocityUse these in your next leadership review. Or your next planning session. Or the next time someone presents a roadmap with 47 items and calls it "focused."What specific customer problem are we solving this quarter?State it in one sentence without using the word "platform." I've watched executives struggle with this for twenty minutes. It's harder than it sounds.Which segment changes behavior if we get this right?Not "users." Which users? If you can't name them specifically, you don't actually know them.What metric moves, and what would we accept as "no effect"?If you can't define failure, you can't learn from it. And you're probably not measuring what you think you're measuring.What are we not doing because we chose this? Real prioritization has casualties. Name them. If everything is a priority, nothing is. You've heard that before. It still holds true.What's the smallest test that preserves learning quality? Not the smallest thing you can ship. The smallest thing that tells you whether you're right.If you can't answer these quickly, that's your diagnosis.The hard part was never typing code. The hard part is knowing what's worth typing.The bottom lineAI didn't create the "build the wrong thing" problem.Feature graveyards existed long before Copilot.I've seen plenty of them that were hand-crafted with artisanal, human-written code. Just check out https://killedbygoogle.com/ to see a ton of Google initiatives that ended up going six feet under.What AI changes is the speed at which you can make expensive mistakes.In 2026, the winners won't be the teams that ship the most. They'll be the teams that ship the right things, with discipline around learning, sequencing, and measurement.That requires something AI still can't automate: judgment about what matters, willingness to kill weak ideas early, and someone accountable when things don't work.The job title hasn't changed.The tolerance for faking it has.Where JAM fitsWe work with teams where delivery capacity is increasing but decision clarity isn't keeping up. Usually AI is in the mix.Usually expectations have risen faster than the system that's supposed to guide them.If you're shipping more than ever and company-level metrics are flat, that's not a talent problem. It's a decision system problem.We fix those.