PLATFORM & TOOLING · APRIL 23, 2026 · 10 MIN READ

Cursor vs Copilot vs Claude Code: a pilot framework that does not rely on vendor metrics

The acceptance-rate number every vendor cites is a flattering metric. Five different measurements during a 4-week pilot give you a real read on which tool fits your team.

The numbers vendors cite during AI coding tool sales conversations are flattering metrics. Cursor's Habits Report from May 28 2026 reported a 36.3% no-manual-review acceptance rate as a sign of tool maturity. Copilot's marketing emphasizes acceptance rate as the headline. Claude Code's pitch leans on benchmark scores. All three numbers are real and all three are also reported in ways that benefit the vendor.

The acceptance-rate metric is the clearest example. Acceptance rate is the percentage of suggestions a developer accepts. It does not measure whether the accepted suggestion was right, whether the developer reread it later and changed it, whether the code survived in main, or whether a different tool would have produced a faster or smaller fix. Acceptance rate measures the developer pressing tab. A skilled vendor can tune their suggestion model to optimize for tab-presses without optimizing for code that ships and stays shipped.

The same applies to lines-of-code-accepted and time-to-completion. Both metrics ignore the downstream cost. The Stack Overflow 2026 Developer Survey reported that 43% of AI users describe their team's review and rework load as having grown faster than their output, which means a measurable fraction of accepted suggestions are inflicting downstream cost that the acceptance rate does not catch.

A pilot framework that measures what actually matters has to look at the full lifecycle of the code, not the moment of acceptance.

The public reference point#

Adevinta published the most rigorous public enterprise pilot in March 2026. 77 engineers across multiple teams, 4 weeks, structured task tracking, comparison across Cursor, Claude Code, and Copilot. The findings are worth reading directly. The summary is that each tool had a distinct strength profile and the overall winner depended heavily on the team's existing workflow. Claude Code was strongest on multi-file refactors. Cursor was strongest on intra-file completion in IDEs the team already used. Copilot was strongest on familiarity and breadth of language support.

The takeaway is not which tool to buy. The takeaway is that the pilot methodology produced a defensible answer in 4 weeks with measurable inputs. The framework below is a generalization of Adevinta's approach designed for teams that do not have a dedicated research function.

The five measurements#

Each one is measurable without specialized tooling. Each one resists vendor optimization. Each one captures a part of the developer lifecycle that acceptance rate ignores.

1. Same-task time delta#

Pick 5 to 8 representative tasks before the pilot starts. They should be tasks your engineers do every week: implement a CRUD endpoint, refactor an existing function, add a new field to a schema, write a test for an existing module, fix a known class of bug. During the pilot, randomly assign engineers to complete each task with each tool. Track time from task assignment to PR merge. The metric is the median time delta per task per tool. Eight to ten task completions per tool per condition gives you a defensible read.

This catches the case where a tool is fast at autocomplete but slow at the surrounding work (running tests, fixing the lint failure, reading the diff). The metric is full-task time, not edit time.

2. Post-merge fix-forward rate per author#

For every PR that merges during the pilot, track whether the same author opens a fix-forward PR within 7 days that touches the same file. The metric is the percentage of merged PRs that get a fix-forward, by tool used.

This catches the case where a tool produces code that passes review but does not survive contact with reality. A higher fix-forward rate means the tool's output is shifting work later in the cycle, not eliminating it.

3. Review-cycle multiplier#

For every PR opened during the pilot, count the number of full review cycles before merge. A full cycle is review comments, author response, review comments, author response. The metric is the median cycles per PR by tool.

This catches the case where a tool produces output that reviewers consistently push back on. A 3-cycle median is twice the review tax of a 1.5-cycle median. The cost shows up in reviewer time, not author time.

4. Weekly engineer-reported context-switch cost#

At the end of every pilot week, ask each engineer two questions:

How many times this week did you switch from coding to a non-coding task because the tool's output required you to investigate something unexpected? (number)
How many minutes did each switch cost on average? (rough estimate)

Aggregate at the team level. The metric is the median total switch-cost minutes per engineer per week by tool.

This catches the shadow tax of the tool. A tool that is fast at the moment of suggestion but produces code that requires the engineer to drop into the docs, the GitHub issues, or a Slack channel to understand is moving cost out of the IDE and into the calendar. The Pragmatic Engineer 2026 AI Tooling Survey found that engineers using AI heavily reported a 1.5 to 3 hour weekly shadow tax. Pilots that do not measure it underweight a real cost.

5. Code-survival rate at 30 days#

Tag every PR merged during the pilot with the tool used. At day 30 after merge, query for the percentage of lines from each PR that still exist unchanged in main. The metric is the survival rate by tool.

This catches the case where a tool produces code that ships but gets rewritten. A tool with high acceptance and low survival is producing slop. A tool with lower acceptance and high survival is producing more durable code per accepted suggestion.

Tracking this requires a git-history query. The tooling for it is one small script; jscpd-history, git-blame, or a Python script using GitPython all work.

The combined score#

After 4 weeks, you have five numbers per tool. Normalize each to a 0-100 scale (best observed is 100, worst is 0) and average. The tool with the highest combined score wins your pilot.

The vendor will not lead with these numbers. Some of them, the vendor cannot produce because they require data only your team has (your file paths, your reviewer comments, your PR history). The combined score is defensible because every input is something your engineering leadership team can verify independently. It is also defensible because the inputs are tied to the work the team actually does, not to a benchmark task that may or may not look like your codebase.

What to do this week#

If you are evaluating an AI coding tool in the next 90 days, do not run a free trial without a measurement framework. Pick the 5-8 representative tasks. Decide which engineers will participate. Set up the tracking spreadsheet. Then start the trial with the metrics in place.

If you are already inside a pilot and have not been measuring these, you can start the measurement framework today and run it for 2 more weeks. Two weeks of data on the full framework is more useful than 8 weeks of acceptance-rate data.

The vendors all want a fast decision based on the easiest metric. The framework above forces a slower decision based on the metrics that matter. The slower decision saves you from the cost of switching tools 9 months from now when the acceptance-rate number turned out to be the wrong thing to optimize.