How We Measured AI Tooling Productivity Gain Across 250+ Engineers at Apollo.io

Developer Productivity

Engineering

Engineering Leadership

How We Measured AI Tooling Productivity Gain Across 250+ Engineers at Apollo.io

Himanshu Gahlot, Keshav Garg

January 16, 2026

14 min read

ai_tooling_productivity_gains_engineering

In late 2024, I sat down to prototype a Chrome Extension. I expected this task to take days building tedious boilerplate code. Instead, I used Cursor, an AI-native editor we had just started testing after trialing Github Copilot. I gave it a prompt, and in two minutes, it generated an end-to-end extension that worked on the first try.

That was my "Innovation Trigger". I spent the next 10 days building 15 different iterations of browser automation prototypes. I became a "drum-beater" for AI, annoying my colleagues with my excitement. But as Keshav and I rolled out AI tools like Cursor, Windsurf, BugBot, CodeRabbit, etc. across our 250+ person engineering team, we learned that the path from simple adoption to true engineering mastery was filled with surprising and counter-intuitive lessons.

Here is exactly how we moved past the "vanity metrics" to find the real truth about AI productivity at Apollo, which has a 10-year-old monolith repository with contributions from 250+ engineers on a daily basis. This post covers three specific things:

The specific metrics we tracked 🧮
What worked and what didn't 📚
The actual productivity gains we measured 📈

We understand that due to our specific setup, these findings may not be generalizable, since smaller and less restrictive codebases and teams may have different experiences. But our AI tooling journey provides good data points to other similar organizations who are trying to figure out their path forward in AI tooling adoption.

The Reality Check: 15% Gains, Not 10x Hype

The industry promised us 10x Engineers. After a year of measurement across 250+ engineers, the data showed us something different: we didn't get 10x engineers, we unlocked a 1.15x engineering organization.

Let's start with the bottom line. After a full year of measurement and 92% weekly active usage of Cursor, here's what we found:

Perceived Velocity: Engineers reported ~15% productivity improvements (with some power users hitting 2-3x)
Actual Throughput: Cycle time (first commit to deployment) remained flat
Sentiment: The average "Self-reported AI Speed Impact" score (where engineers give a score between 0 and 5 on how much of a speed boost they got via AI tools in their PRs) was 2.23 out of 5

Why the discrepancy?

We found that AI is incredible at collapsing the time required for discrete tasks, but it cannot yet accelerate the entire software delivery lifecycle. It can write a test in seconds, but it cannot speed up our CI pipeline, bypass necessary code reviews, or resolve architectural debates on a 10-year-old monolith.

When we isolated specific coding activities, the gains were undeniable:

Task	Before AI	After AI	Measured Improvement
Test generation	30 minutes	5 minutes	6x faster
Boilerplate code	Hours	Seconds	Instant
Technical unblocking	Days	Single sessions	Qualitative shift

Key finding: AI tools provided measurable speed improvements for discrete coding tasks, but these gains didn't compound into proportional increases in overall engineering velocity.

Why Measuring Adoption Masked the Truth

We kicked off 2025 with a clear target: 80%+ adoption of Cursor. To achieve this, we launched the Cursor Champions Committee, led by Keshav. We asked for volunteers from across engineering to drive adoption on their teams, identifying engineers with high curiosity and ownership to lead the charge.

The strategy worked. By the end of Q1, we hit 85% Weekly Active Users (106% of our goal), with Monthly Active Users exceeding 90%.

Initially, this felt like a massive victory. But as we dug into the data, we realized that "Active User" was a vanity metric. It counted an engineer pressing Tab for a single autocomplete the same as an engineer using AI Agents to architect a complex solution. We were measuring activity, not value.

To get to the truth, we bypassed the standard vendor dashboards. We exported Cursor's raw usage logs into Snowflake and built a Weighted Effectiveness Score to differentiate between passive usage and active problem solving:

USAGE_SCORE = (AGENT_REQUESTS × 2.0) + (APPLIED_SUGGESTIONS × 1.0) + (TABS_SHOWN × 0.05)

The multipliers reflected intent:

Agent requests (2.0x): High intent, active problem-solving
Applied suggestions (1.0x): Standard acceptance behavior
Tabs shown (0.05x): Passive exposure

Crucially, we applied logarithmic smoothing (ln(1 + n)) over a rolling 4-month window for each input. This ensured that daily usage spikes were dampened, making the score a measure of sustained habit formation rather than sporadic bursts.

This allowed us to segment our engineers into five cohorts:

Cohort	% of Team	Usage Score Range
Super Power Users	2%	>10,000
Power Users	8%	6,000-10,000
Active Users	41%	Moderate
Moderate Users	29%	Low
Passive Users	20%	Minimal

The data showed that only 2% of our engineers were using AI tools to their full potential. The remaining 98% of our "adopted" user base was using the tools at a much more basic level.

Key takeaway: Standard adoption metrics (WAU, MAU) masked significant variation in usage depth. Custom effectiveness scoring was necessary to understand actual value delivery.

Frontend and Backend Showed Significantly Different Results

When we segmented our data by engineering discipline, we found different patterns.

Frontend Engineers (Super Power Users):

PR velocity increased from ~5 PRs/month to 16-20 PRs/month
This represented a 3-4x improvement in output
Self-reported AI impact scores (that we had asked engineers to submit via a PR description template) showed strong correlation with usage intensity
Cycle time and lines of code changed were inconclusive

frontned pr velocity

Backend Engineers:

PR velocity showed no consistent correlation with AI usage
Super Power Users didn't significantly outperform others in quantitative metrics
Lines of code changed and cycle time were highly volatile
However, Power Users did report higher perceived benefits in self-reported metrics

backend self reported score

We investigated why Frontend showed clear gains while Backend didn't. The primary factor was context availability and how engineers were using context in their prompts.

Frontend teams had:

Comprehensive.cursorrules files
Well-documented component libraries
Standardized patterns and conventions
Domain-specific prompts
LLMs trained more on Javascript
Super power users had mastered the art of providing the right context in the right scenario

Backend teams working in Ruby/Rails and infrastructure:

1 basic.cursorrules file (vs. multiple in Frontend)
Sparse documentation
Less standardized patterns (although more modularity via controllers, models, and proper class hierarchy)
Infrastructure-specific complexity not well-captured
LLMs trained less on Ruby compared to Javascript
No super power users

This led to one of our core principles: AI tools are context amplifiers. They can help bootstrap documentation, but they require grounding to be effective.

Critical insight: Productivity gains from AI tools correlated strongly with available context. Teams with better documentation and standardization saw significantly better results.

Speed Improvements Introduced Quality Concerns

By Q3, we had achieved high adoption and speed improvements, but our surveys revealed quality concerns:

We observed specific patterns:

Auto-generated tests sometimes validated incorrect behavior
Unit tests used random mocks to increase coverage without testing meaningful functionality
Engineers merged AI-generated code without adequate review
Technical debt accumulated faster than anticipated

To address this, we shifted our focus from "Speed" to "Guardrails." We implemented four specific measures:

Custom lint rules: Deployed custom RuboCop rules to catch AI-specific patterns (e.g., redundant T::Sig usage). These block CI and log violations to our dashboard, ensuring we track and resolve "AI noise" before it merges.
Quality gates: Wired GitHub Actions to our AI code-quality rubric to fail PRs when AI-generated diffs add known issues (weak error handling, missing tests, or violations of .cursorrules standards).
Violation tracking: Introduced a Cumulative Improvement Score (CIS) to track RuboCop/ESLint violations in AI-generated code and drive team-level cleanup.
PR review automation: Scaled BugBot coverage from 50% to 92% so AI review catches subtle logic and runtime bugs, freeing human reviewers to focus on architecture.

Important realization: Speed gains without quality guardrails created technical debt. Systematic quality measures were necessary to make AI adoption sustainable.

Our Metrics Evolution: From Vanity to Effectiveness

We deliberately used simple metrics early, then evolved them as we matured. Here's the progression:

Stage	Duration	Primary Goal	Metric	Why This Metric
Adoption	Months 1-3	Build habits	WAU, Lines of Code	Needed to establish consistent usage
Quality	Months 3-6	Build trust	Survey Sentiment, Revert Rate	Validate that gains weren't creating problems
Effectiveness	Months 6+	Optimize value	Custom Usage Score	Distinguish high-value from low-value usage

In Q2, we tracked "AI-generated lines accepted" and set a goal to increase it by 70%. We knew this metric could be gamed, e.g. a formatter run could artificially boost the count. But at that stage, we prioritized habit formation over precision.

By Q3, we had transitioned to the weighted effectiveness score that differentiated between agent requests (high intent) and tab completions (passive). This gave us actionable data on who was using AI tools effectively versus who was just using them.

The Integration Breakthrough

However, generating this score and correlating it with actual productivity required data that wasn't available in any single dashboard. We realized that vendor analytics were optimized for showcasing usage, not ROI.

To solve this, we built a custom analytics pipeline in Snowflake that joined three distinct datasets:

Usage: Raw Cursor logs (identifying Agent vs. Tab usage)
Impact: Cycle time and revert rates from GitHub
Trust: Per-PR impact scores and quarterly trust surveys

This triangulation was critical. It allowed us to validate claims, checking if engineers who "felt faster" (high sentiment) were actually "shipping faster" (low cycle time).

We also added mandatory PR template fields:

ai_speed_boost_impact (0-5 scale)
ai_ideation_help_impact (0-5 scale)

By the end of Q3, we had tagged 196 PRs. While the manual tagging had friction, it provided the "ground truth" we needed to calibrate our automated scores.

Key takeaway: Vendor analytics are optimized for showcasing tool adoption, not ROI. To measure true impact, you must build infrastructure that correlates usage (logs) with outcomes (Git) and sentiment (surveys).

Squad Champions Outperformed Top-Down Initiatives

We tested multiple adoption strategies throughout the year. The most effective was the Cursor Champions Committee, which Keshav led.

The model:

Asked for volunteers across engineering to drive adoption on their teams
Champions met weekly for 1 hour to align on strategy
Each Champion spent 1-2 hours per week coaching their squad
Every Champion owned a squad-specific adoption plan

Sample Champion initiatives:

Initiative Type	Examples
Quick Wins	Starter prompt libraries, pair programming sessions, squad-specific documentation
Coaching	Weekly pair programming with focus group members
Custom Rules	Test generation guidelines, PR description templates
Measurement	Team leaderboards, lines of code accepted tracking
Community Sharing	"Aha!" moments in #eng-cursor Slack channel

The data showed clear patterns:

Squads with engaged Cursor Champions saw consistent adoption
Squads without Cursor Champions or with an Engineering Manager (EM) who was not bought in into AI tooling saw slower adoption
By Q3, only 17% of squads (5 out of 29) achieved 75% or greater WAU, even though the overall adoption on the Engineering team was 85%+

We found that successful adoption required both peer champions and manager alignment. Champions provided technical expertise and enthusiasm; managers provided accountability and strategic prioritization.

Key finding: AI adoption is primarily an organizational change challenge, not a technical one. Peer-led initiatives with management support significantly outperformed top-down mandates.

The Gartner Hype Cycle Matched Our Experience

Our year-long journey closely followed the Gartner Hype Cycle pattern.

Innovation Trigger (December 2024)

Cursor generated working Chrome extension in 2 minutes
Initial excitement about capabilities

Peak of Inflated Expectations (Q1 2025)

85% adoption achieved
Celebrating "lines of code" metrics
Expecting major productivity improvements

Trough of Disillusionment (Q2 2025)

Cycle time showed no significant change
Quality concerns emerged (60% didn't trust AI-generated tests)
Self-reported speed boost median was only 2/5
Senior engineers reporting 15% gains, not 10x

Slope of Enlightenment (Q3 2025)

Cohort analysis revealed usage patterns
Understood why Frontend succeeded while Backend didn't
Identified that context engineering was the key variable
Built measurement frameworks to track effectiveness

Plateau of Productivity (Q4 2025)

Implemented quality gates and custom lint rules
Created sustainable training infrastructure via interactive courses and documentation
Established repeatable processes for new tool adoption
Converted learnings into Cursor Commands for easy reuse

Our key decision in Q1 was to focus on adoption before measuring productivity. We believed that if engineers found genuine value, productivity gains would follow. This proved correct, but it took the full year to reach sustainable productivity improvements.

Key finding: Organizations cannot skip phases of the hype cycle. Attempting to jump directly to productivity measurement without establishing adoption, trust, and effectiveness will likely fail.

What We'd Do Differently

Looking back at our year-long journey, here's what we learned:

Invest in context infrastructure: Frontend's success came from having good .cursorrules, documentation, and best practices. We should have prioritized this for Backend from the start rather than treating it as an optimization.
Build custom analytics from day one: We spent months relying on vendor dashboards before building our own. The custom analytics were far more valuable and should have been prioritized earlier.
Start quality measurement earlier: We waited until Q3 to systematically address quality concerns. By then, we had accumulated technical debt. Starting quality gates in Q2 would have prevented problems and helped us gain more trust with engineers.
Set realistic expectations: The "10x productivity" hype created unrealistic expectations. Setting more modest goals (20-30% task-level improvements) would have managed expectations better.
Survey more frequently: We ran quarterly surveys, but more frequent pulse checks (monthly) would have identified concerns earlier and allowed faster iteration.

The Playbook: A Phased Approach to Adoption

For engineering leaders considering similar initiatives, we recommend a three-phase organizational change program:

Phase 1: Adoption (Months 1-3)

Primary Goal: Habit formation and cultural buy-in. Do not worry about ROI or complex analytics yet.
Key Actions:
- Identify Champions: Find the 2-3 engineers on every squad who are naturally curious.
- Secure EM Buy-in: Ensure Managers are unblocking their teams to experiment.
- Celebrate "Quick Wins": Share small victories publicly (e.g., "Ryan generated this entire test suite in 30 seconds").
Metrics to Watch: Weekly Active Users (WAU), % of Squads with at least 1 Power User.

Phase 2: Quality (Months 3-6)

Primary Goal: Risk mitigation and trust building. As usage spikes, quality will become a concern.
Key Actions:
- Deploy Guardrails: Implement custom lint rules (e.g., RuboCop, ESLint) specifically targeting common AI hallucinations or bad patterns.
- Standardize Context: Begin documenting your .cursorrules. Treat documentation as "infrastructure for AI."
- Pulse Surveys: Run monthly surveys specifically asking: "Do you trust the code generated by the tool?"
Metrics to Watch: Revert rates on AI-generated PRs, Survey Sentiment ("Trust Score").

Phase 3: Effectiveness (Months 6+)

Primary Goal: Value optimization and deep workflow integration. Move beyond "better autocomplete." Use data to drive behavior change from passive usage (Tab) to active problem solving (Agents).
Key Actions:
- Build Custom Analytics: Implement the "Weighted Effectiveness Score" (Logs + Git data) to identify who is actually getting faster.
- Segment & Train: Identify the "Passive Users" and pair them with "Super Power Users" for specific workflow training.
- Invest in "Context Infrastructure": deeply engineer your codebases (types, modularity) to be LLM-friendly.
Metrics to Watch: Custom Usage Score, Cycle Time (correlated with usage).

The Scorecard: Q4 2025 Status

As of Q4 2025, here's where we stand:

Metric	Current State
Weekly Active Users using Coding Agents	92%
Super Power Users	9% (up from 2%)
AI Speed Impact Score	2.29/5 (up from 2.23)
Code quality infrastructure	Quality Gates, Custom Lint Rules deployed
Training modules	3 modules drafted, AI Learning Hub created

The bottom line: AI coding tools can deliver measurable productivity improvements, but realizing those gains requires systematic measurement, quality guardrails, and sustained organizational focus.

Our actual results: 15% productivity gains for senior engineers and 3-4x PR velocity for Frontend Super Power Users are valuable, even if they don't match the initial 10x hype.

💡 We no longer optimize for ‘more AI’. We optimize for better AI.

The question isn't whether AI tools work - they do. The question is whether your organization is prepared to do the hard work of measurement, iteration, and infrastructure-building required to make them work effectively.

Acknowledgments

This initiative and analysis would not have been possible without the dedicated work of the entire Apollo Engineering team.

Major Contributions:

Special thanks to Ryan Alexander for spearheading the adoption of multiple AI tools beyond just Cursor.

AI Tooling Program Contributors:

Debanjan Choudhury, Apurv Garg, Adam Kusmierz, Shraey Chikker, Brandon Renfrow, Rahul Gautam, Ahmed Hamdy, Griffin Brodman, Deeksha Bilochi, Joshua Sullivan, Raja SK, Matt Welk, Tejas H N, Pete Kincaid, Przemyslaw Suchodolski, Hardik Badola, Nilarjun Das, Saurav Keshri, Hardik Bansal, Venkat Ram, Ralph Silaya, Akhil Chennareddy, Alisha Gupta, Arunkumar Nachimuthu, Edgar Harris, Hana Ito, Harshit Pandey, Kevin Yang, Marcin Mazurek, Mohamed Djadoun, Mohit Kumath, Neda Davis, Piotr Dyba, Prashant Yadav, Priya Surana, Rachel Vilceus, Rahul Singh, Siddharth Goswami, Siddharth Malik, Uwais Zaki, Vidya Nethi, Zachary Rogerson, Aditya Keri

We're continuing to iterate on our approach and measure results. If you're working on similar initiatives, we're happy to share more details about our measurement frameworks and learnings.

Engineering

The Email Analyzer That Explains Itself

Apollo's Email Analyzer helps SDRs write better cold emails with instant, actionable feedback - built by using ML to discover rules, not just make predictions.

January 28, 2026

Read article →

Data Pipeline

Engineering

Software

A Reliable Testing and Evaluation Suite for AI Assistant

At Apollo, we engineered a custom AI Assistant Regression Suite to reliably test a multi-agent, multi‑turn AI assistant by converting real production conversations into replayable tests that semantically match intent while verifying side effects.

December 2, 2025

Read article →

Engineering

Software

Intelligent Email Deliverability Debugging

Apollo transformed email deliverability with an AI-powered agent that cuts debugging time from weeks to minutes. Built with the ReACT framework, it systematically analyzes metrics, authentication, bounce logs, and domain setup to pinpoint issues and offer real-time solutions. Continuous evals ensure reliability, while an updated knowledge base keeps the agent current with Apollo’s evolving features. By combining systematic analysis with domain expertise, Apollo turned a complex, frustrating process into a streamlined, AI-assisted experience, helping users maximize inbox reach and ROI faster than ever before.

September 17, 2025

Read article →

Engineering

Software

AI Tagging : From Feedback To Actions

Apollo automated customer feedback tagging with AI + LLMs, scaling VOC insights, boosting CIR accuracy, and reducing manual effort—transforming customer complaints into product action.

August 26, 2025

Read article →