The Email Analyzer That Explains Itself

Engineering

The Email Analyzer That Explains Itself

Saqib Shamsi

January 28, 2026

7 min read

The Email Analyzer is built into Apollo's email composer to help sales development representatives (SDRs) write better cold emails. Users get instant feedback on their draft, with specific suggestions across multiple criteria like subject line, clarity, sales structure, and engagement. This feature originated from ApolloHacks, Apollo's internal hackathon, where engineers tackle creative product challenges. What started as a hackathon project evolved into a production feature that now helps thousands of SDRs improve their email outreach every day.

email analyzer

At Apollo, SDRs send millions of cold emails every month. The difference between a 2% and 5% positive reply rate is the difference between hitting quota and missing it. Yet most SDRs don't know if their email is any good until after they've sent it and watched the replies—or lack thereof—trickle in. We built a tool that gives users instant feedback on their cold emails before they hit send. But building it taught us something counterintuitive about ML in production: sometimes the best thing you can do with a trained model is throw it away.

🎯 The Goal: Actionable Feedback, Not Just a Score

When we started, the obvious approach was to build a classifier. We curated a dataset of ~2,000 emails with engagement stats to train on.

After that, it was textbook ML: train a model on email data with open/reply rates, predict the probability of success, scale it to 0-100, done! Straightforward, right? Well, not really. We quickly realized this created a frustrating user experience.

Imagine you’re an SDR. You draft your email in the composer, hit the analyze button and get back: “Score: 62/100”.

Now what?

You have no idea why it’s 62. You tweak the subject line, re-run the analyzer, and get… 62. You shorten the body, try again… still 62. The score feels arbitrary, like a slot machine that won’t pay out.

We didn’t just want to tell users if their email was good or bad. We wanted to tell them how to improve it.

🔬 The Experiments: Finding What Actually Matters

We started with a systematic exploration of what features correlate with email success. We ran experiments across multiple model architectures:

Model	Key Finding	Correlation
OLS Regression	Features had tiny coefficients, reply rates too small to learn from	0.44
Lasso	Only selected 2 features: `body_word_count`, `opener_you_i_ratio`	0.34
ElasticNet	Added `opener_word_count` to the mix	0.28
Random Forest	Revealed 9 important features with optimal ranges	0.43

Random Forest was the winner — not because it had the highest correlation, but because of what it revealed through Partial Dependence Plots (PDP).

💡 The Breakthrough: Interpretability Over Prediction

Here’s where it gets interesting. PDP is a technique from ML interpretability that shows you how each feature contributes to the model’s prediction. When we applied it to our Random Forest, we got optimal ranges for each feature.

For example, we discovered:

Subject line length: 1-5 words performs best (not “shorter is better” — there’s a floor)
Body word count: 20-100 words is the sweet spot
Reading grade: 5th grade or lower maximizes engagement

Linear models gave us thresholds (“subject < 6 words”). Random Forest gave us ranges (“1-5 words”), which is far more actionable.

Then we made a decision that felt wrong at first: we threw away the model entirely.

Instead of deploying the Random Forest and using its probability scores, we extracted the rules and built a purely rule-based scoring system, where each feature is evaluated independently. Pass the optimal range? No issue. Violate it? One issue. The final score is simply the percentage of criteria you’re meeting.

🤔 Why Rule-Based Beats ML Prediction

This might sound like a step backwards, but it solved a critical UX problem.

With an ML probability score, fixing one thing didn’t guarantee improvement. You might fix your subject line, but if the model has learned complex feature interactions, your score could stay flat or even drop. Users feel like they’re playing whack-a-mole with an invisible opponent.

With our rule-based system:

Fix the subject line → score improves by exactly X points
Fix the reading level → score improves by exactly Y points
Every fix has a predictable, visible impact

One of our users captured this perfectly:

“More important than the overall score, is that the tool provides feedback on each criteria point needed to make appropriate changes.”

That’s exactly what we were going for.

🚧 The Problem: Good Scores for Bad Emails

Our rule-based system worked great until it didn’t.

We discovered that emails could score well on all our syntactic features (word count, reading level, sentence length) but still be terrible sales emails. They were readable, sure, but they had no pain point, no value proposition, no call-to-action, merely well-constructed sentences that wouldn’t win customers.

Syntactic features measure how an email is written. But they can’t measure what it says.

This led us to add a second layer: semantic analysis powered by LLMs. We now evaluate emails on:

Pain Point: Does it acknowledge the recipient’s challenges?
Value Proposition: Does it explain what you’re offering?
Personalization: Is it about them, not just you?
Call-to-Action: Is there a clear next step?
Social Proof: Any credibility signals?
Tone Detection: Does it sound confident or desperate?

Each category uses carefully crafted prompts that ask yes/no questions about the email content. We aggregate the responses into the same rule-based scoring framework — keeping the predictable, actionable UX intact.

⚡ The Latency Challenge: From 25 Seconds to 2

Adding LLMs to the mix created a new problem: latency. Our initial implementation took 25 seconds to analyze an email. That’s an eternity when you’re trying to help users iterate quickly.

We attacked this from two angles:

1. Parallelization: Instead of calling the LLM sequentially for each category, we fire all the prompts concurrently.

2. Faster Models: Because each category is analyzed independently and within a narrow scope, we found that smaller, faster models performed extremely well. The accuracy tradeoff was negligible, while the latency gains were substantial.

Result: ~2 second analysis time, fast enough that users actually use it iteratively while writing.

📐 The Architecture Today

The final system combines both approaches:

🎓 What We Learned

Building the Email Analyzer taught us some lessons that apply beyond this specific feature:

1. Explainability often beats accuracy. A 95% accurate black box that users don’t trust is less useful than an 85% accurate system they can understand and act on.

2. Use ML to discover rules, not just make predictions. PDP and other interpretability techniques let us extract human-understandable insights from our model. Insights that in this case became the actual product.

3. Predictable > Precise. Users don’t need a perfect score. They need a score that moves in the direction they expect when they make changes.

4. Latency is a feature. Our semantic analysis is only valuable because we got it under 2 seconds. At 25 seconds, users wouldn’t iterate, they’d just send their first draft and abandon the analyzer.

🔮 Looking Ahead — Join Us at Apollo!

The Email Analyzer started as a classic ML problem: predict email success. It became something more interesting: an interpretable, actionable feedback system built on top of ML insights. We trained a model not to deploy it, but to learn from it.

This is the kind of creative problem-solving we do every day at Apollo — using AI not just as a hammer, but as a tool that genuinely improves how our users work.

We’d love for smart engineers like you to join our fully remote, globally distributed team. Click here to apply now!

Developer Productivity

Engineering

Engineering Leadership

How We Measured AI Tooling Productivity Gain Across 250+ Engineers at Apollo.io

At Apollo.io, we ran a company-wide experiment across 250+ engineers to quantify the real productivity impact of AI-powered developer tooling. This post breaks down the metrics we chose, the methodology behind the experiment, and the hard lessons we learned about measuring engineering productivity without gaming the system.

January 16, 2026

Read article →

Data Pipeline

Engineering

Software

A Reliable Testing and Evaluation Suite for AI Assistant

At Apollo, we engineered a custom AI Assistant Regression Suite to reliably test a multi-agent, multi‑turn AI assistant by converting real production conversations into replayable tests that semantically match intent while verifying side effects.

December 2, 2025

Read article →

Engineering

Software

Intelligent Email Deliverability Debugging

Apollo transformed email deliverability with an AI-powered agent that cuts debugging time from weeks to minutes. Built with the ReACT framework, it systematically analyzes metrics, authentication, bounce logs, and domain setup to pinpoint issues and offer real-time solutions. Continuous evals ensure reliability, while an updated knowledge base keeps the agent current with Apollo’s evolving features. By combining systematic analysis with domain expertise, Apollo turned a complex, frustrating process into a streamlined, AI-assisted experience, helping users maximize inbox reach and ROI faster than ever before.

September 17, 2025

Read article →

Engineering

Software

AI Tagging : From Feedback To Actions

Apollo automated customer feedback tagging with AI + LLMs, scaling VOC insights, boosting CIR accuracy, and reducing manual effort—transforming customer complaints into product action.

August 26, 2025

Read article →