Most teams measure the outcome of support — was the customer satisfied, did the ticket close fast — and almost never measure the work itself. That gap is where quality assurance lives. QA is the discipline of reading actual replies, scoring them against a shared standard, and feeding what you learn back into coaching. Done well, it is the single fastest way to make an entire team's writing better. Done badly, it becomes a surveillance ritual everyone resents. The difference is almost entirely in how you design the scorecard and what you do with the results.

Why CSAT can't do this job

A CSAT survey tells you a customer was dissatisfied. It does not tell you that the agent answered the wrong question, buried the fix under three paragraphs of apology, or promised a refund the policy doesn't allow. Worse, CSAT is biased toward outcome: a customer who got bad news politely delivered will often score low, and a customer who got a wrong answer they liked will score high. You cannot coach an agent on "the customer was sad." You can coach them on "you skipped the diagnostic question and guessed."

QA closes that loop. It looks at the conversation independent of the customer's mood and asks: given what this agent knew, was this a good reply? That is a question only another human reading the transcript can answer.

What a scorecard should actually measure

Resist the urge to score twenty things. A scorecard that takes fifteen minutes per ticket will never get used at volume. The strongest scorecards cluster into three buckets:

Resolution — did we actually solve it?

Was the customer's real question identified (not just the literal one they typed)?
Was the answer correct and complete?
Did the agent confirm the fix or leave it dangling?

Communication — was it clear and human?

Did the opening prove the agent read the message, or was it a generic canned greeting?
Is the reply scannable, or a wall of text?
Right tone for the situation — not chirpy at an angry customer, not cold at a confused one?

Process — did we follow the rules that matter?

Correct tags, priority, and status set?
Policy followed on refunds, security, and account changes?
Escalated at the right moment, or held too long?

Each item is a simple yes / no / partial. Avoid 1–10 sliders; reviewers won't agree on what a 7 means, and the false precision hides disagreement instead of surfacing it.

Sample, don't audit everything

You cannot review every ticket, and you shouldn't try. A representative sample tells you what you need to know. A workable cadence for a small team:

2–4 tickets per agent per week, chosen to cover a range of ticket types — not just the easy ones, not just the disasters.
Every reopened ticket and every low CSAT. These are your highest-signal cases; something went wrong and the transcript will tell you what.
A rotating focus. One week look hard at tone, the next at technical accuracy. You'll see patterns a flat sample misses.

The goal is a stable read on quality over time, not a complete census. If your sampled score and your CSAT and reopen rate all point the same direction, your sample is big enough.

Calibrate, or your scores are noise

The fastest way to kill a QA program is to have two reviewers score the same ticket 60% and 95%. Agents lose all faith the moment they realize the grade depends on who graded it.

Fix this with calibration sessions: every few weeks, have all reviewers independently score the same two or three tickets, then compare and argue out the differences. You are not trying to make everyone score identically — you are trying to make sure you all mean the same thing by "incomplete answer" or "wrong tone." Write the resolved definitions into the scorecard as one-line examples. A rubric without examples is just a list of adjectives.

The score is the conversation starter, not the verdict

Here is the part teams get wrong: they treat the QA number as a grade and staple it to performance reviews. The instant QA scores feed compensation or rankings, agents start gaming them and stop taking risks — and the honest coaching conversation dies.

Use QA the other way around. The score exists to start a conversation:

Share the review with the agent, with specifics. "This reply was accurate but you opened with a macro — the customer described a very specific situation and it read like a form letter. Here's a one-line opener that would have landed."
Celebrate the great ones. A reviewer who only ever flags problems trains the team to see QA as punishment. Pull out an excellent reply and show the whole team why it worked.
Watch for systemic misses. If six agents all skip the same diagnostic step, that is not six coaching problems — it is a missing KB article or a broken macro. Fix the system, not the people.

Wire QA back into the rest of the operation

A QA program that lives in a spreadsheet nobody links to is wasted effort. The findings should flow outward:

Repeated wrong answers → update the KB and the saved replies.
Repeated tone misses on a topic → a coaching theme for the week, or a macro rewrite.
Repeated late escalations → revisit your escalation criteria; they may be too vague.

QA is the feedback loop that connects what customers experience to what your team actually does at the keyboard. CSAT tells you something is wrong; QA tells you exactly what, and exactly whom to help.

The honest test

A healthy QA program occasionally tells a manager something uncomfortable: that the team's "best" agent by volume is fast because they close tickets half-answered, or that a quiet agent writes the clearest replies on the team. If your QA scores only ever confirm what you already believed, your scorecard is too easy or your sample is too kind. The point is not to generate a number for a dashboard — it is to read the work honestly and make the next reply better than the last.

Support Quality Assurance: Conversation Review and Scorecards That Improve Replies

Why CSAT can't do this job

What a scorecard should actually measure

Sample, don't audit everything

Calibrate, or your scores are noise

The score is the conversation starter, not the verdict

Wire QA back into the rest of the operation

The honest test

See Hosting Desk in action

Related articles

Scaling a Support Team: When to Hire, and What to Build First

Improving First Contact Resolution: Benchmarks and Tactics

CSAT vs. NPS: Which Should Support Actually Track?

Related articles

Metrics
Scaling a Support Team: When to Hire, and What to Build First
Hiring is the most expensive way to add support capacity, and usually the wrong first move. How to read the signals, build leverage before headcount, and hire at the right moment.

Metrics
Improving First Contact Resolution: Benchmarks and Tactics
First Contact Resolution is the metric that quietly drives cost, CSAT, and effort all at once. Realistic FCR benchmarks and a concrete playbook for moving the number.

Metrics
CSAT vs. NPS: Which Should Support Actually Track?
Two beloved metrics that measure very different things. When to use each, and the trap of optimizing the wrong one.