Skip to main content

Resource — Quality

Call quality for AI voice agents isn’t the same problem as traditional VoIP.

Traditional voice quality monitoring was built to measure human-to-human conversation clarity. AI voice agents have a different failure surface — one where responsiveness, turn accuracy, and streaming audio continuity matter as much as MOS score.

Why AI voice is different

An AI voice agent doesn’t just listen and respond. It streams audio to a transcription system, runs inference, synthesizes speech, and streams it back — all in real time. Each stage has its own failure mode.

What degrades it

Packet loss disrupts streaming ASR. Jitter causes audio discontinuities that confuse turn detection. High TTFS makes the agent feel unresponsive even on otherwise good-quality calls. These compound differently than in human conversations.

What to actually measure

TTFS, interruption capture rate, MOS score, packet loss percentage, and jitter variance — measured per call, not as fleet averages. Individual call quality matters because callers experience individual calls, not averages.

A quality framework built for AI voice

The metrics that matter for AI voice agent quality fall into two categories: transport-layer metrics (packet quality, jitter, signaling timing) and experience-layer metrics (TTFS, interruption accuracy, conversation naturalness). Neither alone tells the full story. Together they give you enough to diagnose most production quality issues.

MOS (Mean Opinion Score)

MOS is a standardized measure of perceptual audio quality on a 1–5 scale. A MOS above 4.0 indicates good to excellent clarity; below 3.5 is generally noticeable to callers. For AI voice, MOS is a useful signal for detecting codec degradation, transcoding artifacts, and severe packet loss, but it measures the audio signal quality independently of whether the agent is performing well conversationally. A call with a MOS of 4.2 can still feel frustrating if TTFS is 2.5 seconds.

TTFS (Time to First Speech)

TTFS is the experience metric most directly correlated with whether a caller thinks the AI agent is “fast” or “slow.” Research on voice UX consistently shows that response delays above 800–1000ms begin to feel like the system is unresponsive, even when call audio quality is excellent. Monitoring TTFS per call, segmented by carrier and agent response type, lets you identify whether slowness is systemic or situational. Tail latency (P95, P99 TTFS) is particularly important, since a bad tail creates disproportionate user frustration.

Interruption capture rate

AI voice agents need to detect when a caller speaks while the agent is playing audio, stop the agent’s speech, and respond to the interruption. The accuracy of this detection depends on the quality of the audio stream feeding the VAD (Voice Activity Detector). If jitter or packet loss causes the audio stream to be discontinuous, VAD performance degrades and interruption capture becomes unreliable. Callers who try to interrupt and can’t will often hang up. Monitoring interruption events per call — how many occurred, how many were captured cleanly — surfaces this failure mode directly.

Packet loss percentage

RTP runs over UDP, which provides no delivery guarantee. Packet loss on a traditional call is concealed by the codec’s packet loss concealment (PLC) algorithm, which interpolates plausible audio for the missing packets. For AI voice, lost packets create gaps in the audio delivered to the streaming ASR system. ASR models are generally robust to occasional losses below ~2%, but sustained losses above 5% cause transcription accuracy to degrade noticeably. Lost packets on the return path (agent audio to caller) create audible glitches that reduce clarity below what MOS alone captures.

Jitter variance

Jitter is the variation in packet arrival timing. A jitter buffer absorbs this variation, but at the cost of latency. High jitter forces the buffer to grow to avoid underruns, which adds systematic delay. Periodic jitter bursts — where a burst of packets arrives clustered together following a gap — can exceed the buffer size and cause a run of dropped packets even when average packet loss is low. Monitoring jitter variance (not just average jitter) reveals these burst events that average metrics miss entirely.

Disconnect attribution

When a call ends, knowing whether it was a clean user hangup, a carrier-side disconnect, a network timeout, or an agent-side error is operationally valuable. SIP 4xx and 5xx final responses, BYE reason headers, and RTP stream termination patterns all contribute to disconnect attribution. Without this, high call drop rates look the same regardless of whether the cause is a carrier routing issue or a Vapi agent crash.

Per-call, not sampled

Telepath instruments every call with quality metrics — not a statistical sample. Individual call quality is accessible for any session, which matters when a specific caller complaints doesn’t fit the aggregate pattern.

Transport and experience layers

Packet loss, jitter, and MOS come from the transport layer. TTFS and interruption metrics come from the experience layer. Telepath surfaces both in the same call-level view so you can correlate them without joining separate data sources.

Comparable across carriers

Because Telepath sits at the carrier boundary for all your calls, quality metrics are consistently measured regardless of which carrier originated the call. Carrier-level quality comparison is accurate, not apples-to-oranges.

Monitor call quality where it actually affects callers.

Telepath gives you per-call quality evidence across transport and experience layers so you can distinguish carrier issues from gateway issues from AI runtime issues — and fix the right thing.