Skip to main content

Resource — Diagnostics

Jitter and packet loss hit AI voice agents harder than traditional calls.

A 200ms jitter burst that a human caller barely notices can break streaming ASR, confuse turn detection, and cause an AI voice agent to mishear or miss the caller entirely. The difference lies in how human brains versus machine learning models handle discontinuous audio.

What jitter is

Jitter is variation in packet arrival timing. RTP packets should arrive every 20ms. Jitter causes them to arrive at 18ms, then 35ms, then 12ms intervals. A jitter buffer smooths this, but at the cost of added latency.

What packet loss is

RTP uses UDP, which delivers no delivery guarantee. Lost packets create gaps in the audio stream. Packet loss concealment (PLC) interpolates the missing audio for human listeners, but the gap still reaches the streaming ASR unchanged.

Why AI voice is more sensitive

Human brains reconstruct audio from context. Streaming ASR models do not. A gap in the audio signal that human listeners fill in automatically becomes a transcription error, a missed word, or a false end-of-utterance signal for an AI agent.

How jitter and packet loss manifest in AI voice calls

Understanding the failure modes helps you diagnose which metric to look at when something goes wrong. Jitter and packet loss often co-occur but have different primary effects on AI voice quality.

Jitter: the latency vs. stability tradeoff

Jitter buffers absorb arrival-time variation by holding packets briefly before forwarding them to the media pipeline. The buffer depth is a tradeoff: a deeper buffer handles more jitter without dropping packets, but adds systematic latency to every call. A 60ms jitter buffer means every packet is delivered 60ms later than it arrived. For an AI voice call, this 60ms is added directly to the TTFS budget on top of model inference time.

High-jitter conditions can also cause the jitter buffer to adapt dynamically, which introduces brief audio discontinuities during adaptation. These adaptation events are subtle in human-to-human calls but can cause false VAD end-of-utterance triggers in AI voice agents — making the agent think the caller has stopped speaking when they haven’t, resulting in premature interruption of the caller.

Packet loss: the ASR accuracy problem

Streaming ASR systems (Deepgram, AssemblyAI, Whisper serving, carrier-native ASR) are generally robust to occasional packet loss below ~2%. At 2–5% sustained loss, transcription accuracy begins degrading noticeably on longer utterances. Above 5%, word error rate increases significantly and the agent starts mishearing callers frequently enough to impact conversation quality.

Bursty packet loss is more damaging than uniform loss at the same percentage. A burst of 5 consecutive lost packets (100ms of audio) destroys a word, which is far worse than 5 uniformly distributed lost packets across the same window. Burst patterns are distinguishable from uniform loss in per-call packet metrics.

The jitter buffer size dilemma

A static jitter buffer sized for worst-case network conditions will add unnecessary latency on good-network calls. A dynamic jitter buffer that shrinks on good-network calls and grows during jitter events provides better average-case latency, but the adaptation events can cause transient audio quality issues. For AI voice deployments, understanding which jitter buffer configuration your media gateway uses — and how it performs on your specific carrier paths — is a meaningful optimization lever that most teams haven’t measured.

Correlation with carrier and network path

Jitter and packet loss are often carrier-specific or route-specific, not random. Calls that transit a particular network path may consistently show higher jitter than calls on other paths. Calls from mobile networks in certain regions may show systematic packet loss. Without per-call RTP metrics grouped by carrier and time-of-day, this pattern is invisible. Once visible, it’s often addressable by changing carrier routing or SIP trunk configuration.

What to look at in diagnostics

For any call with quality complaints: start with packet loss percentage and max jitter observed during the call. If packet loss is above 2% or max jitter exceeded 80ms, the network path contributed to the quality issue. Correlate with TTFS — if TTFS was also high, the jitter buffer depth may have been a compounding factor. If packet loss is near zero and jitter was low, the issue is likely upstream (model inference, TTS latency, or carrier setup timing) rather than network path quality.

Per-call packet metrics

Telepath captures packet loss, jitter, and RTCP data per call. You can inspect any call’s network quality directly rather than guessing from aggregate stats.

Carrier-level aggregation

Group packet quality by carrier to identify whether jitter or loss is carrier-specific. This makes carrier escalations evidence-based rather than anecdotal.

Correlated with experience metrics

Packet quality metrics appear alongside TTFS and MOS in the same per-call view. You can see directly whether network-layer issues correlated with experience-layer degradation on a specific call.

See the network layer on every call, not just the ones you already know are bad.

Telepath instruments RTP quality metrics per call so you can correlate network behavior with user experience and stop debugging jitter issues by feel.