Skip to main content

Resource — Latency

Why your AI voice agent feels slow, and what to actually fix.

Most teams blame the language model when callers complain about lag. In practice, a substantial share of first-response latency lives in the transport layer — between the carrier and the agent runtime — and it’s addressable without touching your model or prompts.

What TTFS means

Time to First Speech (TTFS) is the elapsed time from when the caller finishes speaking to when the AI agent’s first audio reaches the caller’s ear. It’s the single metric most correlated with caller-perceived responsiveness.

Where the budget goes

TTFS is the sum of VAD end-of-utterance detection, ASR transcription, LLM first-token generation, TTS first-audio synthesis, and transport delivery back to the caller. Each stage has a floor and a ceiling.

What you can control

Model selection and prompt length affect LLM timing. Transport path and gateway processing affect delivery overhead. Codec transcoding and buffering affect how quickly the agent starts hearing the caller.

Breaking down the latency budget

A typical AI voice call that “feels slow” is experiencing cumulative latency across at least four distinct stages. Understanding which stage is contributing the most is the first step toward targeted optimization — and it requires per-call measurement, not guessing.

SIP setup and carrier signaling: 100–400ms

Before any audio flows, the call must be signaled and answered. A SIP INVITE travels from the PSTN gateway to your carrier, through any SIP proxy chain, to your termination endpoint. Each hop adds round-trip time. Carriers vary significantly here: a carrier with a geographically distant SIP proxy or a slow 183 Session Progress response adds visible latency before the first RTP packet even arrives. Monitoring SIP setup timing per-carrier lets you compare actual performance across providers rather than relying on their published SLAs.

Gateway processing and media bridging: 10–30ms

Once RTP media flows, it needs to be terminated, transcoded, and forwarded to the AI runtime over WebSocket. This is the transport boundary Telepath owns. At ~12ms of added gateway processing, this stage should be negligible in a well-run deployment. Where teams get into trouble is when they use unoptimized bridging approaches that buffer aggressively, introduce codec transcoding chains (G.711 → G.722 → PCM), or add unnecessary relay hops. Each adds latency that compounds with everything downstream.

VAD, ASR, and turn detection: 50–300ms

Voice Activity Detection determines when the caller has stopped speaking. If your VAD end-of-speech timeout is set too conservatively (e.g., 800ms silence before triggering end-of-turn), the agent waits unnecessarily before even starting to transcribe. Streaming ASR systems begin transcription in parallel with the caller speaking, but they still need a complete utterance signal before the LLM can begin. This stage is tunable, but tuning too aggressively causes the agent to cut callers off mid-sentence.

LLM first token: 200–800ms

Time-to-first-token (TTFT) from the language model is heavily influenced by model size, prompt length, and inference infrastructure geography. A large frontier model with a 10,000-token system prompt running in a distant data center will have a materially higher floor than a fine-tuned smaller model co-located with the agent runtime. Many teams overinvest in prompt optimization before measuring whether the LLM is actually the bottleneck.

TTS first audio: 80–250ms

Text-to-speech systems stream audio back sentence-by-sentence or chunk-by-chunk. The time from “LLM generates first token” to “TTS produces first audible frame” is non-trivial. Streaming TTS APIs help, but they still require a minimum buffering threshold before synthesizing. TTS provider and voice quality both affect this floor.

Transport delivery back to caller: 10–80ms

Once TTS audio is available at the agent runtime, it travels back through the WebSocket stream, through the media bridge, transcoded back to G.711 PCMU/PCMA if required, and forwarded via RTP to the carrier for PSTN delivery. Network round-trip between your cloud infra and the carrier’s media server matters here. Callers on mobile networks or in geographically distant regions may experience an additional 40–80ms on this final leg.

Measure first

Without per-call TTFS telemetry broken down by stage, you’re optimizing by feel. Attribution across carrier, gateway, and model timing is the only way to know which stage is actually the bottleneck on any given call.

Minimize transport overhead

A clean, single-hop bridge from carrier RTP to agent WebSocket eliminates transcoding chains and intermediate relay latency. The gateway stage should cost <15ms; anything higher is recoverable overhead.

Track P95, not averages

Median TTFS hides the tail. Callers notice the slow calls, not the average one. Tracking P95 TTFS by carrier, region, and time-of-day surfaces the degradations that average metrics mask entirely.

Get per-call TTFS attribution across carrier, gateway, and model.

Telepath instruments each call with granular timing across the transport boundary so you can see exactly where latency is coming from before you change anything in your stack.