What TTFS means
Time to First Speech (TTFS) is the elapsed time from when the caller finishes speaking to when the AI agent’s first audio reaches the caller’s ear. It’s the single metric most correlated with caller-perceived responsiveness.
Resource — Latency
Most teams blame the language model when callers complain about lag. In practice, a substantial share of first-response latency lives in the transport layer — between the carrier and the agent runtime — and it’s addressable without touching your model or prompts.
Time to First Speech (TTFS) is the elapsed time from when the caller finishes speaking to when the AI agent’s first audio reaches the caller’s ear. It’s the single metric most correlated with caller-perceived responsiveness.
TTFS is the sum of VAD end-of-utterance detection, ASR transcription, LLM first-token generation, TTS first-audio synthesis, and transport delivery back to the caller. Each stage has a floor and a ceiling.
Model selection and prompt length affect LLM timing. Transport path and gateway processing affect delivery overhead. Codec transcoding and buffering affect how quickly the agent starts hearing the caller.
A typical AI voice call that “feels slow” is experiencing cumulative latency across at least four distinct stages. Understanding which stage is contributing the most is the first step toward targeted optimization — and it requires per-call measurement, not guessing.
Before any audio flows, the call must be signaled and answered. A SIP INVITE travels from the PSTN gateway to your carrier, through any SIP proxy chain, to your termination endpoint. Each hop adds round-trip time. Carriers vary significantly here: a carrier with a geographically distant SIP proxy or a slow 183 Session Progress response adds visible latency before the first RTP packet even arrives. Monitoring SIP setup timing per-carrier lets you compare actual performance across providers rather than relying on their published SLAs.
Once RTP media flows, it needs to be terminated, transcoded, and forwarded to the AI runtime over WebSocket. This is the transport boundary Telepath owns. At ~12ms of added gateway processing, this stage should be negligible in a well-run deployment. Where teams get into trouble is when they use unoptimized bridging approaches that buffer aggressively, introduce codec transcoding chains (G.711 → G.722 → PCM), or add unnecessary relay hops. Each adds latency that compounds with everything downstream.
Voice Activity Detection determines when the caller has stopped speaking. If your VAD end-of-speech timeout is set too conservatively (e.g., 800ms silence before triggering end-of-turn), the agent waits unnecessarily before even starting to transcribe. Streaming ASR systems begin transcription in parallel with the caller speaking, but they still need a complete utterance signal before the LLM can begin. This stage is tunable, but tuning too aggressively causes the agent to cut callers off mid-sentence.
Time-to-first-token (TTFT) from the language model is heavily influenced by model size, prompt length, and inference infrastructure geography. A large frontier model with a 10,000-token system prompt running in a distant data center will have a materially higher floor than a fine-tuned smaller model co-located with the agent runtime. Many teams overinvest in prompt optimization before measuring whether the LLM is actually the bottleneck.
Text-to-speech systems stream audio back sentence-by-sentence or chunk-by-chunk. The time from “LLM generates first token” to “TTS produces first audible frame” is non-trivial. Streaming TTS APIs help, but they still require a minimum buffering threshold before synthesizing. TTS provider and voice quality both affect this floor.
Once TTS audio is available at the agent runtime, it travels back through the WebSocket stream, through the media bridge, transcoded back to G.711 PCMU/PCMA if required, and forwarded via RTP to the carrier for PSTN delivery. Network round-trip between your cloud infra and the carrier’s media server matters here. Callers on mobile networks or in geographically distant regions may experience an additional 40–80ms on this final leg.
Without per-call TTFS telemetry broken down by stage, you’re optimizing by feel. Attribution across carrier, gateway, and model timing is the only way to know which stage is actually the bottleneck on any given call.
A clean, single-hop bridge from carrier RTP to agent WebSocket eliminates transcoding chains and intermediate relay latency. The gateway stage should cost <15ms; anything higher is recoverable overhead.
Median TTFS hides the tail. Callers notice the slow calls, not the average one. Tracking P95 TTFS by carrier, region, and time-of-day surfaces the degradations that average metrics mask entirely.
Telepath instruments each call with granular timing across the transport boundary so you can see exactly where latency is coming from before you change anything in your stack.