Skip to main content

Resource — Architecture

The protocol gap between your carrier and your AI voice agent.

SIP carriers deliver audio as G.711 PCMU or PCMA over RTP. AI voice runtimes expect streaming PCM16 audio over WebSocket. These are fundamentally different protocols, framing formats, and timing models. Getting from one to the other cleanly is a real engineering boundary.

What carriers send

SIP signaling (INVITE, ACK, BYE) over UDP or TCP, with RTP audio streams carrying G.711 PCMU or PCMA at 8 kHz in 20ms packets. No concept of a persistent bidirectional channel.

What AI agents expect

A persistent WebSocket connection with streaming PCM16 audio, typically at 8 kHz or 16 kHz. The agent sends audio back over the same connection. Signaling events arrive as JSON messages on the same channel.

What the bridge must do

Terminate SIP signaling, terminate RTP, transcode codecs, reframe audio for WebSocket delivery, handle clock drift and jitter, and translate call lifecycle events (answer, hold, DTMF, disconnect) in both directions.

What the bridge actually needs to handle

Building a SIP-to-WebSocket bridge that works in a demo is relatively straightforward. Building one that performs reliably under production load, across carriers, is a different problem. Here are the specific responsibilities that teams underestimate.

SIP signaling termination

The bridge must act as a SIP User Agent Server (UAS), responding correctly to INVITE with 100 Trying, 180 Ringing, and 200 OK. It must negotiate media parameters via SDP, handle authentication challenges (407 Proxy Authentication Required), and process mid-call re-INVITEs for hold, transfer, and codec renegotiation. BYE and CANCEL must trigger proper cleanup on the WebSocket side. Carriers differ in their SIP implementation details — some require specific SDP attributes, others vary in how they signal DTMF (RFC 2833 vs. SIP INFO).

RTP media termination

RTP streams arrive as discrete UDP datagrams. Each 20ms G.711 packet contains 160 bytes of audio samples. The bridge must maintain an RTP receive buffer, handle out-of-order packets, and detect packet loss without stalling the media pipeline. RTCP packets carrying quality metrics (packet loss, jitter) arrive on the adjacent port and provide the raw data for transport diagnostics.

Codec transcoding

G.711 PCMU (mu-law) and PCMA (a-law) are companded codecs operating at 8 kHz. Most AI voice runtimes want linear PCM16 at 8 kHz or 16 kHz. The transcoding step is computationally cheap but must happen with minimal buffering to avoid adding latency. Transcoding to 16 kHz when the source is 8 kHz requires upsampling, which adds a small but non-zero processing cost. Getting this wrong introduces audio artifacts that confuse streaming ASR.

Jitter buffer management

RTP packets arrive with variable timing due to network jitter. A jitter buffer absorbs this variability and emits audio at a consistent rate. The buffer size is a tradeoff: too small, and packet reordering causes gaps in the audio stream delivered to the AI agent; too large, and you add systematic latency to every call. Dynamic jitter buffers adjust based on observed network conditions, but they require careful implementation to avoid artifacts during adaptation. The audio delivered to streaming ASR must be contiguous and correctly timestamped.

WebSocket framing and backpressure

WebSocket is a message-framed protocol, not a byte-stream. The bridge must chunk audio into appropriately sized frames (typically 20–160ms of PCM16) for delivery to the AI runtime. It must also handle backpressure if the agent runtime is slow to consume audio, and process outbound audio from the agent (TTS playback) for re-encoding to G.711 and transmission back via RTP to the carrier.

Clock drift

RTP timestamps are derived from the sender’s media clock. Over a long call, even a small drift between the carrier’s clock and the bridge’s clock accumulates. Without compensation, this manifests as audio gradually going out of sync or, in extreme cases, causing the RTP timestamp sequence to wrap unexpectedly. Handling clock drift is a long-call reliability concern that is easy to miss in short-duration testing.

One boundary, clearly owned

Telepath owns the SIP/RTP-to-WebSocket translation entirely. Your carrier sends SIP. Your AI agent gets WebSocket. The boundary is instrumented so you can see what’s happening at each layer.

Carrier agnostic

Telepath has tested against Twilio, Telnyx, Bandwidth, Vonage, Plivo, SignalWire, and Flowroute. Carrier-specific SIP quirks are handled at the gateway layer, not pushed onto your integration team.

Agent runtime agnostic

Any AI voice runtime that accepts streaming PCM16 over WebSocket works with Telepath. You keep your choice of agent platform — Vapi, Bland, Retell, LiveKit Agents, or a custom stack.

Stop building and maintaining the bridge. Use Telepath instead.

Telepath handles SIP termination, RTP termination, codec transcoding, jitter buffering, and WebSocket delivery out of the box. Connect your carrier and your AI agent, and get per-call diagnostics across the full transport path.