Technical

The Voice AI Agent Stack: What's Working in 2025

A technical breakdown of every layer in the voice AI stack — and which components to choose for latency, quality, and cost in production.

9 min readApril 2025

Voice AI agents are the fastest-growing category in enterprise automation right now. The combination of near-human speech quality, sub-second latency, and dramatically better language understanding has moved voice agents from "interesting toy" to "production SDR, support rep, and scheduling assistant."

But the stack is complex. Here is every layer, what the real choices are, and what we've found works in production at scale.

Voice AI interface visualization
Voice AI agents handle thousands of concurrent conversations — the stack is what makes quality consistent.

The Full Stack, Layer by Layer

Layer 1

Telephony / Audio Transport

How the audio gets in and out. For outbound phone calls: Twilio or SignalWire. For web/app voice: WebRTC. For existing call centers: SIP trunk integration. Latency here matters — choose providers with low-latency WebSocket streaming.

Layer 2

Speech-to-Text (STT)

Converts audio to text in real time. Leaders in 2025: Deepgram Nova-3 (fastest, ~150ms), OpenAI Whisper v3 (most accurate), AssemblyAI (best for diarization/speaker tracking). For production voice agents, Deepgram wins on latency.

Layer 3

LLM (The Brain)

Takes the transcript, maintains conversation context, decides what to say and what actions to take. For voice, response latency is critical. GPT-4o mini or Claude Haiku provide the best latency/quality balance. Use streaming mode so TTS can start before the full response is generated.

Layer 4

Text-to-Speech (TTS)

Converts LLM output to audio. Leaders: ElevenLabs (most human-sounding, ~250ms), OpenAI TTS (fast, good quality), Cartesia Sonic (fastest streaming, ~90ms). ElevenLabs for quality; Cartesia for lowest latency.

Layer 5

Orchestration & Tools

The agent logic layer. Handles turn-taking, interruption detection, tool calls (CRM lookup, calendar booking, knowledge base search), and conversation state. Livekit + LangGraph is the combination we use most in production.

End-to-End Latency Targets

Humans notice delays over 700ms as "unnatural." Here is the latency budget per layer for a production voice agent targeting 600ms total:

STT (Deepgram)150ms
LLM (first token, streaming)200ms
TTS (Cartesia, first audio)90ms
Network overhead80ms
Total~520ms

The Interruption Problem

The hardest technical challenge in voice AI is natural interruption handling. When a human starts speaking while the agent is speaking, the agent must stop, process the interruption, and respond naturally — without the jarring "robot finishing its sentence" effect.

Our approach: Voice Activity Detection (VAD) runs in parallel to TTS playback. When VAD detects speech start, playback stops within one audio chunk (~20ms), the new input is transcribed, and a new LLM call begins. The key is keeping the conversation context window updated in real time so the response to the interruption is coherent.

Top Voice AI Use Cases in Production

Ready to Build a Voice AI Agent?

We design and deploy voice AI agents for sales and support teams. Start with a free audit of your current phone workflows.

Get a Free AI Audit
Devin Mallonee

Devin Mallonee

Founder & AI Agent Architect · CodeStaff

Devin has been building software products and remote teams since 2017. He founded CodeStaff to deploy purpose-built AI agents and workstations that replace repetitive work and scale operations for businesses of every size. He writes about AI strategy, agent architecture, and the practical reality of deploying AI in production.