Voice AI agents are the fastest-growing category in enterprise automation right now. The combination of near-human speech quality, sub-second latency, and dramatically better language understanding has moved voice agents from "interesting toy" to "production SDR, support rep, and scheduling assistant."
But the stack is complex. Here is every layer, what the real choices are, and what we've found works in production at scale.
The Full Stack, Layer by Layer
Telephony / Audio Transport
How the audio gets in and out. For outbound phone calls: Twilio or SignalWire. For web/app voice: WebRTC. For existing call centers: SIP trunk integration. Latency here matters — choose providers with low-latency WebSocket streaming.
Speech-to-Text (STT)
Converts audio to text in real time. Leaders in 2025: Deepgram Nova-3 (fastest, ~150ms), OpenAI Whisper v3 (most accurate), AssemblyAI (best for diarization/speaker tracking). For production voice agents, Deepgram wins on latency.
LLM (The Brain)
Takes the transcript, maintains conversation context, decides what to say and what actions to take. For voice, response latency is critical. GPT-4o mini or Claude Haiku provide the best latency/quality balance. Use streaming mode so TTS can start before the full response is generated.
Text-to-Speech (TTS)
Converts LLM output to audio. Leaders: ElevenLabs (most human-sounding, ~250ms), OpenAI TTS (fast, good quality), Cartesia Sonic (fastest streaming, ~90ms). ElevenLabs for quality; Cartesia for lowest latency.
Orchestration & Tools
The agent logic layer. Handles turn-taking, interruption detection, tool calls (CRM lookup, calendar booking, knowledge base search), and conversation state. Livekit + LangGraph is the combination we use most in production.
End-to-End Latency Targets
Humans notice delays over 700ms as "unnatural." Here is the latency budget per layer for a production voice agent targeting 600ms total:
The Interruption Problem
The hardest technical challenge in voice AI is natural interruption handling. When a human starts speaking while the agent is speaking, the agent must stop, process the interruption, and respond naturally — without the jarring "robot finishing its sentence" effect.
Our approach: Voice Activity Detection (VAD) runs in parallel to TTS playback. When VAD detects speech start, playback stops within one audio chunk (~20ms), the new input is transcribed, and a new LLM call begins. The key is keeping the conversation context window updated in real time so the response to the interruption is coherent.
Top Voice AI Use Cases in Production
- Inbound sales qualification — answers calls, qualifies, books meetings with reps 24/7
- Outbound prospecting — calls qualified leads list, delivers personalized pitch, handles objections
- Appointment reminders — calls patients/customers before appointments, handles reschedules
- Customer support Tier 1 — handles FAQ, password resets, billing questions by phone
- Post-meeting follow-up — calls after product demos to handle questions and advance the deal
Ready to Build a Voice AI Agent?
We design and deploy voice AI agents for sales and support teams. Start with a free audit of your current phone workflows.
Get a Free AI Audit