Back to glossary

Voice Agent

An AI agent that communicates through spoken language, combining speech recognition, language understanding, reasoning, and speech synthesis to conduct natural voice conversations. Voice agents enable hands-free AI interaction for phone, IoT, and accessibility use cases.

Voice agents add a spoken language interface to AI capabilities. They convert speech to text, process the text through a language model for understanding and reasoning, generate a response, and synthesize it back to speech. Modern voice agents handle interruptions, manage turn-taking, maintain conversational context, and can invoke tools during the conversation, all while maintaining natural-sounding dialogue flow.

For customer-facing businesses, voice agents are transforming phone-based interactions including customer support, appointment scheduling, order management, and lead qualification. The latency requirements are strict: pauses longer than 500 milliseconds feel unnatural, so the entire pipeline from speech recognition to response generation to synthesis must be optimized for speed. Solutions like OpenAI's Realtime API, LiveKit, and Vapi provide the infrastructure for real-time voice interactions. Key considerations include handling accents and background noise, managing multi-speaker conversations, supporting multiple languages, and gracefully handing off to human agents when the voice agent reaches its limits.

Related Terms

Model Context Protocol (MCP)

An open standard that defines how AI models connect to external tools, data sources, and services through a unified interface. MCP enables agents to dynamically discover and invoke capabilities without hardcoded integrations.

Tool Use

The ability of an AI model to invoke external functions, APIs, or services during a conversation to perform actions beyond text generation. Tool use transforms language models from passive responders into active problem solvers.

Function Calling

A model capability where the AI generates structured JSON arguments for predefined functions rather than free-form text. Function calling provides a reliable bridge between natural language understanding and programmatic execution.

Agentic Workflow

A multi-step process where an AI agent autonomously plans, executes, and iterates on tasks using tools, reasoning, and feedback loops. Agentic workflows go beyond single-turn interactions to accomplish complex goals.

ReAct Pattern

An agent architecture that interleaves Reasoning and Acting steps, where the model thinks about what to do next, takes an action, observes the result, and repeats. ReAct combines chain-of-thought reasoning with tool use in a unified loop.

Chain of Thought

A prompting technique that instructs the model to break down complex problems into sequential reasoning steps before producing a final answer. Chain of thought significantly improves accuracy on math, logic, and multi-step tasks.