Back to glossary

Agent Safety

The discipline of ensuring AI agents behave predictably, respect boundaries, and do not cause harm through their actions. Agent safety encompasses prompt injection defense, action validation, scope limitation, and impact assessment.

Agent safety is the comprehensive practice of preventing agents from causing unintended harm. Unlike traditional software that follows deterministic code paths, agents make decisions that can be unpredictable, making safety a multi-dimensional challenge. Safety concerns include the agent being manipulated through prompt injection, executing harmful tool calls, leaking sensitive data, or making decisions that disproportionately impact certain user groups.

For teams deploying agents in production, safety must be designed in from the start rather than bolted on afterward. Implement input validation to detect and block prompt injection attempts. Use tool-level guardrails that validate parameters against allowlists before execution. Apply output filtering to prevent data leakage and ensure brand-safe responses. Conduct adversarial testing where red teams try to make the agent misbehave. Establish incident response procedures for when safety failures occur. The agent safety landscape is evolving rapidly, and staying current with research and best practices from organizations like NIST, Anthropic, and OpenAI is essential for responsible deployment.

Related Terms

Model Context Protocol (MCP)

An open standard that defines how AI models connect to external tools, data sources, and services through a unified interface. MCP enables agents to dynamically discover and invoke capabilities without hardcoded integrations.

Tool Use

The ability of an AI model to invoke external functions, APIs, or services during a conversation to perform actions beyond text generation. Tool use transforms language models from passive responders into active problem solvers.

Function Calling

A model capability where the AI generates structured JSON arguments for predefined functions rather than free-form text. Function calling provides a reliable bridge between natural language understanding and programmatic execution.

Agentic Workflow

A multi-step process where an AI agent autonomously plans, executes, and iterates on tasks using tools, reasoning, and feedback loops. Agentic workflows go beyond single-turn interactions to accomplish complex goals.

ReAct Pattern

An agent architecture that interleaves Reasoning and Acting steps, where the model thinks about what to do next, takes an action, observes the result, and repeats. ReAct combines chain-of-thought reasoning with tool use in a unified loop.

Chain of Thought

A prompting technique that instructs the model to break down complex problems into sequential reasoning steps before producing a final answer. Chain of thought significantly improves accuracy on math, logic, and multi-step tasks.