Back to glossary

Rate Limiting

A technique for controlling the number of requests a client can make to an API within a given time window, protecting services from abuse, overload, and ensuring fair resource allocation.

Rate limiting prevents any single client from consuming a disproportionate share of resources. Common algorithms include fixed window (100 requests per minute), sliding window (smoother distribution), token bucket (allows bursts up to a limit), and leaky bucket (smooths request flow to a constant rate). Each algorithm makes different trade-offs between simplicity, fairness, and burst tolerance.

Implementation typically uses a fast data store like Redis to track request counts per client. Responses include headers indicating the rate limit, remaining requests, and reset time, allowing well-behaved clients to self-throttle. When limits are exceeded, the API returns 429 Too Many Requests with a Retry-After header.

For AI-powered APIs, rate limiting is especially important because inference requests are computationally expensive. A single unthrottled client could consume GPU resources that should serve hundreds of other users. Tiered rate limits based on subscription plan are common, and AI-specific limits might include tokens per minute, concurrent requests, or model-specific quotas.

Related Terms