Rate Limiting

Rate limiting prevents any single client from consuming a disproportionate share of resources. Common algorithms include fixed window (100 requests per minute), sliding window (smoother distribution), token bucket (allows bursts up to a limit), and leaky bucket (smooths request flow to a constant rate). Each algorithm makes different trade-offs between simplicity, fairness, and burst tolerance.

Implementation typically uses a fast data store like Redis to track request counts per client. Responses include headers indicating the rate limit, remaining requests, and reset time, allowing well-behaved clients to self-throttle. When limits are exceeded, the API returns 429 Too Many Requests with a Retry-After header.

For AI-powered APIs, rate limiting is especially important because inference requests are computationally expensive. A single unthrottled client could consume GPU resources that should serve hundreds of other users. Tiered rate limits based on subscription plan are common, and AI-specific limits might include tokens per minute, concurrent requests, or model-specific quotas.

Related Terms

A/B Testing

Feature Flag

MLOps

Model Serving

Semantic Search

CI/CD (Continuous Integration / Continuous Deployment)