Rate Limiting
A technique for controlling the number of requests a client can make to an API within a given time window, protecting services from abuse, overload, and ensuring fair resource allocation.
Rate limiting prevents any single client from consuming a disproportionate share of resources. Common algorithms include fixed window (100 requests per minute), sliding window (smoother distribution), token bucket (allows bursts up to a limit), and leaky bucket (smooths request flow to a constant rate). Each algorithm makes different trade-offs between simplicity, fairness, and burst tolerance.
Implementation typically uses a fast data store like Redis to track request counts per client. Responses include headers indicating the rate limit, remaining requests, and reset time, allowing well-behaved clients to self-throttle. When limits are exceeded, the API returns 429 Too Many Requests with a Retry-After header.
For AI-powered APIs, rate limiting is especially important because inference requests are computationally expensive. A single unthrottled client could consume GPU resources that should serve hundreds of other users. Tiered rate limits based on subscription plan are common, and AI-specific limits might include tokens per minute, concurrent requests, or model-specific quotas.
Related Terms
A/B Testing
A controlled experiment comparing two or more variants to determine which performs better on a defined metric, using statistical methods to ensure reliable results.
Feature Flag
A software mechanism that enables or disables features at runtime without deploying new code, used for gradual rollouts, A/B testing, and targeting specific user segments.
MLOps
The set of practices combining machine learning, DevOps, and data engineering to reliably deploy, monitor, and maintain ML models in production.
Model Serving
The infrastructure and systems that host trained ML models and handle inference requests in production, optimizing for latency, throughput, and cost.
Semantic Search
Search that understands the meaning and intent behind a query rather than just matching keywords, typically powered by embedding-based similarity comparison.
CI/CD (Continuous Integration / Continuous Deployment)
An automated software practice where code changes are continuously integrated into a shared repository, tested, and deployed to production, reducing manual intervention and accelerating delivery cycles.