Model Serving
The infrastructure and systems that host trained ML models and handle inference requests in production, optimizing for latency, throughput, and cost.
Model serving is the bridge between a trained model and user-facing features. It handles receiving requests, running inference, returning results, and managing the operational concerns of production systems: scaling, load balancing, batching, caching, and failover.
For teams using LLM APIs (OpenAI, Anthropic), model serving is largely handled by the provider. Your engineering focus shifts to API management: request routing between models based on task complexity, response caching for common queries, rate limit management, and fallback chains when primary models are unavailable. A typical production setup routes 70-80% of requests to cheaper models, escalating only complex cases to premium models.
For teams running self-hosted models (fine-tuned models, embedding models, custom classifiers), serving infrastructure matters more. Solutions like vLLM, TGI, and BentoML handle GPU utilization, request batching, and scaling. The key optimization is batching: processing multiple requests together on the GPU dramatically improves throughput and reduces per-request cost, at the expense of slightly higher latency for individual requests.
Related Terms
MLOps
The set of practices combining machine learning, DevOps, and data engineering to reliably deploy, monitor, and maintain ML models in production.
Batch Inference
Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.
Real-Time Inference
Generating ML predictions on-demand as requests arrive, typically with latency requirements under 200ms for user-facing features.
A/B Testing
A controlled experiment comparing two or more variants to determine which performs better on a defined metric, using statistical methods to ensure reliable results.
Feature Flag
A software mechanism that enables or disables features at runtime without deploying new code, used for gradual rollouts, A/B testing, and targeting specific user segments.
Semantic Search
Search that understands the meaning and intent behind a query rather than just matching keywords, typically powered by embedding-based similarity comparison.
Further Reading
LLM Cost Optimization: Cut Your API Bill by 80%
Spending $10K+/month on OpenAI or Anthropic? Here are the exact tactics that reduced our LLM costs from $15K to $3K/month without sacrificing quality.
Fine-tuning vs Prompting: The Real Trade-offs
An honest look at when each approach makes sense, with real cost comparisons and performance data.