Load Balancer

Load balancers sit between clients and servers, routing each request to the most appropriate backend instance. Common algorithms include round-robin (sequential distribution), least connections (route to the server handling the fewest active requests), and weighted routing (distribute based on server capacity).

Modern load balancers operate at different network layers. Layer 4 (TCP) load balancers route based on IP and port, offering high throughput with minimal processing overhead. Layer 7 (HTTP) load balancers inspect request content, enabling path-based routing, header-based decisions, and sticky sessions. Cloud providers offer managed load balancers (ALB, NLB on AWS; Cloud Load Balancing on GCP) that integrate with auto-scaling groups.

For AI serving infrastructure, load balancers are critical. They distribute inference requests across GPU servers, route traffic during canary deployments of new models, and perform health checks to remove unhealthy instances from the pool. Intelligent load balancing can also route requests based on model version, request complexity, or available GPU memory.

Related Terms

A/B Testing

Feature Flag

MLOps

Model Serving

Semantic Search

CI/CD (Continuous Integration / Continuous Deployment)