ML models can serve predictions in two modes: batch (process a dataset at once) and real-time (respond to individual requests on demand). The choice affects infrastructure, cost, latency, and system architecture. Many production systems use both modes for different parts of their prediction pipeline.

Overview

AspectBatch InferenceReal-Time Inference
LatencyMinutes to hoursMilliseconds to seconds
ThroughputVery highLimited by endpoint capacity
Cost EfficiencyHigh (optimized compute)Lower (always-on endpoints)
FreshnessStale (until next batch)Current
InfrastructureJob-based (ephemeral)Endpoint-based (persistent)
Error HandlingRetry full batch or itemsPer-request retries
ScalingScale to dataset sizeScale to request rate

Batch Inference

Batch inference processes a dataset through a model in a single job. You provide input data (typically in S3, a database, or a data lake), the job applies the model to every record, and results are written back to storage. SageMaker Batch Transform, Spark ML inference, and custom scripts on EMR are common batch inference patterns on AWS.

Batch is most cost-effective because compute resources run only during processing. GPU instances process data at maximum throughput without idle time. You can use spot instances for further savings. Processing 10 million predictions overnight in batch costs a fraction of maintaining a real-time endpoint to serve those same predictions over the course of a day.

Real-Time Inference

Real-time inference deploys a model behind an endpoint that responds to individual prediction requests. SageMaker endpoints, custom APIs on ECS/EKS, and serverless inference (Lambda with model artifacts) are common patterns. The model is loaded in memory and responds to requests with low latency.

Real-time inference is necessary when predictions must reflect current state, when predictions are triggered by user actions, or when the input data is not known in advance. Product recommendations on a website, fraud detection for transactions, and chatbot responses all require real-time inference.

Near Real-Time: The Middle Ground

Many use cases do not need sub-second latency but cannot tolerate hours of staleness. Near real-time patterns process micro-batches every few minutes. Streaming inference with Kinesis or Kafka processes events in small batches with seconds to minutes of latency. This provides a cost-effective middle ground.

Architecture Patterns

Pre-compute and cache: Run batch inference on all likely inputs and cache results. Serve predictions from cache at request time. This gives real-time latency with batch economics but only works when the input space is bounded and predictable.

Batch + real-time hybrid: Use batch inference for baseline predictions (updated nightly) and real-time inference for adjustments based on current context. Recommendation systems often use this pattern - batch-computed candidate lists refined by real-time scoring.

Asynchronous inference: Accept prediction requests, queue them, and return results when ready. SageMaker Async Inference handles this pattern, scaling from zero when there are no requests. This works when callers can tolerate seconds to minutes of latency.

Cost Comparison

For a model serving 1 million predictions per day, batch inference on a spot GPU instance might cost a few dollars. The same volume through a real-time endpoint with an always-on GPU instance costs significantly more. The real-time premium is the cost of low latency and immediate availability.

SageMaker Serverless Inference and auto-scaling endpoints reduce real-time costs for variable traffic but still cost more than batch for predictable workloads.

When to Choose Batch

Choose batch inference when predictions are needed on a schedule rather than on demand, when the input dataset is known in advance, when latency of minutes to hours is acceptable, when cost optimization is a priority, or when you are scoring large datasets for analytics or reporting.

When to Choose Real-Time

Choose real-time inference when predictions must be immediate (user-facing interactions), when inputs are not known until request time, when predictions depend on current state (fraud detection, pricing), or when the application architecture is request-response.

Practical Recommendation

Default to batch unless you have a clear latency requirement. Many teams deploy real-time endpoints for workloads that could be served more efficiently with batch inference and caching. Start with batch, measure whether the prediction staleness is acceptable, and move to real-time only for the predictions that genuinely need it.