Not all AI workloads need real-time responses. Processing a backlog of documents, analyzing historical data, or generating reports for all customers are batch workloads where throughput and cost matter more than latency. Batch inference patterns optimize for these priorities.

Queue-Based Batch Processing

The foundational pattern for batch inference. Work items are placed in a queue, and workers pull items, process them through the model, and write results to storage.

Queue design - Use a managed queue service (SQS, RabbitMQ) with visibility timeout set to the maximum expected processing time per item. Items that are not acknowledged within the timeout are automatically retried. Dead letter queues capture items that fail repeatedly for investigation.

Worker scaling - Scale workers based on queue depth. When the backlog is large, run more workers to increase throughput. When the queue drains, scale down to reduce cost. For cloud deployments, auto-scaling groups or container orchestration (ECS, Kubernetes) handle this automatically.

Rate limiting - Model APIs have rate limits (requests per minute, tokens per minute). Workers must respect these limits collectively. Use a shared rate limiter (token bucket in Redis, API Gateway throttling) to prevent workers from exceeding the collective limit and getting throttled.

Batch API Patterns

Some model providers offer dedicated batch APIs with lower per-request pricing in exchange for higher latency (results returned hours later rather than seconds).

When to use - For workloads where results are not needed immediately: nightly report generation, weekly data enrichment, historical data reprocessing. The cost savings (often 50% or more) justify the delayed results.

Implementation - Submit a batch of requests as a single job. The provider processes them asynchronously and returns results when complete. Your system checks job status periodically and processes results when available.

Hybrid approach - Use real-time inference for urgent items and batch inference for everything else. A document processing system might process new documents in real-time and reprocess the historical archive in batch mode at lower cost.

Throughput Optimization

Parallelism - Process multiple items simultaneously up to the rate limit. For API-based models, this means concurrent HTTP requests. For self-hosted models, this means batching inputs into a single forward pass.

Request batching - Group multiple items into a single model call when the task allows it. Classifying 10 documents in a single prompt is often cheaper and faster than 10 separate calls, as long as the combined input fits within the context window.

Priority queuing - Not all batch items are equally urgent. Use priority queues to process time-sensitive items first while background items wait. This blends batch efficiency with partial responsiveness.

Error Handling

Batch workloads need robust error handling because failures at scale are certain.

Retry strategy - Transient errors (rate limits, timeouts, server errors) should be retried with exponential backoff. Permanent errors (malformed input, unsupported format) should not be retried. Classify errors to avoid wasting retries on items that will never succeed.

Partial failure handling - A batch job processing 10,000 items will have some failures. Design the system to process successful items and separately track failed items for investigation. Never let a few failures block the entire batch.

Idempotency - Workers may process the same item twice due to retry logic or visibility timeout expiration. Ensure that processing an item twice produces the same result and does not create duplicates in the output store. Use unique item IDs and check-before-write patterns.

Cost Management

Off-peak scheduling - Schedule non-urgent batch jobs during off-peak hours when rate limits are less contested and some providers offer lower pricing.

Progress tracking - For large batches, track completion percentage, estimated time remaining, items processed per hour, and cost incurred. This enables informed decisions about scaling up workers (faster but more expensive) or waiting (slower but cheaper).