What is batch processing?
6 min read
·┌──────────────────────────────────────────────────────────┐ │ ═══════════════════════════════════════════════════ │ │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ │ ──────────────────────────────────────────────────── │ │ ██████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░ │ │ █████████████████████████████████░░░░░░░░░░░░░░░░░░ │ │ ██████████████████████████████████████░░░░░░░░░░░░░ │ │ ████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ │ ──────────────────────────────────────────────────── │ │ ███████████████████████████████████████░░░░░░░░░░░░ │ └──────────────────────────────────────────────────────────┘
Batch processing in the context of AI APIs means sending a large number of requests to be processed together as a group rather than one at a time in real time. Instead of making individual API calls and waiting for each response, you submit a batch of requests, the provider processes them asynchronously, and you retrieve the results when they are ready. This approach is significantly cheaper and allows higher throughput for workloads that do not require immediate responses.
Why Batch Processing Exists
Real-time API calls are optimized for speed. When you send a request, the provider allocates compute resources immediately and returns a response as fast as possible. This is ideal for interactive applications where a user is waiting for a reply.
But many AI workloads do not need instant responses. If you are processing 50,000 product descriptions, evaluating a test suite of 10,000 prompts, or generating labels for a dataset, you do not care whether each individual response comes back in 2 seconds or 2 hours. What you care about is that all 50,000 results come back correctly and at a reasonable cost.
Batch processing is designed for exactly these scenarios. Providers can schedule batch work during off-peak periods, optimize resource allocation across many requests, and pass the resulting savings on to you.
Provider Offerings
The major AI providers offer batch processing with significant cost advantages:
[OpenAI Batch API] allows you to upload a JSONL file containing up to 50,000 requests. Each request follows the same format as a standard chat completion call. Batch requests are processed within a 24-hour window and cost [50% less] than real-time requests. You receive a batch ID that you can poll for status, and results are returned as a downloadable JSONL file.
[Google Batch Prediction] for Gemini models supports batch processing through the Vertex AI platform. You submit requests through BigQuery or Cloud Storage, and results are written back to your specified destination. Pricing is reduced compared to online predictions.
[Anthropic] offers a Message Batches API that allows you to send up to 10,000 message requests per batch. Each batch processes within 24 hours, with results available for 29 days. Batched requests are priced at 50% of standard API rates.
[Open-source and self-hosted models] can implement batch processing through inference servers like vLLM, which supports continuous batching at the inference level, processing multiple requests simultaneously to maximize GPU utilization.
How Batch Processing Works
A typical batch processing workflow looks like this:
[1. Prepare your requests]: Create a file (usually JSONL format) where each line is a complete API request with its parameters, including the model, messages, temperature, and any other settings.
[2. Upload and submit]: Send the file to the batch endpoint. The provider validates the requests and returns a batch identifier.
[3. Wait for processing]: The provider processes your requests asynchronously. This typically takes anywhere from minutes to 24 hours depending on batch size and provider load.
[4. Check status]: Poll the batch status endpoint periodically or set up a webhook to be notified when processing completes.
[5. Retrieve results]: Download the results file, which contains the response for each request along with identifiers that let you match responses to their original requests.
Cost Savings
The economics of batch processing are compelling. At 50% off standard pricing, high-volume workloads see dramatic cost reductions. Here are some concrete examples:
Processing 100,000 product categorization requests using GPT-4o at standard pricing might cost $500. The same work through the Batch API would cost approximately $250.
Running a monthly evaluation pipeline of 20,000 test prompts across multiple models can quickly become expensive at real-time rates. Batch processing cuts that evaluation budget in half.
For organizations processing millions of requests, batch savings can amount to thousands of dollars per month.
Use Cases
Batch processing is ideal for workloads that share certain characteristics: high volume, tolerance for delay, and a desire for cost efficiency.
[Data labeling and classification]: Categorize products, tag content, classify support tickets, or label training data. These tasks involve running the same type of request thousands of times.
[Bulk content generation]: Generate product descriptions, email variations, social media posts, or translations for large catalogs. The content is needed eventually, not immediately.
[Evaluation pipelines]: Test prompts against models, run benchmark suites, or evaluate model performance across test sets. AI teams regularly run large-scale evaluations where batch processing is a natural fit.
[Data extraction and transformation]: Process large datasets through AI models to extract structured data, standardize formats, or enrich records.
[Synthetic data generation]: Generate training data, test fixtures, or simulated scenarios at scale.
[Embedding generation]: Compute embeddings for large document collections when building or updating vector search indexes.
When to Use Batch vs Real-Time
The decision between batch and real-time processing comes down to latency requirements:
[Use real-time when]:
- ▸A user is waiting for the response
- ▸You need results in seconds
- ▸The request is part of an interactive conversation
- ▸You are building a live feature like autocomplete or chat
[Use batch when]:
- ▸Results are needed within hours, not seconds
- ▸You are processing hundreds or thousands of similar requests
- ▸Cost matters more than speed
- ▸The workload is part of a scheduled pipeline
- ▸No human is waiting for an individual response
Building Batch Workflows
Effective batch processing requires some engineering around the batch APIs:
[Idempotency and tracking]: Assign unique IDs to each request so you can match results to inputs and handle retries without duplicate processing.
[Error handling]: Some requests in a batch may fail while others succeed. Build your workflow to identify failures, extract them, and optionally resubmit them in a new batch.
[Rate and size limits]: Providers impose limits on batch sizes and concurrent batches. Design your pipeline to split large workloads into appropriately sized batches.
[Result processing]: Batch results need to be parsed, validated, and routed to their destination, whether that is a database, a file, or another system.
[Scheduling]: For recurring workloads, set up scheduled jobs that prepare, submit, and process batches on a regular cadence, such as nightly content generation or weekly evaluation runs.
[Monitoring]: Track batch completion rates, error rates, processing times, and costs. This helps you identify issues early and optimize your pipeline over time.
Batch processing is one of the simplest ways to reduce AI API costs. If your workload does not require real-time responses, you are likely leaving money on the table by not using it.