What is streaming?

6 min read

┌──────────────────────────────────────────────────────────┐
│  ═══════════════════════════════════════════════════     │
│  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  ────────────────────────────────────────────────────    │
│  ██████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  █████████████████████████████████░░░░░░░░░░░░░░░░░░     │
│  ██████████████████████████████████████░░░░░░░░░░░░░     │
│  ████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  ────────────────────────────────────────────────────    │
│  ███████████████████████████████████████░░░░░░░░░░░░     │
└──────────────────────────────────────────────────────────┘

Streaming is the technique of delivering a model's response token by token in real time, instead of waiting for the entire response to be generated before showing anything. It is what makes AI chat interfaces feel responsive, with text appearing word by word rather than all at once after a long pause.

Why Streaming Matters for User Experience

────────────────────────────────────────

Language models generate text sequentially, one token at a time. For a long response, the model might take 10-30 seconds to generate the full output. Without streaming, the user stares at a blank screen for that entire time, wondering if anything is happening.

With streaming, the first tokens appear within milliseconds. The user immediately sees the response taking shape, reads along as it generates, and perceives the system as much faster, even though the total generation time is the same.

This is not just a nice-to-have. Studies on perceived latency show that users start losing patience after about 1 second of waiting. Streaming converts a 15-second wait into a 15-second reading experience. It is the difference between "this is broken" and "this is fast."

How Streaming Works

────────────────────────────────────────

Under the hood, streaming typically uses [Server-Sent Events (SSE)]. Your application makes an HTTP request to the model API and receives a stream of events, each containing one or a few tokens. The connection stays open until the response is complete.

The flow looks like this:

▸Your application sends a request to the model API with a streaming flag enabled.
▸The server starts generating tokens and sends each one back immediately as an SSE event.
▸Your application receives these events and renders them in real time.
▸When generation is complete, the server sends a final event and closes the stream.

[WebSockets] are another transport mechanism used by some providers and real-time APIs. Unlike SSE, WebSockets support bidirectional communication, which matters for voice interfaces or interactive sessions where you might want to interrupt the model.

Each streamed chunk typically contains the token text, and the final chunk includes usage metadata like total token counts.

Provider Support

────────────────────────────────────────

Streaming is universally supported across all major providers:

[OpenAI] supports streaming in all GPT models through their chat completions API. Set stream: true in your request. Their SDK provides convenient async iterators for processing chunks.

[Anthropic] supports streaming in all Claude models. Their Messages API offers both basic SSE streaming and a higher-level streaming interface with typed events for text deltas, tool use, and message lifecycle events.

[Google] supports streaming in Gemini models through both the Gemini API and Vertex AI. Their SDK provides stream methods alongside standard generation methods.

[Mistral] supports streaming through their chat completions API, following the OpenAI-compatible format.

[Cohere] supports streaming in their Command models through their chat API.

[Open source models] running on frameworks like Ollama, vLLM, or Hugging Face TGI all support streaming, typically using the OpenAI-compatible API format.

When to Stream vs When to Wait

────────────────────────────────────────

Streaming is the right choice when:

▸[Users are reading the response] in a chat or assistant interface
▸[Responses are long] and would otherwise create noticeable wait times
▸[You want perceived speed] without actually changing generation speed
▸[You are building interactive applications] where the feeling of real-time matters

Waiting for the complete response is better when:

▸[You need the full response for processing] before showing anything (like structured JSON that needs validation)
▸[You are making batch API calls] where no human is waiting
▸[You are doing function calling] and need to know all the functions the model wants to call before executing any
▸[The response is short] enough that streaming adds no perceived benefit

Implementing Streaming in Applications

────────────────────────────────────────

On the frontend, you typically process the stream and append each token to the display. Most frameworks handle this well. React applications often use state updates with each chunk. A common pattern is to accumulate the text in a buffer and update the UI at a throttled rate to avoid excessive re-renders.

On the backend, you need to forward the stream from the model API to your client. This means your backend server must support streaming responses as well. In Node.js, you can pipe the stream directly. In Python, you can use async generators.

One important consideration is error handling. With streaming, errors can occur mid-response. Your application needs to handle cases where the stream is interrupted, the connection drops, or the model returns an error partway through.

Streaming with Function Calls

────────────────────────────────────────

Streaming and function calling interact in an interesting way. When a model decides to call a function, the function call arguments are also streamed token by token. You accumulate the argument tokens until the function call is complete, then execute the function.

Some providers stream the text response and function calls as separate events, making it easier to distinguish between content for the user and function call arguments.

With parallel function calls, the model might stream multiple function call requests in sequence. You collect all of them, execute them (potentially in parallel), and then send the results back to continue the conversation.

Best Practices

────────────────────────────────────────

[Always provide visual feedback]: Show a typing indicator or cursor while streaming. Users should see that generation is happening.

[Handle partial markdown]: If your response contains markdown formatting, you will receive it incrementally. A bold marker might arrive before the closing marker. Your renderer should handle incomplete markdown gracefully.

[Buffer for smoother display]: Rather than updating the UI with every single token, consider buffering a few tokens and updating in small batches. This reduces visual jitter and re-render costs.

[Implement cancellation]: Let users stop generation mid-stream. This is both a UX feature and a cost saver, since you stop consuming tokens when the user has seen enough.

[Track token usage]: With streaming, the total token count typically arrives in the final chunk. Make sure you capture this for cost tracking and logging.

[Set appropriate timeouts]: Streaming connections can hang if the server becomes unresponsive. Set connection timeouts and implement heartbeat detection to catch stale connections.

Streaming is one of those features that seems simple but transforms the user experience. Any application that shows AI-generated text to users should implement streaming by default.

What is streaming?

Why Streaming Matters for User Experience

How Streaming Works

Provider Support

When to Stream vs When to Wait

Implementing Streaming in Applications

Streaming with Function Calls

Best Practices

What is an API?

What is a context window?

What is inference?