LLM API Streaming: Building Responsive AI Interfaces
Beyond basic streaming: how to build AI interfaces that feel responsive using streaming APIs, progressive rendering, and smart UX patterns.
View all llm api integration depths βDepth ladder for this topic:
Streaming LLM responses is table stakes in 2026. The difference between good and great AI interfaces is what you do with the stream β how you handle structured output, tool calls, errors, and the UX around waiting. This guide goes beyond βpipe tokens to the screenβ into production-ready streaming architecture.
Why Streaming Matters More Than You Think
The time-to-first-token (TTFT) for a typical LLM API call is 200-800ms. Total generation time for a 500-token response is 3-8 seconds. Without streaming, users stare at a blank screen for the entire duration. With streaming, they see the first word in under a second.
But the UX impact goes beyond perceived speed:
- Users can start reading immediately, processing the response as it arrives
- Users can interrupt if the model is going in the wrong direction, saving time and tokens
- Progressive disclosure reduces cognitive load β content appears in readable chunks
- Streaming signals liveness β the user knows the system is working, not stuck
Streaming Protocols
Server-Sent Events (SSE)
The most common protocol for LLM streaming. SSE is a one-directional protocol where the server pushes events to the client over a standard HTTP connection.
const response = await fetch('/api/chat', {
method: 'POST',
body: JSON.stringify({ messages }),
headers: { 'Content-Type': 'application/json' }
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n');
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = JSON.parse(line.slice(6));
handleToken(data);
}
}
}
Pros: Simple, works over standard HTTP, supported everywhere, easy to proxy through CDNs. Cons: Unidirectional (client canβt send messages back on the same connection), reconnection handling needs manual implementation.
WebSockets
Bidirectional communication. Useful when you need to send signals back to the server during generation (cancel, update context, redirect).
const ws = new WebSocket('wss://api.example.com/chat');
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === 'token') handleToken(data);
if (data.type === 'done') handleComplete(data);
if (data.type === 'error') handleError(data);
};
// Cancel mid-stream
function cancelGeneration() {
ws.send(JSON.stringify({ type: 'cancel' }));
}
Pros: Bidirectional, lower overhead for frequent messages, native cancellation. Cons: More complex server infrastructure, harder to scale horizontally, doesnβt work through all proxies.
SDK Streaming
Most LLM SDKs abstract away the protocol details:
# OpenAI
stream = client.chat.completions.create(
model="gpt-4",
messages=messages,
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
# Anthropic
with client.messages.stream(
model="claude-sonnet-4-20250514",
messages=messages,
max_tokens=1024
) as stream:
for text in stream.text_stream:
yield text
Building the Frontend
Token Buffering
Donβt render every single token individually. Batch tokens and render at a readable pace:
class TokenBuffer {
constructor(renderFn, intervalMs = 30) {
this.buffer = '';
this.renderFn = renderFn;
this.interval = setInterval(() => this.flush(), intervalMs);
}
add(token) {
this.buffer += token;
}
flush() {
if (this.buffer) {
this.renderFn(this.buffer);
this.buffer = '';
}
}
stop() {
this.flush();
clearInterval(this.interval);
}
}
This produces smoother visual output than rendering each token the instant it arrives, which can look jittery.
Markdown Rendering During Streaming
LLM outputs often contain Markdown. Rendering Markdown mid-stream is tricky because incomplete syntax creates visual artifacts:
- An unclosed
**boldlooks broken until the closing**arrives - Code blocks without closing backticks expand indefinitely
- Lists and headers can shift layout as more content arrives
Solutions:
- Buffer until complete blocks β detect Markdown boundaries and only render complete elements
- Incremental Markdown parsing β libraries like
markedcan handle partial input with some configuration - Two-pass rendering β show raw text during streaming, render Markdown when complete (simpler but less polished)
Handling Structured Output
When streaming JSON or structured responses (function calls, tool use), you canβt parse partial JSON:
// This will fail mid-stream
JSON.parse('{"name": "John", "age":') // SyntaxError
// Solutions:
// 1. Buffer until complete JSON
// 2. Use a streaming JSON parser
// 3. Use partial JSON reconstruction libraries
For tool calls, most APIs send the function name first, then stream the arguments. Build your UI to show βCalling searchβ¦β as soon as the function name arrives, then update with parameters as they stream in.
Progressive UI Patterns
Skeleton loading β streaming text: Show a message skeleton (avatar, timestamp, empty bubble) immediately, then fill with streaming text.
Typing indicator β content: Display a typing indicator for TTFT, then transition to streaming text.
Section-by-section rendering: For long responses with headers, render each section as a complete block rather than token-by-token within sections.
Code block streaming: Show a code block container immediately when backticks are detected, stream code into it with syntax highlighting updating incrementally.
Error Handling
Mid-Stream Failures
Streams can fail partway through. Your UI needs to handle:
try {
for await (const chunk of stream) {
appendContent(chunk);
}
} catch (error) {
if (partialContent.length > 0) {
// Show what we have + error indicator
showPartialWithError(partialContent, error);
} else {
showErrorState(error);
}
}
Design decision: Do you show the partial response? Usually yes, with a clear indicator that the response was cut short and an option to retry.
Rate Limiting
Handle 429 responses gracefully:
if (response.status === 429) {
const retryAfter = response.headers.get('Retry-After');
showMessage(`High demand β retrying in ${retryAfter}s`);
await delay(retryAfter * 1000);
return retry(request);
}
Timeout Handling
Set reasonable timeouts for both TTFT and total generation:
- TTFT timeout: 10-30 seconds (if no tokens arrive, something is wrong)
- Total timeout: varies by expected response length
- Idle timeout: if no tokens arrive for 5+ seconds mid-stream, alert the user
Cancellation
Users should be able to stop generation at any time:
const controller = new AbortController();
function cancelGeneration() {
controller.abort();
showStoppedIndicator();
enableNewMessage();
}
const response = await fetch('/api/chat', {
signal: controller.signal,
// ...
});
On the server side, detecting client disconnection and stopping the LLM API call saves tokens and money. Most SDKs support this via abort signals or stream cancellation.
Performance Optimization
Edge Streaming
Deploy your streaming proxy at the edge (Cloudflare Workers, Vercel Edge Functions) to minimize latency between your server and the user. The LLM API call goes from your edge node to the provider; the stream flows from edge to user with minimal round-trip time.
Connection Reuse
Keep connections warm for frequent users. Opening a new HTTPS connection for every message adds 100-300ms of overhead. Connection pooling on the server side and keep-alive on the client side help.
Token Counting
Track token usage during streaming for cost monitoring and context window management. Most streaming APIs include usage information in the final chunk.
What to Read Next
- Streaming LLM Responses β fundamentals of SSE and streaming
- LLM API Integration Guide β the complete integration picture
- LLM API Function Calling Guide β streaming with tool use
Simplify
β Managing LLM API Rate Limits and Quotas in Production
Go deeper
Streaming LLM Responses: Why It Matters and How to Build It β
Related reads
Stay ahead of the AI curve
Weekly insights on AI β explained at the level that's right for you. No hype, no jargon, just what matters.
No spam. Unsubscribe anytime. We respect your inbox.