LLM API Streaming: Building Responsive AI Interfaces

Streaming LLM responses is table stakes in 2026. The difference between good and great AI interfaces is what you do with the stream — how you handle structured output, tool calls, errors, and the UX around waiting. This guide goes beyond “pipe tokens to the screen” into production-ready streaming architecture.

Why Streaming Matters More Than You Think

The time-to-first-token (TTFT) for a typical LLM API call is 200-800ms. Total generation time for a 500-token response is 3-8 seconds. Without streaming, users stare at a blank screen for the entire duration. With streaming, they see the first word in under a second.

But the UX impact goes beyond perceived speed:

Users can start reading immediately, processing the response as it arrives
Users can interrupt if the model is going in the wrong direction, saving time and tokens
Progressive disclosure reduces cognitive load — content appears in readable chunks
Streaming signals liveness — the user knows the system is working, not stuck

Streaming Protocols

Server-Sent Events (SSE)

The most common protocol for LLM streaming. SSE is a one-directional protocol where the server pushes events to the client over a standard HTTP connection.

const response = await fetch('/api/chat', {
  method: 'POST',
  body: JSON.stringify({ messages }),
  headers: { 'Content-Type': 'application/json' }
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  
  const chunk = decoder.decode(value);
  const lines = chunk.split('\n');
  
  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const data = JSON.parse(line.slice(6));
      handleToken(data);
    }
  }
}

Pros: Simple, works over standard HTTP, supported everywhere, easy to proxy through CDNs. Cons: Unidirectional (client can’t send messages back on the same connection), reconnection handling needs manual implementation.

WebSockets

Bidirectional communication. Useful when you need to send signals back to the server during generation (cancel, update context, redirect).

const ws = new WebSocket('wss://api.example.com/chat');

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  if (data.type === 'token') handleToken(data);
  if (data.type === 'done') handleComplete(data);
  if (data.type === 'error') handleError(data);
};

// Cancel mid-stream
function cancelGeneration() {
  ws.send(JSON.stringify({ type: 'cancel' }));
}

Pros: Bidirectional, lower overhead for frequent messages, native cancellation. Cons: More complex server infrastructure, harder to scale horizontally, doesn’t work through all proxies.

SDK Streaming

Most LLM SDKs abstract away the protocol details:

# OpenAI
stream = client.chat.completions.create(
    model="gpt-4",
    messages=messages,
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        yield chunk.choices[0].delta.content

# Anthropic
with client.messages.stream(
    model="claude-sonnet-4-20250514",
    messages=messages,
    max_tokens=1024
) as stream:
    for text in stream.text_stream:
        yield text

Building the Frontend

Token Buffering

Don’t render every single token individually. Batch tokens and render at a readable pace:

class TokenBuffer {
  constructor(renderFn, intervalMs = 30) {
    this.buffer = '';
    this.renderFn = renderFn;
    this.interval = setInterval(() => this.flush(), intervalMs);
  }
  
  add(token) {
    this.buffer += token;
  }
  
  flush() {
    if (this.buffer) {
      this.renderFn(this.buffer);
      this.buffer = '';
    }
  }
  
  stop() {
    this.flush();
    clearInterval(this.interval);
  }
}

This produces smoother visual output than rendering each token the instant it arrives, which can look jittery.

Markdown Rendering During Streaming

LLM outputs often contain Markdown. Rendering Markdown mid-stream is tricky because incomplete syntax creates visual artifacts:

An unclosed **bold looks broken until the closing ** arrives
Code blocks without closing backticks expand indefinitely
Lists and headers can shift layout as more content arrives

Solutions:

Buffer until complete blocks — detect Markdown boundaries and only render complete elements
Incremental Markdown parsing — libraries like marked can handle partial input with some configuration
Two-pass rendering — show raw text during streaming, render Markdown when complete (simpler but less polished)

Handling Structured Output

When streaming JSON or structured responses (function calls, tool use), you can’t parse partial JSON:

// This will fail mid-stream
JSON.parse('{"name": "John", "age":')  // SyntaxError

// Solutions:
// 1. Buffer until complete JSON
// 2. Use a streaming JSON parser
// 3. Use partial JSON reconstruction libraries

For tool calls, most APIs send the function name first, then stream the arguments. Build your UI to show “Calling search…” as soon as the function name arrives, then update with parameters as they stream in.

Progressive UI Patterns

Skeleton loading → streaming text: Show a message skeleton (avatar, timestamp, empty bubble) immediately, then fill with streaming text.

Typing indicator → content: Display a typing indicator for TTFT, then transition to streaming text.

Section-by-section rendering: For long responses with headers, render each section as a complete block rather than token-by-token within sections.

Code block streaming: Show a code block container immediately when backticks are detected, stream code into it with syntax highlighting updating incrementally.

Error Handling

Mid-Stream Failures

Streams can fail partway through. Your UI needs to handle:

try {
  for await (const chunk of stream) {
    appendContent(chunk);
  }
} catch (error) {
  if (partialContent.length > 0) {
    // Show what we have + error indicator
    showPartialWithError(partialContent, error);
  } else {
    showErrorState(error);
  }
}

Design decision: Do you show the partial response? Usually yes, with a clear indicator that the response was cut short and an option to retry.

Rate Limiting

Handle 429 responses gracefully:

if (response.status === 429) {
  const retryAfter = response.headers.get('Retry-After');
  showMessage(`High demand — retrying in ${retryAfter}s`);
  await delay(retryAfter * 1000);
  return retry(request);
}

Timeout Handling

Set reasonable timeouts for both TTFT and total generation:

TTFT timeout: 10-30 seconds (if no tokens arrive, something is wrong)
Total timeout: varies by expected response length
Idle timeout: if no tokens arrive for 5+ seconds mid-stream, alert the user

Cancellation

Users should be able to stop generation at any time:

const controller = new AbortController();

function cancelGeneration() {
  controller.abort();
  showStoppedIndicator();
  enableNewMessage();
}

const response = await fetch('/api/chat', {
  signal: controller.signal,
  // ...
});

On the server side, detecting client disconnection and stopping the LLM API call saves tokens and money. Most SDKs support this via abort signals or stream cancellation.

Performance Optimization

Edge Streaming

Deploy your streaming proxy at the edge (Cloudflare Workers, Vercel Edge Functions) to minimize latency between your server and the user. The LLM API call goes from your edge node to the provider; the stream flows from edge to user with minimal round-trip time.

Connection Reuse

Keep connections warm for frequent users. Opening a new HTTPS connection for every message adds 100-300ms of overhead. Connection pooling on the server side and keep-alive on the client side help.

Token Counting

Track token usage during streaming for cost monitoring and context window management. Most streaming APIs include usage information in the final chunk.