Error Handling and Retry Patterns for LLM APIs
Production-grade error handling for LLM API integrations — retry strategies, fallback patterns, and graceful degradation.
View all llm api integration depths →Depth ladder for this topic:
Error Handling and Retry Patterns for LLM APIs
LLM APIs fail. They rate-limit you, time out on long generations, return malformed responses, and occasionally go down entirely. If your application treats these as exceptional events, your users will have a bad time.
Good error handling for LLM APIs isn’t about preventing failures — it’s about handling them gracefully so your application stays useful.
The Error Taxonomy
LLM APIs produce a predictable set of errors. Handle each category differently.
Transient Errors (Retry)
- 429 Too Many Requests: Rate limited. Almost always worth retrying with backoff.
- 500 Internal Server Error: Provider-side issue. Retry a few times.
- 502/503 Bad Gateway / Service Unavailable: Infrastructure issue. Retry with longer backoff.
- Timeout: Generation took too long. Retry with shorter max_tokens or a faster model.
Client Errors (Fix, Don’t Retry)
- 400 Bad Request: Malformed request. Fix your payload.
- 401 Unauthorized: Invalid API key. Don’t retry — fix authentication.
- 403 Forbidden: Access denied. Check permissions.
- 413 Payload Too Large: Input exceeds context window. Truncate or chunk.
- 422 Unprocessable Entity: Invalid parameters. Fix the request.
Content Errors (Validate and Retry)
- Empty response: Model returned nothing useful
- Malformed JSON: Asked for structured output, got invalid JSON
- Refusal: Model declined to answer (content policy)
- Truncated output: Hit max_tokens before completing
Retry Strategy: Exponential Backoff with Jitter
The standard approach. Wait longer between each retry, with randomness to avoid thundering herd.
import time
import random
from openai import OpenAI, RateLimitError, APITimeoutError, APIConnectionError
client = OpenAI()
def call_llm_with_retry(messages, max_retries=3, base_delay=1.0):
for attempt in range(max_retries + 1):
try:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
timeout=30,
)
return response.choices[0].message.content
except RateLimitError as e:
if attempt == max_retries:
raise
# Use Retry-After header if available
retry_after = getattr(e, 'retry_after', None)
delay = retry_after or base_delay * (2 ** attempt)
delay += random.uniform(0, delay * 0.1) # jitter
time.sleep(delay)
except (APITimeoutError, APIConnectionError) as e:
if attempt == max_retries:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
except Exception as e:
# Non-retryable error
raise
Key Parameters
- Max retries: 3 is usually enough. More than 5 is rarely helpful.
- Base delay: 1 second for rate limits, 2-5 seconds for server errors.
- Max delay: Cap at 60 seconds. If you’re waiting longer, something is seriously wrong.
- Jitter: Always add randomness. Without it, all your retries hit the API simultaneously.
Fallback Patterns
When retries fail, fall back gracefully.
Model Fallback Chain
FALLBACK_CHAIN = [
{"provider": "openai", "model": "gpt-4o"},
{"provider": "anthropic", "model": "claude-sonnet-4-20250514"},
{"provider": "openai", "model": "gpt-4o-mini"},
]
async def call_with_fallback(messages):
errors = []
for config in FALLBACK_CHAIN:
try:
return await call_provider(config, messages)
except Exception as e:
errors.append(f"{config['provider']}/{config['model']}: {e}")
continue
raise AllProvidersFailedError(errors)
Design considerations:
- Cost ordering: Put cheaper models later in the chain (they serve as fallbacks, not primaries)
- Capability matching: Ensure fallback models can handle the task. Don’t fall back from GPT-4o to a model that can’t do function calling if you need function calling.
- Provider diversity: Use multiple providers. If OpenAI is down, Anthropic probably isn’t.
Cached Response Fallback
For queries that repeat, serve a cached response when live calls fail:
import hashlib
def get_cache_key(messages):
content = str(messages)
return hashlib.sha256(content.encode()).hexdigest()
async def call_with_cache_fallback(messages):
cache_key = get_cache_key(messages)
try:
response = await call_llm(messages)
await cache.set(cache_key, response, ttl=3600)
return response
except Exception:
cached = await cache.get(cache_key)
if cached:
return cached # Stale is better than nothing
raise
Graceful Degradation
When AI features fail, the app should still work:
async def get_product_description(product):
try:
return await generate_ai_description(product)
except Exception:
# Fall back to template-based description
return f"{product.name} - {product.category}. {product.basic_description}"
Handling Content-Level Errors
API calls succeed, but the response isn’t what you wanted.
JSON Validation
When you need structured output:
import json
from pydantic import BaseModel, ValidationError
class ProductReview(BaseModel):
sentiment: str # positive, negative, neutral
score: float # 0.0 to 1.0
summary: str
def parse_llm_json(response_text, max_retries=2):
for attempt in range(max_retries + 1):
try:
# Try to extract JSON from response
text = response_text.strip()
if text.startswith("```"):
text = text.split("```")[1]
if text.startswith("json"):
text = text[4:]
data = json.loads(text)
return ProductReview(**data)
except (json.JSONDecodeError, ValidationError) as e:
if attempt == max_retries:
raise
# Ask the model to fix its output
response_text = call_llm([
{"role": "user", "content": f"Fix this JSON to match the schema. Error: {e}\n\nBroken JSON:\n{response_text}"}
])
Better approach: use structured output features when available (OpenAI’s response_format, Anthropic’s tool use for structured data). These guarantee valid JSON from the API.
Truncation Detection
def check_truncation(response):
if response.choices[0].finish_reason == "length":
# Response was cut off — hit max_tokens
# Options: increase max_tokens, ask for continuation, or accept partial
return True
return False
Refusal Handling
REFUSAL_INDICATORS = [
"I can't help with",
"I'm not able to",
"I apologize, but I cannot",
"As an AI, I",
]
def is_refusal(response_text):
return any(indicator.lower() in response_text.lower()
for indicator in REFUSAL_INDICATORS)
When you detect a refusal, options include: rephrasing the prompt, using a different model, or escalating to a human.
Circuit Breaker Pattern
When a provider is consistently failing, stop hammering it:
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=60):
self.failures = 0
self.threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.last_failure_time = 0
self.state = "closed" # closed = normal, open = blocking
def can_proceed(self):
if self.state == "closed":
return True
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = "half-open"
return True # Allow one test request
return False
def record_success(self):
self.failures = 0
self.state = "closed"
def record_failure(self):
self.failures += 1
self.last_failure_time = time.time()
if self.failures >= self.threshold:
self.state = "open"
Monitoring and Alerting
Track these metrics:
- Error rate by type (429, 500, timeout, content errors)
- Retry rate (high retry rate = something’s wrong even if requests eventually succeed)
- Fallback activation rate (how often you’re using backup models)
- P95/P99 latency (including retries)
- Cost per successful request (retries cost money)
Alert when:
- Error rate exceeds 5% over 5 minutes
- A provider’s circuit breaker opens
- Fallback chain reaches the last option
- Average latency doubles
Production Checklist
Before deploying an LLM-powered feature:
- All API calls have timeouts set
- Retries with exponential backoff for transient errors
- Fallback chain with at least one alternative
- Graceful degradation when all AI calls fail
- Structured output validation
- Truncation detection
- Rate limit awareness (respect Retry-After headers)
- Circuit breakers for each provider
- Error logging with request/response context
- Cost monitoring and alerts
- User-facing error messages that make sense
LLM APIs are unreliable by nature — variable latency, rate limits, content filtering, occasional outages. Build your application assuming every call might fail, and it’ll work well when they don’t.
Simplify
← LLM API Cost Optimization: A Practical Guide
Go deeper
How to Run LLM Evals in Production →
Related reads
Stay ahead of the AI curve
Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.
No spam. Unsubscribe anytime. We respect your inbox.