Parent Document Retrieval: Solving RAG's Context Window Problem
Small chunks retrieve better but provide less context. Large chunks provide context but retrieve worse. Parent document retrieval solves this tradeoff — search on small chunks, return the full document.
View all rag depths →Depth ladder for this topic:
RAG systems face a fundamental tension: small chunks (100-200 tokens) match queries more precisely, but they lose the surrounding context the LLM needs to generate good answers. Large chunks (1000+ tokens) provide context but dilute the relevant information, making retrieval less accurate.
Parent document retrieval is the elegant solution: index small chunks for retrieval, but return their parent documents for generation.
The Problem in Detail
Consider a technical document about configuring a database:
Small chunk (200 tokens): “Set max_connections to 100 for databases with fewer than 50 concurrent users. For higher concurrency, use the formula: max_connections = num_users × 1.5 + 20.”
Large chunk (1000 tokens): The entire “Connection Configuration” section, including related settings, dependencies, and examples.
If a user asks “How should I set max_connections?”, the small chunk matches perfectly. But the LLM answering might need the surrounding context about connection pooling, timeout settings, and memory implications to give a complete answer.
How Parent Document Retrieval Works
Indexing:
Document → Split into parents (sections/pages)
→ Split parents into children (small chunks)
→ Embed and index children
→ Store mapping: child_id → parent_id
Retrieval:
Query → Search children → Get matching child_ids
→ Look up parent_ids → Return parent documents
→ Send parents + query to LLM
The key insight: you’re using small chunks as an index into larger documents. The small chunks are search targets; the parent documents are what the LLM actually reads.
Implementation
Basic Implementation with LangChain
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Parent splitter: larger chunks (the context the LLM sees)
parent_splitter = RecursiveCharacterTextSplitter(
chunk_size=2000,
chunk_overlap=200
)
# Child splitter: smaller chunks (what gets embedded and searched)
child_splitter = RecursiveCharacterTextSplitter(
chunk_size=400,
chunk_overlap=50
)
# Store for parent documents
docstore = InMemoryStore() # Use Redis/Postgres in production
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=docstore,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
# Index documents
retriever.add_documents(documents)
# Retrieve: searches children, returns parents
results = retriever.invoke("How do I configure max_connections?")
Custom Implementation
import uuid
from dataclasses import dataclass
@dataclass
class Chunk:
id: str
text: str
parent_id: str
metadata: dict
def index_with_parents(documents, parent_size=2000, child_size=400):
all_children = []
parent_store = {}
for doc in documents:
# Create parent chunks
parents = split_text(doc.text, parent_size, overlap=200)
for parent in parents:
parent_id = str(uuid.uuid4())
parent_store[parent_id] = parent
# Create child chunks from each parent
children = split_text(parent.text, child_size, overlap=50)
for child in children:
all_children.append(Chunk(
id=str(uuid.uuid4()),
text=child.text,
parent_id=parent_id,
metadata={**doc.metadata, "parent_id": parent_id}
))
# Index children in vector store
embeddings = embed(c.text for c in all_children)
vector_store.add(all_children, embeddings)
return parent_store
def retrieve(query, parent_store, k=5, parent_k=3):
# Search children
child_results = vector_store.search(query, k=k)
# Map to unique parents
seen_parents = set()
parents = []
for child in child_results:
pid = child.metadata["parent_id"]
if pid not in seen_parents:
seen_parents.add(pid)
parents.append(parent_store[pid])
if len(parents) >= parent_k:
break
return parents
Choosing Parent and Child Sizes
| Use Case | Child Size | Parent Size | Reasoning |
|---|---|---|---|
| Technical docs | 300-500 tokens | 1500-2500 tokens | Sections are self-contained |
| Legal contracts | 200-400 tokens | 2000-3000 tokens | Clauses need surrounding context |
| Chat/email | 200-300 tokens | Full message/thread | Messages are natural units |
| Code | 100-300 tokens | Full function/class | Code needs complete units |
| Research papers | 400-600 tokens | Full section | Sections are logical units |
Rule of thumb: Child chunks should be small enough to precisely match queries. Parent chunks should be large enough to provide complete context for answering.
Advanced Patterns
Multi-Level Hierarchy
Instead of two levels, use three:
Document → Section → Paragraph → Sentence
↑ search here
↑ return this
Search on sentences (most precise matching), but return entire sections. This works well for long documents where sections are 2000-5000 tokens.
Deduplication
Multiple child chunks from the same parent might match a query. Without deduplication, you waste context window space returning the same parent multiple times:
def deduplicated_parent_retrieval(query, k_children=10, k_parents=3):
children = vector_store.search(query, k=k_children)
parent_scores = {}
for child in children:
pid = child.metadata["parent_id"]
if pid not in parent_scores:
parent_scores[pid] = child.score
else:
# Boost parents with multiple matching children
parent_scores[pid] = max(parent_scores[pid], child.score)
# Return top parents by best child score
top_parents = sorted(parent_scores.items(), key=lambda x: x[1], reverse=True)
return [parent_store[pid] for pid, score in top_parents[:k_parents]]
Dynamic Parent Sizing
Use natural document boundaries instead of fixed sizes:
def split_by_structure(document):
"""Use headings, paragraph breaks, or other structural markers as parent boundaries"""
parents = []
current_parent = []
for element in parse_document(document):
if element.is_heading and current_parent:
parents.append(join(current_parent))
current_parent = [element]
else:
current_parent.append(element)
if current_parent:
parents.append(join(current_parent))
return parents
Natural boundaries produce more coherent parent chunks than arbitrary size-based splitting.
When to Use Parent Document Retrieval
Use it when:
- Documents have clear hierarchical structure
- Answers require context beyond the matching passage
- Your chunks are too small for the LLM to work with
- You’re getting precise but incomplete answers
Don’t use it when:
- Documents are very short (the whole document fits in a chunk)
- Each chunk is self-contained (FAQ entries, glossary definitions)
- Context window is very limited (parent documents might be too large)
- You need maximum retrieval precision without context (keyword search use cases)
Parent document retrieval adds minimal complexity to a RAG system but often produces significantly better answers. If your RAG system is returning relevant but incomplete information, this is the first optimization to try.
Simplify
← Multi-Index RAG: Searching Across Different Knowledge Bases
Go deeper
Query Rewriting for RAG →
Related reads
Stay ahead of the AI curve
Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.
No spam. Unsubscribe anytime. We respect your inbox.