🟣 Technical 8 min read

Parent Document Retrieval: Solving RAG's Context Window Problem

Small chunks retrieve better but provide less context. Large chunks provide context but retrieve worse. Parent document retrieval solves this tradeoff — search on small chunks, return the full document.

View all rag depths →

RAG systems face a fundamental tension: small chunks (100-200 tokens) match queries more precisely, but they lose the surrounding context the LLM needs to generate good answers. Large chunks (1000+ tokens) provide context but dilute the relevant information, making retrieval less accurate.

Parent document retrieval is the elegant solution: index small chunks for retrieval, but return their parent documents for generation.

The Problem in Detail

Consider a technical document about configuring a database:

Small chunk (200 tokens): “Set max_connections to 100 for databases with fewer than 50 concurrent users. For higher concurrency, use the formula: max_connections = num_users × 1.5 + 20.”

Large chunk (1000 tokens): The entire “Connection Configuration” section, including related settings, dependencies, and examples.

If a user asks “How should I set max_connections?”, the small chunk matches perfectly. But the LLM answering might need the surrounding context about connection pooling, timeout settings, and memory implications to give a complete answer.

How Parent Document Retrieval Works

Indexing:
  Document → Split into parents (sections/pages)
           → Split parents into children (small chunks)
           → Embed and index children
           → Store mapping: child_id → parent_id

Retrieval:
  Query → Search children → Get matching child_ids
        → Look up parent_ids → Return parent documents
        → Send parents + query to LLM

The key insight: you’re using small chunks as an index into larger documents. The small chunks are search targets; the parent documents are what the LLM actually reads.

Implementation

Basic Implementation with LangChain

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Parent splitter: larger chunks (the context the LLM sees)
parent_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200
)

# Child splitter: smaller chunks (what gets embedded and searched)
child_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=50
)

# Store for parent documents
docstore = InMemoryStore()  # Use Redis/Postgres in production

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Index documents
retriever.add_documents(documents)

# Retrieve: searches children, returns parents
results = retriever.invoke("How do I configure max_connections?")

Custom Implementation

import uuid
from dataclasses import dataclass

@dataclass
class Chunk:
    id: str
    text: str
    parent_id: str
    metadata: dict

def index_with_parents(documents, parent_size=2000, child_size=400):
    all_children = []
    parent_store = {}
    
    for doc in documents:
        # Create parent chunks
        parents = split_text(doc.text, parent_size, overlap=200)
        
        for parent in parents:
            parent_id = str(uuid.uuid4())
            parent_store[parent_id] = parent
            
            # Create child chunks from each parent
            children = split_text(parent.text, child_size, overlap=50)
            
            for child in children:
                all_children.append(Chunk(
                    id=str(uuid.uuid4()),
                    text=child.text,
                    parent_id=parent_id,
                    metadata={**doc.metadata, "parent_id": parent_id}
                ))
    
    # Index children in vector store
    embeddings = embed(c.text for c in all_children)
    vector_store.add(all_children, embeddings)
    
    return parent_store

def retrieve(query, parent_store, k=5, parent_k=3):
    # Search children
    child_results = vector_store.search(query, k=k)
    
    # Map to unique parents
    seen_parents = set()
    parents = []
    for child in child_results:
        pid = child.metadata["parent_id"]
        if pid not in seen_parents:
            seen_parents.add(pid)
            parents.append(parent_store[pid])
            if len(parents) >= parent_k:
                break
    
    return parents

Choosing Parent and Child Sizes

Use CaseChild SizeParent SizeReasoning
Technical docs300-500 tokens1500-2500 tokensSections are self-contained
Legal contracts200-400 tokens2000-3000 tokensClauses need surrounding context
Chat/email200-300 tokensFull message/threadMessages are natural units
Code100-300 tokensFull function/classCode needs complete units
Research papers400-600 tokensFull sectionSections are logical units

Rule of thumb: Child chunks should be small enough to precisely match queries. Parent chunks should be large enough to provide complete context for answering.

Advanced Patterns

Multi-Level Hierarchy

Instead of two levels, use three:

Document → Section → Paragraph → Sentence
                                    ↑ search here
              ↑ return this

Search on sentences (most precise matching), but return entire sections. This works well for long documents where sections are 2000-5000 tokens.

Deduplication

Multiple child chunks from the same parent might match a query. Without deduplication, you waste context window space returning the same parent multiple times:

def deduplicated_parent_retrieval(query, k_children=10, k_parents=3):
    children = vector_store.search(query, k=k_children)
    
    parent_scores = {}
    for child in children:
        pid = child.metadata["parent_id"]
        if pid not in parent_scores:
            parent_scores[pid] = child.score
        else:
            # Boost parents with multiple matching children
            parent_scores[pid] = max(parent_scores[pid], child.score)
    
    # Return top parents by best child score
    top_parents = sorted(parent_scores.items(), key=lambda x: x[1], reverse=True)
    return [parent_store[pid] for pid, score in top_parents[:k_parents]]

Dynamic Parent Sizing

Use natural document boundaries instead of fixed sizes:

def split_by_structure(document):
    """Use headings, paragraph breaks, or other structural markers as parent boundaries"""
    parents = []
    current_parent = []
    
    for element in parse_document(document):
        if element.is_heading and current_parent:
            parents.append(join(current_parent))
            current_parent = [element]
        else:
            current_parent.append(element)
    
    if current_parent:
        parents.append(join(current_parent))
    
    return parents

Natural boundaries produce more coherent parent chunks than arbitrary size-based splitting.

When to Use Parent Document Retrieval

Use it when:

  • Documents have clear hierarchical structure
  • Answers require context beyond the matching passage
  • Your chunks are too small for the LLM to work with
  • You’re getting precise but incomplete answers

Don’t use it when:

  • Documents are very short (the whole document fits in a chunk)
  • Each chunk is self-contained (FAQ entries, glossary definitions)
  • Context window is very limited (parent documents might be too large)
  • You need maximum retrieval precision without context (keyword search use cases)

Parent document retrieval adds minimal complexity to a RAG system but often produces significantly better answers. If your RAG system is returning relevant but incomplete information, this is the first optimization to try.

Simplify

← Multi-Index RAG: Searching Across Different Knowledge Bases

Go deeper

Query Rewriting for RAG →

Related reads

ragretrievalchunkingparent-documentarchitecture

Stay ahead of the AI curve

Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.

No spam. Unsubscribe anytime. We respect your inbox.