🟣 Technical 9 min read

Hybrid Search for RAG: Combining Dense and Sparse Retrieval

Pure semantic search often underperforms in production RAG systems. Hybrid search — combining dense embeddings with sparse retrieval — is the more reliable approach.

View all rag depths →

Most RAG tutorials start with a simple approach: embed your documents, embed the query, find the nearest neighbors by cosine similarity, feed the results to the LLM. It’s a reasonable prototype, but production RAG systems almost universally move to hybrid search — combining dense (embedding-based) retrieval with sparse (keyword-based) retrieval.

This post explains why pure semantic search underperforms, how hybrid search works, and how to implement it.

Dense retrieval using embeddings is powerful for semantic similarity — matching queries to documents that mean the same thing even if they use different words. “What’s the capital of France?” will find a document that says “Paris is France’s capital city” even without exact word overlap.

But semantic search has real failure modes:

Keyword precision: A user searching for “model XR-7 installation” needs the exact product model number. Semantic search may return results about similar products, or about installation generally, rather than the specific XR-7 documentation. The embedding of “XR-7” isn’t meaningfully different from “XR-8” in most embedding spaces.

Technical terms: Acronyms, technical jargon, error codes, part numbers, drug names — these often don’t have strong semantic embeddings because they appear rarely in training data. ENOENT as a query won’t reliably surface documentation about Linux file system errors just from embeddings.

Names: Person names, company names, product names often embed similarly to other names. Searching for “John Martinez pricing proposal” may not reliably surface documents mentioning that specific person.

Exact phrase requirements: Sometimes users want exact match, not semantic similarity. Legal documents, contract clauses, quoted text.

In short: semantic search excels at conceptual similarity; keyword search excels at lexical precision. Real queries often need both.

Sparse retrieval: BM25

The dominant sparse retrieval algorithm is BM25 (Best Match 25) — a refined version of TF-IDF that has dominated information retrieval for decades and remains competitive with modern neural retrievers on many benchmarks.

BM25 scores documents based on:

  • Term frequency (TF): How often query terms appear in a document — but with diminishing returns (saturation)
  • Inverse document frequency (IDF): How rare the term is across all documents — rare terms get more weight
  • Document length normalization: Longer documents don’t get unfairly rewarded just for containing more words
from rank_bm25 import BM25Okapi
import numpy as np

# Prepare your corpus
documents = [
    "XR-7 installation guide for industrial systems",
    "Model XR-8 user manual and setup instructions",
    "General installation best practices for machinery"
]

# Tokenize
tokenized_corpus = [doc.lower().split() for doc in documents]

# Create BM25 index
bm25 = BM25Okapi(tokenized_corpus)

# Search
query = "XR-7 installation"
tokenized_query = query.lower().split()
scores = bm25.get_scores(tokenized_query)

# Scores: XR-7 doc >> XR-8 doc > general doc
# Semantic search might have put general doc higher

BM25 naturally handles the keyword precision cases that semantic search misses. “XR-7” gets high weight because it appears in only one document (high IDF) and appears in the query.

Hybrid search: reciprocal rank fusion

The most common approach to combining dense and sparse retrieval is Reciprocal Rank Fusion (RRF).

The idea: take the ranked lists from dense and sparse retrieval, and combine them by reciprocal rank:

RRF_score(doc) = Σ 1 / (k + rank_in_list)

where k is a constant (typically 60) that reduces the influence of very high ranks.

def reciprocal_rank_fusion(dense_results: list, sparse_results: list, k: int = 60) -> list:
    """
    dense_results: list of (doc_id, score) from embedding search
    sparse_results: list of (doc_id, score) from BM25 search
    Returns: reranked list of (doc_id, combined_score)
    """
    rrf_scores = {}
    
    # Score from dense results
    for rank, (doc_id, _) in enumerate(dense_results):
        if doc_id not in rrf_scores:
            rrf_scores[doc_id] = 0
        rrf_scores[doc_id] += 1 / (k + rank + 1)
    
    # Score from sparse results
    for rank, (doc_id, _) in enumerate(sparse_results):
        if doc_id not in rrf_scores:
            rrf_scores[doc_id] = 0
        rrf_scores[doc_id] += 1 / (k + rank + 1)
    
    # Sort by combined score
    return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)

RRF works well in practice because:

  • It’s rank-based (not score-based), so you don’t need to normalize different score scales
  • It handles the case where one retriever returns a highly relevant document the other missed
  • It’s simple and parameter-light

Full implementation with Qdrant

Modern vector databases like Qdrant have native sparse vector support, enabling efficient hybrid search without maintaining separate indices:

from qdrant_client import QdrantClient
from qdrant_client.models import (
    VectorParams, SparseVectorParams, Distance,
    SparseIndexParams, PointStruct, SparseVector
)
import numpy as np
from openai import OpenAI

openai_client = OpenAI()
qdrant = QdrantClient(host="localhost", port=6333)

# Create collection with both dense and sparse vectors
qdrant.create_collection(
    collection_name="documents",
    vectors_config={
        "dense": VectorParams(size=1536, distance=Distance.COSINE)
    },
    sparse_vectors_config={
        "sparse": SparseVectorParams(index=SparseIndexParams())
    }
)

def get_dense_embedding(text: str) -> list:
    response = openai_client.embeddings.create(
        model="text-embedding-3-large",
        input=text
    )
    return response.data[0].embedding

def get_sparse_embedding(text: str) -> dict:
    """Create sparse TF-IDF-like representation"""
    from sklearn.feature_extraction.text import TfidfVectorizer
    # In practice, use a shared vectorizer fit on your full corpus
    vectorizer = TfidfVectorizer()
    # Simplified — in production, use a pre-fit vectorizer
    matrix = vectorizer.fit_transform([text])
    cx = matrix.tocoo()
    return {"indices": cx.col.tolist(), "values": cx.data.tolist()}

def index_document(doc_id: str, text: str):
    dense_vector = get_dense_embedding(text)
    sparse_vector = get_sparse_embedding(text)
    
    qdrant.upsert(
        collection_name="documents",
        points=[
            PointStruct(
                id=doc_id,
                vector={
                    "dense": dense_vector,
                    "sparse": SparseVector(
                        indices=sparse_vector["indices"],
                        values=sparse_vector["values"]
                    )
                },
                payload={"text": text}
            )
        ]
    )

def hybrid_search(query: str, top_k: int = 10) -> list:
    query_dense = get_dense_embedding(query)
    query_sparse = get_sparse_embedding(query)
    
    from qdrant_client.models import Prefetch, FusionQuery, Fusion
    
    results = qdrant.query_points(
        collection_name="documents",
        prefetch=[
            Prefetch(query=query_dense, using="dense", limit=top_k * 2),
            Prefetch(
                query=SparseVector(
                    indices=query_sparse["indices"],
                    values=query_sparse["values"]
                ),
                using="sparse",
                limit=top_k * 2
            )
        ],
        query=FusionQuery(fusion=Fusion.RRF),
        limit=top_k
    )
    
    return [(r.id, r.payload["text"], r.score) for r in results.points]

Adding a reranker

Hybrid search typically retrieves 2-4x more candidates than you need, then applies a cross-encoder reranker to reorder them by deeper relevance assessment.

Cross-encoders (models like Cohere Rerank, BGE Reranker, or ColBERT) process query and document jointly, capturing deeper semantic relationships than bi-encoder embeddings. They’re slower (can’t pre-index), so you use them on the smaller candidate set after initial retrieval.

import cohere

co = cohere.Client("your-api-key")

def rerank(query: str, documents: list, top_n: int = 5) -> list:
    results = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=[doc["text"] for doc in documents],
        top_n=top_n
    )
    
    return [documents[r.index] for r in results.results]

# Pipeline
candidates = hybrid_search(query, top_k=20)  # Retrieve 20
doc_objects = [{"text": text, "id": id_} for id_, text, _ in candidates]
final_results = rerank(query, doc_objects, top_n=5)  # Rerank to 5
# Feed final_results to LLM

The three-stage pipeline (dense retrieval → BM25 → reranking) is the current production standard for RAG retrieval quality.

When to add each component

Always worth having: BM25 alongside dense retrieval. The implementation cost is low; the reliability improvement for keyword-sensitive queries is high.

Add reranking when: You have 100k+ documents, complex queries, or quality is genuinely critical. Reranking adds latency (100-300ms) and cost; measure whether the quality gain justifies it for your use case.

Add query expansion when: Your users use short queries that under-specify their needs. LLM-based query expansion (generate multiple phrasings of the query, search with all of them) can significantly improve recall.

Measuring if it’s working

Before and after hybrid search implementation, measure:

  • Recall@K: Of the relevant documents for a set of test queries, what fraction appear in the top K results?
  • Precision@K: Of the top K results returned, what fraction are actually relevant?
  • MRR (Mean Reciprocal Rank): Average of 1/rank_of_first_relevant_document across queries

Create a test set of 50-100 queries with known relevant documents. Measure these metrics for dense-only, sparse-only, and hybrid. The improvement from hybrid is usually clearest in Recall@K.

The bottom line

Pure semantic search is a starting point, not a destination. For production RAG systems handling diverse query types, hybrid search — dense + BM25 + optional reranking — consistently outperforms either approach alone. The implementation complexity is manageable; the quality improvement is real.

If your RAG system sometimes gives confident answers based on the wrong sources: retrieval quality is where to look first.

Simplify

← RAG for Code: Building Documentation-Aware Developer Tools

Go deeper

Metadata Filtering in RAG: The Most Underrated Retrieval Technique →

Related reads

raghybrid-searchbm25dense-retrievalembeddingsinformation-retrievalreranking

Stay ahead of the AI curve

Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.

No spam. Unsubscribe anytime. We respect your inbox.