Multimodal Search: Finding Content Across Text, Images, and Audio

Traditional search works within a modality: you search text with text queries, find images by their text metadata, locate audio by filename or tag. Multimodal search breaks this constraint — finding images by describing them in natural language, finding audio clips by describing their content, or using an image to find related images without any text at all.

This is increasingly practical to build, and the applications are commercially significant: product search by image, video content discovery, medical image retrieval, music discovery by mood description.

The foundational idea: a shared embedding space

Multimodal search works by projecting different modalities into a shared embedding space where semantic similarity is meaningful across modalities.

Concretely: if you encode “a golden retriever playing fetch on a beach” (text) and an actual photo of that scene (image), a good multimodal model will produce embeddings that are close to each other in vector space. You can then search images using a text query — find images whose embeddings are nearest to your query embedding.

The key insight: semantic meaning can be shared across modalities even though the raw data looks completely different. The word “beach” and a photograph of a beach share something, and that something can be represented in a common vector space.

CLIP: the model that made this practical

CLIP (Contrastive Language-Image Pre-training), from OpenAI (2021), is the most important development in multimodal search. CLIP is trained on hundreds of millions of image-text pairs from the internet using a contrastive objective:

Push together the embeddings of matching image-text pairs
Push apart the embeddings of non-matching pairs
Do this simultaneously for a batch of many pairs

The result: an image encoder and a text encoder that share a representation space. Text and image embeddings from CLIP are genuinely comparable — you can compute cosine similarity between a text embedding and an image embedding to measure semantic relevance.

CLIP’s practical impact was immediate: suddenly text-to-image search was achievable with reasonable quality, without labeling millions of images with text.

Building text-to-image search

Here’s the basic architecture:

Step 1: Embed your image library

from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image
import numpy as np

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

def embed_image(image_path: str) -> np.ndarray:
    image = Image.open(image_path)
    inputs = processor(images=image, return_tensors="pt", padding=True)
    with torch.no_grad():
        embedding = model.get_image_features(**inputs)
    # Normalize to unit sphere (important for cosine similarity)
    embedding = embedding / embedding.norm(dim=-1, keepdim=True)
    return embedding.numpy().squeeze()

# Embed all images in your library
embeddings = {}
for image_path in your_image_paths:
    embeddings[image_path] = embed_image(image_path)

Step 2: Index in a vector store

import qdrant_client
from qdrant_client.models import VectorParams, Distance

client = qdrant_client.QdrantClient(host="localhost", port=6333)

# Create collection
client.create_collection(
    collection_name="images",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE)
)

# Insert embeddings
client.upsert(
    collection_name="images",
    points=[
        {"id": i, "vector": emb.tolist(), "payload": {"path": path}}
        for i, (path, emb) in enumerate(embeddings.items())
    ]
)

Step 3: Search with text queries

def search_images(query: str, top_k: int = 10) -> list:
    # Embed the text query
    inputs = processor(text=[query], return_tensors="pt", padding=True)
    with torch.no_grad():
        text_embedding = model.get_text_features(**inputs)
    text_embedding = text_embedding / text_embedding.norm(dim=-1, keepdim=True)
    
    # Search vector store
    results = client.search(
        collection_name="images",
        query_vector=text_embedding.numpy().squeeze().tolist(),
        limit=top_k
    )
    
    return [r.payload["path"] for r in results]

# Use it
results = search_images("sunset over ocean with dramatic clouds")

This is the complete core of a text-to-image search system. Production systems add caching, batching, filtering by metadata, and UI — but the core is this simple.

Image-to-image search

Since CLIP produces comparable embeddings for images, you can also search by image: “find me images similar to this one.”

def find_similar_images(reference_image_path: str, top_k: int = 10) -> list:
    query_embedding = embed_image(reference_image_path)
    
    results = client.search(
        collection_name="images",
        query_vector=query_embedding.tolist(),
        limit=top_k + 1  # +1 to exclude the reference itself
    )
    
    # Filter out the exact match
    return [r.payload["path"] for r in results if r.payload["path"] != reference_image_path]

Applications: product search by photo, reverse image search, finding near-duplicates in a large library, style matching.

Audio multimodal search

Audio multimodal search is less standardized than image-text search, but the pattern is the same: you need audio and text (or audio and audio) in a shared embedding space.

Approach 1: Transcription-based Transcribe audio to text, then use standard text embeddings. Simple, works well for spoken content (podcasts, meetings, lectures). Fails for music or non-speech audio.

Approach 2: Audio encoders with cross-modal training Models like CLAP (Contrastive Language-Audio Pre-training) apply the CLIP approach to audio: train an audio encoder and text encoder jointly so that audio clips and text descriptions share an embedding space.

# Using CLAP for audio-text search
from transformers import ClapModel, ClapProcessor

model = ClapModel.from_pretrained("laion/clap-htsat-unfused")
processor = ClapProcessor.from_pretrained("laion/clap-htsat-unfused")

def embed_audio(audio_path: str) -> np.ndarray:
    import torchaudio
    waveform, sample_rate = torchaudio.load(audio_path)
    inputs = processor(audios=waveform, return_tensors="pt", sampling_rate=sample_rate)
    with torch.no_grad():
        embedding = model.get_audio_features(**inputs)
    return (embedding / embedding.norm(dim=-1, keepdim=True)).numpy().squeeze()

def embed_text_for_audio(text: str) -> np.ndarray:
    inputs = processor(text=[text], return_tensors="pt")
    with torch.no_grad():
        embedding = model.get_text_features(**inputs)
    return (embedding / embedding.norm(dim=-1, keepdim=True)).numpy().squeeze()

With CLAP embeddings, you can search audio by text description: “upbeat jazz piano,” “dog barking,” “heavy rain on a rooftop.”

Video search

Video search combines the challenges of both images and audio. Approaches:

Frame-level search: Embed individual frames (or sampled frames) with an image encoder. Search returns specific frames. This finds visual moments but misses temporal context.

Transcript-based: Transcribe audio, chunk with timestamps, embed and index. Search returns timestamped transcript segments. Works well for spoken content, misses visual information.

Multimodal per-segment: Divide video into short segments, embed the combination of frames + transcript for each segment, search over segments. Best quality but compute-intensive.

For most production use cases, combining transcript search (high recall for spoken content) with frame-level visual search (for visual queries) and merging results gives the best practical result.

The commercial embedding APIs

Building with raw CLIP or CLAP gives you maximum flexibility. For many applications, cloud APIs are faster to production:

OpenAI embeddings (text-only): text-embedding-3-large is strong for text search. Not multimodal.

Google’s Vertex AI multimodal embeddings: Produces aligned text + image (and video) embeddings from a single API. Drop-in for CLIP-style search with no model hosting.

Amazon Titan Multimodal Embeddings: Similar to Google’s, integrated with AWS ecosystems.

Cohere Embed: Strong text embeddings with multilingual support; not natively multimodal.

For prototyping, hosted APIs reduce infrastructure burden significantly. At scale, self-hosting open models becomes economical.

What to watch for in production

Embedding drift: If your embedding model is updated, embeddings in your index become incompatible with new queries. Plan for re-indexing when models change.

Domain mismatch: CLIP was trained on internet images. For specialized domains (medical imaging, satellite imagery, industrial inspection), domain-specific fine-tuning or specialized encoders perform substantially better.

Scale and cost: Embedding a library of 10M images takes meaningful compute. Batch efficiently; use GPU acceleration; cache embeddings.

Hybrid search: For most real applications, pure semantic search underperforms a hybrid approach combining dense (embedding) and sparse (keyword/metadata) search. A product with “Sony WH-1000XM5” in the query should match on that exact model name, not just semantic proximity.

The practical upshot

Multimodal search is now buildable in a weekend with open-source tools. CLIP for text-to-image is mature, well-documented, and produces genuinely useful search quality on general-domain content. CLAP for audio-text is less mature but functional.

The most impactful applications in 2026: e-commerce product search by photo, enterprise media library search, medical image retrieval, and video content navigation. Each is commercially significant and now technically accessible without research-team resources.