Multimodal Search: Finding Content Across Text, Images, and Audio
Multimodal search lets you find images with text queries, match audio to descriptions, and bridge modalities. Here's how it works and how to build it.
View all multimodal ai depths →Depth ladder for this topic:
Traditional search works within a modality: you search text with text queries, find images by their text metadata, locate audio by filename or tag. Multimodal search breaks this constraint — finding images by describing them in natural language, finding audio clips by describing their content, or using an image to find related images without any text at all.
This is increasingly practical to build, and the applications are commercially significant: product search by image, video content discovery, medical image retrieval, music discovery by mood description.
The foundational idea: a shared embedding space
Multimodal search works by projecting different modalities into a shared embedding space where semantic similarity is meaningful across modalities.
Concretely: if you encode “a golden retriever playing fetch on a beach” (text) and an actual photo of that scene (image), a good multimodal model will produce embeddings that are close to each other in vector space. You can then search images using a text query — find images whose embeddings are nearest to your query embedding.
The key insight: semantic meaning can be shared across modalities even though the raw data looks completely different. The word “beach” and a photograph of a beach share something, and that something can be represented in a common vector space.
CLIP: the model that made this practical
CLIP (Contrastive Language-Image Pre-training), from OpenAI (2021), is the most important development in multimodal search. CLIP is trained on hundreds of millions of image-text pairs from the internet using a contrastive objective:
- Push together the embeddings of matching image-text pairs
- Push apart the embeddings of non-matching pairs
- Do this simultaneously for a batch of many pairs
The result: an image encoder and a text encoder that share a representation space. Text and image embeddings from CLIP are genuinely comparable — you can compute cosine similarity between a text embedding and an image embedding to measure semantic relevance.
CLIP’s practical impact was immediate: suddenly text-to-image search was achievable with reasonable quality, without labeling millions of images with text.
Building text-to-image search
Here’s the basic architecture:
Step 1: Embed your image library
from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image
import numpy as np
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
def embed_image(image_path: str) -> np.ndarray:
image = Image.open(image_path)
inputs = processor(images=image, return_tensors="pt", padding=True)
with torch.no_grad():
embedding = model.get_image_features(**inputs)
# Normalize to unit sphere (important for cosine similarity)
embedding = embedding / embedding.norm(dim=-1, keepdim=True)
return embedding.numpy().squeeze()
# Embed all images in your library
embeddings = {}
for image_path in your_image_paths:
embeddings[image_path] = embed_image(image_path)
Step 2: Index in a vector store
import qdrant_client
from qdrant_client.models import VectorParams, Distance
client = qdrant_client.QdrantClient(host="localhost", port=6333)
# Create collection
client.create_collection(
collection_name="images",
vectors_config=VectorParams(size=768, distance=Distance.COSINE)
)
# Insert embeddings
client.upsert(
collection_name="images",
points=[
{"id": i, "vector": emb.tolist(), "payload": {"path": path}}
for i, (path, emb) in enumerate(embeddings.items())
]
)
Step 3: Search with text queries
def search_images(query: str, top_k: int = 10) -> list:
# Embed the text query
inputs = processor(text=[query], return_tensors="pt", padding=True)
with torch.no_grad():
text_embedding = model.get_text_features(**inputs)
text_embedding = text_embedding / text_embedding.norm(dim=-1, keepdim=True)
# Search vector store
results = client.search(
collection_name="images",
query_vector=text_embedding.numpy().squeeze().tolist(),
limit=top_k
)
return [r.payload["path"] for r in results]
# Use it
results = search_images("sunset over ocean with dramatic clouds")
This is the complete core of a text-to-image search system. Production systems add caching, batching, filtering by metadata, and UI — but the core is this simple.
Image-to-image search
Since CLIP produces comparable embeddings for images, you can also search by image: “find me images similar to this one.”
def find_similar_images(reference_image_path: str, top_k: int = 10) -> list:
query_embedding = embed_image(reference_image_path)
results = client.search(
collection_name="images",
query_vector=query_embedding.tolist(),
limit=top_k + 1 # +1 to exclude the reference itself
)
# Filter out the exact match
return [r.payload["path"] for r in results if r.payload["path"] != reference_image_path]
Applications: product search by photo, reverse image search, finding near-duplicates in a large library, style matching.
Audio multimodal search
Audio multimodal search is less standardized than image-text search, but the pattern is the same: you need audio and text (or audio and audio) in a shared embedding space.
Approach 1: Transcription-based Transcribe audio to text, then use standard text embeddings. Simple, works well for spoken content (podcasts, meetings, lectures). Fails for music or non-speech audio.
Approach 2: Audio encoders with cross-modal training Models like CLAP (Contrastive Language-Audio Pre-training) apply the CLIP approach to audio: train an audio encoder and text encoder jointly so that audio clips and text descriptions share an embedding space.
# Using CLAP for audio-text search
from transformers import ClapModel, ClapProcessor
model = ClapModel.from_pretrained("laion/clap-htsat-unfused")
processor = ClapProcessor.from_pretrained("laion/clap-htsat-unfused")
def embed_audio(audio_path: str) -> np.ndarray:
import torchaudio
waveform, sample_rate = torchaudio.load(audio_path)
inputs = processor(audios=waveform, return_tensors="pt", sampling_rate=sample_rate)
with torch.no_grad():
embedding = model.get_audio_features(**inputs)
return (embedding / embedding.norm(dim=-1, keepdim=True)).numpy().squeeze()
def embed_text_for_audio(text: str) -> np.ndarray:
inputs = processor(text=[text], return_tensors="pt")
with torch.no_grad():
embedding = model.get_text_features(**inputs)
return (embedding / embedding.norm(dim=-1, keepdim=True)).numpy().squeeze()
With CLAP embeddings, you can search audio by text description: “upbeat jazz piano,” “dog barking,” “heavy rain on a rooftop.”
Video search
Video search combines the challenges of both images and audio. Approaches:
Frame-level search: Embed individual frames (or sampled frames) with an image encoder. Search returns specific frames. This finds visual moments but misses temporal context.
Transcript-based: Transcribe audio, chunk with timestamps, embed and index. Search returns timestamped transcript segments. Works well for spoken content, misses visual information.
Multimodal per-segment: Divide video into short segments, embed the combination of frames + transcript for each segment, search over segments. Best quality but compute-intensive.
For most production use cases, combining transcript search (high recall for spoken content) with frame-level visual search (for visual queries) and merging results gives the best practical result.
The commercial embedding APIs
Building with raw CLIP or CLAP gives you maximum flexibility. For many applications, cloud APIs are faster to production:
OpenAI embeddings (text-only): text-embedding-3-large is strong for text search. Not multimodal.
Google’s Vertex AI multimodal embeddings: Produces aligned text + image (and video) embeddings from a single API. Drop-in for CLIP-style search with no model hosting.
Amazon Titan Multimodal Embeddings: Similar to Google’s, integrated with AWS ecosystems.
Cohere Embed: Strong text embeddings with multilingual support; not natively multimodal.
For prototyping, hosted APIs reduce infrastructure burden significantly. At scale, self-hosting open models becomes economical.
What to watch for in production
Embedding drift: If your embedding model is updated, embeddings in your index become incompatible with new queries. Plan for re-indexing when models change.
Domain mismatch: CLIP was trained on internet images. For specialized domains (medical imaging, satellite imagery, industrial inspection), domain-specific fine-tuning or specialized encoders perform substantially better.
Scale and cost: Embedding a library of 10M images takes meaningful compute. Batch efficiently; use GPU acceleration; cache embeddings.
Hybrid search: For most real applications, pure semantic search underperforms a hybrid approach combining dense (embedding) and sparse (keyword/metadata) search. A product with “Sony WH-1000XM5” in the query should match on that exact model name, not just semantic proximity.
The practical upshot
Multimodal search is now buildable in a weekend with open-source tools. CLIP for text-to-image is mature, well-documented, and produces genuinely useful search quality on general-domain content. CLAP for audio-text is less mature but functional.
The most impactful applications in 2026: e-commerce product search by photo, enterprise media library search, medical image retrieval, and video content navigation. Each is commercially significant and now technically accessible without research-team resources.
Simplify
← Multimodal Search Systems: Finding Anything with AI
Go deeper
Multimodal AI for Sensor Fusion Products →
Related reads
Stay ahead of the AI curve
Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.
No spam. Unsubscribe anytime. We respect your inbox.