RAG Document Parsing: Getting Clean Text from Messy Documents
A practical guide to parsing documents for RAG systems — handling PDFs, slides, spreadsheets, and web pages, with strategies for preserving structure, tables, and images.
View all rag depths →Depth ladder for this topic:
RAG Document Parsing: Getting Clean Text from Messy Documents
Your RAG system is only as good as its inputs. The most common failure mode isn’t the embedding model, the vector database, or the retrieval algorithm — it’s garbage in the document parsing layer.
Documents come in dozens of formats, each with its own parsing challenges. PDFs with scanned images. PowerPoints with text in shapes. Spreadsheets where structure IS the information. HTML pages with navigation chrome mixed into content.
Getting clean, structured text from these sources is the unglamorous foundation of every working RAG system.
PDF Parsing: The Hardest Easy Problem
PDFs are the most common document format and the hardest to parse well. A PDF is essentially a set of drawing instructions — “put this character at coordinates (x, y)” — not a structured document.
Digital PDFs (Born Digital)
Created from word processors or design tools. Text is extractable but layout must be reconstructed.
Tools:
- PyMuPDF (fitz) — fast, good text extraction, handles most digital PDFs well
- pdfplumber — excellent for tables and structured layouts
- PDFMiner — detailed control over text extraction, good for complex layouts
import pdfplumber
def extract_with_structure(pdf_path):
pages = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
# Extract text preserving layout
text = page.extract_text(layout=True)
# Extract tables separately
tables = page.extract_tables()
pages.append({
"text": text,
"tables": tables,
"page_number": page.page_number
})
return pages
Scanned PDFs (Image-Based)
Contain images of text, not actual text. Require OCR.
Pipeline:
- Extract images from PDF pages
- Run OCR (Tesseract, Google Document AI, Azure Form Recognizer)
- Reconstruct document structure from OCR output
import pytesseract
from pdf2image import convert_from_path
def ocr_pdf(pdf_path):
images = convert_from_path(pdf_path, dpi=300)
pages = []
for i, image in enumerate(images):
text = pytesseract.image_to_string(image)
pages.append({"text": text, "page": i + 1})
return pages
Pro tip: use a multimodal LLM for complex scanned documents. Send the page image directly and ask for structured extraction. More expensive but handles messy layouts, handwriting, and mixed content better than traditional OCR.
The Table Problem
Tables are where most PDF parsers fail. The text extraction gives you cells in reading order, losing the row/column structure.
Strategies:
- pdfplumber — best open-source table extraction for well-formatted tables
- Camelot — specifically designed for PDF table extraction
- Multimodal LLM — send a screenshot of the table, ask for markdown or JSON output
- Specialized services — AWS Textract, Azure Form Recognizer, Google Document AI
For RAG, convert tables to a format the LLM can reason about:
| Quarter | Revenue | Growth |
|---------|---------|--------|
| Q1 2025 | $4.2M | 12% |
| Q2 2025 | $4.8M | 14% |
Or natural language: “In Q1 2025, revenue was $4.2M with 12% growth. In Q2 2025, revenue was $4.8M with 14% growth.”
Office Documents
Word Documents (.docx)
Relatively easy. The format is structured XML.
import docx
def parse_docx(path):
doc = docx.Document(path)
sections = []
current_section = {"heading": None, "content": []}
for para in doc.paragraphs:
if para.style.name.startswith("Heading"):
if current_section["content"]:
sections.append(current_section)
current_section = {"heading": para.text, "content": []}
else:
current_section["content"].append(para.text)
sections.append(current_section)
return sections
Watch for: embedded images (need separate extraction), tracked changes (decide whether to include), headers/footers (often boilerplate), footnotes.
PowerPoint (.pptx)
Text lives in shapes scattered across slides. There’s no linear reading order.
from pptx import Presentation
def parse_pptx(path):
prs = Presentation(path)
slides = []
for slide in prs.slides:
slide_text = []
for shape in slide.shapes:
if shape.has_text_frame:
for paragraph in shape.text_frame.paragraphs:
slide_text.append(paragraph.text)
slides.append("\n".join(slide_text))
return slides
Better approach for complex slides: render each slide as an image and use a multimodal LLM to extract content, preserving the visual relationships between text elements, charts, and diagrams.
Spreadsheets (.xlsx)
The structure IS the data. Extracting just the text loses most of the information.
Strategy: convert each sheet (or meaningful range) to a table format that preserves headers and relationships:
import openpyxl
def parse_xlsx(path):
wb = openpyxl.load_workbook(path)
sheets = {}
for sheet_name in wb.sheetnames:
ws = wb[sheet_name]
rows = []
for row in ws.iter_rows(values_only=True):
rows.append([str(cell) if cell is not None else "" for cell in row])
sheets[sheet_name] = rows
return sheets
For RAG ingestion, convert to markdown tables or natural language descriptions.
Web Pages
Basic Extraction
from bs4 import BeautifulSoup
import requests
def extract_web_content(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
# Remove navigation, headers, footers, scripts
for tag in soup.find_all(["nav", "header", "footer", "script", "style", "aside"]):
tag.decompose()
# Get main content
main = soup.find("main") or soup.find("article") or soup.find("body")
return main.get_text(separator="\n", strip=True)
Better tools:
- Trafilatura — purpose-built for extracting article content from web pages
- Readability — Mozilla’s algorithm for identifying main content
- Jina Reader API — web page to clean markdown
Preserving Structure
For RAG, headings and sections matter. Preserve them:
def extract_with_headings(soup):
sections = []
current_heading = "Introduction"
current_content = []
for element in soup.find_all(["h1", "h2", "h3", "h4", "p", "li", "pre"]):
if element.name.startswith("h"):
if current_content:
sections.append({"heading": current_heading, "content": "\n".join(current_content)})
current_heading = element.get_text(strip=True)
current_content = []
else:
current_content.append(element.get_text(strip=True))
if current_content:
sections.append({"heading": current_heading, "content": "\n".join(current_content)})
return sections
Quality Checks
After parsing, validate your output:
- Empty content check — did the parser actually extract text?
- Character ratio — if >30% of characters are non-alphanumeric, something went wrong
- Language detection — is the extracted text in the expected language?
- Duplicate detection — headers/footers often repeat on every page
- Length sanity — a 50-page PDF should produce more than 100 characters
def quality_check(text, source_pages=None):
issues = []
if len(text.strip()) < 100:
issues.append("extracted_text_too_short")
alnum_ratio = sum(c.isalnum() for c in text) / max(len(text), 1)
if alnum_ratio < 0.5:
issues.append("low_alphanumeric_ratio")
# Check for repeated boilerplate
lines = text.split('\n')
line_counts = collections.Counter(lines)
repeated = [line for line, count in line_counts.items() if count > 3 and len(line) > 20]
if repeated:
issues.append("repeated_boilerplate_detected")
return issues
The Pipeline
Document → Format Detection → Parser Selection → Raw Extraction
→ Structure Preservation → Table Handling → Quality Check
→ Metadata Attachment → Ready for Chunking
Invest in this pipeline. Debug it thoroughly. Log failures. The difference between a RAG system that works and one that doesn’t is usually here — not in the vector database or the embedding model, but in whether the documents were parsed correctly in the first place.
Simplify
← RAG for Code: Building Documentation-Aware Developer Tools
Go deeper
Evaluating RAG Systems: How to Know If Your Pipeline Is Actually Working →
Related reads
Stay ahead of the AI curve
Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.
No spam. Unsubscribe anytime. We respect your inbox.