🟣 Technical 10 min read

RAG Document Parsing: Getting Clean Text from Messy Documents

A practical guide to parsing documents for RAG systems — handling PDFs, slides, spreadsheets, and web pages, with strategies for preserving structure, tables, and images.

View all rag depths →

RAG Document Parsing: Getting Clean Text from Messy Documents

Your RAG system is only as good as its inputs. The most common failure mode isn’t the embedding model, the vector database, or the retrieval algorithm — it’s garbage in the document parsing layer.

Documents come in dozens of formats, each with its own parsing challenges. PDFs with scanned images. PowerPoints with text in shapes. Spreadsheets where structure IS the information. HTML pages with navigation chrome mixed into content.

Getting clean, structured text from these sources is the unglamorous foundation of every working RAG system.

PDF Parsing: The Hardest Easy Problem

PDFs are the most common document format and the hardest to parse well. A PDF is essentially a set of drawing instructions — “put this character at coordinates (x, y)” — not a structured document.

Digital PDFs (Born Digital)

Created from word processors or design tools. Text is extractable but layout must be reconstructed.

Tools:

  • PyMuPDF (fitz) — fast, good text extraction, handles most digital PDFs well
  • pdfplumber — excellent for tables and structured layouts
  • PDFMiner — detailed control over text extraction, good for complex layouts
import pdfplumber

def extract_with_structure(pdf_path):
    pages = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            # Extract text preserving layout
            text = page.extract_text(layout=True)
            
            # Extract tables separately
            tables = page.extract_tables()
            
            pages.append({
                "text": text,
                "tables": tables,
                "page_number": page.page_number
            })
    return pages

Scanned PDFs (Image-Based)

Contain images of text, not actual text. Require OCR.

Pipeline:

  1. Extract images from PDF pages
  2. Run OCR (Tesseract, Google Document AI, Azure Form Recognizer)
  3. Reconstruct document structure from OCR output
import pytesseract
from pdf2image import convert_from_path

def ocr_pdf(pdf_path):
    images = convert_from_path(pdf_path, dpi=300)
    pages = []
    for i, image in enumerate(images):
        text = pytesseract.image_to_string(image)
        pages.append({"text": text, "page": i + 1})
    return pages

Pro tip: use a multimodal LLM for complex scanned documents. Send the page image directly and ask for structured extraction. More expensive but handles messy layouts, handwriting, and mixed content better than traditional OCR.

The Table Problem

Tables are where most PDF parsers fail. The text extraction gives you cells in reading order, losing the row/column structure.

Strategies:

  • pdfplumber — best open-source table extraction for well-formatted tables
  • Camelot — specifically designed for PDF table extraction
  • Multimodal LLM — send a screenshot of the table, ask for markdown or JSON output
  • Specialized services — AWS Textract, Azure Form Recognizer, Google Document AI

For RAG, convert tables to a format the LLM can reason about:

| Quarter | Revenue | Growth |
|---------|---------|--------|
| Q1 2025 | $4.2M   | 12%    |
| Q2 2025 | $4.8M   | 14%    |

Or natural language: “In Q1 2025, revenue was $4.2M with 12% growth. In Q2 2025, revenue was $4.8M with 14% growth.”

Office Documents

Word Documents (.docx)

Relatively easy. The format is structured XML.

import docx

def parse_docx(path):
    doc = docx.Document(path)
    sections = []
    current_section = {"heading": None, "content": []}
    
    for para in doc.paragraphs:
        if para.style.name.startswith("Heading"):
            if current_section["content"]:
                sections.append(current_section)
            current_section = {"heading": para.text, "content": []}
        else:
            current_section["content"].append(para.text)
    
    sections.append(current_section)
    return sections

Watch for: embedded images (need separate extraction), tracked changes (decide whether to include), headers/footers (often boilerplate), footnotes.

PowerPoint (.pptx)

Text lives in shapes scattered across slides. There’s no linear reading order.

from pptx import Presentation

def parse_pptx(path):
    prs = Presentation(path)
    slides = []
    for slide in prs.slides:
        slide_text = []
        for shape in slide.shapes:
            if shape.has_text_frame:
                for paragraph in shape.text_frame.paragraphs:
                    slide_text.append(paragraph.text)
        slides.append("\n".join(slide_text))
    return slides

Better approach for complex slides: render each slide as an image and use a multimodal LLM to extract content, preserving the visual relationships between text elements, charts, and diagrams.

Spreadsheets (.xlsx)

The structure IS the data. Extracting just the text loses most of the information.

Strategy: convert each sheet (or meaningful range) to a table format that preserves headers and relationships:

import openpyxl

def parse_xlsx(path):
    wb = openpyxl.load_workbook(path)
    sheets = {}
    for sheet_name in wb.sheetnames:
        ws = wb[sheet_name]
        rows = []
        for row in ws.iter_rows(values_only=True):
            rows.append([str(cell) if cell is not None else "" for cell in row])
        sheets[sheet_name] = rows
    return sheets

For RAG ingestion, convert to markdown tables or natural language descriptions.

Web Pages

Basic Extraction

from bs4 import BeautifulSoup
import requests

def extract_web_content(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    
    # Remove navigation, headers, footers, scripts
    for tag in soup.find_all(["nav", "header", "footer", "script", "style", "aside"]):
        tag.decompose()
    
    # Get main content
    main = soup.find("main") or soup.find("article") or soup.find("body")
    return main.get_text(separator="\n", strip=True)

Better tools:

  • Trafilatura — purpose-built for extracting article content from web pages
  • Readability — Mozilla’s algorithm for identifying main content
  • Jina Reader API — web page to clean markdown

Preserving Structure

For RAG, headings and sections matter. Preserve them:

def extract_with_headings(soup):
    sections = []
    current_heading = "Introduction"
    current_content = []
    
    for element in soup.find_all(["h1", "h2", "h3", "h4", "p", "li", "pre"]):
        if element.name.startswith("h"):
            if current_content:
                sections.append({"heading": current_heading, "content": "\n".join(current_content)})
            current_heading = element.get_text(strip=True)
            current_content = []
        else:
            current_content.append(element.get_text(strip=True))
    
    if current_content:
        sections.append({"heading": current_heading, "content": "\n".join(current_content)})
    return sections

Quality Checks

After parsing, validate your output:

  1. Empty content check — did the parser actually extract text?
  2. Character ratio — if >30% of characters are non-alphanumeric, something went wrong
  3. Language detection — is the extracted text in the expected language?
  4. Duplicate detection — headers/footers often repeat on every page
  5. Length sanity — a 50-page PDF should produce more than 100 characters
def quality_check(text, source_pages=None):
    issues = []
    if len(text.strip()) < 100:
        issues.append("extracted_text_too_short")
    
    alnum_ratio = sum(c.isalnum() for c in text) / max(len(text), 1)
    if alnum_ratio < 0.5:
        issues.append("low_alphanumeric_ratio")
    
    # Check for repeated boilerplate
    lines = text.split('\n')
    line_counts = collections.Counter(lines)
    repeated = [line for line, count in line_counts.items() if count > 3 and len(line) > 20]
    if repeated:
        issues.append("repeated_boilerplate_detected")
    
    return issues

The Pipeline

Document → Format Detection → Parser Selection → Raw Extraction 
  → Structure Preservation → Table Handling → Quality Check 
  → Metadata Attachment → Ready for Chunking

Invest in this pipeline. Debug it thoroughly. Log failures. The difference between a RAG system that works and one that doesn’t is usually here — not in the vector database or the embedding model, but in whether the documents were parsed correctly in the first place.

Simplify

← RAG for Code: Building Documentation-Aware Developer Tools

Go deeper

Evaluating RAG Systems: How to Know If Your Pipeline Is Actually Working →

Related reads

ragdocument-parsingingestionpreprocessing

Stay ahead of the AI curve

Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.

No spam. Unsubscribe anytime. We respect your inbox.