RAG Document Parsing: Getting Clean Text from Messy Documents

Your RAG system is only as good as its inputs. The most common failure mode isn’t the embedding model, the vector database, or the retrieval algorithm — it’s garbage in the document parsing layer.

Documents come in dozens of formats, each with its own parsing challenges. PDFs with scanned images. PowerPoints with text in shapes. Spreadsheets where structure IS the information. HTML pages with navigation chrome mixed into content.

Getting clean, structured text from these sources is the unglamorous foundation of every working RAG system.

PDF Parsing: The Hardest Easy Problem

PDFs are the most common document format and the hardest to parse well. A PDF is essentially a set of drawing instructions — “put this character at coordinates (x, y)” — not a structured document.

Digital PDFs (Born Digital)

Created from word processors or design tools. Text is extractable but layout must be reconstructed.

Tools:

PyMuPDF (fitz) — fast, good text extraction, handles most digital PDFs well
pdfplumber — excellent for tables and structured layouts
PDFMiner — detailed control over text extraction, good for complex layouts

import pdfplumber

def extract_with_structure(pdf_path):
    pages = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            # Extract text preserving layout
            text = page.extract_text(layout=True)
            
            # Extract tables separately
            tables = page.extract_tables()
            
            pages.append({
                "text": text,
                "tables": tables,
                "page_number": page.page_number
            })
    return pages

Scanned PDFs (Image-Based)

Contain images of text, not actual text. Require OCR.

Pipeline:

Extract images from PDF pages
Run OCR (Tesseract, Google Document AI, Azure Form Recognizer)
Reconstruct document structure from OCR output

import pytesseract
from pdf2image import convert_from_path

def ocr_pdf(pdf_path):
    images = convert_from_path(pdf_path, dpi=300)
    pages = []
    for i, image in enumerate(images):
        text = pytesseract.image_to_string(image)
        pages.append({"text": text, "page": i + 1})
    return pages

Pro tip: use a multimodal LLM for complex scanned documents. Send the page image directly and ask for structured extraction. More expensive but handles messy layouts, handwriting, and mixed content better than traditional OCR.

The Table Problem

Tables are where most PDF parsers fail. The text extraction gives you cells in reading order, losing the row/column structure.

Strategies:

pdfplumber — best open-source table extraction for well-formatted tables
Camelot — specifically designed for PDF table extraction
Multimodal LLM — send a screenshot of the table, ask for markdown or JSON output
Specialized services — AWS Textract, Azure Form Recognizer, Google Document AI

For RAG, convert tables to a format the LLM can reason about:

| Quarter | Revenue | Growth |
|---------|---------|--------|
| Q1 2025 | $4.2M   | 12%    |
| Q2 2025 | $4.8M   | 14%    |

Or natural language: “In Q1 2025, revenue was $4.2M with 12% growth. In Q2 2025, revenue was $4.8M with 14% growth.”

Office Documents

Word Documents (.docx)

Relatively easy. The format is structured XML.

import docx

def parse_docx(path):
    doc = docx.Document(path)
    sections = []
    current_section = {"heading": None, "content": []}
    
    for para in doc.paragraphs:
        if para.style.name.startswith("Heading"):
            if current_section["content"]:
                sections.append(current_section)
            current_section = {"heading": para.text, "content": []}
        else:
            current_section["content"].append(para.text)
    
    sections.append(current_section)
    return sections

Watch for: embedded images (need separate extraction), tracked changes (decide whether to include), headers/footers (often boilerplate), footnotes.

PowerPoint (.pptx)

Text lives in shapes scattered across slides. There’s no linear reading order.

from pptx import Presentation

def parse_pptx(path):
    prs = Presentation(path)
    slides = []
    for slide in prs.slides:
        slide_text = []
        for shape in slide.shapes:
            if shape.has_text_frame:
                for paragraph in shape.text_frame.paragraphs:
                    slide_text.append(paragraph.text)
        slides.append("\n".join(slide_text))
    return slides

Better approach for complex slides: render each slide as an image and use a multimodal LLM to extract content, preserving the visual relationships between text elements, charts, and diagrams.

Spreadsheets (.xlsx)

The structure IS the data. Extracting just the text loses most of the information.

Strategy: convert each sheet (or meaningful range) to a table format that preserves headers and relationships:

import openpyxl

def parse_xlsx(path):
    wb = openpyxl.load_workbook(path)
    sheets = {}
    for sheet_name in wb.sheetnames:
        ws = wb[sheet_name]
        rows = []
        for row in ws.iter_rows(values_only=True):
            rows.append([str(cell) if cell is not None else "" for cell in row])
        sheets[sheet_name] = rows
    return sheets

For RAG ingestion, convert to markdown tables or natural language descriptions.

Web Pages

Basic Extraction

from bs4 import BeautifulSoup
import requests

def extract_web_content(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    
    # Remove navigation, headers, footers, scripts
    for tag in soup.find_all(["nav", "header", "footer", "script", "style", "aside"]):
        tag.decompose()
    
    # Get main content
    main = soup.find("main") or soup.find("article") or soup.find("body")
    return main.get_text(separator="\n", strip=True)

Better tools:

Trafilatura — purpose-built for extracting article content from web pages
Readability — Mozilla’s algorithm for identifying main content
Jina Reader API — web page to clean markdown

Preserving Structure

For RAG, headings and sections matter. Preserve them:

def extract_with_headings(soup):
    sections = []
    current_heading = "Introduction"
    current_content = []
    
    for element in soup.find_all(["h1", "h2", "h3", "h4", "p", "li", "pre"]):
        if element.name.startswith("h"):
            if current_content:
                sections.append({"heading": current_heading, "content": "\n".join(current_content)})
            current_heading = element.get_text(strip=True)
            current_content = []
        else:
            current_content.append(element.get_text(strip=True))
    
    if current_content:
        sections.append({"heading": current_heading, "content": "\n".join(current_content)})
    return sections

Quality Checks

After parsing, validate your output:

Empty content check — did the parser actually extract text?
Character ratio — if >30% of characters are non-alphanumeric, something went wrong
Language detection — is the extracted text in the expected language?
Duplicate detection — headers/footers often repeat on every page
Length sanity — a 50-page PDF should produce more than 100 characters

def quality_check(text, source_pages=None):
    issues = []
    if len(text.strip()) < 100:
        issues.append("extracted_text_too_short")
    
    alnum_ratio = sum(c.isalnum() for c in text) / max(len(text), 1)
    if alnum_ratio < 0.5:
        issues.append("low_alphanumeric_ratio")
    
    # Check for repeated boilerplate
    lines = text.split('\n')
    line_counts = collections.Counter(lines)
    repeated = [line for line, count in line_counts.items() if count > 3 and len(line) > 20]
    if repeated:
        issues.append("repeated_boilerplate_detected")
    
    return issues

The Pipeline

Document → Format Detection → Parser Selection → Raw Extraction 
  → Structure Preservation → Table Handling → Quality Check 
  → Metadata Attachment → Ready for Chunking

Invest in this pipeline. Debug it thoroughly. Log failures. The difference between a RAG system that works and one that doesn’t is usually here — not in the vector database or the embedding model, but in whether the documents were parsed correctly in the first place.

RAG Document Parsing: Getting Clean Text from Messy Documents

RAG Document Parsing: Getting Clean Text from Messy Documents

PDF Parsing: The Hardest Easy Problem

Digital PDFs (Born Digital)

Scanned PDFs (Image-Based)

The Table Problem

Office Documents

Word Documents (.docx)

PowerPoint (.pptx)

Spreadsheets (.xlsx)

Web Pages

Basic Extraction

Preserving Structure

Quality Checks

The Pipeline

Related reads

Stay ahead of the AI curve