PDF/DOCX Ingestion Pipelines

Ingestion is the first deterministic stage of Parsing & Extraction Workflows: the seam where an unstructured file dropped into a bucket becomes a typed, hashed, schema-validated text payload that every downstream engine can trust. Commercial lease portfolios never arrive in one shape. A single onboarding batch mixes natively-authored DOCX amendments, born-digital PDF originals, and flattened scans of decades-old retail leases, each carrying multi-column rent schedules, embedded CAM addendums, and inconsistent typographic hierarchy. The specific challenge this page solves is making that heterogeneity disappear before extraction runs — so that the clause parsers, field mappers, and billing engines downstream never have to special-case where a document came from.

Get ingestion wrong and the failure is expensive and quiet. A scanned lease that yields an empty string sails through as a “successful” zero-text record and silently drops a tenant from the rent roll; a duplicate S3 event re-processes the same amendment twice and double-posts an escalation. This page covers the engineering of doing it safely: MIME-gated routing, format-specific text and table extraction, content hashing for idempotency, and a pydantic v2 quality gate that diverts anything below a text-density threshold to OCR preprocessing instead of letting it pass as clean.

The Scoped Problem: One Text Contract From Many File Shapes

Downstream stages have no business knowing whether a clause came from a PDF or a DOCX. The job of this layer is to collapse every supported input into one stable contract — a RawLeasePayload carrying clean text, separated tables, a content hash, and an extraction verdict — and to do it idempotently so that storage retries, partial writes, and duplicate event notifications cannot corrupt the queue feeding regex and NLP clause extraction.

Three properties make ingestion safe at portfolio scale:

Format-awareness, not format-leakage. Routing decisions are made once, here, against the MIME type and magic bytes. The payload that leaves this stage is format-agnostic.
Idempotency by content. A SHA-256 of the file bytes is the deduplication key and the version anchor. Re-uploading the same lease is a no-op; uploading an amended copy produces a new hash and a new record, preserving an audit trail.
No silent success. An empty or near-empty extraction is a failure verdict, not a clean zero-text record. Scanned and image-only documents are detected and routed, never swallowed.

This page is the parent for file-format-specific deep dives such as using pdfplumber for commercial lease text extraction at scale, which handles the bounding-box logic for multi-column rent rolls that the router delegates to.

Prerequisites & Environment Setup

The examples assume Python 3.11+ and the following pinned dependencies. The pydantic v2 field_validator / model_validator syntax is the project convention and is used for the quality gate.

# requirements.txt
pdfplumber==0.11.0   # layout-aware PDF text + table extraction (coordinate model)
python-docx==1.1.2   # native DOCX paragraph + table traversal
pydantic==2.7.1      # typed RawLeasePayload boundary + model-level quality gate

Assumptions baked into the implementation below:

Input is a file path or readable stream. In production the worker pulls bytes from object storage (S3/GCS) into a temp file or BytesIO; the router does not care which, as long as it can compute a hash and open the document.
Supported formats are an explicit allowlist. application/pdf and the DOCX OOXML MIME only. Legacy .doc (OLE2), .rtf, and image formats are rejected at the gate rather than parsed on a best-effort basis — partial parses of unsupported formats are how garbage reaches the extractor.
Magic bytes are checked, not just extensions. A .pdf extension on a file whose header is not %PDF is rejected. Extension-only trust is a malicious-upload vector and a corruption source.
OCR is out of scope here. When a PDF has no extractable text layer, this stage’s job is to detect that and hand the document to the OCR preprocessing stage — not to run OCR inline and block the batch.

MIME Routing Decision Table

Routing is declarative: a fixed map from validated MIME type to handler. Making the table explicit keeps the supported surface auditable and forces every new format to be a reviewed addition rather than an implicit code path.

MIME type	Magic bytes	Extension	Handler	Tables	If no text layer
`application/pdf`	`%PDF-`	`.pdf`	`_parse_pdf` (pdfplumber)	coordinate extraction	route to OCR preprocessing
`…wordprocessingml.document`	`PK\x03\x04` (ZIP)	`.docx`	`_parse_docx` (python-docx)	native cell traversal	n/a (always has text layer)
`application/msword`	`\xD0\xCF\x11\xE0`	`.doc`	rejected	—	dead-letter: convert upstream
anything else	—	—	rejected at gate	—	dead-letter: unsupported

The two supported lanes converge on the same payload contract. The PDF lane delegates complex multi-column financial tables to the bounding-box logic covered in the pdfplumber extraction deep dive; the DOCX lane walks the OOXML tree directly because python-docx exposes paragraph and table structure natively.

Primary Implementation

The ingestion class below is stateless and constructed once per worker. It validates, hashes, routes, extracts, and returns a validated RawLeasePayload — never raising into the caller for an expected failure (unsupported format, no text layer), because in a batch a single bad document must not halt the run. Expected failures become a failed or needs_ocr verdict on the payload; only programmer errors propagate.

import hashlib
import logging
import unicodedata
from pathlib import Path
from typing import Any, Literal, Optional, Union

import pdfplumber
from docx import Document
from pydantic import BaseModel, Field, model_validator

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger("lease_ingestion")

# Magic-byte signatures — extension is advisory, the header is authoritative.
_SIGNATURES = {
    b"%PDF-": "application/pdf",
    b"PK\x03\x04": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
}

ExtractionStatus = Literal["success", "needs_ocr", "failed"]


class RawLeasePayload(BaseModel):
    """The single contract every downstream stage consumes, regardless of source format."""

    source_file: str
    file_hash: str                       # SHA-256 of raw bytes: dedup key + version anchor
    raw_text: str = ""
    tables: list[list[list[str]]] = Field(default_factory=list)  # page/table/row/cell
    metadata: dict[str, Any] = Field(default_factory=dict)
    page_count: int = 0
    text_density: float = 0.0            # printable chars per page, normalized
    extraction_status: ExtractionStatus = "success"
    error_message: Optional[str] = None

    @model_validator(mode="after")
    def gate_text_density(self) -> "RawLeasePayload":
        """Quality gate: a near-empty extraction from a PDF is a scan, not a success."""
        if self.extraction_status == "failed":
            return self
        printable = sum(1 for c in self.raw_text if not c.isspace())
        pages = max(self.page_count, 1)
        self.text_density = printable / pages
        # A born-digital lease page carries hundreds of printable chars; a flattened
        # scan carries ~0. Below threshold we do NOT trust the empty text.
        if self.metadata.get("format") == "pdf" and self.text_density < 50:
            self.extraction_status = "needs_ocr"
            self.error_message = "No extractable text layer — likely scanned/image-based PDF."
        return self


class DocumentIngestionPipeline:
    """Stateless router + extractor for commercial real-estate documents."""

    def __init__(self, max_pages: int = 200):
        self.max_pages = max_pages

    # --- idempotency -------------------------------------------------------
    def compute_file_hash(self, path: Path) -> str:
        """SHA-256 over raw bytes — stable dedup key and lease version anchor."""
        sha = hashlib.sha256()
        with open(path, "rb") as fh:
            for chunk in iter(lambda: fh.read(1 << 16), b""):
                sha.update(chunk)
        return sha.hexdigest()

    # --- gating ------------------------------------------------------------
    def detect_format(self, path: Path) -> str:
        """Trust magic bytes over the extension to block spoofed/corrupt uploads."""
        with open(path, "rb") as fh:
            head = fh.read(8)
        for sig, mime in _SIGNATURES.items():
            if head.startswith(sig):
                return mime
        raise ValueError(f"Unsupported or corrupt file (header={head!r}): {path.name}")

    # --- routing -----------------------------------------------------------
    def ingest(self, file_path: Union[str, Path]) -> RawLeasePayload:
        path = Path(file_path)
        if not path.exists():
            raise FileNotFoundError(f"Document not found: {path}")

        file_hash = self.compute_file_hash(path)
        try:
            mime = self.detect_format(path)
            if mime == "application/pdf":
                return self._parse_pdf(path, file_hash)
            return self._parse_docx(path, file_hash)
        except Exception as exc:  # expected failures become a verdict, not a crash
            logger.error("Ingestion failed for %s: %s", path.name, exc)
            return RawLeasePayload(
                source_file=str(path),
                file_hash=file_hash,
                extraction_status="failed",
                error_message=str(exc),
            )

    # --- PDF lane ----------------------------------------------------------
    def _parse_pdf(self, path: Path, file_hash: str) -> RawLeasePayload:
        with pdfplumber.open(path) as pdf:
            if len(pdf.pages) > self.max_pages:
                raise ValueError(
                    f"Document exceeds page limit ({len(pdf.pages)} > {self.max_pages})"
                )
            blocks, tables = [], []
            for page in pdf.pages:
                text = page.extract_text() or ""
                if text:
                    blocks.append(self._clean(text))
                # Keep rent-schedule tables structurally separate from prose so the
                # extractor never has to re-segment a flattened financial grid.
                for tbl in page.extract_tables():
                    tables.append([[(c or "").strip() for c in row] for row in tbl])
            return RawLeasePayload(
                source_file=str(path),
                file_hash=file_hash,
                raw_text="\n\n".join(blocks),
                tables=tables,
                metadata={"format": "pdf"},
                page_count=len(pdf.pages),
            )

    # --- DOCX lane ---------------------------------------------------------
    def _parse_docx(self, path: Path, file_hash: str) -> RawLeasePayload:
        doc = Document(path)
        paragraphs = [self._clean(p.text) for p in doc.paragraphs if p.text.strip()]
        tables = [
            [[cell.text.strip() for cell in row.cells] for row in table.rows]
            for table in doc.tables
        ]
        raw_text = "\n\n".join(paragraphs)
        if not raw_text.strip() and not tables:
            raise ValueError("DOCX contains no extractable text or tables.")
        return RawLeasePayload(
            source_file=str(path),
            file_hash=file_hash,
            raw_text=raw_text,
            tables=tables,
            metadata={
                "format": "docx",
                "paragraph_count": len(paragraphs),
                "table_count": len(doc.tables),
            },
            # python-docx exposes no page boundaries — page_count stays 0 by design.
            page_count=0,
        )

    @staticmethod
    def _clean(text: str) -> str:
        """NFKC-normalize so non-breaking and zero-width spaces don't defeat matching."""
        text = unicodedata.normalize("NFKC", text)
        return text.replace("", "").replace("\xad", "")


if __name__ == "__main__":
    pipeline = DocumentIngestionPipeline(max_pages=300)
    payload = pipeline.ingest("/data/incoming/commercial_lease_v3.pdf")
    logger.info(
        "status=%s pages=%d density=%.1f hash=%s",
        payload.extraction_status, payload.page_count,
        payload.text_density, payload.file_hash[:12],
    )

Two lease-specific decisions are worth calling out. First, tables are carried as a separate structured field rather than flattened into the prose — a rent schedule or CAM breakdown loses all meaning once its columns are linearized, and keeping the grid intact is what lets the downstream field mapping layer line Base Rent and CAM Provisions up against the right periods. Second, _clean runs NFKC normalization at the ingestion boundary so that the non-breaking and zero-width spaces endemic to copy-pasted legal templates do not survive to defeat exact-match clause patterns downstream.

Validation & Quality Gates

The RawLeasePayload is the boundary where ingestion either certifies a document as clean or routes it elsewhere. The model_validator runs on every payload and computes text density — printable characters per page — as the single signal that separates a born-digital lease from a flattened scan. A genuine lease page carries hundreds of printable characters; a scanned image yields near zero. Below the threshold the verdict flips to needs_ocr, and the empty text is never trusted as a real extraction.

Three verdicts leave this stage, and each maps to a distinct queue:

success — text density clears the threshold and the schema validates. The payload proceeds to clause extraction.
needs_ocr — a PDF with no usable text layer. The orchestrator diverts it to OCR preprocessing, then re-submits the rasterized text through the same gate.
failed — unsupported format, corrupt header, page-limit breach, or an empty DOCX. The document is dead-lettered for review and the batch continues uninterrupted.

Idempotency is enforced by the content hash. Before persisting a payload, the worker checks the file_hash against a seen-set (a Redis set, a unique index, or a manifest table). A matching hash means the identical bytes were already processed — the event is a no-op. A new hash on a known lease_id means an amendment, which is stored as its own versioned record so precedence can be resolved later by the lease data models layer rather than overwriting the base lease in place. Aligning the date and currency fields surfaced here with metadata normalization standards at the ingestion boundary keeps every downstream comparison unambiguous.

When a verdict is ambiguous — a marginal density score, a partial table extraction — the document should be routed through fallback routing logic to a human-review queue rather than force-committed. Ingestion’s contract is to be certain; uncertainty is escalated, not guessed.

Troubleshooting

Concrete failure scenarios that show up against real lease portfolios, each with its diagnostic signal and fix.

A “successful” PDF carries empty text. A flattened scan extracts with no error but yields a blank string. Diagnostic: extraction_status="success" records with raw_text length near zero. Fix: this is exactly what the density gate exists for — confirm the model_validator is computing text_density and flipping PDFs to needs_ocr below threshold; never treat a zero-text PDF as clean.

Multi-column rent rolls bleed into scrambled prose. A two-column rent schedule extracts as interleaved text where amounts no longer align with periods. Diagnostic: downstream rent values mapped to the wrong dates. Fix: do not rely on extract_text() for financial grids — capture them via extract_tables() into the structured tables field, and apply the bounding-box strategy from the pdfplumber deep dive to set explicit column boundaries.

Zero-width and non-breaking spaces defeat clause patterns. Text that visibly reads base rent fails an exact match because it contains or . Diagnostic: a clause that is plainly present in the source reports as missing downstream. Fix: run unicodedata.normalize("NFKC", text) and strip zero-width/soft-hyphen characters at the ingestion boundary, as _clean does, so the artifacts never propagate.

A .docx upload is actually legacy .doc. An OLE2 binary .doc renamed to .docx makes python-docx raise an opaque package error. Diagnostic: failed verdicts citing a “not a Word file” / bad zip error. Fix: the magic-byte check (PK\x03\x04 for the OOXML ZIP container) rejects it at the gate; dead-letter with a clear “convert .doc upstream” message rather than letting the parser stack-trace.

Duplicate S3 events double-process a lease. Object-storage notifications deliver at-least-once, so the same upload fires twice. Diagnostic: duplicate records, double-posted escalations. Fix: dedupe on file_hash against a seen-set before persisting; identical bytes are a no-op. This is the idempotency contract the whole async batch processing layer depends on.

A 600-page omnibus lease exhausts worker memory. A scanned master agreement with hundreds of exhibits balloons the pdfplumber object graph. Diagnostic: OOM kills on a small subset of files. Fix: enforce max_pages as an explicit failed verdict, process pages lazily inside the with block (never materialize every page object at once), and split oversized documents upstream before ingestion.

Performance & Scale Notes

Ingestion is I/O- and CPU-bound per document and runs on every file, so its per-record cost compounds across a portfolio onboarding of tens of thousands of leases.

Construct the pipeline once per worker. DocumentIngestionPipeline is stateless and cheap, but instantiate it at worker startup, not per task, and keep the signature map module-level.
Stream the hash and the pages. Hash in 64 KB chunks and iterate pdf.pages lazily inside the with block so a large lease never loads fully into memory. pdfplumber holds per-page layout objects; let them be garbage-collected page by page.
Parallelize across documents, not within. A single PDF is best parsed in one process; throughput comes from fanning documents across the worker pool described in async batch processing. Ingestion pickles cleanly because the payload is plain data.
Keep OCR off the hot path. Inline OCR is orders of magnitude slower than text extraction and will starve the pool. Emit a needs_ocr verdict and let a separate, independently-scaled stage handle rasterization so a batch of scans cannot block a batch of born-digital leases.
Bound the page limit to the portfolio. Standard office leases rarely exceed 100 pages; retail and mixed-use omnibus agreements run longer. Set max_pages from the portfolio’s real distribution so the guard rejects pathological files without rejecting legitimate ones.

Frequently Asked Questions

What confidence threshold should trigger manual review? For ingestion the signal is text density, not a model score. Tune the per-page printable-character threshold against a labeled holdout of known born-digital and scanned leases from your own portfolio. Marginal scores near the boundary should divert through fallback routing to human review rather than committing, because a misclassified scan silently drops a lease from the pipeline.

How do I handle lease amendments that override base clauses? Ingest each amendment as its own document with its own content hash and store it as a separate versioned record — never overwrite the base lease in place. Precedence is resolved later by effective date in the lease data models layer, which preserves the audit trail that an in-place overwrite would destroy.

Why hash the file bytes instead of trusting the filename or event ID? Object-storage events deliver at-least-once and filenames are not unique across re-uploads, so neither is a safe dedup key. A SHA-256 of the raw bytes is identical for identical content and different for any edit, which makes it both the idempotency key and the version anchor in one value.

Should ingestion run OCR when a PDF has no text layer? No. Ingestion should detect the missing text layer via the density gate and emit a needs_ocr verdict, then hand the document to a separate OCR preprocessing stage. Running OCR inline is far slower than text extraction and lets a batch of scans block born-digital documents in the same pool.

Using pdfplumber for Commercial Lease Text Extraction at Scale — the bounding-box strategy the PDF lane delegates to for multi-column rent rolls and side-by-side amendment tables.
OCR Preprocessing Workflows — where needs_ocr documents are rasterized and re-submitted through the same quality gate.
Regex & NLP Clause Extraction — the immediate downstream consumer of the clean text contract this stage produces.
Field Mapping Strategies — reconciles the separated tables and text into a canonical property schema.
Async Batch Processing — the worker pool and event-trigger layer that fans documents through this pipeline idempotently.

← Back to Parsing & Extraction Workflows

Continue in this section