Scaling Async Lease Parsing Pipelines with Celery and Redis

A single-node asyncio controller has a hard ceiling: one machine’s CPU and memory. The precise decision this page resolves is when a bounded-concurrency async batch pipeline stops being enough for a lease portfolio, and how to migrate its semantics — the per-task concurrency cap, the confidence gate, and idempotent commits — onto Celery workers backed by a Redis broker without losing data during a worker restart. The answer is not “add more asyncio.gather”; it is a distributed task queue that survives crashes, retries deterministically, and never double-counts a base rent.

Commercial and residential abstraction pipelines ingest heterogeneous documents: multi-page PDFs with embedded financial tables, scanned addenda with inconsistent headers, and DOCX files carrying tracked changes. Clause extraction, OCR fallbacks, and field normalization each introduce unpredictable latency, so a synchronous architecture collapses during peak leasing season. Celery and Redis decouple ingestion from extraction, enabling horizontal scaling across many worker nodes while holding strict SLA adherence and predictable retry behavior.

Architectural context

This technique sits at the distributed end of Async Batch Processing inside the broader Parsing & Extraction Workflows domain. Upstream, the PDF/DOCX ingestion pipelines and OCR preprocessing workflows stage and clean raw files; this layer fans those files across workers; downstream, validated abstractions land in the canonical lease data models, while low-confidence results divert through fallback routing logic to a human review queue. Celery owns exactly one question per task — “given this object key, did extraction produce a valid abstraction, and where does the result go?” — and the broker guarantees the task is not lost if the worker dies mid-parse.

Single-node asyncio vs distributed Celery + Redis

The migration is a tradeoff, not a strict upgrade. Below the threshold, a single asyncio process is simpler to reason about and deploy; above it, the broker’s durability and cross-node fan-out justify the operational weight.

Concern	Single-node `asyncio`	Distributed Celery + Redis
Concurrency model	`Semaphore` cap in one process	Worker concurrency × N nodes, broker-mediated
Survives worker restart	No — in-flight tasks lost	Yes — `acks_late` re-delivers unacked tasks
Throughput ceiling	One machine’s CPU/RAM	Horizontal; add worker nodes
Retry semantics	Manual, in-process	Broker-backed `self.retry` with backoff
Failure isolation	Per-task `try/except`	Per-task + dedicated queues per stage
Operational cost	Low — one process	Higher — broker, workers, monitoring
Best fit	Up to ~tens of thousands of docs/run	Large portfolios, multi-tenant, strict SLAs

Migrate when any of these is true: a run exceeds a single node’s memory budget, work must survive a restart, or you need per-stage scaling (OCR is slow and memory-hungry; normalization is cheap and fast). Until then, the async batch pipeline is the right tool.

Redis broker configuration for lease ingestion

Redis serves as both message broker and result backend, but defaults fail under sustained lease loads. Tasks carry large payloads — base64 documents, extracted text chunks, OCR confidence scores — which exhaust default memory limits and starve the connection pool. The noeviction policy is non-negotiable: dropping pending tasks during a memory spike corrupts abstraction timelines and silently loses leases.

# redis.conf
maxmemory 4gb
maxmemory-policy noeviction
timeout 300
tcp-keepalive 60
client-output-buffer-limit normal 256mb 64mb 60

Lease documents are never stored directly in Redis; only cloud-storage object keys and document hashes travel through the queue, keeping message payloads strictly under 50KB. The broker (db 0) and result backend (db 1) are isolated so result serialization never blocks task dispatch. Connection pooling on the Python side is sized to match worker concurrency.

import redis
from celery import Celery

# Production-grade connection pool, sized to worker concurrency
redis_pool = redis.ConnectionPool(
    host="redis.internal",
    port=6379,
    db=0,
    max_connections=120,
    socket_keepalive=True,
    socket_timeout=10,
    retry_on_timeout=True,
    decode_responses=False,
)

app = Celery(
    "lease_parser",
    broker="redis://redis.internal:6379/0",
    backend="redis://redis.internal:6379/1",
)

app.conf.update(
    broker_pool_limit=120,
    result_backend_transport_options={
        "retry_policy": {"max_retries": 3, "interval_start": 1, "interval_step": 1}
    },
    task_serializer="json",
    accept_content=["json"],
    result_serializer="json",
    timezone="UTC",
    enable_utc=True,
)

Key parameters at a glance

Parameter	Recommended	Why it matters for leases
`maxmemory-policy`	`noeviction`	A dropped task is a lost lease; never evict pending work
`worker_prefetch_multiplier`	`1`	A 15-second commercial lease must not starve the queue
`task_acks_late`	`True`	Re-deliver after a worker OOM/crash mid-parse
`worker_max_tasks_per_child`	`500`	Recycle workers to cap OCR memory leaks
message payload	< 50KB	Pass object keys + hashes, never document bytes

Worker topology and idempotent task routing

Lease parsing needs strict task isolation. A single malformed commercial lease with embedded vector graphics can block a worker for 15+ seconds. Dedicated queues with explicit routing and late acknowledgment guarantee at-least-once processing with idempotent handlers. acks_late ensures a task leaves the broker only after successful execution, preventing data loss during worker crashes or OOM kills.

# celery_config.py
app.conf.update(
    task_acks_late=True,
    worker_prefetch_multiplier=1,
    task_routes={
        "lease_parser.tasks.extract_clauses": {"queue": "lease_extraction"},
        "lease_parser.tasks.run_ocr_fallback": {"queue": "lease_ocr"},
        "lease_parser.tasks.normalize_financials": {"queue": "lease_normalization"},
    },
    worker_max_tasks_per_child=500,
    worker_concurrency=8,
    broker_connection_retry_on_startup=True,
)

Idempotency is enforced at the application layer with document hashes. Property management systems frequently re-ingest the same lease after broker negotiations or legal revisions; checking a Redis-backed deduplication set before execution prevents redundant OCR calls and clause re-extraction.

import logging
from functools import wraps

logger = logging.getLogger(__name__)


def idempotent_task(func):
    """Skip already-processed leases via a Redis deduplication set."""
    @wraps(func)
    def wrapper(self, doc_key: str, doc_hash: str, *args, **kwargs):
        cache_key = f"lease:processed:{doc_hash}"
        if self.app.backend.client.sismember(cache_key, "1"):
            logger.info("Skipping already processed lease %s | hash %s", doc_key, doc_hash)
            return {"status": "skipped", "doc_key": doc_key}
        return func(self, doc_key, doc_hash, *args, **kwargs)
    return wrapper

Recommended task: validation and deterministic retry

Commercial leases carry highly structured financial obligations — base rent, CAM charges, CPI adjustments — that demand strict validation before they reach the books. Parsing failures stem from malformed PDFs, password-protected files, or missing page boundaries, so the task must distinguish transient infrastructure errors (retry) from permanent document corruption (dead-letter). Typed-decimal coercion here mirrors the metadata normalization standards enforced at the canonical boundary, applied early so bad data never travels.

from typing import Dict, Any
import logging
import boto3
from botocore.exceptions import ClientError
from pydantic import BaseModel, Field, ValidationError

logger = logging.getLogger(__name__)
s3_client = boto3.client("s3", endpoint_url="https://storage.internal")


class LeaseAbstraction(BaseModel):
    base_rent: float = Field(gt=0)
    cam_charges: float = Field(ge=0)
    term_months: int = Field(gt=0)
    extraction_confidence: float = Field(ge=0.0, le=1.0)


@app.task(
    bind=True,
    name="lease_parser.tasks.extract_clauses",
    max_retries=4,
    default_retry_delay=30,
    acks_late=True,
)
@idempotent_task
def extract_clauses(self, doc_key: str, doc_hash: str, ocr_fallback: bool = False) -> Dict[str, Any]:
    try:
        # 1. Retrieve document metadata from object storage
        head = s3_client.head_object(Bucket="lease-docs", Key=doc_key)
        if head["ContentLength"] > 50_000_000:
            raise ValueError("Document exceeds 50MB size limit")

        # 2. Run clause extraction (Textract, Azure Document Intelligence, or custom NLP)
        extracted = _run_parser_engine(doc_key, ocr_fallback=ocr_fallback)

        # 3. Validate at the boundary — coercion + range checks reject bad financials
        abstraction = LeaseAbstraction(**extracted)

        # 4. Mark processed for idempotency (30-day retention)
        cache_key = f"lease:processed:{doc_hash}"
        self.app.backend.client.sadd(cache_key, "1")
        self.app.backend.client.expire(cache_key, 86400 * 30)

        return {"status": "completed", "doc_key": doc_key, "fields": abstraction.model_dump()}

    except ClientError as exc:
        # Transient: storage hiccup — retry with exponential backoff
        logger.error("S3 retrieval failed for %s: %s", doc_key, exc)
        raise self.retry(exc=exc, countdown=60 * (2 ** self.request.retries))
    except (ValueError, ValidationError) as exc:
        # Permanent: corrupt or schema-invalid — dead-letter, do not retry
        logger.warning("Permanent validation error for %s: %s", doc_key, exc)
        return {"status": "failed_permanent", "doc_key": doc_key, "error": str(exc)}
    except Exception as exc:
        logger.exception("Unexpected parsing failure for %s", doc_key)
        raise self.retry(exc=exc, countdown=60 * (2 ** self.request.retries))


def _run_parser_engine(doc_key: str, ocr_fallback: bool) -> Dict[str, Any]:
    # Placeholder for the real PDF/DOCX parsing + OCR pipeline
    return {"base_rent": 4500.00, "cam_charges": 12.50, "term_months": 60, "extraction_confidence": 0.93}

The split is the whole point: a ClientError retries because storage will recover, while a ValidationError dead-letters because re-running the same corrupt bytes will fail identically. Exponential backoff — countdown=60 * (2 ** self.request.retries) — scales retries predictably (60s, 120s, 240s, 480s) without overwhelming downstream OCR services during rate limits. The full taxonomy of transient-vs-permanent classification and dead-letter handling lives in error handling & retry logic.

Edge cases specific to commercial leases

Re-ingested amendments. A landlord re-uploads an amended lease with the same filename but different content. Key the dedup set on the content hash, never the object key, so the amendment is treated as a distinct document and resolved by effective date downstream — never overwritten in place.
Password-protected addenda. Encrypted PDFs raise on open and look transient but are permanent until a credential is supplied. Classify them as failed_permanent with a needs_credential reason so they route to review rather than burning four retries.
OCR-bleed on multi-column rent schedules. When a triple-net rent table bleeds across columns, confidence drops below threshold; the task should branch to run_ocr_fallback on the lease_ocr queue rather than committing a misaligned step-up that corrupts the escalation formula mapping.
Non-breaking spaces in monetary fields. Word auto-formatting injects \xa0 and zero-width spaces into $15,000, so the pydantic coercion fails on otherwise valid rent. Normalize whitespace in _run_parser_engine before the validation gate sees it.
Visibility-timeout starvation. A 90-page scanned lease can exceed the broker’s default visibility window, causing the task to be re-delivered while still running and processed twice. Raise the visibility timeout above worst-case extraction latency and lean on idempotency as the safety net.

When to escalate

This distributed pattern is the right tool for high volume, but it is not the terminal answer for every document. Escalate out of the auto-commit path when:

Aggregate extraction_confidence stays below ~0.80 after the OCR fallback — divert to the human review queue through fallback routing logic rather than posting money on a guess.
A document dead-letters four times — it belongs in an ops triage queue with its payload, stack trace, and a diagnostic summary, not silently dropped.
Queue depth exceeds worker_concurrency * 10 sustained — trigger autoscaling (Kubernetes HPA or an autoscaling worker group) before SLAs slip; deploy Flower and Prometheus metrics so the signal is visible.
Amendment-rider precedence is ambiguous — resolution belongs downstream in the lease data models using effective dates, not inside the extraction task.

Implement graceful recycling via worker_max_tasks_per_child to cap memory leaks in long-running parsing workers, and configure dead-letter queues so permanently failed leases reach ops review instead of vanishing.

Frequently asked questions

When should I move from asyncio to Celery + Redis? When a run exceeds one node’s memory budget, work must survive a worker restart, or you need to scale OCR independently of cheap normalization. Below that, a semaphore-bounded asyncio controller is simpler and sufficient.

Why is noeviction the only safe Redis memory policy here? Any eviction policy can silently drop a pending task during a memory spike, which means a lease is lost with no error. noeviction forces backpressure instead, so you discover the pressure before you lose data.

How do I stop the same lease being processed twice? Set task_acks_late=True and gate every task on a Redis deduplication set keyed on the document content hash. Re-deliveries and re-uploads then become no-ops rather than double-counted rent records.

What confidence threshold should trigger manual review? Start at 0.80 for the aggregate score and tune against a labeled holdout. Because a wrong base rent posts real money, any document that fails schema validation should divert to review regardless of its score.

Async Batch Processing — the single-node bounded-concurrency controller this page distributes; start here before reaching for a broker.
Error Handling & Retry Logic — transient-vs-permanent classification, exponential backoff, and dead-letter routing for the failures these tasks emit.
Fallback Routing Logic — where low-confidence and invalid results divert for human review instead of auto-committing.
PDF/DOCX Ingestion Pipelines — the upstream stage that stages and normalizes raw lease files before they enter the queue.
Metadata Normalization Standards — the typed-decimal canonical contract this task enforces early at the extraction boundary.

← Back to Async Batch Processing