Getting started
title: "Architecture" description: "codexPDF boundaries, extraction + render pipeline, and the three deployed services that share one cache key namespace." group: "Getting started" order: 2
Architecture
codexPDF is a contract-first facts engine for PDF documents.
Boundary
- Read-only extraction + render. Codex never produces new PDF bytes
—
scripts/produce_surface_audit.pyenforces this on every CI run. - No customer policy / rule adjudication. Codex emits detection signals; pass/fail belongs to Lint.
- No display / viewer presentation. PNG renders are byte-accurate source-of-truth for Lens to display, not a viewer in themselves.
- Consumer-agnostic output: same JSON contract regardless of caller.
Pipeline
- Input PDF bytes are loaded by the extractor layer (PyMuPDF for the fast path, pikepdf for slower per-object inspection).
- Domain extractors populate
CodexDocumentfields: pages, boxes, fonts, images, color spaces (with Separation tint transforms evaluated att=1.0so spot inks land on the right swatch), OCG / layers, annotations, transparency, trapping, form XObjects. Optional AI signal extractors (1.10.0 +) populatedetected_language,detected_logos,detected_symbols,detected_barcodes, anddocument_classification; seepolicies.md. - Output is serialized as JSON against the published schemas in
schemas/v1/. Each section (document, color, geom) versions independently and reports itsschema_versioninline. - Render endpoints rasterize pages, separations, TAC heatmaps, and OCG-isolated layers via Ghostscript + PyMuPDF.
Sparse extraction (1.18.0+)
When the caller sends X-Codex-Fields, the pipeline runs in sparse
mode: only the extractors needed for the requested fields execute.
The PyMuPDF structure pass (step 1) always runs. Heavier pikepdf
passes (color world, OCGs, forms, content-stream signals) and the AI
signal lane are each skipped unless a requested field depends on them.
See docs/contract.md
and docs/unified-extraction.md
for the full field→extractor mapping and HTTP example.
Primary contract
- Runtime model:
codex_pdf.models.v1.CodexDocument - Document schema:
schemas/v1/codex-document.schema.json - Section versions:
codex_pdf.color.COLOR_SCHEMA_VERSION,codex_pdf.geom.GEOM_SCHEMA_VERSION - Live manifest:
GET /v1/contract
Deployed surface
In production, codex runs as three services sharing one
content-addressed cache namespace
(codex:{VERSION}:{kind}:{pdf_sha}:{args_sha}), so a VERSION
bump invalidates every tier atomically. The full deployed map —
URLs, account / service IDs, and the version-bump checklist —
lives in CLAUDE.md.
- codex-pdf API (Railway) — FastAPI under gunicorn + uvicorn workers. Bearer + internal token auth. Backed by Redis for cache and blob storage.
- codex-speculator (Railway sidecar) — a Redis-Streams
consumer.
POST /v1/probeand the blob-store put both XADD a sha onto thecodex:speculatestream; the speculator runs Phase 1 + Phase 2 ahead of the next request so/v1/extractlands warm. Idempotent — cache-hit short-circuit collapses replays to a single Redis GET. - codex-edge (Cloudflare Worker + KV) — drop-in DNS-level
replacement that captures every probe / extract SSE frame and
replays from KV on the next hash-keyed request. Multipart
uploads bypass to origin.
ctx.waitUntilkeeps the Worker alive long enough to persist every frame before the response stream closes.
Optional retention layer
Codex 1.8+ adds an opt-in persistence branch on POST /v1/extract.
When the caller sends retain_for_training=true and the deploy is
wired to an S3-compatible bucket, the PDF + extract + a small
metadata object land under a hive-partitioned key
({prefix}/tenant=…/dt=…/sha256=…/). Default behaviour is
unchanged — bytes leave memory the moment the response ships. The
production deployment uses Cloudflare R2 with a 90-day bucket
lifecycle; see docs/deploy.md for the env contract
and CLAUDE.md for the live bucket layout.
Consumer relationship
Downstream engines (lint-pdf, lens-pdf, marketing demos)
treat codex output as the source of truth for document facts and
keep any product-specific behaviour in adapter layers. New
products map to one owner per capability — see the "Service
boundary" and "Offshoot rule" sections of
CLAUDE.md.