Reference
Unified Extraction — integration guide
This guide is for consumer services (preflight engines, viewers, batch import pipelines) wiring against the codex-pdf unified extraction API. It covers the surface, the cache-key contract, tenancy, rate-limiting, error shapes, and the per-stage telemetry that ships with every response.
Consumer-agnostic: nothing in this surface presumes a specific caller. The same endpoints serve lint-pdf, lens-pdf, compile-pdf, and any future consumer.
Endpoints in scope
| Verb / Path | Purpose | Cache key |
|---|---|---|
POST /v1/assets | Standard data request (1.20.0+). Ingest bytes or {sha256}; cache hit returns the document inline, miss enqueues a background pull. See data-requests.md. | (tenant, pdf_hash) |
GET /v1/assets/{pdf_hash} | Poll an ingested asset. | (tenant, pdf_hash) |
GET /v1/assets/{pdf_hash}/signals/{kind} | Cached AI signal by hash (alias of the documents signals endpoint). | (tenant, pdf_hash, kind) |
POST /v1/extract | First-stop. Returns the full CodexDocument. Add X-Codex-Fields header for sparse projection (1.18.0+). | (tenant, pdf_hash) — full only; sparse bypasses cache |
GET /v1/documents/{pdf_hash}/text-regions?page_index=N&dpi=N | Second-stop. One page's detected regions, in PDF user-space points. | (tenant, pdf_hash, page_index, dpi) |
POST /v1/documents/{document_id}/conformance/{profile} | Compute (or fetch from cache) a conformance verdict. | (tenant, pdf_hash, profile) |
GET /v1/documents/{pdf_hash}/renders | List (page_index, dpi, color_space) tuples already in the render cache. | n/a (it's the index) |
The first-stop / second-stop split is intentional. /v1/extract
returns everything codex knows; consumers cherry-pick. Per-resource
endpoints let consumers that already have the codex doc ask for
exactly the slice they need without an extract-then-discard round
trip.
New consumers should start with /v1/assets (requestAsset in
both clients) rather than calling /v1/extract directly: it adds the
cache-hit-inline / miss-pulls-in-background contract on top of the same
idempotent extract. /v1/extract remains the lower-level primitive.
The canonical pattern, response shapes, and the viewer backfill flow
are documented in data-requests.md.
Cache-key contract
Cache keys are part of the public contract — stable across releases:
- text-regions:
(pdf_hash, page_index, dpi) - conformance:
(pdf_hash, profile) - render:
(pdf_hash, page_index, dpi, color_space)
The codex implementation also scopes by tenant (see below) but the tenant component is transparent to most consumers and isn't part of the contract the caller cares about — it's a server-side isolation knob.
Tenancy
Every request can carry an X-Codex-Tenant header. The server:
- Normalises the value (
[a-z0-9][a-z0-9-]{0,62}; falls back to"default"for missing/invalid). - Scopes the cache lookup, the blob store, and the renders index by tenant.
A hash uploaded by Tenant A is invisible to Tenant B even if B learns the hash. The 412 message on a blob miss is intentionally identical for "wrong tenant" and "expired" — probing isn't informative.
# Python client
from codex_pdf.client import HttpClient
client = HttpClient(
base_url="https://codex.example.com",
bearer_token="…",
tenant="acme-corp", # surfaces as X-Codex-Tenant on every request
)
// TypeScript client
import { HttpClient } from "@printwithsynergy/codex-client";
const client = new HttpClient({
baseUrl: "https://codex.example.com",
bearerToken: "…",
tenant: "acme-corp", // surfaces as X-Codex-Tenant on every request
});
Both clients also read the tenant from the CODEX_TENANT env when
the option is omitted.
Rate limiting
Compute-and-cache POSTs (/v1/extract, render, sample, walk,
conformance) consult an in-process token bucket per
(tenant, endpoint). Bucket exhausted → 429 Too Many Requests
with a Retry-After header in seconds.
Both bundled clients honour Retry-After and back off
automatically; consumers using raw HTTP should do the same.
Operator knobs (env, codex-pdf service):
| Variable | Default | Purpose |
|---|---|---|
CODEX_RATE_LIMIT_RPM | 120 | Refills per minute |
CODEX_RATE_LIMIT_BURST | 30 | Bucket size |
CODEX_RATE_LIMIT_DISABLED | false | Off-switch |
The limiter is in-process and per-replica. Multi-replica fleets
see effective limit N × rpm.
Error shapes
Every 4xx/5xx response uses the shared envelope:
{ "detail": "human-readable message" }
The new endpoints document their per-status shapes in OpenAPI
under responses=:
400 Bad Request— invalidpdf_hash,page_index,dpi, or unknown conformance profile.404 Not Found— no PDF cached for(tenant, document_id). Upload via/v1/extractfirst.429 Too Many Requests— rate limit exceeded.Retry-Afterheader carries the wait in seconds.
Stage telemetry
Every response carries per-stage wall-clock timing in two places:
- Response envelope:
stage_durations_ms: { stage: int_ms }. - Response header:
X-Codex-Stage-Durations-Ms(same dict serialised as JSON).
The header is there for transports that strip envelope bodies (in-process clients, mocks). Both clients back-fill the envelope from the header when present.
Initial stage names:
extract— full CodexDocument parse.render— page render.text_regions— detected text regions per page.conformance— verdict compute for one profile.
New stage names are non-breaking: consumers must treat unknown keys as opaque.
Observability
Prometheus metrics on the codex-pdf service (/metrics):
| Metric | Type | Labels |
|---|---|---|
codex_api_requests_total | Counter | endpoint, status |
codex_api_request_seconds | Histogram | endpoint |
codex_api_cache_lookups_total | Counter | endpoint, outcome (hit/miss) |
codex_api_stage_seconds | Histogram | stage |
The stage histogram observes the same numbers consumers see in
stage_durations_ms. Cache hit rate per endpoint = rate(codex_api_cache_lookups_total{outcome="hit"}[5m]) / rate(codex_api_cache_lookups_total[5m]).
Conformance — supported profiles
| Profile | Notes |
|---|---|
pdfx4 | OutputIntent + Trapped + PDF ≥1.4 + XMP pdfxid |
pdfx1a | OutputIntent + Trapped + PDF=1.3 |
pdfx3 | OutputIntent + Trapped + PDF ≥1.3 |
pdfa1b / pdfa2b / pdfa3b | XMP present + not encrypted + correct pdfaid:part |
pdfua1 | XMP present + pdfuaid + non-empty Title |
The profile enum is forward-compatible. Consumers must treat
unknown profile strings (e.g. a future pdfx6, pdfa4) as
opaque so an older client doesn't break against a newer server.
Clause coverage is the minimum-viable set in the rc.x series. Full ISO coverage lands in later phases; the framework is registry- driven, so new clauses are additive only.
AI signals (1.11.0 – 1.15.0)
Codex 1.11.0 lit up the AI Signal contract frozen in 1.10.0; subsequent 1.x releases iterated on it:
| Release | Change |
|---|---|
| 1.11.0 | Six extractors wired behind CODEX_AI_ENABLED. |
| 1.12.0 | codex-vision-sidecar (CODEX_VISION_URL) — optional CPU CV lane. |
| 1.13.0 | ai_model_versions on /v1/contract + codex_ai_signal_calls_total Prometheus metric. |
| 1.14.0 | Per-tenant entitlements (CODEX_AI_TENANTS_ALLOWLIST / DENYLIST) + ai_tenant_excluded warning. |
| 1.15.0 | Dieline-candidate / dieline-size reconciliation: bbox-based geometry detection now synthesises a candidate so dieline.count agrees with dieline.size. |
The extracted CodexDocument carries six AI signal surfaces:
| Field | Scope | Backend | Purpose |
|---|---|---|---|
detected_language | per page | Claude Haiku (text) | BCP-47 tag + confidence. |
detected_logos | per page | Claude Sonnet (vision) | Brand identity + bbox in PDF user-space points. |
detected_symbols | per page | Claude Sonnet (vision) | Regulatory / safety / sustainability symbols (GHS, recycling, FDA, CE, ™, ©, etc.). |
detected_barcodes | per page | pyzbar + pylibdmtx (CPU) | Decoded value + format + bbox. No Claude cost. |
spell_candidates | per page | Claude Haiku (text) | Suspect words for lint-pdf's tenant spell rule. |
document_classification | document | Claude Haiku (text) | Probability map ({"label": 0.7, "folding_carton": 0.2}). |
The dedicated endpoint GET /v1/documents/{pdf_hash}/signals/{kind}
returns the same shapes scoped to one signal kind, so consumers can
re-fetch a single signal without re-running the full extract. Pass
?page_index=N for page-scoped kinds (language, logos,
symbols, barcodes, spell); classification is document-scoped
so the parameter is ignored.
Codex emits a structured CodexWarning on every /v1/extract
response describing the AI lane's state:
Warning code | When |
|---|---|
ai_disabled | Operator gate (CODEX_AI_ENABLED) is off. |
ai_skipped | Caller sent X-Codex-Skip-AI: true. |
ai_tenant_excluded | Operator opted in but the requesting tenant is gated out by CODEX_AI_TENANTS_ALLOWLIST / DENYLIST (1.14.0 +). |
ai_missing_credentials | Operator opted in but anthropic SDK isn't importable or ANTHROPIC_API_KEY is unset. |
ai_tier | Advisory — AI ran. message carries cpu+claude or gpu plus the realised dollar spend. |
ai_budget_exceeded | Per-request cost cap (CODEX_AI_COST_CAP_USD_PER_REQUEST, default $0.10) was hit mid-extract. |
See policies.md for the full
warning catalogue, cache-key contract, and the two-backend
(CPU + Claude default vs optional GPU) policy.
Sparse field projection (1.18.0+)
Pass X-Codex-Fields: <comma-separated fields> on POST /v1/extract
to run only the extractors needed for the requested fields and receive
only those fields in the response. Both latency and payload size shrink
proportionally to the number of extractors skipped.
The fitz structure pass (page count, boxes, fonts summary) always runs; only the heavier pikepdf passes and the AI signal lane are gated.
Field → extractor mapping
| Requested field | Extractors skipped when absent |
|---|---|
color_spaces / spot_colors | pikepdf colour-world pass |
detected_barcodes | pyzbar + pylibdmtx AI lane |
detected_language | Claude Haiku language AI lane |
detected_logos | Claude Sonnet vision AI lane |
detected_symbols | Claude Sonnet vision AI lane |
document_classification | Claude Haiku classification AI lane |
spell_candidates | Claude Haiku spell AI lane |
ocgs | pikepdf OCG pass |
form_xobjects | pikepdf forms pass |
analysis | pikepdf content-stream signals pass |
fonts | PyMuPDF fonts sub-pass |
images | PyMuPDF images sub-pass |
annotations | PyMuPDF annotations sub-pass |
Omitting X-Codex-Fields returns the full document (unchanged
default behaviour — no breaking change).
Caching
Sparse responses are not cached — field sets vary too much for content-addressed cache keys to be useful. Full-extract responses remain cached as before.
Example
POST /v1/extract HTTP/1.1
Content-Type: application/pdf
Authorization: Bearer <token>
X-Codex-Fields: detected_barcodes, color_spaces
<pdf bytes>
// TypeScript client (1.17.0+)
import { HttpClient } from "@printwithsynergy/codex-client";
const client = new HttpClient({ baseUrl, bearerToken });
const doc = await client.extract(pdfBytes, {
fields: ["detected_barcodes", "color_spaces"],
});
// doc contains only color_spaces + detected_barcodes + core metadata
# Python — raw header
import httpx
r = httpx.post(
"https://codex.example.com/v1/extract",
content=pdf_bytes,
headers={
"Content-Type": "application/pdf",
"Authorization": f"Bearer {token}",
"X-Codex-Fields": "detected_barcodes,color_spaces",
},
)
doc = r.json()
End-to-end example
from codex_pdf.client import HttpClient
client = HttpClient(
base_url="https://codex.example.com",
bearer_token="…",
tenant="acme-corp",
)
# First stop — full payload, includes detected text regions per page.
doc = client.extract(pdf_bytes)
sha = doc["pdf_sha256"]
# Second-stop re-fetch — one page only, cache-hit on second call.
regions_page_0 = client.text_regions(sha, page_index=0, dpi=150)
print(len(regions_page_0["regions"]))
# Compute a verdict; cached on the server.
verdict = client.conformance(sha, "pdfx4")
print(verdict["passed"], verdict["clauses"])
# What renders already exist in the cache?
print(client.list_renders(sha)["renders"])
import { HttpClient } from "@printwithsynergy/codex-client";
const client = new HttpClient({
baseUrl: "https://codex.example.com",
bearerToken: "…",
tenant: "acme-corp",
});
const doc = await client.extract(pdfBytes);
const sha = doc.pdf_sha256;
const regions = await client.getTextRegions(sha, { pageIndex: 0, dpi: 150 });
const verdict = await client.computeConformance(sha, "pdfx4");
const renders = await client.listRenders(sha);
Versioning
Schema version (the codex-document contract) and package version move on different cadences:
- Schema version (
schema_versionin the payload) — only bumped when the CodexDocument contract changes. - Package version (
pyproject.toml/package.json) — bumped on every release. Pre-release tags (rcN) signal in-flight phases.
The cache-key version segment ({VERSION} in
codex:{VERSION}:{kind}:{tenant}:{pdf_sha}:{args_sha}) tracks the
package version so a deploy that bumps either dimension invalidates
the cache atomically.