Delivery standards¶

Named readiness levels and concrete obligations on Processor and Designer before data is consumed downstream. Levels are not always sequential: a project may target one or several depending on the task.

Level	Guarantee	Typical consumer
AI-Ready	Text is usable by LLM/embedding models without heavy format noise; numbers, units, and inline math remain legible	Generative and embedding workflows
Structure-Aware Training Ready	Stable, closed Poros tags supply explicit semantic boundaries	Long-document, structure-aware training
Plain-Text Training Ready	Tag-free natural language stream	Pre-training, embeddings, retrieval without tag awareness
Data Mining Ready	Entities and attribute–value relations stay recoverable	Extraction, retrieval, knowledge graphs

Shared rule: quality precedes structure; tags elaborate stable text rather than compensating for broken text.

Processor vs Designer — Processor ships the quality package (clean, stable, computable text without final business records). Designer ships the structure package (views, indexes, multimodal linkage). Designer quality is bounded by Processor output; Processor is expected to lower disambiguation cost for Designer rather than leaving ambiguous attribute pairing.

Processor obligations¶

Processor output should meet AI-Ready and Data Mining Ready minima across six dimensions:

Dimension	Requirement
Text continuity	Repair OCR fragmentation in numbers, decimals, units, and element symbols
Entity clarity	Protect and stabilise materials, methods, instruments, and scientific names within a document
Numbers and units	Preserve magnitudes, decimals, superscripts, and unit notation
Formula safety	Repair only when structurally safe; never destroy math or chemistry boundaries
Citations	Normalise in-text numeric citations; do not rewrite reference lists or fuse citations into words
Metadata fields	Apply the same rigour to captions, table titles, and footnotes as to body text

Implementation detail lives on the Processor page.

Designer obligations¶

Designer writes per-document directories under the configured output root (default segment Designed Database under data/). The on-disk contract is defined in Dataset layout and expanded field-by-field in Output artefacts.

Deliverable groups (single folder per doc_id)

View	Files	Role
Full-text	`{doc_id}.content.json` (+ optional `{doc_id}.content.txt`)	`content` (Poros-tagged lines) and `pure_text_stream` (tag-free lines)
Datamining	`{doc_id}.structure.json`	Sections plus `formulas`, `chemical_formulas`, `asset_refs` (with generated `link` fields)
Multimodal	`{doc_id}.assets.index.json`, `images/*`, figure Markdown cards	Index rows, copied assets, mention/caption context

Consumers must be able to choose content, pure_text_stream, or structure.json without contradicting one another for the same doc_id.

Poros tagging — coarse-grained and stability-first. Root <poros_doc>…</poros_doc> is mandatory. Section tags follow the vocabulary and fallbacks described in Poros tags (header, typed sections, other, subtitles by level). Inline tags stay focused (poros_paragraph, poros_equ, poros_chem, poros_asset, poros_keywords). Tags must nest and close; section closers are strong boundaries but do not replace document EOS tokens.

View separation — content may contain poros_*; pure_text_stream (and optional .content.txt) must contain none; structure.json encodes structure as fields, not flattened pseudo-text; inline math delimiters stay balanced and structurally valid.

Acceptance and rejection¶

Accept when Processor passes the six dimensions without regression; Designer emits required files and fields per doc_id; Poros trees validate; pure_text_stream contains no tag leakage; chemical_formulas holds only chemical or material expressions; multimodal indexes reference files that exist on disk; body and metadata fields are treated consistently within a document.

Reject on persistent OCR damage or broken formulas/citations in Processor output; missing required files or fields; invalid or unbalanced Poros trees; tag leakage into plain streams; mismatched multimodal assets; inconsistent treatment across body vs captions vs footnotes.

Further reading: Dataset layout · Processor · Designer overview · Validation and audit · Data governance · Glossary