Skip to content

Delivery standards

Named readiness levels and concrete obligations on Processor and Designer before data is consumed downstream. Levels are not always sequential: a project may target one or several depending on the task.

Level Guarantee Typical consumer
AI-Ready Text is usable by LLM/embedding models without heavy format noise; numbers, units, and inline math remain legible Generative and embedding workflows
Structure-Aware Training Ready Stable, closed Poros tags supply explicit semantic boundaries Long-document, structure-aware training
Plain-Text Training Ready Tag-free natural language stream Pre-training, embeddings, retrieval without tag awareness
Data Mining Ready Entities and attribute–value relations stay recoverable Extraction, retrieval, knowledge graphs

Shared rule: quality precedes structure; tags elaborate stable text rather than compensating for broken text.

Processor vs Designer — Processor ships the quality package (clean, stable, computable text without final business records). Designer ships the structure package (views, indexes, multimodal linkage). Designer quality is bounded by Processor output; Processor is expected to lower disambiguation cost for Designer rather than leaving ambiguous attribute pairing.

Processor obligations

Processor output should meet AI-Ready and Data Mining Ready minima across six dimensions:

Dimension Requirement
Text continuity Repair OCR fragmentation in numbers, decimals, units, and element symbols
Entity clarity Protect and stabilise materials, methods, instruments, and scientific names within a document
Numbers and units Preserve magnitudes, decimals, superscripts, and unit notation
Formula safety Repair only when structurally safe; never destroy math or chemistry boundaries
Citations Normalise in-text numeric citations; do not rewrite reference lists or fuse citations into words
Metadata fields Apply the same rigour to captions, table titles, and footnotes as to body text

Implementation detail lives on the Processor page.

Designer obligations

Designer writes per-document directories under the configured output root (default segment Designed Database under data/). The on-disk contract is defined in Dataset layout and expanded field-by-field in Output artefacts.

Deliverable groups (single folder per doc_id)

View Files Role
Full-text {doc_id}.content.json (+ optional {doc_id}.content.txt) content (Poros-tagged lines) and pure_text_stream (tag-free lines)
Datamining {doc_id}.structure.json Sections plus formulas, chemical_formulas, asset_refs (with generated link fields)
Multimodal {doc_id}.assets.index.json, images/*, figure Markdown cards Index rows, copied assets, mention/caption context

Consumers must be able to choose content, pure_text_stream, or structure.json without contradicting one another for the same doc_id.

Poros tagging — coarse-grained and stability-first. Root <poros_doc>…</poros_doc> is mandatory. Section tags follow the vocabulary and fallbacks described in Poros tags (header, typed sections, other, subtitles by level). Inline tags stay focused (poros_paragraph, poros_equ, poros_chem, poros_asset, poros_keywords). Tags must nest and close; section closers are strong boundaries but do not replace document EOS tokens.

View separationcontent may contain poros_*; pure_text_stream (and optional .content.txt) must contain none; structure.json encodes structure as fields, not flattened pseudo-text; inline math delimiters stay balanced and structurally valid.

Acceptance and rejection

Accept when Processor passes the six dimensions without regression; Designer emits required files and fields per doc_id; Poros trees validate; pure_text_stream contains no tag leakage; chemical_formulas holds only chemical or material expressions; multimodal indexes reference files that exist on disk; body and metadata fields are treated consistently within a document.

Reject on persistent OCR damage or broken formulas/citations in Processor output; missing required files or fields; invalid or unbalanced Poros trees; tag leakage into plain streams; mismatched multimodal assets; inconsistent treatment across body vs captions vs footnotes.

Further reading: Dataset layout · Processor · Designer overview · Validation and audit · Data governance · Glossary