Dataset layout¶
File-level contract for the three-tier data/ model: what each directory holds, how doc_id is used, and how artefacts line up between stages. For the narrative path from install to handoff, see End-to-end workflow.
Top-level shape — flow is normally Raw Database → Processed Database → Designed Database (final delivery):
data/
├── Raw Database/ # Parser bundles, traceability anchor
├── Processed Database/ # Processor output
└── Designed Database/ # Designer output (default name; configurable via --output_dir)
└── {doc_id}/
The default Designed Database directory name matches DEFAULT_DESIGNED_OUTPUT_DIR_NAME in the Designer package. You may point --output_dir at any writable root; documentation still refers to this default for clarity.
Raw Database/{doc_id}/ — {doc_id} is a five-digit zero-padded id (e.g. 00001). Expect the PDF, state.json, Parser output such as {doc_id}.md, {doc_id}_content_list.json, layout PDFs, and images/{sha256}.jpg. {doc_id}_content_list.json is the primary handoff artefact; images are keyed by hash and referenced from list items.
Processed Database/ — processing_report.json plus per-document cleaned *_content_list.json and usually mirrored images/. Designer discovers lists recursively; point --input_dir at Processed Database (or any tree that contains the lists you intend to ship). Image paths inside each JSON are resolved relative to that list file’s directory.
Designer output (Designed Database/ by default) — one folder per document:
data/Designed Database/{doc_id}/
├── {doc_id}.content.json
├── {doc_id}.structure.json
├── {doc_id}.assets.index.json
├── {doc_id}.content.txt # optional; used by audit when present
└── images/
├── fig_1.jpg # or image_{md5prefix}.jpg for long ids
├── fig_1.md # figure card (naming follows fig_id rules)
└── …
content.json— line arrayscontent(Poros-tagged) andpure_text_stream(plain).structure.json— datamining JSON (sections, formulas, chemistry,asset_refswithlink).assets.index.json— JSON array of multimodal rows (paths, captions, mentions, metadata).
See Designer output artefacts for field detail.
Naming conventions¶
| Item | Rule | Example |
|---|---|---|
doc_id |
five-digit numeric | 00001 |
| Raw list | {doc_id}_content_list.json |
00001_content_list.json |
| Raw image | SHA-256 + .jpg |
07922a29…35e.jpg |
| Full-text JSON | {doc_id}.content.json |
00001.content.json |
| Datamining JSON | {doc_id}.structure.json |
00001.structure.json |
| Multimodal index | {doc_id}.assets.index.json |
00001.assets.index.json |
| Figure card / sidecar | under images/, often fig_{fig_id}.md |
images/fig_1.md |
| Copied asset | under images/, fig_{fig_id}.* or hashed name |
images/fig_1.jpg |
Onboarding — allocate the next doc_id, populate Raw Database/{doc_id}/, run Processor and confirm processing_report.json and Processed Database/{doc_id}/, run Designer and verify the per-doc_id folder under the chosen output root; never rename doc_id mid-pipeline.
See also: End-to-end workflow · Processor · Designer · Delivery standards · Glossary