End-to-end workflow¶

This page traces the delivery path from literature inputs to packages you can use for training, structured extraction, retrieval, and review. It stays at the level of stages, directories, and what to inspect; exact commands live on the installation and reference pages.

Pipeline in one line — sources land in the Raw Database via Parser, are stabilised into the Processed Database by Processor, then shaped into delivery views in the Designed Database by Designer:

PDF literature -> Parser -> Raw Database -> Processor -> Processed Database -> Designer -> Designed Database

Stage	Primary work	What you get
Parser	Extract blocks, figures, captions, light metadata	Raw Database: reusable raw text and image assets
Processor	Remove noise, repair fragmentation, normalise expressions	Processed Database: intermediate results and batch reports
Designer	Organise sections and export views	Designed Database: per-`doc_id` folders by default (override with `--output_dir`)

Directory contract — the usual flow is Raw Database → Processed Database → Designed Database:

data/
├── Raw Database/         # Parser / upstream extraction packages
├── Processed Database/   # Processor output
└── Designed Database/      # Designer output (default folder name)
    └── {doc_id}/

Raw Database holds the upstream bundle per document: PDFs, Parser-generated Markdown or lists, page-level intermediates, and extracted images. The layer is for traceability and re-run, not for handing off as final product.

Processed Database is where Processor writes cleaned *_content_list.json, mirrored images where configured, and a batch summary such as processing_report.json. Typical repairs address broken numerals and units, corrupted terms, noisy captions or footnotes, and unstable citation or formula boundaries.

Designed Database (or your custom --output_dir) holds one subdirectory per doc_id with the JSON views, optional plain text, and an images/ subtree. Typical files:

data/Designed Database/00001/
├── 00001.content.json
├── 00001.structure.json
├── 00001.assets.index.json
└── images/

Files	Role
`{doc_id}.content.json`	Poros-tagged `content` plus `pure_text_stream` line arrays
`{doc_id}.structure.json`	Datamining JSON (`sections`, formulas, asset refs)
`{doc_id}.assets.index.json`	Multimodal index (JSON array) and pointers into `images/`

What a completed run should provide — per-document folders under the chosen output root, plain-text streams where configured, structured JSON for mining, multimodal indexes that resolve figure mentions to assets, and reports that support batch review and traceability.

How to review results efficiently — (1) confirm Raw Database is complete for each doc_id, (2) read processing_report.json and spot-check Processed Database for regressions, (3) open the matching {doc_id} folder under Designed Database for your downstream task (reading vs mining vs multimodal).

Typical motivations — high-quality training text, stable inputs for rule-based or learned extraction, explicit links between body text and figures, and a single package shape for storage or knowledge organisation.

See also: Installation · Quick Start · Examples · Parser · Processor · Designer · Dataset layout