Designer¶

Designer is the structured-delivery stage. It reads processor-stabilised lists from the Processed Database (sourced from Parser output via Processor) and writes per-document JSON, plain text where applicable, figure assets, and small Markdown cards into the Designed Database. It does not replace large-scale OCR repair; it assumes upstream text is already usable.

What lands on disk¶

The reference CLI defaults to a single tree under the project data/ directory. Each document is one folder named after doc_id:

data/Designed Database/{doc_id}/
├── {doc_id}.content.json      # structure-aware lines + plain stream
├── {doc_id}.structure.json    # datamining / section view
├── {doc_id}.assets.index.json # multimodal index (JSON array)
├── {doc_id}.content.txt       # optional plain export; used by audit when present
└── images/                    # copied figures; Markdown sidecars may sit here too

Override the base path with --output_dir on run / validate commands, or --root_dir on audit structured / validate delivery. The default directory name Designed Database comes from the package config constant DEFAULT_DESIGNED_OUTPUT_DIR_NAME.

View	Primary files	Typical use
Full-text (tagged)	`{doc_id}.content.json` → `content`	Structure-aware training, review
Plain stream	`pure_text_stream` in the same JSON (and optional `.content.txt`)	Tag-free models and retrieval
Datamining	`{doc_id}.structure.json`	Sections, formulas, chemistry, asset refs
Multimodal	`{doc_id}.assets.index.json`, `images/`, `.md` cards	Figure–text linkage, layout metadata

Upstream expectations — numerics, units, terms, and material names should already be consistent; captions, table titles, and footnotes readable; figure references not broken earlier in the chain.

Design rules — structure must serve delivery, not only mimic print layout; section typing should stay stable within a document; assets must remain traceable to source context; one pass should support reading, extraction, and review together.

How runs are wired¶

Command	What it runs
`designer run all`	Text standardisation, then multimodal extraction
`designer run text`	Text path only (`content` / `structure` outputs)
`designer run multimodal`	Multimodal path only (expects lists and image paths the extractor can resolve)

Discovery is driven by *_content_list.json anywhere under --input_dir; doc_id is the file stem with _content_list removed (for example 00001_content_list.json → 00001).

For exact flags, defaults, exit codes, and logging file patterns, see CLI reference. For field-level contracts, see Output artefacts. Poros tag vocabulary and nesting behaviour are in Poros tags. Operational checks are covered under Validation and audit.

Package entry points¶

Entry	Notes
`designer`	Console script → `designer.cli:main` (`[project.scripts]` in the package)
`python -m designer`	Delegates to the same CLI via `designer/__main__.py`

Install and import name follow the designer Python package (not porosdata_designer).

Documentation map¶

Delivery standards — readiness levels and cross-stage obligations
CLI reference — commands, flags, logs, exit codes
Output artefacts — JSON and Markdown file contracts
Poros tags — tag set, section vocabulary, EOS
Configuration — code-level defaults (runtime/config.py)
Validation and audit — audit / validate subcommands
Multimodal — filenames, paths, known edge cases
Python API — public constructors and helpers
Troubleshooting — common failures and log messages

Limits and relationship with Processor¶

Designer cannot repair wholesale OCR failure; where semantics stay ambiguous it may prefer usable exports over over-labelled ones. Quality is bounded by Processor output—this stage is the organisation and export layer on top of that gate.