Examples¶
Copyable starting points for common delivery scenarios. Each block states when it applies, gives a minimal command, sketches layout, notes expected artefacts, and lists first checks if something fails.
After you complete Quick Start, the scenarios below add batch sizing, Designer-side validation hooks, and explicit ties to the export contract in Delivery standards; they assume you are already comfortable with the single-doc_id layout.
Example 1: Single-document trial¶
When to use — you need a first-pass quality check on one paper before scaling a batch.
Command
porosdata-processor cleaning \
--input-dir "data/Raw Database" \
--output-dir "data/Processed Database" \
--max-workers 1
Input layout
Expected output
data/
└── Processed Database/
├── processing_report.json
└── 00001/
└── ... cleaned intermediates ...
You should see — one cleaned bundle per source folder plus a processing report suitable for quick review.
If it fails — verify Raw Database is complete, the process can write to Processed Database, and logs or processing_report.json do not report missing inputs.
Example 2: Batch processing¶
When to use — many papers must be normalised to a single Processed Database batch before design or handoff.
Command
porosdata-processor cleaning \
--input-dir "data/Raw Database" \
--output-dir "data/Processed Database" \
--max-workers 4
Input layout
Expected output
You should see — one processed directory per document and a batch-level report.
If it fails — check folder naming consistency, whether --max-workers exceeds what the host tolerates, and whether the report marks skipped or failed items.
Example 3: Review-oriented delivery package¶
When to use — recipients need readable text, structured JSON, and multimodal linkage in one pass.
Commands
porosdata-processor cleaning \
--input-dir "data/Raw Database" \
--output-dir "data/Processed Database" \
--max-workers 4
Input layout
Expected output
data/
└── Designed Database/
├── 00001/
│ ├── 00001.content.json
│ ├── 00001.structure.json
│ ├── 00001.assets.index.json
│ └── images/
└── 00002/
└── …
Deliverables — full-text views, plain streams where configured, structured JSON for mining, multimodal indexes, and traceability-oriented reports.
If it fails — confirm Processed Database is complete before invoking Designer; inspect whether each doc_id folder exists under the output root; resolve any quality flags in Processor output before treating the package as final.
Fine-tuning formats: orientation¶
There is no single canonical fine-tuning file format. Layouts differ with model family (encoder-only, causal decoder, encoder–decoder), task shape (single-turn chat, tool use, classification), training-framework habits, and the templates that surround a given base checkpoint.
A typical delivery ships three parallel views per document. Most training stacks need only a thin adapter—field renaming, chunking, or turn packing—to align with local conventions:
| Need | Suggested view |
|---|---|
| Plain pre-training, embeddings, retrieval | pure_text_stream in {doc_id}.content.json, or optional {doc_id}.content.txt |
| Structure-aware long-context training | content in {doc_id}.content.json (Poros tags intact) |
| Extraction, KG-style assembly, instruction data prep | {doc_id}.structure.json |
| Multimodal / figure-grounded training | {doc_id}.assets.index.json plus images/ and Markdown figure cards |
View contracts are specified in Delivery standards. A fuller catalogue of task-specific templates may appear in a later revision of this page.
Related: Installation · Quick Start · End-to-end workflow · Processor · Designer · Home