Glossary¶
Shared vocabulary used across workflow, tool, and delivery pages so you can move between sections without reconciling conflicting terms.
Pipeline modules and data layers¶
| Term | Meaning |
|---|---|
| Parser | Extraction stage (implementation repo: gen-sci-data) that turns papers into reusable text blocks, figures, captions, and metadata under the Raw Database. |
| Processor | Quality stage that cleans Parser output into the Processed Database so downstream steps see stable text and fields. |
| Designer | Delivery stage that organises stable input into full-text, structured, and multimodal outputs under the Designed Database. |
| Raw Database | Layer retaining originals and Parser bundles for traceability (default root often data/Raw Database/; legacy docs may show a raw/ segment). |
| Processed Database | Layer holding cleaned intermediates and processing reports (default root often data/Processed Database/). |
| Designed Database | Designer output root: per-doc_id folders with *.content.json, *.structure.json, *.assets.index.json, and images/. Override the path with --output_dir. |
Output and field concepts¶
| Term | Meaning |
|---|---|
full_text |
Human- and machine-readable full-document delivery (JSON + parallel plain text as documented). |
datamining |
Section-oriented structured JSON aimed at extraction, retrieval, and storage. |
multimodal |
Indexes and assets linking in-text figure references to image files and captions. |
doc_id |
Stable per-document identifier used across all layers. |
| Caption | Figure or table descriptive text, often critical for review and mining. |
| Asset anchor | Stable link between a mention in text and a concrete file in the delivery tree. |
processing_report |
Batch-level summary with status and quality signals for review. |
| Academic atomicity | Design rule: preserve scientific meaning, structure, and context during cleaning unless an explicit, reviewed rule says otherwise. |
See also: Home ยท Quick Start