Configuration and runtime¶
Runtime behaviour matches porosdata-processor 0.4.1. There is no checked-in processing_config.yaml or token_config.yaml; behaviour comes from Python types, environment variables, and CLI flags.
Where settings live¶
Configdataclass — encoding, chunk sizes, extensions, Unicode form,default_pipeline,debug_mode, and related fields.RuntimeConfig/ paths — environment variables:POROS_QUARANTINE_PATH,POROS_LOGS_PATH,POROS_TEMP_DATA_PATH,POROS_CACHE_PATH.DataProcessorconstructor — input/output directories, worker count, memory limit, evaluation switches, and related batch controls.configure_runtime_env— optional Hugging Face client defaults such asHF_ENDPOINT,HF_HUB_TIMEOUT,HF_HUB_DISABLE_TELEMETRY.
Config fields (schema-level summary)¶
| Field | Default | Role |
|---|---|---|
encoding |
utf-8 |
Text decoding |
chunk_size |
4096 |
Read chunk (bytes) |
max_file_size |
100 * 1024 * 1024 |
Per-file ceiling (bytes) |
supported_extensions |
{.txt,.md,.tex,.latex,.json} |
Accepted suffixes |
unicode_normalization_form |
NFC |
unicodedata.normalize target |
regex_precompilation |
True |
Regex precompilation toggle |
debug_mode |
False |
Per-step debug logging in Pipeline |
default_pipeline |
ordered list in code | Default cleaning step names |
Authoritative TOML rule-pack fields live with the package in docs/rules/rule_pack_schema.md (schema, version, defaults, per-rule id, priority, pattern, optional kind, phase, target, replacements, flags, and metadata).
Pipeline and rule packs¶
- The cleaner resolves step names to callables on the
Pipelineobject (globals().get(name)or an instance method)—there is no separate plugin registry file. - Declarative packs apply through helpers such as
apply_rule_pack/get_filtered_transform_rules, filtered bykind,phase,target, and related columns. - Pre-shield extras —
local_text_compressionruns beforeShield.protectso raw$...$stays available. - Post-shield extras —
citation_to_refandterm_consistency_mappingalso have post-shield passes where the implementation needs restored text.
Default ordered steps (1-based) follow Config.default_pipeline:
| # | Step | Role |
|---|---|---|
| 1 | unicode_normalization |
NFC normalisation |
| 2 | scientific_ocr_repair |
Science OCR repairs (repair_ocr.toml and friends); skipped when apply_heavy_ocr_repair=False |
| 3 | term_consistency_mapping |
Terminology mapping (normalize_terms.toml, priorities P0–P2) |
| 4 | patterns_cleaning |
Pattern passes plus full-width digit/letter folding |
| 5 | metadata_signal_cleanup |
Metadata cleanup (normalize_metadata.toml) |
| 6 | reference_symbol_cleanup |
Citation symbol hygiene |
| 7 | citation_rules |
Inline citation rules (normalize_citations.toml) |
| 8 | citation_to_ref |
Prose citation protocol (ref[…] when configured) |
| 9 | document_numbering_rules |
Numbering and circled numerals |
| 10 | greek_to_latex |
Greek Unicode → LaTeX macros |
| 11 | normalize_whitespace |
Tabs and trailing spaces |
| 12 | remove_extra_spaces |
Collapse repeated interior spaces |
Other callables (for example inline_citation_removal) exist for custom pipeline= lists.
Shield coverage¶
Shield.protect wraps, in order: fenced Markdown code; inline code; LaTeX environments; display math ($$…$$, \[…\], \(…\)); inline math $…$ with $$ disambiguation; and common \cmd{…} / \cmd[…] / \cmd forms. Citations are not a separate shield class—they are handled inside the pipeline and post-shield logic.
Placeholders use internal boundary prefixes/suffixes; restore walks placeholders in reverse with single-shot replace. Each clean() ends with clear() so workers can reuse one Shield instance safely.
TextCleaner (high level)¶
Constructor parameters include pipeline, config, sentinel, strict_mode, and apply_heavy_ocr_repair (when False, the heavy pre-shield OCR chain is skipped, including streaming shortcuts).
clean(text)—Nonebecomes""; non-strinput raisesTypeError. The codebase does not ship documentedProcessingError/ConfigurationErrortypes.clean_stream(iterable, chunk_size=8192)— lightweight chunk cleaning without cross-chunk shielding; treat it as a narrow tool, not a drop-in for full documents.clean_file— removed; read bytes yourself, then callclean.- Class method
with_quality_assurance(...)— builds an instance wired toDataSentinel(thresholds, quarantine path, strictness). - Pipeline errors — unknown steps raise
ValueError;strict_mode=Truewith a triggered sentinel may raiseRuntimeError(“Quality circuit breaker…”). Non-strict failures roll back the failing step to its input and log a warning.
In batch mode, Pipeline.audit_processing_result currently passes shield_instance=None, so sentinel branches that expect a live Shield for integrity stats do not run even when a sentinel is attached.
Batch I/O and limits¶
- Large JSON — files above
256 * 1024bytes useijsonstreaming when available; otherwise the loader falls back tojson.load(install the[batch]extra if streaming matters). - Atomic writes — outputs are written to
*.tmpand renamed into place. - Worker count — when unset, the runtime picks
min(cpu_count, floor(available_memory_mb / 512), 32). - Timeouts — per-file default
600s; overall budget scales with file count. Plain text blocks use45s without evaluation and90s with evaluation. Streamed JSON items call workers withenable_evaluation=False; metrics still flow throughMetricsEngine.consume_blockin-process. High-risk blocks (length ≥ 1200 or markers such as$$,\begin{,\cfrac) run isolated with a120s cap; on timeout the worker tries a light fallback, then the original text if that fails. - Memory — RSS is tracked with
psutilwhen present; crossingmemory_limit_mbemits GC guidance rather than killing workers.
processing_report.json and statuses¶
Per-file states include processing (initial), success, skipped (fresh output without --force-reprocess), and error (missing inputs, stream failures, timeouts, or hard exceptions).
When something fails, the runtime often keeps the original text (subprocess timeout path, direct-clean retry path, streamed-item fallback). Non-strict pipeline steps roll back to the previous step’s output; sentinel rollbacks can return the untouched source string.
Formula repair (naming)¶
Only an aggressive region (e.g. \mathrm, \mathbf, \mathsf, \mathit, \boldsymbol, \ce, \unit brace arguments) runs the recursive collapse helper. The sources do not use a “Zone-O” label—treat everything else as more conservative structural and numeric passes. Iteration caps mean repairs are not a formal mathematical fixed point in pathological cases.
See also: CLI reference · Processor overview · Data governance