Quick Start¶
This page walks from installation to a configurable cleaner and file-oriented runs. It assumes you already have porosdata-processor available in the active Python environment.
Install — from PyPI:
From a source tree:
git clone https://github.com/KiventYip/porosdata-processor.git
cd porosdata-processor
pip install -e .
Confirm the import:
Default cleaning with TextCleaner — the class applies the default plugin pipeline and preserves LaTeX-style segments where configured:
from porosdata_processor import TextCleaner
cleaner = TextCleaner()
raw_text = "The energy equation is $E = mc^2$, involving α particles and β rays."
cleaned_text = cleaner.clean(raw_text)
print(cleaned_text)
# Example output: Greek letters normalised to LaTeX forms where rules apply
Custom plugin order — pass an explicit pipeline when you need citation normalisation, Greek-to-LaTeX conversion, or whitespace rules in a particular order:
cleaner = TextCleaner(pipeline=[
"citation_rules",
"greek_to_latex",
"normalize_whitespace",
])
text = "See reference 【1】 for α particle data."
result = cleaner.clean(text)
LaTeX spacing options — optional clean_options tighten math spacing when safe:
cleaner = TextCleaner(clean_options={
"clean_latex_math_spaces": True,
})
text = "Formula: $\\mathbf { X } + \\frac { a }{ b }$"
result = cleaner.clean(text)
Single file on disk — read the file, clean the string, then write:
from pathlib import Path
from porosdata_processor import TextCleaner
cleaner = TextCleaner()
path = Path("document.md")
cleaned = cleaner.clean(path.read_text(encoding="utf-8"))
path.with_name("clean_document.md").write_text(cleaned, encoding="utf-8")
Batch processing — for a full Raw Database tree, the CLI is the supported entry point (see Examples and CLI reference):
porosdata-processor cleaning \
--input-dir "data/Raw Database" \
--output-dir "data/Processed Database" \
--max-workers 4
YAML configuration (optional) — project layouts often keep defaults under config/processing_config.yaml:
cleaning:
convert_greek: true
apply_rules: true
apply_patterns: true
normalize_whitespace: true
remove_extra_spaces: true
processing:
concurrency: 0 # 0: sequential
verbose: true
Tokenisation or chunking settings, when used, may live in a separate config/token_config.yaml (model family, BOS/EOS markers, chunk size, overlap). Align those files with the version of the processor you run; fields can change between releases.
Where to read next — API Reference · Examples · Contributing (process for doc changes in this site)