(OCR / Data Extraction / Regex)

What the role asks: improve a rule-based data extraction system on top of OCR—tighten parsing logic, work with messy / unstructured text, and ship reliable regex-based rules with attention to real-world edge cases.

Scope: design parsing, write and maintain regex extraction rules, handle noisy OCR output, improve robustness on edge cases.
Deliverables shown here: parsing + validation logic, fewer extraction errors (false positives / missing fields), clean testable Python (see repo tests), configurable rules in rules/sample_receipt.yaml (receipt-style document).
Stack: Python backend, OCR text from Tesseract (or paste text to test rules alone).

Try the pipeline below: paste noisy text or upload a receipt-like image, then read the JSON result.

Interactive demo

Result JSON

{}