(OCR / Data Extraction / Regex)
What the role asks: improve a rule-based data extraction system on top of OCR—tighten parsing logic, work with messy / unstructured text, and ship reliable regex-based rules with attention to real-world edge cases.
- Scope: design parsing, write and maintain regex extraction rules, handle noisy OCR output, improve robustness on edge cases.
- Deliverables shown here: parsing + validation logic, fewer extraction errors (false positives / missing fields), clean testable Python (see repo tests), configurable rules in
rules/sample_receipt.yaml (receipt-style document).
- Stack: Python backend, OCR text from Tesseract (or paste text to test rules alone).
Try the pipeline below: paste noisy text or upload a receipt-like image, then read the JSON result.