Skip to content

DFPY-78: Integrate German regex support#146

Open
sidmohan0 wants to merge 8 commits into
devfrom
codex/dfpy-78-german-regex-support
Open

DFPY-78: Integrate German regex support#146
sidmohan0 wants to merge 8 commits into
devfrom
codex/dfpy-78-german-regex-support

Conversation

@sidmohan0
Copy link
Copy Markdown
Contributor

Summary

  • Adapt external PR feat(regex): add German structured PII detection #138 for the 4.5 lightweight regex path without adding dependencies.
  • Add German VAT ID and German IBAN detection to the default regex set.
  • Add broader German structured identifiers behind explicit locales=["de"] or explicit entity_types, with context guards to avoid ordinary ticket/SKU/order ID false positives.
  • Propagate locale support through scan, redact, guardrail helpers, DataFog, TextService, and the core text CLI commands.
  • Document German locale behavior in README and user docs.

Review notes

  • This follows the DFPY-77 review decision: proceed by adapting feat(regex): add German structured PII detection #138, but avoid merging the original PR as-is because several broad German identifiers were too noisy when default-on.
  • Regex overlap suppression now prefers the longer/more specific German VAT match over an inner generic SSN-shaped substring, preventing bad redaction output.

Verification

  • DATAFOG_NO_TELEMETRY=1 DO_NOT_TRACK=1 .venv312/bin/python -m pytest tests/test_de_pii_regex.py tests/test_regex_annotator.py -q
  • DATAFOG_NO_TELEMETRY=1 DO_NOT_TRACK=1 .venv312/bin/python -m pytest tests/test_detection_accuracy.py::test_structured_pii_detection_fast tests/test_detection_accuracy.py::test_negative_cases_fast tests/test_main.py::test_lean_datafog_detect tests/test_main.py::test_lean_datafog_process tests/test_client.py::test_scan_text_success tests/test_cli_smoke.py::test_redact_text_command -q
  • .venv312/bin/pre-commit run --files README.md datafog/__init__.py datafog/agent.py datafog/client.py datafog/core.py datafog/engine.py datafog/main.py datafog/processing/text_processing/regex_annotator/regex_annotator.py datafog/services/text_service.py docs/cli.rst docs/getting-started.rst docs/python-sdk.rst tests/corpus/structured_pii.json tests/test_detection_accuracy.py tests/test_regex_annotator.py tests/test_de_pii_regex.py --show-diff-on-failure
  • DATAFOG_NO_TELEMETRY=1 DO_NOT_TRACK=1 .venv312/bin/python -m sphinx -b html docs docs/_build/html
  • DATAFOG_NO_TELEMETRY=1 DO_NOT_TRACK=1 .venv312/bin/python -m pytest tests/test_runtime_dependency_safety.py tests/test_no_network_core.py -q
  • git diff --check
  • DATAFOG_NO_TELEMETRY=1 DO_NOT_TRACK=1 .venv312/bin/python -m pytest -m "not slow" -q -> 583 passed, 4 skipped, 295 deselected, 19 xfailed

Refs DFPY-78.

@sidmohan0 sidmohan0 changed the title [codex] Integrate German regex support DFPY-78: Integrate German regex support May 27, 2026
Copy link
Copy Markdown
Contributor Author

Update: this PR now adopts the locale-gated German PII behavior proposed and iterated in #138 by @pranjalparmar. All German DE_* patterns require explicit locale selection (locales=["de"] / CLI --locale de) or explicit entity-type selection, so the default regex path stays country-agnostic while still giving users a clear German opt-in path.

@sidmohan0 sidmohan0 force-pushed the codex/dfpy-78-german-regex-support branch from 3bbff71 to a0ab963 Compare May 27, 2026 23:01
@pranjalparmar
Copy link
Copy Markdown

pranjalparmar commented May 28, 2026

Hey @sidmohan0 👋

Really glad to see the locale-gated approach from #138 being adopted here!

One thing I wanted to mention is that this is actually my first-ever open-source contribution, and I'm genuinely really excited to see the German PII patterns being integrated! Since the design and patterns from #138 are being adapted here, would it be possible to add me as a co-author on the relevant commits? GitHub supports this with:

Co-authored-by: Pranjal Parmar <[email protected]>

Also looking forward to extending this further planning to contribute multi-country VAT/IBAN patterns and would love to get involved with the next version work too!

Thanks again for all the guidance throughout this — it's been a great learning experience! 🙌

@sidmohan0 sidmohan0 force-pushed the codex/dfpy-78-german-regex-support branch from a0ab963 to f1a6fce Compare May 28, 2026 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants