Right-size pure-Python CI test runtime (SyntheticControl regression + SDID safe trims) by igerber · Pull Request #505 · igerber/diff-diff

igerber · 2026-05-31T12:11:22Z

Summary

Fix the pure-Python CI regression from PR Add SyntheticControl estimator (classic SCM, Abadie-Diamond-Hainmueller 2010) — PR-1 core #501. The new SyntheticControl test file ran ~15 full nested V-optimization fits at production defaults (n_starts=4, inner_max_iter=10000, inner_min_decrease=1e-5), costing 30–150s+ each in pure-Python mode (no Rust kernel) because the inner Frank-Wolfe solve grinds its slow sublinear tail to the tight tolerance on every objective eval. This inflated the single-runner python-fallback job from ~1.5h → ~4.5h. Added _FAST/_FAST_CHURN cheap settings (loosened inner tolerance + single start + small outer cap) to the fit-completing behavior tests → the file now runs in ~1.3s pure-Python (was tens of minutes).
Preserve pure-Python coverage of the default nested path. Added a non-slow test_nested_production_defaults_smoke (2-donor panel at full production defaults — the inner FW simplex is ~1-D with J=2, so it stays <0.1s) so the pure-Python job still exercises n_starts=4 and the _v_starts heuristic candidates that the _FAST (n_starts=1) tests skip.
Safe baseline trim (SDID). Routed n_bootstrap through ci_params on three tests whose assertions are independent of the draw count: TestScaleEquivariance.test_scale_equivariance (pure invariance, baseline captured at runtime) and two self-consistency p-value tests. Cuts ~290s of pure-Python CPU-work.

Coverage & wall-clock notes

No production defaults changed — only test-harness settings. ci_params returns counts unchanged under Rust, so the Rust matrix (-m '') keeps full fidelity (e.g. the SDID tests still run at 200 reps; the SCM @slow Tier-2 Basque test still runs at full defaults). Only the single-runner pure-Python fallback (-m 'not slow') is scaled. The backend-dependent SCM kernel _sc_weight_fw is independently equivalence-tested by test_rust_backend.py::test_sc_weight_fw_matches_numpy.
Scope of the win. The SCM fix removes the +3h regression (restores ~1.5h). The pre-existing ~1.5h baseline is bound by the TROP critical path (single non-slow tests run 250s+ each), so non-TROP trims don't reduce wall-clock. Pushing to ~1h would require @slow/data-shrinking the heavy TROP tests, trading their pure-Python coverage — intentionally not done here.

Methodology references

Method name(s): N/A — tests-only change. No estimator math, weighting, variance formulas, identification checks, or documented defaults were modified.
Paper / source link(s): N/A
Intentional deviations: None. Verified against docs/methodology/REGISTRY.md (SyntheticDiD §, SyntheticControl §): the changed tests remain consistent with documented placebo/bootstrap semantics and the ADH 2010 nested-V notes.

Validation

Tests added/updated: tests/test_methodology_synthetic_control.py (cheap settings + new test_nested_production_defaults_smoke), tests/test_methodology_sdid.py (3 tests via ci_params).
Pure-Python (DIFF_DIFF_BACKEND=python): SCM file 47 passed in 1.45s (was tens of minutes); SDID affected classes 12 passed. Rust backend: SCM 47 passed incl. @slow Tier-2; edited SDID tests pass at full 200 reps.
Local /ai-review-local --backend codex: ✅ clean (one P2 raised on default-path coverage, resolved by the smoke test).

Security / privacy

Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

… speed PR #501's SyntheticControl test file ran ~15 full nested V-optimization fits at production defaults (n_starts=4, inner_max_iter=10000, inner_min_decrease=1e-5), costing 30-150s+ EACH in pure-Python mode (no Rust kernel) and inflating the pure-Python fallback CI job from ~1.5h to ~4.5h. The cost is the inner Frank-Wolfe solve grinding its slow sublinear tail to hit the tight tolerance on every objective evaluation. Add _FAST / _FAST_CHURN config constants and apply them to the fit-completing behavior tests: loosen the inner tolerance (inner_min_decrease=1e-3), use a single start, and cap the outer optimizer. Each fit drops to ~0.1s; the whole file runs in 1.3s pure-Python (was tens of minutes). The churn tests keep inner_max_iter=1 (only the outer optimizer is capped) so their non-convergence warnings still fire. No coverage loss: the only backend-dependent kernel (_sc_weight_fw) is already equivalence-tested by test_rust_backend.py::test_sc_weight_fw_matches_numpy, the rest of the SCM pipeline is identical pure-Python in both backends, and the full production defaults are still exercised by the @slow Tier-2 Basque test in the Rust matrix. Production defaults unchanged. Verified: 46 passed in 1.30s (pure-Python), 47 passed incl. @slow Tier-2 (Rust). Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

…stency tests Part 2 (safe baseline trim) of the pure-Python CI right-sizing. The SDID TestScaleEquivariance.test_scale_equivariance tests were the heaviest non-TROP pure-Python tests (~137s + ~109s each for placebo/bootstrap) because their _fit helper hardcodes n_bootstrap=200. Route n_bootstrap through ci_params.bootstrap(200) on three tests whose assertions are independent of the absolute draw count: - test_scale_equivariance: pure invariance (att/a, se/|a|, p, count compared between scaled and unscaled fits at the SAME draw count; baseline captured at runtime, not vs the hardcoded _BASELINE literals). - test_bootstrap_p_value_matches_analytical / test_placebo_p_value_uses_empirical_formula: self-consistency (reported p vs the formula applied to the fit's own outputs). ci_params returns n unchanged under Rust (full 200 reps in the Rust matrix) and scales to ~22 in pure-Python, cutting ~290s of pure-Python call-time. Left at full reps: test_baseline_parity (needs the 200-rep _BASELINE literals) and the detection/null-dispersion tests (p-value floor / SE-stability margins). TROP (the ~55% bucket) left intact — its cheap levers are already applied; further cuts would trade pure-Python coverage. Verified: 12 passed pure-Python (affected classes), 5 passed Rust. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

…d defaults Address local codex review P2: after the _FAST rollout, no non-slow test exercised the production-default SyntheticControl(v_method="nested") path in the pure-Python CI job (the @slow Tier-2 Basque test runs only in the Rust matrix), and the cited _sc_weight_fw kernel equivalence test is a low-level solver check, not an estimator-level SCM test. In particular the _v_starts heuristic candidates (inverse-variance / univariate-fit / Dirichlet starts) only generate at n_starts>1, so the all-n_starts=1 _FAST tests skipped them in pure-Python. Add test_nested_production_defaults_smoke: a 2-donor panel run at FULL production defaults (n_starts=4, inner_min_decrease=1e-5). With J=2 the inner Frank-Wolfe simplex is effectively 1-D, so defaults still complete in <0.1s (non-slow) while exercising the heuristic-start generation and the numpy FW kernel inside the real nested pipeline. Tighten the _FAST comment to state coverage accurately. Verified: 47 passed in 1.45s pure-Python (smoke test <0.1s); 1 passed Rust. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

github-actions · 2026-05-31T12:15:33Z

Overall Assessment

✅ Looks good

Executive Summary

Tests-only diff. I did not find any production estimator, weighting, variance, inference, or default-behavior change.
The affected methodology surfaces are SyntheticDiD inference semantics and SyntheticControl nested-V optimizer coverage.
The SDID changes remain aligned with the Methodology Registry: placebo p-values are still checked against the empirical formula, bootstrap p-values against safe_inference(att, se), and the scale-equivariance test remains consistent with the documented Y-normalization contract (tests/test_methodology_sdid.py:L3078-L3217, docs/methodology/REGISTRY.md:L1759-L1767, docs/methodology/REGISTRY.md:L1898-L1926, diff_diff/synthetic_did.py:L1190-L1204).
The SCM changes only right-size test harness settings. The new smoke test still exercises the default nested path (n_starts=4, _v_starts, default inner_min_decrease) rather than removing coverage of it (tests/test_methodology_synthetic_control.py:L50-L72, tests/test_methodology_synthetic_control.py:L423-L436, diff_diff/synthetic_control.py:L916-L1125).
No unmitigated P0/P1 findings.
Review limitation: static analysis only. This environment does not have pytest or numpy, so I could not execute the edited tests.

Methodology

No findings. The PR stays within documented methodology. SyntheticDiD test edits match the registry’s variance-method-specific p-value contract and scale-equivariance notes, and SyntheticControl test edits do not alter estimator methodology or documented defaults (docs/methodology/REGISTRY.md:L1813-L1814, docs/methodology/REGISTRY.md:L1898-L1926, docs/methodology/REGISTRY.md:L1974-L2000).

Code Quality

No findings. The _FAST / _FAST_CHURN helpers reduce duplication while preserving the distinct warning-path assertions in the file (tests/test_methodology_synthetic_control.py:L50-L72, tests/test_methodology_synthetic_control.py:L339-L366, tests/test_methodology_synthetic_control.py:L592-L609).

Performance

No findings. The runtime reduction is confined to tests, and production optimizer defaults remain unchanged. The added non-slow smoke test keeps one default-path nested-V execution in the default pytest selection (pyproject.toml:L109-L117, tests/test_methodology_synthetic_control.py:L423-L436).

Maintainability

No findings. Centralizing the cheaper SCM knobs and routing SDID draw-count scaling through ci_params is a net maintainability improvement over repeating ad hoc per-test overrides (tests/conftest.py:L181-L220, tests/test_methodology_synthetic_control.py:L50-L72).

Tech Debt

No findings. The diff does not introduce new untracked debt. Existing TODO items for future SCM diagnostics and SDID cross-language bootstrap parity are unchanged and unrelated to this tests-only PR.

Security

No findings. Tests-only diff; no new secret-handling, deserialization, subprocess, or external-input surface.

Documentation/Tests

No findings. The added comments accurately describe why the cheaper settings are safe for the preserved assertions. Residual risk: I could not run the edited tests locally because the review environment is missing pytest and numpy.

igerber and others added 3 commits May 31, 2026 07:13

igerber added the ready-for-ci Triggers CI test workflows label May 31, 2026

igerber merged commit d59039e into main May 31, 2026
25 of 26 checks passed

igerber deleted the fix/ci-pure-python-runtime branch May 31, 2026 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Right-size pure-Python CI test runtime (SyntheticControl regression + SDID safe trims)#505

Right-size pure-Python CI test runtime (SyntheticControl regression + SDID safe trims)#505
igerber merged 3 commits into
mainfrom
fix/ci-pure-python-runtime

igerber commented May 31, 2026

Uh oh!

github-actions Bot commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented May 31, 2026

Summary

Coverage & wall-clock notes

Methodology references

Validation

Security / privacy

Uh oh!

github-actions Bot commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant