Skip to content

Right-size pure-Python CI test runtime (SyntheticControl regression + SDID safe trims)#505

Merged
igerber merged 3 commits into
mainfrom
fix/ci-pure-python-runtime
May 31, 2026
Merged

Right-size pure-Python CI test runtime (SyntheticControl regression + SDID safe trims)#505
igerber merged 3 commits into
mainfrom
fix/ci-pure-python-runtime

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 31, 2026

Summary

  • Fix the pure-Python CI regression from PR Add SyntheticControl estimator (classic SCM, Abadie-Diamond-Hainmueller 2010) — PR-1 core #501. The new SyntheticControl test file ran ~15 full nested V-optimization fits at production defaults (n_starts=4, inner_max_iter=10000, inner_min_decrease=1e-5), costing 30–150s+ each in pure-Python mode (no Rust kernel) because the inner Frank-Wolfe solve grinds its slow sublinear tail to the tight tolerance on every objective eval. This inflated the single-runner python-fallback job from ~1.5h → ~4.5h. Added _FAST/_FAST_CHURN cheap settings (loosened inner tolerance + single start + small outer cap) to the fit-completing behavior tests → the file now runs in ~1.3s pure-Python (was tens of minutes).
  • Preserve pure-Python coverage of the default nested path. Added a non-slow test_nested_production_defaults_smoke (2-donor panel at full production defaults — the inner FW simplex is ~1-D with J=2, so it stays <0.1s) so the pure-Python job still exercises n_starts=4 and the _v_starts heuristic candidates that the _FAST (n_starts=1) tests skip.
  • Safe baseline trim (SDID). Routed n_bootstrap through ci_params on three tests whose assertions are independent of the draw count: TestScaleEquivariance.test_scale_equivariance (pure invariance, baseline captured at runtime) and two self-consistency p-value tests. Cuts ~290s of pure-Python CPU-work.

Coverage & wall-clock notes

  • No production defaults changed — only test-harness settings. ci_params returns counts unchanged under Rust, so the Rust matrix (-m '') keeps full fidelity (e.g. the SDID tests still run at 200 reps; the SCM @slow Tier-2 Basque test still runs at full defaults). Only the single-runner pure-Python fallback (-m 'not slow') is scaled. The backend-dependent SCM kernel _sc_weight_fw is independently equivalence-tested by test_rust_backend.py::test_sc_weight_fw_matches_numpy.
  • Scope of the win. The SCM fix removes the +3h regression (restores ~1.5h). The pre-existing ~1.5h baseline is bound by the TROP critical path (single non-slow tests run 250s+ each), so non-TROP trims don't reduce wall-clock. Pushing to ~1h would require @slow/data-shrinking the heavy TROP tests, trading their pure-Python coverage — intentionally not done here.

Methodology references

  • Method name(s): N/A — tests-only change. No estimator math, weighting, variance formulas, identification checks, or documented defaults were modified.
  • Paper / source link(s): N/A
  • Intentional deviations: None. Verified against docs/methodology/REGISTRY.md (SyntheticDiD §, SyntheticControl §): the changed tests remain consistent with documented placebo/bootstrap semantics and the ADH 2010 nested-V notes.

Validation

  • Tests added/updated: tests/test_methodology_synthetic_control.py (cheap settings + new test_nested_production_defaults_smoke), tests/test_methodology_sdid.py (3 tests via ci_params).
  • Pure-Python (DIFF_DIFF_BACKEND=python): SCM file 47 passed in 1.45s (was tens of minutes); SDID affected classes 12 passed. Rust backend: SCM 47 passed incl. @slow Tier-2; edited SDID tests pass at full 200 reps.
  • Local /ai-review-local --backend codex: ✅ clean (one P2 raised on default-path coverage, resolved by the smoke test).

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

igerber and others added 3 commits May 31, 2026 07:13
… speed

PR #501's SyntheticControl test file ran ~15 full nested V-optimization fits at
production defaults (n_starts=4, inner_max_iter=10000, inner_min_decrease=1e-5),
costing 30-150s+ EACH in pure-Python mode (no Rust kernel) and inflating the
pure-Python fallback CI job from ~1.5h to ~4.5h. The cost is the inner Frank-Wolfe
solve grinding its slow sublinear tail to hit the tight tolerance on every
objective evaluation.

Add _FAST / _FAST_CHURN config constants and apply them to the fit-completing
behavior tests: loosen the inner tolerance (inner_min_decrease=1e-3), use a single
start, and cap the outer optimizer. Each fit drops to ~0.1s; the whole file runs in
1.3s pure-Python (was tens of minutes). The churn tests keep inner_max_iter=1 (only
the outer optimizer is capped) so their non-convergence warnings still fire.

No coverage loss: the only backend-dependent kernel (_sc_weight_fw) is already
equivalence-tested by test_rust_backend.py::test_sc_weight_fw_matches_numpy, the
rest of the SCM pipeline is identical pure-Python in both backends, and the full
production defaults are still exercised by the @slow Tier-2 Basque test in the Rust
matrix. Production defaults unchanged.

Verified: 46 passed in 1.30s (pure-Python), 47 passed incl. @slow Tier-2 (Rust).

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
…stency tests

Part 2 (safe baseline trim) of the pure-Python CI right-sizing. The SDID
TestScaleEquivariance.test_scale_equivariance tests were the heaviest non-TROP
pure-Python tests (~137s + ~109s each for placebo/bootstrap) because their _fit
helper hardcodes n_bootstrap=200.

Route n_bootstrap through ci_params.bootstrap(200) on three tests whose assertions
are independent of the absolute draw count:
  - test_scale_equivariance: pure invariance (att/a, se/|a|, p, count compared
    between scaled and unscaled fits at the SAME draw count; baseline captured at
    runtime, not vs the hardcoded _BASELINE literals).
  - test_bootstrap_p_value_matches_analytical / test_placebo_p_value_uses_empirical_formula:
    self-consistency (reported p vs the formula applied to the fit's own outputs).

ci_params returns n unchanged under Rust (full 200 reps in the Rust matrix) and
scales to ~22 in pure-Python, cutting ~290s of pure-Python call-time. Left at full
reps: test_baseline_parity (needs the 200-rep _BASELINE literals) and the
detection/null-dispersion tests (p-value floor / SE-stability margins). TROP (the
~55% bucket) left intact — its cheap levers are already applied; further cuts would
trade pure-Python coverage.

Verified: 12 passed pure-Python (affected classes), 5 passed Rust.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
…d defaults

Address local codex review P2: after the _FAST rollout, no non-slow test exercised
the production-default SyntheticControl(v_method="nested") path in the pure-Python CI
job (the @slow Tier-2 Basque test runs only in the Rust matrix), and the cited
_sc_weight_fw kernel equivalence test is a low-level solver check, not an
estimator-level SCM test. In particular the _v_starts heuristic candidates
(inverse-variance / univariate-fit / Dirichlet starts) only generate at n_starts>1,
so the all-n_starts=1 _FAST tests skipped them in pure-Python.

Add test_nested_production_defaults_smoke: a 2-donor panel run at FULL production
defaults (n_starts=4, inner_min_decrease=1e-5). With J=2 the inner Frank-Wolfe
simplex is effectively 1-D, so defaults still complete in <0.1s (non-slow) while
exercising the heuristic-start generation and the numpy FW kernel inside the real
nested pipeline. Tighten the _FAST comment to state coverage accurately.

Verified: 47 passed in 1.45s pure-Python (smoke test <0.1s); 1 passed Rust.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good

Executive Summary

  • Tests-only diff. I did not find any production estimator, weighting, variance, inference, or default-behavior change.
  • The affected methodology surfaces are SyntheticDiD inference semantics and SyntheticControl nested-V optimizer coverage.
  • The SDID changes remain aligned with the Methodology Registry: placebo p-values are still checked against the empirical formula, bootstrap p-values against safe_inference(att, se), and the scale-equivariance test remains consistent with the documented Y-normalization contract (tests/test_methodology_sdid.py:L3078-L3217, docs/methodology/REGISTRY.md:L1759-L1767, docs/methodology/REGISTRY.md:L1898-L1926, diff_diff/synthetic_did.py:L1190-L1204).
  • The SCM changes only right-size test harness settings. The new smoke test still exercises the default nested path (n_starts=4, _v_starts, default inner_min_decrease) rather than removing coverage of it (tests/test_methodology_synthetic_control.py:L50-L72, tests/test_methodology_synthetic_control.py:L423-L436, diff_diff/synthetic_control.py:L916-L1125).
  • No unmitigated P0/P1 findings.
  • Review limitation: static analysis only. This environment does not have pytest or numpy, so I could not execute the edited tests.

Methodology

  • No findings. The PR stays within documented methodology. SyntheticDiD test edits match the registry’s variance-method-specific p-value contract and scale-equivariance notes, and SyntheticControl test edits do not alter estimator methodology or documented defaults (docs/methodology/REGISTRY.md:L1813-L1814, docs/methodology/REGISTRY.md:L1898-L1926, docs/methodology/REGISTRY.md:L1974-L2000).

Code Quality

  • No findings. The _FAST / _FAST_CHURN helpers reduce duplication while preserving the distinct warning-path assertions in the file (tests/test_methodology_synthetic_control.py:L50-L72, tests/test_methodology_synthetic_control.py:L339-L366, tests/test_methodology_synthetic_control.py:L592-L609).

Performance

  • No findings. The runtime reduction is confined to tests, and production optimizer defaults remain unchanged. The added non-slow smoke test keeps one default-path nested-V execution in the default pytest selection (pyproject.toml:L109-L117, tests/test_methodology_synthetic_control.py:L423-L436).

Maintainability

  • No findings. Centralizing the cheaper SCM knobs and routing SDID draw-count scaling through ci_params is a net maintainability improvement over repeating ad hoc per-test overrides (tests/conftest.py:L181-L220, tests/test_methodology_synthetic_control.py:L50-L72).

Tech Debt

  • No findings. The diff does not introduce new untracked debt. Existing TODO items for future SCM diagnostics and SDID cross-language bootstrap parity are unchanged and unrelated to this tests-only PR.

Security

  • No findings. Tests-only diff; no new secret-handling, deserialization, subprocess, or external-input surface.

Documentation/Tests

  • No findings. The added comments accurately describe why the cheaper settings are safe for the preserved assertions. Residual risk: I could not run the edited tests locally because the review environment is missing pytest and numpy.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 31, 2026
@igerber igerber merged commit d59039e into main May 31, 2026
25 of 26 checks passed
@igerber igerber deleted the fix/ci-pure-python-runtime branch May 31, 2026 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant