Right-size pure-Python CI test runtime (SyntheticControl regression + SDID safe trims)#505
Merged
Conversation
… speed PR #501's SyntheticControl test file ran ~15 full nested V-optimization fits at production defaults (n_starts=4, inner_max_iter=10000, inner_min_decrease=1e-5), costing 30-150s+ EACH in pure-Python mode (no Rust kernel) and inflating the pure-Python fallback CI job from ~1.5h to ~4.5h. The cost is the inner Frank-Wolfe solve grinding its slow sublinear tail to hit the tight tolerance on every objective evaluation. Add _FAST / _FAST_CHURN config constants and apply them to the fit-completing behavior tests: loosen the inner tolerance (inner_min_decrease=1e-3), use a single start, and cap the outer optimizer. Each fit drops to ~0.1s; the whole file runs in 1.3s pure-Python (was tens of minutes). The churn tests keep inner_max_iter=1 (only the outer optimizer is capped) so their non-convergence warnings still fire. No coverage loss: the only backend-dependent kernel (_sc_weight_fw) is already equivalence-tested by test_rust_backend.py::test_sc_weight_fw_matches_numpy, the rest of the SCM pipeline is identical pure-Python in both backends, and the full production defaults are still exercised by the @slow Tier-2 Basque test in the Rust matrix. Production defaults unchanged. Verified: 46 passed in 1.30s (pure-Python), 47 passed incl. @slow Tier-2 (Rust). Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
…stency tests
Part 2 (safe baseline trim) of the pure-Python CI right-sizing. The SDID
TestScaleEquivariance.test_scale_equivariance tests were the heaviest non-TROP
pure-Python tests (~137s + ~109s each for placebo/bootstrap) because their _fit
helper hardcodes n_bootstrap=200.
Route n_bootstrap through ci_params.bootstrap(200) on three tests whose assertions
are independent of the absolute draw count:
- test_scale_equivariance: pure invariance (att/a, se/|a|, p, count compared
between scaled and unscaled fits at the SAME draw count; baseline captured at
runtime, not vs the hardcoded _BASELINE literals).
- test_bootstrap_p_value_matches_analytical / test_placebo_p_value_uses_empirical_formula:
self-consistency (reported p vs the formula applied to the fit's own outputs).
ci_params returns n unchanged under Rust (full 200 reps in the Rust matrix) and
scales to ~22 in pure-Python, cutting ~290s of pure-Python call-time. Left at full
reps: test_baseline_parity (needs the 200-rep _BASELINE literals) and the
detection/null-dispersion tests (p-value floor / SE-stability margins). TROP (the
~55% bucket) left intact — its cheap levers are already applied; further cuts would
trade pure-Python coverage.
Verified: 12 passed pure-Python (affected classes), 5 passed Rust.
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
…d defaults Address local codex review P2: after the _FAST rollout, no non-slow test exercised the production-default SyntheticControl(v_method="nested") path in the pure-Python CI job (the @slow Tier-2 Basque test runs only in the Rust matrix), and the cited _sc_weight_fw kernel equivalence test is a low-level solver check, not an estimator-level SCM test. In particular the _v_starts heuristic candidates (inverse-variance / univariate-fit / Dirichlet starts) only generate at n_starts>1, so the all-n_starts=1 _FAST tests skipped them in pure-Python. Add test_nested_production_defaults_smoke: a 2-donor panel run at FULL production defaults (n_starts=4, inner_min_decrease=1e-5). With J=2 the inner Frank-Wolfe simplex is effectively 1-D, so defaults still complete in <0.1s (non-slow) while exercising the heuristic-start generation and the numpy FW kernel inside the real nested pipeline. Tighten the _FAST comment to state coverage accurately. Verified: 47 passed in 1.45s pure-Python (smoke test <0.1s); 1 passed Rust. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
|
Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SyntheticControltest file ran ~15 full nested V-optimization fits at production defaults (n_starts=4,inner_max_iter=10000,inner_min_decrease=1e-5), costing 30–150s+ each in pure-Python mode (no Rust kernel) because the inner Frank-Wolfe solve grinds its slow sublinear tail to the tight tolerance on every objective eval. This inflated the single-runnerpython-fallbackjob from ~1.5h → ~4.5h. Added_FAST/_FAST_CHURNcheap settings (loosened inner tolerance + single start + small outer cap) to the fit-completing behavior tests → the file now runs in ~1.3s pure-Python (was tens of minutes).test_nested_production_defaults_smoke(2-donor panel at full production defaults — the inner FW simplex is ~1-D with J=2, so it stays <0.1s) so the pure-Python job still exercisesn_starts=4and the_v_startsheuristic candidates that the_FAST(n_starts=1) tests skip.n_bootstrapthroughci_paramson three tests whose assertions are independent of the draw count:TestScaleEquivariance.test_scale_equivariance(pure invariance, baseline captured at runtime) and two self-consistency p-value tests. Cuts ~290s of pure-Python CPU-work.Coverage & wall-clock notes
ci_paramsreturns counts unchanged under Rust, so the Rust matrix (-m '') keeps full fidelity (e.g. the SDID tests still run at 200 reps; the SCM@slowTier-2 Basque test still runs at full defaults). Only the single-runner pure-Python fallback (-m 'not slow') is scaled. The backend-dependent SCM kernel_sc_weight_fwis independently equivalence-tested bytest_rust_backend.py::test_sc_weight_fw_matches_numpy.@slow/data-shrinking the heavy TROP tests, trading their pure-Python coverage — intentionally not done here.Methodology references
docs/methodology/REGISTRY.md(SyntheticDiD §, SyntheticControl §): the changed tests remain consistent with documented placebo/bootstrap semantics and the ADH 2010 nested-V notes.Validation
tests/test_methodology_synthetic_control.py(cheap settings + newtest_nested_production_defaults_smoke),tests/test_methodology_sdid.py(3 tests viaci_params).DIFF_DIFF_BACKEND=python): SCM file 47 passed in 1.45s (was tens of minutes); SDID affected classes 12 passed. Rust backend: SCM 47 passed incl.@slowTier-2; edited SDID tests pass at full 200 reps./ai-review-local --backend codex: ✅ clean (one P2 raised on default-path coverage, resolved by the smoke test).Security / privacy
🤖 Generated with Claude Code