From 315fa27eba51f0e292c9dd63c1d08e5421141e8a Mon Sep 17 00:00:00 2001 From: igerber Date: Sun, 31 May 2026 15:00:53 -0400 Subject: [PATCH 1/3] SyntheticControl: in-space placebo permutation inference + reporting-stack integration MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Implements ADH 2010 §2.4 in-space placebo inference for the classic SyntheticControl estimator (PR-1 / #501 shipped the estimator with no analytical SE), and wires SyntheticControlResults into the practitioner / DiagnosticReport / BusinessReport reporting stack. Methodology (diff_diff/synthetic_control.py, synthetic_control_results.py): - SyntheticControlResults.in_space_placebo(): reassigns treatment to each donor, refits a synthetic control for that pseudo-treated donor against the other J-1 donors (the real treated unit is excluded from every placebo pool -- its post-period is treatment-contaminated; matches SCtools::generate.placebos), and ranks the treated unit's post/pre RMSPE ratio among the J+1 units. - New fields placebo_p_value (= rank/(n_placebos+1); an upper-tail rank test on the unsigned RMSPE-ratio statistic, ties via >=), rmspe_ratio (treated statistic, set at fit), n_placebos/n_failed (effective reference-set sizes; non-converged placebos excluded from BOTH numerator and denominator). - placebo_p_value is a SEPARATE field from the always-NaN analytical p_value; it carries no SE/t-stat and does not flow through safe_inference; is_significant stays bound to p_value. - Fail-closed contract: a valid fit requires BOTH inner Frank-Wolfe AND outer-V convergence of the SELECTED incumbent; treated fit fails closed and placebos are excluded on either failure. The Powell polish validates the incumbent only when it converges back AT the incumbent's objective level (np.isclose), never on a success at a strictly-worse point. - Edge cases: scale-aware RMSPE-ratio floor (perfect pre-fit -> finite ratio), J<2 -> NaN+warn, J==2 -> degenerate+coarse warn, deterministic given seed. - get_placebo_df() returns the per-unit RMSPE-ratio table (incl. treated row + failed donors). Placebo compute is opt-in; each fit retains a _SyntheticControlFitSnapshot of the pivoted panel (excluded from pickling). Reporting (diagnostic_report.py, practitioner.py, business_report.py, _reporting_helpers.py): SCM routed like TROP (parallel-trends-free, fit-based). New scm_fit PT analogue (design_enforced_pt verdict reading pre_rmspe), a _scm_native diagnostic block surfacing pre_rmspe + donor-weight concentration + the placebo p-value when already computed (never triggering the refit implicitly), a practitioner _handle_synthetic_control with the placebo as the headline significance step, and a BusinessReport fit-based assumption block with ADH 2010 attribution. Also fixes a latent BR bug where headline is_significant was a non-JSON-serializable numpy bool_ when p_value is numpy NaN. Docs: REGISTRY.md §SyntheticControl (new **Note:** labels for donor-pool construction, failure handling, RMSPE-ratio floor, non-analytical-p-value split), REPORTING.md, docs/api/synthetic_control.rst, LLM guides, README, CHANGELOG. TODO.md tracks the ADH-2015 follow-up (in-time placebo / LOO) and a compact/lazy snapshot representation. Tests: ~19 placebo + convergence tests (infeasible donor counts, treated-fit non-convergence, Powell-succeeds-at-worse-point, failed-placebo exclusion, deterministic reruns, pickle behavior) + SCM reporting-routing tests across all three tools. Co-Authored-By: Claude Opus 4.8 (1M context) --- CHANGELOG.md | 3 +- README.md | 2 +- TODO.md | 3 +- diff_diff/_reporting_helpers.py | 17 + diff_diff/business_report.py | 76 +++- diff_diff/diagnostic_report.py | 224 +++++++++- diff_diff/guides/llms-full.txt | 9 +- diff_diff/guides/llms.txt | 2 +- diff_diff/practitioner.py | 71 +++- diff_diff/synthetic_control.py | 213 +++++++++- diff_diff/synthetic_control_results.py | 425 ++++++++++++++++++- docs/api/synthetic_control.rst | 17 +- docs/methodology/REGISTRY.md | 10 +- docs/methodology/REPORTING.md | 13 +- tests/test_business_report.py | 99 +++++ tests/test_diagnostic_report.py | 176 ++++++++ tests/test_methodology_synthetic_control.py | 448 +++++++++++++++++++- tests/test_practitioner.py | 25 ++ 18 files changed, 1741 insertions(+), 92 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 02de12ab..023904b5 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,7 +9,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added - **New tutorial: `docs/tutorials/24_staggered_vs_collapsed_power.ipynb` — "Staggered Rollout or a Simple 2×2? A Power-Analysis Decision Guide".** A practitioner walkthrough for geo experiments (framed on a 50-state staggered rollout) on when to reach for Callaway-Sant'Anna vs collapsing to a familiar pre/post 2×2. Shows, with live paired Monte Carlo on `generate_staggered_data`, that the collapsed 2×2 silently targets a *diluted* estimand (reports ~60–94% of the true effect-on-treated as the rollout staggers, with near-zero CI coverage of the truth under a slow rollout), and that CS's minimum-detectable-lift penalty is a *fast-rollout* phenomenon that shrinks to parity as the rollout becomes more staggered. Fully self-contained (runs live, no committed data files); ends with a CS-vs-2×2 decision guide. -- **New estimator: `SyntheticControl` — classic Synthetic Control Method (Abadie, Diamond & Hainmueller 2010; Abadie & Gardeazabal 2003).** Standalone estimator (`diff_diff/synthetic_control.py`) + `SyntheticControlResults` (`diff_diff/synthetic_control_results.py`) + `synthetic_control()` convenience function, exported from `diff_diff`. Builds a single treated unit's counterfactual as a convex combination of never-treated donor units — **donor (unit) weights only**, no time weights or ridge, distinct from `SyntheticDiD`. The inner simplex-constrained weighted-LS solve `W*(V)` reuses `utils._sc_weight_fw` (folding `V^½` into the predictor matrix, `intercept=False`, `zeta=0`); the diagonal predictor-importance matrix `V` is selected data-driven by minimizing pre-period outcome MSPE (`v_method="nested"`, softmax-on-simplex multistart Nelder-Mead + Powell polish) or supplied by the user (`v_method="custom"`). Predictors are built from `predictors`/`predictor_window`/`predictors_op`, `special_predictors`, and per-period outcome lags (`pre_period_outcomes`), in the R `Synth::dataprep` row order; per-row standardization (SD over donors+treated, ddof=1) matches the R `Synth::synth` source. Reports the gap path (`α̂_1t = Y_1t − Σ_j w_j Y_jt`), `att` (mean post-period gap), `pre_rmspe`, donor weights, `v_weights`, and a predictor-balance table. **No analytical standard error** — `se`/`t_stat`/`p_value`/`conf_int` are NaN (in-space placebo permutation inference with the post/pre RMSPE-ratio statistic is planned for a follow-up release; `_placebo_gaps`/`_rmspe_ratio`/`_fit_snapshot` are reserved on the results object). Ten validation gates baked in: predictor-period leakage, absorbing post-period suffix + no-anticipation cross-check against the treatment column, post-period canonicalization, donor-pool filtering before period derivation, empty-window rejection, poor-pre-fit `UserWarning` (RMSPE > SD of treated pre-outcomes), duplicate-predictor-label rejection, inner-solve non-convergence warning, order-independent gap-path rebuild, and the `standardize="none"` deviation; plus fail-closed `custom_v` cross-field rules and degenerate single-donor / single-pre-period handling. **R-`Synth` parity** (`tests/test_methodology_synthetic_control.py`, fixtures generated by `benchmarks/R/generate_synth_basque_golden.R` into `tests/data/`): two-tier on the Basque Country study — Tier-1 feeds R's `solution.v` via `custom_v` and reproduces the published donor weights (region 10 Cataluña 0.851 + region 14 Madrid 0.149) to `atol=1e-3` deterministically; Tier-2 (`@pytest.mark.slow`) checks the data-driven nested fit lands in a tolerance band (the nested `V` legitimately differs because the outer objective uses all pre periods, not R's `time.optimize.ssr` window). Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (with `**Deviation from R:** standardize="none"` and `**Note:**` labels for the standardization formula, objective window, softmax `V` parametrization, and 1×SD poor-fit threshold), `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`. +- **`SyntheticControl` in-space placebo permutation inference + reporting-stack integration (ADH 2010 §2.4).** New `SyntheticControlResults.in_space_placebo()` provides the significance test classic SCM lacks an analytical SE for: it reassigns treatment to each donor, refits a synthetic control for that pseudo-treated donor against the **other `J−1` donors** (the real treated unit is excluded from every placebo pool — its post-period is treatment-contaminated; matches `SCtools::generate.placebos`), and ranks the treated unit's post/pre **RMSPE ratio** among the `J+1` units. New fields `placebo_p_value` (`= rank/(n_placebos+1)`, an upper-tail rank test on the unsigned RMSPE ratio — direction-agnostic, so it detects an effect of *either* sign rather than a signed/one-directional hypothesis; ties counted via `≥`), `rmspe_ratio` (the treated statistic, set at fit), and `n_placebos`/`n_failed` (effective reference-set sizes; non-converged placebos are excluded from BOTH numerator and denominator, never penalized into the rank). `placebo_p_value` is a **separate field** from the (always-NaN) `p_value` — it is a permutation p-value with no SE/t-stat and does not flow through `safe_inference`; `is_significant` stays bound to `p_value`. Edge cases fail closed: scale-aware RMSPE-ratio floor (a perfect pre-fit gives a finite ratio, not `inf`), `J<2` → NaN+warn, `J==2` → degenerate+coarse warn, deterministic given `seed`. New `get_placebo_df()` returns the per-unit RMSPE-ratio summary table (incl. the treated row and any failed donors) used for the rank. The design keeps the placebo *compute* opt-in — the per-donor refit loop runs only on the explicit `in_space_placebo()` call. To support that opt-in call, every fit retains a `_SyntheticControlFitSnapshot` of the pivoted panel (memory O(units × periods × predictor-vars), like `SyntheticDiD`'s snapshot for `in_time_placebo`; excluded from pickling). A compact/lazy snapshot representation is tracked as a follow-up in `TODO.md`. **Reporting-stack integration:** `SyntheticControlResults` is now routed through `DiagnosticReport` (fit-based `scm_fit` parallel-trends analogue → verdict `design_enforced_pt` reading `pre_rmspe`; `_scm_native` surfaces `pre_rmspe` + donor-weight concentration + the placebo p-value when already computed — never triggering the refit loop implicitly), `practitioner_next_steps` (`_handle_synthetic_control` with the placebo as the headline significance step), and `BusinessReport` (fit-based assumption block, ADH 2010 attribution, robustness via `estimator_native_diagnostics`; HonestDiD passthrough rejected like SDiD/TROP). Also fixes a latent BR bug where the headline `is_significant` was a non-JSON-serializable numpy `bool_` when `p_value` is a numpy `NaN`. Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (new `**Note:**` labels for the donor-pool construction, failure handling, RMSPE-ratio floor, and the non-analytical-p-value split), `docs/methodology/REPORTING.md`, `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`. +- **New estimator: `SyntheticControl` — classic Synthetic Control Method (Abadie, Diamond & Hainmueller 2010; Abadie & Gardeazabal 2003).** Standalone estimator (`diff_diff/synthetic_control.py`) + `SyntheticControlResults` (`diff_diff/synthetic_control_results.py`) + `synthetic_control()` convenience function, exported from `diff_diff`. Builds a single treated unit's counterfactual as a convex combination of never-treated donor units — **donor (unit) weights only**, no time weights or ridge, distinct from `SyntheticDiD`. The inner simplex-constrained weighted-LS solve `W*(V)` reuses `utils._sc_weight_fw` (folding `V^½` into the predictor matrix, `intercept=False`, `zeta=0`); the diagonal predictor-importance matrix `V` is selected data-driven by minimizing pre-period outcome MSPE (`v_method="nested"`, softmax-on-simplex multistart Nelder-Mead + Powell polish) or supplied by the user (`v_method="custom"`). Predictors are built from `predictors`/`predictor_window`/`predictors_op`, `special_predictors`, and per-period outcome lags (`pre_period_outcomes`), in the R `Synth::dataprep` row order; per-row standardization (SD over donors+treated, ddof=1) matches the R `Synth::synth` source. Reports the gap path (`α̂_1t = Y_1t − Σ_j w_j Y_jt`), `att` (mean post-period gap), `pre_rmspe`, donor weights, `v_weights`, and a predictor-balance table. **No analytical standard error** — `se`/`t_stat`/`p_value`/`conf_int` are NaN; significance comes from in-space placebo permutation inference via `in_space_placebo()` (see the dedicated entry below). Ten validation gates baked in: predictor-period leakage, absorbing post-period suffix + no-anticipation cross-check against the treatment column, post-period canonicalization, donor-pool filtering before period derivation, empty-window rejection, poor-pre-fit `UserWarning` (RMSPE > SD of treated pre-outcomes), duplicate-predictor-label rejection, inner-solve non-convergence warning, order-independent gap-path rebuild, and the `standardize="none"` deviation; plus fail-closed `custom_v` cross-field rules and degenerate single-donor / single-pre-period handling. **R-`Synth` parity** (`tests/test_methodology_synthetic_control.py`, fixtures generated by `benchmarks/R/generate_synth_basque_golden.R` into `tests/data/`): two-tier on the Basque Country study — Tier-1 feeds R's `solution.v` via `custom_v` and reproduces the published donor weights (region 10 Cataluña 0.851 + region 14 Madrid 0.149) to `atol=1e-3` deterministically; Tier-2 (`@pytest.mark.slow`) checks the data-driven nested fit lands in a tolerance band (the nested `V` legitimately differs because the outer objective uses all pre periods, not R's `time.optimize.ssr` window). Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (with `**Deviation from R:** standardize="none"` and `**Note:**` labels for the standardization formula, objective window, softmax `V` parametrization, and 1×SD poor-fit threshold), `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`. - **StaggeredTripleDifference methodology-review-tracker promotion: In Progress → Complete**, plus a new opt-in Eq-4.14 overall ATT. Closes the Ortiz-Villavicencio & Sant'Anna (2025, arXiv:2505.09942v3) primary-source review on the tracker (PR-A #499 added the paper review on file; this PR validates the source against it). New paper-equation-anchored Verified Components in `tests/test_methodology_staggered_triple_diff.py` (Theorem 4.1 / Eq. 4.5 RA=IPW=DR identification; Eq. 4.1 three-term DDD decomposition; Eqs. 4.11-4.12 optimal-GMM weight normalization + single-group reduction; Eq. 4.13 event-study cohort-share weighting; Eq. 4.14 / Cor. 4.2 overall) alongside the existing R cross-validation against `triplediff::ddd(panel=TRUE)` + `agg_ddd()`. **New feature — opt-in `overall_att_es` (paper Eq. 4.14 overall):** the unweighted mean of the post-treatment event-study effects ES(e), exposed on `StaggeredTripleDiffResults` (with `overall_se_es` / `overall_t_stat_es` / `overall_p_value_es` / `overall_conf_int_es`) and populated only when `aggregate="event_study"` / `"all"`. The default `overall_att` is unchanged (the Callaway-Sant'Anna simple post-treatment (g,t) average — the library-wide convention). Its analytical SE is the influence function of that mean (the average of the per-event-time combined IFs, routed through the same survey-aware variance estimator as the per-e effects via a new `_se_from_psi` helper); a multiplier-bootstrap SE replaces it under `n_bootstrap>0`. Computed via a side-channel stash on the shared `CallawaySantAnnaAggregationMixin._aggregate_event_study` (no return-signature change; CallawaySantAnna unaffected), over post-treatment `e >= -anticipation` (the library convention, matching `overall_att`). Cross-validated against R `agg_ddd(type="eventstudy")$overall.att` / `overall.se` (SE matches to ~0.1%). REGISTRY `## StaggeredTripleDifference`: the previously-unlabeled overall-aggregation prose is formalized under a `**Note:**` documenting both overalls, and the duplicate aggregation-weight deviation is consolidated (fixing a `P(G=g)` vs R `P(S=g)` mislabel). `METHODOLOGY_REVIEW.md` row L69 promoted to **Complete** (`Last Review = 2026-05-30`) with a Verified Components / R Comparison Results detail block; priority queue pruned. `docs/references.rst` Ortiz-Villavicencio entry pinned to arXiv:2505.09942v3. - **SunAbraham + WooldridgeDiD-OLS `vcov_type="conley"` (Conley 1999 spatial-HAC) threading.** Both estimators now accept `vcov_type="conley"` with the five `conley_*` constructor params (`conley_coords`, `conley_cutoff_km`, `conley_metric`, `conley_kernel`, `conley_lag_cutoff`), reusing the already-`conleyreg`-validated `solve_ols` / `conley.py` machinery — within-period spatial HAC at `conley_lag_cutoff=0`, plus the within-unit Bartlett serial term at `conley_lag_cutoff>0` (the panel-aware path, since `conley_time`/`conley_unit` are always supplied — not pooled cross-sectional), no new variance code. Conley routes through each estimator's within-transform path; the unit auto-cluster is dropped on the conley path (an explicit `cluster=` enables the spatial+cluster product kernel); `survey_design=` / `weights` / `n_bootstrap>0` are rejected, and WooldridgeDiD conley is OLS-path-only (`method ∈ {logit, poisson}` + conley still rejected via the `method != "ols"` guard). `SunAbrahamResults` / `WooldridgeDiDResults` gain a `conley_lag_cutoff` field plus a Conley variance-label line in `summary()` (`SunAbrahamResults` also gains `cluster_name`). FWL-composability — the within-transform conley SE equals the full-dummy conley SE — is pinned in `tests/test_conley_vcov.py` (`TestConleySunAbraham` / `TestConleyWooldridge`). **`StackedDiD` conley remains deferred for a methodology reason** (the stacked design replicates units across sub-experiments, so Conley would see same-unit copies at distance 0; no `conleyreg` anchor; paper-gated) — its prior "same shape as the SunAbraham follow-up" framing is corrected in REGISTRY / TODO / the rejection message. - **ConleySpatialHAC methodology-review-tracker promotion: In Progress → Complete.** Closes the Conley (1999) *Journal of Econometrics* 92(1) primary-source review on the methodology-review tracker. The paper review on file at `docs/methodology/papers/conley-1999-review.md` was previously merged (2026-05-09); this PR is the F.L.I.P. consolidation — new `tests/test_methodology_conley.py` with paper-equation-numbered Verified Components walk-through (~1600 LoC; 10 classes; 60 tests, 5 of them `@pytest.mark.slow`). Coverage: Eq. 4.2 cross-sectional sandwich (pairwise-distance specialization; the project's paper review identifies Eq. 4.2 page 18 as the real-valued/pairwise form, with Eq. 3.13 reserved for the lattice-indexed form), Eq. 4.2 HC0 + rank-1 limits, Andrews (1991) HAC lag truncation matching `conleyreg::time_dist.cpp`, haversine convention with Earth radius 6371.01 km, Phase 2 panel block-decomposed sandwich at `atol=1e-12`, sparse k-d-tree dense-vs-sparse bit-identity (Wave A #120 numerical correctness), and R `conleyreg` v0.1.9 parity at `atol=1e-6` on 6 fixtures (3 cross-sectional + 3 panel) plus the sparse-forced and time-asymmetric kernel parity contracts. Three dedicated deviations-area classes: `TestConleyLibraryExtensions` (Wave A library extensions — combined spatial+cluster product kernel #119, callable conley_metric validation #123, sparse k-d-tree activation #120, indefiniteness guard), `TestConleyDeviationsFromR` (1-D radial Bartlett vs paper's 2-D separable Eq. 3.14, time-label normalization via `np.unique`, independent temporal kernel deferred), and `TestConleyDeferrals` (5 fail-closed `NotImplementedError`/`TypeError` contracts: LinearRegression + survey_design, DiD/MPD/TWFE + survey_design, Conley + weights, SyntheticDiD + Conley, wild_bootstrap + Conley). Methodology-anchored tests extracted from `tests/test_conley_vcov.py`: full classes `TestConleyDirectHelper`, `TestConleyReductions`, `TestConleyReductionsAddendum`, `TestConleyParityR`, `TestConleyParitySpacetime`, `TestConleyPanelHelper`, `TestConleySparseRParityForced`; plus methodology-anchored tests from `TestConleyKernels`, `TestConleyDistanceMetrics`, `TestConleySparse`. File drops 4248 → 3113 lines after extraction. Defensive surface preserved: input validation, NaN/inf guards, dispatch-level validity, estimator-level integration smoke tests, set_params atomicity, sparse-path activation thresholds + density-gate fallback. `METHODOLOGY_REVIEW.md` row L91 promoted to **Complete** with `Last Review = 2026-05-26`; detail block rewritten with Verified Components / Test Coverage / R Comparison Results inline table / Corrections Made / Deviations / Outstanding Concerns. Priority queue at L1386 pruned: PreTrendsPower removed (already Complete since 2026-05-19) and ConleySpatialHAC removed (this PR); substantive-review-blocked renumbered #2-#5 → #1-#4 and consolidation-pass-blocked renumbered #6-#8 → #5-#6. diff --git a/README.md b/README.md index a4ce9993..1e738591 100644 --- a/README.md +++ b/README.md @@ -108,7 +108,7 @@ Full guide: `diff_diff.get_llm_guide("practitioner")`. - [TwoStageDiD](https://diff-diff.readthedocs.io/en/stable/api/two_stage.html) - Gardner (2022) two-stage estimator with GMM sandwich variance - [SpilloverDiD](https://diff-diff.readthedocs.io/en/stable/api/spillover.html) - Butts (2021) ring-indicator spillover-aware DiD identifying direct effect on treated + per-ring spillover on near-control units; handles non-staggered and staggered timing; supports survey-design variance under `survey_design=` for HC1 / CR1 (Wave E.1 Binder TSL) and Conley (Wave E.2 panel-aware stratified-Conley sandwich on per-period PSU totals; extended in Wave E.2 follow-up to `conley_lag_cutoff > 0` via panel-block composition with within-PSU serial Bartlett HAC — `lag>0` requires an effective PSU via explicit `survey_design.psu` or injected `cluster=`); `SurveyDesign.subpopulation()` preserves full-design `n_psu` / `df_survey` via zero-padded scores (Wave E.3, R `svyrecvar(subset())` form) - [SyntheticDiD](https://diff-diff.readthedocs.io/en/stable/api/estimators.html) - Synthetic DiD combining standard DiD and synthetic control for few treated units -- [SyntheticControl](https://diff-diff.readthedocs.io/en/stable/api/synthetic_control.html) - Abadie, Diamond & Hainmueller (2010) classic synthetic control for a single treated unit (donor-weight counterfactual, nested/custom V; no inference in this release — permutation/placebo planned) +- [SyntheticControl](https://diff-diff.readthedocs.io/en/stable/api/synthetic_control.html) - Abadie, Diamond & Hainmueller (2010) classic synthetic control for a single treated unit (donor-weight counterfactual, nested/custom V; in-space placebo permutation inference via `in_space_placebo()`) - [TripleDifference](https://diff-diff.readthedocs.io/en/stable/api/triple_diff.html) - triple difference (DDD) estimator for designs requiring two criteria for treatment eligibility - [ContinuousDiD](https://diff-diff.readthedocs.io/en/stable/api/continuous_did.html) - Callaway, Goodman-Bacon & Sant'Anna (2024) continuous treatment DiD with dose-response curves - [HeterogeneousAdoptionDiD](https://diff-diff.readthedocs.io/en/stable/api/had.html) - de Chaisemartin, Ciccia, D'Haultfœuille & Knau (2026) for designs where **no unit remains untreated**; local-linear estimator at the dose support boundary returning Weighted Average Slope (WAS) on Design 1' (`d̲ = 0` / QUG) or `WAS_{d̲}` on Design 1 (`d̲ > 0`, continuous-near-d̲ or mass-point), with a multi-period event-study extension (last-treatment cohort, pointwise CIs). **Panel-only** in this release - repeated cross-sections rejected by the validator. Alias `HAD`. diff --git a/TODO.md b/TODO.md index 5d20a4f9..5236208d 100644 --- a/TODO.md +++ b/TODO.md @@ -85,7 +85,7 @@ Deferred items from PR reviews that were not addressed before merge. | ImputationDiD dense `(A0'A0).toarray()` scales O((U+T+K)^2), OOM risk on large panels | `imputation.py` | #141 | Medium (deferred — only triggers when sparse solver fails) | | Multi-absorb weighted demeaning needs iterative alternating projections for N > 1 absorbed FE with survey weights; unweighted multi-absorb also uses single-pass (pre-existing, exact only for balanced panels) | `estimators.py` | #218 | Medium | | Survey design resolution/collapse patterns are inconsistent across panel estimators — ContinuousDiD rebuilds unit-level design in SE code, EfficientDiD builds once in fit(), StackedDiD re-resolves on stacked data; extract shared helpers for panel-to-unit collapse, post-filter re-resolution, and metadata recomputation | `continuous_did.py`, `efficient_did.py`, `stacked_did.py` | #226 | Low | -| SyntheticControl: `SyntheticControlResults` not wired into the practitioner / DiagnosticReport / BusinessReport routing, so routing SCM results through those tools yields generic parallel-trends/HonestDiD guidance that doesn't fit SCM. Add SCM to the native-routed rejection sets (mirror SDiD/TROP) and surface SCM-native diagnostics (pre-fit / in-space placebo / in-time placebo / leave-one-out). Deferred to PR-2, where it pairs with the placebo-inference layer those reports would surface. | `practitioner.py`, `diagnostic_report.py`, `business_report.py` | SCM PR-1 → PR-2 | Medium | +| SyntheticControl: in-time placebo + leave-one-out donor-robustness diagnostics are not implemented (ADH 2015, not the ADH 2010 scope of the current estimator), so `_scm_native` surfaces only pre-fit + in-space placebo. The practitioner / DiagnosticReport / BusinessReport routing and the in-space placebo permutation layer landed in PR-2; this remaining row covers adding the two ADH-2015 diagnostics (and surfacing them under `estimator_native_diagnostics`) in a later 2015-sourced PR. | `synthetic_control.py`, `diagnostic_report.py` | ADH-2015 follow-up | Low | | ContinuousDiD deferred CGBS 2024 extensions: (a) `covariates=` kwarg not implemented (matches R `contdid` v0.1.0); (b) discrete-treatment saturated regression deferred (integer-valued dose currently warned, not routed to per-level coefficients); (c) lowest-dose-as-control per CGBS 2024 Remark 3.1 (when `P(D=0) = 0`) not implemented — estimator requires never-treated controls. REGISTRY `## ContinuousDiD` → Implementation Checklist marks these as deferred `[ ]` items. | `diff_diff/continuous_did.py` | — | Low | | Survey-weighted Silverman bandwidth in EfficientDiD conditional Omega* — `_silverman_bandwidth()` uses unweighted mean/std for bandwidth selection; survey-weighted statistics would better reflect the population distribution but is a second-order refinement | `efficient_did_covariates.py` | — | Low | | TROP: extend Wave 4's `_setup_trop_data` helper to also cover the duplicated bootstrap resampling loop in `_bootstrap_variance` / `_bootstrap_variance_global` (~40 LoC dedup; mirrors the data-setup helper pattern with a `fit_callable` parameter for the per-draw refit step). | `trop_local.py`, `trop_global.py` | follow-up | Low | @@ -163,6 +163,7 @@ Deferred items from PR reviews that were not addressed before merge. | MPD cluster+hc2_bm path computes CR2 precomputes twice — once via `solve_ols` → `_compute_cr2_bm` for vcov + per-coefficient DOF, then again via `_compute_cr2_bm_contrast_dof` from `MultiPeriodDiD.fit()` for the post-period-average contrast DOF. Both rebuild `H = X bread_inv X'`, the residual-maker `M`, and the per-cluster `A_g = (I - H_gg)^{-1/2}` matrices. O(n²k) redundant work; acceptable for typical cluster-robust DiD panel sizes (n ≤ a few thousand). Fix would plumb the contrast DOF through the existing CR2 vcov path (intrusive API change) or share the precomputes via a cached helper. | `linalg.py::_compute_cr2_bm_contrast_dof`, `estimators.py::MultiPeriodDiD.fit` | follow-up | Low | | Rust-backend HC2 implementation. Current Rust path only supports HC1; HC2 and CR2 Bell-McCaffrey fall through to the NumPy backend. For large-n fits this is noticeable. | `rust/src/linalg.rs` | Phase 1a | Low | | CR2 Bell-McCaffrey DOF uses a naive `O(n² k)` per-coefficient loop over cluster pairs. Pustejovsky-Tipton (2018) Appendix B has a scores-based formulation that avoids the full `n × n` `M` matrix. Switch when a user hits a large-`n` cluster-robust design. | `linalg.py::_compute_cr2_bm` | Phase 1a | Low | +| `SyntheticControl` retains a full `_SyntheticControlFitSnapshot` (pivoted outcome/predictor panels) on EVERY fit to support the opt-in `in_space_placebo()`, so callers who never run the placebo still pay O(units × periods × predictor-vars) memory (same as `SyntheticDiD`'s always-on snapshot for `in_time_placebo`). Store a compact array/index representation instead of per-variable DataFrames, or build the snapshot lazily on first placebo call (would need to retain the source data, ~same cost). | `synthetic_control.py` snapshot build, `synthetic_control_results.py::_SyntheticControlFitSnapshot` | follow-up | Low | #### Testing/Docs diff --git a/diff_diff/_reporting_helpers.py b/diff_diff/_reporting_helpers.py index f5711344..6cc70905 100644 --- a/diff_diff/_reporting_helpers.py +++ b/diff_diff/_reporting_helpers.py @@ -618,6 +618,23 @@ def describe_target_parameter(results: Any) -> Dict[str, Any]: "reference": "REGISTRY.md Sec. TROP", } + if name == "SyntheticControlResults": + return { + "name": "SCM ATT (mean post-treatment gap for the single treated unit)", + "definition": ( + "The average over the post-treatment periods of the gap " + "``alpha_hat_{1t} = Y_{1t} - sum_j w_j Y_{jt}`` between the single " + "treated unit and its donor-weighted synthetic control (Abadie, " + "Diamond & Hainmueller 2010). There is no population-averaging or " + "sampling estimand — it is the effect on the one treated unit; " + "significance is assessed by in-space placebo permutation inference " + "(no analytical standard error)." + ), + "aggregation": "single_unit_gap", + "headline_attribute": "att", + "reference": "REGISTRY.md Sec. SyntheticControl", + } + # Default: unrecognized result class. Fall through with a neutral # block — agents / downstream consumers can still dispatch on # ``aggregation="unknown"`` and fall back to generic ATT narration. diff --git a/diff_diff/business_report.py b/diff_diff/business_report.py index 6ceeb4a5..f6edb275 100644 --- a/diff_diff/business_report.py +++ b/diff_diff/business_report.py @@ -203,6 +203,7 @@ def __init__( if honest_did_results is not None and type(results).__name__ in { "SyntheticDiDResults", "TROPResults", + "SyntheticControlResults", }: raise ValueError( f"{type(results).__name__} routes robustness to " @@ -213,8 +214,9 @@ def __init__( "object's native diagnostics " "(SDiD: ``in_time_placebo()``, ``sensitivity_to_zeta_omega()``, " "``pre_treatment_fit``; TROP: ``effective_rank``, " - "``loocv_score``) — BusinessReport surfaces these " - "automatically under ``estimator_native_diagnostics``." + "``loocv_score``; SyntheticControl: ``in_space_placebo()``, " + "``pre_rmspe``, ``get_placebo_df()``) — BusinessReport surfaces " + "these automatically under ``estimator_native_diagnostics``." ) # Round-44 P1 CI review on PR #318: mirror the SDiD/TROP @@ -646,10 +648,13 @@ def _extract_headline(self, dr_schema: Optional[Dict[str, Any]]) -> Dict[str, An if att is None or not np.isfinite(att): sign = "undefined" ci_level = int(round((1.0 - display_alpha) * 100)) - is_significant = ( + # bool(...) coerces away numpy bool_ — when ``p`` is a numpy NaN (e.g. + # SyntheticControl, whose analytical p_value is always NaN), ``np.isfinite`` + # yields a numpy bool that is NOT JSON-serializable in the schema. + is_significant = bool( p is not None and np.isfinite(p) and p < phrasing_alpha if p is not None else False ) - near_threshold = ( + near_threshold = bool( p is not None and np.isfinite(p) and (phrasing_alpha - 0.01) < p < (phrasing_alpha + 0.001) @@ -1002,16 +1007,25 @@ def _lift_robustness(dr: Optional[Dict[str, Any]]) -> Dict[str, Any]: return {"status": "skipped", "reason": "auto_diagnostics=False"} bacon = dr.get("bacon") or {} native = dr.get("estimator_native_diagnostics") or {} + native_block = { + "status": native.get("status"), + "estimator": native.get("estimator"), + "pre_treatment_fit": native.get("pre_treatment_fit"), + } + # Classic SCM exposes pre_rmspe + donor-weight concentration + the (opt-in) + # in-space placebo rather than SDiD's pre_treatment_fit; surface those so the + # top-level robustness block is not empty for SyntheticControl. + if native.get("estimator") == "SyntheticControl": + native_block["pre_rmspe"] = native.get("pre_rmspe") + native_block["weight_concentration"] = native.get("weight_concentration") + native_block["in_space_placebo"] = native.get("in_space_placebo") return { "bacon": { "status": bacon.get("status"), "forbidden_weight": bacon.get("forbidden_weight"), "verdict": bacon.get("verdict"), }, - "estimator_native": { - "status": native.get("status"), - "pre_treatment_fit": native.get("pre_treatment_fit"), - }, + "estimator_native": native_block, } @@ -1153,6 +1167,20 @@ def _describe_assumption(estimator_name: str, results: Any = None) -> Dict[str, "captured through latent factor loadings." ), } + if estimator_name in {"SyntheticControlResults"}: + return { + # Distinct from SDiD's "synthetic_fit" weighted-PT analogue: classic + # SCM is a donor-weighted level match (matches the DR "scm_fit" method). + "parallel_trends_variant": "scm_fit", + "no_anticipation": True, + "description": ( + "Classic synthetic control identifies the single treated unit's " + "counterfactual via a donor-weighted match to its pre-treatment " + "trajectory (a design-enforced fit, not a parallel-trends test); " + "significance comes from in-space placebo permutation inference " + "rather than an analytical standard error." + ), + } if estimator_name == "ContinuousDiDResults": # Callaway, Goodman-Bacon & Sant'Anna (2024), two-level PT: # REGISTRY.md §ContinuousDiD > Identification. @@ -1780,6 +1808,8 @@ def _pt_method_subject(method: Optional[str]) -> str: return "Pre-treatment event-study coefficients" if method == "synthetic_fit": return "The synthetic-control pre-treatment fit" + if method == "scm_fit": + return "The synthetic-control donor-weighted pre-treatment fit" if method == "factor": return "The factor-model pre-treatment fit" return "Pre-treatment data" @@ -1806,7 +1836,9 @@ def _pt_method_stat_label(method: Optional[str]) -> Optional[str]: return "joint p" if method in {"slope_difference", "hausman"}: return "p" - if method in {"synthetic_fit", "factor"}: + if method in {"synthetic_fit", "scm_fit", "factor"}: + # Design-enforced fit-based paths have no p-value label (SCM's significance + # is the in-space placebo, not a PT joint test). return None return "joint p" @@ -1846,6 +1878,13 @@ def _references_for(estimator_name: str) -> List[Dict[str, str]]: "& Wager, S. (2021). Synthetic Difference in Differences." ), }, + "SyntheticControlResults": { + "role": "estimator", + "citation": ( + "Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic " + "Control Methods for Comparative Case Studies. JASA, 105(490)." + ), + }, "SunAbrahamResults": { "role": "estimator", "citation": ( @@ -2181,11 +2220,20 @@ def _render_summary(schema: Dict[str, Any]) -> str: "assumption." + sens_tail_see_reliable ) elif verdict == "design_enforced_pt": - sentences.append( - "The synthetic control is designed to match the treated " - "group's pre-period trajectory (SDiD's weighted-parallel-" - "trends analogue)." - ) + if method == "scm_fit": + sentences.append( + "The synthetic control is designed to reproduce the treated " + "unit's pre-period trajectory via donor weights (classic SCM's " + "design-enforced analogue of parallel trends); significance " + "comes from in-space placebo permutation inference, not a " + "parallel-trends test." + ) + else: + sentences.append( + "The synthetic control is designed to match the treated " + "group's pre-period trajectory (SDiD's weighted-parallel-" + "trends analogue)." + ) elif verdict == "inconclusive": # Round-35 P1 CI review on PR #318: a ``verdict=="inconclusive"`` # state means one or more pre-period coefficients had diff --git a/diff_diff/diagnostic_report.py b/diff_diff/diagnostic_report.py index cdfdf21a..9173f3c6 100644 --- a/diff_diff/diagnostic_report.py +++ b/diff_diff/diagnostic_report.py @@ -157,6 +157,21 @@ "estimator_native", } ), + "SyntheticControlResults": frozenset( + # Classic SCM expresses its identifying assumption as design-enforced + # pre-treatment fit (the donor weights match the treated unit's + # pre-period trajectory), so — like SDiD — ``parallel_trends`` routes to + # the fit-based ``_pt_scm_fit`` (verdict ``design_enforced_pt``, NOT a PT + # hypothesis test; see prose). It also exposes ``estimator_native`` (the + # in-space placebo permutation inference + weight concentration). + # ``sensitivity`` is omitted (no HonestDiD analogue — significance testing + # is the native placebo); ``heterogeneity`` is omitted (one treated unit). + { + "parallel_trends", + "design_effect", + "estimator_native", + } + ), "EfficientDiDResults": frozenset( { "parallel_trends", @@ -195,6 +210,7 @@ # "hausman" — EfficientDiD.hausman_pretest (native PT-All vs PT-Post) # "synthetic_fit" — SDiD weighted pre-treatment fit (surfaces pre_treatment_fit) # "factor" — TROP factor-model identification (no PT; renders "N/A" prose) +# "scm_fit" — classic SCM donor-weighted pre-treatment fit (surfaces pre_rmspe) _PT_METHOD: Dict[str, str] = { "DiDResults": "two_x_two", "MultiPeriodDiDResults": "event_study", @@ -210,6 +226,7 @@ "ChaisemartinDHaultfoeuilleResults": "event_study", "SyntheticDiDResults": "synthetic_fit", "TROPResults": "factor", + "SyntheticControlResults": "scm_fit", } @@ -257,11 +274,14 @@ class DiagnosticReport: pre_periods, post_periods : list, optional Explicit pre- and post-treatment period labels. run_parallel_trends, run_sensitivity, run_placebo, run_bacon, run_design_effect, run_heterogeneity, run_epv, run_pretrends_power : bool - Per-check opt-in flags. ``run_placebo`` defaults to ``False`` (opt-in, - expensive, currently not implemented - placebo key remains reserved - as ``skipped`` in the schema). All other checks default to ``True`` - and are further gated by estimator-type and instance-level - applicability (see ``docs/methodology/REPORTING.md``). + Per-check opt-in flags. ``run_placebo`` defaults to ``False`` — the generic + placebo battery is not implemented in MVP, so the ``placebo`` key remains + reserved as ``skipped`` in the schema. (Exception: ``SyntheticControl``'s + in-space placebo permutation test IS implemented — run it via + ``results.in_space_placebo()``; its result is surfaced under + ``estimator_native_diagnostics.in_space_placebo``, not this generic section.) + All other checks default to ``True`` and are further gated by estimator-type + and instance-level applicability (see ``docs/methodology/REPORTING.md``). sensitivity_M_grid : tuple of float, default (0.5, 1.0, 1.5, 2.0) Grid of M values passed to ``HonestDiD.sensitivity``. Yields a ``SensitivityResults`` object with ``breakdown_M`` populated. @@ -298,7 +318,9 @@ class DiagnosticReport: Other sections (``design_effect``, ``heterogeneity``, ``epv``) are read directly from the fitted result object and do not currently accept precomputed values — there is no expensive call to bypass. - ``placebo`` is reserved in the schema but opt-in / deferred in MVP. + ``placebo`` is reserved in the schema but opt-in / deferred in MVP for the + generic battery; ``SyntheticControl`` surfaces its in-space placebo under + ``estimator_native_diagnostics`` (run ``results.in_space_placebo()``). outcome_label, treatment_label : str, optional Plain-English labels used in prose rendering. """ @@ -390,11 +412,18 @@ def __init__( # native-routing contract documented in REPORTING.md. # Round-21 P1 CI review on PR #318 flagged this bypass. _result_name = type(self._results).__name__ - _native_routed_names = {"SyntheticDiDResults", "TROPResults"} + _native_routed_names = {"SyntheticDiDResults", "TROPResults", "SyntheticControlResults"} if _result_name in _native_routed_names: _incompatible_keys = [] if "sensitivity" in self._precomputed: _incompatible_keys.append("sensitivity") + # All native-routed estimators reject a precomputed ``parallel_trends`` + # payload — their PT verdict is computed INTERNALLY (SDiD/SCM: + # design-enforced pre-treatment fit; TROP: factor-model identification), + # not supplied as a user p-value. SyntheticControl is no exception: its + # ``scm_fit`` route reads ``pre_rmspe`` from the fit, and the generic + # precomputed-PT adapter only understands p-value-style payloads, which + # are methodology-incompatible with SCM (it has no PT p-value test). if "parallel_trends" in self._precomputed: _incompatible_keys.append("parallel_trends") # Round-32 P1 CI review on PR #318: ``pretrends_power`` is a @@ -420,7 +449,9 @@ def __init__( "Use the native diagnostics on the result object " "(SDiD: ``in_time_placebo``, ``sensitivity_to_zeta_omega``, " "``pre_treatment_fit``; TROP: ``effective_rank``, " - "``loocv_score``) — DR surfaces these automatically." + "``loocv_score``; SyntheticControl: ``in_space_placebo()``, " + "``pre_rmspe``, ``get_placebo_df()``) — DR surfaces these " + "automatically under ``estimator_native_diagnostics``." ) # Round-44 P1 CI review on PR #318: mirror the SDiD/TROP @@ -572,6 +603,19 @@ def _compute_applicable_checks(self) -> Tuple[set, Dict[str, str]]: # shape is stable: ``schema["placebo"]["status"] == "skipped"`` # always holds regardless of estimator. The opt-in execution path # is deferred to a follow-up; ``REPORTING.md`` documents this. + # SyntheticControl is the exception — its in-space placebo permutation test + # IS implemented (run via results.in_space_placebo()) and surfaced under + # estimator_native_diagnostics, so its generic-section reason points there + # rather than claiming placebo is unimplemented (``setdefault`` wins over + # the generic message below). + if type(self._results).__name__ == "SyntheticControlResults": + skipped.setdefault( + "placebo", + "SyntheticControl's placebo battery is the in-space placebo " + "permutation test — run results.in_space_placebo(); the result is " + "surfaced under estimator_native_diagnostics.in_space_placebo, not " + "this generic section.", + ) skipped.setdefault( "placebo", "Placebo battery runs on opt-in only; not yet implemented in MVP. " @@ -877,7 +921,7 @@ def _instance_skip_reason(self, check: str) -> Optional[str]: continue return "No group/event-study effects available to compute heterogeneity." if check == "estimator_native": - if name not in {"SyntheticDiDResults", "TROPResults"}: + if name not in {"SyntheticDiDResults", "TROPResults", "SyntheticControlResults"}: return f"{name} does not expose native validation methods." return None return None @@ -1073,11 +1117,29 @@ def _ran(key: str) -> bool: # has effectively been performed; treating the sensitivity # section as not-run would have ``next_steps`` redundantly # recommend a check the report already executed (round-19 - # CI review on PR #318). + # CI review on PR #318). SyntheticControl is deliberately NOT + # in this set: its significance procedure is the OPT-IN in-space + # placebo (``in_space_placebo()``), which ``_scm_native`` reports + # but does not run, so the work is not complete merely because the + # native block ran — the practitioner ``placebo`` step must persist. result_name = type(self._results).__name__ - if result_name in {"SyntheticDiDResults", "TROPResults"} and _ran("estimator_native"): + if result_name in { + "SyntheticDiDResults", + "TROPResults", + } and _ran("estimator_native"): if "sensitivity" not in completed: completed.append("sensitivity") + # SCM's significance step is the opt-in in-space placebo. Mark it + # complete only when the caller has run it AND it produced a VALID + # reference set (>=1 placebo + a finite p-value) — an attempted-but- + # infeasible run (J<2, treated-fit failure, or all donors failed) leaves + # placebo_p_value NaN and is NOT complete, so the recommendation persists. + if ( + result_name == "SyntheticControlResults" + and int(getattr(self._results, "n_placebos", 0) or 0) > 0 + and np.isfinite(getattr(self._results, "placebo_p_value", np.nan)) + ): + completed.append("placebo") if _ran("heterogeneity"): completed.append("heterogeneity") ns = practitioner_next_steps( @@ -1114,6 +1176,8 @@ def _check_parallel_trends(self) -> Dict[str, Any]: return self._pt_hausman() if method == "synthetic_fit": return self._pt_synthetic_fit() + if method == "scm_fit": + return self._pt_scm_fit() if method == "factor": return self._pt_factor() return { @@ -2086,6 +2150,10 @@ def _check_estimator_native(self) -> Dict[str, Any]: TROP: factor-model fit metrics (``effective_rank``, ``loocv_score``, selected ``lambda_*``). + + SyntheticControl: pre-treatment fit (``pre_rmspe``), donor-weight + concentration, and — when already computed — the in-space placebo + permutation p-value (``in_space_placebo``). """ r = self._results name = type(r).__name__ @@ -2093,6 +2161,8 @@ def _check_estimator_native(self) -> Dict[str, Any]: return self._sdid_native(r) if name == "TROPResults": return self._trop_native(r) + if name == "SyntheticControlResults": + return self._scm_native(r) return { "status": "not_applicable", "reason": f"{name} does not expose native validation methods.", @@ -2172,6 +2242,71 @@ def _trop_native(self, r: Any) -> Dict[str, Any]: }, } + def _scm_native(self, r: Any) -> Dict[str, Any]: + """Populate classic-SCM-native diagnostics section. + + Always surfaces the pre-treatment fit (``pre_rmspe``) and donor-weight + concentration. The in-space placebo permutation inference (ADH 2010 §2.4) + is reported only when the user has ALREADY run ``in_space_placebo()`` — + DR never triggers it implicitly, because it refits one synthetic control + per donor (potentially many nested V searches) and the placebo layer is + opt-in by design. (This differs from SDiD's cheaper in-time-placebo sweep, + which ``_sdid_native`` runs inline.) Only the in-space placebo is exposed; + in-time placebo and leave-one-out are ADH 2015 (not implemented). + """ + out: Dict[str, Any] = {"status": "ran", "estimator": "SyntheticControl"} + out["pre_rmspe"] = _to_python_float(getattr(r, "pre_rmspe", None)) + out["n_donors"] = _to_python_scalar(getattr(r, "n_donors", None)) + out["v_method"] = getattr(r, "v_method", None) + # Donor-weight concentration (Herfindahl + top weight): SCM places all mass + # on the donor simplex, so concentration is the natural "how few donors + # drive the synthetic" diagnostic. + try: + weights = [float(w) for w in r.donor_weights.values()] + out["weight_concentration"] = { + "herfindahl": _to_python_float(sum(w * w for w in weights)), + "top_weight": _to_python_float(max(weights) if weights else None), + "n_donors_with_weight": len(weights), + } + except Exception as exc: # noqa: BLE001 + out["weight_concentration"] = { + "status": "error", + "reason": f"donor_weights unavailable: {type(exc).__name__}: {exc}", + } + # In-space placebo: surface only if already computed (opt-in; see docstring). + if getattr(r, "_placebo_df", None) is not None: + n_placebos = int(getattr(r, "n_placebos", 0) or 0) + placebo_p = getattr(r, "placebo_p_value", np.nan) + block = { + "placebo_p_value": _to_python_float(placebo_p), + "rmspe_ratio": _to_python_float(getattr(r, "rmspe_ratio", None)), + "n_placebos": _to_python_scalar(n_placebos), + "n_failed": _to_python_scalar(getattr(r, "n_failed", None)), + } + # Distinguish a valid run from an attempted-but-infeasible one (J<2, + # treated-fit non-convergence, or zero successful donor refits) so BR/DR + # consumers see an explicit status/reason rather than a bare NaN p-value. + if n_placebos > 0 and np.isfinite(placebo_p): + block["status"] = "ran" + else: + block["status"] = "infeasible" + block["reason"] = ( + "in_space_placebo() was run but produced no valid reference set " + "(fewer than 2 donors, a non-converged treated fit, or all donor " + "refits failed); placebo_p_value is NaN." + ) + out["in_space_placebo"] = block + else: + out["in_space_placebo"] = { + "status": "not_run", + "reason": ( + "Call results.in_space_placebo() to run in-space placebo " + "permutation inference (opt-in; refits one synthetic control " + "per donor)." + ), + } + return out + # -- Heterogeneity helpers -------------------------------------------- def _collect_effect_scalars(self) -> List[float]: @@ -2504,6 +2639,31 @@ def _pt_synthetic_fit(self) -> Dict[str, Any]: "verdict": "design_enforced_pt", } + def _pt_scm_fit(self) -> Dict[str, Any]: + """Classic SCM donor-weighted pre-treatment-fit PT analogue. + + Classic synthetic control (Abadie-Diamond-Hainmueller 2010) is not + parallel-trends-based: the donor weights are chosen to match the treated + unit's pre-period trajectory, so a small ``pre_rmspe`` is the + design-enforced substitute for a PT test. Unlike SDiD this reads + ``pre_rmspe`` (always populated on a successful fit). The in-space placebo + permutation inference lives in ``estimator_native_diagnostics``. + """ + r = self._results + fit = _to_python_float(getattr(r, "pre_rmspe", None)) + if fit is None: + return { + "status": "skipped", + "method": "scm_fit", + "reason": "SyntheticControlResults.pre_rmspe is not populated on this fit.", + } + return { + "status": "ran", + "method": "scm_fit", + "pre_treatment_fit_rmse": fit, + "verdict": "design_enforced_pt", + } + def _pt_factor(self) -> Dict[str, Any]: """TROP has no PT concept — its identification is factor-model-based.""" return { @@ -3077,7 +3237,10 @@ def _check_headline(check: str, section: Dict[str, Any]) -> Optional[Any]: if check == "epv": return section.get("min_epv") if check == "estimator_native": - return section.get("pre_treatment_fit") + # SDiD reports ``pre_treatment_fit``; classic SCM reports ``pre_rmspe`` (its + # design-enforced pre-fit) — fall back so SCM's tabular headline is not None. + fit = section.get("pre_treatment_fit") + return fit if fit is not None else section.get("pre_rmspe") return None @@ -3112,6 +3275,8 @@ def _pt_subject_phrase(method: Optional[str]) -> str: return "Pre-treatment event-study coefficients" if method == "synthetic_fit": return "The synthetic-control pre-treatment fit" + if method == "scm_fit": + return "The synthetic-control donor-weighted pre-treatment fit" if method == "factor": return "The factor-model pre-treatment fit" return "Pre-treatment data" @@ -3139,7 +3304,7 @@ def _pt_stat_label(method: Optional[str]) -> Optional[str]: return "joint p" if method in {"slope_difference", "hausman"}: return "p" - if method in {"synthetic_fit", "factor"}: + if method in {"synthetic_fit", "scm_fit", "factor"}: return None return "joint p" @@ -3299,15 +3464,30 @@ def _render_overall_interpretation(schema: Dict[str, Any], labels: Dict[str, str ) elif verdict == "design_enforced_pt": rmse = pt.get("pre_treatment_fit_rmse") - sentences.append( - f"The synthetic control matches the treated group's " - f"pre-period trajectory with RMSE = " - f"{rmse:.3g} (SDiD's design-enforced analogue of parallel " - f"trends)." - if isinstance(rmse, (int, float)) - else "SDiD's synthetic control is designed to satisfy the " - "weighted parallel-trends analogue." - ) + if pt.get("method") == "scm_fit": + # Classic SCM: a single treated UNIT, donor-weighted level match; + # significance is the in-space placebo, not SDiD's weighted PT. + sentences.append( + f"The synthetic control reproduces the treated unit's " + f"pre-period trajectory with pre-RMSPE = {rmse:.3g} (classic " + f"SCM's donor-weighted, design-enforced analogue of parallel " + f"trends; significance is assessed via in-space placebo " + f"permutation, not a parallel-trends test)." + if isinstance(rmse, (int, float)) + else "Classic SCM's donor-weighted synthetic control is designed " + "to match the treated unit's pre-period trajectory; significance " + "comes from in-space placebo permutation inference." + ) + else: + sentences.append( + f"The synthetic control matches the treated group's " + f"pre-period trajectory with RMSE = " + f"{rmse:.3g} (SDiD's design-enforced analogue of parallel " + f"trends)." + if isinstance(rmse, (int, float)) + else "SDiD's synthetic control is designed to satisfy the " + "weighted parallel-trends analogue." + ) elif verdict == "inconclusive": # Round-35 P1 CI review on PR #318: DR summary / overall # interpretation must surface the inconclusive state diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt index e12801c4..b9a54e21 100644 --- a/diff_diff/guides/llms-full.txt +++ b/diff_diff/guides/llms-full.txt @@ -616,7 +616,7 @@ scm.fit( ) -> SyntheticControlResults ``` -**Inference:** NONE analytical — `se`/`t_stat`/`p_value`/`conf_int` are NaN. `att` is the mean post-period gap; use placebo/permutation inference for significance (planned follow-up). Predictor periods must lie within the pre window; `post_periods` must be a contiguous suffix cross-checked against `D` (no anticipation). +**Inference:** NONE analytical — `se`/`t_stat`/`p_value`/`conf_int` are always NaN. `att` is the mean post-period gap. Significance via in-space placebo permutation inference: `results.in_space_placebo()` reassigns treatment to each donor, refits against the other J-1 donors (the real treated unit is excluded from every placebo pool), and sets `placebo_p_value = rank/(n_placebos+1)` from the post/pre RMSPE-ratio. The permutation `placebo_p_value` is a SEPARATE field from the (NaN) `p_value`; `is_significant` stays bound to `p_value`. Predictor periods must lie within the pre window; `post_periods` must be a contiguous suffix cross-checked against `D` (no anticipation). **Usage:** @@ -1278,7 +1278,10 @@ Returned by `SyntheticControl.fit()`. | Attribute | Type | Description | |-----------|------|-------------| | `att` | `float` | Mean post-period gap (reported point estimate) | -| `se`, `t_stat`, `p_value`, `conf_int` | `float` / tuple | Always NaN — no analytical SE (permutation inference planned) | +| `se`, `t_stat`, `p_value`, `conf_int` | `float` / tuple | Always NaN — no analytical SE (use `in_space_placebo()`) | +| `placebo_p_value` | `float` | In-space placebo permutation p-value, `rank/(n_placebos+1)` (NaN until `in_space_placebo()` runs) | +| `rmspe_ratio` | `float` | Treated unit's post/pre RMSPE ratio = sqrt(post-MSPE/pre-MSPE) (the placebo test statistic; set at fit) | +| `n_placebos`, `n_failed` | `int` | Placebos in the reference set / excluded for non-convergence | | `n_obs` | `int` | Treated + donor rows over all periods | | `n_donors` | `int` | Donor units in the (post-filter) pool | | `n_pre_periods`, `n_post_periods` | `int` | Period counts | @@ -1292,7 +1295,7 @@ Returned by `SyntheticControl.fit()`. | `pre_periods`, `post_periods` | `list` | Calendar-sorted periods | | `v_method`, `standardize` | `str` | Echoed configuration | -**Methods:** `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()`, `get_gap_df()`, `get_weights_df()` +**Methods:** `in_space_placebo()` (opt-in permutation inference; refits one synthetic control per donor), `get_placebo_df()` (per-unit RMSPE-ratio table incl. the treated row), `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()`, `get_gap_df()`, `get_weights_df()` ### TripleDifferenceResults diff --git a/diff_diff/guides/llms.txt b/diff_diff/guides/llms.txt index f56e63cd..5261b046 100644 --- a/diff_diff/guides/llms.txt +++ b/diff_diff/guides/llms.txt @@ -60,7 +60,7 @@ Full practitioner guide: call `diff_diff.get_llm_guide("practitioner")` - [TwoStageDiD](https://diff-diff.readthedocs.io/en/stable/api/two_stage.html): Gardner (2022) two-stage estimator with GMM sandwich variance - [SpilloverDiD](https://diff-diff.readthedocs.io/en/stable/api/spillover.html): Butts (2021) ring-indicator spillover-aware DiD identifying direct effect on treated + per-ring spillover-on-control; reuses `conley_coords` for ring construction; handles non-staggered and staggered timing; supports `SurveyDesign(weights, strata, psu, fpc)` under `vcov_type="hc1"` with optional `cluster=` for CR1 via Gerber (2026) Binder TSL (Wave E.1) and under `vcov_type="conley"` via a panel-aware stratified-Conley sandwich on per-period PSU totals (Wave E.2 cross-sectional `conley_lag_cutoff=0`) extended in Wave E.2 follow-up to `conley_lag_cutoff > 0` via panel-block composition with within-PSU serial Bartlett HAC (Newey-West 1987 separable form; `lag>0` requires an effective PSU via explicit `survey_design.psu` or injected `cluster=`), both composed with the Wave D Gardner GMM correction; `SurveyDesign.subpopulation()` preserves full-design `n_psu` / `df_survey` via zero-padded scores at the meat-helper boundary (Wave E.3, R `svyrecvar(subset())` form) (replicate weights queued as follow-up) - [SyntheticDiD](https://diff-diff.readthedocs.io/en/stable/api/estimators.html): Synthetic DiD combining standard DiD and synthetic control methods for few treated units -- [SyntheticControl](https://diff-diff.readthedocs.io/en/stable/api/synthetic_control.html): Abadie, Diamond & Hainmueller (2010) classic synthetic control for ONE treated unit — donor-weight counterfactual, nested or custom predictor-importance V, gap path + pre-RMSPE; no inference in this release (inference fields are NaN — permutation/placebo planned) +- [SyntheticControl](https://diff-diff.readthedocs.io/en/stable/api/synthetic_control.html): Abadie, Diamond & Hainmueller (2010) classic synthetic control for ONE treated unit — donor-weight counterfactual, nested or custom predictor-importance V, gap path + pre-RMSPE; no analytical SE (inference fields NaN), significance via in-space placebo permutation inference (`in_space_placebo()`, post/pre RMSPE-ratio, p = rank/(n_placebos+1)) - [TripleDifference](https://diff-diff.readthedocs.io/en/stable/api/triple_diff.html): Triple difference (DDD) estimator for designs requiring two criteria for treatment eligibility - [ContinuousDiD](https://diff-diff.readthedocs.io/en/stable/api/continuous_did.html): Callaway, Goodman-Bacon & Sant'Anna (2024) continuous treatment DiD with dose-response curves - [HeterogeneousAdoptionDiD](https://diff-diff.readthedocs.io/en/stable/api/had.html): de Chaisemartin, Ciccia, D'Haultfœuille & Knau (2026) for designs where **no unit remains untreated**; local-linear estimator at the dose support boundary returning Weighted Average Slope (WAS) on Design 1' (`d̲=0` / QUG) or `WAS_{d̲}` on Design 1 (`d̲>0`, continuous-near-d̲ or mass-point), with multi-period event-study extension (last-treatment cohort, pointwise CIs). **Panel-only** in this release (repeated cross-sections rejected by the validator). Alias `HAD`. diff --git a/diff_diff/practitioner.py b/diff_diff/practitioner.py index 128f378a..b0d6811f 100644 --- a/diff_diff/practitioner.py +++ b/diff_diff/practitioner.py @@ -20,6 +20,7 @@ "estimator_selection", "estimation", "sensitivity", + "placebo", "heterogeneity", "robustness", } @@ -37,6 +38,7 @@ "StackedDiDResults": "StackedDiD", "SyntheticDiDResults": "SyntheticDiD", "TROPResults": "TROP", + "SyntheticControlResults": "SyntheticControl", "EfficientDiDResults": "EfficientDiD", "ContinuousDiDResults": "ContinuousDiD", "TripleDifferenceResults": "TripleDifference (DDD)", @@ -71,7 +73,7 @@ def practitioner_next_steps( Steps the caller has already completed. Valid names: ``"target_parameter"``, ``"assumptions"``, ``"parallel_trends"``, ``"estimator_selection"``, ``"estimation"``, ``"sensitivity"``, - ``"heterogeneity"``, ``"robustness"``. + ``"placebo"``, ``"heterogeneity"``, ``"robustness"``. verbose : bool, default True If True, print a human-readable summary to stdout. @@ -656,6 +658,72 @@ def _handle_trop(results: Any): return steps, warnings +def _handle_synthetic_control(results: Any): + steps = [ + _step( + baker_step=6, + label="In-space placebo permutation inference", + why=( + "Classic SCM has no analytical standard error. Significance " + "comes from the in-space placebo test (Abadie-Diamond-Hainmueller " + "2010, Section 2.4): reassign treatment to each donor, refit, and " + "rank the treated unit's post/pre RMSPE ratio " + "(p = rank/(n_placebos+1), excluding non-converged placebos)." + ), + code=( + "placebo_df = results.in_space_placebo()\n" + "print(f'placebo p-value: {results.placebo_p_value:.3f} " + "(n={results.n_placebos})')\n" + "print(placebo_df) # per-unit RMSPE-ratio table used for the rank" + ), + priority="high", + # SCM's significance test IS the placebo; tag it "placebo" (not + # "sensitivity") so it survives once the native diagnostics block runs, + # mirroring _handle_trop. + step_name="placebo", + ), + _step( + baker_step=3, + label="Demonstrate pre-treatment fit (SCM identification)", + why=( + "SCM's identifying assumption is design-enforced fit, not a " + "parallel-trends test: it is only credible when the synthetic " + "control reproduces the treated unit's pre-period path. Report " + "the pre-RMSPE and predictor-balance table; a poor fit means do " + "not use SCM (ADH 2010 p. 495)." + ), + code=( + "print(f'pre-treatment RMSPE: {results.pre_rmspe:.4f}')\n" + "print(results.predictor_balance)\n" + "print(results.get_weights_df()) # donor weight concentration" + ), + priority="high", + # Design-enforced fit IS SCM's parallel-trends analogue (mirrors the + # DiagnosticReport ``scm_fit`` PT routing); tagging it "parallel_trends" + # keeps it from being auto-suppressed as the completed estimation step. + step_name="parallel_trends", + ), + _step( + baker_step=4, + label="Curate the donor pool", + why=( + "Donors exposed to the same/similar intervention or to large " + "confounding shocks contaminate the comparison (ADH 2010 " + "pp. 498-499). Restrict the donor pool to clean, comparable units." + ), + code=( + "# Exclude contaminated donors explicitly:\n" + "# synthetic_control(..., donor_pool=[clean, comparable, units])" + ), + priority="medium", + step_name="estimator_selection", + ), + _robustness_compare_step("SyntheticDiD or CS"), + ] + warnings = _check_nan_att(results) + return steps, warnings + + def _handle_efficient(results: Any): steps = [ _parallel_trends_step(staggered=True), @@ -1301,6 +1369,7 @@ def _handle_generic(results: Any): "StackedDiDResults": _handle_stacked, "SyntheticDiDResults": _handle_synthetic, "TROPResults": _handle_trop, + "SyntheticControlResults": _handle_synthetic_control, "EfficientDiDResults": _handle_efficient, "ContinuousDiDResults": _handle_continuous, "TripleDifferenceResults": _handle_triple, diff --git a/diff_diff/synthetic_control.py b/diff_diff/synthetic_control.py index 8f349124..164568a2 100644 --- a/diff_diff/synthetic_control.py +++ b/diff_diff/synthetic_control.py @@ -17,9 +17,11 @@ time weights and ridge regularization: classic SCM uses **donor weights only** and a level-matching estimator, plus the outer ``V`` search SyntheticDiD has no analog for. -Inference: classic SCM has **no analytical standard error** — the paper proposes -permutation/placebo inference (a later PR). ``se``/``t_stat``/``p_value``/``conf_int`` -are always NaN here; ``att`` (mean post-period gap) is the reported estimate. +Inference: classic SCM has **no analytical standard error**, so +``se``/``t_stat``/``p_value``/``conf_int`` are always NaN; ``att`` (mean post-period +gap) is the reported estimate. Significance comes from in-space placebo permutation +inference (ADH 2010 §2.4) via ``SyntheticControlResults.in_space_placebo()`` — a +separate ``placebo_p_value`` field, distinct from the NaN analytical ``p_value``. Numerics provenance: the standardization divisor and the inner/outer optimization scheme are NOT specified in ADH (2010) — they are pinned from the R ``Synth`` package @@ -35,7 +37,10 @@ import pandas as pd from scipy.optimize import minimize -from diff_diff.synthetic_control_results import SyntheticControlResults +from diff_diff.synthetic_control_results import ( + SyntheticControlResults, + _SyntheticControlFitSnapshot, +) from diff_diff.utils import _sc_weight_fw, safe_inference, warn_if_not_converged __all__ = ["SyntheticControl", "synthetic_control"] @@ -389,6 +394,9 @@ def fit( # mspe_v is the OUTER-objective value; it is populated only when a nested V # search actually ran (None on the custom and single-donor paths). mspe_v: Optional[float] = None + # ``outer_converged`` tracks whether the nested V search reached an optimum + # (trivially True when there is no outer search: custom V or a single donor). + outer_converged = True if self.v_method == "custom": v = self._prepare_custom_v(k) w, converged = _inner_solve_W(X1s, X0s, v, self.inner_max_iter, self.inner_min_decrease) @@ -405,7 +413,7 @@ def fit( w = np.array([1.0]) converged = True else: - v, w, converged, mspe_v = _outer_solve_V( + v, w, converged, mspe_v, outer_converged = _outer_solve_V( X1, X0, X1s, @@ -426,6 +434,12 @@ def fit( pre_rmspe = float(np.sqrt(np.mean(pre_gaps**2))) att = float(np.mean(post_gaps)) + # Treated unit's post/pre RMSPE ratio — the in-space placebo test statistic + # (ADH 2010 §2.4). Cheap to compute now; the placebo reference distribution + # is built on demand by SyntheticControlResults.in_space_placebo(). + treated_scale = float(np.max(np.abs(Z1))) if Z1.size else 0.0 + rmspe_ratio = _rmspe_ratio(pre_gaps, post_gaps, treated_scale) + # Poor-fit warning (REGISTRY contract: warn when pre-RMSPE exceeds the SD of # the treated unit's pre-period outcomes). This includes a FLAT treated pre-path # (pre_sd == 0): any non-trivial RMSPE then means the synthetic cannot reproduce @@ -490,12 +504,45 @@ def fit( standardize=self.standardize, alpha=self.alpha, mspe_v=mspe_v, + rmspe_ratio=rmspe_ratio, + ) + # Retain the panel state needed to refit each donor as a pseudo-treated + # unit for in-space placebo inference (ADH 2010 §2.4). Stored as a plain + # attribute (not a dataclass field) and excluded from pickling via + # SyntheticControlResults.__getstate__ (it holds the full panel). + # COPY all caller-owned mutable inputs (the custom_v array, the + # optimizer_options dict, the specs list) so post-fit mutation of the + # estimator's inputs cannot silently change in_space_placebo() output on an + # already-returned results object. (pivots are freshly pivoted here, and the + # period/id lists are re-listed below, so those are not caller-aliased.) + results._fit_snapshot = _SyntheticControlFitSnapshot( + pivots=pivots, + specs=list(specs), + outcome=outcome, + all_periods=list(all_periods), + pre_periods=list(pre_periods), + post_periods=list(post_periods), + donor_ids=list(donor_ids), + treated_id=treated_id, + standardize=self.standardize, + v_method=self.v_method, + custom_v=( + None if self.custom_v is None else np.array(self.custom_v, dtype=float, copy=True) + ), + n_starts=self.n_starts, + seed=self.seed, + optimizer_options=( + None if self.optimizer_options is None else dict(self.optimizer_options) + ), + inner_max_iter=self.inner_max_iter, + inner_min_decrease=self.inner_min_decrease, ) - # Reserved for PR-2 (placebo inference) / PR-3 (conformal). Set as plain - # attributes so dataclasses.asdict/fields cannot reach them. - results._placebo_gaps = None - results._rmspe_ratio = None - results._fit_snapshot = None + # Persist whether the treated unit's own fit reached a valid optimum — both + # the inner Frank-Wolfe weight solve AND (on the nested path) the outer V + # search. in_space_placebo() fails closed otherwise: ranking a statistic from + # a truncated / under-optimized treated fit would not be a valid ADH 2010 + # §2.4 permutation (and an under-optimized V is anti-conservative). + results._fit_converged = bool(converged and outer_converged) self.results_ = results self.is_fitted_ = True @@ -1022,17 +1069,20 @@ def _outer_solve_V( optimizer_options: Optional[Dict[str, Any]], inner_max_iter: int, inner_min_decrease: float, -) -> Tuple[np.ndarray, np.ndarray, bool, float]: +) -> Tuple[np.ndarray, np.ndarray, bool, float, bool]: """Data-driven V selection: minimize pre-period outcome MSPE of W*(V). Multistart Nelder-Mead over ``theta`` (V = softmax(theta)) plus a derivative-free - Powell polish from the best point. Returns ``(v_star, w_star, converged, mspe)``. + Powell polish from the best point. Returns + ``(v_star, w_star, inner_converged, mspe, outer_converged)``. """ k = X1s.shape[0] if k == 1: + # Single predictor: V is fixed (no outer search), so outer convergence is + # trivially satisfied. v = np.array([1.0]) w, converged = _inner_solve_W(X1s, X0s, v, inner_max_iter, inner_min_decrease) - return v, w, converged, float(np.mean((Z1 - Z0 @ w) ** 2)) + return v, w, converged, float(np.mean((Z1 - Z0 @ w) ** 2)), True # Track inner Frank-Wolfe non-convergence across ALL intermediate evaluations so # the outer search cannot silently rank truncated W*(V) solves (codex). `_inner_solve_W` @@ -1077,26 +1127,41 @@ def objective(theta: np.ndarray) -> float: _st["total"] += start_total _st["nonconv"] += start_nonconv + # Track convergence of the SELECTED (lowest-objective) incumbent, NOT "any start + # converged" — a non-winning start's success says nothing about whether the V we + # actually return is a valid optimum. ``best_success`` follows ``best_x``. best_x: np.ndarray = starts[0] best_fun = np.inf - outer_converged = False + best_success = False for theta0 in starts: res = minimize(objective, theta0, method="Nelder-Mead", options=nm_options) - outer_converged = outer_converged or bool(res.success) if res.fun < best_fun: best_fun = float(res.fun) best_x = res.x + best_success = bool(res.success) - # Derivative-free polish from the incumbent (best-of, mirrors R optimx). + # Derivative-free polish from the incumbent (best-of, mirrors R optimx). If the + # polish improves on the incumbent it becomes the selected solution (carry its + # success); if it does not improve but converged AT the incumbent's objective + # level, that validates the incumbent as an optimum. res_p = minimize(objective, best_x, method="Powell", options=powell_options) - outer_converged = outer_converged or bool(res_p.success) if res_p.fun < best_fun: best_fun = float(res_p.fun) best_x = res_p.x - - # Surface a silent under-optimized V: if neither the multistart Nelder-Mead nor - # the Powell polish reported success (e.g. optimizer_options={"maxiter": 1}), the - # selected V / donor weights / ATT may be sub-optimal. + best_success = bool(res_p.success) + elif bool(res_p.success) and np.isclose(res_p.fun, best_fun, rtol=1e-5, atol=1e-8): + # Powell did not improve but converged back AT the incumbent (same objective + # within tolerance) -> the incumbent is a validated optimum. A Powell run that + # "succeeds" at a STRICTLY WORSE objective ended elsewhere and says nothing + # about the selected incumbent's validity, so it must NOT flip best_success + # (which would silently admit an under-optimized V from a success=False start). + best_success = True + outer_converged = best_success + + # Surface a silent under-optimized V: if the SELECTED outer solution did not + # converge (e.g. optimizer_options={"maxiter": 1}, or the lowest-objective + # incumbent came from a truncated run), the selected V / donor weights / ATT may + # be sub-optimal. if not outer_converged: warnings.warn( "Outer V-search (Nelder-Mead / Powell) did not converge; the selected " @@ -1125,7 +1190,7 @@ def objective(theta: np.ndarray) -> float: v_star = _softmax(best_x) w_star, converged = _inner_solve_W(X1s, X0s, v_star, inner_max_iter, inner_min_decrease) mspe = float(np.mean((Z1 - Z0 @ w_star) ** 2)) - return v_star, w_star, converged, mspe + return v_star, w_star, converged, mspe, outer_converged def _compute_gap_path( @@ -1141,3 +1206,107 @@ def _compute_gap_path( synthetic = donor_block @ w gaps = treated_series - synthetic return {period: float(g) for period, g in zip(all_periods, gaps)} + + +# ============================================================================= +# in-space placebo permutation inference (ADH 2010 §2.4) — used by +# SyntheticControlResults.in_space_placebo() via function-level import +# ============================================================================= + + +def _mspe(gap_path: Dict[Any, float], periods: List[Any]) -> float: + """Mean squared prediction error of ``gap_path`` over ``periods``.""" + if not periods: + return float("nan") + g = np.array([gap_path[p] for p in periods], dtype=float) + return float(np.mean(g**2)) + + +def _rmspe_ratio(pre_gaps: np.ndarray, post_gaps: np.ndarray, scale: float) -> float: + """Post/pre RMSPE ratio — the in-space placebo test statistic (ADH 2010 §2.4). + + Returns ``RMSPE_post / RMSPE_pre = sqrt(MSPE_post / MSPE_pre)`` (the root-scale + ratio, matching the ``rmspe_ratio`` name). ADH 2010 §3.4 reports the *MSPE* + ratio (the square of this); the two are monotone-equivalent on nonnegative + values, so the permutation rank and p-value are identical either way — only the + reported statistic's scale differs. + + The pre-period MSPE denominator is floored at a scale-aware + ``1e-8 * max(scale, 1)**2`` (squared-outcome units) BEFORE the square root, so a + (near-)perfect pre-treatment fit (pre-MSPE → 0) yields a large-but-FINITE ratio + rather than ``inf``/``nan`` (which would corrupt the permutation rank). + ``scale`` is the magnitude of the unit's pre-period outcomes. Mirrors the + ``_fit_tol`` poor-fit guard in ``fit()``. + """ + pre_mspe = float(np.mean(pre_gaps**2)) if pre_gaps.size else float("nan") + post_mspe = float(np.mean(post_gaps**2)) if post_gaps.size else float("nan") + floor = 1e-8 * max(float(scale), 1.0) ** 2 + return float(np.sqrt(post_mspe / max(pre_mspe, floor))) + + +def _placebo_fit_unit( + snap: _SyntheticControlFitSnapshot, + unit: Any, + donor_pool: List[Any], + n_starts: int, +) -> Optional[Tuple[Dict[Any, float], float]]: + """Refit a synthetic control for one (pseudo-)treated ``unit`` vs ``donor_pool``. + + Reuses the exact predictor build / standardization / weight solve / gap-path + path as ``fit()`` — none of those helpers reads estimator (``self``) state, so + the refit is driven entirely by the snapshot. The real treated unit is never in + ``donor_pool`` (the caller passes the other ``J−1`` donors). Returns + ``(gap_path, rmspe_ratio)``, or ``None`` when the weight solve does not converge + (the caller excludes such placebos from the permutation reference set). + Per-placebo ``UserWarning``s (poor fit, zero-variance row, non-convergence) are + suppressed here; the caller surfaces an aggregate count. + """ + X1, X0, _ = _build_predictor_matrix(snap.pivots, snap.specs, unit, donor_pool) + # Belt-and-suspenders: fit() already gated non-finite outcomes over the full + # treated+donor panel, so a donor reassigned as pseudo-treated has finite cells. + if not (np.all(np.isfinite(X1)) and np.all(np.isfinite(X0))): + return None + X1s, X0s, _ = _standardize(X1, X0, snap.standardize) + Y = snap.pivots[snap.outcome] + Z1 = Y.loc[snap.pre_periods, unit].to_numpy(dtype=float) + Z0 = Y.loc[snap.pre_periods, donor_pool].to_numpy(dtype=float) + # ``outer_converged`` is trivially True when there is no outer V search (custom + # V or a single-donor degenerate pool). + outer_converged = True + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + if snap.v_method == "custom": + # custom_v length == k (= len(specs)) is unit-agnostic, so it stays + # valid for every placebo; it was finiteness/non-negativity-checked at + # construction. Just trace-normalize (matches _prepare_custom_v). + v = np.asarray(snap.custom_v, dtype=float).ravel() + v = v / v.sum() + w, converged = _inner_solve_W(X1s, X0s, v, snap.inner_max_iter, snap.inner_min_decrease) + elif len(donor_pool) == 1: + # Degenerate: a single donor forces w = [1.0]; V is irrelevant. + w, converged = np.array([1.0]), True + else: + _, w, converged, _, outer_converged = _outer_solve_V( + X1, + X0, + X1s, + X0s, + Z1, + Z0, + n_starts, + snap.seed, + snap.optimizer_options, + snap.inner_max_iter, + snap.inner_min_decrease, + ) + # Exclude a placebo whose fit is not a valid optimum — a truncated inner W OR an + # under-optimized outer V search. An under-optimized placebo V fits the pre-period + # worse, shrinking its RMSPE ratio and biasing the permutation p-value + # anti-conservatively, so such placebos must not silently enter the rank. + if not (converged and outer_converged): + return None + gap_path = _compute_gap_path(Y, w, unit, donor_pool, snap.all_periods) + pre_gaps = np.array([gap_path[p] for p in snap.pre_periods], dtype=float) + post_gaps = np.array([gap_path[p] for p in snap.post_periods], dtype=float) + scale = float(np.max(np.abs(Z1))) if Z1.size else 0.0 + return gap_path, _rmspe_ratio(pre_gaps, post_gaps, scale) diff --git a/diff_diff/synthetic_control_results.py b/diff_diff/synthetic_control_results.py index e36c8352..dff829f0 100644 --- a/diff_diff/synthetic_control_results.py +++ b/diff_diff/synthetic_control_results.py @@ -5,12 +5,15 @@ ``synthetic_control.py`` to mirror the TROP estimator/results split. The classic synthetic control of Abadie, Diamond & Hainmueller (2010) produces a -gap path and donor/predictor weights but **no analytical standard error** — the -paper proposes permutation/placebo inference instead (a later PR). Accordingly -``se``/``t_stat``/``p_value``/``conf_int`` are always NaN on this object; the -point estimate ``att`` (average post-period gap) is the reported quantity. +gap path and donor/predictor weights but **no analytical standard error**. +Accordingly ``se``/``t_stat``/``p_value``/``conf_int`` are always NaN on this +object; the point estimate ``att`` (average post-period gap) is the reported +quantity. Significance comes from in-space placebo permutation inference via +:meth:`SyntheticControlResults.in_space_placebo` (a separate ``placebo_p_value`` +field, not the NaN ``p_value``). """ +import warnings from dataclasses import dataclass, field from typing import Any, Dict, List, Optional, Tuple @@ -22,6 +25,40 @@ __all__ = ["SyntheticControlResults"] +@dataclass +class _SyntheticControlFitSnapshot: + """Panel state retained for post-hoc in-space placebo refits. + + Holds everything ``SyntheticControlResults.in_space_placebo()`` needs to + refit ANY donor as the pseudo-treated unit without re-reading the original + DataFrame. Built in ``SyntheticControl.fit()`` and excluded from pickling by + ``SyntheticControlResults.__getstate__`` (it retains the full treated+donor + outcome/predictor panel — a privacy/size hazard if serialized). + + ``specs`` is annotated ``List[Any]`` rather than ``List[_PredictorSpec]`` to + avoid an import cycle (``_PredictorSpec`` lives in ``synthetic_control.py``, + which imports this module). ``donor_ids`` is an ORDERED list so the placebo + iteration order — and therefore the rank / p-value — is deterministic. + """ + + pivots: Dict[str, pd.DataFrame] + specs: List[Any] + outcome: str + all_periods: List[Any] + pre_periods: List[Any] + post_periods: List[Any] + donor_ids: List[Any] + treated_id: Any + standardize: str + v_method: str + custom_v: Optional[Any] + n_starts: int + seed: Optional[int] + optimizer_options: Optional[Dict[str, Any]] + inner_max_iter: int + inner_min_decrease: float + + @dataclass class SyntheticControlResults: """ @@ -82,8 +119,22 @@ class SyntheticControlResults: ``"std"`` (per-row SD scaling) or ``"none"``. alpha : float Significance level recorded for downstream (placebo) inference. + rmspe_ratio : float + The treated unit's post/pre RMSPE ratio = ``sqrt(MSPE_post / MSPE_pre)`` — + the in-space placebo test statistic (ADH 2010 §2.4), computed at fit time. + placebo_p_value : float + In-space placebo permutation p-value (``rank / (n_placebos + 1)``), NaN + until :meth:`in_space_placebo` is run. SEPARATE from the (always-NaN) + analytical ``p_value``; ``is_significant`` stays bound to ``p_value``. + n_placebos, n_failed : int + Donor placebos that entered the permutation reference set / were excluded + for non-convergence. Both 0 until :meth:`in_space_placebo` is run. survey_metadata : Any, optional Reserved; always None in this release. + + Significance for classic SCM comes from :meth:`in_space_placebo` (opt-in + in-space placebo permutation inference); :meth:`get_placebo_df` returns the + per-unit RMSPE-ratio table used for the rank. """ att: float @@ -108,13 +159,49 @@ class SyntheticControlResults: alpha: float = 0.05 mspe_v: Optional[float] = None survey_metadata: Optional[Any] = field(default=None) - - # Reserved for PR-2 (placebo inference) / PR-3 (conformal). These are plain - # (un-annotated) class attributes, NOT dataclass fields, so dataclasses.fields() - # and dataclasses.asdict() cannot reach them; fit() sets them per instance. - _placebo_gaps = None - _rmspe_ratio = None - _fit_snapshot = None + # In-space placebo permutation inference (Abadie-Diamond-Hainmueller 2010 + # Section 2.4), populated by ``in_space_placebo()``. ``rmspe_ratio`` (the + # treated unit's post/pre RMSPE ratio) is computed at fit time; the rest stay + # at their no-inference defaults until a placebo run. NOTE: the permutation + # ``placebo_p_value`` is deliberately SEPARATE from ``p_value`` (which stays + # NaN) — it is not an analytical p-value, has no SE / t-stat, and does not + # flow through ``safe_inference``. ``is_significant`` likewise stays bound to + # the (NaN) ``p_value``, NOT ``placebo_p_value``. + placebo_p_value: float = np.nan + rmspe_ratio: float = np.nan + n_placebos: int = 0 + n_failed: int = 0 + + def __post_init__(self) -> None: + # Internal state set per instance by ``fit()`` / ``in_space_placebo()``. + # Declared here (not as dataclass fields) so ``dataclasses.fields()`` / + # ``dataclasses.asdict()`` cannot reach the retained panel state. + # ``_fit_snapshot`` (full panel) and ``_placebo_gaps`` (per-unit gap paths) + # are panel-derived and nulled on pickle by ``__getstate__``; ``_placebo_df`` + # holds the small per-unit aggregate table returned by ``get_placebo_df()``. + self._fit_snapshot: Optional[_SyntheticControlFitSnapshot] = None + self._placebo_gaps: Optional[Dict[Any, Dict[Any, float]]] = None + self._placebo_df: Optional[pd.DataFrame] = None + # Whether the treated unit's own inner Frank-Wolfe weight solve converged. + # in_space_placebo() fails closed when this is False: a truncated treated + # fit makes the ranked statistic (rmspe_ratio) not a valid SCM optimum. + self._fit_converged: bool = True + + def __getstate__(self) -> Dict[str, Any]: + """Exclude panel-derived internal state from pickling. + + ``_fit_snapshot`` retains the full treated+donor panel and ``_placebo_gaps`` + the per-unit gap paths — both panel-derived, a privacy/size hazard if the + pickle is sent elsewhere. The scalar placebo fields (``placebo_p_value``, + ``rmspe_ratio``, ``n_placebos``, ``n_failed``) and the small ``_placebo_df`` + aggregate table survive. An unpickled result keeps all public fields; a + diagnostic call that needs the snapshot (``in_space_placebo``) then raises a + ValueError directing the user to re-fit. Mirrors ``SyntheticDiDResults``. + """ + state = self.__dict__.copy() + state["_fit_snapshot"] = None + state["_placebo_gaps"] = None + return state def __repr__(self) -> str: """Concise string representation.""" @@ -209,12 +296,51 @@ def summary(self, alpha: Optional[float] = None) -> str: f"{'ATT (avg gap)':<15} {self.att:>12.4f} {'n/a':>12} " f"{'n/a':>10} {'n/a':>10}", "-" * 75, "", - "Inference: classic SCM has no analytical standard error.", - "Use permutation / placebo inference for significance testing", - "(Abadie-Diamond-Hainmueller 2010, Section 2.4).", - "=" * 75, ] ) + # Three states: (1) placebo never run -> point to in_space_placebo(); + # (2) run with a valid reference set -> show the permutation p-value; + # (3) run but infeasible (no placebo entered the rank, e.g. J<2 or all + # donors failed) -> say so explicitly rather than implying it was not run. + # ``_placebo_df is not None`` is the "attempted" signal (survives pickling). + placebo_attempted = self._placebo_df is not None + if placebo_attempted and np.isfinite(self.placebo_p_value): + # The classic analytical fields above stay n/a (no SE); this is the + # permutation p-value of the post/pre RMSPE ratio, p = rank/(n_placebos+1). + lines.extend( + [ + "In-space placebo permutation inference " + "(Abadie-Diamond-Hainmueller 2010, Section 2.4):", + f"{' RMSPE ratio (post/pre):':<34} {self.rmspe_ratio:>10.4f}", + f"{' Permutation p-value:':<34} {self.placebo_p_value:>10.4f}", + f"{' Placebos in reference set:':<34} {self.n_placebos:>10d}" + + (f" ({self.n_failed} excluded)" if self.n_failed else ""), + "", + "(Analytical SE is still undefined for classic SCM; the " + "p-value above is permutation-based.)", + "=" * 75, + ] + ) + elif placebo_attempted: + lines.extend( + [ + "In-space placebo permutation inference was attempted but " + "produced no valid reference set", + f"(0 placebos entered the rank; {self.n_failed} failed to " + "converge). placebo_p_value is undefined — too few donors or " + "all donor refits failed. Inspect get_placebo_df().", + "=" * 75, + ] + ) + else: + lines.extend( + [ + "Inference: classic SCM has no analytical standard error.", + "Run in_space_placebo() for in-space permutation inference", + "(Abadie-Diamond-Hainmueller 2010, Section 2.4).", + "=" * 75, + ] + ) return "\n".join(lines) @@ -248,6 +374,13 @@ def to_dict(self) -> Dict[str, Any]: "treated_unit": self.treated_unit, "v_method": self.v_method, "standardize": self.standardize, + # In-space placebo permutation inference. rmspe_ratio is set at fit; + # placebo_p_value / n_placebos / n_failed stay at their no-inference + # defaults (NaN / 0) until in_space_placebo() runs. + "rmspe_ratio": self.rmspe_ratio, + "placebo_p_value": self.placebo_p_value, + "n_placebos": self.n_placebos, + "n_failed": self.n_failed, } if self.survey_metadata is not None: sm = self.survey_metadata @@ -293,3 +426,265 @@ def get_weights_df(self) -> pd.DataFrame: [{"unit": unit, "weight": w} for unit, w in items], columns=["unit", "weight"], ) + + _PLACEBO_COLS = ["unit", "pre_mspe", "post_mspe", "rmspe_ratio", "is_treated", "status"] + + def get_placebo_df(self) -> pd.DataFrame: + """ + Get the in-space placebo distribution as a DataFrame (one row per unit). + + This is a per-unit SUMMARY table (one row per unit), enough to reproduce + the permutation rank and a ratio-distribution plot — NOT the per-period + placebo gap paths needed for the classic "spaghetti" plot (those are + retained internally on ``_placebo_gaps`` for the successful placebos). + Columns: ``unit``, ``pre_mspe``, ``post_mspe``, ``rmspe_ratio``, + ``is_treated``, ``status`` (``"treated"`` / ``"placebo"`` / ``"failed"``). + The treated unit is always present as a single ``is_treated=True, + status="treated"`` row (its ratio is the original J-donor fit). After a + placebo run **that produced a reference set** (``>= 2`` donors AND a + converged treated fit), the table has ``n_donors + 1`` rows — every donor + appears, including those whose refit did not converge (``status="failed"`` + with NaN metrics, excluded from the rank). In the degenerate / fail-closed + cases (fewer than 2 donors, or a treated fit that did not converge) the + placebo loop does not run, so only the treated row is returned. + + Populated by :meth:`in_space_placebo`; the summary table is retained on + pickling, so it is still returned after a round-trip. Before any placebo + run — including on an unpickled result that never ran one — only the + treated row is returned. + + Returns + ------- + pandas.DataFrame + """ + if self._placebo_df is not None: + return self._placebo_df.copy() + from diff_diff.synthetic_control import _mspe + + pre = _mspe(self.gap_path, self.pre_periods) + post = _mspe(self.gap_path, self.post_periods) + return pd.DataFrame( + [ + { + "unit": self.treated_unit, + "pre_mspe": pre, + "post_mspe": post, + "rmspe_ratio": self.rmspe_ratio, + "is_treated": True, + "status": "treated", + } + ], + columns=self._PLACEBO_COLS, + ) + + def in_space_placebo( + self, + n_starts: Optional[int] = None, + ) -> pd.DataFrame: + """ + In-space placebo permutation inference (Abadie-Diamond-Hainmueller 2010, + Section 2.4). + + Reassigns the treatment to each donor in turn, re-estimates a synthetic + control for that pseudo-treated donor against the OTHER donors, and ranks + the real treated unit's post/pre RMSPE ratio among all units. Populates + ``placebo_p_value``, ``n_placebos`` and ``n_failed`` on this object + (``rmspe_ratio`` — the treated unit's own ratio — is set at fit time) and + returns the placebo distribution via :meth:`get_placebo_df`. + + The real treated unit is **excluded from every placebo's donor pool**: its + post-period outcome is treatment-contaminated, so allowing a placebo to + load weight on it would bias the placebo gap. The ranking set is therefore + the ``J+1`` units ``{treated} ∪ {J placebos}``, with each placebo fit + against the other ``J-1`` donors (this matches the standard + ``SCtools::generate.placebos`` construction). The post/pre RMSPE ratio + normalizes by pre-treatment fit, which obviates the pre-fit-cutoff + filtering of ADH Figures 5-7 (journal p. 502), so no pre-fit filter is + offered — every converged placebo enters the rank. + + The permutation ``placebo_p_value`` is intentionally distinct from + ``p_value`` (which stays NaN — classic SCM has no analytical SE) and from + ``is_significant`` (which also stays bound to the NaN ``p_value``). + + A placebo is **excluded** from the reference set (counted in ``n_failed``) + when its fit is not a valid optimum — EITHER its inner Frank-Wolfe weight + solve did not converge (a truncated ``W`` is unusable) OR its outer ``V`` + search did not converge (an under-optimized ``V`` fits the pre-period worse, + shrinking its RMSPE ratio and biasing the permutation p-value + anti-conservatively). Each placebo refit **inherits the original fit's + ``optimizer_options`` / ``n_starts``**, so valid inference requires settings + adequate for the outer ``V`` search to converge: production defaults do; + with cheap settings, raise ``n_starts`` here or re-fit with a larger + ``optimizer_options['maxiter']`` (otherwise placebos are dropped as failed). + The treated unit's own fit is held to the same standard — if its inner OR + outer search did not converge, the whole run fails closed (see below). + + Parameters + ---------- + n_starts : int, optional + Override the multistart count for each placebo's nested V search. + Default None inherits the original fit's ``n_starts``. The placebo + loop is the cost driver (one outer V search per donor); lower it for a + faster, coarser scan. + + Returns + ------- + pandas.DataFrame + The placebo distribution (see :meth:`get_placebo_df`). + + Raises + ------ + ValueError + If the fit snapshot is unavailable (e.g. this result was unpickled). + """ + if self._fit_snapshot is None: + raise ValueError( + "in_space_placebo() requires the fit snapshot on the results " + "object. This result appears to have been loaded from " + "serialization (which excludes the snapshot) or produced by an " + "older estimator version. Re-fit to enable in-space placebo " + "inference." + ) + from diff_diff.synthetic_control import _mspe, _placebo_fit_unit + + snap = self._fit_snapshot + donors = list(snap.donor_ids) + n_donors = len(donors) + n_starts_eff = snap.n_starts if n_starts is None else int(n_starts) + + treated_pre = _mspe(self.gap_path, snap.pre_periods) + treated_post = _mspe(self.gap_path, snap.post_periods) + treated_ratio = self.rmspe_ratio + + rows: List[Dict[str, Any]] = [ + { + "unit": snap.treated_id, + "pre_mspe": treated_pre, + "post_mspe": treated_post, + "rmspe_ratio": treated_ratio, + "is_treated": True, + "status": "treated", + } + ] + + # Fail closed when the treated unit's OWN fit did not converge at fit time + # (inner Frank-Wolfe weight solve OR outer V search): ranking a statistic + # from a truncated / under-optimized treated fit would not be a valid ADH + # 2010 §2.4 permutation (placebos already fail-closed on non-convergence, so + # the treated unit must too). ``_fit_converged`` folds both failure modes, so + # the remediation names the knobs for each. + if not self._fit_converged: + warnings.warn( + "In-space placebo skipped: the treated unit's own SCM fit did not " + "converge at fit time (inner Frank-Wolfe weight solve and/or outer V " + "search), so its RMSPE ratio is not a valid optimum to rank against " + "placebos. placebo_p_value is NaN — re-fit with a larger " + "inner_max_iter / looser inner_min_decrease (inner) and/or a larger " + "optimizer_options['maxiter'] / more n_starts (outer V search).", + UserWarning, + stacklevel=2, + ) + self.placebo_p_value = np.nan + self.n_placebos = 0 + self.n_failed = 0 + self._placebo_gaps = {} + self._placebo_df = pd.DataFrame(rows, columns=self._PLACEBO_COLS) + return self._placebo_df.copy() + + if n_donors < 2: + warnings.warn( + "In-space placebo inference requires at least 2 donors (each " + f"placebo is fit against the other donors); only {n_donors} " + "available. placebo_p_value is NaN.", + UserWarning, + stacklevel=2, + ) + self.placebo_p_value = np.nan + self.n_placebos = 0 + self.n_failed = 0 + self._placebo_gaps = {} + self._placebo_df = pd.DataFrame(rows, columns=self._PLACEBO_COLS) + return self._placebo_df.copy() + + if n_donors == 2: + warnings.warn( + "In-space placebo with 2 donors: each placebo is fit against a " + "single donor (degenerate weight w=[1]) with no V search, so the " + "permutation p-value is coarse (only 2 placebos enter the " + "reference set; the smallest attainable p-value is 1/3).", + UserWarning, + stacklevel=2, + ) + + placebo_gaps: Dict[Any, Dict[Any, float]] = {} + ranked_ratios: List[float] = [] + n_failed = 0 + + for j in donors: + pool = [d for d in donors if d != j] + fitted = _placebo_fit_unit(snap, j, pool, n_starts_eff) + if fitted is None: + # Non-converged inner Frank-Wolfe weight solve (a truncated W is + # unusable for ranking): exclude from BOTH the numerator and the + # denominator (never penalize a truncated solve into the rank). + # Still record the donor with NaN metrics so get_placebo_df() + # returns the full treated + every-donor unit set. + n_failed += 1 + rows.append( + { + "unit": j, + "pre_mspe": np.nan, + "post_mspe": np.nan, + "rmspe_ratio": np.nan, + "is_treated": False, + "status": "failed", + } + ) + continue + gap_path_j, ratio_j = fitted + placebo_gaps[j] = gap_path_j + pre_j = _mspe(gap_path_j, snap.pre_periods) + post_j = _mspe(gap_path_j, snap.post_periods) + ranked_ratios.append(ratio_j) + rows.append( + { + "unit": j, + "pre_mspe": pre_j, + "post_mspe": post_j, + "rmspe_ratio": ratio_j, + "is_treated": False, + "status": "placebo", + } + ) + + n_placebos = len(ranked_ratios) + if n_placebos == 0: + warnings.warn( + "No in-space placebo entered the reference set (all donors " + f"failed to converge or were filtered out of {n_donors}); " + "placebo_p_value is NaN.", + UserWarning, + stacklevel=2, + ) + p_value = np.nan + else: + # One-sided rank, treated unit included as the "+1". Ties counted via + # ``>=`` so the p-value is conservative. + rank = 1 + sum(1 for r in ranked_ratios if r >= treated_ratio) + p_value = rank / (n_placebos + 1) + + if n_failed > 0: + warnings.warn( + f"{n_failed} of {n_donors} in-space placebos failed to converge " + "and were excluded from the permutation distribution; " + f"placebo_p_value uses the remaining {n_placebos}.", + UserWarning, + stacklevel=2, + ) + + self.placebo_p_value = float(p_value) + self.n_placebos = int(n_placebos) + self.n_failed = int(n_failed) + self._placebo_gaps = placebo_gaps + self._placebo_df = pd.DataFrame(rows, columns=self._PLACEBO_COLS) + return self._placebo_df.copy() diff --git a/docs/api/synthetic_control.rst b/docs/api/synthetic_control.rst index a9377898..506c7422 100644 --- a/docs/api/synthetic_control.rst +++ b/docs/api/synthetic_control.rst @@ -19,9 +19,11 @@ over the post periods. - Aggregate / few-unit comparative case studies (states, regions, countries). **Inference:** classic SCM has **no analytical standard error**. ``se``, ``t_stat``, -``p_value`` and ``conf_int`` are NaN; ``att`` (the mean post-period gap) is the reported -estimate. The paper's in-space placebo permutation inference (post/pre RMSPE-ratio, -``rank/(J+1)`` p-value) is planned for a follow-up release. +``p_value`` and ``conf_int`` are always NaN; ``att`` (the mean post-period gap) is the +reported estimate. Significance comes from **in-space placebo permutation inference** via +:meth:`~diff_diff.SyntheticControlResults.in_space_placebo` (post/pre RMSPE-ratio statistic, +``placebo_p_value = rank/(n_placebos+1)``). This permutation p-value is a separate field +from the (NaN) ``p_value``; ``is_significant`` stays bound to ``p_value``. **Distinct from** :class:`~diff_diff.SyntheticDiD` (Arkhangelsky et al. 2021), which adds time weights and ridge regularization; classic SCM uses **donor weights only** plus the @@ -67,6 +69,8 @@ Results container for synthetic control estimation. .. autosummary:: + ~SyntheticControlResults.in_space_placebo + ~SyntheticControlResults.get_placebo_df ~SyntheticControlResults.summary ~SyntheticControlResults.print_summary ~SyntheticControlResults.to_dict @@ -155,6 +159,13 @@ Quick estimation with the convenience function:: ) print(f"ATT: {results.att:.3f}, pre-RMSPE: {results.pre_rmspe:.3f}") +In-space placebo permutation inference (opt-in; refits one synthetic control per donor):: + + placebo_df = results.in_space_placebo() # reassigns treatment to each donor + print(f"placebo p-value: {results.placebo_p_value:.3f} " + f"(n_placebos={results.n_placebos})") # p = rank/(n_placebos+1) + print(placebo_df) # per-unit RMSPE-ratio table used for the permutation rank + Supplying a fixed predictor-importance matrix (skips the outer V search):: import numpy as np diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 736cbd92..26420281 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -1984,7 +1984,7 @@ Classic synthetic control (donor/unit weights only) for a single treated unit, d - **Outer solve (`v_method="nested"`):** choose diagonal PSD `V` minimizing the pre-period **outcome** MSPE `mean((Z1 − Z0·W*(V))²)`. `v_method="custom"` skips the outer search and uses a user-supplied `custom_v` (trace-normalized). - **Effect:** gap path `α̂_1t = Y_1t − Σ_j w_j·Y_jt`; `att` = mean post-period gap; `pre_rmspe` = pre-period fit diagnostic. -**Inference:** **No analytical standard error** (Section 2.4) — `se`/`t_stat`/`p_value`/`conf_int` are NaN. The paper proposes in-space placebo permutation inference with the post/pre RMSPE-ratio statistic and `rank/(J+1)` p-value (deferred to a follow-up PR; `_placebo_gaps`/`_rmspe_ratio` are reserved on the results object). +**Inference:** **No analytical standard error** (Section 2.4) — `se`/`t_stat`/`p_value`/`conf_int` are always NaN. Significance comes from **in-space placebo permutation inference** via `SyntheticControlResults.in_space_placebo()`: reassign treatment to each donor, refit a synthetic control for it, and rank the treated unit's post/pre RMSPE ratio (`rmspe_ratio` = `RMSPE_post / RMSPE_pre` = `sqrt(MSPE_post / MSPE_pre)`) among all units; `placebo_p_value = rank / (n_placebos + 1)`, where `rank = 1 + #{placebos with ratio ≥ treated ratio}` — an **upper-tail rank test on the (unsigned) RMSPE-ratio statistic**, ties counted conservatively via `≥`. Because the ratio squares the gaps it is direction-agnostic: a large ratio signals an effect of *either* sign, so this is NOT a signed/one-directional ("one-sided") hypothesis test on the treatment-effect direction. ADH 2010 §3.4 reports the *MSPE* ratio (the square of `rmspe_ratio`); the two are monotone-equivalent, so the rank and p-value are identical — only the reported statistic's scale differs. `rmspe_ratio` (the treated statistic) is computed at fit time; `placebo_p_value` / `n_placebos` / `n_failed` are populated by the opt-in `in_space_placebo()` call. `get_placebo_df()` returns the per-unit ratio table used for the rank (the per-period placebo gap paths for a "spaghetti" plot are retained internally, not in this summary). **Notes / deviations:** - **Note:** The standardization divisor `divisor = sqrt(apply(cbind(X0,X1), 1, var))` (per-predictor SD over donors+treated, ddof=1) and the inner/outer optimizer are **not specified in ADH 2010** (which defers these numerics to Abadie & Gardeazabal 2003 App. B / the `Synth` software). The divisor is pinned from the R `Synth::synth` source; `solution.v` lives in this scaled predictor space, so the deterministic R-parity test feeds `custom_v` in the same scaled space. @@ -1994,8 +1994,12 @@ Classic synthetic control (donor/unit weights only) for a single treated unit, d - **Deviation from R:** `standardize="none"` disables predictor standardization entirely; R `Synth` always scales by the predictor SD. Provided for diagnostics; changes the geometry of the `V` objective. - **Note:** predictor rows support only **equal-weight** linear combinations of pre-period values — `mean` (`k_s = 1/T0`), `sum` (`k_s = 1`), and per-period outcome lags (identity, a single `k_s = 1`). ADH (2010) §2.3 defines the general form `Ȳ_i^{K_m} = Σ_s k_s Y_is` with *arbitrary* weights `k_s`; this release does NOT accept user-supplied non-uniform `K_m` weight vectors (and `median` and other non-linear aggregations are intentionally excluded). The supported set still spans the standard `Synth::dataprep` `predictors.op` + `special.predictors` usage; arbitrary-weight `K_m` is a deferred extension. - **Deviation from R:** predictor/outcome **aggregation fails closed on any non-finite (NaN/inf) cell**, whereas R `Synth::dataprep` hardwires `na.rm=TRUE` (aggregating over the observed cells of a partially-missing window). The fail-closed contract is deliberate: na-dropping silently aggregates different period subsets across units, yielding incomparable predictors with no warning. The analyst must restrict `predictor_window` / `special_predictors` / `pre_period_outcomes` periods (and the outcome panel) to where each variable is observed; both partially- and fully-missing windows raise `ValueError`. Only the row *ordering* matches `dataprep`, not the missing-data handling. +- **Note (in-space placebo donor pool):** the real treated unit is **excluded from every placebo's donor pool** — when donor `j` is pseudo-treated it is fit against the other `J−1` donors, never the real treated unit (whose post-period is treatment-contaminated). The ranking set is still the `J+1` units {treated} ∪ {J placebos}. ADH 2010 §2.4 does not spell out placebo donor-pool composition; this matches the standard `SCtools::generate.placebos` construction (rotate the pseudo-treated identity through the donor pool; the original treated unit is never re-added as a donor). +- **Note (placebo failure handling):** a placebo is **excluded from both the numerator and the denominator** of the rank (never penalized into it) and tallied in `n_failed` when its fit is not a valid optimum — EITHER its **inner Frank-Wolfe weight solve** did not converge (a truncated `W` is unusable) OR its **outer `V` search** did not converge (an under-optimized `V` fits the pre-period worse, shrinking the RMSPE ratio and biasing the p-value anti-conservatively, so it must not silently enter the rank). The reported p-value uses the **effective** count `rank / (n_placebos + 1)`, where `n_placebos` is the number of placebos that entered the reference set. Failed donors still appear in `get_placebo_df()` (`status="failed"`, NaN metrics), so once a reference set is produced the table is the full treated + every-donor unit set (`n_donors + 1` rows). In the fail-closed cases the placebo loop does not run and only the treated row is returned: `J < 2` → `placebo_p_value` is NaN with a warning (no placebo distribution; `J == 2` warns the distribution is coarse), and a treated fit whose own **inner OR outer** search did not converge also fails closed (ranking a truncated / under-optimized treated statistic would not be a valid permutation). **Caveat:** each placebo refit inherits the original fit's `optimizer_options` / `n_starts`, so valid inference requires settings adequate for the outer `V` search to converge to a comparable-quality synthetic (production defaults do; cheap settings under-optimize placebo `V` and those placebos are dropped as failed — raise `n_starts` on `in_space_placebo()` or re-fit with a larger `optimizer_options['maxiter']`). +- **Note (RMSPE-ratio floor):** the reported `rmspe_ratio = sqrt(MSPE_post / MSPE_pre)` floors the pre-period MSPE denominator at a scale-aware `1e-8 · max(|pre-outcomes|, 1)²` (before the square root) so a (near-)perfect pre-fit (`pre-MSPE → 0`) yields a large-but-FINITE ratio rather than `inf`/`nan` (which would corrupt the rank). Ties (`ratio_j ≥ treated_ratio`) are counted, making the p-value conservative. Mirrors the `_fit_tol` poor-fit guard. +- **Note (placebo p-value is non-analytical):** `placebo_p_value` is deliberately a SEPARATE field from `p_value` (which stays NaN) — it is a permutation p-value with no SE / t-stat, so it does not flow through `safe_inference`. `is_significant` likewise stays bound to the (NaN) `p_value`, NOT `placebo_p_value`; a tool gating on `is_significant` will see `False` even when `placebo_p_value` is small. The reporting stack surfaces the placebo p-value through `estimator_native_diagnostics`, never the analytical headline. -**Reference implementation:** authors' `Synth` package for R/MATLAB/Stata (`Synth::synth`). **R-parity anchor:** the Basque Country study (Abadie-Gardeazabal 2003, `data("basque")`) — published synthetic = region 10 (Cataluña) 0.851 + region 14 (Madrid) 0.149, `loss.v` 0.0089. Two-tier test (`tests/test_methodology_synthetic_control.py`): Tier-1 feeds R's `solution.v` via `custom_v` → donor weights match to atol 1e-3 (deterministic); Tier-2 checks the nested fit in a band. +**Reference implementation:** authors' `Synth` package for R/MATLAB/Stata (`Synth::synth`); in-space placebo construction follows `SCtools::generate.placebos`. **R-parity anchor:** the Basque Country study (Abadie-Gardeazabal 2003, `data("basque")`) — published synthetic = region 10 (Cataluña) 0.851 + region 14 (Madrid) 0.149, `loss.v` 0.0089. Two-tier test (`tests/test_methodology_synthetic_control.py`): Tier-1 feeds R's `solution.v` via `custom_v` → donor weights match to atol 1e-3 (deterministic); Tier-2 checks the nested fit in a band. **Requirements checklist:** - [x] Donor weights on the unit simplex; exactly one treated unit, block assignment. @@ -2003,7 +2007,7 @@ Classic synthetic control (donor/unit weights only) for a single treated unit, d - [x] Inner simplex-constrained weighted LS via `_sc_weight_fw` with diagonal PSD `V`. - [x] Outer nested `V` (pre-period outcome MSPE) + user-supplied `custom_v`. - [x] Gap path + pre-period RMSPE + predictor-balance table. -- [x] No analytical SE (NaN inference); placebo permutation inference deferred to a follow-up PR. +- [x] No analytical SE (NaN inference); in-space placebo permutation inference (`in_space_placebo()`, `rank/(n_placebos+1)`) with the real treated unit excluded from every placebo pool, effective-count denominator, and a scale-aware RMSPE-ratio floor. - [x] Predictor-leakage, absorbing-suffix/no-anticipation, empty-window, duplicate-label, and inner-non-convergence validation gates. --- diff --git a/docs/methodology/REPORTING.md b/docs/methodology/REPORTING.md index ea16530f..58058312 100644 --- a/docs/methodology/REPORTING.md +++ b/docs/methodology/REPORTING.md @@ -263,9 +263,16 @@ a library setting. pre-period), and routes sensitivity to `in_time_placebo()` + `sensitivity_to_zeta_omega()`. `TROPResults` surfaces factor-model diagnostics (`effective_rank`, `loocv_score`, selected `lambda_*`) - under `estimator_native_diagnostics`. `EfficientDiDResults` PT runs - through `EfficientDiD.hausman_pretest` (the estimator's native - PT-All vs PT-Post check). + under `estimator_native_diagnostics`. `SyntheticControlResults` + routes parallel-trends to the `scm_fit` analogue (`pre_rmspe`, + verdict `design_enforced_pt`) and surfaces `pre_rmspe`, donor-weight + concentration, and the in-space placebo permutation p-value under + `estimator_native_diagnostics` — the placebo block is populated only + when the caller has already run `in_space_placebo()` (opt-in; DR never + triggers the per-donor refit loop implicitly), and it omits + HonestDiD-style `sensitivity` (significance IS the placebo). + `EfficientDiDResults` PT runs through `EfficientDiD.hausman_pretest` + (the estimator's native PT-All vs PT-Post check). - **Note:** Pre-trends verdict is a three-bin heuristic, not a field convention. DR maps the joint p-value as follows: diff --git a/tests/test_business_report.py b/tests/test_business_report.py index 940ca1ef..d5668e95 100644 --- a/tests/test_business_report.py +++ b/tests/test_business_report.py @@ -24,6 +24,7 @@ from unittest.mock import patch import numpy as np +import pandas as pd import pytest import diff_diff as dd @@ -39,6 +40,7 @@ generate_did_data, generate_factor_data, generate_staggered_data, + synthetic_control, ) from diff_diff.business_report import BUSINESS_REPORT_SCHEMA_VERSION @@ -111,6 +113,38 @@ def sdid_fit(): return sdid, fdf +@pytest.fixture # function-scoped: some tests run in_space_placebo(), which mutates the result +def scm_fit(): + rng = np.random.default_rng(0) + T, T0, n = 8, 6, 4 + years = list(range(2000, 2000 + T)) + donors = { + j: rng.normal(10, 2) + rng.normal(0, 0.3) * np.arange(T) + rng.normal(0, 0.15, T) + for j in range(n) + } + treated = 0.6 * donors[0] + 0.4 * donors[1] + rng.normal(0, 0.08, T) + treated[T0:] += 3.0 + rows = [] + for j in range(n): + for t in range(T): + rows.append({"unit": f"d{j}", "year": years[t], "y": donors[j][t], "treated": 0}) + for t in range(T): + rows.append({"unit": "treated", "year": years[t], "y": treated[t], "treated": int(t >= T0)}) + df = pd.DataFrame(rows) + res = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + seed=0, + n_starts=1, + optimizer_options={"maxiter": 50}, + inner_min_decrease=1e-3, + ) + return res, df + + @pytest.fixture(scope="module") def edid_fit(): from diff_diff import EfficientDiD @@ -726,6 +760,71 @@ class TripleDifferenceResults: assert "DDD" in desc assert "Ortiz-Villavicencio" in desc or "2025" in desc + +class TestSCMBusinessReport: + """Classic SCM routes through BusinessReport with the fit-based assumption + block, native-diagnostics robustness, and an ADH 2010 attribution.""" + + def test_scm_assumption_block_is_fit_based(self, scm_fit): + res, _ = scm_fit + assumption = BusinessReport(res, auto_diagnostics=False).to_dict()["assumption"] + # Distinct from SDiD's "synthetic_fit" — classic SCM is a donor-weighted match. + assert assumption["parallel_trends_variant"] == "scm_fit" + desc = assumption["description"].lower() + assert "placebo" in desc # significance is placebo-based, not analytical SE + assert "donor-weighted" in desc or "donor" in desc + + def test_scm_report_is_json_serializable(self, scm_fit): + # SCM's analytical p_value is NaN; the headline is_significant must be a + # plain bool (not numpy bool_) so the AI-legible schema serializes. + res, _ = scm_fit + schema = BusinessReport(res, auto_diagnostics=True).to_dict() + json.dumps(schema) # must not raise + assert isinstance(schema["headline"]["is_significant"], bool) + + def test_scm_robustness_uses_native_diagnostics(self, scm_fit): + # SCM omits HonestDiD "sensitivity", so robustness must come from the + # estimator-native diagnostics block, which must be populated and ran. + res, _ = scm_fit + diag = BusinessReport(res, auto_diagnostics=True).to_dict()["diagnostics"] + assert diag["status"] == "ran" + native = diag["schema"]["estimator_native_diagnostics"] + assert native["estimator"] == "SyntheticControl" + assert isinstance(native["pre_rmspe"], float) + + def test_scm_appendix_attributes_adh_2010(self, scm_fit): + res, _ = scm_fit + blob = json.dumps(BusinessReport(res, auto_diagnostics=False).to_dict()).lower() + assert "abadie" in blob and "2010" in blob + + def test_scm_rejects_honest_did_passthrough(self, scm_fit): + res, _ = scm_fit + with pytest.raises(ValueError, match="estimator_native_diagnostics"): + BusinessReport(res, honest_did_results=object(), auto_diagnostics=False) + + def test_scm_target_parameter_is_named(self, scm_fit): + # BR must name SCM's headline estimand (single-unit mean post-treatment gap), + # not fall back to "ATT (unrecognized result class)". + res, _ = scm_fit + tp = BusinessReport(res, auto_diagnostics=False).to_dict()["target_parameter"] + assert "unrecognized" not in tp["name"].lower() + assert "scm" in tp["name"].lower() or "synthetic control" in tp["name"].lower() + + def test_scm_robustness_block_surfaces_native_fields(self, scm_fit): + # The top-level robustness block must surface SCM-native content (pre_rmspe, + # weight concentration, the placebo result) rather than the SDiD-shaped + # pre_treatment_fit=None left over from the generic lift. + res, _ = scm_fit + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res.in_space_placebo() + rob = BusinessReport(res, auto_diagnostics=True).to_dict()["robustness"] + native = rob["estimator_native"] + assert native["estimator"] == "SyntheticControl" + assert isinstance(native["pre_rmspe"], float) + assert "weight_concentration" in native + assert native["in_space_placebo"]["n_placebos"] == res.n_placebos + def test_staggered_triple_diff_assumption_uses_ddd_not_generic_pt(self): class StaggeredTripleDiffResults: pass diff --git a/tests/test_diagnostic_report.py b/tests/test_diagnostic_report.py index 32e7a50d..af1ef4c3 100644 --- a/tests/test_diagnostic_report.py +++ b/tests/test_diagnostic_report.py @@ -20,6 +20,7 @@ from unittest.mock import patch import numpy as np +import pandas as pd import pytest import diff_diff as dd @@ -34,6 +35,7 @@ generate_did_data, generate_factor_data, generate_staggered_data, + synthetic_control, ) from diff_diff.diagnostic_report import ( DIAGNOSTIC_REPORT_SCHEMA_VERSION, @@ -135,6 +137,45 @@ def sdid_fit(): return sdid, fdf +def _scm_panel(n_donors=4, T=8, T0=6, effect=3.0, seed=0): + """Small balanced SCM panel: treated = convex mix of two donors + post effect.""" + rng = np.random.default_rng(seed) + years = list(range(2000, 2000 + T)) + donors = { + j: rng.normal(10, 2) + rng.normal(0, 0.3) * np.arange(T) + rng.normal(0, 0.15, T) + for j in range(n_donors) + } + treated = 0.6 * donors[0] + 0.4 * donors[1] + rng.normal(0, 0.08, T) + treated[T0:] += effect + rows = [] + for j in range(n_donors): + for t in range(T): + rows.append({"unit": f"d{j}", "year": years[t], "y": donors[j][t], "treated": 0}) + for t in range(T): + rows.append({"unit": "treated", "year": years[t], "y": treated[t], "treated": int(t >= T0)}) + return pd.DataFrame(rows) + + +@pytest.fixture # function-scoped: some tests run in_space_placebo(), which mutates the result +def scm_fit(): + warnings.filterwarnings("ignore") + df = _scm_panel(n_donors=4) + # Cheap optimizer settings keep the pure-Python fixture fast; the report path, + # not solver accuracy, is what these tests exercise. + res = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + seed=0, + n_starts=1, + optimizer_options={"maxiter": 50}, + inner_min_decrease=1e-3, + ) + return res, df + + # --------------------------------------------------------------------------- # Schema contract # --------------------------------------------------------------------------- @@ -2005,6 +2046,141 @@ def test_sdid_does_not_call_honest_did(self, sdid_fit): mock.assert_not_called() +class TestSCMNative: + """Classic SCM routes to the fit-based PT analogue + native diagnostics.""" + + def test_scm_applicable_checks(self, scm_fit): + res, _ = scm_fit + applicable = DiagnosticReport(res).applicable_checks + assert "estimator_native" in applicable + assert "parallel_trends" in applicable # design-enforced fit analogue + # SCM is not HonestDiD/heterogeneity territory (one treated unit). + assert "sensitivity" not in applicable + assert "heterogeneity" not in applicable + + def test_scm_pt_uses_scm_fit_method(self, scm_fit): + res, _ = scm_fit + pt = DiagnosticReport(res).to_dict()["parallel_trends"] + assert pt["method"] == "scm_fit" + assert pt["verdict"] == "design_enforced_pt" + assert isinstance(pt.get("pre_treatment_fit_rmse"), float) + + def test_scm_native_section_populated(self, scm_fit): + res, _ = scm_fit + native = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"] + assert native["status"] == "ran" + assert native["estimator"] == "SyntheticControl" + assert isinstance(native["pre_rmspe"], float) + assert "herfindahl" in native["weight_concentration"] + # Placebo is opt-in: NOT auto-run inside the report. + assert native["in_space_placebo"]["status"] == "not_run" + # In-time placebo / leave-one-out are ADH 2015 (not implemented here). + assert "in_time_placebo" not in native and "leave_one_out" not in native + + def test_scm_native_surfaces_placebo_after_optin_run(self, scm_fit): + res, _ = scm_fit + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res.in_space_placebo() + native = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"] + block = native["in_space_placebo"] + assert block["n_placebos"] == res.n_placebos + assert block["placebo_p_value"] == pytest.approx(res.placebo_p_value) + + def test_scm_does_not_call_honest_did(self, scm_fit): + """HonestDiD sensitivity should NOT run on SCM (fit-based / native path).""" + res, _ = scm_fit + with patch("diff_diff.honest_did.HonestDiD.sensitivity_analysis") as mock: + DiagnosticReport(res).to_dict() + mock.assert_not_called() + + def test_scm_significance_not_marked_done_until_placebo_run(self, scm_fit): + """SCM significance is the OPT-IN in-space placebo, which DR never runs. + + Unlike SDiD/TROP (whose native block IS the sensitivity work), SCM's + native block only reports the placebo as ``not_run``, so the in-space + placebo next-step must persist — significance is not complete merely + because the native diagnostics ran. + """ + res, _ = scm_fit # the fixture does NOT run the placebo + schema = DiagnosticReport(res).to_dict() + labels = " ".join(s.get("label", "") for s in schema.get("next_steps", [])).lower() + assert "placebo" in labels # the significance recommendation still surfaces + + def test_scm_placebo_step_completes_after_run(self, scm_fit): + """Once the opt-in placebo has been run, DR stops recommending it.""" + res, _ = scm_fit + before = DiagnosticReport(res).to_dict()["next_steps"] + assert any("placebo" in s.get("label", "").lower() for s in before) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res.in_space_placebo() # opt-in significance procedure now done + after = DiagnosticReport(res).to_dict()["next_steps"] + assert not any("placebo" in s.get("label", "").lower() for s in after) + + def test_scm_rejects_precomputed_parallel_trends_and_sensitivity(self, scm_fit): + # Like SDiD/TROP, SCM computes its PT verdict internally (the scm_fit + # design-enforced pre-fit), so a precomputed parallel_trends p-value payload + # is methodology-incompatible and rejected — as is sensitivity (no HonestDiD + # analogue). SCM's PT still runs natively (scm_fit) without a precomputed input. + res, _ = scm_fit + with pytest.raises(ValueError, match="estimator_native_diagnostics"): + DiagnosticReport(res, precomputed={"parallel_trends": {"joint_p_value": 0.5}}) + with pytest.raises(ValueError, match="estimator_native_diagnostics"): + DiagnosticReport(res, precomputed={"sensitivity": {"breakdown_M": 1.0}}) + # Without a precomputed input, the native scm_fit PT verdict still renders. + assert DiagnosticReport(res).to_dict()["parallel_trends"]["method"] == "scm_fit" + + def test_scm_pt_prose_is_scm_specific_not_sdid(self, scm_fit): + # The design_enforced_pt prose must describe SCM's donor-weighted, single- + # treated-unit, placebo-based identification — NOT SDiD's weighted-PT analogue. + res, _ = scm_fit + interp = DiagnosticReport(res).to_dict()["overall_interpretation"].lower() + assert "in-space placebo" in interp or "donor-weighted" in interp + assert "sdid" not in interp + + def test_scm_target_parameter_is_named(self, scm_fit): + res, _ = scm_fit + tp = DiagnosticReport(res).to_dict()["target_parameter"] + assert "unrecognized" not in tp["name"].lower() + assert "scm" in tp["name"].lower() or "synthetic control" in tp["name"].lower() + + def test_scm_generic_placebo_section_points_to_native(self, scm_fit): + # The generic ``placebo`` section must not claim placebo is "not yet + # implemented" for SCM — its in-space placebo IS implemented and surfaced + # under estimator_native_diagnostics (avoids contradictory report state). + res, _ = scm_fit + reason = (DiagnosticReport(res).to_dict()["placebo"].get("reason") or "").lower() + assert "not yet implemented" not in reason + assert "in_space_placebo" in reason or "estimator_native" in reason + + def test_scm_native_marks_infeasible_placebo(self): + # An attempted-but-infeasible placebo (here J<2) surfaces an explicit + # status="infeasible" + reason in _scm_native, not a bare NaN p-value. + rng = np.random.default_rng(0) + T, T0 = 8, 6 + years = list(range(2000, 2000 + T)) + d0 = rng.normal(10, 2, T) + treated = d0 + rng.normal(0, 0.1, T) + treated[T0:] += 3.0 + rows = [{"unit": "d0", "year": years[t], "y": d0[t], "treated": 0} for t in range(T)] + rows += [ + {"unit": "treated", "year": years[t], "y": treated[t], "treated": int(t >= T0)} + for t in range(T) + ] + df = pd.DataFrame(rows) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, "y", "treated", "unit", "year", seed=0, n_starts=1, inner_min_decrease=1e-3 + ) + res.in_space_placebo() # only 1 donor -> infeasible + native = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"] + block = native["in_space_placebo"] + assert block["status"] == "infeasible" + assert np.isnan(block["placebo_p_value"]) and "reason" in block + + # --------------------------------------------------------------------------- # Error handling # --------------------------------------------------------------------------- diff --git a/tests/test_methodology_synthetic_control.py b/tests/test_methodology_synthetic_control.py index 00507bd8..074ad896 100644 --- a/tests/test_methodology_synthetic_control.py +++ b/tests/test_methodology_synthetic_control.py @@ -24,6 +24,8 @@ from __future__ import annotations import json +import pickle +import warnings from pathlib import Path import numpy as np @@ -742,8 +744,13 @@ def test_result_accessors_render(): assert set(gdf.columns) == {"period", "gap", "phase"} wdf = res.get_weights_df() assert list(wdf.columns) == ["unit", "weight"] - # Reserved PR-2/3 attributes present and None. - assert res._placebo_gaps is None and res._rmspe_ratio is None and res._fit_snapshot is None + # PR-2: fit() populates the placebo refit snapshot and the treated unit's + # RMSPE ratio; the placebo reference distribution is not computed until + # in_space_placebo() runs (placebo_p_value stays NaN, gaps/df unset). + assert res._fit_snapshot is not None + assert res._placebo_gaps is None and res._placebo_df is None + assert np.isfinite(res.rmspe_ratio) + assert np.isnan(res.placebo_p_value) and res.n_placebos == 0 def test_inferred_post_matches_explicit(): @@ -836,3 +843,440 @@ def test_basque_tier2_nested_band(): top2 = [u for u, _ in sorted(res.donor_weights.items(), key=lambda kv: -kv[1])[:2]] assert set(top2) == {10, 14} assert res.donor_weights.get(10, 0) + res.donor_weights.get(14, 0) > 0.7 + + +# --------------------------------------------------------------------------- +# In-space placebo permutation inference (Abadie-Diamond-Hainmueller 2010 §2.4) +# --------------------------------------------------------------------------- + + +def _fit_for_placebo(n_donors=4, effect=3.0, **kw): + """Fit with cheap settings on a panel carrying a strong post-treatment effect.""" + df, _, _ = _make_panel(n_donors=n_donors, effect=effect) + opts = dict(_FAST) + opts.update(kw) + with warnings.catch_warnings(): # single-donor / poor-fit fit warnings are not under test + warnings.simplefilter("ignore") + return synthetic_control(df, "y", "treated", "unit", "year", seed=0, **opts) + + +def test_in_space_placebo_strong_effect_ranks_treated_first(): + # A 3.0-unit post effect on a treated unit that is a clean convex mix of two + # donors -> treated RMSPE ratio is the most extreme -> rank 1 -> p = 1/(J+1). + res = _fit_for_placebo(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + pdf = res.in_space_placebo() + assert res.n_placebos == 4 and res.n_failed == 0 + treated_ratio = pdf.loc[pdf["is_treated"], "rmspe_ratio"].iloc[0] + placebo_ratios = pdf.loc[~pdf["is_treated"], "rmspe_ratio"] + assert (treated_ratio > placebo_ratios).all() # treated is the most extreme unit + assert res.placebo_p_value == pytest.approx(1 / (res.n_placebos + 1)) + # Exactly one treated row; the placebo rows are exactly the donor units. + assert int(pdf["is_treated"].sum()) == 1 + assert pdf.loc[pdf["is_treated"], "unit"].iloc[0] == "treated" + assert set(pdf.loc[~pdf["is_treated"], "unit"]) == {"d0", "d1", "d2", "d3"} + + +def test_in_space_placebo_excludes_real_treated_from_donor_pools(): + # The real treated unit is never in the donor universe, so it cannot serve as + # a donor for any placebo (ADH 2010 contamination guard; SCtools convention). + res = _fit_for_placebo(n_donors=4) + snap = res._fit_snapshot + assert snap.treated_id == "treated" and "treated" not in snap.donor_ids + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res.in_space_placebo() + # Each donor became a placebo exactly once; the treated unit is not a placebo. + assert "treated" not in res._placebo_gaps + assert set(res._placebo_gaps) == set(snap.donor_ids) + + +def test_in_space_placebo_p_in_valid_discrete_set(): + res = _fit_for_placebo(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res.in_space_placebo() + valid = {(k + 1) / (res.n_placebos + 1) for k in range(res.n_placebos + 1)} + assert any(res.placebo_p_value == pytest.approx(v) for v in valid) + + +def test_in_space_placebo_does_not_touch_analytical_inference(): + # The permutation p-value is SEPARATE from the analytical fields, which stay + # NaN; is_significant stays bound to the (NaN) p_value, not placebo_p_value. + res = _fit_for_placebo(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res.in_space_placebo() + assert np.isfinite(res.placebo_p_value) + assert_nan_inference( + {"se": res.se, "t_stat": res.t_stat, "p_value": res.p_value, "conf_int": res.conf_int} + ) + assert res.is_significant is False + + +def test_in_space_placebo_deterministic(): + res = _fit_for_placebo(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + p1 = res.in_space_placebo() + first_p = res.placebo_p_value + p2 = res.in_space_placebo() + assert res.placebo_p_value == first_p # bit-equal p-value across runs + pd.testing.assert_frame_equal(p1, p2) # identical rows AND row order + + +def test_in_space_placebo_requires_two_donors(): + res = _fit_for_placebo(n_donors=1) + with pytest.warns(UserWarning, match="at least 2 donors"): + pdf = res.in_space_placebo() + assert np.isnan(res.placebo_p_value) and res.n_placebos == 0 + assert len(pdf) == 1 and bool(pdf["is_treated"].iloc[0]) + + +def test_in_space_placebo_two_donors_warns_coarse(): + res = _fit_for_placebo(n_donors=2) + with pytest.warns(UserWarning, match="coarse"): + res.in_space_placebo() + # 2 placebos -> reference set of 3 -> p in {1/3, 2/3, 1}. + assert res.n_placebos == 2 + assert any(res.placebo_p_value == pytest.approx(v) for v in (1 / 3, 2 / 3, 1.0)) + + +def test_in_space_placebo_fails_closed_on_nonconverged_treated_fit(): + # inner_max_iter=1 truncates the treated unit's own Frank-Wolfe solve, so its + # RMSPE ratio is not a valid optimum. in_space_placebo() must fail closed + # (NaN p-value + warning), NOT rank a truncated treated statistic. + df, _, _ = _make_panel(n_donors=4, effect=3.0) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + seed=0, + n_starts=1, + inner_max_iter=1, + optimizer_options={"maxiter": 5}, + ) + assert res._fit_converged is False # treated fit was truncated + with pytest.warns(UserWarning, match="did not converge at fit time"): + pdf = res.in_space_placebo() + assert np.isnan(res.placebo_p_value) + assert res.n_placebos == 0 and res.n_failed == 0 # the placebo loop never ran + assert len(pdf) == 1 and bool(pdf["is_treated"].iloc[0]) # treated row only + + +def test_in_space_placebo_pickle_drops_snapshot_keeps_scalars(): + res = _fit_for_placebo(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res.in_space_placebo() + restored = pickle.loads(pickle.dumps(res)) + # Scalars survive; panel-derived state is dropped. + assert restored.placebo_p_value == res.placebo_p_value + assert restored.rmspe_ratio == res.rmspe_ratio + assert restored.n_placebos == res.n_placebos and restored.n_failed == res.n_failed + assert restored._fit_snapshot is None and restored._placebo_gaps is None + # The small aggregate table survives, so get_placebo_df still works... + assert len(restored.get_placebo_df()) == len(res.get_placebo_df()) + # ...but a re-run of the refit raises (the snapshot is gone). + with pytest.raises(ValueError, match="requires the fit snapshot"): + restored.in_space_placebo() + + +def test_in_space_placebo_custom_v_path(): + df, _, _ = _make_panel(n_donors=4) + # Default predictors = all pre-period outcomes -> k = number of pre periods (T0). + k = 6 + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + v_method="custom", + custom_v=np.ones(k), + inner_min_decrease=1e-3, + ) + pdf = res.in_space_placebo() + assert res.n_placebos == 4 and np.isfinite(res.placebo_p_value) + assert len(pdf) == 5 + + +def test_get_placebo_df_before_run_returns_treated_row_only(): + res = _fit_for_placebo(n_donors=4) + pdf = res.get_placebo_df() + assert len(pdf) == 1 + assert bool(pdf["is_treated"].iloc[0]) and pdf["status"].iloc[0] == "treated" + assert set(pdf.columns) == { + "unit", + "pre_mspe", + "post_mspe", + "rmspe_ratio", + "is_treated", + "status", + } + + +def test_rmspe_ratio_floors_zero_pre_mspe(): + # Perfect pre-fit (pre-MSPE == 0) must yield a large-but-finite ratio, not + # inf/nan (which would corrupt the permutation rank). + from diff_diff.synthetic_control import _rmspe_ratio + + pre = np.zeros(5) + assert np.isfinite(_rmspe_ratio(pre, np.array([1.0, 2.0, 3.0]), scale=10.0)) + # A zero-effect (post all zero) placebo has ratio 0 — the least extreme. + assert _rmspe_ratio(pre, np.zeros(3), scale=10.0) == 0.0 + + +def test_in_space_placebo_perfect_treated_fit_finite_ratio(): + # 2-donor panel where the treated unit EQUALS d0 in the pre-period -> the inner + # FW solve lands on w=[1, 0], so the treated pre-MSPE is (bit-)exactly 0. The + # RMSPE ratio must stay FINITE (scale-aware floor), never inf/nan. + rng = np.random.default_rng(2) + T, T0 = 8, 6 + years = list(range(2000, 2000 + T)) + d0 = rng.normal(10, 2, T) + d1 = rng.normal(5, 2, T) + treated = d0.copy() + treated[T0:] += 5.0 # identical to d0 in the pre-period, clean post effect + rows = [] + for name, series in (("d0", d0), ("d1", d1)): + for t in range(T): + rows.append({"unit": name, "year": years[t], "y": series[t], "treated": 0}) + for t in range(T): + rows.append({"unit": "treated", "year": years[t], "y": treated[t], "treated": int(t >= T0)}) + df = pd.DataFrame(rows) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + v_method="custom", + custom_v=np.ones(T0), + inner_min_decrease=1e-3, + ) + assert res.pre_rmspe == pytest.approx(0.0, abs=1e-9) + assert np.isfinite(res.rmspe_ratio) and res.rmspe_ratio > 0 + + +def test_in_space_placebo_immune_to_post_fit_mutation(): + # The fit snapshot must COPY caller-owned mutable inputs (custom_v, + # optimizer_options), so mutating them after fit() cannot silently change + # in_space_placebo() output on an already-returned results object. + df, _, _ = _make_panel(n_donors=4) + cv = np.ones(6) # k = 6 default pre-period-outcome predictors + opts = {"maxiter": 50} + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + v_method="custom", + custom_v=cv, + optimizer_options=opts, + inner_min_decrease=1e-3, + ) + p1 = res.in_space_placebo().copy() + pval1 = res.placebo_p_value + snap = res._fit_snapshot + assert snap.custom_v is not cv and snap.optimizer_options is not opts + # Mutate the caller-owned originals AFTER fit -> placebo output must not change. + cv[:] = [1e6, 1.0, 1.0, 1.0, 1.0, 1.0] + opts["maxiter"] = 1 + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + p2 = res.in_space_placebo().copy() + assert res.placebo_p_value == pval1 + pd.testing.assert_frame_equal(p1, p2) + + +def test_get_placebo_df_includes_failed_donors(monkeypatch): + # When the treated fit IS valid but some per-donor placebo refits fail to + # converge, get_placebo_df() must still list EVERY unit (treated + each donor) + # so callers can tell which donors failed -> exactly n_donors + 1 rows. + # (A truncated treated fit instead fails the whole placebo run closed, tested + # separately; here we simulate isolated donor failures with a converged treated + # fit by monkeypatching the per-donor refit to fail for the first two donors.) + import importlib + + # diff_diff.synthetic_control the SUBMODULE is shadowed by the re-exported + # synthetic_control FUNCTION on the package, so import the module explicitly. + sc = importlib.import_module("diff_diff.synthetic_control") + + res = _fit_for_placebo(n_donors=4) # treated fit converges (normal settings) + real_fit_unit = sc._placebo_fit_unit + calls = {"n": 0} + + def flaky_fit_unit(snap, unit, donor_pool, n_starts): + calls["n"] += 1 + if calls["n"] <= 2: # first two donor refits "fail" + return None + return real_fit_unit(snap, unit, donor_pool, n_starts) + + monkeypatch.setattr(sc, "_placebo_fit_unit", flaky_fit_unit) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + pdf = res.in_space_placebo() + assert len(pdf) == res.n_donors + 1 # treated + every donor, regardless of failures + assert res.n_failed == 2 and res.n_placebos == res.n_donors - 2 + failed = pdf[pdf["status"] == "failed"] + assert len(failed) == 2 and failed["rmspe_ratio"].isna().all() # NaN metrics + + +def test_in_space_placebo_fails_closed_on_underoptimized_outer_v(): + # An under-optimized OUTER V search (optimizer maxiter=1) leaves the treated + # fit's V non-optimal even though the inner solve converges. Its RMSPE ratio is + # therefore not a valid optimum, so in_space_placebo() must FAIL CLOSED rather + # than silently rank an anti-conservatively under-optimized statistic. + df, _, _ = _make_panel(n_donors=4, effect=3.0) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + seed=0, + n_starts=1, + optimizer_options={"maxiter": 1}, # outer V search cannot converge + inner_min_decrease=1e-3, # inner still converges -> isolates the outer path + ) + assert res._fit_converged is False # outer V non-convergence -> invalid fit + with pytest.warns(UserWarning, match="did not converge at fit time"): + res.in_space_placebo() + assert np.isnan(res.placebo_p_value) + assert res.n_placebos == 0 and res.n_failed == 0 # placebo loop never ran + + +def test_outer_v_convergence_tracks_selected_incumbent(monkeypatch): + # _outer_solve_V must report convergence of the SELECTED (lowest-objective) + # incumbent, NOT "any start converged". Here the first multistart succeeds with a + # HIGH objective while the winning (lowest-objective) start reports success=False; + # the fit must be flagged non-converged so in_space_placebo() fails closed. + import importlib + + from scipy.optimize import OptimizeResult + + sc = importlib.import_module("diff_diff.synthetic_control") + calls = {"n": 0} + + def fake_minimize(fun, x0, **kwargs): + calls["n"] += 1 + x0 = np.asarray(x0, dtype=float) + if kwargs.get("method") == "Nelder-Mead": + # 1st start: high objective but converged; later: low objective (wins) but NOT. + if calls["n"] == 1: + return OptimizeResult(x=x0, fun=10.0, success=True) + return OptimizeResult(x=x0, fun=1.0, success=False) + # Powell polish: neither improves on nor converges at the incumbent. + return OptimizeResult(x=x0, fun=5.0, success=False) + + monkeypatch.setattr(sc, "minimize", fake_minimize) + df, _, _ = _make_panel(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, "y", "treated", "unit", "year", seed=0, n_starts=2, inner_min_decrease=1e-3 + ) + # The winning incumbent came from a success=False run -> selected V is not a + # validated optimum, so the fit must not be marked converged. + assert res._fit_converged is False + with pytest.warns(UserWarning, match="did not converge at fit time"): + res.in_space_placebo() + assert np.isnan(res.placebo_p_value) + + +def test_outer_v_powell_success_at_worse_point_does_not_validate(monkeypatch): + # The Powell polish must validate the SELECTED incumbent only when it converges + # back AT the incumbent's objective level. Here the winning (lowest-objective) + # Nelder-Mead start reports success=False, and Powell "succeeds" but at a STRICTLY + # WORSE objective (it ended elsewhere). Powell's success says nothing about the + # selected incumbent, so the fit must stay non-converged and fail closed -- a flag + # of "converged" here would silently admit an under-optimized V into the placebo + # ranking and produce wrong permutation inference. + import importlib + + from scipy.optimize import OptimizeResult + + sc = importlib.import_module("diff_diff.synthetic_control") + calls = {"n": 0} + + def fake_minimize(fun, x0, **kwargs): + calls["n"] += 1 + x0 = np.asarray(x0, dtype=float) + if kwargs.get("method") == "Nelder-Mead": + # Single start: lowest objective wins but reports success=False. + return OptimizeResult(x=x0, fun=1.0, success=False) + # Powell polish: SUCCEEDS, but at a strictly worse objective than the incumbent. + return OptimizeResult(x=x0, fun=5.0, success=True) + + monkeypatch.setattr(sc, "minimize", fake_minimize) + df, _, _ = _make_panel(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, "y", "treated", "unit", "year", seed=0, n_starts=1, inner_min_decrease=1e-3 + ) + # Powell's success at a worse point must NOT flip the selected incumbent to converged. + assert res._fit_converged is False + with pytest.warns(UserWarning, match="did not converge at fit time"): + res.in_space_placebo() + assert np.isnan(res.placebo_p_value) + + +def test_to_dict_includes_placebo_scalars(): + res = _fit_for_placebo(n_donors=4) + d = res.to_dict() + for key in ("placebo_p_value", "rmspe_ratio", "n_placebos", "n_failed"): + assert key in d + # Before the placebo run: rmspe_ratio is finite (fit-time), placebo_p_value NaN. + assert np.isfinite(d["rmspe_ratio"]) and np.isnan(d["placebo_p_value"]) + assert d["n_placebos"] == 0 and d["n_failed"] == 0 + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res.in_space_placebo() + d2 = res.to_dict() + assert np.isfinite(d2["placebo_p_value"]) and d2["n_placebos"] == 4 + + +def test_summary_distinguishes_infeasible_placebo_from_not_run(): + # summary() must tell "placebo never run" apart from "placebo run but produced + # no valid reference set" (J<2 here -> placebo_p_value NaN but it WAS attempted). + df, _, _ = _make_panel(n_donors=1) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control(df, "y", "treated", "unit", "year", seed=0, **_FAST) + before = res.summary() + res.in_space_placebo() # infeasible: single donor -> no placebo distribution + after = res.summary() + assert "Run in_space_placebo()" in before # never run + assert np.isnan(res.placebo_p_value) and res._placebo_df is not None # attempted + assert "attempted but" in after and "no valid reference set" in after + assert "Run in_space_placebo()" not in after # not mislabeled as "not run" + + +def test_rmspe_ratio_is_root_scale(): + # The reported statistic is the ROOT-scale ratio RMSPE_post/RMSPE_pre = + # sqrt(MSPE_post/MSPE_pre), NOT the MSPE ratio. Hand-worked: pre-MSPE = 4, + # post-MSPE = 9 -> sqrt(9/4) = 1.5 (the MSPE ratio would be 9/4 = 2.25). + from diff_diff.synthetic_control import _rmspe_ratio + + pre = np.array([2.0, 2.0]) # MSPE = 4 + post = np.array([3.0, 3.0]) # MSPE = 9 + assert _rmspe_ratio(pre, post, scale=10.0) == pytest.approx(1.5) + # Zero post-effect -> ratio 0; perfect pre-fit -> finite (floored), not inf. + assert _rmspe_ratio(pre, np.zeros(2), scale=10.0) == pytest.approx(0.0) + assert np.isfinite(_rmspe_ratio(np.zeros(2), post, scale=10.0)) diff --git a/tests/test_practitioner.py b/tests/test_practitioner.py index 99ccec5d..4153d840 100644 --- a/tests/test_practitioner.py +++ b/tests/test_practitioner.py @@ -20,6 +20,7 @@ from diff_diff.results import DiDResults, SyntheticDiDResults from diff_diff.stacked_did_results import StackedDiDResults from diff_diff.sun_abraham import SunAbrahamResults +from diff_diff.synthetic_control_results import SyntheticControlResults from diff_diff.triple_diff import TripleDifferenceResults from diff_diff.trop_results import TROPResults from diff_diff.two_stage_results import TwoStageDiDResults @@ -114,6 +115,15 @@ def mock_trop_results(): return r +@pytest.fixture +def mock_scm_results(): + r = SyntheticControlResults.__new__(SyntheticControlResults) + r.att = 1.2 + r.se = np.nan + r.placebo_p_value = np.nan + return r + + @pytest.fixture def mock_efficient_results(): r = EfficientDiDResults.__new__(EfficientDiDResults) @@ -336,6 +346,21 @@ def test_trop_results(self, mock_trop_results): assert "control_group" not in all_code assert "anticipation" not in all_code + def test_synthetic_control_results(self, mock_scm_results): + output = practitioner_next_steps(mock_scm_results, verbose=False) + assert output["estimator"] == "SyntheticControl" + assert len(output["next_steps"]) > 0 + # The in-space placebo step must surface (it is SCM's significance test) + # and must not be auto-suppressed as the completed estimation step. + all_code = " ".join(s.get("code", "") for s in output["next_steps"]) + all_labels = " ".join(s.get("label", "") for s in output["next_steps"]).lower() + assert "in_space_placebo" in all_code + assert "placebo" in all_labels + # SCM is not a staggered DiD: no control-group / anticipation knobs. + handler_steps = [s for s in output["next_steps"] if s["baker_step"] > 2] + handler_code = " ".join(s.get("code", "") for s in handler_steps) + assert "control_group" not in handler_code and "anticipation" not in handler_code + def test_efficient_results(self, mock_efficient_results): output = practitioner_next_steps(mock_efficient_results, verbose=False) assert len(output["next_steps"]) > 0 From 1d188638a204737796ab1d21a378da393d3841c9 Mon Sep 17 00:00:00 2001 From: igerber Date: Sun, 31 May 2026 15:26:39 -0400 Subject: [PATCH 2/3] =?UTF-8?q?synthetic=5Fcontrol:=20address=20CI=20codex?= =?UTF-8?q?=20P2s=20=E2=80=94=20precise=20placebo-infeasibility=20reason?= =?UTF-8?q?=20+=20n=5Fstarts=20validation?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CI review (PR #511) flagged two P2s: 1. summary() misdiagnosed the infeasible-placebo reason. The treated-fit fail-closed branch and the J<2 branch both leave n_placebos=0, n_failed=0, so reconstructing the cause from counts narrated a non-converged treated fit as "too few donors or all donor refits failed" (false for that path). Add an explicit `_placebo_status` recorded by in_space_placebo() ({treated_fit_nonconverged, too_few_donors, all_placebos_failed, ran}) and render the specific reason in both summary() and DiagnosticReport._scm_native instead of inferring it from counts. The status is a small string, so it survives pickling alongside _placebo_df / the scalar fields. 2. in_space_placebo(n_starts=...) skipped the positive-integer validation the estimator constructor enforces — int(0)/int(-1)/int(2.5) silently coerced into a degenerate or invalid permutation procedure. Mirror the constructor's check and fail fast with a ValueError. Also relabels the lingering "One-sided rank" code comment to "upper-tail rank on the unsigned RMSPE ratio" (matching the REGISTRY/CHANGELOG wording fix). New regressions: summary() names the treated-fit-failure reason (not the donor-side reason) under identical n_placebos/n_failed counts; invalid n_starts overrides raise and leave placebo state untouched. Co-Authored-By: Claude Opus 4.8 (1M context) --- diff_diff/diagnostic_report.py | 30 +++++++++-- diff_diff/synthetic_control_results.py | 57 +++++++++++++++++---- tests/test_methodology_synthetic_control.py | 53 ++++++++++++++++++- 3 files changed, 125 insertions(+), 15 deletions(-) diff --git a/diff_diff/diagnostic_report.py b/diff_diff/diagnostic_report.py index 9173f3c6..f474e91b 100644 --- a/diff_diff/diagnostic_report.py +++ b/diff_diff/diagnostic_report.py @@ -2283,17 +2283,39 @@ def _scm_native(self, r: Any) -> Dict[str, Any]: "n_placebos": _to_python_scalar(n_placebos), "n_failed": _to_python_scalar(getattr(r, "n_failed", None)), } - # Distinguish a valid run from an attempted-but-infeasible one (J<2, - # treated-fit non-convergence, or zero successful donor refits) so BR/DR + # Distinguish a valid run from an attempted-but-infeasible one so BR/DR # consumers see an explicit status/reason rather than a bare NaN p-value. + # The SPECIFIC cause comes from the results' recorded ``_placebo_status`` + # (n_placebos/n_failed alone cannot tell a non-converged treated fit + # apart from too-few-donors). if n_placebos > 0 and np.isfinite(placebo_p): block["status"] = "ran" else: + placebo_status = getattr(r, "_placebo_status", None) + _reasons = { + "treated_fit_nonconverged": ( + "in_space_placebo() was run but the treated unit's own SCM " + "fit did not converge at fit time, so its RMSPE ratio is not " + "a valid optimum to rank against placebos; placebo_p_value " + "is NaN." + ), + "too_few_donors": ( + "in_space_placebo() was run but fewer than 2 donors are " + "available (each placebo is fit against the other donors); " + "placebo_p_value is NaN." + ), + "all_placebos_failed": ( + "in_space_placebo() was run but every donor refit failed to " + "converge, so no placebo entered the reference set; " + "placebo_p_value is NaN." + ), + } block["status"] = "infeasible" - block["reason"] = ( + block["reason"] = _reasons.get( + placebo_status, "in_space_placebo() was run but produced no valid reference set " "(fewer than 2 donors, a non-converged treated fit, or all donor " - "refits failed); placebo_p_value is NaN." + "refits failed); placebo_p_value is NaN.", ) out["in_space_placebo"] = block else: diff --git a/diff_diff/synthetic_control_results.py b/diff_diff/synthetic_control_results.py index dff829f0..8e412057 100644 --- a/diff_diff/synthetic_control_results.py +++ b/diff_diff/synthetic_control_results.py @@ -186,6 +186,14 @@ def __post_init__(self) -> None: # in_space_placebo() fails closed when this is False: a truncated treated # fit makes the ranked statistic (rmspe_ratio) not a valid SCM optimum. self._fit_converged: bool = True + # Explicit reason an in-space placebo run was infeasible/absent, set by + # in_space_placebo(). summary() / _scm_native render THIS instead of + # reconstructing the cause from counts — n_placebos/n_failed alone cannot + # tell a non-converged treated fit ("treated_fit_nonconverged", n_failed=0) + # apart from too few donors ("too_few_donors", also n_failed=0). Values: + # None (not run), "ran", "treated_fit_nonconverged", "too_few_donors", + # "all_placebos_failed". A small string, so it survives pickling. + self._placebo_status: Optional[str] = None def __getstate__(self) -> Dict[str, Any]: """Exclude panel-derived internal state from pickling. @@ -322,16 +330,35 @@ def summary(self, alpha: Optional[float] = None) -> str: ] ) elif placebo_attempted: - lines.extend( - [ + # Render the SPECIFIC reason recorded by in_space_placebo(); the count + # fields (n_placebos=0, n_failed=0) cannot tell a non-converged treated + # fit apart from too-few-donors, so do not reconstruct it from counts. + status = getattr(self, "_placebo_status", None) + if status == "treated_fit_nonconverged": + reason = [ + "In-space placebo was skipped: the treated unit's own SCM fit " + "did not converge at fit time (inner Frank-Wolfe weight solve", + "and/or outer V search), so its RMSPE ratio is not a valid " + "optimum to rank against placebos. placebo_p_value is undefined", + "— re-fit with a larger inner_max_iter / looser " + "inner_min_decrease and/or a larger optimizer_options['maxiter']", + "/ more n_starts.", + ] + elif status == "too_few_donors": + reason = [ + "In-space placebo inference requires at least 2 donors (each " + "placebo is fit against the other donors); too few were", + "available. placebo_p_value is undefined. Inspect " "get_placebo_df().", + ] + else: # "all_placebos_failed" (or a legacy unpickle without the status) + reason = [ "In-space placebo permutation inference was attempted but " "produced no valid reference set", f"(0 placebos entered the rank; {self.n_failed} failed to " - "converge). placebo_p_value is undefined — too few donors or " - "all donor refits failed. Inspect get_placebo_df().", - "=" * 75, + "converge). placebo_p_value is undefined — all donor refits", + "failed. Inspect get_placebo_df().", ] - ) + lines.extend([*reason, "=" * 75]) else: lines.extend( [ @@ -550,7 +577,15 @@ def in_space_placebo( snap = self._fit_snapshot donors = list(snap.donor_ids) n_donors = len(donors) - n_starts_eff = snap.n_starts if n_starts is None else int(n_starts) + if n_starts is None: + n_starts_eff = snap.n_starts + else: + # Mirror the estimator constructor's validation (synthetic_control.py) + # so a bad override fails fast instead of silently coercing (e.g. via + # int(0)/int(-1)) into a degenerate or invalid permutation procedure. + if not isinstance(n_starts, (int, np.integer)) or n_starts < 1: + raise ValueError(f"n_starts override must be a positive integer, got {n_starts!r}") + n_starts_eff = int(n_starts) treated_pre = _mspe(self.gap_path, snap.pre_periods) treated_post = _mspe(self.gap_path, snap.post_periods) @@ -588,6 +623,7 @@ def in_space_placebo( self.n_placebos = 0 self.n_failed = 0 self._placebo_gaps = {} + self._placebo_status = "treated_fit_nonconverged" self._placebo_df = pd.DataFrame(rows, columns=self._PLACEBO_COLS) return self._placebo_df.copy() @@ -603,6 +639,7 @@ def in_space_placebo( self.n_placebos = 0 self.n_failed = 0 self._placebo_gaps = {} + self._placebo_status = "too_few_donors" self._placebo_df = pd.DataFrame(rows, columns=self._PLACEBO_COLS) return self._placebo_df.copy() @@ -668,8 +705,9 @@ def in_space_placebo( ) p_value = np.nan else: - # One-sided rank, treated unit included as the "+1". Ties counted via - # ``>=`` so the p-value is conservative. + # Upper-tail rank on the (unsigned) RMSPE ratio, treated unit included + # as the "+1". Ties counted via ``>=`` so the p-value is conservative. + # (The ratio squares the gaps -> direction-agnostic, NOT a signed test.) rank = 1 + sum(1 for r in ranked_ratios if r >= treated_ratio) p_value = rank / (n_placebos + 1) @@ -686,5 +724,6 @@ def in_space_placebo( self.n_placebos = int(n_placebos) self.n_failed = int(n_failed) self._placebo_gaps = placebo_gaps + self._placebo_status = "ran" if n_placebos > 0 else "all_placebos_failed" self._placebo_df = pd.DataFrame(rows, columns=self._PLACEBO_COLS) return self._placebo_df.copy() diff --git a/tests/test_methodology_synthetic_control.py b/tests/test_methodology_synthetic_control.py index 074ad896..ae563d01 100644 --- a/tests/test_methodology_synthetic_control.py +++ b/tests/test_methodology_synthetic_control.py @@ -1254,7 +1254,8 @@ def test_to_dict_includes_placebo_scalars(): def test_summary_distinguishes_infeasible_placebo_from_not_run(): # summary() must tell "placebo never run" apart from "placebo run but produced - # no valid reference set" (J<2 here -> placebo_p_value NaN but it WAS attempted). + # no valid reference set" (J<2 here -> placebo_p_value NaN but it WAS attempted), + # and name the SPECIFIC infeasibility reason (too few donors). df, _, _ = _make_panel(n_donors=1) with warnings.catch_warnings(): warnings.simplefilter("ignore") @@ -1264,10 +1265,58 @@ def test_summary_distinguishes_infeasible_placebo_from_not_run(): after = res.summary() assert "Run in_space_placebo()" in before # never run assert np.isnan(res.placebo_p_value) and res._placebo_df is not None # attempted - assert "attempted but" in after and "no valid reference set" in after + assert res._placebo_status == "too_few_donors" + assert "requires at least 2 donors" in after # specific reason, not "not run" assert "Run in_space_placebo()" not in after # not mislabeled as "not run" +def test_summary_treated_fit_failure_names_specific_reason(): + # When the treated unit's OWN fit fails to converge, in_space_placebo() fails + # closed with n_placebos=0, n_failed=0 -- the SAME counts as the J<2 case. The + # CI codex P2: summary() must not reconstruct the reason from those counts and + # narrate "too few donors or all donor refits failed" (false here); it must name + # the treated-fit non-convergence recorded in _placebo_status. + df, _, _ = _make_panel(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + seed=0, + n_starts=1, + optimizer_options={"maxiter": 1}, # outer V cannot converge -> fail closed + inner_min_decrease=1e-3, + ) + assert res._fit_converged is False + with pytest.warns(UserWarning, match="did not converge at fit time"): + res.in_space_placebo() + after = res.summary() + assert res._placebo_status == "treated_fit_nonconverged" + assert res.n_placebos == 0 and res.n_failed == 0 # same counts as J<2 + assert "treated unit's own SCM fit" in after and "did not converge" in after + # Must NOT misdiagnose as the donor-side reason. + assert "too few" not in after.lower() + assert "all donor refits" not in after.lower() + + +def test_in_space_placebo_rejects_invalid_n_starts(): + # CI codex P2: the n_starts override must fail fast on non-positive / non-integer + # values (mirroring the estimator constructor) rather than silently coercing via + # int(...) into a degenerate one-start (or invalid) permutation procedure. + res = _fit_for_placebo(n_donors=4) + for bad in (0, -1, -5): + with pytest.raises(ValueError, match="n_starts override must be a positive integer"): + res.in_space_placebo(n_starts=bad) + for bad in (2.5, "3"): + with pytest.raises(ValueError, match="n_starts override must be a positive integer"): + res.in_space_placebo(n_starts=bad) # type: ignore[arg-type] + # The placebo state must be untouched by a rejected override. + assert res._placebo_status is None and res._placebo_df is None + + def test_rmspe_ratio_is_root_scale(): # The reported statistic is the ROOT-scale ratio RMSPE_post/RMSPE_pre = # sqrt(MSPE_post/MSPE_pre), NOT the MSPE ratio. Hand-worked: pre-MSPE = 4, From 44ea0b409ca9bd96b33e503edfc5bdb7a39b1e50 Mon Sep 17 00:00:00 2001 From: igerber Date: Sun, 31 May 2026 15:45:16 -0400 Subject: [PATCH 3/3] test(diagnostic_report): scope scm_fit warning suppression to avoid global filter mutation CI review (PR #511) P3: the new scm_fit fixture called warnings.filterwarnings("ignore"), which permanently mutates the process-wide warning filters and can silently weaken warning-regression coverage in later tests in the same worker. Wrap only the synthetic_control(...) call in warnings.catch_warnings() + simplefilter("ignore") so the suppression is scoped to the fixture's solver call. (Pre-existing sibling fixtures in this file use the same legacy global-filter pattern but are out of scope for this PR.) Co-Authored-By: Claude Opus 4.8 (1M context) --- tests/test_diagnostic_report.py | 29 ++++++++++++++++------------- 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/tests/test_diagnostic_report.py b/tests/test_diagnostic_report.py index af1ef4c3..66e594a5 100644 --- a/tests/test_diagnostic_report.py +++ b/tests/test_diagnostic_report.py @@ -158,21 +158,24 @@ def _scm_panel(n_donors=4, T=8, T0=6, effect=3.0, seed=0): @pytest.fixture # function-scoped: some tests run in_space_placebo(), which mutates the result def scm_fit(): - warnings.filterwarnings("ignore") df = _scm_panel(n_donors=4) # Cheap optimizer settings keep the pure-Python fixture fast; the report path, - # not solver accuracy, is what these tests exercise. - res = synthetic_control( - df, - "y", - "treated", - "unit", - "year", - seed=0, - n_starts=1, - optimizer_options={"maxiter": 50}, - inner_min_decrease=1e-3, - ) + # not solver accuracy, is what these tests exercise. Scope the warning + # suppression to this call via catch_warnings() so it does not mutate the + # global filter state for later tests in the same worker. + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + seed=0, + n_starts=1, + optimizer_options={"maxiter": 50}, + inner_min_decrease=1e-3, + ) return res, df