diff --git a/CHANGELOG.md b/CHANGELOG.md index 5a28e1bc..f73eeef7 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added - **New estimator: `SyntheticControl` — classic Synthetic Control Method (Abadie, Diamond & Hainmueller 2010; Abadie & Gardeazabal 2003).** Standalone estimator (`diff_diff/synthetic_control.py`) + `SyntheticControlResults` (`diff_diff/synthetic_control_results.py`) + `synthetic_control()` convenience function, exported from `diff_diff`. Builds a single treated unit's counterfactual as a convex combination of never-treated donor units — **donor (unit) weights only**, no time weights or ridge, distinct from `SyntheticDiD`. The inner simplex-constrained weighted-LS solve `W*(V)` reuses `utils._sc_weight_fw` (folding `V^½` into the predictor matrix, `intercept=False`, `zeta=0`); the diagonal predictor-importance matrix `V` is selected data-driven by minimizing pre-period outcome MSPE (`v_method="nested"`, softmax-on-simplex multistart Nelder-Mead + Powell polish) or supplied by the user (`v_method="custom"`). Predictors are built from `predictors`/`predictor_window`/`predictors_op`, `special_predictors`, and per-period outcome lags (`pre_period_outcomes`), in the R `Synth::dataprep` row order; per-row standardization (SD over donors+treated, ddof=1) matches the R `Synth::synth` source. Reports the gap path (`α̂_1t = Y_1t − Σ_j w_j Y_jt`), `att` (mean post-period gap), `pre_rmspe`, donor weights, `v_weights`, and a predictor-balance table. **No analytical standard error** — `se`/`t_stat`/`p_value`/`conf_int` are NaN (in-space placebo permutation inference with the post/pre RMSPE-ratio statistic is planned for a follow-up release; `_placebo_gaps`/`_rmspe_ratio`/`_fit_snapshot` are reserved on the results object). Ten validation gates baked in: predictor-period leakage, absorbing post-period suffix + no-anticipation cross-check against the treatment column, post-period canonicalization, donor-pool filtering before period derivation, empty-window rejection, poor-pre-fit `UserWarning` (RMSPE > SD of treated pre-outcomes), duplicate-predictor-label rejection, inner-solve non-convergence warning, order-independent gap-path rebuild, and the `standardize="none"` deviation; plus fail-closed `custom_v` cross-field rules and degenerate single-donor / single-pre-period handling. **R-`Synth` parity** (`tests/test_methodology_synthetic_control.py`, fixtures generated by `benchmarks/R/generate_synth_basque_golden.R` into `tests/data/`): two-tier on the Basque Country study — Tier-1 feeds R's `solution.v` via `custom_v` and reproduces the published donor weights (region 10 Cataluña 0.851 + region 14 Madrid 0.149) to `atol=1e-3` deterministically; Tier-2 (`@pytest.mark.slow`) checks the data-driven nested fit lands in a tolerance band (the nested `V` legitimately differs because the outer objective uses all pre periods, not R's `time.optimize.ssr` window). Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (with `**Deviation from R:** standardize="none"` and `**Note:**` labels for the standardization formula, objective window, softmax `V` parametrization, and 1×SD poor-fit threshold), `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`. +- **StaggeredTripleDifference methodology-review-tracker promotion: In Progress → Complete**, plus a new opt-in Eq-4.14 overall ATT. Closes the Ortiz-Villavicencio & Sant'Anna (2025, arXiv:2505.09942v3) primary-source review on the tracker (PR-A #499 added the paper review on file; this PR validates the source against it). New paper-equation-anchored Verified Components in `tests/test_methodology_staggered_triple_diff.py` (Theorem 4.1 / Eq. 4.5 RA=IPW=DR identification; Eq. 4.1 three-term DDD decomposition; Eqs. 4.11-4.12 optimal-GMM weight normalization + single-group reduction; Eq. 4.13 event-study cohort-share weighting; Eq. 4.14 / Cor. 4.2 overall) alongside the existing R cross-validation against `triplediff::ddd(panel=TRUE)` + `agg_ddd()`. **New feature — opt-in `overall_att_es` (paper Eq. 4.14 overall):** the unweighted mean of the post-treatment event-study effects ES(e), exposed on `StaggeredTripleDiffResults` (with `overall_se_es` / `overall_t_stat_es` / `overall_p_value_es` / `overall_conf_int_es`) and populated only when `aggregate="event_study"` / `"all"`. The default `overall_att` is unchanged (the Callaway-Sant'Anna simple post-treatment (g,t) average — the library-wide convention). Its analytical SE is the influence function of that mean (the average of the per-event-time combined IFs, routed through the same survey-aware variance estimator as the per-e effects via a new `_se_from_psi` helper); a multiplier-bootstrap SE replaces it under `n_bootstrap>0`. Computed via a side-channel stash on the shared `CallawaySantAnnaAggregationMixin._aggregate_event_study` (no return-signature change; CallawaySantAnna unaffected), over post-treatment `e >= -anticipation` (the library convention, matching `overall_att`). Cross-validated against R `agg_ddd(type="eventstudy")$overall.att` / `overall.se` (SE matches to ~0.1%). REGISTRY `## StaggeredTripleDifference`: the previously-unlabeled overall-aggregation prose is formalized under a `**Note:**` documenting both overalls, and the duplicate aggregation-weight deviation is consolidated (fixing a `P(G=g)` vs R `P(S=g)` mislabel). `METHODOLOGY_REVIEW.md` row L69 promoted to **Complete** (`Last Review = 2026-05-30`) with a Verified Components / R Comparison Results detail block; priority queue pruned. `docs/references.rst` Ortiz-Villavicencio entry pinned to arXiv:2505.09942v3. - **ConleySpatialHAC methodology-review-tracker promotion: In Progress → Complete.** Closes the Conley (1999) *Journal of Econometrics* 92(1) primary-source review on the methodology-review tracker. The paper review on file at `docs/methodology/papers/conley-1999-review.md` was previously merged (2026-05-09); this PR is the F.L.I.P. consolidation — new `tests/test_methodology_conley.py` with paper-equation-numbered Verified Components walk-through (~1600 LoC; 10 classes; 60 tests, 5 of them `@pytest.mark.slow`). Coverage: Eq. 4.2 cross-sectional sandwich (pairwise-distance specialization; the project's paper review identifies Eq. 4.2 page 18 as the real-valued/pairwise form, with Eq. 3.13 reserved for the lattice-indexed form), Eq. 4.2 HC0 + rank-1 limits, Andrews (1991) HAC lag truncation matching `conleyreg::time_dist.cpp`, haversine convention with Earth radius 6371.01 km, Phase 2 panel block-decomposed sandwich at `atol=1e-12`, sparse k-d-tree dense-vs-sparse bit-identity (Wave A #120 numerical correctness), and R `conleyreg` v0.1.9 parity at `atol=1e-6` on 6 fixtures (3 cross-sectional + 3 panel) plus the sparse-forced and time-asymmetric kernel parity contracts. Three dedicated deviations-area classes: `TestConleyLibraryExtensions` (Wave A library extensions — combined spatial+cluster product kernel #119, callable conley_metric validation #123, sparse k-d-tree activation #120, indefiniteness guard), `TestConleyDeviationsFromR` (1-D radial Bartlett vs paper's 2-D separable Eq. 3.14, time-label normalization via `np.unique`, independent temporal kernel deferred), and `TestConleyDeferrals` (5 fail-closed `NotImplementedError`/`TypeError` contracts: LinearRegression + survey_design, DiD/MPD/TWFE + survey_design, Conley + weights, SyntheticDiD + Conley, wild_bootstrap + Conley). Methodology-anchored tests extracted from `tests/test_conley_vcov.py`: full classes `TestConleyDirectHelper`, `TestConleyReductions`, `TestConleyReductionsAddendum`, `TestConleyParityR`, `TestConleyParitySpacetime`, `TestConleyPanelHelper`, `TestConleySparseRParityForced`; plus methodology-anchored tests from `TestConleyKernels`, `TestConleyDistanceMetrics`, `TestConleySparse`. File drops 4248 → 3113 lines after extraction. Defensive surface preserved: input validation, NaN/inf guards, dispatch-level validity, estimator-level integration smoke tests, set_params atomicity, sparse-path activation thresholds + density-gate fallback. `METHODOLOGY_REVIEW.md` row L91 promoted to **Complete** with `Last Review = 2026-05-26`; detail block rewritten with Verified Components / Test Coverage / R Comparison Results inline table / Corrections Made / Deviations / Outstanding Concerns. Priority queue at L1386 pruned: PreTrendsPower removed (already Complete since 2026-05-19) and ConleySpatialHAC removed (this PR); substantive-review-blocked renumbered #2-#5 → #1-#4 and consolidation-pass-blocked renumbered #6-#8 → #5-#6. ### Added / Changed diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md index 88678d52..fd0a8b5d 100644 --- a/METHODOLOGY_REVIEW.md +++ b/METHODOLOGY_REVIEW.md @@ -66,7 +66,7 @@ The catalog grew incrementally over several quarters, so formats vary across the | Estimator | Module | R Reference | Status | Last Review | |-----------|--------|-------------|--------|-------------| | TripleDifference | `triple_diff.py` | `triplediff::ddd()` | **Complete** | 2026-02-18 | -| StaggeredTripleDifference | `staggered_triple_diff.py` | `triplediff::ddd(panel=TRUE)` + `agg_ddd()` | **In Progress** | — | +| StaggeredTripleDifference | `staggered_triple_diff.py` | `triplediff::ddd(panel=TRUE)` + `agg_ddd()` | **Complete** | 2026-05-30 | ### Counterfactual / Synthetic Estimators @@ -940,21 +940,39 @@ These three are feature deferrals (paper-supported extensions that the library h | Module | `staggered_triple_diff.py`, `staggered_triple_diff_results.py` | | Primary Reference | Ortiz-Villavicencio & Sant'Anna (2025) — same paper as TripleDifference, staggered case | | R Reference | `triplediff::ddd(panel=TRUE)` + `agg_ddd()` (per `benchmarks/R/benchmark_staggered_triplediff.R`) | -| Status | **In Progress** | -| Last Review | — | +| Status | **Complete** | +| Last Review | 2026-05-30 | **Documentation in place:** +- Paper review: `docs/methodology/papers/ortiz-villavicencio-santanna-2025-review.md` (full-paper, equal-depth, arXiv:2505.09942v3; shared primary source with TripleDifference) — PR #499 - REGISTRY.md section: `## StaggeredTripleDifference` (per-cohort comparisons against three sub-groups, DR/RA/IPW per component, GMM-optimal closed-form inverse-variance weighting, event-study via CS mixin, IF-based SEs, multiplier bootstrap for simultaneous bands, survey support) -- `tests/test_methodology_staggered_triple_diff.py`: 6 tests across 3 classes (never-treated comparison, not-yet-treated comparison, aggregation) -- Dedicated unit-test suite: `tests/test_staggered_triple_diff.py` (~680 lines, full coverage of DR/RA/IPW paths, both control-group modes, GMM weighting, event-study aggregation, edge cases) -- Survey-specific: `tests/test_survey_staggered_ddd.py` - -**Outstanding for promotion:** -- Paper review under `docs/methodology/papers/` covering Ortiz-Villavicencio & Sant'Anna (2025) for the staggered case (the primary paper is shared with TripleDifference, but no dedicated review file exists on disk yet) -- R parity validation against `triplediff::ddd(panel=TRUE)` + `agg_ddd()` (per `benchmarks/R/benchmark_staggered_triplediff.R`) — CSV fixtures not committed (gitignored); tests skip without local R + `triplediff` (tracked in TODO.md row, PR #245) -- Per-cohort group-effect SE convention: implementation includes WIF (conservative vs R's `wif=NULL`); documented in REGISTRY, deferred decision on whether to add an opt-in WIF-disable path (tracked in TODO.md row, PR #245) -- Formal Verified Components walk-through here -- Cluster-robust analytical SEs accepted but not wired (deferred per REGISTRY) +- `tests/test_methodology_staggered_triple_diff.py`: R cross-validation (group-time ATT/SE, both control groups) + paper-equation-anchored Verified Components (below) +- Dedicated unit-test suite: `tests/test_staggered_triple_diff.py` (full coverage of DR/RA/IPW paths, both control-group modes, GMM weighting, event-study aggregation, edge cases) +- Survey-specific: `tests/test_survey_staggered_ddd.py` (incl. the Eq-4.14 overall under survey weighting) + +**Verified Components (validated against the paper + R):** +- **Identification (Theorem 4.1 / Eq. 4.5):** RA = IPW = DR coincide without covariates. +- **Three-term DDD decomposition (Eq. 4.1):** post-treatment ATT(g,t) recover a known constant effect. +- **GMM combination (Eqs. 4.11-4.12):** optimal weights sum to one; a single comparison group reduces to `w=[1]`. +- **Event study (Eq. 4.13):** ES(e) equals the eligible-treated cohort-share-weighted average of ATT(g, g+e). +- **Overall (Eq. 4.14 / Cor. 4.2):** opt-in `overall_att_es` = unweighted mean of post-treatment ES(e), cross-validated against R `agg_ddd(type="eventstudy")$overall.att`/`overall.se`. + +**R Comparison Results** (`benchmarks/R/benchmark_staggered_triplediff.R`; `triplediff::ddd(panel=TRUE)` + `agg_ddd()`; CSV fixtures gitignored / regenerated on-the-fly, JSON golden committed): + +| Quantity | Tolerance | Observed agreement | +|----------|-----------|--------------------| +| Group-time ATT(g,t) | rtol 0.1% | exact | +| Group-time SE(g,t) | rtol 1% | matches | +| Event-study ES(e) | rtol 25% | within (per-e eligible-treated weighting deviation) | +| Overall ATT, Eq. 4.14 (`overall_att_es`) | rtol 10% | ≤5% (weighting deviation averages out in the mean) | +| Overall SE, Eq. 4.14 (`overall_se_es`) | rtol 3% | ≤0.5% | + +The paper-equation-anchored Verified Components above are deterministic and run without R. +The R cross-validation in this table runs only when local `R` + `triplediff` are available +(it skips otherwise — the fixtures are gitignored); making those fixtures deterministic in +CI and extending covariate-adjusted R parity are tracked follow-ups in `TODO.md`. + +**Documented deviations (verified non-masking; REGISTRY `## StaggeredTripleDifference`):** comparison-cohort admissibility (matches R `triplediff`, base-period/anticipation-aware; paper uses `g_c > max(g,t)`); aggregation weights `P(S=g,Q=1)` (matches paper Eq. 4.13 where `G_i` is defined only for `Q=1`, not R's `P(S=g)`) — drives the 25% aggregation tolerance; per-cohort group-effect WIF (conservative vs R `wif=NULL`); default `overall_att` is the CS-simple post-treatment average (paper Eq. 4.14 available opt-in as `overall_att_es`); cluster-robust analytical SEs accepted-but-deferred (multiplier bootstrap provides unit-level clustering). --- @@ -1396,8 +1414,7 @@ Promotion priority for the **In Progress** entries, ordered by what's blocked on **Consolidation-pass-blocked (already has paper review or methodology file or R parity; mostly Verified Components walk-through):** -5. **StaggeredTripleDifference** — shares the primary paper (Ortiz-Villavicencio & Sant'Anna 2025) with TripleDifference, but no dedicated paper review on file yet; needs R parity (R fixtures gitignored — tracked in TODO.md, PR #245). -6. **Survey Data Support** — cross-cutting feature; promotion requires the per-estimator integration paths to be locked down first. +5. **Survey Data Support** — cross-cutting feature; promotion requires the per-estimator integration paths to be locked down first. --- diff --git a/diff_diff/staggered_aggregation.py b/diff_diff/staggered_aggregation.py index d90139d4..dee32c34 100644 --- a/diff_diff/staggered_aggregation.py +++ b/diff_diff/staggered_aggregation.py @@ -443,7 +443,7 @@ def _compute_aggregated_se_with_wif( unit: str, precomputed: Optional["PrecomputedData"] = None, return_psi: bool = False, - ) -> "Union[float, Tuple[float, np.ndarray]]": + ) -> "Union[Tuple[float, Optional[int]], Tuple[float, np.ndarray, Optional[int]]]": """ Compute SE with weight influence function (wif) adjustment. @@ -458,6 +458,15 @@ def _compute_aggregated_se_with_wif( Formula (matching R's did::aggte): agg_inf_i = Σ_k w_k × inf_i_k + wif_i × ATT_k se = sqrt(mean(agg_inf^2) / n) + + Returns + ------- + ``(se, effective_df)`` when ``return_psi=False``; ``(se, psi_total, + effective_df)`` when ``return_psi=True``. This 2-tuple / 3-tuple arity is + held on EVERY branch — including the empty-IF (``se=0.0``) and non-finite-IF + (``se=NaN``) early returns — so callers that unpack two or three values fail + soft instead of raising on degenerate influence functions. ``effective_df`` + is non-None only for replicate designs that dropped replicates. """ # Extract global unit info for correct pg = n_g / N_total scaling. # Without this, the local path builds the unit set from only units in @@ -486,15 +495,37 @@ def _compute_aggregated_se_with_wif( n_global_units=n_global_units, ) + # Consistent return arity across ALL branches: return_psi=True -> 3-tuple + # (se, psi, effective_df); return_psi=False -> 2-tuple (se, effective_df). + # The empty / non-finite-IF branches must match so callers that unpack three + # values (``_aggregate_event_study``) or two (``_aggregate_simple``) fail soft + # (NaN SE) instead of raising on degenerate IF/WIF edge cases. if len(psi_total) == 0: - return (0.0, psi_total) if return_psi else 0.0 + return (0.0, psi_total, None) if return_psi else (0.0, None) # Check for NaN propagation from non-finite WIF if not np.all(np.isfinite(psi_total)): - return (np.nan, psi_total) if return_psi else np.nan + return (np.nan, psi_total, None) if return_psi else (np.nan, None) - # Use design-based variance when full survey design is available - # Use unit-level resolved survey (panel IF is indexed by unit, not obs) + se, effective_df = self._se_from_psi(psi_total, precomputed) + if return_psi: + return (se, psi_total, effective_df) + return (se, effective_df) + + def _se_from_psi( + self, + psi_total: np.ndarray, + precomputed: Optional["PrecomputedData"] = None, + ) -> "Tuple[float, Optional[int]]": + """Standard error (and per-statistic effective df) from a combined IF vector. + + Routes a finite, non-empty influence-function vector through the same + variance estimator the per-event-time and simple-aggregation SE paths use: + replicate-weight variance, full survey-design variance, or the simple + ``sqrt(sum(psi^2))``. Callers must guard emptiness/finiteness first. + Returns ``(se, effective_df)``; ``effective_df`` is non-None only for + replicate designs that dropped replicates. + """ resolved_survey = ( precomputed.get("resolved_survey_unit") if precomputed is not None else None ) @@ -514,9 +545,7 @@ def _compute_aggregated_se_with_wif( se = np.nan else: se = np.sqrt(max(variance, 0.0)) - if return_psi: - return (se, psi_total, effective_df) - return (se, effective_df) + return se, effective_df if resolved_survey is not None and ( resolved_survey.strata is not None @@ -530,15 +559,10 @@ def _compute_aggregated_se_with_wif( se = np.nan else: se = np.sqrt(max(variance, 0.0)) - if return_psi: - return (se, psi_total, None) - return (se, None) + return se, None variance = np.sum(psi_total**2) - se = np.sqrt(variance) - if return_psi: - return (se, psi_total, None) - return (se, None) + return np.sqrt(variance), None def _aggregate_event_study( self, @@ -671,6 +695,10 @@ def _aggregate_event_study( _psi_vectors.append(psi_e) _psi_event_times.append(e) + # Reset the Eq. (4.14) overall before any early return so a reused estimator + # instance never reads a stale value from a prior fit. + self._event_study_overall = None + # Batch inference for all relative periods if not agg_effects_list: return {} @@ -769,6 +797,44 @@ def _aggregate_event_study( # Attach VCV to self for CallawaySantAnna to pick up self._event_study_vcov = event_study_vcov + # Eq. (4.14) overall ATT: the unweighted mean of the post-treatment + # event-study effects ES(e). Stashed on self (mirroring _event_study_vcov) + # so the StaggeredTripleDifference estimator can expose it as overall_att_es; + # CallawaySantAnna leaves it unread. Post-treatment is the library predicate + # e >= -anticipation (matching _aggregate_simple and the default overall_att), + # NOT a hardcoded e >= 0. + # + # The POINT ESTIMATE averages EVERY finite post-treatment ES(e) effect (read + # from event_study_effects by event-time key), so it is always the true Eq. + # 4.14 average -- it must NOT be silently restricted to horizons with a finite + # influence function. The SE is the influence function of that mean (the + # average of the per-event-time combined IFs, via the same survey-aware + # variance routine as the per-e effects). If any contributing horizon lacks a + # finite, well-formed combined IF (a finite ES(e) can have a non-finite + # WIF/IF, which the per-e path already surfaces as a NaN SE), the combined IF + # for the mean is undefined: the SE is NaN while the point estimate is + # retained, and the consumer (fit) warns and NaN-propagates the inference. + post_e = [ + e + for e in event_study_effects + if e >= -self.anticipation and np.isfinite(event_study_effects[e]["effect"]) + ] + if post_e: + att_es = float(np.mean([event_study_effects[e]["effect"] for e in post_e])) + psi_by_e = {e: psi for e, psi in zip(_psi_event_times, _psi_vectors)} + psis = [psi_by_e.get(e) for e in post_e] + se_es: float = np.nan + eff_df_es: Optional[int] = None + if all(p is not None and len(p) > 0 and np.all(np.isfinite(p)) for p in psis): + if len({len(p) for p in psis}) == 1: + psi_es = np.column_stack(psis).mean(axis=1) + se_es, eff_df_es = self._se_from_psi(psi_es, precomputed) + self._event_study_overall = { + "att": att_es, + "se": float(se_es), + "effective_df": eff_df_es, + } + return event_study_effects def _aggregate_by_group( diff --git a/diff_diff/staggered_bootstrap.py b/diff_diff/staggered_bootstrap.py index a97d6ed4..ba057b22 100644 --- a/diff_diff/staggered_bootstrap.py +++ b/diff_diff/staggered_bootstrap.py @@ -81,6 +81,12 @@ class CSBootstrapResults: Bootstrap p-values for group effects. bootstrap_distribution : Optional[np.ndarray] Full bootstrap distribution of overall ATT (if requested). + overall_att_es_se : Optional[float] + Bootstrap standard error for the paper Eq. 4.14 overall (event-study average). + overall_att_es_ci : Optional[Tuple[float, float]] + Bootstrap confidence interval for the Eq. 4.14 overall. + overall_att_es_p_value : Optional[float] + Bootstrap p-value for the Eq. 4.14 overall. """ n_bootstrap: int @@ -100,6 +106,10 @@ class CSBootstrapResults: group_effect_p_values: Optional[Dict[Any, float]] = None bootstrap_distribution: Optional[np.ndarray] = field(default=None, repr=False) cband_crit_value: Optional[float] = None + # Paper Eq. (4.14) overall (event-study average) bootstrap inference. + overall_att_es_se: Optional[float] = None + overall_att_es_ci: Optional[Tuple[float, float]] = None + overall_att_es_p_value: Optional[float] = None # ============================================================================= @@ -485,8 +495,18 @@ def _run_multiplier_bootstrap( event_study_ses = None event_study_cis = None event_study_p_values = None - - if bootstrap_event_study is not None and event_study_info is not None: + # Paper Eq. (4.14) overall (event-study average) bootstrap inference; stays + # NaN unless event-study draws exist. Mirrors the analytical overall_att_es. + overall_att_es_se = np.nan + overall_att_es_ci = (np.nan, np.nan) + overall_att_es_p_value = np.nan + + # ``rel_periods`` can be empty when balance_e (or an empty event study) leaves + # no relative periods; guard the column_stack so the bootstrap mirrors the + # analytical empty/NaN surface instead of raising "need at least one array to + # concatenate". event_study_ses stays None and the Eq. 4.14 overall bootstrap + # stays NaN (the analytical fit path emits the requested-but-undefined warning). + if bootstrap_event_study is not None and event_study_info is not None and rel_periods: es_effects = np.array([event_study_info[e]["effect"] for e in rel_periods]) es_boot_matrix = np.column_stack([bootstrap_event_study[e] for e in rel_periods]) es_ses, es_ci_lo, es_ci_hi, es_pv = _compute_effect_bootstrap_stats_batch_func( @@ -500,6 +520,34 @@ def _run_multiplier_bootstrap( } event_study_p_values = {e: float(es_pv[i]) for i, e in enumerate(rel_periods)} + # Eq. (4.14) overall = unweighted mean of post-treatment ES(e) (e >= + # -anticipation, matching post_treatment_mask above). The per-draw mean + # over those event times is the bootstrap distribution of the overall. + es_post = [ + e + for e in rel_periods + if e >= -self.anticipation + and e in event_study_info + and np.isfinite(event_study_info[e]["effect"]) + ] + if es_post: + original_es_overall = float( + np.mean([event_study_info[e]["effect"] for e in es_post]) + ) + boot_es_overall = np.column_stack([bootstrap_event_study[e] for e in es_post]).mean( + axis=1 + ) + ( + overall_att_es_se, + overall_att_es_ci, + overall_att_es_p_value, + ) = _compute_effect_bootstrap_stats_func( + original_es_overall, + boot_es_overall, + alpha=self.alpha, + context="overall ATT (event-study average)", + ) + # Batch compute bootstrap statistics for group effects group_effect_ses = None group_effect_cis = None @@ -564,10 +612,11 @@ def _run_multiplier_bootstrap( overall_se = np.nan overall_ci = (np.nan, np.nan) overall_p_value = np.nan + overall_att_es_se = np.nan + overall_att_es_ci = (np.nan, np.nan) + overall_att_es_p_value = np.nan gt_ses = {gt: np.nan for gt in gt_ses} if gt_ses else gt_ses - gt_cis = ( - {gt: (np.nan, np.nan) for gt in gt_cis} if gt_cis else gt_cis - ) + gt_cis = {gt: (np.nan, np.nan) for gt in gt_cis} if gt_cis else gt_cis gt_p_values = {gt: np.nan for gt in gt_p_values} if gt_p_values else gt_p_values if event_study_ses: event_study_ses = {k: np.nan for k in event_study_ses} @@ -597,6 +646,9 @@ def _run_multiplier_bootstrap( group_effect_p_values=group_effect_p_values, bootstrap_distribution=bootstrap_overall, cband_crit_value=cband_crit_value, + overall_att_es_se=overall_att_es_se, + overall_att_es_ci=overall_att_es_ci, + overall_att_es_p_value=overall_att_es_p_value, ) def _prepare_event_study_aggregation( diff --git a/diff_diff/staggered_triple_diff.py b/diff_diff/staggered_triple_diff.py index d2c7339d..893b4a3b 100644 --- a/diff_diff/staggered_triple_diff.py +++ b/diff_diff/staggered_triple_diff.py @@ -87,7 +87,7 @@ class StaggeredTripleDifference( References ---------- Ortiz-Villavicencio, M. & Sant'Anna, P.H.C. (2025). "Better Understanding - Triple Differences Estimators." arXiv:2505.09942. + Triple Differences Estimators." arXiv:2505.09942v3. """ def __init__( @@ -599,6 +599,12 @@ def fit( overall_att, overall_se, overall_effective_df = self._aggregate_simple( group_time_effects, influence_func_info, df_agg, unit, precomputed_agg ) + # Preserve the ORIGINAL survey df before the simple-overall statistic mutates + # it below. The Eq. 4.14 overall (overall_att_es) must fall back to this + # original df, never to the simple overall's per-statistic replicate df — the + # two statistics can drop different replicate subsets, so reusing the simple + # overall's df would silently give overall_att_es the wrong p-value/CI. + df_survey_original = df_survey # Use per-statistic effective df from replicate aggregation if available; # otherwise fall back to the original df from the survey design. if overall_effective_df is not None: @@ -633,6 +639,55 @@ def fit( unit, ) + # Paper Eq. (4.14) overall ATT (event-study average): an opt-in summary + # alongside the default CS-simple ``overall_att``. ``_aggregate_event_study`` + # stashes it on ``self._event_study_overall``; populated only when the + # event-study aggregation ran. Analytical inference here; the bootstrap block + # below overrides the SE when ``n_bootstrap > 0`` (mirroring ``overall_se``). + overall_att_es = None + overall_se_es = None + overall_t_stat_es = None + overall_p_value_es = None + overall_conf_int_es = None + # Whether the ANALYTICAL Eq. 4.14 SE was non-finite while its point estimate + # was finite (i.e. a contributing horizon's influence function was non-finite). + # Captured before any bootstrap override so the terminal warning below does not + # misdiagnose a bootstrap-side NaN SE (e.g. cluster-unidentified) as an + # analytical-IF failure (the bootstrap path emits its own warning). + analytical_overall_es_se_nonfinite = False + if aggregate in ("event_study", "all"): + es_overall = getattr(self, "_event_study_overall", None) + if es_overall is not None: + overall_att_es = es_overall["att"] + overall_se_es = es_overall["se"] + analytical_overall_es_se_nonfinite = bool( + np.isfinite(overall_att_es) and not np.isfinite(overall_se_es) + ) + es_eff_df = es_overall.get("effective_df") + # Fall back to the ORIGINAL survey df, not the simple-overall's mutated + # per-statistic df (P1 fix): overall_att_es has its own replicate df. + df_for_es = es_eff_df if es_eff_df is not None else df_survey_original + overall_t_stat_es, overall_p_value_es, overall_conf_int_es = safe_inference( + overall_att_es, overall_se_es, alpha=self.alpha, df=df_for_es + ) + else: + # Event-study aggregation was requested but yielded no post-treatment + # horizon: this is "requested but undefined" -> NaN + warning (the + # library's overall-aggregation contract, matching _aggregate_simple), + # distinct from "not requested" which leaves the fields None. + warnings.warn( + "Event-study aggregation was requested but no post-treatment " + "horizons are available for the Eq. 4.14 overall (overall_att_es); " + "returning NaN.", + UserWarning, + stacklevel=2, + ) + overall_att_es = np.nan + overall_se_es = np.nan + overall_t_stat_es, overall_p_value_es, overall_conf_int_es = safe_inference( + np.nan, np.nan, alpha=self.alpha, df=df_survey + ) + # Reject replicate-weight designs for bootstrap — replicate variance # is an analytical alternative, not compatible with bootstrap if ( @@ -670,6 +725,17 @@ def fit( ) overall_conf_int = bootstrap_results.overall_att_ci overall_p_value = bootstrap_results.overall_att_p_value + + # Mirror the override for the Eq. (4.14) event-study-average overall + # (only when event-study aggregation produced it). A NaN bootstrap SE + # (e.g. cluster-unidentified) correctly NaNs the inference. + if overall_att_es is not None and bootstrap_results.overall_att_es_se is not None: + overall_se_es = bootstrap_results.overall_att_es_se + overall_t_stat_es, overall_p_value_es, overall_conf_int_es = safe_inference( + overall_att_es, overall_se_es, alpha=self.alpha, df=df_survey + ) + overall_conf_int_es = bootstrap_results.overall_att_es_ci + overall_p_value_es = bootstrap_results.overall_att_es_p_value if bootstrap_results.cband_crit_value is not None: cband_crit_value = bootstrap_results.cband_crit_value @@ -743,6 +809,29 @@ def fit( ) group_effects[g_key]["t_stat"] = t_val + # Eq. 4.14 overall: an ANALYTICAL non-finite SE under a finite point estimate + # (a contributing horizon's influence function is non-finite, or the variance is + # unidentified — e.g. a single-PSU/cluster design). Surface it — never NaN the SE + # silently. Gated on the analytical-origin flag (captured before the bootstrap + # override) and the final state still being non-finite: a bootstrap that supplies a + # finite SE rescues it (no warning), and a bootstrap that NaNs the SE for unrelated + # reasons (e.g. cluster-unidentified) is reported by the bootstrap's own warning, + # not misdiagnosed here as an analytical-IF failure. + if ( + analytical_overall_es_se_nonfinite + and overall_se_es is not None + and not np.isfinite(overall_se_es) + ): + warnings.warn( + "Eq. 4.14 overall (overall_att_es) point estimate is defined but its " + "standard error is undefined (NaN): either a contributing post-treatment " + "event-study horizon has a non-finite influence function, or the variance " + "is unidentified (e.g. a single-PSU/cluster survey design). overall_se_es " + "and its inference fields are NaN.", + UserWarning, + stacklevel=2, + ) + n_treated_units = int(np.sum((unit_cohorts > 0) & (eligibility_per_unit == 1))) n_control_units = n_units - n_treated_units n_never_enabled = int(np.sum(unit_cohorts == 0)) @@ -780,6 +869,11 @@ def fit( epv_diagnostics=epv_diagnostics if epv_diagnostics else None, epv_threshold=self.epv_threshold, pscore_fallback=self.pscore_fallback, + overall_att_es=overall_att_es, + overall_se_es=overall_se_es, + overall_t_stat_es=overall_t_stat_es, + overall_p_value_es=overall_p_value_es, + overall_conf_int_es=overall_conf_int_es, ) self.is_fitted_ = True return self.results_ diff --git a/diff_diff/staggered_triple_diff_results.py b/diff_diff/staggered_triple_diff_results.py index be653e8b..2c7c434f 100644 --- a/diff_diff/staggered_triple_diff_results.py +++ b/diff_diff/staggered_triple_diff_results.py @@ -55,6 +55,15 @@ class StaggeredTripleDiffResults: Number of eligible units (Q = 1). n_ineligible : int Number of ineligible units (Q = 0). + overall_att_es : float, optional + Paper Eq. (4.14) "overall" ATT: the unweighted mean of the post-treatment + event-study effects ES(e). Populated only when ``aggregate`` is + ``"event_study"`` or ``"all"``; ``None`` otherwise. Distinct from + ``overall_att`` (the Callaway-Sant'Anna simple post-treatment (g,t) average, + which is the default headline ATT). + overall_se_es, overall_t_stat_es, overall_p_value_es, overall_conf_int_es : optional + Standard error, t-statistic, p-value, and confidence interval for + ``overall_att_es``; ``None`` when ``overall_att_es`` is ``None``. """ group_time_effects: Dict[Tuple[Any, Any], Dict[str, Any]] @@ -94,6 +103,16 @@ class StaggeredTripleDiffResults: ) epv_threshold: float = 10 pscore_fallback: str = "error" + # Paper Eq. (4.14) "overall" ATT: the unweighted mean of the post-treatment + # event-study effects ES(e). Populated only when aggregate in {"event_study", + # "all"}; None otherwise. Distinct from the default ``overall_att`` (the + # Callaway-Sant'Anna simple post-treatment (g,t) average). See REGISTRY + # ## StaggeredTripleDifference "Aggregation". + overall_att_es: Optional[float] = None + overall_se_es: Optional[float] = None + overall_t_stat_es: Optional[float] = None + overall_p_value_es: Optional[float] = None + overall_conf_int_es: Optional[Tuple[float, float]] = None # --- Inference-field aliases (balance/external-adapter compatibility) --- @property @@ -194,6 +213,28 @@ def summary(self, alpha: Optional[float] = None) -> str: ] ) + # Paper Eq. (4.14) overall (event-study average), when computed + # (aggregate in {"event_study", "all"}). The headline ATT above remains the + # Callaway-Sant'Anna simple post-treatment (g,t) average. + if ( + self.overall_att_es is not None + and self.overall_se_es is not None + and self.overall_conf_int_es is not None + ): + p_es = self.overall_p_value_es if self.overall_p_value_es is not None else float("nan") + t_es = self.overall_t_stat_es if self.overall_t_stat_es is not None else float("nan") + sig_es = _get_significance_stars(p_es) + lines.extend( + [ + "", + "Overall ATT (event-study average, paper Eq. 4.14):", + f"{'ATT':<15} {self.overall_att_es:>12.4f} {self.overall_se_es:>12.4f} " + f"{t_es:>10.3f} {p_es:>10.4f} {sig_es:>6}", + f"{conf_level}% Confidence Interval: " + f"[{self.overall_conf_int_es[0]:.4f}, {self.overall_conf_int_es[1]:.4f}]", + ] + ) + cv = self.coef_var if np.isfinite(cv): lines.append(f"{'CV (SE/abs(ATT)):':<25} {cv:>10.4f}") @@ -429,6 +470,12 @@ def to_dict(self) -> Dict[str, Any]: d["group_effects"] = self.group_effects if self.comparison_group_counts is not None: d["comparison_group_counts"] = self.comparison_group_counts + if self.overall_att_es is not None: + d["overall_att_es"] = self.overall_att_es + d["overall_se_es"] = self.overall_se_es + d["overall_t_stat_es"] = self.overall_t_stat_es + d["overall_p_value_es"] = self.overall_p_value_es + d["overall_conf_int_es"] = self.overall_conf_int_es return d @property diff --git a/docs/api/_autosummary/diff_diff.CSBootstrapResults.rst b/docs/api/_autosummary/diff_diff.CSBootstrapResults.rst index e7a16b54..60b3ca1a 100644 --- a/docs/api/_autosummary/diff_diff.CSBootstrapResults.rst +++ b/docs/api/_autosummary/diff_diff.CSBootstrapResults.rst @@ -34,6 +34,9 @@ ~CSBootstrapResults.overall_att_se ~CSBootstrapResults.overall_att_ci ~CSBootstrapResults.overall_att_p_value + ~CSBootstrapResults.overall_att_es_se + ~CSBootstrapResults.overall_att_es_ci + ~CSBootstrapResults.overall_att_es_p_value ~CSBootstrapResults.group_time_ses ~CSBootstrapResults.group_time_cis ~CSBootstrapResults.group_time_p_values diff --git a/docs/api/_autosummary/diff_diff.StaggeredTripleDiffResults.rst b/docs/api/_autosummary/diff_diff.StaggeredTripleDiffResults.rst index 70bbf478..dbcdbb5a 100644 --- a/docs/api/_autosummary/diff_diff.StaggeredTripleDiffResults.rst +++ b/docs/api/_autosummary/diff_diff.StaggeredTripleDiffResults.rst @@ -56,6 +56,11 @@ ~StaggeredTripleDiffResults.overall_t_stat ~StaggeredTripleDiffResults.overall_p_value ~StaggeredTripleDiffResults.overall_conf_int + ~StaggeredTripleDiffResults.overall_att_es + ~StaggeredTripleDiffResults.overall_se_es + ~StaggeredTripleDiffResults.overall_t_stat_es + ~StaggeredTripleDiffResults.overall_p_value_es + ~StaggeredTripleDiffResults.overall_conf_int_es ~StaggeredTripleDiffResults.groups ~StaggeredTripleDiffResults.time_periods ~StaggeredTripleDiffResults.n_obs diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 926d0acb..1e81a7dc 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -2205,10 +2205,13 @@ functions across comparison groups. Minimizes asymptotic variance subject to `su Event study (Equation 4.13): cohort-share-weighted average across cohorts for each relative time `e = t - g`. Reuses `CallawaySantAnnaAggregationMixin._aggregate_event_study()`. -Overall ATT: cohort-size-weighted average across post-treatment (g,t) pairs. -Reuses `CallawaySantAnnaAggregationMixin._aggregate_simple()`. Note: this is the -simple post-treatment aggregation, not the paper's Equation 4.14 (which averages -over event-study effects). +Overall ATT: two summaries are available. (1) The default `overall_att` is the +cohort-size-weighted average across post-treatment (g,t) pairs (reuses +`CallawaySantAnnaAggregationMixin._aggregate_simple()`) — the library-wide +Callaway-Sant'Anna "simple" convention, matching R `agg_ddd(type="simple")`. +(2) The opt-in `overall_att_es` is the paper's Equation 4.14 overall — the unweighted +mean of the post-treatment event-study effects ES(e) — populated when +`aggregate="event_study"` or `"all"` (see the labeled Note below). Group effects: average across post-treatment time periods for each cohort. Reuses `CallawaySantAnnaAggregationMixin._aggregate_by_group()`. @@ -2216,15 +2219,29 @@ Reuses `CallawaySantAnnaAggregationMixin._aggregate_by_group()`. All aggregation SEs include the WIF (Weight Influence Function) adjustment for uncertainty in cohort-share weights, inherited from the CallawaySantAnna mixin. -- **Deviation from R:** Aggregation weights and WIF use the eligible-treated - population `P(S=g, Q=1)` (matching the paper's Eq 4.13, where `G_i` is defined - only for `Q=1` units). R's `agg_ddd()` uses `P(S=g)` (all units in the enabling - group, including ineligible). This is implemented by setting `unit_cohorts=0` for - ineligible units before calling the aggregation mixin. +- **Deviation from R:** Aggregation weights (and the WIF) use the eligible-treated + population `P(S=g, Q=1)` — matching the paper's Eq 4.13, where `G_i` is defined + only for `Q=1` units (`G_i = g` iff `S_i = g` and `Q_i = 1`), so the paper's + `P(G=g)` *is* `P(S=g, Q=1)`. R's `agg_ddd()` instead weights by `P(S=g)` (all units + in the enabling group, including ineligible). Implemented by setting `unit_cohorts=0` + for ineligible units before calling the aggregation mixin. Group-time `ATT(g,t)` + values are identical to R; only the weighted average across (g,t) pairs differs — + this is the source of the larger tolerance in the R cross-validation tests. - **Note:** Per-cohort group-effect SEs include WIF via the inherited mixin. R's `agg_ddd(type="group")` uses `wif=NULL` for per-cohort aggregation since within-cohort weights are fixed. This makes our per-cohort group-effect SEs slightly conservative relative to R. +- **Note:** The default overall ATT (`overall_att`) is the Callaway-Sant'Anna simple + post-treatment (g,t) average — the library-wide convention across staggered + estimators — and is NOT the paper's Equation 4.14 overall (which averages the + event-study effects). The paper's Eq 4.14 form is exposed opt-in as `overall_att_es` + (populated only when `aggregate` is `"event_study"`/`"all"`), computed as the + unweighted mean of the post-treatment ES(e) over `e >= -anticipation`. Its analytical + SE is the influence function of that mean (the average of the per-event-time combined + IFs, routed through the same survey-aware variance estimator as the per-e effects); + a multiplier-bootstrap SE replaces it when `n_bootstrap > 0`. The two summaries answer + different questions and generally differ; `overall_att_es` is cross-validated against + R `agg_ddd(type="eventstudy")$overall.att`/`overall.se`. *Standard errors:* @@ -2267,11 +2284,6 @@ confidence bands (sup-t) for event study. (TSL) or `compute_replicate_if_variance()` (replicate weights). Bootstrap uses PSU-level multiplier weights. The R `triplediff` package does not support survey weights. -- **Deviation from R:** Event-study and simple aggregation reuse - `CallawaySantAnnaAggregationMixin` cohort-size weights (`n_treated` per cohort) - instead of R's `agg_ddd()` group-probability weights (`pg = P(G=g)` over all - units including ineligible). Group-time ATT(g,t) values are identical; only the - weighted average across (g,t) pairs differs. *Edge cases:* - Single comparison group: GMM reduces to w=[1], no matrix inversion @@ -2283,7 +2295,11 @@ confidence bands (sup-t) for event study. - No valid (g,t) pairs at all: raises `ValueError` **Reference implementation(s):** -- R `triplediff` (companion package by paper authors) — not yet validated against +- R `triplediff` (companion package by paper authors) — cross-validated in + `tests/test_methodology_staggered_triple_diff.py` (group-time ATT/SE for both control + groups, plus the Eq. 4.14 overall `overall_att_es` against `agg_ddd(type="eventstudy")`). + CSV fixtures are gitignored and regenerated on-the-fly from + `benchmarks/R/benchmark_staggered_triplediff.R`; the JSON golden is committed. **Requirements checklist:** - [x] Panel data with (unit, time, enabling-group S, eligibility Q, outcome Y) diff --git a/docs/references.rst b/docs/references.rst index fa54071d..2d5036c1 100644 --- a/docs/references.rst +++ b/docs/references.rst @@ -129,7 +129,7 @@ Triply Robust Panel (TROP) Triple Difference (DDD) ----------------------- -- **Ortiz-Villavicencio, M., & Sant'Anna, P. H. C. (2025).** "Better Understanding Triple Differences Estimators." *Working Paper*. https://arxiv.org/abs/2505.09942 +- **Ortiz-Villavicencio, M., & Sant'Anna, P. H. C. (2025).** "Better Understanding Triple Differences Estimators." *arXiv:2505.09942v3*. https://arxiv.org/abs/2505.09942v3 This paper shows that common DDD implementations (taking the difference between two DiDs, or applying three-way fixed effects regressions) are generally invalid when identification requires conditioning on covariates. The ``TripleDifference`` class implements their regression adjustment, inverse probability weighting, and doubly robust estimators. diff --git a/tests/test_methodology_staggered_triple_diff.py b/tests/test_methodology_staggered_triple_diff.py index 3412358d..6f16fd1d 100644 --- a/tests/test_methodology_staggered_triple_diff.py +++ b/tests/test_methodology_staggered_triple_diff.py @@ -15,7 +15,7 @@ import pandas as pd import pytest -from diff_diff import StaggeredTripleDifference +from diff_diff import StaggeredTripleDifference, generate_staggered_ddd_data BENCHMARK_DIR = Path(__file__).parent.parent / "benchmarks" / "data" / "synthetic" RESULTS_FILE = BENCHMARK_DIR / "staggered_ddd_r_results.json" @@ -46,7 +46,9 @@ def _r_triplediff_available() -> bool: try: result = subprocess.run( ["Rscript", "-e", "library(triplediff); library(jsonlite); cat('OK')"], - capture_output=True, text=True, timeout=30, + capture_output=True, + text=True, + timeout=30, ) return result.returncode == 0 and "OK" in result.stdout except (subprocess.TimeoutExpired, FileNotFoundError, OSError): @@ -57,8 +59,9 @@ def _ensure_csv_fixtures(): """Generate CSV fixtures via R if they don't exist and R is available.""" # Check if any CSV is missing needed_keys = ["s42_dgp1", "s123_dgp1", "s99_dgp1"] - missing = [k for k in needed_keys - if not (BENCHMARK_DIR / f"staggered_ddd_data_{k}.csv").exists()] + missing = [ + k for k in needed_keys if not (BENCHMARK_DIR / f"staggered_ddd_data_{k}.csv").exists() + ] if not missing: return True # All present @@ -69,7 +72,9 @@ def _ensure_csv_fixtures(): try: result = subprocess.run( ["Rscript", str(R_SCRIPT)], - capture_output=True, text=True, timeout=120, + capture_output=True, + text=True, + timeout=120, cwd=str(Path(__file__).parent.parent), ) return result.returncode == 0 @@ -95,10 +100,7 @@ def _load_r_data(seed: int, dgp: int) -> pd.DataFrame: """Load the CSV data that R used for a given scenario.""" csv_path = BENCHMARK_DIR / f"staggered_ddd_data_s{seed}_dgp{dgp}.csv" if not csv_path.exists(): - pytest.skip( - f"Data file not found: {csv_path}. " - "Requires R + triplediff to generate." - ) + pytest.skip(f"Data file not found: {csv_path}. " "Requires R + triplediff to generate.") data = pd.read_csv(csv_path) return data.rename(columns=R_TO_PY_COLS) @@ -111,7 +113,12 @@ def _run_python(data, method, control_group): base_period="varying", ) return est.fit( - data, "outcome", "unit", "period", "first_treat", "eligibility", + data, + "outcome", + "unit", + "period", + "first_treat", + "eligibility", aggregate="event_study", ) @@ -153,16 +160,19 @@ def test_gt_att_matches_r(self, r_results, key): py_gt = sorted(res.group_time_effects.items()) r_gt = list(zip(r["gt_groups"], r["gt_periods"])) - assert len(py_gt) == len(r["gt_att"]), ( - f"{key}: Python has {len(py_gt)} GT cells, R has {len(r['gt_att'])}" - ) + assert len(py_gt) == len( + r["gt_att"] + ), f"{key}: Python has {len(py_gt)} GT cells, R has {len(r['gt_att'])}" for i, ((g, t), eff) in enumerate(py_gt): - assert (g, t) == (r_gt[i][0], r_gt[i][1]), ( - f"{key}: GT cell mismatch at index {i}: Python=({g},{t}), R={r_gt[i]}" - ) + assert (g, t) == ( + r_gt[i][0], + r_gt[i][1], + ), f"{key}: GT cell mismatch at index {i}: Python=({g},{t}), R={r_gt[i]}" _assert_close( - eff["effect"], r["gt_att"][i], - ATT_RTOL, ATT_ATOL, + eff["effect"], + r["gt_att"][i], + ATT_RTOL, + ATT_ATOL, f"{key} ATT(g={g},t={t})", ) @@ -173,13 +183,15 @@ def test_gt_se_matches_r(self, r_results, key): res = _run_python(data, r["est_method"], r["control_group"]) py_gt = sorted(res.group_time_effects.items()) - assert len(py_gt) == len(r["gt_se"]), ( - f"{key}: Python has {len(py_gt)} GT cells, R has {len(r['gt_se'])}" - ) + assert len(py_gt) == len( + r["gt_se"] + ), f"{key}: Python has {len(py_gt)} GT cells, R has {len(r['gt_se'])}" for i, ((g, t), eff) in enumerate(py_gt): _assert_close( - eff["se"], r["gt_se"][i], - SE_RTOL, SE_ATOL, + eff["se"], + r["gt_se"][i], + SE_RTOL, + SE_ATOL, f"{key} SE(g={g},t={t})", ) @@ -208,16 +220,19 @@ def test_gt_att_matches_r(self, r_results, key): py_gt = sorted(res.group_time_effects.items()) r_gt = list(zip(r["gt_groups"], r["gt_periods"])) - assert len(py_gt) == len(r["gt_att"]), ( - f"{key}: Python has {len(py_gt)} GT cells, R has {len(r['gt_att'])}" - ) + assert len(py_gt) == len( + r["gt_att"] + ), f"{key}: Python has {len(py_gt)} GT cells, R has {len(r['gt_att'])}" for i, ((g, t), eff) in enumerate(py_gt): - assert (g, t) == (r_gt[i][0], r_gt[i][1]), ( - f"{key}: GT cell mismatch at index {i}: Python=({g},{t}), R={r_gt[i]}" - ) + assert (g, t) == ( + r_gt[i][0], + r_gt[i][1], + ), f"{key}: GT cell mismatch at index {i}: Python=({g},{t}), R={r_gt[i]}" _assert_close( - eff["effect"], r["gt_att"][i], - ATT_RTOL, ATT_ATOL, + eff["effect"], + r["gt_att"][i], + ATT_RTOL, + ATT_ATOL, f"{key} ATT(g={g},t={t})", ) @@ -228,13 +243,15 @@ def test_gt_se_matches_r(self, r_results, key): res = _run_python(data, r["est_method"], r["control_group"]) py_gt = sorted(res.group_time_effects.items()) - assert len(py_gt) == len(r["gt_se"]), ( - f"{key}: Python has {len(py_gt)} GT cells, R has {len(r['gt_se'])}" - ) + assert len(py_gt) == len( + r["gt_se"] + ), f"{key}: Python has {len(py_gt)} GT cells, R has {len(r['gt_se'])}" for i, ((g, t), eff) in enumerate(py_gt): _assert_close( - eff["se"], r["gt_se"][i], - SE_RTOL, SE_ATOL, + eff["se"], + r["gt_se"][i], + SE_RTOL, + SE_ATOL, f"{key} SE(g={g},t={t})", ) @@ -247,16 +264,24 @@ def test_gt_se_matches_r(self, r_results, key): class TestStaggeredDDDAggregation: - """Cross-validate event study and overall ATT aggregation. - - Note: Aggregation weights differ between R's agg_ddd() and the - CallawaySantAnna mixin we reuse. The group-time ATT(g,t) values - match R exactly (0.00%), but aggregation introduces small differences - due to the weighting scheme. We use 10% tolerance here. + """Cross-validate event study and overall ATT aggregation against R. + + The eligible-treated aggregation weighting (REGISTRY deviation) propagates into + the individual per-relative-time ES(e) values and the simple overall, so those + are checked at a loose 25% envelope (the per-(g,t) ATT(g,t) themselves match R + exactly, 0.00%). The Eq. 4.14 overall (``overall_att_es``) is the unweighted MEAN + of the post-treatment ES(e), where the weighting deviation largely averages out + -- empirically <=5% on ATT and <=0.5% on SE across all benchmark scenarios -- so + it is checked at a much tighter, dedicated envelope. """ - AGG_RTOL = 0.25 # 25% — aggregation weighting differs from R's agg_ddd + AGG_RTOL = 0.25 # per-e ES + simple overall: weighting deviation on individual values AGG_ATOL = 3.0 + # Eq. 4.14 overall_att_es: the weighting deviation averages out in the mean + # (max observed across all scenarios: ATT 5.0%, SE 0.4%); tight-but-safe envelope. + OVERALL_ES_ATT_RTOL = 0.10 + OVERALL_ES_SE_RTOL = 0.03 + OVERALL_ES_ATOL = 0.5 @pytest.mark.parametrize("key", ALL_SCENARIOS) def test_event_study_att_close_to_r(self, r_results, key): @@ -271,8 +296,10 @@ def test_event_study_att_close_to_r(self, r_results, key): for e, eff in res.event_study_effects.items(): if e in r_es: _assert_close( - eff["effect"], r_es[e], - self.AGG_RTOL, self.AGG_ATOL, + eff["effect"], + r_es[e], + self.AGG_RTOL, + self.AGG_ATOL, f"{key} ES ATT(e={e})", ) @@ -288,10 +315,325 @@ def test_overall_att_close_to_r(self, r_results, key): base_period="varying", ) res = est.fit( - data, "outcome", "unit", "period", "first_treat", "eligibility", + data, + "outcome", + "unit", + "period", + "first_treat", + "eligibility", ) _assert_close( - res.overall_att, r["overall_att_simple"], - self.AGG_RTOL, self.AGG_ATOL, + res.overall_att, + r["overall_att_simple"], + self.AGG_RTOL, + self.AGG_ATOL, f"{key} overall ATT (simple)", ) + + @pytest.mark.parametrize("key", ALL_SCENARIOS) + def test_overall_att_es_close_to_r(self, r_results, key): + """Paper Eq. 4.14 overall (event-study average) vs R agg_ddd(type='eventstudy'). + + R's ``agg_ddd(type="eventstudy")$overall.att`` / ``overall.se`` ARE the + paper's Eq. 4.14 overall and its analytical SE; the library exposes them + opt-in as ``overall_att_es`` / ``overall_se_es``. The AGG_RTOL tolerance + reflects the documented eligible-treated weighting (REGISTRY deviation) + propagating from ES(e) into the overall -- NOT a new deviation; per-(g,t) + ATT(g,t) match R exactly, and empirically the overall agreement is far + tighter than AGG_RTOL. + """ + r = r_results[key] + r_att = r.get("overall_att_es") + if r_att is None or (isinstance(r_att, float) and np.isnan(r_att)): + pytest.skip("No event-study overall in R") + data = _load_r_data(r["seed"], r["dgp_type"]) + res = _run_python(data, r["est_method"], r["control_group"]) + assert res.overall_att_es is not None + _assert_close( + res.overall_att_es, + r_att, + self.OVERALL_ES_ATT_RTOL, + self.OVERALL_ES_ATOL, + f"{key} overall ATT (Eq 4.14, event-study avg)", + ) + r_se = r.get("overall_se_es") + if r_se is not None and not (isinstance(r_se, float) and np.isnan(r_se)): + _assert_close( + res.overall_se_es, + r_se, + self.OVERALL_ES_SE_RTOL, + self.OVERALL_ES_ATOL, + f"{key} overall SE (Eq 4.14, event-study avg)", + ) + + +# --------------------------------------------------------------------------- +# Paper-equation-anchored Verified Components +# +# These validate the implementation against properties the paper STATES +# (Ortiz-Villavicencio & Sant'Anna 2025, arXiv:2505.09942v3), using synthetic +# panels from ``generate_staggered_ddd_data`` (DDD-CPT holds by construction). +# They need no R and run in the pure-Python CI fallback. Paper assumptions: +# S (random sampling), SO (strong overlap), NA (no anticipation), DDD-CPT +# (conditional parallel trends). +# --------------------------------------------------------------------------- + +_FIT_COLS = ("outcome", "unit", "period", "first_treat", "eligibility") + + +class TestTheorem41Identification: + """Theorem 4.1 / Eq. (4.5): nonparametric ATT(g,t) identification, with the + regression-adjustment (RA), IPW, and doubly-robust (DR) estimands all equal. + + Observable implication tested: with no covariates the three ``est_method`` + paths reduce to the same DiD building block, so ATT(g,t) must coincide. + """ + + def test_ra_ipw_dr_agree_without_covariates(self): + data = generate_staggered_ddd_data(n_units=600, treatment_effect=3.0, seed=11) + fits = { + m: StaggeredTripleDifference(estimation_method=m, control_group="nevertreated").fit( + data, *_FIT_COLS + ) + for m in ("dr", "ipw", "reg") + } + for k in fits["dr"].group_time_effects: + a = fits["dr"].group_time_effects[k]["effect"] + b = fits["ipw"].group_time_effects[k]["effect"] + c = fits["reg"].group_time_effects[k]["effect"] + if np.isfinite(a) and np.isfinite(b) and np.isfinite(c): + assert abs(a - b) < 1e-6 and abs(a - c) < 1e-6, ( + f"RA=IPW=DR identification (Thm 4.1) violated at {k}: " + f"dr={a:.6f}, ipw={b:.6f}, reg={c:.6f}" + ) + + +class TestEquation41Decomposition: + """Eq. (4.1): the three-term DR DDD estimand (DiD_A + DiD_B - DiD_C). + + Observable implication tested: under a constant treatment effect with + DDD-CPT holding (the DGP's eligibility trend is common across enabling + groups), post-treatment ATT(g,t) recover the true effect. + """ + + def test_recovers_constant_effect(self): + tau = 2.5 + data = generate_staggered_ddd_data(n_units=800, treatment_effect=tau, noise_sd=0.3, seed=21) + res = StaggeredTripleDifference(estimation_method="dr", control_group="notyettreated").fit( + data, *_FIT_COLS, aggregate="event_study" + ) + post = [ + v["effect"] + for (g, t), v in res.group_time_effects.items() + if t >= g and np.isfinite(v["effect"]) + ] + assert len(post) > 0 + assert ( + abs(float(np.mean(post)) - tau) < 0.3 + ), f"mean post ATT(g,t)={np.mean(post):.4f} vs true tau={tau}" + + +class TestEquations411412GMM: + """Eqs. (4.11)-(4.12): optimal GMM combination across comparison cohorts. + + Verified components: the optimal weights sum to one (Eq. 4.12 normalization + ``w = Omega^{-1} 1 / (1' Omega^{-1} 1)``); a single comparison group reduces + to ``w=[1]``; multiple not-yet-treated cohorts yield a multi-weight + combination. (Variance-minimization optimality is by construction via the + ``Omega^{-1}`` weights; ``Omega`` is not exposed for a separate inequality + test, so it is not asserted here.) + """ + + def test_weights_sum_to_one_multi_group(self): + data = generate_staggered_ddd_data( + n_units=600, cohort_periods=[3, 5, 7], treatment_effect=3.0, seed=31 + ) + res = StaggeredTripleDifference(estimation_method="dr", control_group="notyettreated").fit( + data, *_FIT_COLS + ) + assert res.gmm_weights is not None + multi = {k: w for k, w in res.gmm_weights.items() if len(w) > 1} + assert multi, "expected at least one (g,t) with multiple comparison cohorts" + for k, w in multi.items(): + assert ( + abs(sum(w.values()) - 1.0) < 1e-9 + ), f"GMM weights at {k} sum to {sum(w.values())}, expected 1.0" + + def test_single_group_reduces_to_unit_weight(self): + data = generate_staggered_ddd_data(n_units=400, treatment_effect=3.0, seed=32) + res = StaggeredTripleDifference(estimation_method="dr", control_group="nevertreated").fit( + data, *_FIT_COLS + ) + assert res.gmm_weights is not None + for k, w in res.gmm_weights.items(): + assert len(w) == 1, f"nevertreated should use one comparison group at {k}, got {w}" + assert abs(next(iter(w.values())) - 1.0) < 1e-12 + + +class TestEquation413EventStudy: + """Eq. (4.13): event-study ES(e) is the eligible-treated-cohort-share-weighted + average of ATT(g, g+e) across cohorts at each relative time e. + """ + + def test_es_is_cohort_share_weighted_average(self): + data = generate_staggered_ddd_data( + n_units=600, + cohort_periods=[3, 5], + treatment_effect=3.0, + dynamic_effects=True, + effect_growth=0.4, + seed=41, + ) + res = StaggeredTripleDifference(estimation_method="dr", control_group="nevertreated").fit( + data, *_FIT_COLS, aggregate="event_study" + ) + assert res.event_study_effects is not None + gt = res.group_time_effects + checked_multi = False + for e, es in res.event_study_effects.items(): + contrib = [ + (g, t) for (g, t) in gt if (t - g) == e and np.isfinite(gt[(g, t)]["effect"]) + ] + if len(contrib) < 2: + continue + checked_multi = True + w = np.array([gt[(g, t)]["n_treated"] for (g, t) in contrib], dtype=float) + eff = np.array([gt[(g, t)]["effect"] for (g, t) in contrib]) + expected = float(np.sum(w / w.sum() * eff)) + assert ( + abs(es["effect"] - expected) < 1e-9 + ), f"ES(e={e})={es['effect']:.6f} != cohort-share-weighted avg {expected:.6f}" + assert checked_multi, "DGP did not produce a multi-cohort relative time to check" + + +class TestEquation414Overall: + """Eq. (4.14) / Corollary 4.2: ``overall_att_es`` is the UNWEIGHTED mean of the + post-treatment event-study effects ES(e); opt-in via ``aggregate='event_study'``; + ``None`` otherwise; the default ``overall_att`` is undisturbed. + """ + + def _fit(self, aggregate, anticipation=0, seed=51): + data = generate_staggered_ddd_data( + n_units=600, + cohort_periods=[3, 5], + treatment_effect=3.0, + dynamic_effects=True, + effect_growth=0.3, + seed=seed, + ) + est = StaggeredTripleDifference( + estimation_method="dr", control_group="nevertreated", anticipation=anticipation + ) + return est.fit(data, *_FIT_COLS, aggregate=aggregate) + + def test_overall_es_is_unweighted_mean_of_post_es(self): + res = self._fit("event_study") + assert res.overall_att_es is not None + assert res.event_study_effects is not None + post = [ + es["effect"] + for e, es in res.event_study_effects.items() + if e >= 0 and np.isfinite(es["effect"]) + ] + assert abs(res.overall_att_es - float(np.mean(post))) < 1e-10 + + def test_overall_es_none_without_event_study(self): + assert self._fit("simple").overall_att_es is None + assert self._fit(None).overall_att_es is None + + def test_overall_es_respects_anticipation_window(self): + # anticipation=1 => post-treatment is e >= -1 (matches the default overall_att set) + res = self._fit("event_study", anticipation=1) + assert res.overall_att_es is not None + assert res.event_study_effects is not None + post = [ + es["effect"] + for e, es in res.event_study_effects.items() + if e >= -1 and np.isfinite(es["effect"]) + ] + assert abs(res.overall_att_es - float(np.mean(post))) < 1e-10 + + def test_overall_es_does_not_disturb_default_overall(self): + res_es = self._fit("event_study") + res_simple = self._fit("simple") + # Default overall_att stays the CS-simple aggregation regardless of ES opt-in + assert abs(res_es.overall_att - res_simple.overall_att) < 1e-9 + assert res_es.overall_se_es is not None + assert np.isfinite(res_es.overall_se_es) and res_es.overall_se_es > 0 + + +class TestEquation414OverallBootstrap: + """Cross-surface: the Eq. 4.14 overall is finite under the multiplier bootstrap + (its SE is overridden by the per-draw mean of post-treatment ES(e) draws). + """ + + def test_bootstrap_overall_es_finite(self): + data = generate_staggered_ddd_data( + n_units=500, cohort_periods=[3, 5], treatment_effect=3.0, seed=61 + ) + res = StaggeredTripleDifference( + estimation_method="dr", + control_group="nevertreated", + n_bootstrap=199, + seed=7, + ).fit(data, *_FIT_COLS, aggregate="event_study") + assert res.overall_att_es is not None + assert np.isfinite(res.overall_att_es) + assert res.overall_se_es is not None + assert np.isfinite(res.overall_se_es) and res.overall_se_es > 0 + + def test_balance_e_empties_event_study_fails_soft(self): + """A balance_e that removes every event-study cohort must fail SOFT (NaN + + warning), not raise, on BOTH the analytical and multiplier-bootstrap paths. + + Regression for the empty-rel_periods ``np.column_stack`` crash in the shared + bootstrap mixin: ``balance_e=100`` (no cohort reaches relative time 100) + empties the event study, so the Eq. 4.14 overall is "requested but + undefined" -> NaN. The default ``overall_att`` (CS-simple, balance_e- + independent) stays finite. + """ + data = generate_staggered_ddd_data( + n_units=400, cohort_periods=[3, 5], treatment_effect=3.0, seed=71 + ) + for nb in (0, 10): + est = StaggeredTripleDifference( + estimation_method="dr", control_group="nevertreated", n_bootstrap=nb, seed=3 + ) + with pytest.warns(UserWarning): + res = est.fit(data, *_FIT_COLS, aggregate="event_study", balance_e=100) + # requested-but-undefined -> NaN (not a stale value, not a crash) + assert res.overall_att_es is not None and np.isnan(res.overall_att_es) + assert res.overall_se_es is None or np.isnan(res.overall_se_es) + # default overall_att is balance_e-independent and stays finite + assert np.isfinite(res.overall_att) + + +class TestAggregationReturnContract: + """`CallawaySantAnnaAggregationMixin._compute_aggregated_se_with_wif` must keep a + consistent return arity on its empty/non-finite-IF branches: `return_psi=True` -> + 3-tuple `(se, psi, effective_df)`, `return_psi=False` -> 2-tuple `(se, effective_df)`. + + Regression for the refactor where the early-return branches still yielded a + 2-tuple / bare float while `_aggregate_event_study` unpacks three values -- a + degenerate (empty / non-finite) combined IF would then raise a tuple-unpack + ValueError instead of failing soft (NaN SE), which is exactly the edge case the + Eq. 4.14 overall's NaN-SE contract relies on. + """ + + def test_degenerate_if_branch_returns_consistent_arity(self): + est = StaggeredTripleDifference() + est.anticipation = 0 # mixin reads self.anticipation + empty = np.array([]) + df0 = pd.DataFrame({"unit": []}) # empty design -> empty combined IF + r3 = est._compute_aggregated_se_with_wif( + [], empty, empty, empty, {}, df0, "unit", return_psi=True + ) + assert ( + isinstance(r3, tuple) and len(r3) == 3 + ), f"return_psi=True must be 3-tuple, got {r3!r}" + r2 = est._compute_aggregated_se_with_wif( + [], empty, empty, empty, {}, df0, "unit", return_psi=False + ) + assert ( + isinstance(r2, tuple) and len(r2) == 2 + ), f"return_psi=False must be 2-tuple, got {r2!r}" diff --git a/tests/test_staggered_triple_diff.py b/tests/test_staggered_triple_diff.py index 197c5803..bcd772aa 100644 --- a/tests/test_staggered_triple_diff.py +++ b/tests/test_staggered_triple_diff.py @@ -143,6 +143,74 @@ def test_summary_runs(self, simple_data): assert "Staggered Triple Difference" in summary assert "ATT" in summary + def test_overall_att_es_result_surface(self, simple_data): + """Opt-in Eq. 4.14 overall (overall_att_es) is exposed on the result surface + (attribute, summary(), to_dict()) under aggregate='event_study', and is None + for the default (no event-study) fit.""" + res = StaggeredTripleDifference().fit( + simple_data, + "outcome", + "unit", + "period", + "first_treat", + "eligibility", + aggregate="event_study", + ) + es_fields = ( + "overall_att_es", + "overall_se_es", + "overall_t_stat_es", + "overall_p_value_es", + "overall_conf_int_es", + ) + assert res.overall_att_es is not None and np.isfinite(res.overall_att_es) + for f in es_fields: + assert getattr(res, f) is not None + assert "Eq. 4.14" in res.summary() + d = res.to_dict() + for f in es_fields: + assert f in d + # Default fit (no event-study aggregation): fields are None and absent from to_dict. + res_default = StaggeredTripleDifference().fit( + simple_data, "outcome", "unit", "period", "first_treat", "eligibility" + ) + assert res_default.overall_att_es is None + assert "overall_att_es" not in res_default.to_dict() + + def test_overall_att_es_aggregate_all_matches_event_study(self, simple_data): + """Eq. 4.14 overall (overall_att_es) is populated under aggregate='all' (not only + 'event_study') and matches the 'event_study' path bit-for-bit on the same data. + + Regression guard for the documented public contract that BOTH surfaces expose + overall_att_es: aggregate='all' additionally computes the per-cohort group + aggregation, and must neither drop nor perturb the event-study-average overall. + """ + res_all = StaggeredTripleDifference().fit( + simple_data, "outcome", "unit", "period", "first_treat", "eligibility", aggregate="all" + ) + res_es = StaggeredTripleDifference().fit( + simple_data, + "outcome", + "unit", + "period", + "first_treat", + "eligibility", + aggregate="event_study", + ) + # Populated (not None / finite) on the 'all' surface. + assert res_all.overall_att_es is not None and np.isfinite(res_all.overall_att_es) + assert res_all.overall_se_es is not None and np.isfinite(res_all.overall_se_es) + # Identical point estimate AND inference vs the event_study path: same data, same + # analytical influence-function SE (no bootstrap), so bit-for-bit close. + assert res_all.overall_att_es == pytest.approx(res_es.overall_att_es, abs=1e-10) + assert res_all.overall_se_es == pytest.approx(res_es.overall_se_es, abs=1e-10) + assert res_all.overall_t_stat_es == pytest.approx(res_es.overall_t_stat_es, abs=1e-10) + assert res_all.overall_p_value_es == pytest.approx(res_es.overall_p_value_es, abs=1e-10) + ci_all, ci_es = res_all.overall_conf_int_es, res_es.overall_conf_int_es + assert ci_all is not None and ci_es is not None + assert ci_all[0] == pytest.approx(ci_es[0], abs=1e-10) + assert ci_all[1] == pytest.approx(ci_es[1], abs=1e-10) + def test_to_dataframe_group_time(self, simple_data): est = StaggeredTripleDifference() res = est.fit(simple_data, "outcome", "unit", "period", "first_treat", "eligibility") @@ -411,9 +479,7 @@ def test_inf_first_treat_works(self): UserWarning, match=rf"{n_inf_rows} row\(s\) have first_treat=inf; recoding to 0", ): - res = est.fit( - data, "outcome", "unit", "period", "first_treat", "eligibility" - ) + res = est.fit(data, "outcome", "unit", "period", "first_treat", "eligibility") assert np.isfinite(res.overall_att) def test_survey_design_invalid_type_raises(self, simple_data): @@ -574,12 +640,10 @@ def test_collinear_covariates_emit_lstsq_fallback_warning(self): covariates=["x1", "x2", "x3"], ) or_warnings = [ - w for w in caught - if "outcome-regression influence-function step" in str(w.message) + w for w in caught if "outcome-regression influence-function step" in str(w.message) ] assert len(or_warnings) == 1, ( - f"Expected exactly one aggregate OR lstsq-fallback warning, " - f"got {len(or_warnings)}." + f"Expected exactly one aggregate OR lstsq-fallback warning, " f"got {len(or_warnings)}." ) msg = str(or_warnings[0].message) assert "(g, g_c, t) pair(s)" in msg @@ -616,10 +680,7 @@ def test_collinear_covariates_emit_ps_hessian_warning(self): "eligibility", covariates=["x1", "x2", "x3"], ) - ps_warnings = [ - w for w in caught - if "propensity-score Hessian" in str(w.message) - ] + ps_warnings = [w for w in caught if "propensity-score Hessian" in str(w.message)] assert len(ps_warnings) == 1, ( f"Expected exactly one aggregate PS-Hessian lstsq-fallback " f"warning under IPW, got {len(ps_warnings)}: " @@ -654,9 +715,7 @@ def test_well_conditioned_covariates_emit_no_lstsq_warning(self): "eligibility", covariates=["x1", "x2"], ) - lstsq_warnings = [ - w for w in caught if "Rank-deficient X'WX" in str(w.message) - ] + lstsq_warnings = [w for w in caught if "Rank-deficient X'WX" in str(w.message)] assert lstsq_warnings == [], ( f"Unexpected lstsq-fallback warning on clean covariates: " f"{[str(w.message) for w in lstsq_warnings]}" @@ -670,10 +729,6 @@ def test_no_covariates_no_warning(self, simple_data): with _w.catch_warnings(record=True) as caught: _w.simplefilter("always") - est.fit( - simple_data, "outcome", "unit", "period", "first_treat", "eligibility" - ) - lstsq_warnings = [ - w for w in caught if "Rank-deficient X'WX" in str(w.message) - ] + est.fit(simple_data, "outcome", "unit", "period", "first_treat", "eligibility") + lstsq_warnings = [w for w in caught if "Rank-deficient X'WX" in str(w.message)] assert lstsq_warnings == [] diff --git a/tests/test_survey_staggered_ddd.py b/tests/test_survey_staggered_ddd.py index 264c602d..dfccde40 100644 --- a/tests/test_survey_staggered_ddd.py +++ b/tests/test_survey_staggered_ddd.py @@ -5,6 +5,8 @@ full design (strata/PSU/FPC), replicate weights, and bootstrap + survey. """ +import warnings + import numpy as np import pandas as pd import pytest @@ -843,9 +845,7 @@ def test_fallback_uses_weighted_mean(self, sddd_data): # logit failure in at least some subgroups data["collinear_x"] = data["eligibility"].astype(float) * 100.0 sd = SurveyDesign(weights="weight") - est = StaggeredTripleDifference( - estimation_method="ipw", pscore_fallback="unconditional" - ) + est = StaggeredTripleDifference(estimation_method="ipw", pscore_fallback="unconditional") import warnings with warnings.catch_warnings(): @@ -900,3 +900,166 @@ def test_zero_mass_subgroup_warns(self): assert len(mass_warnings) > 0, "Expected zero-mass subgroup warning" # Result should still be finite (other cohorts contribute) assert np.isfinite(res.overall_att) + + +# --------------------------------------------------------------------------- +# Paper Eq. 4.14 overall (event-study average) under survey weighting +# --------------------------------------------------------------------------- + + +class TestSurveyOverallAttEs: + """The opt-in ``overall_att_es`` (paper Eq. 4.14) honours survey weights: its + analytical SE routes through the same survey-aware variance estimator as the + per-event-time effects, so uniform weights reproduce the unweighted result, + nontrivial weights move the SE, and a full strata/PSU design stays finite. + """ + + def test_uniform_weight_equivalence(self): + data = _make_staggered_ddd_data(n_units=200, seed=42) + data["weight"] = 1.0 + sd = SurveyDesign(weights="weight") + res_w = StaggeredTripleDifference(estimation_method="reg").fit( + data, + "outcome", + "unit", + "period", + "first_treat", + "eligibility", + aggregate="event_study", + survey_design=sd, + ) + res_u = StaggeredTripleDifference(estimation_method="reg").fit( + data, + "outcome", + "unit", + "period", + "first_treat", + "eligibility", + aggregate="event_study", + ) + assert res_w.overall_att_es is not None and res_u.overall_att_es is not None + assert np.isclose(res_w.overall_att_es, res_u.overall_att_es, atol=1e-8) + assert res_w.overall_se_es is not None and res_u.overall_se_es is not None + assert np.isclose(res_w.overall_se_es, res_u.overall_se_es, atol=1e-8) + + def test_nontrivial_weights_change_se(self): + data = _make_staggered_ddd_data(n_units=200, seed=42) # variable weights + sd = SurveyDesign(weights="weight") + res_w = StaggeredTripleDifference(estimation_method="reg").fit( + data, + "outcome", + "unit", + "period", + "first_treat", + "eligibility", + aggregate="event_study", + survey_design=sd, + ) + res_u = StaggeredTripleDifference(estimation_method="reg").fit( + data, + "outcome", + "unit", + "period", + "first_treat", + "eligibility", + aggregate="event_study", + ) + assert res_w.overall_se_es is not None and res_u.overall_se_es is not None + # Variable survey weights change the Eq-4.14 overall SE + assert not np.isclose(res_w.overall_se_es, res_u.overall_se_es, atol=1e-6) + + def test_full_design_overall_es_finite(self): + data = _make_staggered_ddd_data(n_units=200, seed=42) + sd = SurveyDesign(weights="weight", strata="stratum", psu="psu") + res = StaggeredTripleDifference(estimation_method="reg").fit( + data, + "outcome", + "unit", + "period", + "first_treat", + "eligibility", + aggregate="event_study", + survey_design=sd, + ) + assert res.overall_att_es is not None and np.isfinite(res.overall_att_es) + assert res.overall_se_es is not None and np.isfinite(res.overall_se_es) + assert res.overall_se_es > 0 + + def test_replicate_weight_overall_es_finite(self): + """The Eq. 4.14 overall (overall_att_es) is computed and finite under a + replicate-weight (BRR) survey design, routing its SE through the replicate + variance path. Regression for the P1 fix that overall_att_es uses its OWN + per-statistic effective df (compute_replicate_if_variance), not the simple + overall's mutated df. + """ + data = _make_staggered_ddd_data(n_units=200, seed=42) + rng = np.random.default_rng(99) + unit_ids = sorted(data["unit"].unique()) + base_w = data.groupby("unit")["weight"].first().reindex(unit_ids).values + n_rep = 20 + for r in range(n_rep): + factor = np.abs(1.0 + rng.standard_normal(len(unit_ids)) * 0.1) + data[f"rep_{r}"] = data["unit"].map(dict(zip(unit_ids, base_w * factor))) + sd = SurveyDesign( + weights="weight", + replicate_weights=[f"rep_{r}" for r in range(n_rep)], + replicate_method="BRR", + ) + res = StaggeredTripleDifference(estimation_method="reg").fit( + data, + "outcome", + "unit", + "period", + "first_treat", + "eligibility", + aggregate="event_study", + survey_design=sd, + ) + assert res.overall_att_es is not None and np.isfinite(res.overall_att_es) + assert res.overall_se_es is not None and np.isfinite(res.overall_se_es) + assert res.overall_se_es > 0 + # default overall_att (simple) also computed under the replicate design + assert np.isfinite(res.overall_att) and np.isfinite(res.overall_se) + + def test_single_psu_bootstrap_cluster_unidentified(self): + """Under a single-PSU survey design with n_bootstrap>0, the Eq. 4.14 overall + keeps its (analytical) point estimate but its SE is unidentified -> NaN, and + the inference fields are NaN-consistent. + + Regression for the P2 warning-misdiagnosis fix: when overall_se_es is NaN the + terminal warning must not attribute it *solely* to a non-finite influence + function -- a single-PSU/cluster design is the actual cause here, and the + authoritative "variance unidentified" warning is raised by the bootstrap. The + Eq. 4.14 warning, when emitted, must name the unidentified-variance cause. + """ + data = _make_staggered_ddd_data(n_units=120, seed=42).copy() + data["psu"] = 0 # collapse to a single PSU -> variance/cluster unidentified + data["fpc_col"] = 10**6 # design-based variance so the PSU is retained + sd = SurveyDesign(weights="weight", psu="psu", fpc="fpc_col") + with warnings.catch_warnings(record=True) as caught: + warnings.simplefilter("always") + res = StaggeredTripleDifference(estimation_method="reg", n_bootstrap=50, seed=7).fit( + data, + "outcome", + "unit", + "period", + "first_treat", + "eligibility", + aggregate="event_study", + survey_design=sd, + ) + # Point estimate survives; SE + inference are NaN-consistent (never a stale value). + assert res.overall_att_es is not None and np.isfinite(res.overall_att_es) + assert res.overall_se_es is not None and np.isnan(res.overall_se_es) + assert res.overall_t_stat_es is not None and np.isnan(res.overall_t_stat_es) + assert res.overall_p_value_es is not None and np.isnan(res.overall_p_value_es) + assert res.overall_conf_int_es is not None + assert np.isnan(res.overall_conf_int_es[0]) and np.isnan(res.overall_conf_int_es[1]) + msgs = [str(w.message) for w in caught] + # The bootstrap raises the authoritative "variance unidentified" warning. + assert any("unidentified" in m for m in msgs) + # The Eq. 4.14 overall warning is cause-accurate (would fail on the old wording + # that attributed the NaN solely to a non-finite influence function). + es_warns = [m for m in msgs if "overall_att_es)" in m] + assert es_warns, "expected the Eq. 4.14 overall warning under single-PSU bootstrap" + assert all("unidentified" in m for m in es_warns)