diff --git a/CHANGELOG.md b/CHANGELOG.md index dfb4688c..7763c178 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,6 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - **`CallawaySantAnna.cluster=` silent no-op (Phase 1b interstitial).** `CallawaySantAnna(cluster="state").fit(...)` previously accepted the argument, stored it, returned it from `get_params()`, but never consumed it anywhere in the fit / aggregator / bootstrap pipeline (`staggered.py:154-156` docstring claimed "Defaults to unit-level clustering" — but for bare `cluster=X`, the aggregator at `staggered_aggregation.py:193-213` computed per-unit IF variance regardless, and the bootstrap at `staggered_bootstrap.py:323-347` drew per-unit multiplier weights regardless). Users who explicitly set `cluster="state"` got per-unit inference with no warning — typically SE too small under intra-cluster correlation. **Survey-PSU clustering via `survey_design=SurveyDesign(psu="state")` was NOT affected** and continued to cluster correctly via `_compute_stratified_psu_meat`. The fix synthesizes a minimal `SurveyDesign(psu=self.cluster, weight_type="pweight")` when bare `cluster=` is set without an explicit survey design, threading the synthesized PSU through the existing survey-PSU machinery (aggregator + bootstrap). A new dedicated `df_inference` field on `CallawaySantAnnaResults` carries the cluster-level df for the bare-cluster-synthesize path ONLY (where `survey_metadata` is intentionally `None` to preserve the `DiagnosticReport.survey_metadata is not None` skip at `diagnostic_report.py:848-856` + `:1150-1158` for "Original fit used a survey design" reasoning, and the `summary()` survey block render at `staggered_results.py:235-238`). `HonestDiD` at `honest_did.py` prefers `survey_metadata.df_survey` first (the actual CS-internal df, which may be tightened post-resolve for replicate designs) and falls back to `df_inference` for bare-cluster fits — so downstream consumers always see the cluster df without overriding the post-recompute survey df. When `survey_design=SurveyDesign(weights=Y)` without PSU is provided AND `cluster=X` is also set, `_inject_cluster_as_psu` injects the bare cluster as the effective PSU AND an `effective_survey_design = replace(survey_design, psu=self.cluster)` is constructed so the downstream `_validate_unit_constant_survey` catches movers (units crossing clusters across periods) on panel data via the now-PSU-bearing design; `survey_metadata` is recomputed to reflect the injected PSU. When both `cluster=X` AND `survey_design.psu=Y` are set, the explicit PSU wins via `_resolve_effective_cluster` (emits `UserWarning` if partitions differ). **`cluster= + SurveyDesign(replicate_weights=[...])` raises `NotImplementedError`**: replicate-weight variance is computed by replicate reweighting (BRR / Fay / JK1 / JKn / SDR) and ignores PSU/cluster entirely (`survey.py:104-109` enforces replicate_weights are mutually exclusive with strata/psu/fpc); honoring bare `cluster=` would silently have no effect while populating `cluster_name`/`n_clusters` on Results dishonestly. Assertive regression tests pin the fix on both panel and repeated-cross-section paths plus the survey/non-survey contract boundaries: `test_cluster_robust_ses_differ_from_unit_level`, `test_bare_cluster_works_with_panel_false_rcs`, `test_bare_cluster_synthesizes_survey_design`, `test_inject_branch_panel_mover_raises`, `test_replicate_weight_plus_cluster_rejected`, `test_bare_cluster_populates_df_inference` (asserts the dedicated cluster-df carrier is set), `test_bare_cluster_does_not_set_survey_metadata` (asserts the survey/non-survey contract is preserved — DiagnosticReport / summary() must not treat a bare-cluster fit as survey-backed), `test_explicit_survey_design_does_populate_survey_metadata` (asserts the inject-branch path still populates survey_metadata for legitimate user-provided SurveyDesign), and `test_bare_cluster_honest_did_uses_df_inference` (end-to-end: HonestDiD threads df_inference into HonestDiDResults.df_survey, preventing silent normal-theory regression on a future refactor). When `cluster=None` (default), behavior is bit-equal to pre-PR (wiring guarded by `if self.cluster is not None:`). Audit verified the no-op was CS-specific — the other 7 Phase 1b estimators (SunAbraham, StackedDiD, WooldridgeDiD, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD) handle bare `cluster=` correctly. ### Added +- **TROP methodology-review-tracker promotion: In Progress → Complete.** Closes the Athey, Imbens, Qu & Viviano (2025) *Triply Robust Panel Estimators* (arXiv:2508.21536) primary-source review on the methodology-review tracker. PR-A (the paper review on file at `docs/methodology/papers/athey-2025-review.md`) was previously merged as #443; this PR is the F.L.I.P. consolidation — new `tests/test_methodology_trop.py` with paper-equation-numbered Verified Components walk-through (10 classes, 36 tests covering Eq. 2 soft-threshold SVD prox / plain prox-gradient monotonicity on a toy setup / weighted-prox solver (the shipped accelerated FISTA outer loop is NOT directly tested for per-step monotonicity because Nesterov momentum does not guarantee it), Eq. 3 unit + time weights, Eqs. 4-5 + Algorithm 1 LOOCV with two-stage cycling, Corollary 1 three-condition unbiasedness, Theorem 5.1 MC-ranking realisation of the triply-robust bias bound, Section 2.2 DID + MC reductions, Eq. 13 + Algorithm 2 per-(i, t) estimation, Algorithm 3 stratified pairs bootstrap, Section 3 / Eq. 6 factor-DGP recovery, plus a `TestTROPDeviations` class locking 11 documented library deviations). Migrated from `tests/test_trop.py`: `TestMethodologyVerification` (5 tests → `TestTROPEquation6FactorDGPRecovery`), four paper-conformance tests + one weighted-solver convergence test from `TestPaperConformanceFixes` (→ `TestTROPEquation3Weights` / `TestTROPAlgorithm1LOOCV` / `TestTROPNuclearNormProx` / `TestTROPAlgorithm3Bootstrap`), three prox / plain prox-gradient monotonicity / weighted-objective tests from `TestTROPNuclearNormSolver` (→ `TestTROPNuclearNormProx`), plus a cycling-convergence test from `TestCyclingSearch` and the factor-DGP smoke from `TestTROPvsSDID`; the `TestPaperConformanceFixes` and `TestTROPvsSDID` shells are deleted. `TestTROPNuclearNormSolver` retains its single defensive `test_zero_weights_no_division_error`. `METHODOLOGY_REVIEW.md` TROP row promoted with merge date 2026-05-24, full Verified Components / Test Coverage / Deviations / Outstanding Concerns / R Parity structure mirroring HAD (PR #473) / ContinuousDiD (PR #476) / DCDH (PR #481) / WooldridgeDiD (PR #486) precedents. **Documented deviations:** Gap #5 (unnormalised weights match Eq. 2, not Section 5 sum-to-one), Gap #9 (unbalanced panels supported beyond paper's balanced-panel assumption), rank selection is implicit via nuclear-norm soft-thresholding with no discrete `rank_selection` constructor parameter (matches paper Section 5.3 + Appendix; corrects an earlier REGISTRY overclaim that listed cv / ic / elbow methods), `λ_nn = ∞` → 1e10 internal sentinel with original-value storage on results. **Outstanding Concerns (deferred):** Equation 14 covariate extension (`TROP.fit()` does not accept a `covariates` kwarg; non-support locked by `TestTROPDeviations::test_covariates_not_supported` via `inspect.signature` to guard against future `**kwargs`) and Theorem 8.1 (covariate triple robustness) deferred until use cases motivate; SC / SDID reductions paper-claimed under "specific (omega, theta) weight choices" not provided in the paper text — cross-language anchor deferred. **R parity:** deferred until paper-author reference implementation is released ("forthcoming" per the paper). REGISTRY.md `## TROP` section gains a "Verified Components" expansion: 10 ticked requirements + four `**Note:**` / `**Note (library-side choice):**` / `**Note (deferral):**` annotations consolidating the deviation surface (Eq. 10 balancing-decomposition pointer, Gap #5 weight-normalisation library-side choice, Eq. 14 covariate deferral). No source-code changes to `diff_diff/trop*.py`. Methodology sign-off scope: paper-aligned identification ingredients (Eq. 2 prox, Eq. 3 weights, Eqs. 4-5 LOOCV, Algorithms 1-3, Corollary 1 unbiasedness, Eq. 6 simulation recovery, DID reduction, documented deviations) are directly locked by the new tests. Theorem 5.1 is verified as a simulation sanity check (TROP RMSE < DID RMSE under LOOCV-tuned weights), NOT as a direct fixed-weight conditional-bias-bound lock; the Matrix Completion reduction is verified as code-path activation (effective_rank > 0 + beats DID baseline), NOT as equivalence against an independent MC reference. The Eq. 14 covariate extension is documented as deferred (TROP.fit() has no `covariates` kwarg). - **TripleDifference `vcov_type` input contract (Phase 1b interstitial #2, permanently narrow).** `TripleDifference(vcov_type=...)` now accepts `{"hc1"}` only (default). The analytical-sandwich families `{classical, hc2, hc2_bm}` and `conley` spatial-HAC are REJECTED at `__init__` with methodology-rooted messages mirroring the CS interstitial. The rejection is **library-architectural, not paper-prescribed**: TripleDifference uses influence-function-based variance per Ortiz-Villavicencio & Sant'Anna (2025) arXiv:2505.09942 — the 3-pairwise-DiD decomposition `inf = w3·IF_3 + w2·IF_2 - w1·IF_1` has no single design matrix to compute hat-matrix leverage `1/(1-h_ii)` or Bell-McCaffrey Satterthwaite DOF on. The narrow contract is permanent and applies to the remaining IF-based estimators (`ImputationDiD`, `EfficientDiD`) when their `vcov_type` threading PRs land. `hc1` with `cluster=None` ≡ per-unit IF variance (`std(inf)/sqrt(n)`); `hc1` with `cluster=X` ≡ CR1 Liang-Zeger on the combined IF (`(G/(G-1)) · Σ_c (Σ_{i∈c} ψ_i)² / n²`, plain CR1 — no Stata-style `(n-1)/(n-p)` finite-sample factor because the IF has no design-matrix `p` in the OLS sense); `hc1` with `survey_design=` ≡ TSL on the combined IF (analytical or replicate). All three paths are unchanged at machine precision (default behavior bit-equal across all 3 estimation methods `{dr, reg, ipw}`). `vcov_type` and `cluster_name` fields added to `TripleDifferenceResults`, threaded through `to_dict()`. `summary()` routes the variance-family label through the shared `_format_vcov_label` (`results.py:49-89`): bare fits render `"HC1 heteroskedasticity-robust"`, clustered fits render `"CR1 cluster-robust at , G="` (since the actual algebra is Liang-Zeger CR1 on the combined IF), and survey-backed fits suppress the variance-estimator line entirely (the Survey Design block already names design + n_psu + df, and the analytical SE is TSL on the combined IF — a raw "hc1" label would misstate the inference path). **`cluster= + SurveyDesign(replicate_weights=[...])` raises `NotImplementedError`** at `fit()`: replicate-weight variance is computed by replicate reweighting (BRR / Fay / JK1 / JKn / SDR) and ignores PSU/cluster entirely; honoring bare `cluster=` would silently have no effect on the variance estimate while populating `cluster_name`/`n_clusters` on Results dishonestly. Mirrors the `CallawaySantAnna` guard from PR #487. Under `survey_design.psu` (non-replicate path) `cluster_name`/`n_clusters` on Results are suppressed (set to None) so they can't misreport the raw cluster argument when the resolver picks the survey PSU instead. `set_params(vcov_type=...)` mirrors CS pattern (mutate-then-validate-at-use, no atomic validation); `fit()` re-validates `vcov_type` at use time so a `set_params(vcov_type="hc4")` mutation surfaces a clear error at fit-time rather than silently propagating to Results metadata. **Interstitial PR #2** (after CS PR #487) rather than full Phase 1b PR 4/8 vcov_type threading — the narrow surface is methodologically dictated by TripleDifference's IF-based variance, not a deferral. New `TestTripleDifferenceVcovType` class in `tests/test_triple_diff.py` covers the 5-surface contract (default/cluster/survey bit-equal, `__init__` rejection per family, `fit()`-time revalidation) plus 8 introspection / convenience-function tests. REGISTRY.md "IF-based variance estimators vs analytical-sandwich estimators" cross-reference section updated to list `TripleDifference` alongside `CallawaySantAnna` in the "Enforced today" tier. Phase 1b PR 4/8 (full `{classical, hc1, hc2, hc2_bm}` threading) resumes on a different estimator (TwoStageDiD) post-merge; the two remaining IF-based estimators (`ImputationDiD`, `EfficientDiD`) follow the same narrow-contract template. - **CallawaySantAnna `vcov_type` input contract (Phase 1b interstitial, permanently narrow).** `CallawaySantAnna(vcov_type=...)` now accepts `{"hc1"}` only (default). The analytical-sandwich families `{classical, hc2, hc2_bm}` and `conley` spatial-HAC are REJECTED at `__init__` with methodology-rooted messages. The rejection is **library-architectural, not paper-prescribed**: CS uses influence-function-based variance per Callaway & Sant'Anna (2021) — per-(g,t) doubly-robust / IPW / outcome-regression structure — and has no single design matrix to compute hat-matrix leverage `1/(1-h_ii)` or Bell-McCaffrey Satterthwaite DOF on. The narrow contract is permanent and applies to other IF-based estimators (ImputationDiD, EfficientDiD) when their `vcov_type` threading PRs land. `hc1` with `cluster=None` ≡ per-unit IF variance (Williams 2000 form); `hc1` with `cluster=X` ≡ CR1 Liang-Zeger on the IF activated via the cluster= wiring fix above. Documentation in `docs/methodology/REGISTRY.md` "IF-based variance estimators vs analytical-sandwich estimators" subsection. `vcov_type`, `cluster_name`, `n_clusters`, `df_inference` added to `CallawaySantAnnaResults` (the canonical PSU column wins for `cluster_name` reporting — `survey_design.psu` when explicit PSU is provided, `self.cluster` when bare cluster synthesizes/injects). `set_params(vcov_type=...)` mirrors SA pattern (mutate-then-refresh `_vcov_type_explicit`, no atomic validation); `fit()` re-validates `vcov_type` at use time so a `set_params(vcov_type="hc4")` mutation surfaces a clear error at fit-time rather than silently propagating to Results metadata. **Interstitial PR** rather than full Phase 1b PR 4/8 vcov_type threading — the narrow surface is methodologically dictated by CS's IF-based variance, not a deferral. Phase 1b PR 4/8 (full {classical, hc1, hc2, hc2_bm} threading) resumes on a different estimator post-merge. - **TripleDifference cluster-changes-SE defensive regression test.** Added `tests/test_triple_diff.py::TestTripleDifferenceClusterDefensive::test_cluster_changes_ses` asserting that `TripleDifference(cluster="state")` produces SE differing from `cluster=None` SE by `>1e-6` on a fixed-seed panel with state-level random effects. Defensive coverage closes a test gap identified during the Phase 1b cluster-wiring audit; TripleDifference's bare-cluster code path (`triple_diff.py:1245-1259`) was already correct but lacked a positive regression test. Mirrors `tests/test_two_stage.py::test_cluster_changes_ses`. diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md index 6d3a08e4..9f6e99ad 100644 --- a/METHODOLOGY_REVIEW.md +++ b/METHODOLOGY_REVIEW.md @@ -24,7 +24,7 @@ A **Complete** entry has a documented review pass against the primary academic s The catalog grew incrementally over several quarters, so formats vary across the existing Complete entries; the consistent invariant is that someone walked through the implementation against the academic source and captured the result here. New reviews going forward should aim for the fuller structure (Verified Components + Corrections Made + Deviations + dedicated methodology test file) used by the more recent entries. -**In Progress** entries have a REGISTRY.md section and unit-test coverage, but no formal walk-through has been captured here yet. The In Progress band is wide — some entries also have some combination of a paper review (primary or companion), a dedicated methodology test file, and R parity fixtures (e.g., TROP has a recent paper review but no methodology test file or cross-language anchor yet); others have only the REGISTRY entry and unit tests (e.g., PowerAnalysis). The "Documentation in place" sub-section enumerates what each entry already has; the "Outstanding for promotion" sub-section enumerates what's still needed to flip it to Complete. +**In Progress** entries have a REGISTRY.md section and unit-test coverage, but no formal walk-through has been captured here yet. The In Progress band is wide — some entries also have some combination of a paper review (primary or companion), a dedicated methodology test file, and R parity fixtures; others have only the REGISTRY entry and unit tests (e.g., PowerAnalysis). The "Documentation in place" sub-section enumerates what each entry already has; the "Outstanding for promotion" sub-section enumerates what's still needed to flip it to Complete. **Not Started** entries have neither a tracker walk-through nor an REGISTRY.md section. This tracker no longer carries any Not Started rows; new estimators are expected to enter as In Progress when their REGISTRY entry lands. @@ -59,7 +59,7 @@ The catalog grew incrementally over several quarters, so formats vary across the | ContinuousDiD | `continuous_did.py` | `contdid` v0.1.0 | **Complete** | 2026-05-20 | | ChaisemartinDHaultfoeuille (DCDH) | `chaisemartin_dhaultfoeuille.py` | `DIDmultiplegtDYN` | **Complete** | 2026-05-21 | | HeterogeneousAdoptionDiD (HAD) | `had.py`, `had_pretests.py` | `chaisemartin::did_had` (`Credible-Answers/did_had` v2.0.0); `nprobust` for bandwidth | **Complete** | 2026-05-20 | -| TROP | `trop.py`, `trop_local.py`, `trop_global.py` | (forthcoming; paper-author reference implementation) | **In Progress** | — | +| TROP | `trop.py`, `trop_local.py`, `trop_global.py` | (forthcoming; paper-author reference implementation) | **Complete** (paper `method="local"`) | 2026-05-24 | ### Triple-Difference Estimators @@ -823,22 +823,50 @@ These three are feature deferrals (paper-supported extensions that the library h | Field | Value | |-------|-------| -| Module | `trop.py`, `trop_local.py`, `trop_global.py`, `trop_results.py` | +| Module | `trop.py`, `trop_local.py`, `trop_results.py` (paper-aligned local method); `trop_global.py` (library-side efficiency adaptation — see "Scope" below) | | Primary Reference | Athey, Imbens, Qu & Viviano (2025), *Triply Robust Panel Estimators*, arXiv:2508.21536 | | R Reference | Paper-author reference implementation (not yet released as CRAN package) | -| Status | **In Progress** | -| Last Review | — | +| Status | **Complete** (paper `method="local"`, version-pinned to arXiv v2 — see Version Pinning below) | +| Last Review | 2026-05-24 | -**Documentation in place:** -- REGISTRY.md section: `## TROP` (local: factor matrix via soft-threshold SVD, exponential-decay unit weights matching paper Eq. 2, LOOCV per Eq. 5, multiple rank-selection methods cv/ic/elbow; global: alternating minimization for nuclear-norm penalty with hard-coded inner-FISTA 20-iteration loop, ATT averaging over D==1 cells, Rust-accelerated LOOCV and bootstrap) -- **Paper review on file**: `docs/methodology/papers/athey-2025-review.md` (retrospective, merged PR #443 on 2026-05-13) -- Implementation: 120 unit tests in `tests/test_trop.py` -- Survey support: Rao-Wu rescaled bootstrap with cross-classified pseudo-strata; Rust backend remains pweight-only +**Version Pinning:** This methodology promotion is anchored on **arXiv:2508.21536v2** (the version covered by the paper review on file at `docs/methodology/papers/athey-2025-review.md`). The current arXiv version is **v3** (submitted 2026-02-09). A formal v2→v3 source delta-check against the v3 PDF has **NOT** been performed for any of the sections this PR promotes (Eqs. 2-3, Algorithms 1-3, Section 2.2, Section 5.2-5.3, Section 6.1-6.2, Theorem 5.1, Corollary 1, Appendix Theorem 8.1). **Action item:** before the next paper-author reference implementation or substantive v3 release, refresh the paper review against the most recent arXiv version and re-validate the verified-component checklist; until then the promotion stays v2-anchored. -**Outstanding for promotion:** -- Dedicated `tests/test_methodology_trop.py` with paper-equation-numbered Verified Components walk-through -- Cross-validation against the paper-author reference implementation (when it becomes available) or against the paper's reported numbers on the empirical applications -- Documented deviations: bootstrap proportional-failure warnings (5% threshold), alternating-minimization convergence warnings, Rust backend's pweight-only limitation vs. Python's full survey-design support +**Scope:** This methodology promotion covers the paper-aligned `method="local"` path (paper Algorithm 2: per-(i, t) estimation with observation-specific weights). The library also exposes `method="global"`, documented in `REGISTRY.md` as a "computationally efficient adaptation using the (1-W) masking principle from Eq. 2" — a library-side adaptation, NOT the paper's full Algorithm 2 estimator. Defensive coverage of the global method lives in `tests/test_trop.py::TestTROPGlobalMethod` (704 lines, ~30 tests for the global-method-specific surface) and is not duplicated in the methodology walk-through. Methodology promotion of `method="global"` as a primary surface would require either (a) a paper-side derivation of the global adaptation's equivalence to Algorithm 2 under specific conditions, or (b) a separate library-extension methodology review; both are deferred. + +**Verified Components:** + +- [x] Eq. 2 weighted nuclear-norm-penalised L estimation: proximal-gradient inner solver (soft-threshold SVD) converges to ``prox_{λ/2}(R)`` under uniform weights; **plain** (non-accelerated) prox-gradient objective ``f(L) + λ‖L‖_*`` is non-increasing across iterations of a toy loop (this verifies the prox + gradient ingredient, NOT the shipped accelerated FISTA outer loop — Nesterov momentum gives the faster ``O(1/k^2)`` rate but does not guarantee per-step monotonicity); the shipped weighted-prox solver on non-uniform weights produces a final objective that is **at-or-below initialisation** (final-vs-initial check via `_weighted_nuclear_norm_solve`, NOT per-iteration monotonicity — the accelerated FISTA loop is allowed to have transient per-step increases) and reduces total singular-value mass below the residual. (Eq. 10 balancing representation / decomposition is the paper's identity built from these ingredients per Section 5.2; direct numerical reconstruction is out of scope — see "Outstanding Concerns".) +- [x] Eq. 3 per-(i, t) weights: unit distance excludes the target period (``1{u ≠ t}`` mask in the kernel) and uses only periods where both units are untreated (``(1 - W_iu)(1 - W_ju)`` mask). +- [x] Eqs. 4-5 + Algorithm 1 LOOCV: ``Q(λ)`` sums squared pseudo-treatment effects over ALL control observations where ``D_js = 0`` (including pre-treatment cells of eventually-treated units, paper Eq. 2 control set); two-stage coordinate-descent cycling (footnote 2) returns a tuple of values from the input grids. +- [x] Corollary 1 (paper p. 23) — **single-draw sanity checks consistent with the three unbiasedness conditions, not a repeated-MC mean-bias study**: each of the three balance conditions (a) unit balance, (b) time balance, (c) ``B = 0`` is exercised on a targeted DGP that makes one condition trivially hold while keeping the others sub-optimal. The assertion in each case is a single-realisation ``|att - τ| < 3 * se`` band using the estimator's own bootstrap SE — this is a smoke check, NOT a repeated-draw Monte Carlo bias study of the paper's conditional-unbiasedness statement under fixed weights. A stronger MC bias study at fixed λ values is deferred (would multiply test runtime by ~30x for marginal additional evidence given the existing 3-σ band already catches order-of-magnitude bias regressions). +- [x] Theorem 5.1 (paper p. 23) — **simulation sanity check, not a direct theorem lock**: the paper's bias bound ``|E[τ_hat - τ | L]| <= ||Δ_u|| · ||Δ_t|| · ||B||_*`` is stated for FIXED, non-data-dependent weights. The library's TROP fit uses data-dependent LOOCV-tuned λ values, so the direct conditional bias bound is not tested here. Instead, the methodology test verifies the bound's empirical realisation: TROP RMSE strictly below DID RMSE under a confounded factor DGP with ``true τ = 0`` (calibration measurement: TROP/DID RMSE ratio ≈ 0.34 at ``factor_strength = 1.0``). The direct fixed-weight bound test is deferred — would require exposing oracle Γ / Λ / B from a paper-aligned DGP and computing each component of the bound from instrumented internals. +- [x] Section 2.2 special-case reductions: **DiD benchmark sanity check** (not a direct algebraic-equivalence proof) — on a no-interactive-FE multi-period panel (additive unit + time effects only, no factor structure), TROP with ``λ_nn = ∞`` + uniform weights produces an ATT within 0.5 of `DifferenceInDifferences` fitted as `outcome ~ treat * post_flag` (basic 2×2 design with `[const, D, T, D×T]`, extended to repeated observations within each treat×post cell). This is **empirical numerical agreement on a friendly DGP**, NOT a proof of the paper Section 2.2 algebraic reduction (which would require either a true 2-period block-assignment panel where the basic-DiD comparator is the algebraic target, or a comparison against `TwoWayFixedEffects` — both deferred). **Matrix Completion code path exercised, not equivalence-checked** — TROP with uniform weights + finite ``λ_nn`` engages the nuclear-norm prox solver (effective_rank > 0) and recovers ATT better than the DiD-style baseline on a factor-confounded DGP; this verifies the code path activates but does NOT prove equivalence with an independent MC reference implementation (which would require either an external MC port or a hand-written reference solver). SC / SDID reductions deferred — see "Outstanding Concerns". +- [x] Eq. 13 + Algorithm 2 per-(i, t) estimation: ``treatment_effects`` dict contains one finite ``τ_hat_it`` per treated cell; the aggregate ATT equals the unweighted mean of per-cell effects (Eq. 1). **Tests cover block adoption with a constant treatment effect**; **absorbing-state staggered adoption** and **heterogeneous per-cell effects** (paper Remark 6.1) are SUPPORTED by the code path but not directly verified in this methodology surface. **Section 6.1 non-absorbing / on-off / switching assignment patterns are explicitly OUT OF SCOPE** — the implementation rejects non-absorbing D-matrices via `trop_local.py` absorbing-state validation, and the methodology test enforces the rejection contract via `TestTROPDeviations::test_event_style_d_rejected_with_value_error` (event-style D being one specific non-absorbing pattern; the same absorbing-state validator catches all 1→0 transitions). Cross-coverage of the staggered-cohort fit path is `tests/test_methodology_trop.py::TestTROPAlgorithm1LOOCV::test_control_set_includes_pretreat_of_eventually_treated`. +- [x] Algorithm 3 stratified pairs bootstrap: under an unbalanced (3 treated, 17 control) panel, the stratified sampler reliably produces ≥ 67% successful bootstrap draws and a positive finite SE. +- [x] Section 3 / Eq. 6 semi-synthetic factor DGP: five recovery tests verify limiting-case uniform weights, unit-weight bias reduction, time-weight bias reduction, factor-model bias reduction with effective_rank > 0, and null-DGP recovery centred near zero. +- [x] safe_inference contract: confidence interval uses the t-distribution with df = max(1, n_treated_obs - 1), consistent with p_value (matches REGISTRY `## TROP` "Inference CI distribution" note, post safe_inference migration). + +**Test Coverage:** + +- 36 methodology tests (10 classes) in `tests/test_methodology_trop.py`. +- Defensive guards (107 tests in `tests/test_trop.py`): D-matrix absorbing-state validation, silent-warning audit, FISTA convergence warnings, bootstrap-failure-rate proportional warning, bootstrap NaN-SE propagation, module-split smoke tests. + +**Deviations from paper:** + +- **Gap #5 (weight normalisation)**: paper Section 5 (p. 20) states weights sum to one (``1ᵀω = 1ᵀθ = 1``), but Eq. 3 (p. 7) writes unnormalised exponential weights. The shipped implementation matches Eq. 2 (unnormalised). Locked by `tests/test_methodology_trop.py::TestTROPDeviations::test_unnormalized_weights_match_eq2`. +- **Gap #9 (unbalanced panels)**: paper assumes a balanced ``N × T`` panel; the library accepts unbalanced panels with missing control / pre-treatment cells. **The methodology test exercises 10% random drops on the control + pre-treatment subset** (TROP fit completes and returns finite ATT). Three additional unbalanced-panel regressions are in `tests/test_trop.py::TestPR110FeedbackRound8`. Missing-treated-cell handling and thinner pre-period donor support are NOT directly covered by a dedicated regression — those edge cases lean on the absorbing-state monotonicity validation in `trop_local.py` and the validator-side tests in `tests/test_trop.py::TestDMatrixValidation`. Locked by `TestTROPDeviations::test_unbalanced_panels_supported`. +- **Rank selection**: the library follows the paper's implicit rank-selection via nuclear-norm soft-thresholding (paper review Gap #8). `TROP.__init__` does NOT expose a discrete `rank_selection` parameter; effective rank is reported via `TROPResults.effective_rank` (sum of singular values divided by largest) as a diagnostic, not as a user-selectable mode. Earlier REGISTRY prose mentioning "cv / ic / elbow" rank-selection methods was an overclaim — corrected in this PR. Locked by `TestTROPDeviations::test_rank_selection_is_implicit_via_nuclear_norm`. +- **`λ_nn = ∞` → 1e10 internal conversion**: results metadata stores the original ``inf`` value (REGISTRY `## TROP` "λ_nn=∞ implementation" edge-case note) while computations use 1e10 as a numerical sentinel. Locked by `TestTROPDeviations::test_lambda_nn_inf_stored_unchanged`. +- **Bootstrap proportional 5% failure-rate warning** and **FISTA / outer-loop convergence warnings**: defensive surfaces introduced under the Phase 2 silent-failure audit (REGISTRY `## TROP` "Bootstrap minimum" and "alternating-minimization convergence" notes). Covered in `tests/test_trop.py::TestTROPBootstrapFailureRateGuard` and `TestTROPConvergenceWarnings`. + +**Outstanding Concerns:** + +- **Equation 14 covariate extension** (paper Section 6.2): the library does NOT implement covariates. `TROP.fit()` has no ``covariates`` keyword argument, and the corresponding Theorem 8.1 (paper Appendix, pp. 36-37) covariate-triple-robustness result is correspondingly out of scope. Non-support is locked by `TestTROPDeviations::test_covariates_not_supported`. Deferred until use cases motivate the X threading through `trop_local.py` / `trop_global.py` / LOOCV / bootstrap. +- **SC / SDID special-case reductions** (paper Section 2.2 third bullet): the paper claims TROP reduces to SC and SDID under ``λ_nn = ∞`` and "specific choices of unit and time weights", but the exact ``(ω, θ)`` maps are not provided in the paper text. The library does not expose an SC- or SDID-matching weight setter (only the Eq. 3 ``λ_unit`` / ``λ_time`` decay rates). Cross-language anchor against `SyntheticDiD` is deferred until paper-author code clarifies the weight map. +- **Rust backend survey scope**: the Rust accelerated paths remain pweight-only; full survey-design (strata, PSU, FPC via Rao-Wu) falls back to the Python bootstrap loop. Cross-backend parity is covered by `tests/test_trop.py` defensive surfaces. +- **Eq. 10 balancing decomposition** (paper Section 5.2 + Eq. 10): numerical reconstruction of the four-term identity ``Y_NT_hat = L_NT + θ·(Y_pre_N - L_pre_N) + ω·(Y_T_co - L_T_co) - Σ θ_t ω_i (Y_it_co - L_it_co)`` requires the internal per-(i, t) weight vectors ``θ_s^{i,t}`` / ``ω_j^{i,t}``, which are not exposed on the public TROP API. The Eq. 2 ingredients that the Eq. 10 derivation builds on (soft-threshold SVD, **plain prox-gradient monotonicity** — NOT the shipped accelerated FISTA outer loop, which uses Nesterov momentum and does not guarantee per-step monotonicity — weighted-prox solver) are independently verified in `TestTROPNuclearNormProx`. Direct Eq. 10 lock deferred — would require exposing the internal weight vectors as a results-side diagnostic or instrumenting the test against the solver's intermediate state. + +**R Parity:** Deferred until the paper-author reference implementation is released ("forthcoming" per the paper). Tracker entry will be reopened when it lands. --- @@ -1338,10 +1366,9 @@ Promotion priority for the **In Progress** entries, ordered by what's blocked on **Consolidation-pass-blocked (already has paper review or methodology file or R parity; mostly Verified Components walk-through):** -6. **TROP** — paper review recently merged (PR #443); needs methodology file and cross-language anchor (when paper-author reference becomes available). -7. **StaggeredTripleDifference** — shares the primary paper (Ortiz-Villavicencio & Sant'Anna 2025) with TripleDifference, but no dedicated paper review on file yet; needs R parity (R fixtures gitignored — tracked in TODO.md, PR #245). -8. **ConleySpatialHAC** — paper review + committed R `conleyreg` goldens; needs dedicated methodology test file + summary R-parity table in this tracker. -9. **Survey Data Support** — cross-cutting feature; promotion requires the per-estimator integration paths to be locked down first. +6. **StaggeredTripleDifference** — shares the primary paper (Ortiz-Villavicencio & Sant'Anna 2025) with TripleDifference, but no dedicated paper review on file yet; needs R parity (R fixtures gitignored — tracked in TODO.md, PR #245). +7. **ConleySpatialHAC** — paper review + committed R `conleyreg` goldens; needs dedicated methodology test file + summary R-parity table in this tracker. +8. **Survey Data Support** — cross-cutting feature; promotion requires the per-estimator integration paths to be locked down first. --- diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index decf22b9..e7274f3e 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -2244,11 +2244,12 @@ confidence bands (sup-t) for event study. **Primary source:** [Athey, S., Imbens, G.W., Qu, Z., & Viviano, D. (2025). Triply Robust Panel Estimators. arXiv:2508.21536.](https://arxiv.org/abs/2508.21536) +**Note (version pinning):** the methodology promotion (`METHODOLOGY_REVIEW.md` `#### TROP` → **Complete** as of 2026-05-24) is anchored on **arXiv:2508.21536v2**; the current arXiv version is **v3**. A formal v2→v3 source delta-check against the v3 PDF has NOT been performed for any of the sections covered by the promotion (Eqs. 2-3, Algorithms 1-3, Section 2.2, Section 5.2-5.3, Section 6.1-6.2, Theorem 5.1, Corollary 1, Appendix Theorem 8.1). See `docs/methodology/papers/athey-2025-review.md` "Version-pinning note" for the deferred action item. + **Key implementation requirements:** *Assumption checks / warnings:* - Requires sufficient pre-treatment periods for factor estimation (at least 2) -- Warns if estimated rank seems too high/low relative to panel dimensions - Unit weights can become degenerate if λ_unit too large - Returns Q(λ) = ∞ if ANY LOOCV fit fails (Equation 5 compliance) @@ -2345,7 +2346,7 @@ Q(λ) = Σ_{j,s: D_js=0} [τ̂_js^loocv(λ)]² - Block bootstrap preserving panel structure (Algorithm 3) *Edge cases:* -- Rank selection: automatic via cross-validation, information criterion, or elbow +- Rank selection: implicit via nuclear-norm soft-thresholding (paper Section 5.3 + Appendix); `TROPResults.effective_rank` reports the diagnostic. No discrete `rank_selection` constructor parameter (cv / ic / elbow) is exposed — earlier prose claiming "automatic via cross-validation, information criterion, or elbow" was an overclaim, corrected in the 2026-05-24 methodology-promotion PR. See the Requirements checklist Rank-selection bullet below. - Zero singular values: handled by soft-thresholding - Extreme distances: weights regularized to prevent degeneracy - LOOCV fit failures: returns Q(λ) = ∞ on first failure (per Equation 5 requirement that Q sums over ALL control observations where D==0); if all parameter combinations fail, falls back to defaults (1.0, 1.0, 0.1) @@ -2381,12 +2382,17 @@ Q(λ) = Σ_{j,s: D_js=0} [τ̂_js^loocv(λ)]² - [x] Unit weights: `exp(-λ_unit × distance)` (unnormalized, matching Eq. 2) - [x] LOOCV implemented for tuning parameter selection - [x] LOOCV uses SUM of squared errors per Equation 5 -- [x] Multiple rank selection methods: cv, ic, elbow -- [x] Returns factor loadings and scores for interpretation +- [x] Rank selection implicit via nuclear-norm soft-thresholding (paper Section 5.3 + Appendix); `TROPResults.effective_rank` reports the diagnostic. No discrete `rank_selection` constructor parameter is exposed — earlier mention of "cv / ic / elbow" methods in this checklist was an overclaim, corrected in the methodology-promotion PR. Locked by `tests/test_methodology_trop.py::TestTROPDeviations::test_rank_selection_is_implicit_via_nuclear_norm`. +- [x] Returns the fitted factor matrix and an effective-rank diagnostic (`TROPResults.factor_matrix` and `TROPResults.effective_rank`). The library does NOT expose separate factor-loading / factor-score outputs — earlier prose claiming "factor loadings and scores" was an overclaim corrected in the 2026-05-24 methodology-promotion PR (TROP's nuclear-norm soft-thresholded L is delivered as a single (n_periods × n_units) matrix, not decomposed into loading / score components on Results). - [x] ATT averages over all D==1 cells (general assignment patterns) - [x] No post_periods parameter (D matrix determines treatment timing) - [x] D matrix semantics documented (absorbing state, not event indicator) -- [x] Unbalanced panels supported (missing observations don't trigger false violations) +- [x] Unbalanced panels supported — missing control / pre-treatment cells don't trigger false absorbing-state violations. Locked by `tests/test_methodology_trop.py::TestTROPDeviations::test_unbalanced_panels_supported` (10% random drops on control + pre-treatment subset). Three additional unbalanced-panel regressions live in `tests/test_trop.py::TestPR110FeedbackRound8` (`test_unbalanced_panel_d_matrix_validation`, `test_unbalanced_panel_real_violation_still_caught`, `test_unbalanced_panel_multiple_missing_periods`). Absorbing-state monotonicity validation (which fires on unbalanced cases too) is covered by `tests/test_trop.py::TestDMatrixValidation`. +- [x] Per-observation treatment-effect estimation (Eq. 13 / Algorithm 2) — `treatment_effects` dict contains one finite `τ_hat_it` per treated cell, and the aggregate ATT equals the unweighted mean of per-cell effects (Eq. 1). **The methodology test exercises block adoption with a constant treatment effect**; **absorbing-state staggered adoption** and **heterogeneous per-cell effects** (paper Remark 6.1) are SUPPORTED by the code path (the implementation does not gate on cohort or effect-magnitude pattern), but are not directly verified in the methodology test surface in this PR. **Section 6.1 non-absorbing / on-off / switching assignment patterns are explicitly OUT OF SCOPE** — the absorbing-state validator at `trop_local.py` rejects non-monotonic D matrices with a `ValueError`, and `TestTROPDeviations::test_event_style_d_rejected_with_value_error` enforces the rejection contract (event-style D being one specific non-absorbing pattern; the same validator catches all 1→0 transitions). Cross-coverage of the staggered-cohort fit path is `tests/test_methodology_trop.py::TestTROPAlgorithm1LOOCV::test_control_set_includes_pretreat_of_eventually_treated` (two-cohort early-/late-treated panel under LOOCV-tuned `λ_unit`); absorbing-state structural validation is `tests/test_trop.py::TestDMatrixValidation`. +- [x] Special-case reductions (paper Section 2.2): **DiD benchmark sanity check** (NOT a direct algebraic-equivalence proof) — TROP with `λ_nn=∞` + uniform weights produces an ATT within 0.5 of `DifferenceInDifferences` fitted as a basic 2×2 design on a TWFE-clean multi-period panel. This is empirical numerical agreement on a friendly DGP. A direct Section 2.2 reduction lock (true 2-period block-assignment panel where basic DiD is the algebraic target, or a comparison against `TwoWayFixedEffects` with explicit unit FE) is deferred. **Matrix Completion code path exercised** — TROP with uniform weights + finite `λ_nn` engages the nuclear-norm prox solver (effective_rank > 0) and beats the DiD-style baseline on a factor-confounded DGP; not an equivalence check against an independent MC reference. SC and SDID reductions are paper-claimed under "specific (omega, theta) weight choices" not provided in the paper text; cross-language anchor deferred until paper-author reference implementation clarifies the weight map. See `tests/test_methodology_trop.py::TestTROPSpecialCases`. +- **Note:** The balancing representation / decomposition (paper Eq. 10, Section 5.2) is a paper-side identity. Direct numerical reconstruction of the four-term sum requires the internal `θ_s^{i,t}` / `ω_j^{i,t}` weight vectors, which are not exposed on the public TROP API; numerical Eq. 10 verification is therefore out of scope. The test `tests/test_methodology_trop.py::TestTROPNuclearNormProx::test_factor_matrix_consistent_with_treatment_effects` is a structural pointer only — it checks `factor_matrix` shape + finiteness + that `treatment_effects` is populated with finite entries, but does NOT lock the magnitude of `L_hat`. (The test DGP uses additive unit + time effects only; on a no-interactive-FE panel, the paper's framework absorbs the additive surfaces into `α_i` / `β_t`, so a near-zero `L_hat` is methodologically correct. An `effective_rank > 0` assertion would lock a solver artifact, not the intended low-rank behavior.) This is NOT a full Eq. 10 lock. The Eq. 2 ingredients (soft-threshold SVD, **plain prox-gradient monotonicity** — NOT the shipped accelerated FISTA outer loop, which uses Nesterov momentum and does not guarantee per-step monotonicity, see `TestTROPNuclearNormProx` class docstring — weighted-prox) that the Eq. 10 derivation relies on are independently verified in the same class. +- **Note (library-side choice):** Weight normalization (Gap #5 in `docs/methodology/papers/athey-2025-review.md`): paper Section 5 (p. 20) states weights sum to one (`1ᵀω = 1ᵀθ = 1`), but Eq. 3 (p. 7) writes unnormalized exponential weights. **The paper-side ambiguity remains open**; the library resolves it as a documented deviation — the shipped implementation matches Eq. 2 (unnormalized). Verified by `tests/test_methodology_trop.py::TestTROPDeviations::test_unnormalized_weights_match_eq2`. Will be revisited once paper-author reference implementation lands. +- **Note (deferral):** Equation 14 covariate extension (`Y_it = α_i + β_t + X_it·β_coef + R_it` with R low-rank, paper Section 6.2) is **not implemented**. `TROP.fit()` does not accept a `covariates` keyword argument. The corresponding Theorem 8.1 covariate-triple-robustness result is correspondingly out of scope. The non-support is locked by `tests/test_methodology_trop.py::TestTROPDeviations::test_covariates_not_supported`, which uses `inspect.signature` to guard against future `**kwargs` silently breaking the contract. Deferred until use cases motivate the X threading through `trop_local.py` / `trop_global.py` / LOOCV / bootstrap. - **Note:** Survey support: weights, strata, PSU, and FPC are all supported via Rao-Wu rescaled bootstrap with cross-classified pseudo-strata (Phase 6). Rust backend remains pweight-only; full-design surveys fall back to the Python bootstrap path. Survey weights enter ATT aggregation only — population-weighted average of per-observation treatment effects. Model fitting (kernel weights, LOOCV, nuclear norm regularization) stays unchanged. Rust and Python bootstrap paths both support survey-weighted ATT in each iteration. ### TROP Global Estimation Method diff --git a/docs/methodology/papers/athey-2025-review.md b/docs/methodology/papers/athey-2025-review.md index 6ca71516..c2e253c7 100644 --- a/docs/methodology/papers/athey-2025-review.md +++ b/docs/methodology/papers/athey-2025-review.md @@ -5,11 +5,17 @@ **PDF reviewed:** https://arxiv.org/abs/2508.21536v2 (version-pinned arXiv abstract for v2) **Review date:** 2026-02-08 +**Version-pinning note (2026-05-25):** The current arXiv version of arXiv:2508.21536 is **v3** (submitted 2026-02-09). The 2026-05-24 methodology promotion ships against this v2-pinned review; a formal v2-vs-v3 delta-check against the v3 PDF for TROP-relevant methodology changes (Eqs. 2-3, Algorithms 1-3, Section 2.2, Section 5.2-5.3, Section 6.1-6.2, Theorem 5.1, Corollary 1, Appendix Theorem 8.1) has **NOT** been performed. + +**Action item**: before the next paper-author reference implementation or substantive v3 release, refresh this review against the most recent arXiv version, perform a real v2→v3 PDF delta audit, and re-validate that the verified-component checklist still maps cleanly. Pending that refresh, the methodology promotion is anchored on v2 as documented here. + --- ## Methodology Registry Entry -*Working draft formatted to match docs/methodology/REGISTRY.md structure. Heading levels and labels align with existing entries. One open source-ambiguity item remains (weight normalization — see Gap #5 under "Gaps and Uncertainties" below); resolve it against the source before promoting this section into REGISTRY.md.* +*Working draft formatted to match docs/methodology/REGISTRY.md structure. Heading levels and labels align with existing entries.* + +*Resolution status (post-promotion, 2026-05-24):* Gap #5 (weight normalization) is resolved as a **library-side choice**, not a source-side clarification: the shipped implementation uses unnormalized exponential weights matching Eq. 2, and the methodology-promotion PR documents this choice as a deliberate deviation from the Section 5 sum-to-one statement (see the weight-normalization note in the `## TROP` block of `docs/methodology/REGISTRY.md` and the Deviations subsection of the `#### TROP` block in `METHODOLOGY_REVIEW.md`). The deviation is locked by `tests/test_methodology_trop.py::TestTROPDeviations::test_unnormalized_weights_match_eq2` via direct kernel-weight inspection. See Gap #5 below for the original source-side ambiguity. ## TROP @@ -264,7 +270,7 @@ Note: Stratified bootstrap -- control and treated units resampled separately. Pr - [ ] Stratified bootstrap preserving unit-level structure (Algorithm 3) - [ ] Covariate extension supported (Equation 14) - [ ] Special cases recoverable: DID, MC, SC, SDID via tuning parameters -- Weight normalization (`1^T omega = 1^T theta = 1`, Section 5, page 20) — **open source-ambiguity item**; see Gap #5 below. Section 5 states sum-to-one, Equation 3 / Equation 2 use unnormalized exponential weights, and the shipped implementation matches Equation 2. Do not promote this to a requirement until the discrepancy is resolved against the source. +- Weight normalization (`1^T omega = 1^T theta = 1`, Section 5, page 20) — **resolved as library-side choice (2026-05-24)**. Section 5 states sum-to-one, Equation 3 / Equation 2 use unnormalized exponential weights, and the shipped implementation matches Equation 2 (unnormalized). The methodology-promotion PR documents this as a deliberate deviation from the Section 5 sum-to-one statement, locked by direct kernel inspection (see Gap #5 below for original source-side ambiguity). - [ ] Heterogeneous treatment effects supported via per-observation estimation (Remark 6.1) --- @@ -349,7 +355,7 @@ The paper uses semi-synthetic simulations (Section 3.1, pages 9-11) based on 6 r 4. **Computational complexity**: Not explicitly discussed. The LOOCV grid search is described as the bottleneck, but no formal complexity analysis is provided. -5. **Weight normalization**: Section 5 (page 20) states weights sum to one (`1^T omega = 1^T theta = 1`), but the weight specification in Equation 3 (page 7) uses unnormalized exponential weights. It is unclear whether normalization is applied before or after the optimization, or whether the theoretical results in Section 5 assume normalized weights while the practical algorithm uses unnormalized weights. The existing diff-diff implementation uses unnormalized weights in the optimization (matching Equation 2). +5. **Weight normalization** (*resolved 2026-05-24*): Section 5 (page 20) states weights sum to one (`1^T omega = 1^T theta = 1`), but the weight specification in Equation 3 (page 7) uses unnormalized exponential weights. It is unclear whether normalization is applied before or after the optimization, or whether the theoretical results in Section 5 assume normalized weights while the practical algorithm uses unnormalized weights. **Resolution**: the shipped implementation uses unnormalized weights matching Equation 2. The methodology-promotion PR adopts this as a deliberate **library-side choice / deviation from the Section 5 sum-to-one statement**, locked by `tests/test_methodology_trop.py::TestTROPDeviations::test_unnormalized_weights_match_eq2` which directly inspects the per-(i, t) weight matrix at `lambda_unit = lambda_time = 0` and asserts every entry equals 1.0 (sum = N*T, not 1). The source-side ambiguity remains open — clarification from the paper authors / forthcoming reference implementation would let the library either confirm the unnormalized choice or migrate to the normalized form; for now the unnormalized form is the documented library contract. 6. **Nuclear norm penalty in Equation 13** (resolved): the source uses the same unsquared nuclear-norm penalty `lambda_nn ||L||_*` in Equation 13 as in Equation 2 (consistent with the rest of the draft and confirmed against the paper text). The shipped implementation matches. diff --git a/tests/test_methodology_trop.py b/tests/test_methodology_trop.py new file mode 100644 index 00000000..f6a662eb --- /dev/null +++ b/tests/test_methodology_trop.py @@ -0,0 +1,2425 @@ +"""Methodology verification tests for the Triply Robust Panel (TROP) estimator. + +Targets Athey, Imbens, Qu & Viviano (2025), *Triply Robust Panel Estimators*, +arXiv:2508.21536 (https://arxiv.org/abs/2508.21536). + +Equation walk-through: + +- Eq. 2: nuclear-norm penalised weighted-least-squares objective with + soft-threshold SVD prox. Verified by `TestTROPNuclearNormProx` + (proximal correctness, plain prox-gradient objective monotonicity + on a toy setup, weighted-solver convergence, reduction of + singular-value mass). The shipped local / global solvers wrap + this prox step with accelerated FISTA momentum; the accelerated + loop's faster `O(1/k^2)` rate does NOT guarantee per-step + monotonicity, so the test exercises the plain prox-gradient + ingredient (without momentum), not the accelerated loop. +- Eq. 10: balancing decomposition of the estimated counterfactual is + a paper-side balancing representation (paper Eq. 10 / Section 5.2) that depends on the internal + per-(i, t) weight vectors `theta_s^{i,t}` / `omega_j^{i,t}`, + which are not exposed on the public TROP API. Direct + numerical reconstruction of the four-term identity is out of + scope; `TestTROPNuclearNormProx.test_factor_matrix_consistent_with_treatment_effects` + is a structural pointer (shape + finiteness of the fitted + ``factor_matrix`` + ``treatment_effects`` populated with + finite entries), not a full Eq. 10 lock and not a non- + triviality claim on the ``L_hat`` magnitude. +- Eq. 3: exponential-decay unit and time weights; unit distance uses + only periods where both i and j are untreated and excludes + the target period t. Verified by `TestTROPEquation3Weights`. +- Eqs. 4-5 / Algorithm 1: LOOCV pseudo-treatment-effect summation; control + set includes pre-treatment observations of eventually- + treated units; two-stage coordinate cycling per footnote 2. + Verified by `TestTROPAlgorithm1LOOCV`. +- Corollary 1: unbiasedness under any one of (a) unit balance, + (b) time balance, (c) correct regression adjustment B=0. + Verified by `TestTROPCorollary1Unbiasedness` via three + targeted DGPs. +- Theorem 5.1: triply-robust bias bound; **simulation sanity check, not + a direct theorem lock**. The paper states the bound + `|E[tau_hat - tau | L]| <= ||Delta_u|| * ||Delta_t|| * ||B||_*` + for FIXED, non-data-dependent weights, whereas TROP fits + use LOOCV-tuned (data-dependent) weights. The class + `TestTROPTheorem51TripleRobustness` verifies the bound's + empirical realisation under LOOCV-tuned weights: TROP + MSE strictly below DID MSE on a factor-confounded DGP. + The direct fixed-weight bias-bound test is deferred. +- Section 2.2: special-case reductions. DID reduction (lambda_nn=inf + + uniform weights) is verified as a **benchmark sanity + check** against the basic DiD on a no-interactive-FE + panel (additive unit + time effects only) — empirical + numerical agreement on a friendly DGP, NOT an algebraic- + equivalence proof of the Section 2.2 reduction. The MC + reduction (uniform weights + finite lambda_nn) only + verifies that the nuclear-norm prox code path engages + and beats a DID-style baseline; it is NOT an + equivalence check against an independent MC reference + implementation. SC + SDID reductions are skipped — + paper claims reduction under "specific (omega, theta) + weight choices" without providing the map; cross- + language anchors are deferred until paper-author code + lands. See `TestTROPSpecialCases`. +- Eq. 13 / Algorithm 2: per-(i, t) estimation for multiple treated units; + ATT averages over all W_it=1 cells. Verified by + `TestTROPAlgorithm2MultipleTreated`. +- Algorithm 3: stratified pairs bootstrap; separate N_0 / N_1 + resampling; preserves within-unit temporal correlation. + Verified by `TestTROPAlgorithm3Bootstrap`. +- Section 3 / Eq. 6: semi-synthetic factor DGP; treatment-effect recovery + under known structure. Verified by + `TestTROPEquation6FactorDGPRecovery`. + +Deviations from paper: + +- Gap #5 (paper review): weight normalisation is ambiguous in the paper + (Section 5 says weights sum to one; Eq. 3 uses unnormalised exponential + weights). The shipped implementation matches Eq. 2 (unnormalised). + Verified by + `TestTROPDeviations.test_unnormalized_weights_match_eq2`. +- Gap #9 (paper review): the paper assumes a balanced panel; the library + supports unbalanced panels with missing-period guards. Verified by + `TestTROPDeviations.test_unbalanced_panels_supported`. +- Equation 14 covariate extension (and Theorem 8.1 covariate triple + robustness) is deferred. `TROP.fit()` does not accept a ``covariates`` + parameter. Locked by + `TestTROPDeviations.test_covariates_not_supported`. + +See: + +- ``docs/methodology/papers/athey-2025-review.md`` (paper review). +- ``docs/methodology/REGISTRY.md`` ``## TROP`` block. +- ``METHODOLOGY_REVIEW.md`` ``TROP`` section. + +Companion files (NOT duplicated here): + +- ``tests/test_trop.py`` (implementation-detail unit + tests, defensive guards, + API regressions) +- ``tests/test_trop.py::TestSilentWarningAudit`` (silent-failure audit) +- ``tests/test_trop.py::TestTROPConvergenceWarnings`` (FISTA / outer-loop + convergence warnings) +- ``tests/test_trop.py::TestTROPBootstrapFailureRateGuard`` (bootstrap 5% + failure-rate + guard) +- ``tests/test_trop.py::TestDMatrixValidation`` (absorbing-state + validation) + +R / source parity is deferred until the paper-author reference +implementation is released ("forthcoming" per the paper); see +``METHODOLOGY_REVIEW.md`` ``TROP`` section. + +Class structure: + +- ``TestTROPNuclearNormProx`` — Eq. 2 + Eq. 10 (prox, FISTA, balancing). +- ``TestTROPEquation3Weights`` — Eq. 3 (unit / time weight semantics). +- ``TestTROPAlgorithm1LOOCV`` — Eqs. 4-5 + Algorithm 1. +- ``TestTROPCorollary1Unbiasedness`` — Corollary 1 (three conditions). +- ``TestTROPTheorem51TripleRobustness`` — Theorem 5.1 (MC ranking realisation). +- ``TestTROPSpecialCases`` — Section 2.2 reductions (DID, MC, SDID; SC skipped). +- ``TestTROPAlgorithm2MultipleTreated`` — Eq. 13 + Algorithm 2. +- ``TestTROPAlgorithm3Bootstrap`` — Algorithm 3 (stratified pairs bootstrap). +- ``TestTROPEquation6FactorDGPRecovery`` — Section 3 / Eq. 6 semi-synthetic. +- ``TestTROPDeviations`` — locks library deviations and documented choices. +""" + +import inspect +import warnings + +import numpy as np +import pandas as pd +import pytest + +from diff_diff import TROP, DifferenceInDifferences +from diff_diff.prep import generate_factor_data + +# Per-class seed bases (decorrelate MC tests within and across classes). +_BASE_SEED_NUCLEAR_PROX = 4242 +_BASE_SEED_EQ3_WEIGHTS = 3131 +_BASE_SEED_ALG1_LOOCV = 5555 +_BASE_SEED_COROLLARY_1 = 6464 +_BASE_SEED_THEOREM_51 = 7373 +_BASE_SEED_SPECIAL_CASES = 8181 +_BASE_SEED_ALG2_MULTI = 9292 +_BASE_SEED_DEVIATIONS = 3030 + + +# ============================================================================= +# Helpers — paper-aligned DGPs for methodology tests +# ============================================================================= + + +def _make_trop_factor_panel( + *, + n_units: int = 30, + n_treated: int = 6, + n_pre: int = 8, + n_post: int = 4, + n_factors: int = 2, + factor_strength: float = 1.0, + treated_loading_shift: float = 0.5, + unit_fe_sd: float = 0.3, + noise_sd: float = 0.3, + treatment_effect: float = 0.0, + seed: int = 0, +) -> pd.DataFrame: + """Build a balanced panel with a rank-`n_factors` interactive-FE component. + + Mirrors the paper's Eq. 6 simulation DGP (Section 3.1). Returns a long + DataFrame with columns ``unit``, ``period``, ``outcome``, ``treated``. + A non-zero ``treated_loading_shift`` induces selection-on-loadings + confounding (matching the paper's logistic-on-factors selection in Eq. 8). + Wraps :func:`diff_diff.prep.generate_factor_data` for parity with + `tests/test_trop.py::generate_factor_dgp`. + """ + data = generate_factor_data( + n_units=n_units, + n_pre=n_pre, + n_post=n_post, + n_treated=n_treated, + n_factors=n_factors, + treatment_effect=treatment_effect, + factor_strength=factor_strength, + treated_loading_shift=treated_loading_shift, + unit_fe_sd=unit_fe_sd, + noise_sd=noise_sd, + seed=seed, + ) + return pd.DataFrame(data[["unit", "period", "outcome", "treated"]]) + + +def _make_constant_loading_panel( + *, + n_units: int = 30, + n_treated: int = 8, + n_pre: int = 6, + n_post: int = 4, + n_factors: int = 2, + factor_strength: float = 1.0, + noise_sd: float = 0.3, + treatment_effect: float = 2.0, + seed: int = 0, +) -> pd.DataFrame: + """Build a panel where every unit shares the same factor loading. + + This makes Corollary 1(a) (unit balance) trivially hold: with + Gamma_i = Gamma_j for all i, j, any non-negative weight vector ω + satisfies (sum_i omega_i Gamma_i) = Gamma_N, so Delta_u = 0 regardless + of lambda_unit. Treated units still receive a non-zero treatment effect + and time factors still vary across t, so the test surface is real. + """ + rng = np.random.default_rng(seed) + n_periods = n_pre + n_post + shared_loading = rng.normal(0, 1, (1, n_factors)) + Gamma = np.tile(shared_loading, (n_units, 1)) + Lambda = rng.normal(0, 1, (n_periods, n_factors)) + rows = [] + for i in range(n_units): + is_treated = i < n_treated + unit_fe = rng.normal(0, 0.3) + for t in range(n_periods): + time_fe = 0.15 * t + interaction = factor_strength * Gamma[i, :] @ Lambda[t, :] + y = 5.0 + unit_fe + time_fe + interaction + d = 1 if (is_treated and t >= n_pre) else 0 + if d: + y += treatment_effect + y += rng.normal(0, noise_sd) + rows.append({"unit": i, "period": t, "outcome": y, "treated": d}) + return pd.DataFrame(rows) + + +def _make_constant_factor_panel( + *, + n_units: int = 30, + n_treated: int = 8, + n_pre: int = 6, + n_post: int = 4, + n_factors: int = 2, + factor_strength: float = 1.0, + noise_sd: float = 0.3, + treatment_effect: float = 2.0, + seed: int = 0, +) -> pd.DataFrame: + """Build a panel where every period shares the same factor score. + + This makes Corollary 1(b) (time balance) trivially hold: with + Lambda_s = Lambda_t for all s, t, any non-negative weight vector θ + satisfies (sum_s theta_s Lambda_s) = Lambda_T, so Delta_t = 0 regardless + of lambda_time. Unit loadings still vary (with confounding shift on + treated), so the test surface is non-trivial. + """ + rng = np.random.default_rng(seed) + n_periods = n_pre + n_post + shared_factor = rng.normal(0, 1, (1, n_factors)) + Lambda = np.tile(shared_factor, (n_periods, 1)) + Gamma = rng.normal(0, 1, (n_units, n_factors)) + Gamma[:n_treated, :] += 0.5 # selection on loadings (still hard for DID) + rows = [] + for i in range(n_units): + is_treated = i < n_treated + unit_fe = rng.normal(0, 0.3) + for t in range(n_periods): + time_fe = 0.15 * t + interaction = factor_strength * Gamma[i, :] @ Lambda[t, :] + y = 5.0 + unit_fe + time_fe + interaction + d = 1 if (is_treated and t >= n_pre) else 0 + if d: + y += treatment_effect + y += rng.normal(0, noise_sd) + rows.append({"unit": i, "period": t, "outcome": y, "treated": d}) + return pd.DataFrame(rows) + + +def _make_no_factor_panel( + *, + n_units: int = 30, + n_treated: int = 8, + n_pre: int = 6, + n_post: int = 4, + noise_sd: float = 0.3, + treatment_effect: float = 2.0, + seed: int = 0, +) -> pd.DataFrame: + """Build a TWFE-clean panel (no interactive FE, just additive unit + time). + + With no factor structure, the regression-adjustment bias matrix B is + trivially zero (and TROP with lambda_nn=infinity reduces exactly to TWFE + per paper Section 2.2). Used by Corollary 1(c) to verify the B=0 case. + """ + rng = np.random.default_rng(seed) + rows = [] + for i in range(n_units): + is_treated = i < n_treated + unit_fe = rng.normal(0, 0.5) + for t in range(n_pre + n_post): + time_fe = 0.2 * t + y = 5.0 + unit_fe + time_fe + d = 1 if (is_treated and t >= n_pre) else 0 + if d: + y += treatment_effect + y += rng.normal(0, noise_sd) + rows.append({"unit": i, "period": t, "outcome": y, "treated": d}) + return pd.DataFrame(rows) + + +def _fit_did(df: pd.DataFrame) -> float: + """Fit a `DifferenceInDifferences` benchmark on the panel and return + the interaction coefficient. + + This is the library's basic 2×2 DiD estimator (`[const, D, T, D×T]` + design, no explicit fixed effects added). On a balanced two-period + block-assignment panel it coincides numerically with TWFE / two-way + fixed effects (paper Section 2.2 uses "DID/TWFE" interchangeably for + this special case), but the library distinguishes the two classes — + `TwoWayFixedEffects` is a separate explicit-FE estimator. Used by + `TestTROPTheorem51TripleRobustness` as the comparator benchmark for + the MC-ranking realisation of the Theorem 5.1 bias bound and by + `TestTROPSpecialCases::test_did_reduction_lambda_nn_inf_uniform_weights` + as the DID-reduction target. + """ + df2 = df.copy() + df2["treat"] = df2.groupby("unit")["treated"].transform("max").astype(int) + df2["post_flag"] = df2.groupby("period")["treated"].transform("max").astype(int) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + est = DifferenceInDifferences() + r = est.fit(df2, formula="outcome ~ treat * post_flag", unit="unit", time="period") + coefs = r.coefficients or {} + return float(coefs.get("treat:post_flag", np.nan)) + + +# ============================================================================= +# TestTROPNuclearNormProx — Eq. 2 + Eq. 10 (prox, FISTA, balancing) +# ============================================================================= + + +class TestTROPNuclearNormProx: + """Eq. 2: weighted nuclear-norm-penalised L estimation; structural + pointer to Eq. 10. + + Verifies the proximal-gradient inner solver (soft-threshold SVD), + plain (non-accelerated) prox-gradient monotonicity on a toy setup, + and the weighted-norm solver behaviour. The shipped local / global + solvers wrap the prox step with accelerated FISTA momentum; the + accelerated loop's faster ``O(1/k^2)`` rate does NOT guarantee + per-step monotonicity, so the monotonicity test + (``test_plain_prox_gradient_objective_decreases``) exercises the + plain prox-gradient ingredient (no Nesterov momentum), NOT the + accelerated outer loop. + + The balancing decomposition of paper Eq. 10 + (``Y_NT_hat = L_NT + theta . (Y_pre_N - L_pre_N) + omega . (Y_T_co - L_T_co) + - sum theta_t omega_i (Y_it_co - L_it_co)``) + is a paper-side balancing representation (paper Eq. 10 / Section 5.2) that requires the internal per-(i, t) + weight vectors ``theta_s^{i,t}`` / ``omega_j^{i,t}`` to numerically + reconstruct. Those vectors are not exposed on the public TROP API, + so this class does NOT directly verify the four-term identity. The + method ``test_factor_matrix_consistent_with_treatment_effects`` is a + structural pointer only — it checks ``factor_matrix`` shape + + finiteness and verifies ``treatment_effects`` is populated with + finite entries (the framework that Eq. 10 derives). It does NOT + assert non-triviality of the ``L_hat`` magnitude (the test DGP has + no interactive factor structure, so a near-zero ``L_hat`` is + methodologically correct under the paper's framework). + + Origin: ported from the pre-migration `TestTROPNuclearNormSolver` + class in `test_trop.py` (three of four methods migrated; the one + remaining defensive `test_zero_weights_no_division_error` stayed in + `test_trop.py`) plus the weighted-solver convergence test + `test_weighted_nuclear_norm_solver_convergence` and the weighted- + nuclear-norm objective test `test_issue_c_weighted_nuclear_norm`, + both originally in a pre-migration `TestPaperConformanceFixes` + class that was deleted in the methodology-promotion PR. + """ + + def test_proximal_step_size_correctness(self): + """Eq. 2 prox operator: L converges to ``prox_{lambda/2}(R)`` under + uniform weights. + + With delta_max = 1 and uniform weights, the proximal gradient step + reduces to L_{k+1} = soft_threshold_svd(R, lambda/2). Many + iterations should converge L exactly to that analytical fixed point. + """ + trop_est = TROP(method="global", n_bootstrap=2) + + rng = np.random.default_rng(42) + R = rng.normal(0, 1, (4, 3)) + delta = np.ones((4, 3)) + lambda_nn = 0.5 + + L = np.zeros_like(R) + for _ in range(500): + delta_max = np.max(delta) + delta_norm = delta / delta_max + gradient_step = L + delta_norm * (R - L) + eta = 1.0 / (2.0 * delta_max) + L = trop_est._soft_threshold_svd(gradient_step, eta * lambda_nn) + + L_exact = trop_est._soft_threshold_svd(R, lambda_nn / 2.0) + np.testing.assert_array_almost_equal(L, L_exact, decimal=4) + + def test_plain_prox_gradient_objective_decreases(self): + """Eq. 2 plain (non-accelerated) prox-gradient: objective + ``f(L) + lambda * ||L||_*`` is non-increasing across iterations + when the soft-threshold SVD prox is applied via a plain prox- + gradient step (no Nesterov momentum). + + **Scope:** this verifies the underlying prox operator + gradient + step that the library's accelerated FISTA loop builds on. The + shipped local / global solvers (``trop_local.py`` / ``trop_global.py``) + wrap this prox step with FISTA acceleration, which gives the + faster ``O(1/k^2)`` rate but does NOT guarantee per-iteration + monotonicity — Nesterov momentum can cause transient objective + increases between iterations while still converging optimally. + This test exercises the prox+gradient ingredient, not the + accelerated outer loop. + """ + rng = np.random.default_rng(42) + R = rng.normal(0, 1, (6, 4)) + delta = rng.uniform(0.5, 2.0, (6, 4)) + lambda_nn = 0.3 + + trop_est = TROP(method="global", n_bootstrap=2) + L = np.zeros_like(R) + objectives = [] + + for _ in range(50): + f_val = np.sum(delta * (R - L) ** 2) + _, s, _ = np.linalg.svd(L, full_matrices=False) + obj = f_val + lambda_nn * np.sum(s) + objectives.append(obj) + + delta_max = np.max(delta) + delta_norm = delta / delta_max + gradient_step = L + delta_norm * (R - L) + eta = 1.0 / (2.0 * delta_max) + L = trop_est._soft_threshold_svd(gradient_step, eta * lambda_nn) + + for k in range(1, len(objectives)): + assert objectives[k] <= objectives[k - 1] + 1e-10, ( + f"Plain prox-gradient objective increased at step {k}: " + f"{objectives[k]} > {objectives[k - 1]} " + f"(NOTE: this assertion is for the plain prox-gradient " + f"ingredient; the shipped accelerated FISTA loop does " + f"NOT guarantee per-step monotonicity)" + ) + + def test_local_nonuniform_weights_objective(self): + """Eq. 2 weighted-prox: objective at the final iterate is + at-or-below initialisation under non-uniform weights + (W_max < 1), and the resulting L has strictly smaller nuclear + norm than the residual R. + + Scope note: this is a **final-vs-initial** check on the shipped + `_weighted_nuclear_norm_solve` (which uses accelerated FISTA); + it does NOT verify per-iteration monotonicity. Per-step + monotonicity is a property of the plain prox-gradient ingredient + only (see ``test_plain_prox_gradient_objective_decreases``); + accelerated FISTA's Nesterov momentum is allowed to produce + transient per-step objective increases while still converging. + """ + rng = np.random.default_rng(123) + R = rng.normal(0, 1, (6, 4)) + W = rng.uniform(0.1, 0.8, (6, 4)) + lambda_nn = 0.3 + + trop_est = TROP(method="local", n_bootstrap=2) + + L_init = np.zeros_like(R) + f_init = np.sum(W * (R - L_init) ** 2) + _, s_init, _ = np.linalg.svd(L_init, full_matrices=False) + obj_init = f_init + lambda_nn * np.sum(s_init) + + L_final = trop_est._weighted_nuclear_norm_solve( + Y=R, + W=W, + L_init=L_init, + alpha=np.zeros(R.shape[1]), + beta=np.zeros(R.shape[0]), + lambda_nn=lambda_nn, + max_inner_iter=20, + ) + + f_final = np.sum(W * (R - L_final) ** 2) + _, s_final, _ = np.linalg.svd(L_final, full_matrices=False) + obj_final = f_final + lambda_nn * np.sum(s_final) + + assert ( + obj_final <= obj_init + 1e-10 + ), f"Objective did not decrease: {obj_final} > {obj_init}" + + nuclear_norm_R = np.sum(np.linalg.svd(R, compute_uv=False)) + nuclear_norm_L = np.sum(s_final) + assert ( + nuclear_norm_L < nuclear_norm_R + ), f"Nuclear norm not reduced: {nuclear_norm_L} >= {nuclear_norm_R}" + + def test_weighted_nuclear_norm_objective_recovers_att(self): + """Eq. 2 weighted objective with active regularisation: TROP with + a non-zero ``lambda_nn`` grid (on an interactive-FE DGP) recovers + a finite positive ATT and a non-negative effective_rank — + exercising the weighted prox + alternating-min code path. (The + ``effective_rank`` assertion is `>= 0` rather than `> 0` because + the test DGP's factor structure may be absorbed by the prox + regulariser; the active code path is verified by the positive- + ATT recovery, not by a non-zero rank claim.) + + Scope note: this test does NOT fit an unregularised baseline for + comparison; for the DID-vs-MC ranking on a confounded factor DGP + see `TestTROPSpecialCases::test_matrix_completion_reduction_uniform_weights_finite_nn`. + + Origin: ported from the pre-migration + `TestPaperConformanceFixes::test_issue_c_weighted_nuclear_norm` + in `test_trop.py` (the pre-migration class was deleted in the + methodology-promotion PR; the test was migrated here). + """ + rng = np.random.default_rng(456) + + n_units = 15 + n_periods = 8 + n_treated = 3 + true_att = 2.0 + + loadings = rng.normal(0, 1, n_units) + factors = rng.normal(0, 1, n_periods) + + data = [] + for i in range(n_units): + is_treated = i < n_treated + for t in range(n_periods): + post = t >= 5 + y = 10.0 + loadings[i] * factors[t] + treatment_indicator = 1 if (is_treated and post) else 0 + if treatment_indicator: + y += true_att + y += rng.normal(0, 0.3) + data.append( + { + "unit": i, + "period": t, + "outcome": y, + "treated": treatment_indicator, + } + ) + + df = pd.DataFrame(data) + + trop_est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[0.1, 1.0], # active regularisation + max_iter=500, # converge cleanly without "may be inaccurate" warning + n_bootstrap=10, + seed=42, + ) + results = trop_est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + assert results.effective_rank >= 0 + assert results.att > 0, f"ATT={results.att:.3f} should be positive" + + def test_weighted_nuclear_norm_solver_reduces_sv_mass(self): + """Eq. 2 nuclear-norm regularisation: the fitted L has strictly + smaller total singular-value mass than the input Y. + + Origin: ported from the pre-migration + `TestPaperConformanceFixes::test_weighted_nuclear_norm_solver_convergence` + in `test_trop.py` (the pre-migration class was deleted in the + methodology-promotion PR; the test was migrated here). + """ + trop_est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[1.0], + ) + + n_periods = 5 + n_units = 8 + + Y = np.random.default_rng(42).normal(0, 1, (n_periods, n_units)) + W = np.ones((n_periods, n_units)) + L_init = np.zeros((n_periods, n_units)) + alpha = np.zeros(n_units) + beta = np.zeros(n_periods) + + L = trop_est._weighted_nuclear_norm_solve( + Y, W, L_init, alpha, beta, lambda_nn=1.0, max_inner_iter=20 + ) + + assert np.all(np.isfinite(L)) + _, s, _ = np.linalg.svd(L, full_matrices=False) + _, s_orig, _ = np.linalg.svd(Y, full_matrices=False) + assert np.sum(s) < np.sum( + s_orig + ), "Nuclear-norm regularisation should reduce total singular-value mass" + + def test_factor_matrix_consistent_with_treatment_effects(self): + """Eq. 10 corollary: the fitted ``factor_matrix`` (L_hat) is + consistent with the per-cell counterfactual implied by + ``treatment_effects``. + + Eq. 10 (paper p. 22) writes the estimated counterfactual at a + treated cell as ``L_hat_NT`` plus three weighted (Y - L_hat) + averages over control / pre-treatment slices. The exact identity + depends on the internal weight vectors (theta_s^{i,t}, omega_j^{i,t}) + which TROP computes from the user's lambda_time / lambda_unit + plus per-(i,t) unit distances and is not part of the public API + — so we cannot reconstruct the four-component sum here. What we + CAN verify is that the resulting ``factor_matrix`` has the same + shape as the (period, unit) outcome grid, is finite, and that + ``treatment_effects`` is populated with finite entries. **No + non-triviality / magnitude claim** is made on ``L_hat`` because + the test DGP has no interactive factor structure (additive unit + + time effects only) and a near-zero ``L_hat`` is methodologically + correct under the paper's framework. + + For a direct numerical realisation of the Eq. 10 decomposition, + see Athey et al. (2025) Section 5.2, which derives the + decomposition under the block (N_0, T_0) assignment. The library + relies on the same Eq. 2 prox + alternating-min stack as the + derivation; the soft-threshold-SVD / FISTA / weighted-prox tests + above (and `test_weighted_nuclear_norm_objective_recovers_att`) + cover the ingredients. + """ + rng = np.random.default_rng(_BASE_SEED_NUCLEAR_PROX) + n_units = 8 + n_treated = 1 + n_pre = 5 + n_post = 1 + n_periods = n_pre + n_post + + rows = [] + for i in range(n_units): + is_treated = i < n_treated + unit_fe = rng.normal(0, 0.3) + for t in range(n_periods): + y = 5.0 + unit_fe + 0.1 * t + rng.normal(0, 0.2) + d = 1 if (is_treated and t >= n_pre) else 0 + rows.append({"unit": i, "period": t, "outcome": y, "treated": d}) + df = pd.DataFrame(rows) + + trop_est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[0.1], + max_iter=500, # converge cleanly without "may be inaccurate" warning + n_bootstrap=2, + seed=42, + ) + results = trop_est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + L = np.asarray(results.factor_matrix, dtype=float) + assert L.shape == (n_periods, n_units) + assert np.all(np.isfinite(L)) + # NOTE: this DGP has only additive unit + time effects plus iid + # noise — no interactive factor structure. Per the paper's + # framework, ``alpha_i`` / ``beta_t`` absorb the additive + # surfaces, so a near-zero ``L_hat`` is methodologically correct + # here. Asserting ``effective_rank > 0`` would lock a solver + # artifact (e.g., regularisation under-shrinkage) rather than + # the intended low-rank behavior. The shape + finiteness check + # above + the treatment_effects existence check below are the + # legitimate structural surface for this test; the Eq. 2 + # ingredients are independently verified in the prox / FISTA / + # weighted-norm tests above. + assert results.treatment_effects is not None + # The single treated cell (i=0, t=n_pre) must be in treatment_effects. + # Resolve whatever unit / period values were used in the input frame. + assert any( + np.isfinite(value) for value in results.treatment_effects.values() + ), "treatment_effects must contain at least one finite entry" + + +# ============================================================================= +# TestTROPEquation3Weights — Eq. 3 (unit / time weight semantics) +# ============================================================================= + + +class TestTROPEquation3Weights: + """Eq. 3: exponential-decay unit and time weights. + + Three direct paper-formula assertions: + + - ``test_distance_excludes_target_period`` (extracted from the + pre-migration ``TestPaperConformanceFixes::test_issue_b`` in + `test_trop.py` — the pre-migration class was deleted in the + methodology-promotion PR): end-to-end fit smoke that the + target-period anomaly does not dominate. + - ``test_unit_distance_uses_untreated_only_mask`` (NEW): direct + assertion that ``TROP._compute_unit_distance_for_obs`` matches + the paper's + ``dist_unit_{-t}(j, i) = sqrt(sum_{u != t}(1-W_iu)(1-W_ju)(Y_iu - Y_ju)^2 + / sum_{u != t}(1-W_iu)(1-W_ju))`` + formula, hand-computed on a constructed Y / D where the masking + gives an unambiguous answer. + - ``test_time_weights_match_exp_decay_formula`` (NEW): direct + assertion that the per-(i, t) weight matrix's time-axis + column equals ``exp(-lambda_time * |t - s|)``. + + Origin: target-period-exclusion test ported from the pre-migration + `TestPaperConformanceFixes::test_issue_b_distance_excludes_target_period` + in `test_trop.py` (the pre-migration class was deleted in the + methodology-promotion PR); the unit-distance-mask and time-decay + assertions are new direct Eq. 3 locks. + """ + + def test_distance_excludes_target_period(self): + """Eq. 3: pairwise unit distance uses 1{u != t} to exclude period t. + + Construct a panel where the treated unit's outcome at the target + period (t = 3) is anomalous (Y = 100, vs Y ~ 5 elsewhere). Without + the 1{u != t} mask, this single observation would dominate every + pairwise unit distance and break the weight calculation. With the + mask, the fit completes and returns a finite ATT. + """ + rng = np.random.default_rng(123) + + n_units = 10 + n_periods = 6 + data = [] + for i in range(n_units): + is_treated = i == 0 + for t in range(n_periods): + if is_treated and t == 3: + y = 100.0 # anomalous target-period outcome + elif is_treated and t >= 3: + y = 5.0 + rng.normal(0, 0.1) + else: + y = 5.0 + rng.normal(0, 0.1) + + treatment_indicator = 1 if (is_treated and t >= 3) else 0 + data.append( + { + "unit": i, + "period": t, + "outcome": y, + "treated": treatment_indicator, + } + ) + + df = pd.DataFrame(data) + + trop_est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[1.0], + lambda_nn_grid=[0.0], + n_bootstrap=5, + seed=42, + ) + + results = trop_est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + assert results is not None + assert np.isfinite(results.att) + + def test_unit_distance_uses_untreated_only_mask(self): + """Eq. 3: ``dist_unit_{-t}(j, i)`` uses only periods where both + units are untreated (the ``(1 - W_iu)(1 - W_ju)`` mask) AND + excludes the target period t. + + Hand-construct ``Y`` and ``D`` where the treatment pattern is + known, then verify ``TROP._compute_unit_distance_for_obs`` returns + the paper's hand-computed RMS distance (sum of squared + differences over the valid mask, normalised by the count of + valid periods). + """ + # Treated unit i = 0 is treated at t = 2 onward; control unit + # j = 1 is untreated throughout. Target period t = 2. + # Untreated periods for both: u in {0, 1}. Target period t = 2 is + # also excluded by the 1{u != t} mask. Periods u >= 2 are + # treated for i and excluded by (1 - W_iu). + n_periods = 6 + n_units = 2 + Y = np.zeros((n_periods, n_units)) + # Y_0u: treated unit, set so that (Y_iu - Y_ju)^2 = 4 at u=0, + # = 9 at u=1, others irrelevant due to mask. + Y[0, 0] = 2.0 + Y[1, 0] = 3.0 + Y[2, 0] = 100.0 # target period — excluded by 1{u != t} + Y[3, 0] = 99.0 # treated period — excluded by (1 - W_iu) + Y[4, 0] = 98.0 # treated period — excluded by (1 - W_iu) + Y[5, 0] = 97.0 # treated period — excluded by (1 - W_iu) + Y[:, 1] = 0.0 # j is constant 0 + D = np.zeros((n_periods, n_units), dtype=int) + D[2:, 0] = 1 # i treated from t = 2 + + est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[0.1], + n_bootstrap=2, + seed=42, + ) + dist = est._compute_unit_distance_for_obs(Y=Y, D=D, j=1, i=0, target_period=2) + + # Valid mask: u in {0, 1} (target excluded + both untreated). + # Squared differences: (2-0)^2 = 4, (3-0)^2 = 9. Mean = 6.5. + # RMS = sqrt(6.5). + expected = float(np.sqrt(6.5)) + assert np.isclose(dist, expected, rtol=1e-12), ( + f"Eq. 3 unit distance: {dist:.6e} != hand-computed " + f"{expected:.6e} over the valid (untreated-only," + " target-excluded) mask" + ) + + def test_time_weights_match_exp_decay_formula(self): + """Eq. 3 time weights: ``theta_s^{i,t} = exp(-lambda_time * |t - s|)``. + + Direct assertion on the per-(i, t) weight matrix's time-axis + slice: at column ``j != i``, the time-decay factor should be + ``exp(-lambda_time * |t - s|)`` for each s. Since + ``_compute_observation_weights`` returns the outer product of + ``time_weights`` and ``unit_weights`` (shape ``(n_periods, n_units)``), + column ``j`` equals ``unit_weights[j] * time_weights``. At + ``lambda_unit = 0`` (uniform unit weights = 1), column ``j`` + equals ``time_weights`` exactly. + """ + n_units = 4 + n_periods = 7 + target_t = 3 + lambda_time = 0.5 + rng = np.random.default_rng(_BASE_SEED_EQ3_WEIGHTS) + Y = rng.normal(0, 1, (n_periods, n_units)) + D = np.zeros((n_periods, n_units), dtype=int) + D[target_t:, 0] = 1 # one absorbing-state treated unit + + est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[0.1], + n_bootstrap=2, + seed=42, + ) + weights = est._compute_observation_weights( + Y=Y, + D=D, + i=0, + t=target_t, + lambda_time=lambda_time, + lambda_unit=0.0, # uniform unit weights => column j = time_weights + control_unit_idx=np.arange(1, n_units), + n_units=n_units, + n_periods=n_periods, + ) + + # Hand-computed time weights: exp(-lambda_time * |t - s|) + expected_time = np.exp(-lambda_time * np.abs(np.arange(n_periods) - target_t)) + + # Column j = 1 (a control unit) should equal expected_time + # (unit_weights[1] = 1.0 under lambda_unit = 0). + np.testing.assert_array_almost_equal( + weights[:, 1], + expected_time, + decimal=12, + err_msg=( + "Eq. 3 time weights at column j=1 do not match " + f"exp(-{lambda_time} * |t-s|) hand-computed values" + ), + ) + + +# ============================================================================= +# TestTROPAlgorithm1LOOCV — Eqs. 4-5 + Algorithm 1 (LOOCV) +# ============================================================================= + + +class TestTROPAlgorithm1LOOCV: + """Eqs. 4-5 + Algorithm 1: leave-one-out cross-validation. + + Verifies that Q(lambda) sums squared pseudo-treatment effects over ALL + control observations where D_js = 0 (including pre-treatment periods + of eventually-treated units, not just never-treated units), and that + the two-stage coordinate-descent cycling search (paper footnote 2) + converges to a grid point. + + Origin: + `test_issue_a_control_includes_pretreatment_obs` ported from the + pre-migration `TestPaperConformanceFixes` class in `test_trop.py` + (deleted in the methodology-promotion PR); one tightly-scoped + cycling-convergence assertion ported from + `tests/test_trop.py::TestCyclingSearch` (which retains its other + LOOCV tests for the fallback / single-value-grid surfaces — those + are defensive and stayed in `test_trop.py`). + """ + + def test_control_set_includes_pretreat_of_eventually_treated(self): + """Eq. 2 / Eq. 5: control set is {(j, s) : (1 - W_{js}) > 0}, + which includes the pre-treatment observations of eventually- + treated units, not just never-treated units. + + Construct staggered adoption with two treatment cohorts (early at + t=3, late at t=5) plus never-treated. With a non-zero lambda_unit + and the Eq. 2 control set, TROP exploits late-treated pre-period + observations to fit early-treated counterfactuals (their levels + match) and recovers a positive ATT. + """ + rng = np.random.default_rng(42) + n_units = 20 + n_early_treat = 5 + n_late_treat = 5 + n_periods = 8 + true_att = 2.0 + + data = [] + for i in range(n_units): + if i < n_early_treat: + treat_period: int | None = 3 + unit_fe = 5.0 + elif i < n_early_treat + n_late_treat: + treat_period = 5 + unit_fe = 5.5 + else: + treat_period = None + unit_fe = 10.0 # controls have distinct level + + for t in range(n_periods): + is_post = treat_period is not None and t >= treat_period + treatment_indicator = 1 if is_post else 0 + y = unit_fe + 0.2 * t + if treatment_indicator: + y += true_att + y += rng.normal(0, 0.3) + data.append( + { + "unit": i, + "period": t, + "outcome": y, + "treated": treatment_indicator, + } + ) + + df = pd.DataFrame(data) + + trop_est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[1.0], # unit weights so distance matters + lambda_nn_grid=[0.0], + n_bootstrap=10, + seed=42, + ) + results = trop_est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + assert results.att > 0, f"ATT={results.att:.3f} should be positive" + + def test_cycling_search_converges_to_grid_point(self): + """Algorithm 1 / footnote 2: two-stage coordinate-descent cycling + always converges to a tuple of values from the input grids. + + Confirms the cycling-search invariant: the returned + ``(lambda_time, lambda_unit, lambda_nn)`` must be members of the + user-supplied grids, and ``att`` must be finite, ``se >= 0``. + Origin: ported from + `tests/test_trop.py::TestCyclingSearch::test_cycling_search_converges`. + """ + df = _make_no_factor_panel( + n_units=20, + n_treated=5, + n_pre=5, + n_post=3, + noise_sd=0.5, + treatment_effect=3.0, + seed=_BASE_SEED_ALG1_LOOCV, + ) + + trop_est = TROP( + lambda_time_grid=[0.0, 0.5, 1.0], + lambda_unit_grid=[0.0, 0.5, 1.0], + lambda_nn_grid=[0.0, 0.1, 1.0], + n_bootstrap=5, + seed=42, + ) + + results = trop_est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + assert results.lambda_time in trop_est.lambda_time_grid + assert results.lambda_unit in trop_est.lambda_unit_grid + assert results.lambda_nn in trop_est.lambda_nn_grid + + assert np.isfinite(results.att) + assert results.se >= 0 + + +# ============================================================================= +# TestTROPCorollary1Unbiasedness — three single-condition DGPs +# ============================================================================= + + +class TestTROPCorollary1Unbiasedness: + """Corollary 1: unbiased if ANY ONE of three balance conditions holds. + + The paper states (page 23): under Assumption 1, for fixed (non-data- + dependent) weights theta, omega, + `|E[tau_hat - tau | L]| <= ||Delta_u(omega, Gamma)||_2 * ||Delta_t(theta, Lambda)||_2 * ||B||_*` + so the estimator is unbiased if any one of: + (a) Unit balance: sum_{i in C} omega_i Gamma_i = Gamma_N + (b) Time balance: sum_{s in C} theta_s Lambda_s = Lambda_T + (c) Correct regression adjustment: B = 0_K + + Each test constructs a DGP that makes one condition hold trivially + (constant loadings, constant factor scores, or no factor structure + with lambda_nn=infinity), and asserts TROP recovers the ATT within + a 3-sigma MC band even when the other components are sub-optimal. + Tests are NEW — no extraction from `test_trop.py`. + """ + + def test_unit_balance_constant_loadings(self): + """Corollary 1(a): constant unit loadings make ``Delta_u = 0`` for + any non-negative weight vector ``omega``. TROP is unbiased even + with sub-optimal ``lambda_unit = 0`` (uniform unit weights). + + The DGP has shared ``Gamma_i = Gamma`` for all units (so the + product ``sum omega_i Gamma_i = Gamma`` regardless of omega) and + random ``Lambda_t`` across periods. Treated cells receive a + constant treatment effect. + """ + df = _make_constant_loading_panel( + n_units=30, + n_treated=8, + n_pre=6, + n_post=4, + n_factors=2, + factor_strength=1.0, + noise_sd=0.3, + treatment_effect=2.0, + seed=_BASE_SEED_COROLLARY_1, + ) + + # Deliberately sub-optimal unit weighting (lambda_unit = 0 == uniform). + # Theorem says we're still unbiased because the unit-balance term is 0. + est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[1.0], + n_bootstrap=20, + seed=42, + ) + results = est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + # 3-sigma band: with constant loadings, unit balance is trivial, + # so any reasonable estimator (TROP included) is centred on truth. + assert abs(results.att - 2.0) < 3.0 * results.se, ( + f"ATT={results.att:.4f} not within 3 SE={results.se:.4f} " + f"of true=2.0 under unit balance" + ) + + def test_time_balance_constant_factors(self): + """Corollary 1(b): constant factor scores make ``Delta_t = 0`` for + any non-negative weight vector ``theta``. TROP is unbiased even + with sub-optimal ``lambda_time = 0``. + + The DGP has shared ``Lambda_s = Lambda`` for all periods (so the + product ``sum theta_s Lambda_s = Lambda`` regardless of theta) and + random ``Gamma_i`` with selection-on-loadings (treated units + shifted) — so DID would be biased, but TROP recovers ATT. + """ + df = _make_constant_factor_panel( + n_units=30, + n_treated=8, + n_pre=6, + n_post=4, + n_factors=2, + factor_strength=1.0, + noise_sd=0.3, + treatment_effect=2.0, + seed=_BASE_SEED_COROLLARY_1 + 1, + ) + + # Sub-optimal time weighting (lambda_time = 0). Time-balance + # condition makes this irrelevant. + est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[1.0], + n_bootstrap=20, + seed=42, + ) + results = est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + assert abs(results.att - 2.0) < 3.0 * results.se, ( + f"ATT={results.att:.4f} not within 3 SE={results.se:.4f} " + f"of true=2.0 under time balance" + ) + + def test_zero_regression_bias_no_factor_dgp(self): + """Corollary 1(c): when ``B = 0_K`` (no regression-adjustment bias), + TROP is unbiased regardless of weights. + + Setting ``lambda_nn = infinity`` forces ``L = 0`` (paper Section 2.2: + factor model disabled), so the estimator reduces to TWFE on a + TWFE-clean DGP and ``B = 0`` trivially. ATT recovery is sharp. + """ + df = _make_no_factor_panel( + n_units=30, + n_treated=8, + n_pre=6, + n_post=4, + noise_sd=0.3, + treatment_effect=2.0, + seed=_BASE_SEED_COROLLARY_1 + 2, + ) + + est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[np.inf], # factor model disabled --> B = 0 + n_bootstrap=20, + seed=42, + ) + results = est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + assert abs(results.att - 2.0) < 3.0 * results.se, ( + f"ATT={results.att:.4f} not within 3 SE={results.se:.4f} " f"of true=2.0 under B=0" + ) + # Sanity: lambda_nn was preserved as inf in results metadata + # (REGISTRY ## TROP "λ_nn=∞ implementation" note: "TROPResults stores ORIGINAL lambda_nn value (inf)"). + assert results.lambda_nn == np.inf + + +# ============================================================================= +# TestTROPTheorem51TripleRobustness — MC-ranking realisation of bias bound +# ============================================================================= + + +@pytest.mark.slow +class TestTROPTheorem51TripleRobustness: + """Theorem 5.1: empirical realisation of the triply-robust bias bound. + + The paper's bound + `|E[tau_hat - tau | L]| <= ||Delta_u||_2 * ||Delta_t||_2 * ||B||_*` + (product of three components rather than sum) is "strictly tighter + than bounds for DID, SC, and SDID" (paper Section 5.2 + Eq. 11). + A direct oracle-Gamma/Lambda/B test was spiked and found to require + constructing B from the regression-adjustment estimator class + (Assumption 2 bias matrix), which is brittle under finite-sample MC + noise. The methodology test therefore verifies the bound's + empirical realisation: + + TROP RMSE < DID RMSE under a confounded factor DGP, over a MC sweep + of independent panels. The factor DGP induces interactive fixed-effect + bias that the DiD benchmark cannot handle, while TROP's three robustness + components jointly absorb the confounding. + + Tests are NEW. The MC-ranking pattern also dedupes the + factor-DGP coverage from the pre-migration + `TestTROPvsSDID::test_trop_handles_factor_dgp` in `test_trop.py` + (which only asserted ``att != 0``, not ranking against DID; the + pre-migration `TestTROPvsSDID` class was deleted in the + methodology-promotion PR). + """ + + def test_trop_rmse_strictly_below_did_under_factor_confounding(self, ci_params): + """Theorem 5.1 (Eq. 11 ranking): on a confounded factor DGP with + true ``tau = 0``, TROP RMSE is strictly below DID RMSE across + independent MC replicates. + + The factor DGP induces interactive-FE bias that the DiD benchmark cannot + handle; TROP's three robustness components jointly absorb the + confounding. Empirical magnitude (spike measurement at + ``factor_strength=1.0``, 15 reps): TROP/DID RMSE ratio ~ 0.34 + (3x advantage). The assertion uses a generous ``ratio < 0.7`` + margin to absorb finite-sample MC noise and per-rep + loocv-tuning variance. + """ + n_reps = ci_params.bootstrap(10, min_n=5) + trop_atts = [] + did_atts = [] + + for rep in range(n_reps): + df = _make_trop_factor_panel( + n_units=24, + n_treated=6, + n_pre=6, + n_post=3, + n_factors=2, + factor_strength=1.0, + treated_loading_shift=0.8, + noise_sd=0.3, + treatment_effect=0.0, + seed=_BASE_SEED_THEOREM_51 + rep, + ) + + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + trop_est = TROP( + lambda_time_grid=[0.0, 1.0], + lambda_unit_grid=[0.0, 1.0], + lambda_nn_grid=[0.0, 1.0], + n_bootstrap=5, + seed=_BASE_SEED_THEOREM_51 + rep, + ) + r_trop = trop_est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + did_att = _fit_did(df) + + if np.isfinite(r_trop.att): + trop_atts.append(r_trop.att) + if np.isfinite(did_att): + did_atts.append(did_att) + + # Both estimators must have produced enough usable estimates. + assert len(trop_atts) >= max( + 5, n_reps // 2 + ), f"Only {len(trop_atts)} of {n_reps} TROP fits returned finite ATT" + assert len(did_atts) >= max( + 5, n_reps // 2 + ), f"Only {len(did_atts)} of {n_reps} DiD fits returned finite ATT" + + trop_rmse = float(np.sqrt(np.mean(np.asarray(trop_atts) ** 2))) + did_rmse = float(np.sqrt(np.mean(np.asarray(did_atts) ** 2))) + ratio = trop_rmse / did_rmse + + # Theorem 5.1 ranking realisation: TROP MSE strictly below DID MSE. + # Generous margin (0.7) absorbs MC noise; spike measurement + # showed ratio ~ 0.34 at the calibrated DGP. + assert ratio < 0.7, ( + f"TROP/DID RMSE ratio {ratio:.3f} not below 0.7 " + f"(TROP RMSE={trop_rmse:.4f}, DID RMSE={did_rmse:.4f}, " + f"true tau=0). Theorem 5.1 bias bound predicts TROP should " + f"strictly dominate DID under factor confounding." + ) + + +# ============================================================================= +# TestTROPSpecialCases — Section 2.2 reductions (DID, MC, SDID) +# ============================================================================= + + +class TestTROPSpecialCases: + """Section 2.2 reductions: TROP collapses to DID and MC under specific + tunings. + + Paper Section 2.2 (page 6): + - lambda_nn=infinity AND omega_j=theta_s=1 (uniform weights) + --> reduces to DID / TWFE + - omega_j=theta_s=1 (uniform weights) AND lambda_nn reduces to Matrix Completion (Athey et al. 2021, MC) + - lambda_nn=infinity AND specific (omega, theta) weight choices + --> reduces to SC and SDID + + This class verifies the DID and MC reductions on clean DGPs. The + SC and SDID reductions are intentionally skipped: the paper claims + they hold "with specific choices of unit and time weights" without + providing the omega/theta map, and the library does not expose an + SC- or SDID-matching weight setter (only ``lambda_unit`` / + ``lambda_time`` decay rates per Eq. 3). Cross-language anchor + against `SyntheticDiD` / a synthetic-control reference is deferred + until paper-author code clarifies the weight map. (Documented in + `METHODOLOGY_REVIEW.md` ``TROP`` section under Deviations.) + + Tests are NEW. The factor-DGP smoke previously in the + pre-migration `TestTROPvsSDID::test_trop_handles_factor_dgp` is + subsumed by `TestTROPTheorem51TripleRobustness` (which tests a + stronger MC-ranking claim); the pre-migration `TestTROPvsSDID` + class was deleted in the methodology-promotion PR. + """ + + def test_did_reduction_lambda_nn_inf_uniform_weights(self): + """Section 2.2 first bullet: with ``lambda_nn = infinity`` and + uniform weights (``lambda_time = lambda_unit = 0``), TROP and + basic DiD should produce close ATT estimates on a no- + interactive-FE panel (additive unit + time effects only). + + **This is a benchmark sanity check, not an algebraic-equivalence + proof.** The paper's Section 2.2 reduction is stated for the + 2×2 block-assignment case; this test uses a multi-period panel + (n_pre=6, n_post=4) where the library's basic DiD is the + canonical comparator but not the algebraic target. The 0.5 + tolerance absorbs finite-sample MC noise. A direct algebraic- + reduction test (true 2-period panel or `TwoWayFixedEffects` + comparator) is deferred. + """ + df = _make_no_factor_panel( + n_units=30, + n_treated=8, + n_pre=6, + n_post=4, + noise_sd=0.3, + treatment_effect=2.0, + seed=_BASE_SEED_SPECIAL_CASES, + ) + + trop_est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[np.inf], # factor model disabled + n_bootstrap=10, + seed=42, + ) + r_trop = trop_est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + did_att = _fit_did(df) + + assert np.isfinite(r_trop.att) + assert np.isfinite(did_att) + + # Same DGP, same identification --> ATTs match within MC noise. + # 0.5 absolute tolerance is generous given true_att = 2.0 with + # noise_sd = 0.3 and finite sample. + assert abs(r_trop.att - did_att) < 0.5, ( + f"TROP ATT={r_trop.att:.4f} and DID ATT={did_att:.4f} " + f"should match under lambda_nn=inf + uniform weights " + f"(paper Section 2.2 DID reduction)" + ) + + def test_matrix_completion_reduction_uniform_weights_finite_nn(self): + """Section 2.2 second bullet: with uniform weights and finite + ``lambda_nn``, TROP reduces to a Matrix Completion estimator. + + **This is NOT an equivalence check against an independent MC + reference implementation.** The paper does not provide a separate + MC algorithm specification beyond "uniform weights + finite + nuclear-norm penalty", and the library does not bundle an + independent MC port. What this test DOES verify is that the + nuclear-norm prox code path activates under the MC tuning + (effective_rank > 0) and that the resulting ATT beats the + DID-style (lambda_nn=infinity) baseline on a factor-confounded + DGP. A true equivalence test would require either an external + MC port from R or a hand-written reference solver. + """ + df = _make_trop_factor_panel( + n_units=25, + n_treated=6, + n_pre=7, + n_post=3, + n_factors=2, + factor_strength=1.2, + treated_loading_shift=0.5, + noise_sd=0.3, + treatment_effect=2.0, + seed=_BASE_SEED_SPECIAL_CASES + 1, + ) + + # MC-style: uniform weights + finite lambda_nn enables factor model. + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + mc_est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[0.1, 1.0], # finite --> factor model on + n_bootstrap=10, + seed=42, + ) + r_mc = mc_est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + # DID-style baseline: same uniform weights, infinite lambda_nn. + did_est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[np.inf], + n_bootstrap=10, + seed=42, + ) + r_did_via_trop = did_est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + true_att = 2.0 + # MC reduction should recover the ATT strictly better than the + # no-factor baseline because the DGP has interactive FE. No + # tolerance slack — under factor confounding the DID-style + # baseline is biased and MC must strictly improve on it. + assert abs(r_mc.att - true_att) < abs(r_did_via_trop.att - true_att), ( + f"MC reduction ATT error |{r_mc.att:.4f} - {true_att}| " + f"should be strictly LESS than DID-style baseline error " + f"|{r_did_via_trop.att:.4f} - {true_att}|" + ) + # MC reduction's factor matrix must be non-trivially active. + assert r_mc.effective_rank > 0, ( + f"MC reduction effective_rank={r_mc.effective_rank} should be > 0 " + f"(finite lambda_nn engages the prox solver)" + ) + + +# ============================================================================= +# TestTROPAlgorithm2MultipleTreated — Eq. 13 + Algorithm 2 +# ============================================================================= + + +class TestTROPAlgorithm2MultipleTreated: + """Eq. 13 + Algorithm 2: per-(i, t) estimation for multiple treated units. + + For each treated observation (i, t), Eq. 13 fits a separate model as if + (i, t) were the only treated cell, with observation-specific weights + omega_j^{i,t}, theta_s^{i,t}. The ATT then averages over all W_it=1 + cells via Eq. 1: + `tau_hat = (1 / sum_{i,t} W_it) sum_{i,t} W_it tau_hat_{it}(lambda_hat)`. + + This supports general assignment patterns (Section 6.1) including + staggered adoption and heterogeneous treatment effects (Remark 6.1). + Tests are NEW. + """ + + def test_treatment_effects_dict_has_entry_per_treated_cell(self): + """Algorithm 2: ``TROPResults.treatment_effects`` contains one + ``tau_hat_it`` entry per treated (unit, period) cell. + + With ``n_treated`` treated units and ``n_post`` post-treatment + periods under block assignment, there are ``n_treated * n_post`` + treated cells. Each must have a finite per-cell estimate. + """ + df = _make_no_factor_panel( + n_units=20, + n_treated=4, + n_pre=5, + n_post=3, + noise_sd=0.3, + treatment_effect=2.0, + seed=_BASE_SEED_ALG2_MULTI, + ) + + est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[0.1], + max_iter=500, # converge cleanly without "may be inaccurate" warning + n_bootstrap=5, + seed=42, + ) + results = est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + n_treated_cells = 4 * 3 + assert results.treatment_effects is not None + assert len(results.treatment_effects) == n_treated_cells, ( + f"Expected {n_treated_cells} per-cell tau_hat entries, " + f"got {len(results.treatment_effects)}" + ) + for cell, tau_it in results.treatment_effects.items(): + assert np.isfinite(tau_it), f"Per-cell tau_hat at {cell} should be finite, got {tau_it}" + + def test_att_equals_mean_of_per_cell_effects(self): + """Eq. 1: ``tau_hat = (1 / sum W_it) sum W_it tau_hat_it``. + + With block assignment (no observation-level weight kwargs), the + reduction is just the unweighted mean of per-cell ``tau_hat_it`` + entries in ``treatment_effects``. Match within 1e-8 (no additional + post-aggregation step in the ATT pipeline). + """ + df = _make_no_factor_panel( + n_units=18, + n_treated=4, + n_pre=5, + n_post=2, + noise_sd=0.3, + treatment_effect=1.5, + seed=_BASE_SEED_ALG2_MULTI + 1, + ) + + est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[0.1], + max_iter=500, # converge cleanly without "may be inaccurate" warning + n_bootstrap=5, + seed=42, + ) + results = est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + assert results.treatment_effects is not None + per_cell = np.array(list(results.treatment_effects.values())) + mean_per_cell = float(np.mean(per_cell)) + assert np.isclose(results.att, mean_per_cell, atol=1e-8), ( + f"ATT={results.att:.10f} != mean of per-cell effects " + f"{mean_per_cell:.10f} (Eq. 1 / Algorithm 2 step 3)" + ) + + +# ============================================================================= +# TestTROPAlgorithm3Bootstrap — stratified pairs bootstrap +# ============================================================================= + + +class TestTROPAlgorithm3Bootstrap: + """Algorithm 3: stratified pairs bootstrap. + + Paper Algorithm 3 (page 27) resamples N_0 control rows and N_1 treated + rows with replacement SEPARATELY (not pooled), preserving the + treatment ratio across replicates and within-unit temporal + correlation. This is a pairs bootstrap, NOT a multiplier / Rao-Wu + bootstrap. + + Origin: ported from the pre-migration + `TestPaperConformanceFixes::test_issue_d_stratified_bootstrap` in + `test_trop.py` (the pre-migration class was deleted in the + methodology-promotion PR). Bootstrap-failure-rate and bootstrap- + NaN-SE guards are defensive surfaces and stayed in + `tests/test_trop.py::TestTROPBootstrapFailureRateGuard` and + `tests/test_trop.py::TestTROPBootstrapNaNSE`. + """ + + def test_stratified_pairs_resampling_completes(self, ci_params): + """Algorithm 3: stratified pairs bootstrap completes successfully on + an unbalanced (3 treated, 17 control) panel and yields a positive + finite SE. + + With N_1=3 treated and N_0=17 control units, naive pooled + resampling would sometimes draw 0 treated rows; the stratified + sampler always draws 3 treated + 17 control per replicate, so the + bootstrap distribution converges and the SE is positive. + """ + rng = np.random.default_rng(789) + + n_treated = 3 + n_control = 17 + n_units = n_treated + n_control + n_periods = 6 + true_att = 2.0 + + data = [] + for i in range(n_units): + is_treated = i < n_treated + for t in range(n_periods): + post = t >= 3 + y = 10.0 + rng.normal(0, 0.5) + treatment_indicator = 1 if (is_treated and post) else 0 + if treatment_indicator: + y += true_att + data.append( + { + "unit": i, + "period": t, + "outcome": y, + "treated": treatment_indicator, + } + ) + + df = pd.DataFrame(data) + + n_boot = ci_params.bootstrap(30) + trop_est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[0.0], + n_bootstrap=n_boot, + seed=42, + ) + results = trop_est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + assert results.bootstrap_distribution is not None + min_successes = max(5, int(0.67 * n_boot)) + assert len(results.bootstrap_distribution) >= min_successes, ( + f"Expected >= {min_successes} successful bootstrap draws " + f"out of {n_boot}, got {len(results.bootstrap_distribution)}" + ) + assert results.se > 0 + assert np.isfinite(results.se) + + +# ============================================================================= +# TestTROPEquation6FactorDGPRecovery — Section 3 / Eq. 6 semi-synthetic +# ============================================================================= + + +@pytest.mark.slow +class TestTROPEquation6FactorDGPRecovery: + """Section 3 / Eq. 6: semi-synthetic factor-DGP recovery. + + The paper's simulation framework (Section 3.1, pages 9-11) uses a + rank-4 factor model with AR(2) errors over 6 real-data backbones. + True treatment effects are zero (placebo). The library's equivalent + DGP is built via `_make_trop_factor_panel` (wraps + `diff_diff.prep.generate_factor_data`). + + Origin: 5 tests ported verbatim from the pre-migration + `TestMethodologyVerification` class in `test_trop.py` (the + pre-migration class was deleted in the methodology-promotion PR; + original line range L552-878): limiting-case uniform weights, + unit-weight bias reduction, time-weight bias reduction, factor- + model bias reduction, and paper-DGP null-recovery. Verified that + the ported tests still pass after relocation. + """ + + def test_limiting_case_uniform_weights(self): + """Eq. 3 limiting case: lambda_unit = lambda_time = 0, lambda_nn = 0. + + With all lambdas at zero, TROP uses uniform weights and an + unregularised L (paper Section 2.2 first bullet, omega_j=theta_s=1 + case). This should give TWFE-like estimates on a TWFE-clean panel. + """ + rng = np.random.default_rng(42) + n_units = 15 + n_treated = 5 + n_pre = 5 + n_post = 3 + true_att = 3.0 + + data = [] + for i in range(n_units): + is_treated = i < n_treated + unit_fe = rng.normal(0, 0.5) + for t in range(n_pre + n_post): + post = t >= n_pre + time_fe = 0.2 * t + y = 10.0 + unit_fe + time_fe + treatment_indicator = 1 if (is_treated and post) else 0 + if treatment_indicator: + y += true_att + y += rng.normal(0, 0.3) + data.append( + { + "unit": i, + "period": t, + "outcome": y, + "treated": treatment_indicator, + } + ) + + df = pd.DataFrame(data) + + trop_est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[0.0], + n_bootstrap=10, + seed=42, + ) + results = trop_est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + assert ( + abs(results.att - true_att) < 1.0 + ), f"ATT={results.att:.3f} should be close to true={true_att}" + assert results.lambda_time == 0.0 + assert results.lambda_unit == 0.0 + assert results.lambda_nn == 0.0 + + def test_unit_weights_reduce_bias(self): + """Eq. 3 unit weights: exp(-lambda_unit * RMSE_distance) reduces bias + when control units vary in similarity to treated. + + Heterogeneous controls (5 similar + remainder dissimilar) plus a + non-zero lambda_unit grid lets LOOCV pick informative weighting. + """ + rng = np.random.default_rng(123) + n_units = 25 + n_treated = 5 + n_pre = 6 + n_post = 3 + true_att = 2.5 + + data = [] + for i in range(n_units): + is_treated = i < n_treated + if is_treated or i < n_treated + 5: + unit_fe = 5.0 + rng.normal(0, 0.3) + else: + unit_fe = 10.0 + rng.normal(0, 0.5) + + for t in range(n_pre + n_post): + post = t >= n_pre + time_fe = 0.2 * t + y = unit_fe + time_fe + treatment_indicator = 1 if (is_treated and post) else 0 + if treatment_indicator: + y += true_att + y += rng.normal(0, 0.3) + data.append( + { + "unit": i, + "period": t, + "outcome": y, + "treated": treatment_indicator, + } + ) + + df = pd.DataFrame(data) + + trop_est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0, 1.0, 2.0], + lambda_nn_grid=[0.0], + n_bootstrap=10, + seed=42, + ) + results = trop_est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + assert ( + abs(results.att - true_att) < 1.5 + ), f"ATT={results.att:.3f} should be close to true={true_att}" + + def test_time_weights_reduce_bias(self): + """Eq. 3 time weights: exp(-lambda_time * |t - s|) reduces bias on + trending pre-treatment outcomes. + + Quadratic pre-trend (time_fe = 0.1*t + 0.05*t^2/n_pre) makes recent + periods more informative for extrapolating the counterfactual. + """ + rng = np.random.default_rng(456) + n_units = 20 + n_treated = 5 + n_pre = 8 + n_post = 3 + true_att = 2.0 + + data = [] + for i in range(n_units): + is_treated = i < n_treated + unit_fe = rng.normal(0, 0.5) + + for t in range(n_pre + n_post): + post = t >= n_pre + time_fe = 0.1 * t + 0.05 * (t**2 / n_pre) + y = 10.0 + unit_fe + time_fe + treatment_indicator = 1 if (is_treated and post) else 0 + if treatment_indicator: + y += true_att + y += rng.normal(0, 0.3) + data.append( + { + "unit": i, + "period": t, + "outcome": y, + "treated": treatment_indicator, + } + ) + + df = pd.DataFrame(data) + + trop_est = TROP( + lambda_time_grid=[0.0, 0.5, 1.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[0.0], + n_bootstrap=10, + seed=42, + ) + results = trop_est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + assert results.att > 0, f"ATT={results.att:.3f} should be positive" + assert results.lambda_time in [0.0, 0.5, 1.0] + + def test_factor_model_reduces_bias(self, ci_params): + """Eq. 2 / Section 2.2 MC reduction: nuclear-norm regularisation + reduces bias when the true DGP has interactive fixed effects. + + Generated via `_make_trop_factor_panel` (wraps + `diff_diff.prep.generate_factor_data`) with rank-2 factor + structure and selection-on-loadings. + """ + data = _make_trop_factor_panel( + n_units=25, + n_pre=7, + n_post=3, + n_treated=5, + n_factors=2, + treatment_effect=2.0, + factor_strength=1.5, + unit_fe_sd=1.0, + noise_sd=0.5, + seed=789, + ) + + n_boot = ci_params.bootstrap(20) + nn_grid = ci_params.grid([0.0, 0.1, 1.0, 5.0]) + trop_est = TROP( + lambda_time_grid=[0.0, 0.5], + lambda_unit_grid=[0.0, 0.5], + lambda_nn_grid=nn_grid, + n_bootstrap=n_boot, + seed=42, + ) + results = trop_est.fit( + data, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + true_att = 2.0 + assert ( + abs(results.att - true_att) < 2.0 + ), f"ATT={results.att:.3f} should be within 2.0 of true={true_att}" + assert results.effective_rank > 0, "Factor matrix should have positive rank" + + def test_paper_dgp_recovery(self, ci_params): + """Section 3 Eq. 6 null recovery: factor DGP with treatment_effect=0 + produces ATT estimates centred near zero. + + Mirrors paper Table 2 settings (page 32) at reduced sample size: + 2 factors, selection on loadings and levels, linear time trend, + zero true treatment effect. The estimate should fall well inside a + normal-distribution 2-sigma band of zero. + """ + rng = np.random.default_rng(2024) + n_units = 30 + n_treated = 6 + n_pre = 7 + n_post = 3 + n_factors = 2 + true_tau = 0.0 + + F = rng.normal(0, 1, (n_pre + n_post, n_factors)) + + Lambda = rng.normal(0, 1, (n_factors, n_units)) + Lambda[:, :n_treated] += 0.5 + + gamma = rng.normal(0, 1, n_units) + gamma[:n_treated] += 1.0 + + delta = np.linspace(0, 2, n_pre + n_post) + + data = [] + for i in range(n_units): + is_treated = i < n_treated + for t in range(n_pre + n_post): + post = t >= n_pre + y = 10.0 + gamma[i] + delta[t] + y += Lambda[:, i] @ F[t, :] + treatment_indicator = 1 if (is_treated and post) else 0 + if treatment_indicator: + y += true_tau + y += rng.normal(0, 0.5) + + data.append( + { + "unit": i, + "period": t, + "outcome": y, + "treated": treatment_indicator, + } + ) + + df = pd.DataFrame(data) + + n_boot = ci_params.bootstrap(30) + trop_est = TROP( + lambda_time_grid=[0.0, 0.5, 1.0], + lambda_unit_grid=[0.0, 0.5, 1.0], + lambda_nn_grid=[0.0, 0.1, 1.0], + max_iter=500, # converge cleanly without "may be inaccurate" warning + n_bootstrap=n_boot, + seed=42, + ) + results = trop_est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + assert ( + abs(results.att) < 2.0 + ), f"ATT={results.att:.3f} should be close to true={true_tau} under null" + assert results.effective_rank >= 0 + + +# ============================================================================= +# TestTROPDeviations — documented deviations and library choices +# ============================================================================= + + +class TestTROPDeviations: + """Locks library deviations from the paper and documented choices. + + Each test is independent (no bare cross-references to defensive + tests in ``test_trop.py``; pointers live in this class docstring). + + Cross-references for context (NOT duplicated assertions, NOT + locked by tests in this class): + - `tests/test_trop.py::TestTROPBootstrapFailureRateGuard` + (bootstrap proportional 5% failure-rate warning — defensive + surface, stays in `test_trop.py`). + - `tests/test_trop.py::TestTROPConvergenceWarnings` + (FISTA / outer-loop convergence warnings — defensive surface, + stays in `test_trop.py`). + - `tests/test_trop.py::TestSilentWarningAudit` + (Phase 2 silent-failure audit — defensive surface, stays in + `test_trop.py`). + - `tests/test_trop.py::TestDMatrixValidation` + (absorbing-state validation — defensive surface, stays in + `test_trop.py`). + + Tests in this class cover: + - Gap #5 (unnormalised weights match Eq. 2, not Section 5 + sum-to-one). + - lambda_nn=infinity internal conversion to 1e10 sentinel. + - lambda_time / lambda_unit rejection of inf grid values. + - Gap #9 (unbalanced panels supported beyond paper's balanced- + panel assumption). + - Event-style D rejection. + - Eq. 14 covariate extension not supported (Theorem 8.1 + correspondingly out of scope). + - Rank selection is implicit via nuclear-norm soft-thresholding + (no discrete ``rank_selection`` constructor parameter exposed; + paper Section 5.3 + Appendix matches this choice). + - n_bootstrap < 2 rejection (no analytical SE; bootstrap required). + - LOOCV happy-path uses the user-supplied grid verbatim + (fallback-to-defaults side covered defensively in `test_trop.py`). + - Inference CI uses t-distribution post safe_inference migration. + - safe_inference NaN-propagation contract on degenerate SE inputs. + """ + + def test_unnormalized_weights_match_eq2(self): + """Gap #5 (paper review): paper Section 5 (p. 20) states weights + sum to one (``1^T omega = 1^T theta = 1``), but Eq. 3 (p. 7) writes + unnormalised exponential weights. The library matches Eq. 2 — + the kernel weight at ``lambda_unit = 0`` is exactly 1 (not 1/N). + + This test directly inspects the per-observation weight matrix + returned by `TROP._compute_observation_weights` under + ``lambda_unit = lambda_time = 0`` and asserts: + + - Every entry equals 1.0 within machine precision (unnormalised). + - The sum equals ``n_units * n_periods`` (paper Section 5 would + require sum = 1; Eq. 2 / library returns ``N * T``). + + See ``docs/methodology/papers/athey-2025-review.md`` Gap #5. + """ + n_units = 8 + n_periods = 6 + # Minimal Y / D matrices in (n_periods, n_units) layout. Values + # don't matter for the lambda=0 branch (which early-returns 1.0 + # without inspecting distances), but supply realistic shapes + # so the helper signature is exercised end-to-end. + rng = np.random.default_rng(_BASE_SEED_DEVIATIONS) + Y = rng.normal(0, 1, (n_periods, n_units)) + D = np.zeros((n_periods, n_units), dtype=int) + D[3:, 0] = 1 # one absorbing-state treated unit at t=3 + + est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[0.1], + n_bootstrap=2, + seed=42, + ) + + weights = est._compute_observation_weights( + Y=Y, + D=D, + i=0, + t=3, + lambda_time=0.0, + lambda_unit=0.0, + control_unit_idx=np.arange(1, n_units), + n_units=n_units, + n_periods=n_periods, + ) + + # All weight entries must be exactly 1.0 under uniform Eq. 3 + # specification (exp(-0 * dist) = 1; product of time and unit + # weights is 1 * 1 = 1 everywhere). + np.testing.assert_array_almost_equal( + weights, + np.ones((n_periods, n_units)), + decimal=12, + err_msg=( + "Per-obs weights under lambda_time = lambda_unit = 0 must " + "be exactly 1.0 (Eq. 2 unnormalised), not 1/N or any " + "other normalised value (Gap #5 in paper review)" + ), + ) + # And the sum is N*T, not 1 — discriminating against the paper's + # Section 5 "sum-to-one" claim. + assert weights.sum() == float(n_units * n_periods), ( + f"weights.sum()={weights.sum()} != N*T={n_units * n_periods}; " + "library would be using normalised weights (Gap #5 broken)" + ) + + def test_lambda_nn_inf_stored_unchanged(self): + """REGISTRY ``## TROP`` "λ_nn=∞ implementation" edge-case note: + lambda_nn=infinity is converted internally to 1e10 for + computation, but ``TROPResults.lambda_nn`` stores the original + ``inf`` value. lambda_time and lambda_unit store their selected + grid values directly (no inf conversion — Eq. 3 uses + ``lambda_time = lambda_unit = 0`` for "disabled", not infinity). + """ + df = _make_no_factor_panel( + n_units=15, + n_treated=4, + n_pre=5, + n_post=2, + noise_sd=0.3, + treatment_effect=1.5, + seed=_BASE_SEED_DEVIATIONS + 1, + ) + est = TROP( + lambda_time_grid=[0.0, 0.5], + lambda_unit_grid=[0.0, 0.5], + lambda_nn_grid=[np.inf], + n_bootstrap=2, + seed=42, + ) + results = est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + assert results.lambda_nn == np.inf, ( + f"lambda_nn={results.lambda_nn} not preserved as inf " + "(REGISTRY says ORIGINAL value is stored)" + ) + assert results.lambda_time in [0.0, 0.5] + assert results.lambda_unit in [0.0, 0.5] + + def test_inf_in_lambda_time_or_unit_grid_rejected(self): + """REGISTRY ``## TROP`` "Disabled parameter semantics" note: only + ``lambda_nn`` may be infinity in Eq. 3 semantics. + ``lambda_time = 0`` and ``lambda_unit = 0`` mean "uniform weights" + (disabled), because ``exp(-0 * dist) = 1``. inf in lambda_time / + lambda_unit is rejected with ``ValueError`` pointing users to + 0.0 for uniform weights. + """ + df = _make_no_factor_panel( + n_units=10, + n_treated=3, + n_pre=4, + n_post=2, + noise_sd=0.1, + treatment_effect=1.0, + seed=_BASE_SEED_DEVIATIONS + 2, + ) + with pytest.raises(ValueError, match="(?i)inf|infinity"): + TROP( + lambda_time_grid=[np.inf], + lambda_unit_grid=[0.0], + lambda_nn_grid=[0.1], + n_bootstrap=2, + seed=42, + ).fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + def test_unbalanced_panels_supported(self): + """Gap #9 (paper review): the paper assumes balanced panels + (N x T). The library accepts unbalanced panels with missing + unit-period observations. + + Construct a balanced panel and drop ~10% of rows; TROP fit must + complete and return a finite ATT. + """ + df = _make_no_factor_panel( + n_units=18, + n_treated=5, + n_pre=5, + n_post=3, + noise_sd=0.3, + treatment_effect=2.0, + seed=_BASE_SEED_DEVIATIONS + 3, + ) + # Drop 10 rows at random (control + pre-treatment cells only, + # so the treatment indicator stays absorbing). + rng = np.random.default_rng(_BASE_SEED_DEVIATIONS + 30) + eligible_idx = df.index[(df["treated"] == 0)].to_numpy() + drop_idx = rng.choice(eligible_idx, size=10, replace=False) + df_unbal = df.drop(index=drop_idx).reset_index(drop=True) + assert len(df_unbal) < len(df) + + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[0.1], + n_bootstrap=5, + seed=42, + ) + results = est.fit( + df_unbal, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + assert np.isfinite(results.att) + assert results.se >= 0 + + def test_event_style_d_rejected_with_value_error(self): + """REGISTRY ``## TROP`` "D matrix validation" edge-case note: + event-style D (only first treatment period has D=1) is rejected + because monotonicity (absorbing state) is violated. Error message + must guide users to convert to absorbing state. + """ + # Build event-style: D=1 at t=3, D=0 at t=4 (1->0 transition + # is non-monotonic, violating absorbing state). + rows = [] + for i in range(10): + for t in range(6): + d = 1 if (i < 3 and t == 3) else 0 # event-style! + rows.append( + { + "unit": i, + "period": t, + "outcome": 1.0 + 0.1 * t + (1.0 if d else 0.0), + "treated": d, + } + ) + df = pd.DataFrame(rows) + est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[0.1], + n_bootstrap=2, + seed=42, + ) + with pytest.raises(ValueError, match="(?i)absorbing|monotonic"): + est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + + def test_covariates_not_supported(self): + """Eq. 14 + Theorem 8.1 are deferred. ``TROP.fit()`` does NOT + accept a ``covariates`` keyword argument. The check uses + ``inspect.signature`` so that adding a future ``**kwargs`` would + not silently break this contract. + + See ``METHODOLOGY_REVIEW.md`` ``TROP`` "Outstanding Concerns". + """ + sig = inspect.signature(TROP.fit) + param_names = set(sig.parameters.keys()) + assert "covariates" not in param_names, ( + f"TROP.fit() unexpectedly exposes a 'covariates' parameter: " + f"{param_names}. Equation 14 covariate extension is deferred; " + f"adding the parameter without implementing the X*beta_coef + R " + f"objective per Eq. 14 would silently break methodology contract." + ) + # No **kwargs either (would let covariates slip through). + var_kw = [n for n, p in sig.parameters.items() if p.kind == inspect.Parameter.VAR_KEYWORD] + assert var_kw == [], ( + f"TROP.fit() exposes **kwargs ({var_kw}); covariate-trap " "is no longer airtight." + ) + + def test_rank_selection_is_implicit_via_nuclear_norm(self): + """Paper Section 5.3 + Appendix: rank selection is implicit via + nuclear-norm soft-thresholding. The library matches this — there + is NO discrete ``rank_selection`` constructor parameter exposing + cv / ic / elbow methods. + + Earlier REGISTRY prose (pre-promotion) mentioned "cv / ic / elbow" + methods; that claim was an overclaim and is corrected in this + promotion. This test locks the actual contract: rank is reported + post-hoc via ``TROPResults.effective_rank`` (sum of singular + values divided by the largest singular value) as a diagnostic, + not as a user-selectable mode. + """ + sig = inspect.signature(TROP.__init__) + assert "rank_selection" not in sig.parameters, ( + "TROP.__init__ unexpectedly exposes a 'rank_selection' parameter. " + "The paper specifies implicit-via-nuclear-norm rank selection; " + "if a discrete switch is added, this deviation note in " + "METHODOLOGY_REVIEW.md and REGISTRY.md must be updated." + ) + # effective_rank is reported as a diagnostic post-fit. + df = _make_no_factor_panel( + n_units=10, + n_treated=3, + n_pre=4, + n_post=2, + noise_sd=0.1, + treatment_effect=1.0, + seed=_BASE_SEED_DEVIATIONS + 4, + ) + est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[0.1], + n_bootstrap=2, + seed=42, + ) + results = est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + assert hasattr(results, "effective_rank") + assert np.isfinite(results.effective_rank) + + def test_n_bootstrap_minimum_is_2(self): + """REGISTRY ``## TROP`` "Bootstrap minimum" edge-case note: + ``n_bootstrap`` must be >= 2 (enforced via + ``ValueError``). TROP has no analytical SE formula — bootstrap + is the only variance estimator, so n_bootstrap=0 or 1 cannot + produce a defined SE. + """ + with pytest.raises(ValueError, match="(?i)bootstrap"): + TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[0.1], + n_bootstrap=1, + seed=42, + ) + + def test_loocv_returns_user_grid_values_on_well_conditioned_panel(self): + """REGISTRY ``## TROP`` "LOOCV failure handling" edge-case note + (happy-path side): when LOOCV produces a finite + Q(lambda) on at least one grid point, the result tuple + ``(lambda_time, lambda_unit, lambda_nn)`` is from the user- + supplied grid (no fallback to documented defaults + ``(1.0, 1.0, 0.1)``). + + The fallback-warning side (when ALL parameter combinations fail + LOOCV) is covered by `tests/test_trop.py::TestLOOCVFallback` + defensive surfaces — duplication here would be redundant. This + test locks the dual: well-conditioned panel uses the user grid + verbatim, so any regression that prematurely triggers the + fallback would surface here. + """ + df = _make_no_factor_panel( + n_units=12, + n_treated=3, + n_pre=4, + n_post=2, + noise_sd=0.3, + treatment_effect=1.5, + seed=_BASE_SEED_DEVIATIONS + 5, + ) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + est = TROP( + lambda_time_grid=[0.5], + lambda_unit_grid=[0.5], + lambda_nn_grid=[0.1], + n_bootstrap=2, + seed=42, + ) + results = est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + # Well-conditioned panel: result should be the user grid values + # (LOOCV succeeded, no fallback triggered). + assert results.lambda_time == 0.5 + assert results.lambda_unit == 0.5 + assert results.lambda_nn == 0.1 + + def test_inference_ci_uses_t_distribution(self): + """REGISTRY ``## TROP`` "Inference CI distribution" edge-case + note: after the safe_inference migration the confidence interval + uses the t-distribution with df = max(1, n_treated_obs - 1), + consistent with p_value. (Previously CI used normal-distribution + while p_value used t-distribution.) + + Lock: with a well-defined SE, the CI half-width equals + ``t_{alpha/2, df} * SE`` within numerical tolerance. + """ + df = _make_no_factor_panel( + n_units=15, + n_treated=4, + n_pre=5, + n_post=2, + noise_sd=0.3, + treatment_effect=2.0, + seed=_BASE_SEED_DEVIATIONS + 6, + ) + est = TROP( + lambda_time_grid=[0.0], + lambda_unit_grid=[0.0], + lambda_nn_grid=[0.1], + max_iter=500, # converge cleanly without "may be inaccurate" warning + n_bootstrap=5, + seed=42, + ) + results = est.fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + ) + from scipy import stats + + df_t = max(1, results.n_treated_obs - 1) + t_crit = stats.t.ppf(1 - results.alpha / 2, df_t) + ci_lower, ci_upper = results.conf_int + half_width = (ci_upper - ci_lower) / 2.0 + expected_half_width = t_crit * results.se + assert np.isclose(half_width, expected_half_width, rtol=1e-6), ( + f"CI half-width={half_width:.6e} does not match " + f"t_{{alpha/2, df={df_t}}} * SE = {expected_half_width:.6e} " + f"(REGISTRY ## TROP 'Inference CI distribution' note: post safe_inference migration uses t-dist)" + ) + + def test_safe_inference_nan_propagation_contract(self): + """`diff_diff.utils.safe_inference` invariant: when SE is non- + finite or non-positive, ALL inference fields (t_stat, p_value, + conf_int) are NaN-consistent. + + This test invokes `safe_inference` directly with the degenerate + inputs (SE = 0, SE = NaN, SE = -1) used across the TROP code + paths, so the NaN-propagation contract is exercised + deterministically rather than depending on a panel construction + that *might* produce a zero-SE bootstrap distribution. The + product-level coverage of TROP's NaN-SE propagation lives at + `tests/test_trop.py::TestTROPBootstrapNaNSE` (panel-level) and + `tests/test_trop.py::TestTROPResults::test_nan_propagation_when_se_zero` + (results-level); this methodology test locks the underlying + invariant. + """ + from diff_diff.utils import safe_inference + from tests.conftest import assert_nan_inference + + # Cover the three SE-degenerate inputs that propagate to NaN + # inference per the contract: zero, negative, NaN. + for bad_se in (0.0, -1.0, float("nan")): + t_stat, p_value, conf_int = safe_inference( + effect=1.5, + se=bad_se, + alpha=0.05, + ) + assert_nan_inference( + { + "se": bad_se, + "t_stat": t_stat, + "p_value": p_value, + "conf_int": conf_int, + } + ) diff --git a/tests/test_trop.py b/tests/test_trop.py index 5cf10cac..1ecf2870 100644 --- a/tests/test_trop.py +++ b/tests/test_trop.py @@ -8,9 +8,8 @@ import pandas as pd import pytest -from diff_diff import SyntheticDiD -from diff_diff.trop import TROP, TROPResults, trop from diff_diff.prep import generate_factor_data +from diff_diff.trop import TROP, TROPResults, trop def generate_factor_dgp( @@ -469,45 +468,6 @@ def test_nan_propagation_when_se_zero(self): assert results.att == 1.0, "ATT should still be valid" -@pytest.mark.slow -class TestTROPvsSDID: - """Tests comparing TROP to SDID under different DGPs.""" - - def test_trop_handles_factor_dgp(self, ci_params): - """Test that TROP works on factor DGP data.""" - data = generate_factor_dgp( - n_units=30, - n_pre=8, - n_post=4, - n_treated=5, - n_factors=2, - treatment_effect=2.0, - factor_strength=1.5, - noise_std=0.5, - seed=42, - ) - - # TROP should complete without error - n_boot = ci_params.bootstrap(20) - trop_est = TROP( - lambda_time_grid=[0.0, 1.0], - lambda_unit_grid=[0.0, 1.0], - lambda_nn_grid=[0.0, 0.1, 1.0], - n_bootstrap=n_boot, - seed=42, - ) - results = trop_est.fit( - data, - outcome="outcome", - treatment="treated", - unit="unit", - time="period", - ) - - assert results.att != 0 - assert results.se >= 0 - - class TestConvenienceFunction: """Tests for trop() convenience function.""" @@ -548,335 +508,6 @@ def test_convenience_with_kwargs(self, simple_panel_data): assert isinstance(results, TROPResults) -@pytest.mark.slow -class TestMethodologyVerification: - """Tests verifying TROP methodology matches paper specifications. - - These tests verify: - 1. Limiting cases match expected behavior - 2. Treatment effect recovery under paper's simulation DGP - 3. Observation-specific weighting produces expected results - """ - - def test_limiting_case_uniform_weights(self): - """ - Test limiting case: λ_unit = λ_time = 0, λ_nn = 0. - - With all lambdas at zero, TROP should use uniform weights and no - nuclear norm regularization, giving TWFE-like estimates. - """ - # Generate simple data with known treatment effect - rng = np.random.default_rng(42) - n_units = 15 - n_treated = 5 - n_pre = 5 - n_post = 3 - true_att = 3.0 - - data = [] - for i in range(n_units): - is_treated = i < n_treated - unit_fe = rng.normal(0, 0.5) - for t in range(n_pre + n_post): - post = t >= n_pre - time_fe = 0.2 * t - y = 10.0 + unit_fe + time_fe - treatment_indicator = 1 if (is_treated and post) else 0 - if treatment_indicator: - y += true_att - y += rng.normal(0, 0.3) - data.append( - { - "unit": i, - "period": t, - "outcome": y, - "treated": treatment_indicator, - } - ) - - df = pd.DataFrame(data) - - # TROP with uniform weights - trop_est = TROP( - lambda_time_grid=[0.0], - lambda_unit_grid=[0.0], - lambda_nn_grid=[0.0], - n_bootstrap=10, - seed=42, - ) - results = trop_est.fit( - df, - outcome="outcome", - treatment="treated", - unit="unit", - time="period", - ) - - # Should recover treatment effect within reasonable tolerance - assert ( - abs(results.att - true_att) < 1.0 - ), f"ATT={results.att:.3f} should be close to true={true_att}" - # Check that uniform weights were selected - assert results.lambda_time == 0.0 - assert results.lambda_unit == 0.0 - assert results.lambda_nn == 0.0 - - def test_unit_weights_reduce_bias(self): - """ - Test that unit distance-based weights reduce bias when controls vary. - - When control units have varying similarity to treated units, using - distance-based unit weights should improve estimation. - """ - rng = np.random.default_rng(123) - n_units = 25 - n_treated = 5 - n_pre = 6 - n_post = 3 - true_att = 2.5 - - data = [] - # Create heterogeneous control units - some similar to treated, some different - for i in range(n_units): - is_treated = i < n_treated - # Treated units and first 5 controls are similar - if is_treated or i < n_treated + 5: - unit_fe = 5.0 + rng.normal(0, 0.3) - else: - # Remaining controls are dissimilar - unit_fe = 10.0 + rng.normal(0, 0.5) - - for t in range(n_pre + n_post): - post = t >= n_pre - time_fe = 0.2 * t - y = unit_fe + time_fe - treatment_indicator = 1 if (is_treated and post) else 0 - if treatment_indicator: - y += true_att - y += rng.normal(0, 0.3) - data.append( - { - "unit": i, - "period": t, - "outcome": y, - "treated": treatment_indicator, - } - ) - - df = pd.DataFrame(data) - - # TROP with unit weighting enabled - trop_est = TROP( - lambda_time_grid=[0.0], - lambda_unit_grid=[0.0, 1.0, 2.0], - lambda_nn_grid=[0.0], - n_bootstrap=10, - seed=42, - ) - results = trop_est.fit( - df, - outcome="outcome", - treatment="treated", - unit="unit", - time="period", - ) - - # Should recover treatment effect reasonably well - assert ( - abs(results.att - true_att) < 1.5 - ), f"ATT={results.att:.3f} should be close to true={true_att}" - - def test_time_weights_reduce_bias(self): - """ - Test that time distance-based weights reduce bias with trending data. - - When pre-treatment outcomes are trending, weighting recent periods - more heavily should improve estimation. - """ - rng = np.random.default_rng(456) - n_units = 20 - n_treated = 5 - n_pre = 8 - n_post = 3 - true_att = 2.0 - - data = [] - for i in range(n_units): - is_treated = i < n_treated - unit_fe = rng.normal(0, 0.5) - - for t in range(n_pre + n_post): - post = t >= n_pre - # Time trend that accelerates near treatment - time_fe = 0.1 * t + 0.05 * (t**2 / n_pre) - y = 10.0 + unit_fe + time_fe - treatment_indicator = 1 if (is_treated and post) else 0 - if treatment_indicator: - y += true_att - y += rng.normal(0, 0.3) - data.append( - { - "unit": i, - "period": t, - "outcome": y, - "treated": treatment_indicator, - } - ) - - df = pd.DataFrame(data) - - # TROP with time weighting enabled - trop_est = TROP( - lambda_time_grid=[0.0, 0.5, 1.0], - lambda_unit_grid=[0.0], - lambda_nn_grid=[0.0], - n_bootstrap=10, - seed=42, - ) - results = trop_est.fit( - df, - outcome="outcome", - treatment="treated", - unit="unit", - time="period", - ) - - # Should recover treatment effect direction - assert results.att > 0, f"ATT={results.att:.3f} should be positive" - # Check that time weighting was considered - assert results.lambda_time in [0.0, 0.5, 1.0] - - def test_factor_model_reduces_bias(self, ci_params): - """ - Test that nuclear norm regularization reduces bias with factor structure. - - Following paper's simulation: when true DGP has interactive fixed effects, - the factor model component should help recover the treatment effect. - """ - # Generate data with known factor structure (reduced size for CI speed) - data = generate_factor_dgp( - n_units=25, - n_pre=7, - n_post=3, - n_treated=5, - n_factors=2, - treatment_effect=2.0, - factor_strength=1.5, # Strong factors - noise_std=0.5, - seed=789, - ) - - # TROP with nuclear norm regularization - n_boot = ci_params.bootstrap(20) - nn_grid = ci_params.grid([0.0, 0.1, 1.0, 5.0]) - trop_est = TROP( - lambda_time_grid=[0.0, 0.5], - lambda_unit_grid=[0.0, 0.5], - lambda_nn_grid=nn_grid, - n_bootstrap=n_boot, - seed=42, - ) - results = trop_est.fit( - data, - outcome="outcome", - treatment="treated", - unit="unit", - time="period", - ) - - true_att = 2.0 - # With factor adjustment, should recover treatment effect - assert ( - abs(results.att - true_att) < 2.0 - ), f"ATT={results.att:.3f} should be within 2.0 of true={true_att}" - # Factor matrix should capture some structure - assert results.effective_rank > 0, "Factor matrix should have positive rank" - - def test_paper_dgp_recovery(self, ci_params): - """ - Test treatment effect recovery using paper's simulation DGP. - - Based on Table 2 (page 32) simulation settings: - - Factor model with 2 factors - - Treatment effect = 0 (null hypothesis) - - Should produce estimates centered around zero - - This is a methodological validation test. - """ - # Generate data similar to paper's simulation (reduced size for CI speed) - rng = np.random.default_rng(2024) - n_units = 30 - n_treated = 6 - n_pre = 7 - n_post = 3 - n_factors = 2 - true_tau = 0.0 # Null treatment effect - - # Generate factors F: (n_periods, n_factors) - F = rng.normal(0, 1, (n_pre + n_post, n_factors)) - - # Generate loadings Lambda: (n_factors, n_units) - Lambda = rng.normal(0, 1, (n_factors, n_units)) - # Treated units have different loadings (selection on unobservables) - Lambda[:, :n_treated] += 0.5 - - # Unit fixed effects - gamma = rng.normal(0, 1, n_units) - gamma[:n_treated] += 1.0 # Selection on levels - - # Time fixed effects (linear trend) - delta = np.linspace(0, 2, n_pre + n_post) - - data = [] - for i in range(n_units): - is_treated = i < n_treated - for t in range(n_pre + n_post): - post = t >= n_pre - # Y = mu + gamma_i + delta_t + Lambda_i'F_t + tau*D + eps - y = 10.0 + gamma[i] + delta[t] - y += Lambda[:, i] @ F[t, :] # Factor component - treatment_indicator = 1 if (is_treated and post) else 0 - if treatment_indicator: - y += true_tau - y += rng.normal(0, 0.5) # Idiosyncratic noise - - data.append( - { - "unit": i, - "period": t, - "outcome": y, - "treated": treatment_indicator, - } - ) - - df = pd.DataFrame(data) - - # TROP estimation - n_boot = ci_params.bootstrap(30) - trop_est = TROP( - lambda_time_grid=[0.0, 0.5, 1.0], - lambda_unit_grid=[0.0, 0.5, 1.0], - lambda_nn_grid=[0.0, 0.1, 1.0], - n_bootstrap=n_boot, - seed=42, - ) - results = trop_est.fit( - df, - outcome="outcome", - treatment="treated", - unit="unit", - time="period", - ) - - # Under null hypothesis, ATT should be close to zero - # Allow for estimation error (this is a finite sample) - assert ( - abs(results.att) < 2.0 - ), f"ATT={results.att:.3f} should be close to true={true_tau} under null" - # Check that factor model was used - assert results.effective_rank >= 0 - - class TestOptimizationEquivalence: """Tests verifying optimized implementations produce identical results. @@ -1368,309 +999,6 @@ def test_cycling_search_single_value_grids(self, simple_panel_data): assert results.lambda_nn == 0.1 -@pytest.mark.slow -class TestPaperConformanceFixes: - """Tests verifying fixes for paper conformance issues. - - These tests validate the four fixes from the implementation assessment: - - Issue A: Control set includes pre-treatment obs of eventually-treated units - - Issue B: Distance computation excludes target period - - Issue C: Nuclear norm update uses weights - - Issue D: Bootstrap uses stratified sampling - """ - - def test_issue_a_control_includes_pretreatment_obs(self): - """ - Test Issue A fix: Control set includes pre-treatment observations - of eventually-treated units. - - Paper's Equation 2 (page 7) sums over ALL observations where - (1 - W_js) is non-zero, including pre-treatment periods of - eventually-treated units. - """ - # Create staggered adoption data where treated units have - # informative pre-treatment outcomes - rng = np.random.default_rng(42) - n_units = 20 - n_early_treat = 5 # Units treated at period 3 - n_late_treat = 5 # Units treated at period 5 - n_control = 10 # Never-treated units - n_periods = 8 - true_att = 2.0 - - data = [] - for i in range(n_units): - # Determine treatment timing - if i < n_early_treat: - treat_period = 3 - unit_fe = 5.0 # Early-treated have specific level - elif i < n_early_treat + n_late_treat: - treat_period = 5 - unit_fe = 5.5 # Late-treated similar to early-treated - else: - treat_period = None - unit_fe = 10.0 # Control units have different level - - for t in range(n_periods): - is_post = treat_period is not None and t >= treat_period - treatment_indicator = 1 if is_post else 0 - y = unit_fe + 0.2 * t - if treatment_indicator: - y += true_att - y += rng.normal(0, 0.3) - data.append( - { - "unit": i, - "period": t, - "outcome": y, - "treated": treatment_indicator, - } - ) - - df = pd.DataFrame(data) - - # With Issue A fix, TROP should be able to use pre-treatment - # observations of late-treated units as controls for early-treated - trop_est = TROP( - lambda_time_grid=[0.0], - lambda_unit_grid=[1.0], # Use unit weights so distance matters - lambda_nn_grid=[0.0], - n_bootstrap=10, - seed=42, - ) - results = trop_est.fit( - df, - outcome="outcome", - treatment="treated", - unit="unit", - time="period", - ) - - # Should recover treatment effect direction - assert results.att > 0, f"ATT={results.att:.3f} should be positive" - - def test_issue_b_distance_excludes_target_period(self): - """ - Test Issue B fix: Distance computation excludes target period. - - Paper's Equation 3 (page 7) specifies 1{u ≠ t} to exclude the - target period when computing pairwise distances. - """ - rng = np.random.default_rng(123) - - # Create data where unit 0's outcome at target period is very different - n_units = 10 - n_periods = 6 - data = [] - for i in range(n_units): - is_treated = i == 0 - for t in range(n_periods): - if is_treated and t == 3: - # Target period (t=3) has anomalous outcome - y = 100.0 # Very different from other periods - elif is_treated and t >= 3: - y = 5.0 + rng.normal(0, 0.1) - else: - y = 5.0 + rng.normal(0, 0.1) - - treatment_indicator = 1 if (is_treated and t >= 3) else 0 - data.append( - { - "unit": i, - "period": t, - "outcome": y, - "treated": treatment_indicator, - } - ) - - df = pd.DataFrame(data) - - trop_est = TROP( - lambda_time_grid=[0.0], - lambda_unit_grid=[1.0], - lambda_nn_grid=[0.0], - n_bootstrap=5, - seed=42, - ) - - # With Issue B fix (target period excluded), this should complete - # Without the fix, the anomalous period would dominate distance - results = trop_est.fit( - df, - outcome="outcome", - treatment="treated", - unit="unit", - time="period", - ) - - # Model should fit without error - assert results is not None - # ATT should be finite - assert np.isfinite(results.att) - - def test_issue_c_weighted_nuclear_norm(self): - """ - Test Issue C fix: Nuclear norm update properly accounts for weights. - - The paper's Equation 2 (page 7) specifies the full weighted objective. - Weights should affect L matrix estimation. - """ - rng = np.random.default_rng(456) - - # Create data with factor structure where weights matter - n_units = 15 - n_periods = 8 - n_treated = 3 - true_att = 2.0 - - # Factor loadings that vary by unit - loadings = rng.normal(0, 1, n_units) - factors = rng.normal(0, 1, n_periods) - - data = [] - for i in range(n_units): - is_treated = i < n_treated - for t in range(n_periods): - post = t >= 5 - # Y = mu + factor_component + treatment_effect + noise - y = 10.0 + loadings[i] * factors[t] - treatment_indicator = 1 if (is_treated and post) else 0 - if treatment_indicator: - y += true_att - y += rng.normal(0, 0.3) - data.append( - { - "unit": i, - "period": t, - "outcome": y, - "treated": treatment_indicator, - } - ) - - df = pd.DataFrame(data) - - # Test with nuclear norm regularization - trop_est = TROP( - lambda_time_grid=[0.0], - lambda_unit_grid=[0.0], - lambda_nn_grid=[0.1, 1.0], # Use regularization - n_bootstrap=10, - seed=42, - ) - results = trop_est.fit( - df, - outcome="outcome", - treatment="treated", - unit="unit", - time="period", - ) - - # Factor matrix should have been estimated with non-zero effective rank - # (with weighted nuclear norm solver, this tests the code path) - assert results.effective_rank >= 0 - # ATT should recover treatment effect direction - assert results.att > 0, f"ATT={results.att:.3f} should be positive" - - def test_issue_d_stratified_bootstrap(self, ci_params): - """ - Test Issue D fix: Bootstrap uses stratified sampling. - - Paper's Algorithm 3 (page 27) specifies sampling N_0 control and - N_1 treated units separately to preserve treatment ratio. - """ - rng = np.random.default_rng(789) - - # Create data with unbalanced treated/control ratio - n_treated = 3 - n_control = 17 - n_units = n_treated + n_control - n_periods = 6 - true_att = 2.0 - - data = [] - for i in range(n_units): - is_treated = i < n_treated - for t in range(n_periods): - post = t >= 3 - y = 10.0 + rng.normal(0, 0.5) - treatment_indicator = 1 if (is_treated and post) else 0 - if treatment_indicator: - y += true_att - data.append( - { - "unit": i, - "period": t, - "outcome": y, - "treated": treatment_indicator, - } - ) - - df = pd.DataFrame(data) - - # Run with bootstrap variance estimation - n_boot = ci_params.bootstrap(30) - trop_est = TROP( - lambda_time_grid=[0.0], - lambda_unit_grid=[0.0], - lambda_nn_grid=[0.0], - n_bootstrap=n_boot, - seed=42, - ) - results = trop_est.fit( - df, - outcome="outcome", - treatment="treated", - unit="unit", - time="period", - ) - - # Bootstrap should complete successfully - assert results.bootstrap_distribution is not None - min_successes = max(5, int(0.67 * n_boot)) - assert len(results.bootstrap_distribution) >= min_successes, ( - f"Expected >= {min_successes} successful bootstrap draws " - f"out of {n_boot}, got {len(results.bootstrap_distribution)}" - ) - # SE should be positive and finite - assert results.se > 0 - assert np.isfinite(results.se) - - def test_weighted_nuclear_norm_solver_convergence(self): - """ - Test that the weighted nuclear norm solver converges properly. - """ - trop_est = TROP( - lambda_time_grid=[0.0], - lambda_unit_grid=[0.0], - lambda_nn_grid=[1.0], # Larger lambda for more regularization - ) - - # Create test data - n_periods = 5 - n_units = 8 - - Y = np.random.default_rng(42).normal(0, 1, (n_periods, n_units)) - W = np.ones((n_periods, n_units)) - L_init = np.zeros((n_periods, n_units)) - alpha = np.zeros(n_units) - beta = np.zeros(n_periods) - - # Call the weighted nuclear norm solver - L = trop_est._weighted_nuclear_norm_solve( - Y, W, L_init, alpha, beta, lambda_nn=1.0, max_inner_iter=20 - ) - - # L should be finite and have reasonable values - assert np.all(np.isfinite(L)) - # With nuclear norm regularization, singular values should be reduced - _, s, _ = np.linalg.svd(L, full_matrices=False) - _, s_orig, _ = np.linalg.svd(Y, full_matrices=False) - # Regularized singular values should be smaller than original - assert np.sum(s) < np.sum( - s_orig - ), "Nuclear norm regularization should reduce total singular value mass" - - class TestAPIChangesV2_1_8: """Tests verifying API changes in v2.1.8. @@ -2025,7 +1353,7 @@ def test_rust_infinite_score_triggers_fallback(self, simple_panel_data): is attempted. If Python also returns infinity, defaults are used. """ import sys - from unittest.mock import patch, MagicMock + from unittest.mock import MagicMock, patch trop_est = TROP( lambda_time_grid=[0.0, 1.0], @@ -2648,105 +1976,14 @@ def test_n_post_periods_counts_observed_treatment(self): class TestTROPNuclearNormSolver: - """Tests for proximal gradient step size correctness and objective monotonicity.""" - - def test_proximal_step_size_correctness(self): - """Verify L converges to prox_{λ/2}(R) for uniform weights.""" - trop_est = TROP(method="global", n_bootstrap=2) - - # Small problem with known solution - rng = np.random.default_rng(42) - R = rng.normal(0, 1, (4, 3)) - delta = np.ones((4, 3)) - lambda_nn = 0.5 - - # Run solver (many iterations to ensure convergence) - L = np.zeros_like(R) - for _ in range(500): - delta_max = np.max(delta) - delta_norm = delta / delta_max - gradient_step = L + delta_norm * (R - L) - eta = 1.0 / (2.0 * delta_max) - L = trop_est._soft_threshold_svd(gradient_step, eta * lambda_nn) - - # Analytical solution for uniform weights: prox_{λ/2}(R) - L_exact = trop_est._soft_threshold_svd(R, lambda_nn / 2.0) - - np.testing.assert_array_almost_equal(L, L_exact, decimal=4) - - def test_lowrank_objective_decreases(self): - """Verify objective f(L) + λ||L||_* is non-increasing across iterations.""" - # Generate small problem - rng = np.random.default_rng(42) - R = rng.normal(0, 1, (6, 4)) - delta = rng.uniform(0.5, 2.0, (6, 4)) - lambda_nn = 0.3 - - trop_est = TROP(method="global", n_bootstrap=2) - L = np.zeros_like(R) - objectives = [] - - for _ in range(50): - # Compute objective - f_val = np.sum(delta * (R - L) ** 2) - _, s, _ = np.linalg.svd(L, full_matrices=False) - obj = f_val + lambda_nn * np.sum(s) - objectives.append(obj) - - # Proximal gradient step - delta_max = np.max(delta) - delta_norm = delta / delta_max - gradient_step = L + delta_norm * (R - L) - eta = 1.0 / (2.0 * delta_max) - L = trop_est._soft_threshold_svd(gradient_step, eta * lambda_nn) - - # Objective should be non-increasing (within numerical tolerance) - for k in range(1, len(objectives)): - assert ( - objectives[k] <= objectives[k - 1] + 1e-10 - ), f"Objective increased at step {k}: {objectives[k]} > {objectives[k-1]}" - - def test_local_nonuniform_weights_objective(self): - """Verify objective decreases with non-uniform weights (W_max < 1).""" - rng = np.random.default_rng(123) - R = rng.normal(0, 1, (6, 4)) - W = rng.uniform(0.1, 0.8, (6, 4)) - lambda_nn = 0.3 - - trop_est = TROP(method="local", n_bootstrap=2) + """Defensive guard for the weighted-nuclear-norm prox solver. - # Initial objective with L=0 - L_init = np.zeros_like(R) - f_init = np.sum(W * (R - L_init) ** 2) - _, s_init, _ = np.linalg.svd(L_init, full_matrices=False) - obj_init = f_init + lambda_nn * np.sum(s_init) - - # Solve - L_final = trop_est._weighted_nuclear_norm_solve( - Y=R, - W=W, - L_init=L_init, - alpha=np.zeros(R.shape[1]), - beta=np.zeros(R.shape[0]), - lambda_nn=lambda_nn, - max_inner_iter=20, - ) - - # Final objective - f_final = np.sum(W * (R - L_final) ** 2) - _, s_final, _ = np.linalg.svd(L_final, full_matrices=False) - obj_final = f_final + lambda_nn * np.sum(s_final) - - assert ( - obj_final <= obj_init + 1e-10 - ), f"Objective did not decrease: {obj_final} > {obj_init}" - - # Soft-thresholding should reduce nuclear norm vs residual - nuclear_norm_R = np.sum(np.linalg.svd(R, compute_uv=False)) - nuclear_norm_L = np.sum(s_final) - assert ( - nuclear_norm_L < nuclear_norm_R - ), f"Nuclear norm not reduced: {nuclear_norm_L} >= {nuclear_norm_R}" + Paper-side Eq. 2 prox correctness (proximal step size, FISTA objective + monotonicity, weighted non-uniform objective decrease) is verified in + `tests/test_methodology_trop.py::TestTROPNuclearNormProx`. This class + retains only the all-zero-weights defensive guard, which exercises a + library-internal edge case rather than a paper-derived property. + """ def test_zero_weights_no_division_error(self): """Verify solver handles all-zero weights without ZeroDivisionError.""" @@ -3632,8 +2869,8 @@ class TestTROPBootstrapNaNSE: def test_global_bootstrap_zero_draws_returns_nan_se(self): """Global bootstrap with 0 successful draws returns NaN SE, not 0.0.""" - from unittest.mock import patch import sys + from unittest.mock import patch df = TestTROPNValidTreated._make_panel() @@ -4246,6 +3483,7 @@ def test_item5_balanced_panel_no_warning(self): def test_item6_rust_loocv_fallback_warning(self): """Item 6: Warn when Rust LOOCV falls back to Python.""" from unittest.mock import patch + import diff_diff.trop_global as trop_global_mod df = self._make_panel()