Skip to content

StaggeredTripleDifference methodology validation + opt-in Eq-4.14 overall ATT#504

Merged
igerber merged 2 commits into
mainfrom
feature/staggered-triple-diff-validation
May 31, 2026
Merged

StaggeredTripleDifference methodology validation + opt-in Eq-4.14 overall ATT#504
igerber merged 2 commits into
mainfrom
feature/staggered-triple-diff-validation

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 31, 2026

Summary

  • StaggeredTripleDifference methodology-review-tracker: In Progress → Complete. Validates the diff_diff source against Ortiz-Villavicencio & Sant'Anna (2025, arXiv:2505.09942v3); PR-A (StaggeredTripleDifference PR-A: Ortiz-Villavicencio & Sant'Anna (2025) paper review #499) added the paper review on file, this PR validates the implementation against it.
  • New opt-in overall_att_es (paper Eq. 4.14 overall) — the unweighted mean of the post-treatment event-study effects ES(e) — on StaggeredTripleDiffResults (with overall_se_es / overall_t_stat_es / overall_p_value_es / overall_conf_int_es), populated only under aggregate="event_study"/"all". The default overall_att (Callaway-Sant'Anna simple post-treatment average, the library-wide convention) is unchanged. Computed via a side-channel stash on the shared CallawaySantAnnaAggregationMixin._aggregate_event_study (no return-signature change; CallawaySantAnna unaffected), over post-treatment e >= -anticipation. Analytical SE = the influence function of the mean (per-event-time combined IFs averaged, routed through the same survey-aware variance estimator as the per-e effects via a new _se_from_psi helper); a multiplier-bootstrap SE replaces it under n_bootstrap>0.
  • REGISTRY fidelity fixes: formalized the previously-unlabeled overall-aggregation prose under a **Note:** documenting both overalls; consolidated the duplicate aggregation-weight deviation and fixed a P(G=g) vs R P(S=g) mislabel.
  • Edge-case hardening surfaced during validation: empty event-study / degenerate-IF aggregation now fails soft (NaN + warning) instead of raising; the shared bootstrap mixin guards empty rel_periods (balance_e-emptied event studies); overall_att_es uses its own replicate-weight effective df.
  • Tracker row → Complete with a Verified Components / R Comparison Results block; docs/references.rst pinned to arXiv:2505.09942v3; autosummary stubs + CHANGELOG updated.

Methodology references

  • Method name(s): StaggeredTripleDifference (staggered triple-differences / DDD), built on the shared CallawaySantAnna aggregation + multiplier-bootstrap mixins.
  • Paper / source link(s): Ortiz-Villavicencio, M. & Sant'Anna, P. H. C. (2025). "Better Understanding Triple Differences Estimators." arXiv:2505.09942v3. R reference: triplediff::ddd(panel=TRUE) + agg_ddd().
  • Intentional deviations (and why) — documented in docs/methodology/REGISTRY.md ## StaggeredTripleDifference, all verified non-masking against the v3 paper:
    • Comparison-cohort admissibility g_c > max(t, base_period) + anticipation (matches the companion R triplediff; the paper states g_c > max(g,t)) — valid cell-by-cell and base-period/anticipation-aware.
    • Aggregation weights P(S=g, Q=1) (eligible-treated; matches the paper's Eq. 4.13 where G_i is finite only for Q=1) vs R's P(S=g) — drives the larger tolerance on aggregated quantities.
    • Per-cohort group-effect WIF (conservative vs R wif=NULL).
    • Default overall_att = CS-simple post-treatment average (library convention); the paper's Eq. 4.14 overall is available opt-in as overall_att_es.
    • Cluster-robust analytical SEs accepted-but-deferred (the multiplier bootstrap provides unit-level clustering).

Validation

  • Tests added/updated:
    • tests/test_methodology_staggered_triple_diff.py — paper-equation-anchored Verified Components (Theorem 4.1 / Eq. 4.5 RA=IPW=DR identification; Eq. 4.1 three-term DDD decomposition; Eqs. 4.11-4.12 optimal-GMM weight normalization + single-group reduction; Eq. 4.13 event-study cohort-share weighting; Eq. 4.14 / Cor. 4.2 overall), overall_att_es R cross-validation, balance_e-empties-event-study fail-soft, and the aggregation return-contract arity.
    • tests/test_survey_staggered_ddd.pyoverall_att_es under survey weighting (uniform-equivalence, nontrivial-weights-change-SE, full design, replicate weights).
    • tests/test_staggered_triple_diff.py — result-surface smoke (summary() / to_dict() expose the new fields; None for the default fit).
  • R parity: cross-validated against triplediff::ddd(panel=TRUE) + agg_ddd() — group-time ATT(g,t) exact, SE within 1%; Eq. 4.14 overall within 10% (ATT) / 3% (SE). CSV fixtures gitignored / regenerated on-the-fly from benchmarks/R/benchmark_staggered_triplediff.R; JSON golden committed. CallawaySantAnna + StaggeredDiD suites pass (shared-mixin regression guard).

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

…rall ATT

Validates the StaggeredTripleDifference source against Ortiz-Villavicencio &
Sant'Anna (2025, arXiv:2505.09942v3) and promotes the methodology-review row to
Complete. Adds an opt-in Eq-4.14 overall ATT (overall_att_es).

Source:
- New _se_from_psi helper factored from _compute_aggregated_se_with_wif
  (survey/replicate/simple variance dispatch), reused for the overall SE.
- _aggregate_event_study stashes self._event_study_overall (mirroring
  _event_study_vcov; no return-signature change, CallawaySantAnna unaffected):
  unweighted mean of post-treatment ES(e) over e >= -anticipation.
- StaggeredTripleDiffResults gains overall_att_es/_se_es/_t_stat_es/_p_value_es/
  _conf_int_es (populated only under aggregate in {event_study, all}); rendered
  in summary() and to_dict(). Default overall_att unchanged.
- Bootstrap parity: per-draw mean of post-treatment ES(e) draws; cluster-
  unidentified NaN guard mirrored for the new scalars.

Methodology docs (REGISTRY ## StaggeredTripleDifference):
- Formalize the previously-unlabeled overall-aggregation prose under a Note
  documenting both overalls (default CS-simple vs opt-in Eq-4.14).
- Consolidate the duplicate aggregation-weight deviation; fix the P(G=g) vs R
  P(S=g) mislabel.

Tests:
- Paper-equation-anchored Verified Components (Thm 4.1/Eq 4.5, Eq 4.1, Eqs
  4.11-4.12, Eq 4.13, Eq 4.14/Cor 4.2) + overall_att_es R cross-validation +
  bootstrap/survey cross-surface coverage.

Tracker/refs: METHODOLOGY_REVIEW.md row -> Complete with Verified Components / R
Comparison Results; priority queue pruned; references.rst pinned to v3; CHANGELOG.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • The affected method is StaggeredTripleDifference aggregation/inference. The new overall_att_es path matches the paper’s Eq. 4.14 / Cor. 4.2 and preserves the documented library default overall_att.
  • I did not find an undocumented methodology deviation in weighting, identification, or SE construction for the new analytical or bootstrap Eq. 4.14 path.
  • P2: the new terminal warning for overall_att_es can misdiagnose bootstrap-side NaN SEs as analytical non-finite influence-function failures.
  • P3: remaining StaggeredTripleDifference R-parity limitations are already tracked in TODO.md:L92-L94, so they are informational only.
  • P3: the new result surface is tested for aggregate="event_study" and default omission, but not directly for the other supported public surface aggregate="all" exposing overall_att_es.
  • I could not run the test suite here because this environment lacks numpy/pytest; this is a static-source review.

Methodology

  • No findings. The new overall is implemented as the unweighted mean of post-treatment ES(e) in diff_diff/staggered_aggregation.py:L800-L836, which matches the paper review’s Eq. 4.14 summary in docs/methodology/papers/ortiz-villavicencio-santanna-2025-review.md:L151-L177. The retained default overall_att plus the anticipation-aware e >= -anticipation convention are explicitly documented in docs/methodology/REGISTRY.md:L2208-L2244, so this is a documented library choice, not an undocumented deviation.

Code Quality

  • P2 diff_diff/staggered_triple_diff.py:L803-L821, diff_diff/staggered_bootstrap.py:L611-L617. Impact: after bootstrap, any non-finite overall_se_es is warned as “a contributing post-treatment event-study horizon has a non-finite influence function,” but the bootstrap code can also force overall_att_es_se = NaN for unrelated reasons such as unidentified clustered bootstrap variance. That yields a false diagnosis even though the stored inference fields remain NaN-consistent. Concrete fix: track whether the analytical es_overall["se"] was already non-finite before the bootstrap override, and only emit the analytical-IF warning in that case; otherwise emit a bootstrap-specific warning or rely on the bootstrap warnings already produced.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • P3 TODO.md:L92-L94. Impact: R cross-validation still depends on local R/triplediff, covariate-adjusted parity is still untested, and the per-cohort WIF-vs-R difference remains documented. These items are already tracked in TODO.md, so they are informational only under the review rules. Concrete fix: none required in this PR beyond the existing TODO entries.

Security

  • No findings.

Documentation/Tests

  • P3 diff_diff/staggered_triple_diff.py:L652-L664, tests/test_staggered_triple_diff.py:L146-L178, tests/test_staggered_triple_diff.py:L319-L325, tests/test_methodology_staggered_triple_diff.py:L529-L562. Impact: the public contract says overall_att_es is populated for both aggregate="event_study" and aggregate="all", but the added assertions cover aggregate="event_study", default omission, and an aggregate="all" smoke test that only checks event_study_effects/group_effects. A regression could drop overall_att_es from the aggregate="all" path without tripping these tests. Concrete fix: add a direct regression test asserting that aggregate="all" populates overall_att_es and matches the aggregate="event_study" value/inference on the same data.

…te=all / cluster-unidentified coverage

CI Codex review findings on PR #504:

- P2: the Eq. 4.14 overall (overall_att_es) terminal warning re-derived its
  trigger from the post-bootstrap state, so a bootstrap that NaN'd overall_se_es
  for unrelated reasons (e.g. a single-PSU/cluster-unidentified survey design)
  was misdiagnosed as an analytical non-finite influence function. Gate the
  warning on whether the ANALYTICAL SE was non-finite (captured before the
  bootstrap override), and broaden the message to name both causes (non-finite
  IF or unidentified variance). The bootstrap path already emits its own
  authoritative "variance unidentified" warning.
- P3 (tests): add a direct regression that aggregate="all" populates
  overall_att_es and matches the aggregate="event_study" value/inference
  bit-for-bit on the same data.
- Add a single-PSU cluster-unidentified bootstrap regression: overall_att_es
  keeps its point estimate, SE+inference are NaN-consistent, and the Eq. 4.14
  warning (when emitted) is cause-accurate.

No numeric output changes; inference fields remain NaN-consistent in all paths.
Affected suites pass, incl. the shared-mixin CallawaySantAnna regression guard.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: fe53908ef5fba7bc50f34f675e0dd2a45858c25b


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings. Highest remaining item is P3 informational tech debt already tracked in TODO.md.

Executive Summary

  • Re-review result: the prior warning-diagnosis issue and the prior aggregate="all" test-surface gap are both addressed in the changed code/tests.
  • Affected method: StaggeredTripleDifference, specifically event-study aggregation/inference and the new opt-in Eq. 4.14 summary overall_att_es.
  • I did not find an undocumented methodology deviation in weighting, identification, variance/SE construction, or default behavior for the new Eq. 4.14 path.
  • The new inference path stays NaN-consistent and now has explicit edge-case coverage for empty event studies, replicate-design df handling, and single-PSU bootstrap variance failure.
  • Static-source review only: I could not execute the test suite here because pytest and numpy are unavailable in this environment.

Methodology

No findings. The new overall_att_es implementation matches the local paper review’s Eq. 4.14 / Cor. 4.2 description: it is the unweighted mean of post-treatment ES(e), with analytical SE obtained from the averaged combined influence function and bootstrap SE overriding it when n_bootstrap > 0. The retained default overall_att remains a documented library convention rather than an undocumented deviation. References: diff_diff/staggered_aggregation.py:L800-L836, diff_diff/staggered_triple_diff.py:L642-L738, docs/methodology/REGISTRY.md:L2208-L2244, docs/methodology/papers/ortiz-villavicencio-santanna-2025-review.md:L151-L176.

Code Quality

No findings. The prior P2 warning issue is fixed: the Eq. 4.14 warning is now keyed to an analytical non-finite SE captured before bootstrap override, so bootstrap-side NaNs are not mislabeled as influence-function failures. References: diff_diff/staggered_triple_diff.py:L652-L657, diff_diff/staggered_triple_diff.py:L812-L833, tests/test_survey_staggered_ddd.py:L1024-L1065.

Performance

No findings.

Maintainability

No findings.

Tech Debt

P3 — Impact: remaining StaggeredTripleDifference R-parity follow-ups are still tracked in TODO.md:L92-L94 (fixture determinism, covariate-adjusted parity, documented group-effect WIF-vs-R difference). Under the review rules these are informational only, not blockers. Concrete fix: none required in this PR beyond the existing TODO tracking.

Security

No findings.

Documentation/Tests

No findings. The prior aggregate="all" surface gap is now covered directly by tests/test_staggered_triple_diff.py:L180-L212, and the new methodology/survey tests cover the main Eq. 4.14 edge cases, including empty event studies and single-PSU bootstrap NaN propagation: tests/test_methodology_staggered_triple_diff.py:L509-L620, tests/test_survey_staggered_ddd.py:L910-L1065. I could not run these tests in this environment because pytest and numpy are missing.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 31, 2026
@igerber igerber merged commit d0a8752 into main May 31, 2026
33 of 34 checks passed
@igerber igerber deleted the feature/staggered-triple-diff-validation branch May 31, 2026 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant