Skip to content

TwoStageDiD Wave E.3 parity: always-treated full-design retention#485

Merged
igerber merged 2 commits into
mainfrom
spillover-conley-twostagedid-wave-e3-parity
May 22, 2026
Merged

TwoStageDiD Wave E.3 parity: always-treated full-design retention#485
igerber merged 2 commits into
mainfrom
spillover-conley-twostagedid-wave-e3-parity

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 22, 2026

Summary

  • Mechanical transfer of the Wave E.3 invariant from SpilloverDiD (PR SpilloverDiD Wave E.3: SurveyDesign.subpopulation() full-design retention #482) to TwoStageDiD. When the always-treated handler drops units from the OLS sample, the resolved survey design now retains its FULL-DOMAIN n_psu / n_strata / df_survey / strata / fpc / psu arrays. Per-cluster scores are zero-padded onto the full-domain unique-PSU list via new optional kwargs (score_pad_mask, cluster_ids_full) on _compute_gmm_variance; PSUs containing only always-treated rows get zero score rows but still count toward G_full.
  • Replicate-weight refit callback (_refit_ts) subsets each w_r via keep_mask.values before threading into stage-1/stage-2, mirroring the same subsetting the main fit applies to survey_weights (otherwise solve_ols rejects the length mismatch and compute_replicate_refit_variance swallows the ValueError so replicate inference NaNs out).
  • Cluster injection block sources cluster_ids_raw from FULL-DOMAIN data[cluster] (not post-drop df[cluster]) so _inject_cluster_as_psu's zip against resolved_survey.strata stays length-aligned; df["_survey_cluster"] aligned via resolved_survey.psu[keep_mask.values].
  • Closes the parity follow-up tracked at TODO.md since PR SpilloverDiD Wave E.3: SurveyDesign.subpopulation() full-design retention #482.

Methodology references (required if estimator / math changes)

  • Method name(s): TwoStageDiD (Gardner 2022) — survey variance under always-treated unit drop.
  • Paper / source link(s): Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R, §2.5 "Domains and subpopulations" (R survey::svyrecvar(subset()) zero-pad convention). Gardner, J. (2022). Two-stage differences in differences. arXiv:2207.05943.
  • Any intentional deviations from the source (and why): None new. The full-domain retention under always-treated drop is documented in REGISTRY.md TwoStageDiD section as "documented synthesis — Wave E.3 parity, full-domain survey design under always-treated drop" mirroring the SpilloverDiD Wave E.3 invariant (PR SpilloverDiD Wave E.3: SurveyDesign.subpopulation() full-design retention #482) and the in-library precedents at imputation.py:2175-2183 (PreTrendsImputation) and prep.py:1401-1432 (DCDH cell variance).

Validation

  • Tests added/updated:
    • New TestTwoStageDiDWaveE3ParityAlwaysTreated in tests/test_two_stage.py (8 tests: no-always-treated baseline, full-domain df_survey preservation, full-domain n_psu reporting, per-cluster zero-pad mock-spy, subpopulation+always-treated composition, cluster-as-PSU+always-treated, no-survey path unchanged, PSU entirely-always-treated).
    • Strengthened test_two_stage_always_treated in tests/test_replicate_weight_expansion.py to assert finite overall_se / overall_p_value / overall_conf_int under always-treated + replicate-weight design.
    • New test_two_stage_always_treated_event_study_and_group_replicate in the same file: exercises event-study + group replicate refit branches end-to-end with aggregate="all", asserts finite se + p_value + conf_int on non-reference horizons and cohorts.
  • Backtest / simulation / notebook evidence (if applicable): N/A — no tutorial changes.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…n survey design

Mechanical transfer of the Wave E.3 SpilloverDiD invariant (PR #482, merge
24de906) to TwoStageDiD. When the always-treated handler drops units from
the OLS sample, the resolved survey design retains its FULL-DOMAIN n_psu /
n_strata / df_survey / strata / fpc / psu arrays instead of being subsetted
via replace(...). Per-cluster scores aggregate at fit-length then zero-pad
onto the full-domain unique-PSU list via two new optional kwargs on
_compute_gmm_variance: score_pad_mask and cluster_ids_full. PSUs containing
only always-treated rows get zero score rows but still count toward G_full
for n_psu / df_survey accounting.

Documented synthesis (library-convention adoption): adopts the canonical
R survey::svyrecvar(subset()) convention (Lumley 2010 §2.5), already
established at imputation.py:2175-2183 (PreTrendsImputation),
prep.py:1401-1432 (DCDH cell variance), and spillover.py (PR #482).

Implementation:
- diff_diff/two_stage.py: delete L1485-1525 design-subset block; promote
  keep_mask to fit()-level scope (always defined; defaults all-True);
  cluster injection sources cluster_ids_raw from FULL-DOMAIN data[cluster]
  (not post-drop df[cluster]) so _inject_cluster_as_psu's zip against
  resolved_survey.strata stays length-aligned; df["_survey_cluster"]
  aligned to post-drop length via resolved_survey.psu[keep_mask.values];
  _compute_gmm_variance + 3 inner _stage2_* methods gain score_pad_mask /
  cluster_ids_full kwargs; zero-pad expansion after per-cluster
  aggregation; strata/fpc obs_idx lookups use cluster_ids_full under
  padding; G < 2 unidentified gate fires on G_full when padding active.
- diff_diff/two_stage.py: _refit_ts callback subsets each replicate
  weight w_r via keep_mask.values before threading into _fit_untreated_model
  and _stage2_*, matching the keep_mask-subsetting applied to survey_weights
  in the main fit (otherwise solve_ols rejects the length mismatch and
  compute_replicate_refit_variance swallows the ValueError so replicate
  inference NaNs out).
- Always-treated warning text updated to reflect the new contract: weights
  are subsetted for OLS, design retained for variance.
- Replicate variance + always-treated: existing path now works correctly
  (score_pad_mask_arg stays None on _uses_replicate_ts paths; per-replicate
  refit handles resampling separately).

Tests (tests/test_two_stage.py):
- New TestTwoStageDiDWaveE3ParityAlwaysTreated class with 8 tests:
  (a) no-always-treated baseline, (b) full-domain df_survey preservation
  under drop, (c) full-domain n_psu reporting, (d) per-cluster zero-pad
  mock-spy on _compute_stratified_meat_from_psu_scores, (e) subpopulation
  + always-treated composition, (f) cluster-as-PSU + always-treated,
  (g) no-survey path unchanged, (h) PSU entirely-always-treated.

Tests (tests/test_replicate_weight_expansion.py):
- Strengthened test_two_stage_always_treated to assert finite overall_se /
  overall_p_value / overall_conf_int (was only asserting finite ATT,
  missing the replicate-SE regression class).
- New test_two_stage_always_treated_event_study_and_group_replicate
  exercising the event-study + group replicate refit branches end-to-end
  under always-treated drop with aggregate="all"; asserts finite se +
  p_value + conf_int on non-reference horizons and cohorts.

Docs:
- docs/methodology/REGISTRY.md: TwoStageDiD section gains "documented
  synthesis — Wave E.3 parity" note; SpilloverDiD Wave E.3 follow-up note
  updated to mark TwoStageDiD parity as shipped.
- CHANGELOG.md: Unreleased Added entry leading with documented-synthesis
  framing.
- TODO.md: drop parity follow-up row.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good

Executive Summary

  • The estimator-side change is a documented Wave E.3 synthesis, not an undocumented methodology deviation: docs/methodology/REGISTRY.md:L1316-L1318 now explicitly records the full-domain survey-design retention rule, and the implementation in diff_diff/two_stage.py:L1460-L1704 and diff_diff/two_stage.py:L2958-L3084 matches that contract.
  • I did not find a P0/P1 issue in identification, weighting, variance/SE computation, or inference propagation for the changed TwoStageDiD paths.
  • The risky propagation points appear complete: the new padding kwargs are threaded through static/event-study/group stage-2 paths, and replicate refits now subset full-domain replicate weights before refitting in diff_diff/two_stage.py:L1788-L1897.
  • The main non-blocking issue is a small performance regression: the zero-pad branch now runs on every analytical survey fit, even when no always-treated rows were dropped.
  • The added coverage is good for the intended bugfix, but the new replicate tests use unseeded randomness.
  • I could not execute the test suite in this environment because runtime tooling/deps are missing; the test review below is static.

Methodology

No findings. The PR changes the survey-variance realization for TwoStageDiD, but that deviation is explicitly documented in docs/methodology/REGISTRY.md:L1317-L1317, so under the review rubric it is mitigated/P3-informational rather than a defect. The code preserves post-drop OLS estimation while zero-padding PSU scores onto the full-domain survey design, which is consistent with the documented Lumley-style domain-estimation convention implemented here at diff_diff/two_stage.py:L1498-L1512, diff_diff/two_stage.py:L1555-L1602, and diff_diff/two_stage.py:L2969-L3084.

Code Quality

No findings. The PR continues to use safe_inference() on the changed inference surfaces and does not introduce the banned inline t_stat = effect / se pattern; see diff_diff/two_stage.py:L1707-L1709, diff_diff/two_stage.py:L1907-L1929, diff_diff/two_stage.py:L2599-L2606, and diff_diff/two_stage.py:L2713-L2720.

Performance

  • Severity: P3
    Impact: score_pad_mask_arg is populated for every non-replicate survey fit because keep_mask defaults to all-True, so _compute_gmm_variance() always takes the zero-pad/copy branch even when no always-treated rows were removed. That adds avoidable np.unique/np.searchsorted work and full-size c_by_cluster / s2_by_cluster copies on the baseline survey path. See diff_diff/two_stage.py:L1464-L1467, diff_diff/two_stage.py:L1677-L1682, and diff_diff/two_stage.py:L2969-L3027.
    Concrete fix: Only pass score_pad_mask and cluster_ids_full when rows were actually dropped, e.g. gate on n_always_treated > 0 or not keep_mask.all().

Maintainability

No findings. Internal propagation of the new split-length contract looks complete across the changed call graph: diff_diff/two_stage.py:L1684-L1758, diff_diff/two_stage.py:L2377-L2388, diff_diff/two_stage.py:L2549-L2560, and diff_diff/two_stage.py:L2675-L2686.

Tech Debt

No findings. Removing the resolved TODO entry in TODO.md is consistent with the shipped code/docs/tests for this parity item.

Security

No findings.

Documentation/Tests

  • Severity: P3
    Impact: The new replicate tests add unseeded np.random.normal(...) draws at tests/test_replicate_weight_expansion.py:L395-L401 and tests/test_replicate_weight_expansion.py:L437-L440. The assertions are coarse enough that flakiness risk is low, but deterministic fixtures are preferable for regression coverage in inference code.
    Concrete fix: Seed a local generator (rng = np.random.default_rng(...)) and draw from it, or replace the perturbations with fixed values.

Execution note: I could not run pytest here because the environment lacks both pytest and core runtime deps such as numpy; I limited validation to static inspection and AST parsing of the changed Python files.

CI codex review P3 #1 (performance): _compute_gmm_variance was taking the
zero-pad branch (np.unique + np.searchsorted + full-size c_by_cluster /
s2_by_cluster copies) on EVERY analytical survey fit because keep_mask
defaults to all-True. Gate on `n_always_treated > 0` so baseline survey
fits with no drop pass `None / None` and take the bit-identical pre-PR
path.

CI codex review P3 #2 (test determinism): seed local np.random.default_rng
in the two new always-treated replicate-weight tests so always-treated
outcomes are deterministic across runs (was using global np.random.normal
which is non-deterministic at import time).

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: db3f7fea1bec4cbeed0046ca8b6c73c1c1e9cc9b


Overall Assessment

✅ Looks good

Executive Summary

Methodology

Code Quality

Performance

  • No findings. The earlier always-on zero-pad/copy regression is resolved by activating the padding path only when always-treated rows were actually dropped in diff_diff/two_stage.py:L1667-L1687.

Maintainability

  • No findings. The registry note, in-code docstrings/comments, and the new targeted tests all describe the same Wave E.3 contract.

Tech Debt

  • No findings. Removing the shipped parity follow-up from TODO.md is consistent with the delivered code, docs, and tests.

Security

  • No findings.

Documentation/Tests

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 22, 2026
@igerber igerber merged commit fdf2ceb into main May 22, 2026
33 of 34 checks passed
@igerber igerber deleted the spillover-conley-twostagedid-wave-e3-parity branch May 22, 2026 19:12
TDL77 pushed a commit to TDL77/diff-diff that referenced this pull request Jun 2, 2026
…ns 2014 citation correction

Three bundled changes:

1. ConleySpatialHAC methodology-review-tracker promotion (In Progress → Complete).
   - METHODOLOGY_REVIEW.md L91 + L1262-1290 flipped Complete with full
     Verified Components / Test Coverage / R Comparison Results inline
     table / Corrections Made / Deviations / Outstanding Concerns
     structure, with Last Review = 2026-05-26.
   - New tests/test_methodology_conley.py (~1600 LoC; 10 classes;
     60 tests, 5 of them @pytest.mark.slow). Paper-equation-numbered
     Verified Components walk-through covering:
       * Eq. 4.2 cross-sectional sandwich (pairwise-distance
         specialization; Eq. 3.13 is the lattice-indexed form reserved
         for grid coordinates)
       * Eq. 4.2 HC0 + rank-1 limits + K(0) = 1 diagonal contribution
       * Andrews (1991) HAC lag truncation matching
         conleyreg::time_dist.cpp
       * Haversine convention with Earth radius 6371.01 km (matches
         conleyreg::haversine_dist)
       * Phase 2 panel block-decomposed sandwich at atol=1e-12 (internal
         cross-check vs hand-coded reference + cluster time-invariance
         contract)
       * Wave A igerber#120 sparse k-d-tree numerical correctness (sparse-vs-
         dense bit-identity at atol=1e-10 on cross-sectional, panel,
         and sparse-with-cluster paths)
       * R conleyreg v0.1.9 parity at atol=1e-6 on 6 fixtures
         (3 cross-sectional + 3 panel) + sparse-forced cross-sectional
         + sparse-forced panel + time-asymmetric kernel literal-matching
     Three dedicated deviations-area classes:
       * TestConleyLibraryExtensions (6 tests, no R correspondence):
         combined spatial+cluster product kernel Wave A igerber#119 (two
         limit-fixture anchors), callable conley_metric validation
         Wave A igerber#123, sparse k-d-tree activation Wave A igerber#120,
         indefiniteness guard on Bartlett + uniform kernels.
       * TestConleyDeviationsFromR (3 tests): 1-D radial Bartlett vs
         paper's 2-D separable Eq. 3.14, time-label normalization,
         independent temporal kernel deferred.
       * TestConleyDeferrals (5 tests): fail-closed NotImplementedError
         / TypeError contracts for LinearRegression + survey_design,
         DiD/MPD/TWFE + survey_design (estimator-level via
         conley.py::_validate_conley_estimator_inputs), Conley +
         weights (rejected for any weight_type — pweight, aweight,
         fweight), SyntheticDiD + Conley (TypeError), wild_bootstrap
         + Conley.
   - tests/test_conley_vcov.py extracted 1135+ LoC out (4248 → 3113);
     defensive surface preserved (input validation, NaN/inf guards,
     dispatch-level validity, estimator-level integration smoke tests,
     set_params atomicity, sparse-path activation thresholds + density-
     gate fallback). Module docstring refreshed to describe its current
     defensive-regression role.

2. Stale priority-queue cleanup at METHODOLOGY_REVIEW.md L1386:
   - PreTrendsPower removed (already Complete since 2026-05-19).
   - ConleySpatialHAC removed (this PR).
   - Substantive-review-blocked renumbered igerber#2-igerber#5igerber#1-igerber#4.
   - Consolidation-pass-blocked renumbered igerber#6-igerber#8igerber#5-igerber#6.

3. Bertanha-Imbens 2014 citation correction across 16 sites:
   - linalg.py × 8, conley.py × 1, llms-full.txt × 2, REGISTRY.md × 4,
     spillover.rst × 1.
   - NBER w20773 is on FRD external validity, NOT weighted spatial-HAC.
   - The boundary is now framed as a tri-part contract:
     * Shipped — SpilloverDiD + Conley + survey via Wave E.1/E.2/E.3
       (PR igerber#468/igerber#474/igerber#482, stratified-Conley sandwich on PSU totals
       with within-PSU serial Bartlett HAC for lag_cutoff > 0);
       TwoStageDiD + Conley + survey via Wave E.3 parity (PR igerber#485).
     * Deferred (generic linalg surface, any weight_type) — DiD/MPD/
       TWFE/LinearRegression generic path + Conley + survey_design;
       LinearRegression / compute_robust_vcov Conley + weights rejected
       for pweight, aweight, AND fweight (weighted Conley is not
       implemented on the generic linalg surface).
     * Open methodological question (subset) — the pweight /
       survey_design portion of the deferral additionally lacks a
       canonical methodological extension of Conley (1999) for weighted
       spatial-HAC under probability sampling.
   - REGISTRY sites use canonical `**Note (open methodological
     question):**` label wrapper per CLAUDE.md "Documenting Deviations".
   - Historical CHANGELOG entries (pre-[Unreleased]) intentionally
     retain Bertanha-Imbens 2014 attribution as accurate records of
     past release claims.

Verification:
- pytest tests/test_methodology_conley.py: 60 tests (55 unit + 5 slow)
  pass.
- pytest tests/test_methodology_conley.py + tests/test_conley_vcov.py:
  176 pass + 5 slow deselected (preserves pre-edit baseline coverage
  176 vs original 175 + 1 new uniform-kernel indefiniteness guard).
- grep -rn "Bertanha" diff_diff/ docs/ benchmarks/: 0 hits.
- black + ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant