TwoStageDiD Wave E.3 parity: always-treated full-design retention#485
Conversation
…n survey design Mechanical transfer of the Wave E.3 SpilloverDiD invariant (PR #482, merge 24de906) to TwoStageDiD. When the always-treated handler drops units from the OLS sample, the resolved survey design retains its FULL-DOMAIN n_psu / n_strata / df_survey / strata / fpc / psu arrays instead of being subsetted via replace(...). Per-cluster scores aggregate at fit-length then zero-pad onto the full-domain unique-PSU list via two new optional kwargs on _compute_gmm_variance: score_pad_mask and cluster_ids_full. PSUs containing only always-treated rows get zero score rows but still count toward G_full for n_psu / df_survey accounting. Documented synthesis (library-convention adoption): adopts the canonical R survey::svyrecvar(subset()) convention (Lumley 2010 §2.5), already established at imputation.py:2175-2183 (PreTrendsImputation), prep.py:1401-1432 (DCDH cell variance), and spillover.py (PR #482). Implementation: - diff_diff/two_stage.py: delete L1485-1525 design-subset block; promote keep_mask to fit()-level scope (always defined; defaults all-True); cluster injection sources cluster_ids_raw from FULL-DOMAIN data[cluster] (not post-drop df[cluster]) so _inject_cluster_as_psu's zip against resolved_survey.strata stays length-aligned; df["_survey_cluster"] aligned to post-drop length via resolved_survey.psu[keep_mask.values]; _compute_gmm_variance + 3 inner _stage2_* methods gain score_pad_mask / cluster_ids_full kwargs; zero-pad expansion after per-cluster aggregation; strata/fpc obs_idx lookups use cluster_ids_full under padding; G < 2 unidentified gate fires on G_full when padding active. - diff_diff/two_stage.py: _refit_ts callback subsets each replicate weight w_r via keep_mask.values before threading into _fit_untreated_model and _stage2_*, matching the keep_mask-subsetting applied to survey_weights in the main fit (otherwise solve_ols rejects the length mismatch and compute_replicate_refit_variance swallows the ValueError so replicate inference NaNs out). - Always-treated warning text updated to reflect the new contract: weights are subsetted for OLS, design retained for variance. - Replicate variance + always-treated: existing path now works correctly (score_pad_mask_arg stays None on _uses_replicate_ts paths; per-replicate refit handles resampling separately). Tests (tests/test_two_stage.py): - New TestTwoStageDiDWaveE3ParityAlwaysTreated class with 8 tests: (a) no-always-treated baseline, (b) full-domain df_survey preservation under drop, (c) full-domain n_psu reporting, (d) per-cluster zero-pad mock-spy on _compute_stratified_meat_from_psu_scores, (e) subpopulation + always-treated composition, (f) cluster-as-PSU + always-treated, (g) no-survey path unchanged, (h) PSU entirely-always-treated. Tests (tests/test_replicate_weight_expansion.py): - Strengthened test_two_stage_always_treated to assert finite overall_se / overall_p_value / overall_conf_int (was only asserting finite ATT, missing the replicate-SE regression class). - New test_two_stage_always_treated_event_study_and_group_replicate exercising the event-study + group replicate refit branches end-to-end under always-treated drop with aggregate="all"; asserts finite se + p_value + conf_int on non-reference horizons and cohorts. Docs: - docs/methodology/REGISTRY.md: TwoStageDiD section gains "documented synthesis — Wave E.3 parity" note; SpilloverDiD Wave E.3 follow-up note updated to mark TwoStageDiD parity as shipped. - CHANGELOG.md: Unreleased Added entry leading with documented-synthesis framing. - TODO.md: drop parity follow-up row. Co-Authored-By: Claude Opus 4.7 <[email protected]>
Overall Assessment✅ Looks good Executive Summary
MethodologyNo findings. The PR changes the survey-variance realization for Code QualityNo findings. The PR continues to use Performance
MaintainabilityNo findings. Internal propagation of the new split-length contract looks complete across the changed call graph: Tech DebtNo findings. Removing the resolved TODO entry in SecurityNo findings. Documentation/Tests
Execution note: I could not run |
CI codex review P3 #1 (performance): _compute_gmm_variance was taking the zero-pad branch (np.unique + np.searchsorted + full-size c_by_cluster / s2_by_cluster copies) on EVERY analytical survey fit because keep_mask defaults to all-True. Gate on `n_always_treated > 0` so baseline survey fits with no drop pass `None / None` and take the bit-identical pre-PR path. CI codex review P3 #2 (test determinism): seed local np.random.default_rng in the two new always-treated replicate-weight tests so always-treated outcomes are deterministic across runs (was using global np.random.normal which is non-deterministic at import time). Co-Authored-By: Claude Opus 4.7 <[email protected]>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…ns 2014 citation correction
Three bundled changes:
1. ConleySpatialHAC methodology-review-tracker promotion (In Progress → Complete).
- METHODOLOGY_REVIEW.md L91 + L1262-1290 flipped Complete with full
Verified Components / Test Coverage / R Comparison Results inline
table / Corrections Made / Deviations / Outstanding Concerns
structure, with Last Review = 2026-05-26.
- New tests/test_methodology_conley.py (~1600 LoC; 10 classes;
60 tests, 5 of them @pytest.mark.slow). Paper-equation-numbered
Verified Components walk-through covering:
* Eq. 4.2 cross-sectional sandwich (pairwise-distance
specialization; Eq. 3.13 is the lattice-indexed form reserved
for grid coordinates)
* Eq. 4.2 HC0 + rank-1 limits + K(0) = 1 diagonal contribution
* Andrews (1991) HAC lag truncation matching
conleyreg::time_dist.cpp
* Haversine convention with Earth radius 6371.01 km (matches
conleyreg::haversine_dist)
* Phase 2 panel block-decomposed sandwich at atol=1e-12 (internal
cross-check vs hand-coded reference + cluster time-invariance
contract)
* Wave A igerber#120 sparse k-d-tree numerical correctness (sparse-vs-
dense bit-identity at atol=1e-10 on cross-sectional, panel,
and sparse-with-cluster paths)
* R conleyreg v0.1.9 parity at atol=1e-6 on 6 fixtures
(3 cross-sectional + 3 panel) + sparse-forced cross-sectional
+ sparse-forced panel + time-asymmetric kernel literal-matching
Three dedicated deviations-area classes:
* TestConleyLibraryExtensions (6 tests, no R correspondence):
combined spatial+cluster product kernel Wave A igerber#119 (two
limit-fixture anchors), callable conley_metric validation
Wave A igerber#123, sparse k-d-tree activation Wave A igerber#120,
indefiniteness guard on Bartlett + uniform kernels.
* TestConleyDeviationsFromR (3 tests): 1-D radial Bartlett vs
paper's 2-D separable Eq. 3.14, time-label normalization,
independent temporal kernel deferred.
* TestConleyDeferrals (5 tests): fail-closed NotImplementedError
/ TypeError contracts for LinearRegression + survey_design,
DiD/MPD/TWFE + survey_design (estimator-level via
conley.py::_validate_conley_estimator_inputs), Conley +
weights (rejected for any weight_type — pweight, aweight,
fweight), SyntheticDiD + Conley (TypeError), wild_bootstrap
+ Conley.
- tests/test_conley_vcov.py extracted 1135+ LoC out (4248 → 3113);
defensive surface preserved (input validation, NaN/inf guards,
dispatch-level validity, estimator-level integration smoke tests,
set_params atomicity, sparse-path activation thresholds + density-
gate fallback). Module docstring refreshed to describe its current
defensive-regression role.
2. Stale priority-queue cleanup at METHODOLOGY_REVIEW.md L1386:
- PreTrendsPower removed (already Complete since 2026-05-19).
- ConleySpatialHAC removed (this PR).
- Substantive-review-blocked renumbered igerber#2-igerber#5 → igerber#1-igerber#4.
- Consolidation-pass-blocked renumbered igerber#6-igerber#8 → igerber#5-igerber#6.
3. Bertanha-Imbens 2014 citation correction across 16 sites:
- linalg.py × 8, conley.py × 1, llms-full.txt × 2, REGISTRY.md × 4,
spillover.rst × 1.
- NBER w20773 is on FRD external validity, NOT weighted spatial-HAC.
- The boundary is now framed as a tri-part contract:
* Shipped — SpilloverDiD + Conley + survey via Wave E.1/E.2/E.3
(PR igerber#468/igerber#474/igerber#482, stratified-Conley sandwich on PSU totals
with within-PSU serial Bartlett HAC for lag_cutoff > 0);
TwoStageDiD + Conley + survey via Wave E.3 parity (PR igerber#485).
* Deferred (generic linalg surface, any weight_type) — DiD/MPD/
TWFE/LinearRegression generic path + Conley + survey_design;
LinearRegression / compute_robust_vcov Conley + weights rejected
for pweight, aweight, AND fweight (weighted Conley is not
implemented on the generic linalg surface).
* Open methodological question (subset) — the pweight /
survey_design portion of the deferral additionally lacks a
canonical methodological extension of Conley (1999) for weighted
spatial-HAC under probability sampling.
- REGISTRY sites use canonical `**Note (open methodological
question):**` label wrapper per CLAUDE.md "Documenting Deviations".
- Historical CHANGELOG entries (pre-[Unreleased]) intentionally
retain Bertanha-Imbens 2014 attribution as accurate records of
past release claims.
Verification:
- pytest tests/test_methodology_conley.py: 60 tests (55 unit + 5 slow)
pass.
- pytest tests/test_methodology_conley.py + tests/test_conley_vcov.py:
176 pass + 5 slow deselected (preserves pre-edit baseline coverage
176 vs original 175 + 1 new uniform-kernel indefiniteness guard).
- grep -rn "Bertanha" diff_diff/ docs/ benchmarks/: 0 hits.
- black + ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Summary
SpilloverDiD(PR SpilloverDiD Wave E.3: SurveyDesign.subpopulation() full-design retention #482) toTwoStageDiD. When the always-treated handler drops units from the OLS sample, the resolved survey design now retains its FULL-DOMAINn_psu/n_strata/df_survey/strata/fpc/psuarrays. Per-cluster scores are zero-padded onto the full-domain unique-PSU list via new optional kwargs (score_pad_mask,cluster_ids_full) on_compute_gmm_variance; PSUs containing only always-treated rows get zero score rows but still count towardG_full._refit_ts) subsets eachw_rviakeep_mask.valuesbefore threading into stage-1/stage-2, mirroring the same subsetting the main fit applies tosurvey_weights(otherwisesolve_olsrejects the length mismatch andcompute_replicate_refit_varianceswallows the ValueError so replicate inference NaNs out).cluster_ids_rawfrom FULL-DOMAINdata[cluster](not post-dropdf[cluster]) so_inject_cluster_as_psu's zip againstresolved_survey.stratastays length-aligned;df["_survey_cluster"]aligned viaresolved_survey.psu[keep_mask.values].TODO.mdsince PR SpilloverDiD Wave E.3: SurveyDesign.subpopulation() full-design retention #482.Methodology references (required if estimator / math changes)
TwoStageDiD(Gardner 2022) — survey variance under always-treated unit drop.survey::svyrecvar(subset())zero-pad convention). Gardner, J. (2022). Two-stage differences in differences. arXiv:2207.05943.imputation.py:2175-2183(PreTrendsImputation) andprep.py:1401-1432(DCDH cell variance).Validation
TestTwoStageDiDWaveE3ParityAlwaysTreatedintests/test_two_stage.py(8 tests: no-always-treated baseline, full-domaindf_surveypreservation, full-domainn_psureporting, per-cluster zero-pad mock-spy, subpopulation+always-treated composition, cluster-as-PSU+always-treated, no-survey path unchanged, PSU entirely-always-treated).test_two_stage_always_treatedintests/test_replicate_weight_expansion.pyto assert finiteoverall_se/overall_p_value/overall_conf_intunder always-treated + replicate-weight design.test_two_stage_always_treated_event_study_and_group_replicatein the same file: exercises event-study + group replicate refit branches end-to-end withaggregate="all", asserts finitese+p_value+conf_inton non-reference horizons and cohorts.Security / privacy
Generated with Claude Code