From ef6da37d2977b31bfeec1b7e8ad9bcc8f76ab0b1 Mon Sep 17 00:00:00 2001 From: igerber Date: Wed, 20 May 2026 15:46:47 -0400 Subject: [PATCH 1/6] ContinuousDiD: methodology-review-tracker promotion (In Progress -> Complete) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Flips the ContinuousDiD tracker row to **Complete** with full Verified Components / Test Coverage / Corrections Made / Deviations / Outstanding Concerns structure mirroring the HAD precedent (PR #473). Consolidation only — no source code changes, no new tests, no new docstrings. - METHODOLOGY_REVIEW.md L59 row flipped In Progress -> Complete with Last Review 2026-05-20. L634-655 detail section rewritten with the five-block tracker template: 12 Verified Components rows backed by 15 methodology tests + 80 unit tests + R parity at relative tolerance on 6 benchmark configurations. - docs/methodology/REGISTRY.md ## ContinuousDiD gains a formal Deviations block (4 entries with framing header) before the Implementation Checklist: boundary-knots Deviation from R + three Phase 2 silent-failures audit fixes documented as library extensions with no R correspondence. Existing Edge Cases bullet and Note entries remain in place — Deviations is the canonical AI-review surface per CLAUDE.md "Documenting Deviations" labels. - CHANGELOG.md [Unreleased] ### Added gains the ContinuousDiD tracker-promotion bullet at the top with per-benchmark tolerance language calling out the relative-tolerance scope caveat (NOT bit-exact like HAD) due to the boundary-knots deviation precluding algorithmic bit-equality. - TODO.md gains one consolidated row tracking the three CGBS 2024 feature deferrals (covariates kwarg, discrete-treatment saturated regression, lowest-dose-as-control Remark 3.1) — these mirror R contdid v0.1.0's omissions and are explicitly marked deferred in the REGISTRY Implementation Checklist L755-757. R parity scope: 1% overall ATT on all 6 benchmarks; 1% max ATT(d) curve and 2% max ACRT(d) curve on benchmarks 1-3 via _compare_with_r helper; 1% overall ACRT on benchmarks 4-5; benchmark 6 is event-study ATT-only. NOT bit-exact (atol=1e-8) like HAD — boundary-knots divergence precludes algorithmic bit-equality on aggregated dose-response curves. 89 regression tests pass (80 unit + 9 methodology, R benchmarks deselected without R/contdid installed). Co-Authored-By: Claude Opus 4.7 --- CHANGELOG.md | 1 + METHODOLOGY_REVIEW.md | 57 ++++++++++++++++++++++++++++-------- TODO.md | 1 + docs/methodology/REGISTRY.md | 19 ++++++++++++ 4 files changed, 65 insertions(+), 13 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 92f06d56..6d40a6ae 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added +- **ContinuousDiD methodology-review-tracker promotion.** Tracker row flipped **In Progress** → **Complete** with full Verified Components / Test Coverage / Corrections Made / Deviations / Outstanding Concerns structure mirroring the HAD precedent (PR #473). REGISTRY `## ContinuousDiD` gains a formal Deviations block consolidating the boundary-knots deviation from R `contdid` v0.1.0 (`range(dose)` vs `range(dvals)` — library avoids extrapolation), the `bspline_derivative` derivative-failure `UserWarning` (Phase 2 axis-C #12), the `+inf` → `0` never-treated recoding warning, and the zero-`first_treat`+nonzero-`dose` force-zeroing warning (both axis-E silent-coercion fixes) into a single AI-review-recognized labeled surface. R parity for ContinuousDiD remains at relative tolerance (1% overall ATT on all 6 benchmarks; 1% max ATT(d) curve and 2% max ACRT(d) curve on benchmarks 1-3 via the shared `_compare_with_r` helper at `tests/test_methodology_continuous_did.py:395-459`; 1% overall ACRT on benchmarks 4-5; benchmark 6 is event-study ATT-only), NOT bit-exact (`atol=1e-8`) like HAD — the boundary-knots deviation precludes algorithmic bit-equality on aggregated dose-response curves. No source code changes, no new tests, no new docstrings — consolidation only against the existing 15 methodology tests (`tests/test_methodology_continuous_did.py`), 80 unit tests (`tests/test_continuous_did.py`), and `docs/methodology/continuous-did.md` theory note. `METHODOLOGY_REVIEW.md` ContinuousDiD row promoted **In Progress** → **Complete**. - **`SpilloverDiD(vcov_type="conley", survey_design=...)` integration via stratified-Conley sandwich on PSU totals (Wave E.2).** Lifts the Wave E.1 `NotImplementedError` (`spillover.py:2201` upfront, `two_stage.py:217` helper-level) and adds spatial-HAC + design-based variance for the previously deferred composition. **Documented synthesis** of Conley (1999) spatial-HAC × Gerber (2026, arXiv:2605.04124) Proposition 1 Binder TSL (the Wave E.1 foundation) × Wave D Gardner GMM first-stage uncertainty correction (Butts 2021 §3.1 + Gardner 2022 §4) applied to SpilloverDiD's ring-indicator stage-2 design. No reference software combines all three ingredients on a two-stage influence function. **Mechanical composition (panel-aware):** preserves the library's existing `conley_lag_cutoff = 0` semantic at `diff_diff.conley._compute_conley_meat` ("within-period spatial only — exclude cross-period spatial pairs") by looping over periods. For each period `t`, SpilloverDiD's per-obs Hájek-weighted Wave D IF `psi_i` is aggregated to per-period PSU totals `S_psu_t[g] = sum_{i in PSU g, time t} psi_i` (via `np.add.at`); per-PSU spatial centroids are panel-constant (mean of per-observation `conley_coords` within each PSU, vectorized `np.add.at` sums / `np.bincount` counts); for each stratum the within-stratum sandwich is `M_h_t = (1 - f_h) * n_h/(n_h-1) * sum_{j,k in PSUs_h} K(d(centroid_j, centroid_k) / conley_cutoff_km) * (S_psu_t[j] - S_bar_h_t)(S_psu_t[k] - S_bar_h_t)'`, where K is the Bartlett kernel (SpilloverDiD currently exposes Bartlett only and hardcodes it; the survey helper accepts `"uniform"` too but exposing that on the SpilloverDiD constructor is a separate follow-up) and `d` is haversine / euclidean / callable per `ConleyMetric`. Cross-stratum kernel weights are exactly zero by sampling design (strata are independence partitions). Total meat is `sum_t sum_h M_h_t`. Cross-period spatial pairs are excluded by construction — the per-period loop matches the library's panel Conley contract exactly. **Reduction semantics (load-bearing for tests):** the orchestrator's panel-aware meat equals `sum_t` of per-period within-stratum stratified-Conley sandwiches on per-period PSU totals (pinned at `tests/test_spillover.py::TestSpilloverDiDWaveE2ConleySurveyDesign::test_b_panel_aware_per_period_sum_invariant`); single stratum (H = 1, FPC = inf) reduces to `sum_t` plain Conley sandwich on per-period PSU totals (NOT on time-collapsed totals). **Implementation:** new `_compute_stratified_conley_meat_from_psu_scores` helper in `diff_diff/survey.py` (parallel to existing `_compute_stratified_meat_from_psu_scores` 3-tuple `(meat, variance_computed, legitimate_zero_count)` contract; per-stratum loop replaces the inner `centered.T @ centered` with `_compute_conley_meat(scores=centered, coords=psu_coords_h, ...)` in cross-sectional mode); new dispatch wrapper `_compute_stratified_conley_meat` in `diff_diff/two_stage.py` (parallel to existing `_compute_binder_tsl_meat`, performs per-obs Psi → PSU aggregation + centroid derivation + dispatch to survey helper, intentionally drops `cluster_ids` at the dispatch boundary — see Restrictions). `_compute_gmm_corrected_meat` conley branch extended with `if resolved_survey is not None` routing to the new wrapper; the `resolved_survey is None` branch is bit-identical to Wave D. **Singleton-stratum `lonely_psu="adjust"` parity:** the survey helper mirrors the Binder helper's `continue` to skip the FPC scale on singleton strata (with `n_h = 1` the scale `n_h / (n_h - 1)` would divide by zero); the degenerate one-PSU kernel `K = [[K(0)]] = [[1.0]]` reduces to `centered.T @ centered`, matching Binder's singleton-adjust output. **Saturated `df_survey = 0` NaN-fail:** mirrors Wave E.1 (`_compute_stratified_conley_meat` returns NaN meat with `UserWarning` template "Wave E.2 stratified-Conley sandwich: df_survey = 0..." so callers can `pytest.warns(UserWarning, match="Wave E.2 stratified-Conley")`). **Public surface restrictions:** replicate-weight variance (BRR / Fay / JK1 / JKn / SDR) raises `NotImplementedError` (inherits Wave E.1 gate; per-replicate full refit is separate follow-up scope); `cluster= + survey_design.psu + vcov_type="conley"` coerces `cluster=` to PSU per Wave E.1's warn-and-use-PSU pattern (the Conley cluster product kernel becomes a no-op after PSU aggregation, so `cluster_ids` is intentionally not threaded into the inner Conley kernel call — every PSU is its own cluster post-aggregation, which would zero all cross-PSU pairs); LinearRegression-side `vcov_type="conley" + survey_design=` gate at `diff_diff/linalg.py:2853` remains (separate Bertanha-Imbens 2014 weighted-Conley "Phase 5" roadmap, not Wave E); DiagnosticReport routing for `SpilloverDiDResults(vcov_type="conley", survey_design=)` requires `_APPLICABILITY` / `_PT_METHOD` registration (separate Wave F PR). **Tests:** new `TestSpilloverDiDWaveE2ConleySurveyDesign` and `TestSpilloverDiDWaveE2ConleySurveyDesignEventStudy` classes in `tests/test_spillover.py` (bit-identical no-survey fallback; panel-aware per-period sum invariant on the orchestrator + helper composition; hand-computation methodology anchor; single-stratum ≡ plain Conley on PSU totals; cross-stratum independence as a unit test on the survey helper with interleaved cross-stratum centroids; Binder vs Conley singleton-adjust FPC skip parity; lonely-PSU sensitivity across three modes; FPC large ≡ no-FPC and FPC = n_h zeros stratum; saturated NaN-fail with `pytest.warns(match="Wave E.2 stratified-Conley")`; replicate-weight + non-pweight rejections; cluster warn-and-use-PSU; fit idempotency; `finite_mask` survey-array subsetting; no-PSU coverage — weights-only `SurveyDesign(weights=...)`, strata-only `SurveyDesign(weights=..., strata=...)`, and a per-period re-index unit invariant pinning that no cross-period spatial pairs leak into the meat on implicit-PSU layouts; event-study path on both `is_staggered=True`/`False` branches per `feedback_cohort_loop_trigger_cache_both_branches`; drift goldens at `rtol=1e-12 / atol=1e-14`). The pre-existing `tests/test_spillover.py::test_fit_conley_plus_survey_design_not_implemented` Wave E.1-era gate-assertion test is removed (replaced by the positive-path tests above). Wave E.1 entry's "Public surface restrictions" bullet updated to past-tense the conley+survey gate reference. - **HeterogeneousAdoptionDiD methodology-review-tracker promotion.** New `tests/test_methodology_had.py` (6 classes, 36 tests) with paper-equation-numbered Verified Components walk-through against de Chaisemartin, Ciccia, D'Haultfœuille & Knau (2026) arXiv:2405.04465v6 (Equations 3 / 7 / 11 / 18 / 29 and Theorems 1 / 3 / 4 / 7): Design 1' MC recovery on both the zero-boundary DGP AND a nonzero-boundary-intercept DGP (`ΔY = c + β·D + ε` with `c != 0`) so the `att = (mean(ΔY) − τ_bc) / mean(D)` subtraction term is verified explicitly, N(0,1) coverage at `n_replicates=200`, mass-point Wald-IV closed-form equivalence at `atol=1e-9`, QUG limit-law distributional match at KS-stat ≤ 0.05 (n_draws=5000), Yatchew-HR paper-literal `σ²_diff = 1/(2G)` normalization lock, joint Stute pre-trends + homogeneity H0 fail-to-reject on both surfaces and H1 reject for joint homogeneity under a nonlinear DGP, and library-deviation locks (equal-weighting via selective low-dose-region replication, sup-t bootstrap gating, staggered-timing fail-closed `ValueError`). Added "Non-testable assumptions (paper Section 3.1.2)" Notes block to `HeterogeneousAdoptionDiD` class docstring + "Scope (what this test does NOT cover)" clauses to `qug_test` / `stute_test` / `yatchew_hr_test` / `did_had_pretest_workflow` Notes sections explicitly stating that the pre-tests verify ADJACENT assumptions (Assumption 4 / 7 / 8) and CANNOT test Assumptions 5 or 6. Phase-4 validation-harness items (Pierce-Schott 2016 Figure 2 replication, Table 1 coverage-rate reproduction across 3 DGPs × G ∈ {100, 500, 2500}) waived with documented rationale: R parity at `atol=1e-8` in `tests/test_did_had_parity.py` (3 DGPs × 5 method combos, bit-exact via `rtol=0`) is a strictly stronger anchor than coverage-rate Monte Carlo, and the paper itself self-acknowledges (Section 5.2) that NP estimators are too noisy to be informative on the LBD-restricted PNTR panel. REGISTRY HAD section gains a consolidated Deviations block (5 entries with framing header) and closes 2 of 3 unchecked Implementation Checklist items — the staggered-timing fail-closed `ValueError` and the Assumption 5/6 non-testability documentation; the `covariates=` Theorem 6 follow-up and the extensive-margin / "consider running standard DiD" warning both remain explicitly tracked in `TODO.md` as Low-priority follow-ups rather than claimed-closed. `dechaisemartin-2026-review.md:182-194` requirements checklist boxes the Phase 1a/1b/1c implementation-status closures + the Assumption 5/6 documentation + the staggered-timing closures; the extensive-margin item is acknowledged as partial (zero-dose `UserWarning` exists in `qug_test`; main-`fit()` "consider standard DiD" recommendation is the TODO follow-up). `METHODOLOGY_REVIEW.md` HAD row promoted **In Progress** → **Complete**. - **SunAbraham `vcov_type` parameter (Phase 1b PR 1/8).** `SunAbraham(vcov_type=...)` now accepts `{"classical","hc1","hc2","hc2_bm"}` (defaults to `"hc1"`, which preserves prior behavior bit-equally - SA historically hard-coded HC1). Auto-cluster-at-unit dropped when the user opts into explicit `vcov_type="hc2"` or `vcov_type="classical"` (one-way only); preserved for `"hc1"` and `"hc2_bm"`. When `vcov_type in {"classical","hc2","hc2_bm"}`, `_fit_saturated_regression` auto-routes to a full-dummy saturated design (mirrors TWFE Gate 1 from PR #469): FWL preserves cohort coefficients but not the hat matrix, so HC2 leverage and Bell-McCaffrey Satterthwaite DOF must be computed on the full FE projection. Empirically matches R `lm()` summary classical SE, `sandwich::vcovHC(type="HC2")`, and `clubSandwich::vcovCR(..., type="CR2")` + `coef_test()$df_Satt` at atol=1e-10 (cohort SE and BM DOF pinned in `tests/test_methodology_sun_abraham.py`). For `vcov_type="hc2_bm"`, the user-facing aggregated inference (`event_study_effects[e]['p_value']`/`['conf_int']`, `overall_p_value`/`overall_conf_int`) uses CR2 Bell-McCaffrey contrast DOF — matches `clubSandwich::Wald_test(test="HTZ")$df_denom` at atol=1e-10 (mirrors PR #465's `_compute_cr2_bm_contrast_dof` pattern for MultiPeriodDiD's post-period-average ATT). `vcov_type` is now propagated to `SunAbrahamResults.vcov_type` for downstream introspection. `SurveyDesign` (any kind — analytical weights, stratified, PSU, or replicate-weight) combined with `vcov_type in {"classical","hc2","hc2_bm"}` raises `NotImplementedError`: the survey-design TSL (or replicate-weight refit) variance overrides the analytical sandwich family, and the auto-cluster guard for one-way families would silently downgrade unit-level PSUs to per-observation PSUs. Use `vcov_type="hc1"` (default) for survey designs. `conley` rejected at `__init__` with a deferral message (would require threading 6+ `conley_*` params through the saturated regression call). **Deviation from R:** SA's within-transform HC1 SE differs from `fixest::sunab()` by ~1-2% (~2e-3 absolute) on typical panel sizes due to a different `(n-k)` finite-sample correction (fixest counts absorbed FE in k_total; SA's `solve_ols` counts only within-transformed columns); the IW aggregation step is otherwise identical (pinned at atol=5e-3, tracked in TODO.md). First PR of the Phase 1b standalone-estimator threading initiative (7 PRs to follow: StackedDiD, WooldridgeDiD-OLS, CallawaySantAnna, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD). diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md index e83e2117..43fd0be4 100644 --- a/METHODOLOGY_REVIEW.md +++ b/METHODOLOGY_REVIEW.md @@ -56,7 +56,7 @@ The catalog grew incrementally over several quarters, so formats vary across the | Estimator | Module | R / Stata Reference | Status | Last Review | |-----------|--------|---------------------|--------|-------------| -| ContinuousDiD | `continuous_did.py` | `contdid` v0.1.0 | **In Progress** | — | +| ContinuousDiD | `continuous_did.py` | `contdid` v0.1.0 | **Complete** | 2026-05-20 | | ChaisemartinDHaultfoeuille (DCDH) | `chaisemartin_dhaultfoeuille.py` | `DIDmultiplegtDYN` | **In Progress** | — | | HeterogeneousAdoptionDiD (HAD) | `had.py`, `had_pretests.py` | `chaisemartin::did_had` (`Credible-Answers/did_had` v2.0.0); `nprobust` for bandwidth | **Complete** | 2026-05-20 | | TROP | `trop.py`, `trop_local.py`, `trop_global.py` | (forthcoming; paper-author reference implementation) | **In Progress** | — | @@ -637,20 +637,51 @@ and covariate-adjusted specifications.) |-------|-------| | Module | `continuous_did.py`, `continuous_did_bspline.py`, `continuous_did_results.py` | | Primary Reference | Callaway, Goodman-Bacon & Sant'Anna (2024), *Difference-in-Differences with a Continuous Treatment*, NBER WP 32117 | -| R Reference | `contdid` v0.1.0 (CRAN) | -| Status | **In Progress** | -| Last Review | — | +| R Reference | `contdid` v0.1.0 (CRAN) — R parity at relative tolerance (1% overall ATT on all 6 benchmarks; 1% max ATT(d) curve and 2% max ACRT(d) curve on benchmarks 1-3 via the shared `_compare_with_r` helper; 1% overall ACRT on benchmarks 4-5; benchmark 6 is event-study ATT-only) via `tests/test_methodology_continuous_did.py::TestRBenchmark`. NOT bit-exact (`atol=1e-8`) like HAD because of the boundary-knots deviation documented below | +| Status | **Complete** | +| Last Review | 2026-05-20 | -**Documentation in place:** -- REGISTRY.md section: `## ContinuousDiD` plus dedicated theory note in `docs/methodology/continuous-did.md` (PT vs SPT identification, ATT(d|d) / ATT(d) / ACRT(d) / ATT^{loc} / ATT^{glob} / ACRT^{glob} estimands, B-spline OLS, multiplier bootstrap) -- `tests/test_methodology_continuous_did.py`: 15 tests across 5 classes (linear dose response, quadratic with cubic basis, multi-period aggregation, edge cases, R benchmark) -- Implementation: 80 unit tests in `tests/test_continuous_did.py` -- Survey support: weighted B-spline OLS, TSL on influence functions, bootstrap+survey (Phase 6) +**Verified Components:** +- [x] **PT and SPT identification** (CGBS 2024 Assumptions 1-2) — two-level parallel trends with explicit untreated-and-doses conditioning; estimands `ATT(d|d)`, `ATT(d)`, `ACRT(d)`, `ATT^{loc}`, `ATT^{glob}`, `ACRT^{glob}` defined in `docs/methodology/continuous-did.md` § 4 + REGISTRY `## ContinuousDiD` Identification block. Hand-calc coverage: `tests/test_methodology_continuous_did.py::TestLinearDoseResponse` (4 tests at `atol=1e-10` / `atol=1e-6` on no-noise linear DGP — locks the `ATT^{glob}` binarization formula `E[ΔY | D > 0] − E[ΔY | D = 0]`, the `ACRT^{glob}` plug-in average, and the `ATT(d) = 2d`, `ACRT(d) = 2` closed forms). +- [x] **B-spline basis matching `splines2::bSpline`** (cubic and linear degrees, `num_knots=0` default; global boundary knots from the training-dose range, NOT per-cell) — `tests/test_methodology_continuous_did.py::TestQuadraticWithCubicBasis::test_quadratic_recovery` recovers `ATT(d) = d²` at `atol=1e-6` via a degree-3 basis (cubic spline can represent quadratic exactly). The matching basis algorithm lives in `diff_diff/continuous_did_bspline.py` (216 LoC); the boundary-knots deviation from R `contdid` is documented in the Deviations block below. +- [x] **Multi-period (g,t) cell iteration with base period selection** — `TestMultiPeriodAggregation::test_multiple_groups` and `test_gt_cell_count` exercise the cohort iteration on 2-cohort staggered panels; cell counts agree with the R `ptetools`-style convention. R parity at 1% relative further locks the staggered-aggregation surface via `TestRBenchmark::test_benchmark_4_staggered_dose` and `test_benchmark_5_not_yet_treated`. +- [x] **Dose-response (`aggregate="dose"`) and event-study (`aggregate="eventstudy"`) aggregation** with group-proportional weights (`n_treated/n_total` per group, divided among post-treatment cells; matches R `ptetools` convention) — R parity via `TestRBenchmark::test_benchmark_1_basic_cubic` / `_2_linear` / `_3_interior_knots` / `_4_staggered_dose` / `_5_not_yet_treated` (dose) and `_6_event_study` (event-study). Per-benchmark relative tolerances: all 6 assert overall ATT at `< 0.01` (1%); benchmarks 1-3 additionally assert max ATT(d) at `< 0.01` and max ACRT(d) at `< 0.02` via the shared `_compare_with_r` helper; benchmarks 4-5 assert overall ACRT at `< 0.01` inline; benchmark 6 is event-study mode (binarized ATT, no ACRT comparison). Skipped if R / `contdid` not installed via `_check_r_contdid()`; R parity uses R's `dvals` grid for exact knot alignment. +- [x] **Multiplier bootstrap for inference** (PSU-level multiplier weights on the survey path per Phase 6) — implementation in `diff_diff/continuous_did.py`; bootstrap SE invariant on rank-deficient cells locked in `TestEdgeCasesMethodology::test_all_same_dose` (verifies `dose_response_att.se` is finite on a heterogeneous-outcome / identical-dose DGP); 80 unit tests in `tests/test_continuous_did.py` exercise the rest of the bootstrap path. +- [x] **Analytical SEs via influence functions** (NOT delta method; corrected post-v3.0.0, see Corrections Made) — IF-based variance with `safe_inference()` joint-NaN consistency on all six estimand fields (`overall_att`, `overall_acrt`, dose-response, event-study). +- [x] **Survey support**: weighted B-spline OLS, two-stage linearization (TSL) on influence functions, bootstrap + survey via PSU-level multiplier weights (Phase 3 + Phase 6). REGISTRY `## ContinuousDiD` Implementation Checklist L758 boxed. +- [x] **`+inf` → `0` never-treated recoding** with `UserWarning` reporting the affected row count (axis-E silent-coercion fix per Phase 2 audit) — the R-style convention of `first_treat = +inf` is normalized internally but no longer absorbed silently. **Any negative `first_treat` value (including `-inf`) raises `ValueError`** with the affected row count. Locked in `tests/test_continuous_did.py`. +- [x] **Zero-`first_treat` rows with nonzero `dose` force-zeroed** with `UserWarning` reporting the affected row count (axis-E silent-coercion fix per Phase 2 audit) — never-treated cells must have `D=0` for internal consistency; the previous silent zeroing is now signaled. Locked in `tests/test_continuous_did.py`. +- [x] **`bspline_derivative_design_matrix` derivative-construction failure warning** (Phase 2 axis-C #12 silent-failures audit fix) — aggregates failed basis indices into a single `UserWarning` naming them, instead of swallowing `scipy.interpolate.BSpline.ValueError` and leaving silently zeroed derivative columns. Both ACRT point estimates AND analytical/bootstrap inference read the same `dPsi` matrix (`continuous_did.py:1026-1046` and the bootstrap ACRT path at `continuous_did.py:1524-1561`), so both are biased on partial-derivative failure — the warning wording makes that explicit. The all-identical-knot degenerate case (single dose value) remains silently handled because derivatives are mathematically zero there. Locked in `tests/test_continuous_did.py::TestBSplineDerivativeDegenerateBasis` (3 tests: `test_single_dose_is_silent`, `test_valueerror_from_bspline_emits_aggregate_warning`, `test_clean_knots_emit_no_warning`); source-level aggregate-warning block at `diff_diff/continuous_did_bspline.py:150-187`. +- [x] **Edge cases**: all-same-dose (rank-deficient design, recovers only intercept = `ATT^{glob}`, ACRT = 0 everywhere), single-treated-unit (insufficient for OLS, raises `ValueError` "No valid"), discrete-treatment (detected and warned, saturated regression deferred), rank-deficiency per cell (cell skipped under `rank_deficient_action="silent"` / `"warn"`), balanced-panel-required (matches R `contdid` v0.1.0). Locked in `TestEdgeCasesMethodology` (2 methodology tests) + rank-deficient unit tests in `test_continuous_did.py`. +- [x] **Anticipation-aware not-yet-treated control mask**: when `anticipation > 0`, the not-yet-treated control mask uses `G > t + anticipation` (not just `G > t`) to exclude cohorts in the anticipation window from controls. When `anticipation=0` (default), behavior is unchanged. CHANGELOG `[3.0.x]`-era fix; locked in `test_continuous_did.py`. -**Outstanding for promotion:** -- Detailed Verified Components block here mirroring REGISTRY's Implementation Checklist (B-spline basis matching `splines2::bSpline`, multi-period cell iteration, dose-response and event-study aggregation, multiplier bootstrap, analytical SE via influence functions) -- Document the boundary-knots deviation from R `contdid` v0.1.0 (Python uses `range(dose)`; R uses `range(dvals)` which can produce extrapolation artifacts) in a formal Deviations block here -- Formalize the `+inf` recoding and zero-dose silent-zeroing warnings (currently in REGISTRY) into a Verified Components row +**Test Coverage:** +- 15 methodology tests in `tests/test_methodology_continuous_did.py` (5 classes: 4 + 1 + 2 + 2 + 6); the R-benchmark class (6 tests) skips if R / `contdid` v0.1.0 is not installed via `_check_r_contdid()` guard. +- 80 unit tests in `tests/test_continuous_did.py` (1,530 LoC) covering bootstrap, survey design, IF-based SE, anticipation, rank-deficient cells, and result-class field contracts. +- R parity at relative tolerance (NOT bit-exact — see Deviations § "Boundary knots") on 6 benchmark configurations: all 6 assert overall ATT at `< 0.01` (1% relative); benchmarks 1-3 additionally assert max ATT(d) at `< 0.01` and max ACRT(d) at `< 0.02` via the shared `_compare_with_r` helper; benchmarks 4-5 assert overall ACRT at `< 0.01` inline; benchmark 6 is event-study mode (binarized ATT, no ACRT comparison). +- Documentation: `docs/methodology/continuous-did.md` (14,885 bytes theory note covering PT vs SPT, estimands, B-spline OLS, multiplier bootstrap). + +**Corrections Made:** +1. **SE method correction (early v3.0.x):** ContinuousDiD originally computed SEs via delta method; corrected to influence-function-based variance. See CHANGELOG entries "Fix ContinuousDiD SE method: influence function, not delta method" + "Fix methodology doc: influence functions, not delta method for ContinuousDiD SEs". +2. **Anticipation-aware control mask** (CHANGELOG `[3.0.x]`-era): not-yet-treated control mask now uses `G > t + anticipation` instead of `G > t`, excluding cohorts in the anticipation window from controls. +3. **Phase 2 silent-failures audit fixes** (axis-C + axis-E): + - **axis-C #12:** `bspline_derivative_design_matrix` no longer swallows `scipy.interpolate.BSpline.ValueError` silently; aggregates failed basis indices into a single `UserWarning`. Both ACRT point estimates AND analytical/bootstrap inference are affected when this fires. + - **axis-E (silent coercion):** `+inf` → `0` never-treated recoding now emits `UserWarning` with affected row count; negative `first_treat` (including `-inf`) raises `ValueError`. Zero-`first_treat` rows with nonzero `dose` force-zeroed now also emit `UserWarning`. +4. **Bread normalization, fweight TSL scaling, weighted-mass IF linearization** (CHANGELOG): three pieces of the IF-based variance machinery on the survey path corrected to match the analytical identities. Replicate-IF variance score scaling also fixed for EfficientDiD / TripleDifference / ContinuousDiD as part of the same sweep. +5. **Tracker-promotion consolidation (this PR, 2026-05-20):** formal Deviations block added to REGISTRY `## ContinuousDiD` consolidating the boundary-knots deviation, the `bspline_derivative` warning, and the two axis-E silent-coercion warnings into a single labeled surface. The original Edge Cases / Notes entries remain in place — Deviations is an additional canonical surface (per CLAUDE.md "Documenting Deviations (AI Review Compatibility)" labels). + +**Deviations from the paper / from R / library extensions:** +1. **Deviation from R — boundary knots use `range(dose)` not `range(dvals)`** — knots are built once from all treated doses (global, not per-cell) to ensure a common basis across (g,t) cells for aggregation. The evaluation grid is clamped to training-dose boundary knots (`range(dose)`). R's `contdid` v0.1.0 has an inconsistency where `splines2::bSpline(dvals)` uses `range(dvals)` instead of `range(dose)`, which can produce extrapolation artifacts at dose-grid extremes. **Scope caveat:** R parity tests therefore run at **relative** tolerance bands (1% on overall ATT for all 6 benchmarks; 1% on max ATT(d) curve and 2% on max ACRT(d) curve for benchmarks 1-3 via the `_compare_with_r` helper; 1% on overall ACRT for benchmarks 4-5; benchmark 6 is event-study, ATT-only), NOT bit-exact (`atol=1e-8`) like HAD — `contdid` and ContinuousDiD cannot bit-match on aggregated dose-response or ACRT curves because they use different knot placement; the agreement band reflects the boundary-knot divergence rather than algorithmic drift. The slightly wider 2% ACRT(d)-curve tolerance on benchmarks 1-3 reflects the tighter coupling between basis derivative numerics and the boundary-knot choice; benchmarks 4-5 use overall scalars (`overall_acrt`) where the boundary effect averages down to 1%. Library extension toward methodological soundness (avoids extrapolation). +2. **Library extension — `bspline_derivative_design_matrix` derivative-failure warning** — previously swallowed `scipy.interpolate.BSpline.ValueError` in the per-basis derivative loop, leaving affected derivative-matrix columns silently zero. Now aggregates the failed basis indices into a single `UserWarning` naming them. Both ACRT point estimates and analytical/bootstrap inference read the same `dPsi` matrix, so both are biased when this fires — the warning wording makes that explicit. The all-identical-knot degenerate case (single dose value) remains silently handled (mathematically-zero derivatives are correct there). Phase 2 axis-C #12 silent-failures audit fix. No R correspondence; `contdid` v0.1.0 does not implement an equivalent warning. +3. **Library extension — `+inf` → `0` never-treated recoding warns** — the R-style convention of coding never-treated units as `first_treat=+inf` is still accepted and normalized to `first_treat=0` internally, but the estimator now emits a `UserWarning` reporting the row count so the silent recategorization is surfaced. Only `+inf` is recoded (matching the R convention). Any **negative** `first_treat` value (including `-inf`) raises `ValueError` with the row count, since such units would otherwise silently fall out of both the treated (`g > 0`) and never-treated (`g == 0`) masks. Pass `0` directly for never-treated units to avoid the warning. Library extension toward stricter safety; matches the broader Phase 2 axis-E silent-coercion convention. No R correspondence; `contdid` v0.1.0 silently absorbs `+inf` without a signal. +4. **Library extension — zero-`first_treat` rows with nonzero `dose` force-zeroed with warning** — never-treated cells must have `D=0` for internal consistency in the dose-response. The estimator now emits a `UserWarning` with the affected row count before the zeroing, so unintended nonzero doses on never-treated rows are no longer absorbed without a signal. Library extension toward stricter safety with no R correspondence — `contdid` v0.1.0 has the same `first_treat = 0` → `D = 0` invariant requirement but silently coerces without a warning; same axis-E silent-coercion lineage as #3. + +**Outstanding Concerns:** +- **Covariate support (deferred)** — `covariates=` kwarg is not implemented; matches R `contdid` v0.1.0 which also has no covariate support. Tracked as a future-work row in TODO.md (Low priority). +- **Discrete-treatment saturated regression (deferred)** — when `dose` is detected as integer-valued, the estimator currently warns; the saturated regression approach (one coefficient per discrete dose level instead of B-spline basis) is not implemented. Tracked as a future-work row. +- **Lowest-dose-as-control (Remark 3.1, deferred)** — CGBS 2024 Remark 3.1 outlines using the lowest non-zero dose as the comparison group when `P(D=0) = 0`. Not implemented; the estimator requires `P(D=0) > 0` (never-treated controls present). Tracked as a future-work row. + +These three are feature deferrals (paper-supported extensions that the library has chosen not to implement yet), not tracker blockers — the Implementation Checklist at REGISTRY `:755-757` already marks them as `[ ]` deferred. They mirror the same "future work" status that the ChaisemartinDHaultfoeuille and TROP tracker rows carry for analogous optional extensions. --- diff --git a/TODO.md b/TODO.md index eedc57df..fc8434b7 100644 --- a/TODO.md +++ b/TODO.md @@ -82,6 +82,7 @@ Deferred items from PR reviews that were not addressed before merge. | ImputationDiD dense `(A0'A0).toarray()` scales O((U+T+K)^2), OOM risk on large panels | `imputation.py` | #141 | Medium (deferred — only triggers when sparse solver fails) | | Multi-absorb weighted demeaning needs iterative alternating projections for N > 1 absorbed FE with survey weights; unweighted multi-absorb also uses single-pass (pre-existing, exact only for balanced panels) | `estimators.py` | #218 | Medium | | Survey design resolution/collapse patterns are inconsistent across panel estimators — ContinuousDiD rebuilds unit-level design in SE code, EfficientDiD builds once in fit(), StackedDiD re-resolves on stacked data; extract shared helpers for panel-to-unit collapse, post-filter re-resolution, and metadata recomputation | `continuous_did.py`, `efficient_did.py`, `stacked_did.py` | #226 | Low | +| ContinuousDiD deferred CGBS 2024 extensions: (a) `covariates=` kwarg not implemented (matches R `contdid` v0.1.0); (b) discrete-treatment saturated regression deferred (integer-valued dose currently warned, not routed to per-level coefficients); (c) lowest-dose-as-control per CGBS 2024 Remark 3.1 (when `P(D=0) = 0`) not implemented — estimator requires never-treated controls. REGISTRY `## ContinuousDiD` Implementation Checklist L755-757 marks these `[ ]`. | `diff_diff/continuous_did.py` | — | Low | | Survey-weighted Silverman bandwidth in EfficientDiD conditional Omega* — `_silverman_bandwidth()` uses unweighted mean/std for bandwidth selection; survey-weighted statistics would better reflect the population distribution but is a second-order refinement | `efficient_did_covariates.py` | — | Low | | TROP: extend Wave 4's `_setup_trop_data` helper to also cover the duplicated bootstrap resampling loop in `_bootstrap_variance` / `_bootstrap_variance_global` (~40 LoC dedup; mirrors the data-setup helper pattern with a `fit_callable` parameter for the per-draw refit step). | `trop_local.py`, `trop_global.py` | follow-up | Low | | TripleDifference power auto-routing: `power.simulate_power` ignores `n_periods` for DDD because `_ddd_dgp_kwargs` is hard-coded to the cross-sectional `generate_ddd_data`. Now that `generate_ddd_panel_data` exists (Wave 4), add a new `_EstimatorProfile` registry entry (or extend the existing one) to route to the panel DGP when `n_periods > 2`. | `power.py`, `prep_dgp.py` | follow-up | Low | diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 14065416..915f0b39 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -744,6 +744,25 @@ See `docs/methodology/continuous-did.md` Section 4 for full details. - **Boundary knots**: Knots are built once from all treated doses (global, not per-cell) to ensure a common basis across (g,t) cells for aggregation. Evaluation grid is clamped to training-dose boundary knots (`range(dose)`). R's `contdid` v0.1.0 has an inconsistency where `splines2::bSpline(dvals)` uses `range(dvals)` instead of `range(dose)`, which can produce extrapolation artifacts at dose grid extremes. Our approach avoids extrapolation and is methodologically sound. - **Note:** `bspline_derivative_design_matrix` previously swallowed `ValueError` from `scipy.interpolate.BSpline` in the per-basis derivative loop, leaving affected columns of the derivative design matrix as zero with no user-facing signal. It now aggregates the failed basis indices and emits ONE `UserWarning` naming them. Both ACRT point estimates and analytical/bootstrap inference read the same `dPsi` matrix (see `continuous_did.py:1026-1046` and the bootstrap ACRT path at `continuous_did.py:1524-1561`), so both are biased on a partial derivative-construction failure — the warning wording makes that explicit. The all-identical-knot degenerate case (single dose value) remains silently handled — derivatives there are mathematically zero. Axis-C finding #12 in the Phase 2 silent-failures audit. +### Deviations from the paper / from R / library extensions + +*Note #1 codifies a deviation from R `contdid` v0.1.0's boundary-knot +choice (library extension toward methodological soundness — avoids +extrapolation that `contdid` exhibits). Notes #2-#4 codify library +extensions with NO R correspondence — Phase 2 silent-failures audit +fixes that surface previously silent behavior as `UserWarning` or +`ValueError`; `contdid` v0.1.0 absorbs the same conditions without a +signal. The original Edge Cases bullet (under § Edge Cases above) and +the two `**Note:**` entries (under § Implementation Checklist below) +remain in place — this Deviations block is the canonical AI-review +surface per CLAUDE.md "Documenting Deviations (AI Review Compatibility)" +labels.* + +1. **Deviation from R:** `range(dose)` vs `range(dvals)` boundary knots — the library uses `range(dose)` (training-dose range) for B-spline boundary knots; R's `contdid` v0.1.0 uses `range(dvals)` via `splines2::bSpline(dvals)`, which can produce extrapolation artifacts at dose-grid extremes. **Scope caveat:** R parity therefore runs at **relative** tolerance bands (1% on overall ATT for all 6 benchmarks; 1% on max ATT(d) curve and 2% on max ACRT(d) curve for benchmarks 1-3 via the `_compare_with_r` helper at `tests/test_methodology_continuous_did.py:395-459`; 1% on overall ACRT for benchmarks 4-5; benchmark 6 is event-study, ATT-only), NOT bit-exact (`atol=1e-8`) like HAD. Library extension toward methodological soundness (avoids extrapolation). Cross-references the § Edge Cases "Boundary knots" bullet above and `METHODOLOGY_REVIEW.md` § ContinuousDiD Deviations #1. +2. **Note:** `bspline_derivative_design_matrix` derivative-failure `UserWarning` — Phase 2 axis-C #12 silent-failures audit fix. No R correspondence; `contdid` v0.1.0 does not implement an equivalent warning. Cross-references the § Edge Cases `**Note:**` bullet above (L745) and `METHODOLOGY_REVIEW.md` § ContinuousDiD Deviations #2. Locked in `tests/test_continuous_did.py::TestBSplineDerivativeDegenerateBasis` (3 tests); source-level aggregate-warning block at `diff_diff/continuous_did_bspline.py:150-187`. +3. **Note:** `+inf` → `0` never-treated recoding emits `UserWarning` reporting the affected row count; negative `first_treat` (including `-inf`) raises `ValueError`. Axis-E silent-coercion fix per Phase 2 audit. No R correspondence; `contdid` v0.1.0 silently absorbs `+inf` without a signal. Cross-references the § Implementation Checklist `**Note:**` below and `METHODOLOGY_REVIEW.md` § ContinuousDiD Deviations #3. +4. **Note:** Zero-`first_treat` rows with nonzero `dose` are force-zeroed with `UserWarning` reporting the affected row count (axis-E silent-coercion). No R correspondence; `contdid` v0.1.0 has the same `first_treat = 0` → `D = 0` invariant but silently coerces without a warning. Cross-references the § Implementation Checklist `**Note:**` below and `METHODOLOGY_REVIEW.md` § ContinuousDiD Deviations #4. + ### Implementation Checklist - [x] B-spline basis construction matching R's `splines2::bSpline` (global knots from all treated doses; boundary knots use training-dose range; see deviation note above) From a618f03cbb33dc0b76bae94d5cd7fd2e4d5e08ef Mon Sep 17 00:00:00 2001 From: igerber Date: Wed, 20 May 2026 15:55:31 -0400 Subject: [PATCH 2/6] Address codex R1 P3s on ContinuousDiD: parity wording + line-ref staleness + event-study scope MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three informational P3s from local codex R1, all narrow text fixes: 1. **Methodology** — reword "R parity" claims to distinguish two surfaces: (a) scalar parity with raw R cont_did / pte_default output (overall ATT on all 6 benchmarks; overall ACRT on benchmarks 4-5; scalar overall_att on benchmark 6 event-study), and (b) harmonized boundary-knot-normalized curve parity (max ATT(d), max ACRT(d) on benchmarks 1-3 only — _compare_with_r helper rebuilds R-side basis under Boundary.knots = range(treated_doses) before comparison because raw contdid curves use range(dvals)). Applied to METHODOLOGY_REVIEW.md R-Reference row + Verified Components rows + Test Coverage block + long-form Deviations #1; REGISTRY Deviations #1; CHANGELOG bullet. 2. **Maintainability** — replace hard-coded REGISTRY line numbers (L755-757, L758) with stable section/item anchors: "REGISTRY ## ContinuousDiD -> Implementation Checklist -> Survey design support item" and "Implementation Checklist deferred items" instead of fragile :L755-757 refs that already drifted to L774-776 with this same PR's REGISTRY Deviations block insertion. Applied in METHODOLOGY_REVIEW.md (2 occurrences) + TODO.md (1). 3. **Documentation/Tests** — clarify that benchmark 6 validates the event-study code path through scalar overall_att only (binarized ATT, no per-horizon comparison); per-horizon event_study_effects estimates and inference are exercised by Python-side tests at tests/test_continuous_did.py:557-690 and :1500-1528 with no R cross-language comparison on the per-horizon surface. Co-Authored-By: Claude Opus 4.7 --- CHANGELOG.md | 2 +- METHODOLOGY_REVIEW.md | 14 +++++++------- TODO.md | 2 +- docs/methodology/REGISTRY.md | 2 +- 4 files changed, 10 insertions(+), 10 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 6d40a6ae..980e53d9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,7 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added -- **ContinuousDiD methodology-review-tracker promotion.** Tracker row flipped **In Progress** → **Complete** with full Verified Components / Test Coverage / Corrections Made / Deviations / Outstanding Concerns structure mirroring the HAD precedent (PR #473). REGISTRY `## ContinuousDiD` gains a formal Deviations block consolidating the boundary-knots deviation from R `contdid` v0.1.0 (`range(dose)` vs `range(dvals)` — library avoids extrapolation), the `bspline_derivative` derivative-failure `UserWarning` (Phase 2 axis-C #12), the `+inf` → `0` never-treated recoding warning, and the zero-`first_treat`+nonzero-`dose` force-zeroing warning (both axis-E silent-coercion fixes) into a single AI-review-recognized labeled surface. R parity for ContinuousDiD remains at relative tolerance (1% overall ATT on all 6 benchmarks; 1% max ATT(d) curve and 2% max ACRT(d) curve on benchmarks 1-3 via the shared `_compare_with_r` helper at `tests/test_methodology_continuous_did.py:395-459`; 1% overall ACRT on benchmarks 4-5; benchmark 6 is event-study ATT-only), NOT bit-exact (`atol=1e-8`) like HAD — the boundary-knots deviation precludes algorithmic bit-equality on aggregated dose-response curves. No source code changes, no new tests, no new docstrings — consolidation only against the existing 15 methodology tests (`tests/test_methodology_continuous_did.py`), 80 unit tests (`tests/test_continuous_did.py`), and `docs/methodology/continuous-did.md` theory note. `METHODOLOGY_REVIEW.md` ContinuousDiD row promoted **In Progress** → **Complete**. +- **ContinuousDiD methodology-review-tracker promotion.** Tracker row flipped **In Progress** → **Complete** with full Verified Components / Test Coverage / Corrections Made / Deviations / Outstanding Concerns structure mirroring the HAD precedent (PR #473). REGISTRY `## ContinuousDiD` gains a formal Deviations block consolidating the boundary-knots deviation from R `contdid` v0.1.0 (`range(dose)` vs `range(dvals)` — library avoids extrapolation), the `bspline_derivative` derivative-failure `UserWarning` (Phase 2 axis-C #12), the `+inf` → `0` never-treated recoding warning, and the zero-`first_treat`+nonzero-`dose` force-zeroing warning (both axis-E silent-coercion fixes) into a single AI-review-recognized labeled surface. R cross-language coverage for ContinuousDiD runs at relative tolerance across two surfaces: (a) **scalar parity with raw R `cont_did` / `pte_default`** at 1% on overall ATT for all 6 benchmarks and on overall ACRT for benchmarks 4-5 (benchmark 6 is event-study, scalar `overall_att` only); (b) **harmonized boundary-knot-normalized curve parity** with R-side ATT(d) / ACRT(d) reconstructed under `Boundary.knots = range(treated_doses)` (matching the library) on benchmarks 1-3 via the shared `_compare_with_r` helper at `tests/test_methodology_continuous_did.py:395-459` — max ATT(d) at 1% and max ACRT(d) at 2%. NOT bit-exact (`atol=1e-8`) like HAD — the boundary-knots deviation precludes algorithmic bit-equality on aggregated dose-response curves. Surface (a) is direct raw-package parity; surface (b) is reconstructed-basis parity because raw `contdid` curves use `range(dvals)`. No source code changes, no new tests, no new docstrings — consolidation only against the existing 15 methodology tests (`tests/test_methodology_continuous_did.py`), 80 unit tests (`tests/test_continuous_did.py`), and `docs/methodology/continuous-did.md` theory note. `METHODOLOGY_REVIEW.md` ContinuousDiD row promoted **In Progress** → **Complete**. - **`SpilloverDiD(vcov_type="conley", survey_design=...)` integration via stratified-Conley sandwich on PSU totals (Wave E.2).** Lifts the Wave E.1 `NotImplementedError` (`spillover.py:2201` upfront, `two_stage.py:217` helper-level) and adds spatial-HAC + design-based variance for the previously deferred composition. **Documented synthesis** of Conley (1999) spatial-HAC × Gerber (2026, arXiv:2605.04124) Proposition 1 Binder TSL (the Wave E.1 foundation) × Wave D Gardner GMM first-stage uncertainty correction (Butts 2021 §3.1 + Gardner 2022 §4) applied to SpilloverDiD's ring-indicator stage-2 design. No reference software combines all three ingredients on a two-stage influence function. **Mechanical composition (panel-aware):** preserves the library's existing `conley_lag_cutoff = 0` semantic at `diff_diff.conley._compute_conley_meat` ("within-period spatial only — exclude cross-period spatial pairs") by looping over periods. For each period `t`, SpilloverDiD's per-obs Hájek-weighted Wave D IF `psi_i` is aggregated to per-period PSU totals `S_psu_t[g] = sum_{i in PSU g, time t} psi_i` (via `np.add.at`); per-PSU spatial centroids are panel-constant (mean of per-observation `conley_coords` within each PSU, vectorized `np.add.at` sums / `np.bincount` counts); for each stratum the within-stratum sandwich is `M_h_t = (1 - f_h) * n_h/(n_h-1) * sum_{j,k in PSUs_h} K(d(centroid_j, centroid_k) / conley_cutoff_km) * (S_psu_t[j] - S_bar_h_t)(S_psu_t[k] - S_bar_h_t)'`, where K is the Bartlett kernel (SpilloverDiD currently exposes Bartlett only and hardcodes it; the survey helper accepts `"uniform"` too but exposing that on the SpilloverDiD constructor is a separate follow-up) and `d` is haversine / euclidean / callable per `ConleyMetric`. Cross-stratum kernel weights are exactly zero by sampling design (strata are independence partitions). Total meat is `sum_t sum_h M_h_t`. Cross-period spatial pairs are excluded by construction — the per-period loop matches the library's panel Conley contract exactly. **Reduction semantics (load-bearing for tests):** the orchestrator's panel-aware meat equals `sum_t` of per-period within-stratum stratified-Conley sandwiches on per-period PSU totals (pinned at `tests/test_spillover.py::TestSpilloverDiDWaveE2ConleySurveyDesign::test_b_panel_aware_per_period_sum_invariant`); single stratum (H = 1, FPC = inf) reduces to `sum_t` plain Conley sandwich on per-period PSU totals (NOT on time-collapsed totals). **Implementation:** new `_compute_stratified_conley_meat_from_psu_scores` helper in `diff_diff/survey.py` (parallel to existing `_compute_stratified_meat_from_psu_scores` 3-tuple `(meat, variance_computed, legitimate_zero_count)` contract; per-stratum loop replaces the inner `centered.T @ centered` with `_compute_conley_meat(scores=centered, coords=psu_coords_h, ...)` in cross-sectional mode); new dispatch wrapper `_compute_stratified_conley_meat` in `diff_diff/two_stage.py` (parallel to existing `_compute_binder_tsl_meat`, performs per-obs Psi → PSU aggregation + centroid derivation + dispatch to survey helper, intentionally drops `cluster_ids` at the dispatch boundary — see Restrictions). `_compute_gmm_corrected_meat` conley branch extended with `if resolved_survey is not None` routing to the new wrapper; the `resolved_survey is None` branch is bit-identical to Wave D. **Singleton-stratum `lonely_psu="adjust"` parity:** the survey helper mirrors the Binder helper's `continue` to skip the FPC scale on singleton strata (with `n_h = 1` the scale `n_h / (n_h - 1)` would divide by zero); the degenerate one-PSU kernel `K = [[K(0)]] = [[1.0]]` reduces to `centered.T @ centered`, matching Binder's singleton-adjust output. **Saturated `df_survey = 0` NaN-fail:** mirrors Wave E.1 (`_compute_stratified_conley_meat` returns NaN meat with `UserWarning` template "Wave E.2 stratified-Conley sandwich: df_survey = 0..." so callers can `pytest.warns(UserWarning, match="Wave E.2 stratified-Conley")`). **Public surface restrictions:** replicate-weight variance (BRR / Fay / JK1 / JKn / SDR) raises `NotImplementedError` (inherits Wave E.1 gate; per-replicate full refit is separate follow-up scope); `cluster= + survey_design.psu + vcov_type="conley"` coerces `cluster=` to PSU per Wave E.1's warn-and-use-PSU pattern (the Conley cluster product kernel becomes a no-op after PSU aggregation, so `cluster_ids` is intentionally not threaded into the inner Conley kernel call — every PSU is its own cluster post-aggregation, which would zero all cross-PSU pairs); LinearRegression-side `vcov_type="conley" + survey_design=` gate at `diff_diff/linalg.py:2853` remains (separate Bertanha-Imbens 2014 weighted-Conley "Phase 5" roadmap, not Wave E); DiagnosticReport routing for `SpilloverDiDResults(vcov_type="conley", survey_design=)` requires `_APPLICABILITY` / `_PT_METHOD` registration (separate Wave F PR). **Tests:** new `TestSpilloverDiDWaveE2ConleySurveyDesign` and `TestSpilloverDiDWaveE2ConleySurveyDesignEventStudy` classes in `tests/test_spillover.py` (bit-identical no-survey fallback; panel-aware per-period sum invariant on the orchestrator + helper composition; hand-computation methodology anchor; single-stratum ≡ plain Conley on PSU totals; cross-stratum independence as a unit test on the survey helper with interleaved cross-stratum centroids; Binder vs Conley singleton-adjust FPC skip parity; lonely-PSU sensitivity across three modes; FPC large ≡ no-FPC and FPC = n_h zeros stratum; saturated NaN-fail with `pytest.warns(match="Wave E.2 stratified-Conley")`; replicate-weight + non-pweight rejections; cluster warn-and-use-PSU; fit idempotency; `finite_mask` survey-array subsetting; no-PSU coverage — weights-only `SurveyDesign(weights=...)`, strata-only `SurveyDesign(weights=..., strata=...)`, and a per-period re-index unit invariant pinning that no cross-period spatial pairs leak into the meat on implicit-PSU layouts; event-study path on both `is_staggered=True`/`False` branches per `feedback_cohort_loop_trigger_cache_both_branches`; drift goldens at `rtol=1e-12 / atol=1e-14`). The pre-existing `tests/test_spillover.py::test_fit_conley_plus_survey_design_not_implemented` Wave E.1-era gate-assertion test is removed (replaced by the positive-path tests above). Wave E.1 entry's "Public surface restrictions" bullet updated to past-tense the conley+survey gate reference. - **HeterogeneousAdoptionDiD methodology-review-tracker promotion.** New `tests/test_methodology_had.py` (6 classes, 36 tests) with paper-equation-numbered Verified Components walk-through against de Chaisemartin, Ciccia, D'Haultfœuille & Knau (2026) arXiv:2405.04465v6 (Equations 3 / 7 / 11 / 18 / 29 and Theorems 1 / 3 / 4 / 7): Design 1' MC recovery on both the zero-boundary DGP AND a nonzero-boundary-intercept DGP (`ΔY = c + β·D + ε` with `c != 0`) so the `att = (mean(ΔY) − τ_bc) / mean(D)` subtraction term is verified explicitly, N(0,1) coverage at `n_replicates=200`, mass-point Wald-IV closed-form equivalence at `atol=1e-9`, QUG limit-law distributional match at KS-stat ≤ 0.05 (n_draws=5000), Yatchew-HR paper-literal `σ²_diff = 1/(2G)` normalization lock, joint Stute pre-trends + homogeneity H0 fail-to-reject on both surfaces and H1 reject for joint homogeneity under a nonlinear DGP, and library-deviation locks (equal-weighting via selective low-dose-region replication, sup-t bootstrap gating, staggered-timing fail-closed `ValueError`). Added "Non-testable assumptions (paper Section 3.1.2)" Notes block to `HeterogeneousAdoptionDiD` class docstring + "Scope (what this test does NOT cover)" clauses to `qug_test` / `stute_test` / `yatchew_hr_test` / `did_had_pretest_workflow` Notes sections explicitly stating that the pre-tests verify ADJACENT assumptions (Assumption 4 / 7 / 8) and CANNOT test Assumptions 5 or 6. Phase-4 validation-harness items (Pierce-Schott 2016 Figure 2 replication, Table 1 coverage-rate reproduction across 3 DGPs × G ∈ {100, 500, 2500}) waived with documented rationale: R parity at `atol=1e-8` in `tests/test_did_had_parity.py` (3 DGPs × 5 method combos, bit-exact via `rtol=0`) is a strictly stronger anchor than coverage-rate Monte Carlo, and the paper itself self-acknowledges (Section 5.2) that NP estimators are too noisy to be informative on the LBD-restricted PNTR panel. REGISTRY HAD section gains a consolidated Deviations block (5 entries with framing header) and closes 2 of 3 unchecked Implementation Checklist items — the staggered-timing fail-closed `ValueError` and the Assumption 5/6 non-testability documentation; the `covariates=` Theorem 6 follow-up and the extensive-margin / "consider running standard DiD" warning both remain explicitly tracked in `TODO.md` as Low-priority follow-ups rather than claimed-closed. `dechaisemartin-2026-review.md:182-194` requirements checklist boxes the Phase 1a/1b/1c implementation-status closures + the Assumption 5/6 documentation + the staggered-timing closures; the extensive-margin item is acknowledged as partial (zero-dose `UserWarning` exists in `qug_test`; main-`fit()` "consider standard DiD" recommendation is the TODO follow-up). `METHODOLOGY_REVIEW.md` HAD row promoted **In Progress** → **Complete**. - **SunAbraham `vcov_type` parameter (Phase 1b PR 1/8).** `SunAbraham(vcov_type=...)` now accepts `{"classical","hc1","hc2","hc2_bm"}` (defaults to `"hc1"`, which preserves prior behavior bit-equally - SA historically hard-coded HC1). Auto-cluster-at-unit dropped when the user opts into explicit `vcov_type="hc2"` or `vcov_type="classical"` (one-way only); preserved for `"hc1"` and `"hc2_bm"`. When `vcov_type in {"classical","hc2","hc2_bm"}`, `_fit_saturated_regression` auto-routes to a full-dummy saturated design (mirrors TWFE Gate 1 from PR #469): FWL preserves cohort coefficients but not the hat matrix, so HC2 leverage and Bell-McCaffrey Satterthwaite DOF must be computed on the full FE projection. Empirically matches R `lm()` summary classical SE, `sandwich::vcovHC(type="HC2")`, and `clubSandwich::vcovCR(..., type="CR2")` + `coef_test()$df_Satt` at atol=1e-10 (cohort SE and BM DOF pinned in `tests/test_methodology_sun_abraham.py`). For `vcov_type="hc2_bm"`, the user-facing aggregated inference (`event_study_effects[e]['p_value']`/`['conf_int']`, `overall_p_value`/`overall_conf_int`) uses CR2 Bell-McCaffrey contrast DOF — matches `clubSandwich::Wald_test(test="HTZ")$df_denom` at atol=1e-10 (mirrors PR #465's `_compute_cr2_bm_contrast_dof` pattern for MultiPeriodDiD's post-period-average ATT). `vcov_type` is now propagated to `SunAbrahamResults.vcov_type` for downstream introspection. `SurveyDesign` (any kind — analytical weights, stratified, PSU, or replicate-weight) combined with `vcov_type in {"classical","hc2","hc2_bm"}` raises `NotImplementedError`: the survey-design TSL (or replicate-weight refit) variance overrides the analytical sandwich family, and the auto-cluster guard for one-way families would silently downgrade unit-level PSUs to per-observation PSUs. Use `vcov_type="hc1"` (default) for survey designs. `conley` rejected at `__init__` with a deferral message (would require threading 6+ `conley_*` params through the saturated regression call). **Deviation from R:** SA's within-transform HC1 SE differs from `fixest::sunab()` by ~1-2% (~2e-3 absolute) on typical panel sizes due to a different `(n-k)` finite-sample correction (fixest counts absorbed FE in k_total; SA's `solve_ols` counts only within-transformed columns); the IW aggregation step is otherwise identical (pinned at atol=5e-3, tracked in TODO.md). First PR of the Phase 1b standalone-estimator threading initiative (7 PRs to follow: StackedDiD, WooldridgeDiD-OLS, CallawaySantAnna, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD). diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md index 43fd0be4..cd576e4c 100644 --- a/METHODOLOGY_REVIEW.md +++ b/METHODOLOGY_REVIEW.md @@ -637,18 +637,18 @@ and covariate-adjusted specifications.) |-------|-------| | Module | `continuous_did.py`, `continuous_did_bspline.py`, `continuous_did_results.py` | | Primary Reference | Callaway, Goodman-Bacon & Sant'Anna (2024), *Difference-in-Differences with a Continuous Treatment*, NBER WP 32117 | -| R Reference | `contdid` v0.1.0 (CRAN) — R parity at relative tolerance (1% overall ATT on all 6 benchmarks; 1% max ATT(d) curve and 2% max ACRT(d) curve on benchmarks 1-3 via the shared `_compare_with_r` helper; 1% overall ACRT on benchmarks 4-5; benchmark 6 is event-study ATT-only) via `tests/test_methodology_continuous_did.py::TestRBenchmark`. NOT bit-exact (`atol=1e-8`) like HAD because of the boundary-knots deviation documented below | +| R Reference | `contdid` v0.1.0 (CRAN) — two parity surfaces at relative tolerance: (a) **scalar overall ATT parity** with raw R `cont_did` / `pte_default` output at `< 0.01` (1%) on all 6 benchmarks; **scalar overall ACRT parity** with raw R `cont_did` at `< 0.01` (1%) on benchmarks 4-5; (b) **harmonized boundary-knot-normalized curve parity** with R-side ATT(d)/ACRT(d) reconstructed under `Boundary.knots = range(treated_doses)` (matching the library) at `< 0.01` max ATT(d) and `< 0.02` max ACRT(d) on benchmarks 1-3 via the `_compare_with_r` helper at `tests/test_methodology_continuous_did.py:395-459`; benchmark 6 is event-study, scalar `overall_att` only (binarized ATT, no curve comparison and no ACRT in event-study mode). Surface (a) is direct raw-package parity; surface (b) is reconstructed-basis parity because raw `contdid` curves use `range(dvals)` instead of `range(dose)`. NOT bit-exact (`atol=1e-8`) like HAD because of the boundary-knots deviation documented below. See `tests/test_methodology_continuous_did.py::TestRBenchmark` | | Status | **Complete** | | Last Review | 2026-05-20 | **Verified Components:** - [x] **PT and SPT identification** (CGBS 2024 Assumptions 1-2) — two-level parallel trends with explicit untreated-and-doses conditioning; estimands `ATT(d|d)`, `ATT(d)`, `ACRT(d)`, `ATT^{loc}`, `ATT^{glob}`, `ACRT^{glob}` defined in `docs/methodology/continuous-did.md` § 4 + REGISTRY `## ContinuousDiD` Identification block. Hand-calc coverage: `tests/test_methodology_continuous_did.py::TestLinearDoseResponse` (4 tests at `atol=1e-10` / `atol=1e-6` on no-noise linear DGP — locks the `ATT^{glob}` binarization formula `E[ΔY | D > 0] − E[ΔY | D = 0]`, the `ACRT^{glob}` plug-in average, and the `ATT(d) = 2d`, `ACRT(d) = 2` closed forms). - [x] **B-spline basis matching `splines2::bSpline`** (cubic and linear degrees, `num_knots=0` default; global boundary knots from the training-dose range, NOT per-cell) — `tests/test_methodology_continuous_did.py::TestQuadraticWithCubicBasis::test_quadratic_recovery` recovers `ATT(d) = d²` at `atol=1e-6` via a degree-3 basis (cubic spline can represent quadratic exactly). The matching basis algorithm lives in `diff_diff/continuous_did_bspline.py` (216 LoC); the boundary-knots deviation from R `contdid` is documented in the Deviations block below. -- [x] **Multi-period (g,t) cell iteration with base period selection** — `TestMultiPeriodAggregation::test_multiple_groups` and `test_gt_cell_count` exercise the cohort iteration on 2-cohort staggered panels; cell counts agree with the R `ptetools`-style convention. R parity at 1% relative further locks the staggered-aggregation surface via `TestRBenchmark::test_benchmark_4_staggered_dose` and `test_benchmark_5_not_yet_treated`. -- [x] **Dose-response (`aggregate="dose"`) and event-study (`aggregate="eventstudy"`) aggregation** with group-proportional weights (`n_treated/n_total` per group, divided among post-treatment cells; matches R `ptetools` convention) — R parity via `TestRBenchmark::test_benchmark_1_basic_cubic` / `_2_linear` / `_3_interior_knots` / `_4_staggered_dose` / `_5_not_yet_treated` (dose) and `_6_event_study` (event-study). Per-benchmark relative tolerances: all 6 assert overall ATT at `< 0.01` (1%); benchmarks 1-3 additionally assert max ATT(d) at `< 0.01` and max ACRT(d) at `< 0.02` via the shared `_compare_with_r` helper; benchmarks 4-5 assert overall ACRT at `< 0.01` inline; benchmark 6 is event-study mode (binarized ATT, no ACRT comparison). Skipped if R / `contdid` not installed via `_check_r_contdid()`; R parity uses R's `dvals` grid for exact knot alignment. +- [x] **Multi-period (g,t) cell iteration with base period selection** — `TestMultiPeriodAggregation::test_multiple_groups` and `test_gt_cell_count` exercise the cohort iteration on 2-cohort staggered panels; cell counts agree with the R `ptetools`-style convention. Scalar parity with raw R `cont_did` at 1% relative further locks the staggered-aggregation surface via `TestRBenchmark::test_benchmark_4_staggered_dose` and `test_benchmark_5_not_yet_treated` (both assert overall ATT AND overall ACRT at `< 0.01`). +- [x] **Dose-response (`aggregate="dose"`) and event-study (`aggregate="eventstudy"`) aggregation** with group-proportional weights (`n_treated/n_total` per group, divided among post-treatment cells; matches R `ptetools` convention). Two R-side surfaces are exercised: (a) **scalar `overall_att`** via `TestRBenchmark::test_benchmark_1_basic_cubic` / `_2_linear` / `_3_interior_knots` / `_4_staggered_dose` / `_5_not_yet_treated` (dose mode) and `_6_event_study` (event-study mode — binarized ATT only; benchmark 6 validates the event-study code path through the scalar surface, NOT per-horizon `event_study_effects`); (b) **harmonized boundary-knot-normalized ATT(d) / ACRT(d) curves** on benchmarks 1-3 via `_compare_with_r` (helper at `tests/test_methodology_continuous_did.py:395-459` rebuilds the R-side basis under `Boundary.knots = range(treated_doses)` before comparison — raw `contdid` curves use `range(dvals)`, so this is reconstructed-basis parity not raw-package parity). Per-benchmark tolerances: all 6 assert overall ATT at `< 0.01` (1%); benchmarks 1-3 additionally assert max ATT(d) at `< 0.01` and max ACRT(d) at `< 0.02` via the helper; benchmarks 4-5 assert overall ACRT at `< 0.01` inline. Per-horizon `event_study_effects` estimates and inference are exercised by Python-side tests at `tests/test_continuous_did.py:557-690` and `:1500-1528` (no R cross-language comparison on the per-horizon surface). Skipped if R / `contdid` not installed via `_check_r_contdid()`; benchmarks use R's `dvals` grid for exact knot alignment. - [x] **Multiplier bootstrap for inference** (PSU-level multiplier weights on the survey path per Phase 6) — implementation in `diff_diff/continuous_did.py`; bootstrap SE invariant on rank-deficient cells locked in `TestEdgeCasesMethodology::test_all_same_dose` (verifies `dose_response_att.se` is finite on a heterogeneous-outcome / identical-dose DGP); 80 unit tests in `tests/test_continuous_did.py` exercise the rest of the bootstrap path. - [x] **Analytical SEs via influence functions** (NOT delta method; corrected post-v3.0.0, see Corrections Made) — IF-based variance with `safe_inference()` joint-NaN consistency on all six estimand fields (`overall_att`, `overall_acrt`, dose-response, event-study). -- [x] **Survey support**: weighted B-spline OLS, two-stage linearization (TSL) on influence functions, bootstrap + survey via PSU-level multiplier weights (Phase 3 + Phase 6). REGISTRY `## ContinuousDiD` Implementation Checklist L758 boxed. +- [x] **Survey support**: weighted B-spline OLS, two-stage linearization (TSL) on influence functions, bootstrap + survey via PSU-level multiplier weights (Phase 3 + Phase 6). Boxed in REGISTRY `## ContinuousDiD` → Implementation Checklist → "Survey design support (Phase 3)" item. - [x] **`+inf` → `0` never-treated recoding** with `UserWarning` reporting the affected row count (axis-E silent-coercion fix per Phase 2 audit) — the R-style convention of `first_treat = +inf` is normalized internally but no longer absorbed silently. **Any negative `first_treat` value (including `-inf`) raises `ValueError`** with the affected row count. Locked in `tests/test_continuous_did.py`. - [x] **Zero-`first_treat` rows with nonzero `dose` force-zeroed** with `UserWarning` reporting the affected row count (axis-E silent-coercion fix per Phase 2 audit) — never-treated cells must have `D=0` for internal consistency; the previous silent zeroing is now signaled. Locked in `tests/test_continuous_did.py`. - [x] **`bspline_derivative_design_matrix` derivative-construction failure warning** (Phase 2 axis-C #12 silent-failures audit fix) — aggregates failed basis indices into a single `UserWarning` naming them, instead of swallowing `scipy.interpolate.BSpline.ValueError` and leaving silently zeroed derivative columns. Both ACRT point estimates AND analytical/bootstrap inference read the same `dPsi` matrix (`continuous_did.py:1026-1046` and the bootstrap ACRT path at `continuous_did.py:1524-1561`), so both are biased on partial-derivative failure — the warning wording makes that explicit. The all-identical-knot degenerate case (single dose value) remains silently handled because derivatives are mathematically zero there. Locked in `tests/test_continuous_did.py::TestBSplineDerivativeDegenerateBasis` (3 tests: `test_single_dose_is_silent`, `test_valueerror_from_bspline_emits_aggregate_warning`, `test_clean_knots_emit_no_warning`); source-level aggregate-warning block at `diff_diff/continuous_did_bspline.py:150-187`. @@ -658,7 +658,7 @@ and covariate-adjusted specifications.) **Test Coverage:** - 15 methodology tests in `tests/test_methodology_continuous_did.py` (5 classes: 4 + 1 + 2 + 2 + 6); the R-benchmark class (6 tests) skips if R / `contdid` v0.1.0 is not installed via `_check_r_contdid()` guard. - 80 unit tests in `tests/test_continuous_did.py` (1,530 LoC) covering bootstrap, survey design, IF-based SE, anticipation, rank-deficient cells, and result-class field contracts. -- R parity at relative tolerance (NOT bit-exact — see Deviations § "Boundary knots") on 6 benchmark configurations: all 6 assert overall ATT at `< 0.01` (1% relative); benchmarks 1-3 additionally assert max ATT(d) at `< 0.01` and max ACRT(d) at `< 0.02` via the shared `_compare_with_r` helper; benchmarks 4-5 assert overall ACRT at `< 0.01` inline; benchmark 6 is event-study mode (binarized ATT, no ACRT comparison). +- R cross-language coverage at relative tolerance (NOT bit-exact — see Deviations § "Boundary knots") on 6 benchmark configurations across two surfaces: (a) **scalar parity with raw R `cont_did` / `pte_default`** — all 6 assert overall ATT at `< 0.01` (1%); benchmarks 4-5 also assert overall ACRT at `< 0.01` inline; benchmark 6 is event-study mode with scalar `overall_att` only (binarized ATT, no per-horizon and no ACRT comparison). Per-horizon `event_study_effects` is exercised by Python-side tests at `tests/test_continuous_did.py:557-690` and `:1500-1528`. (b) **harmonized boundary-knot-normalized curve parity** with R-side ATT(d) / ACRT(d) reconstructed under `Boundary.knots = range(treated_doses)` (matching the library) on benchmarks 1-3 via the `_compare_with_r` helper — max ATT(d) at `< 0.01` and max ACRT(d) at `< 0.02`. Surface (a) is direct raw-package parity; surface (b) is reconstructed-basis parity because raw `contdid` curves use `range(dvals)`. - Documentation: `docs/methodology/continuous-did.md` (14,885 bytes theory note covering PT vs SPT, estimands, B-spline OLS, multiplier bootstrap). **Corrections Made:** @@ -671,7 +671,7 @@ and covariate-adjusted specifications.) 5. **Tracker-promotion consolidation (this PR, 2026-05-20):** formal Deviations block added to REGISTRY `## ContinuousDiD` consolidating the boundary-knots deviation, the `bspline_derivative` warning, and the two axis-E silent-coercion warnings into a single labeled surface. The original Edge Cases / Notes entries remain in place — Deviations is an additional canonical surface (per CLAUDE.md "Documenting Deviations (AI Review Compatibility)" labels). **Deviations from the paper / from R / library extensions:** -1. **Deviation from R — boundary knots use `range(dose)` not `range(dvals)`** — knots are built once from all treated doses (global, not per-cell) to ensure a common basis across (g,t) cells for aggregation. The evaluation grid is clamped to training-dose boundary knots (`range(dose)`). R's `contdid` v0.1.0 has an inconsistency where `splines2::bSpline(dvals)` uses `range(dvals)` instead of `range(dose)`, which can produce extrapolation artifacts at dose-grid extremes. **Scope caveat:** R parity tests therefore run at **relative** tolerance bands (1% on overall ATT for all 6 benchmarks; 1% on max ATT(d) curve and 2% on max ACRT(d) curve for benchmarks 1-3 via the `_compare_with_r` helper; 1% on overall ACRT for benchmarks 4-5; benchmark 6 is event-study, ATT-only), NOT bit-exact (`atol=1e-8`) like HAD — `contdid` and ContinuousDiD cannot bit-match on aggregated dose-response or ACRT curves because they use different knot placement; the agreement band reflects the boundary-knot divergence rather than algorithmic drift. The slightly wider 2% ACRT(d)-curve tolerance on benchmarks 1-3 reflects the tighter coupling between basis derivative numerics and the boundary-knot choice; benchmarks 4-5 use overall scalars (`overall_acrt`) where the boundary effect averages down to 1%. Library extension toward methodological soundness (avoids extrapolation). +1. **Deviation from R — boundary knots use `range(dose)` not `range(dvals)`** — knots are built once from all treated doses (global, not per-cell) to ensure a common basis across (g,t) cells for aggregation. The evaluation grid is clamped to training-dose boundary knots (`range(dose)`). R's `contdid` v0.1.0 has an inconsistency where `splines2::bSpline(dvals)` uses `range(dvals)` instead of `range(dose)`, which can produce extrapolation artifacts at dose-grid extremes. **Scope caveat:** R cross-language coverage therefore runs at **relative** tolerance bands across two surfaces, NOT bit-exact (`atol=1e-8`) like HAD — `contdid` and ContinuousDiD cannot bit-match on aggregated dose-response or ACRT curves because they use different knot placement; the agreement band reflects the boundary-knot divergence rather than algorithmic drift. (a) **Scalar parity with raw R `cont_did` / `pte_default`** at 1% relative on overall ATT for all 6 benchmarks and on overall ACRT for benchmarks 4-5 (benchmark 6 is event-study, scalar `overall_att` only). (b) **Harmonized boundary-knot-normalized curve parity** with R-side ATT(d) / ACRT(d) reconstructed under `Boundary.knots = range(treated_doses)` (matching the library) on benchmarks 1-3 via the `_compare_with_r` helper — max ATT(d) at 1% and max ACRT(d) at 2%. The slightly wider 2% ACRT(d)-curve tolerance on benchmarks 1-3 reflects the tighter coupling between basis derivative numerics and the boundary-knot choice; benchmarks 4-5 use overall scalars (`overall_acrt`) where the boundary effect averages down to 1%. Library extension toward methodological soundness (avoids extrapolation). 2. **Library extension — `bspline_derivative_design_matrix` derivative-failure warning** — previously swallowed `scipy.interpolate.BSpline.ValueError` in the per-basis derivative loop, leaving affected derivative-matrix columns silently zero. Now aggregates the failed basis indices into a single `UserWarning` naming them. Both ACRT point estimates and analytical/bootstrap inference read the same `dPsi` matrix, so both are biased when this fires — the warning wording makes that explicit. The all-identical-knot degenerate case (single dose value) remains silently handled (mathematically-zero derivatives are correct there). Phase 2 axis-C #12 silent-failures audit fix. No R correspondence; `contdid` v0.1.0 does not implement an equivalent warning. 3. **Library extension — `+inf` → `0` never-treated recoding warns** — the R-style convention of coding never-treated units as `first_treat=+inf` is still accepted and normalized to `first_treat=0` internally, but the estimator now emits a `UserWarning` reporting the row count so the silent recategorization is surfaced. Only `+inf` is recoded (matching the R convention). Any **negative** `first_treat` value (including `-inf`) raises `ValueError` with the row count, since such units would otherwise silently fall out of both the treated (`g > 0`) and never-treated (`g == 0`) masks. Pass `0` directly for never-treated units to avoid the warning. Library extension toward stricter safety; matches the broader Phase 2 axis-E silent-coercion convention. No R correspondence; `contdid` v0.1.0 silently absorbs `+inf` without a signal. 4. **Library extension — zero-`first_treat` rows with nonzero `dose` force-zeroed with warning** — never-treated cells must have `D=0` for internal consistency in the dose-response. The estimator now emits a `UserWarning` with the affected row count before the zeroing, so unintended nonzero doses on never-treated rows are no longer absorbed without a signal. Library extension toward stricter safety with no R correspondence — `contdid` v0.1.0 has the same `first_treat = 0` → `D = 0` invariant requirement but silently coerces without a warning; same axis-E silent-coercion lineage as #3. @@ -681,7 +681,7 @@ and covariate-adjusted specifications.) - **Discrete-treatment saturated regression (deferred)** — when `dose` is detected as integer-valued, the estimator currently warns; the saturated regression approach (one coefficient per discrete dose level instead of B-spline basis) is not implemented. Tracked as a future-work row. - **Lowest-dose-as-control (Remark 3.1, deferred)** — CGBS 2024 Remark 3.1 outlines using the lowest non-zero dose as the comparison group when `P(D=0) = 0`. Not implemented; the estimator requires `P(D=0) > 0` (never-treated controls present). Tracked as a future-work row. -These three are feature deferrals (paper-supported extensions that the library has chosen not to implement yet), not tracker blockers — the Implementation Checklist at REGISTRY `:755-757` already marks them as `[ ]` deferred. They mirror the same "future work" status that the ChaisemartinDHaultfoeuille and TROP tracker rows carry for analogous optional extensions. +These three are feature deferrals (paper-supported extensions that the library has chosen not to implement yet), not tracker blockers — REGISTRY `## ContinuousDiD` → Implementation Checklist already marks them as `[ ]` deferred (the "Covariate support", "Discrete treatment saturated regression", and "Lowest-dose-as-control (Remark 3.1)" items). They mirror the same "future work" status that the ChaisemartinDHaultfoeuille and TROP tracker rows carry for analogous optional extensions. --- diff --git a/TODO.md b/TODO.md index fc8434b7..ec244fec 100644 --- a/TODO.md +++ b/TODO.md @@ -82,7 +82,7 @@ Deferred items from PR reviews that were not addressed before merge. | ImputationDiD dense `(A0'A0).toarray()` scales O((U+T+K)^2), OOM risk on large panels | `imputation.py` | #141 | Medium (deferred — only triggers when sparse solver fails) | | Multi-absorb weighted demeaning needs iterative alternating projections for N > 1 absorbed FE with survey weights; unweighted multi-absorb also uses single-pass (pre-existing, exact only for balanced panels) | `estimators.py` | #218 | Medium | | Survey design resolution/collapse patterns are inconsistent across panel estimators — ContinuousDiD rebuilds unit-level design in SE code, EfficientDiD builds once in fit(), StackedDiD re-resolves on stacked data; extract shared helpers for panel-to-unit collapse, post-filter re-resolution, and metadata recomputation | `continuous_did.py`, `efficient_did.py`, `stacked_did.py` | #226 | Low | -| ContinuousDiD deferred CGBS 2024 extensions: (a) `covariates=` kwarg not implemented (matches R `contdid` v0.1.0); (b) discrete-treatment saturated regression deferred (integer-valued dose currently warned, not routed to per-level coefficients); (c) lowest-dose-as-control per CGBS 2024 Remark 3.1 (when `P(D=0) = 0`) not implemented — estimator requires never-treated controls. REGISTRY `## ContinuousDiD` Implementation Checklist L755-757 marks these `[ ]`. | `diff_diff/continuous_did.py` | — | Low | +| ContinuousDiD deferred CGBS 2024 extensions: (a) `covariates=` kwarg not implemented (matches R `contdid` v0.1.0); (b) discrete-treatment saturated regression deferred (integer-valued dose currently warned, not routed to per-level coefficients); (c) lowest-dose-as-control per CGBS 2024 Remark 3.1 (when `P(D=0) = 0`) not implemented — estimator requires never-treated controls. REGISTRY `## ContinuousDiD` → Implementation Checklist marks these as deferred `[ ]` items. | `diff_diff/continuous_did.py` | — | Low | | Survey-weighted Silverman bandwidth in EfficientDiD conditional Omega* — `_silverman_bandwidth()` uses unweighted mean/std for bandwidth selection; survey-weighted statistics would better reflect the population distribution but is a second-order refinement | `efficient_did_covariates.py` | — | Low | | TROP: extend Wave 4's `_setup_trop_data` helper to also cover the duplicated bootstrap resampling loop in `_bootstrap_variance` / `_bootstrap_variance_global` (~40 LoC dedup; mirrors the data-setup helper pattern with a `fit_callable` parameter for the per-draw refit step). | `trop_local.py`, `trop_global.py` | follow-up | Low | | TripleDifference power auto-routing: `power.simulate_power` ignores `n_periods` for DDD because `_ddd_dgp_kwargs` is hard-coded to the cross-sectional `generate_ddd_data`. Now that `generate_ddd_panel_data` exists (Wave 4), add a new `_EstimatorProfile` registry entry (or extend the existing one) to route to the panel DGP when `n_periods > 2`. | `power.py`, `prep_dgp.py` | follow-up | Low | diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 915f0b39..1221db04 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -758,7 +758,7 @@ remain in place — this Deviations block is the canonical AI-review surface per CLAUDE.md "Documenting Deviations (AI Review Compatibility)" labels.* -1. **Deviation from R:** `range(dose)` vs `range(dvals)` boundary knots — the library uses `range(dose)` (training-dose range) for B-spline boundary knots; R's `contdid` v0.1.0 uses `range(dvals)` via `splines2::bSpline(dvals)`, which can produce extrapolation artifacts at dose-grid extremes. **Scope caveat:** R parity therefore runs at **relative** tolerance bands (1% on overall ATT for all 6 benchmarks; 1% on max ATT(d) curve and 2% on max ACRT(d) curve for benchmarks 1-3 via the `_compare_with_r` helper at `tests/test_methodology_continuous_did.py:395-459`; 1% on overall ACRT for benchmarks 4-5; benchmark 6 is event-study, ATT-only), NOT bit-exact (`atol=1e-8`) like HAD. Library extension toward methodological soundness (avoids extrapolation). Cross-references the § Edge Cases "Boundary knots" bullet above and `METHODOLOGY_REVIEW.md` § ContinuousDiD Deviations #1. +1. **Deviation from R:** `range(dose)` vs `range(dvals)` boundary knots — the library uses `range(dose)` (training-dose range) for B-spline boundary knots; R's `contdid` v0.1.0 uses `range(dvals)` via `splines2::bSpline(dvals)`, which can produce extrapolation artifacts at dose-grid extremes. **Scope caveat:** R cross-language coverage therefore runs at **relative** tolerance bands across two surfaces: (a) **scalar parity with raw R `cont_did` / `pte_default`** at 1% relative on overall ATT for all 6 benchmarks and on overall ACRT for benchmarks 4-5; (b) **harmonized boundary-knot-normalized curve parity** with R-side ATT(d) / ACRT(d) reconstructed under `Boundary.knots = range(treated_doses)` (matching the library) on benchmarks 1-3 via the `_compare_with_r` helper at `tests/test_methodology_continuous_did.py:395-459` — max ATT(d) at 1% and max ACRT(d) at 2%. Benchmark 6 is event-study, scalar `overall_att` only. NOT bit-exact (`atol=1e-8`) like HAD. Library extension toward methodological soundness (avoids extrapolation). Cross-references the § Edge Cases "Boundary knots" bullet above and `METHODOLOGY_REVIEW.md` § ContinuousDiD Deviations #1. 2. **Note:** `bspline_derivative_design_matrix` derivative-failure `UserWarning` — Phase 2 axis-C #12 silent-failures audit fix. No R correspondence; `contdid` v0.1.0 does not implement an equivalent warning. Cross-references the § Edge Cases `**Note:**` bullet above (L745) and `METHODOLOGY_REVIEW.md` § ContinuousDiD Deviations #2. Locked in `tests/test_continuous_did.py::TestBSplineDerivativeDegenerateBasis` (3 tests); source-level aggregate-warning block at `diff_diff/continuous_did_bspline.py:150-187`. 3. **Note:** `+inf` → `0` never-treated recoding emits `UserWarning` reporting the affected row count; negative `first_treat` (including `-inf`) raises `ValueError`. Axis-E silent-coercion fix per Phase 2 audit. No R correspondence; `contdid` v0.1.0 silently absorbs `+inf` without a signal. Cross-references the § Implementation Checklist `**Note:**` below and `METHODOLOGY_REVIEW.md` § ContinuousDiD Deviations #3. 4. **Note:** Zero-`first_treat` rows with nonzero `dose` are force-zeroed with `UserWarning` reporting the affected row count (axis-E silent-coercion). No R correspondence; `contdid` v0.1.0 has the same `first_treat = 0` → `D = 0` invariant but silently coerces without a warning. Cross-references the § Implementation Checklist `**Note:**` below and `METHODOLOGY_REVIEW.md` § ContinuousDiD Deviations #4. From 6c180cdc151b6189342de22cc3aecb991a451d03 Mon Sep 17 00:00:00 2001 From: igerber Date: Wed, 20 May 2026 15:58:48 -0400 Subject: [PATCH 3/6] Address codex R2 P3 on ContinuousDiD: knot-alignment wording MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit One remaining P3 from local codex R2 — "benchmarks use R's dvals grid for exact knot alignment" partially reintroduced the knot-parity ambiguity the surrounding text fixed. dvals aligns the evaluation grid between Python and R outputs; boundary knots are re-harmonized separately to range(treated_doses) inside the _compare_with_r helper. Reworded to "exact evaluation-grid alignment between Python and R outputs (boundary knots are harmonized separately under surface (b))" for clarity. Co-Authored-By: Claude Opus 4.7 --- METHODOLOGY_REVIEW.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md index cd576e4c..f44a0b1d 100644 --- a/METHODOLOGY_REVIEW.md +++ b/METHODOLOGY_REVIEW.md @@ -645,7 +645,7 @@ and covariate-adjusted specifications.) - [x] **PT and SPT identification** (CGBS 2024 Assumptions 1-2) — two-level parallel trends with explicit untreated-and-doses conditioning; estimands `ATT(d|d)`, `ATT(d)`, `ACRT(d)`, `ATT^{loc}`, `ATT^{glob}`, `ACRT^{glob}` defined in `docs/methodology/continuous-did.md` § 4 + REGISTRY `## ContinuousDiD` Identification block. Hand-calc coverage: `tests/test_methodology_continuous_did.py::TestLinearDoseResponse` (4 tests at `atol=1e-10` / `atol=1e-6` on no-noise linear DGP — locks the `ATT^{glob}` binarization formula `E[ΔY | D > 0] − E[ΔY | D = 0]`, the `ACRT^{glob}` plug-in average, and the `ATT(d) = 2d`, `ACRT(d) = 2` closed forms). - [x] **B-spline basis matching `splines2::bSpline`** (cubic and linear degrees, `num_knots=0` default; global boundary knots from the training-dose range, NOT per-cell) — `tests/test_methodology_continuous_did.py::TestQuadraticWithCubicBasis::test_quadratic_recovery` recovers `ATT(d) = d²` at `atol=1e-6` via a degree-3 basis (cubic spline can represent quadratic exactly). The matching basis algorithm lives in `diff_diff/continuous_did_bspline.py` (216 LoC); the boundary-knots deviation from R `contdid` is documented in the Deviations block below. - [x] **Multi-period (g,t) cell iteration with base period selection** — `TestMultiPeriodAggregation::test_multiple_groups` and `test_gt_cell_count` exercise the cohort iteration on 2-cohort staggered panels; cell counts agree with the R `ptetools`-style convention. Scalar parity with raw R `cont_did` at 1% relative further locks the staggered-aggregation surface via `TestRBenchmark::test_benchmark_4_staggered_dose` and `test_benchmark_5_not_yet_treated` (both assert overall ATT AND overall ACRT at `< 0.01`). -- [x] **Dose-response (`aggregate="dose"`) and event-study (`aggregate="eventstudy"`) aggregation** with group-proportional weights (`n_treated/n_total` per group, divided among post-treatment cells; matches R `ptetools` convention). Two R-side surfaces are exercised: (a) **scalar `overall_att`** via `TestRBenchmark::test_benchmark_1_basic_cubic` / `_2_linear` / `_3_interior_knots` / `_4_staggered_dose` / `_5_not_yet_treated` (dose mode) and `_6_event_study` (event-study mode — binarized ATT only; benchmark 6 validates the event-study code path through the scalar surface, NOT per-horizon `event_study_effects`); (b) **harmonized boundary-knot-normalized ATT(d) / ACRT(d) curves** on benchmarks 1-3 via `_compare_with_r` (helper at `tests/test_methodology_continuous_did.py:395-459` rebuilds the R-side basis under `Boundary.knots = range(treated_doses)` before comparison — raw `contdid` curves use `range(dvals)`, so this is reconstructed-basis parity not raw-package parity). Per-benchmark tolerances: all 6 assert overall ATT at `< 0.01` (1%); benchmarks 1-3 additionally assert max ATT(d) at `< 0.01` and max ACRT(d) at `< 0.02` via the helper; benchmarks 4-5 assert overall ACRT at `< 0.01` inline. Per-horizon `event_study_effects` estimates and inference are exercised by Python-side tests at `tests/test_continuous_did.py:557-690` and `:1500-1528` (no R cross-language comparison on the per-horizon surface). Skipped if R / `contdid` not installed via `_check_r_contdid()`; benchmarks use R's `dvals` grid for exact knot alignment. +- [x] **Dose-response (`aggregate="dose"`) and event-study (`aggregate="eventstudy"`) aggregation** with group-proportional weights (`n_treated/n_total` per group, divided among post-treatment cells; matches R `ptetools` convention). Two R-side surfaces are exercised: (a) **scalar `overall_att`** via `TestRBenchmark::test_benchmark_1_basic_cubic` / `_2_linear` / `_3_interior_knots` / `_4_staggered_dose` / `_5_not_yet_treated` (dose mode) and `_6_event_study` (event-study mode — binarized ATT only; benchmark 6 validates the event-study code path through the scalar surface, NOT per-horizon `event_study_effects`); (b) **harmonized boundary-knot-normalized ATT(d) / ACRT(d) curves** on benchmarks 1-3 via `_compare_with_r` (helper at `tests/test_methodology_continuous_did.py:395-459` rebuilds the R-side basis under `Boundary.knots = range(treated_doses)` before comparison — raw `contdid` curves use `range(dvals)`, so this is reconstructed-basis parity not raw-package parity). Per-benchmark tolerances: all 6 assert overall ATT at `< 0.01` (1%); benchmarks 1-3 additionally assert max ATT(d) at `< 0.01` and max ACRT(d) at `< 0.02` via the helper; benchmarks 4-5 assert overall ACRT at `< 0.01` inline. Per-horizon `event_study_effects` estimates and inference are exercised by Python-side tests at `tests/test_continuous_did.py:557-690` and `:1500-1528` (no R cross-language comparison on the per-horizon surface). Skipped if R / `contdid` not installed via `_check_r_contdid()`; benchmarks use R's `dvals` for exact evaluation-grid alignment between Python and R outputs (boundary knots are harmonized separately under surface (b) — see the `_compare_with_r` helper's `Boundary.knots = range(treated_doses)` block). - [x] **Multiplier bootstrap for inference** (PSU-level multiplier weights on the survey path per Phase 6) — implementation in `diff_diff/continuous_did.py`; bootstrap SE invariant on rank-deficient cells locked in `TestEdgeCasesMethodology::test_all_same_dose` (verifies `dose_response_att.se` is finite on a heterogeneous-outcome / identical-dose DGP); 80 unit tests in `tests/test_continuous_did.py` exercise the rest of the bootstrap path. - [x] **Analytical SEs via influence functions** (NOT delta method; corrected post-v3.0.0, see Corrections Made) — IF-based variance with `safe_inference()` joint-NaN consistency on all six estimand fields (`overall_att`, `overall_acrt`, dose-response, event-study). - [x] **Survey support**: weighted B-spline OLS, two-stage linearization (TSL) on influence functions, bootstrap + survey via PSU-level multiplier weights (Phase 3 + Phase 6). Boxed in REGISTRY `## ContinuousDiD` → Implementation Checklist → "Survey design support (Phase 3)" item. From c1a8f2fde252f2dcd865d5d32ede5a49043ac796 Mon Sep 17 00:00:00 2001 From: igerber Date: Wed, 20 May 2026 16:02:53 -0400 Subject: [PATCH 4/6] Address codex R3 P3 on ContinuousDiD: attribute R-side rebuild to _run_r_contdid MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit One remaining P3 from local codex R3 — the docs attributed the R-side basis rebuild to _compare_with_r when it actually lives in _run_r_contdid at tests/test_methodology_continuous_did.py:333-367; _compare_with_r only orchestrates the Python-vs-R comparison at :395-459. This sends future reviewers to the wrong code path when auditing the documented parity surface. Reworded 7 citations across METHODOLOGY_REVIEW.md (R Reference row + Verified Components dose-response row + Test Coverage block + long-form Deviations #1 + the in-line dvals-grid-alignment note), REGISTRY Deviations #1, and the CHANGELOG bullet to attribute the rebuild to _run_r_contdid at L333-367 explicitly, keeping _compare_with_r credited as the orchestrator at :395-459. Co-Authored-By: Claude Opus 4.7 --- CHANGELOG.md | 2 +- METHODOLOGY_REVIEW.md | 8 ++++---- docs/methodology/REGISTRY.md | 2 +- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 980e53d9..6bac2cce 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,7 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added -- **ContinuousDiD methodology-review-tracker promotion.** Tracker row flipped **In Progress** → **Complete** with full Verified Components / Test Coverage / Corrections Made / Deviations / Outstanding Concerns structure mirroring the HAD precedent (PR #473). REGISTRY `## ContinuousDiD` gains a formal Deviations block consolidating the boundary-knots deviation from R `contdid` v0.1.0 (`range(dose)` vs `range(dvals)` — library avoids extrapolation), the `bspline_derivative` derivative-failure `UserWarning` (Phase 2 axis-C #12), the `+inf` → `0` never-treated recoding warning, and the zero-`first_treat`+nonzero-`dose` force-zeroing warning (both axis-E silent-coercion fixes) into a single AI-review-recognized labeled surface. R cross-language coverage for ContinuousDiD runs at relative tolerance across two surfaces: (a) **scalar parity with raw R `cont_did` / `pte_default`** at 1% on overall ATT for all 6 benchmarks and on overall ACRT for benchmarks 4-5 (benchmark 6 is event-study, scalar `overall_att` only); (b) **harmonized boundary-knot-normalized curve parity** with R-side ATT(d) / ACRT(d) reconstructed under `Boundary.knots = range(treated_doses)` (matching the library) on benchmarks 1-3 via the shared `_compare_with_r` helper at `tests/test_methodology_continuous_did.py:395-459` — max ATT(d) at 1% and max ACRT(d) at 2%. NOT bit-exact (`atol=1e-8`) like HAD — the boundary-knots deviation precludes algorithmic bit-equality on aggregated dose-response curves. Surface (a) is direct raw-package parity; surface (b) is reconstructed-basis parity because raw `contdid` curves use `range(dvals)`. No source code changes, no new tests, no new docstrings — consolidation only against the existing 15 methodology tests (`tests/test_methodology_continuous_did.py`), 80 unit tests (`tests/test_continuous_did.py`), and `docs/methodology/continuous-did.md` theory note. `METHODOLOGY_REVIEW.md` ContinuousDiD row promoted **In Progress** → **Complete**. +- **ContinuousDiD methodology-review-tracker promotion.** Tracker row flipped **In Progress** → **Complete** with full Verified Components / Test Coverage / Corrections Made / Deviations / Outstanding Concerns structure mirroring the HAD precedent (PR #473). REGISTRY `## ContinuousDiD` gains a formal Deviations block consolidating the boundary-knots deviation from R `contdid` v0.1.0 (`range(dose)` vs `range(dvals)` — library avoids extrapolation), the `bspline_derivative` derivative-failure `UserWarning` (Phase 2 axis-C #12), the `+inf` → `0` never-treated recoding warning, and the zero-`first_treat`+nonzero-`dose` force-zeroing warning (both axis-E silent-coercion fixes) into a single AI-review-recognized labeled surface. R cross-language coverage for ContinuousDiD runs at relative tolerance across two surfaces: (a) **scalar parity with raw R `cont_did` / `pte_default`** at 1% on overall ATT for all 6 benchmarks and on overall ACRT for benchmarks 4-5 (benchmark 6 is event-study, scalar `overall_att` only); (b) **harmonized boundary-knot-normalized curve parity** with R-side ATT(d) / ACRT(d) reconstructed under `Boundary.knots = range(treated_doses)` (matching the library) on benchmarks 1-3 via the benchmark harness — `_run_r_contdid` does the R-side rebuild at `tests/test_methodology_continuous_did.py:333-367`, and `_compare_with_r` orchestrates the Python-vs-R comparison at `:395-459` — max ATT(d) at 1% and max ACRT(d) at 2%. NOT bit-exact (`atol=1e-8`) like HAD — the boundary-knots deviation precludes algorithmic bit-equality on aggregated dose-response curves. Surface (a) is direct raw-package parity; surface (b) is reconstructed-basis parity because raw `contdid` curves use `range(dvals)`. No source code changes, no new tests, no new docstrings — consolidation only against the existing 15 methodology tests (`tests/test_methodology_continuous_did.py`), 80 unit tests (`tests/test_continuous_did.py`), and `docs/methodology/continuous-did.md` theory note. `METHODOLOGY_REVIEW.md` ContinuousDiD row promoted **In Progress** → **Complete**. - **`SpilloverDiD(vcov_type="conley", survey_design=...)` integration via stratified-Conley sandwich on PSU totals (Wave E.2).** Lifts the Wave E.1 `NotImplementedError` (`spillover.py:2201` upfront, `two_stage.py:217` helper-level) and adds spatial-HAC + design-based variance for the previously deferred composition. **Documented synthesis** of Conley (1999) spatial-HAC × Gerber (2026, arXiv:2605.04124) Proposition 1 Binder TSL (the Wave E.1 foundation) × Wave D Gardner GMM first-stage uncertainty correction (Butts 2021 §3.1 + Gardner 2022 §4) applied to SpilloverDiD's ring-indicator stage-2 design. No reference software combines all three ingredients on a two-stage influence function. **Mechanical composition (panel-aware):** preserves the library's existing `conley_lag_cutoff = 0` semantic at `diff_diff.conley._compute_conley_meat` ("within-period spatial only — exclude cross-period spatial pairs") by looping over periods. For each period `t`, SpilloverDiD's per-obs Hájek-weighted Wave D IF `psi_i` is aggregated to per-period PSU totals `S_psu_t[g] = sum_{i in PSU g, time t} psi_i` (via `np.add.at`); per-PSU spatial centroids are panel-constant (mean of per-observation `conley_coords` within each PSU, vectorized `np.add.at` sums / `np.bincount` counts); for each stratum the within-stratum sandwich is `M_h_t = (1 - f_h) * n_h/(n_h-1) * sum_{j,k in PSUs_h} K(d(centroid_j, centroid_k) / conley_cutoff_km) * (S_psu_t[j] - S_bar_h_t)(S_psu_t[k] - S_bar_h_t)'`, where K is the Bartlett kernel (SpilloverDiD currently exposes Bartlett only and hardcodes it; the survey helper accepts `"uniform"` too but exposing that on the SpilloverDiD constructor is a separate follow-up) and `d` is haversine / euclidean / callable per `ConleyMetric`. Cross-stratum kernel weights are exactly zero by sampling design (strata are independence partitions). Total meat is `sum_t sum_h M_h_t`. Cross-period spatial pairs are excluded by construction — the per-period loop matches the library's panel Conley contract exactly. **Reduction semantics (load-bearing for tests):** the orchestrator's panel-aware meat equals `sum_t` of per-period within-stratum stratified-Conley sandwiches on per-period PSU totals (pinned at `tests/test_spillover.py::TestSpilloverDiDWaveE2ConleySurveyDesign::test_b_panel_aware_per_period_sum_invariant`); single stratum (H = 1, FPC = inf) reduces to `sum_t` plain Conley sandwich on per-period PSU totals (NOT on time-collapsed totals). **Implementation:** new `_compute_stratified_conley_meat_from_psu_scores` helper in `diff_diff/survey.py` (parallel to existing `_compute_stratified_meat_from_psu_scores` 3-tuple `(meat, variance_computed, legitimate_zero_count)` contract; per-stratum loop replaces the inner `centered.T @ centered` with `_compute_conley_meat(scores=centered, coords=psu_coords_h, ...)` in cross-sectional mode); new dispatch wrapper `_compute_stratified_conley_meat` in `diff_diff/two_stage.py` (parallel to existing `_compute_binder_tsl_meat`, performs per-obs Psi → PSU aggregation + centroid derivation + dispatch to survey helper, intentionally drops `cluster_ids` at the dispatch boundary — see Restrictions). `_compute_gmm_corrected_meat` conley branch extended with `if resolved_survey is not None` routing to the new wrapper; the `resolved_survey is None` branch is bit-identical to Wave D. **Singleton-stratum `lonely_psu="adjust"` parity:** the survey helper mirrors the Binder helper's `continue` to skip the FPC scale on singleton strata (with `n_h = 1` the scale `n_h / (n_h - 1)` would divide by zero); the degenerate one-PSU kernel `K = [[K(0)]] = [[1.0]]` reduces to `centered.T @ centered`, matching Binder's singleton-adjust output. **Saturated `df_survey = 0` NaN-fail:** mirrors Wave E.1 (`_compute_stratified_conley_meat` returns NaN meat with `UserWarning` template "Wave E.2 stratified-Conley sandwich: df_survey = 0..." so callers can `pytest.warns(UserWarning, match="Wave E.2 stratified-Conley")`). **Public surface restrictions:** replicate-weight variance (BRR / Fay / JK1 / JKn / SDR) raises `NotImplementedError` (inherits Wave E.1 gate; per-replicate full refit is separate follow-up scope); `cluster= + survey_design.psu + vcov_type="conley"` coerces `cluster=` to PSU per Wave E.1's warn-and-use-PSU pattern (the Conley cluster product kernel becomes a no-op after PSU aggregation, so `cluster_ids` is intentionally not threaded into the inner Conley kernel call — every PSU is its own cluster post-aggregation, which would zero all cross-PSU pairs); LinearRegression-side `vcov_type="conley" + survey_design=` gate at `diff_diff/linalg.py:2853` remains (separate Bertanha-Imbens 2014 weighted-Conley "Phase 5" roadmap, not Wave E); DiagnosticReport routing for `SpilloverDiDResults(vcov_type="conley", survey_design=)` requires `_APPLICABILITY` / `_PT_METHOD` registration (separate Wave F PR). **Tests:** new `TestSpilloverDiDWaveE2ConleySurveyDesign` and `TestSpilloverDiDWaveE2ConleySurveyDesignEventStudy` classes in `tests/test_spillover.py` (bit-identical no-survey fallback; panel-aware per-period sum invariant on the orchestrator + helper composition; hand-computation methodology anchor; single-stratum ≡ plain Conley on PSU totals; cross-stratum independence as a unit test on the survey helper with interleaved cross-stratum centroids; Binder vs Conley singleton-adjust FPC skip parity; lonely-PSU sensitivity across three modes; FPC large ≡ no-FPC and FPC = n_h zeros stratum; saturated NaN-fail with `pytest.warns(match="Wave E.2 stratified-Conley")`; replicate-weight + non-pweight rejections; cluster warn-and-use-PSU; fit idempotency; `finite_mask` survey-array subsetting; no-PSU coverage — weights-only `SurveyDesign(weights=...)`, strata-only `SurveyDesign(weights=..., strata=...)`, and a per-period re-index unit invariant pinning that no cross-period spatial pairs leak into the meat on implicit-PSU layouts; event-study path on both `is_staggered=True`/`False` branches per `feedback_cohort_loop_trigger_cache_both_branches`; drift goldens at `rtol=1e-12 / atol=1e-14`). The pre-existing `tests/test_spillover.py::test_fit_conley_plus_survey_design_not_implemented` Wave E.1-era gate-assertion test is removed (replaced by the positive-path tests above). Wave E.1 entry's "Public surface restrictions" bullet updated to past-tense the conley+survey gate reference. - **HeterogeneousAdoptionDiD methodology-review-tracker promotion.** New `tests/test_methodology_had.py` (6 classes, 36 tests) with paper-equation-numbered Verified Components walk-through against de Chaisemartin, Ciccia, D'Haultfœuille & Knau (2026) arXiv:2405.04465v6 (Equations 3 / 7 / 11 / 18 / 29 and Theorems 1 / 3 / 4 / 7): Design 1' MC recovery on both the zero-boundary DGP AND a nonzero-boundary-intercept DGP (`ΔY = c + β·D + ε` with `c != 0`) so the `att = (mean(ΔY) − τ_bc) / mean(D)` subtraction term is verified explicitly, N(0,1) coverage at `n_replicates=200`, mass-point Wald-IV closed-form equivalence at `atol=1e-9`, QUG limit-law distributional match at KS-stat ≤ 0.05 (n_draws=5000), Yatchew-HR paper-literal `σ²_diff = 1/(2G)` normalization lock, joint Stute pre-trends + homogeneity H0 fail-to-reject on both surfaces and H1 reject for joint homogeneity under a nonlinear DGP, and library-deviation locks (equal-weighting via selective low-dose-region replication, sup-t bootstrap gating, staggered-timing fail-closed `ValueError`). Added "Non-testable assumptions (paper Section 3.1.2)" Notes block to `HeterogeneousAdoptionDiD` class docstring + "Scope (what this test does NOT cover)" clauses to `qug_test` / `stute_test` / `yatchew_hr_test` / `did_had_pretest_workflow` Notes sections explicitly stating that the pre-tests verify ADJACENT assumptions (Assumption 4 / 7 / 8) and CANNOT test Assumptions 5 or 6. Phase-4 validation-harness items (Pierce-Schott 2016 Figure 2 replication, Table 1 coverage-rate reproduction across 3 DGPs × G ∈ {100, 500, 2500}) waived with documented rationale: R parity at `atol=1e-8` in `tests/test_did_had_parity.py` (3 DGPs × 5 method combos, bit-exact via `rtol=0`) is a strictly stronger anchor than coverage-rate Monte Carlo, and the paper itself self-acknowledges (Section 5.2) that NP estimators are too noisy to be informative on the LBD-restricted PNTR panel. REGISTRY HAD section gains a consolidated Deviations block (5 entries with framing header) and closes 2 of 3 unchecked Implementation Checklist items — the staggered-timing fail-closed `ValueError` and the Assumption 5/6 non-testability documentation; the `covariates=` Theorem 6 follow-up and the extensive-margin / "consider running standard DiD" warning both remain explicitly tracked in `TODO.md` as Low-priority follow-ups rather than claimed-closed. `dechaisemartin-2026-review.md:182-194` requirements checklist boxes the Phase 1a/1b/1c implementation-status closures + the Assumption 5/6 documentation + the staggered-timing closures; the extensive-margin item is acknowledged as partial (zero-dose `UserWarning` exists in `qug_test`; main-`fit()` "consider standard DiD" recommendation is the TODO follow-up). `METHODOLOGY_REVIEW.md` HAD row promoted **In Progress** → **Complete**. - **SunAbraham `vcov_type` parameter (Phase 1b PR 1/8).** `SunAbraham(vcov_type=...)` now accepts `{"classical","hc1","hc2","hc2_bm"}` (defaults to `"hc1"`, which preserves prior behavior bit-equally - SA historically hard-coded HC1). Auto-cluster-at-unit dropped when the user opts into explicit `vcov_type="hc2"` or `vcov_type="classical"` (one-way only); preserved for `"hc1"` and `"hc2_bm"`. When `vcov_type in {"classical","hc2","hc2_bm"}`, `_fit_saturated_regression` auto-routes to a full-dummy saturated design (mirrors TWFE Gate 1 from PR #469): FWL preserves cohort coefficients but not the hat matrix, so HC2 leverage and Bell-McCaffrey Satterthwaite DOF must be computed on the full FE projection. Empirically matches R `lm()` summary classical SE, `sandwich::vcovHC(type="HC2")`, and `clubSandwich::vcovCR(..., type="CR2")` + `coef_test()$df_Satt` at atol=1e-10 (cohort SE and BM DOF pinned in `tests/test_methodology_sun_abraham.py`). For `vcov_type="hc2_bm"`, the user-facing aggregated inference (`event_study_effects[e]['p_value']`/`['conf_int']`, `overall_p_value`/`overall_conf_int`) uses CR2 Bell-McCaffrey contrast DOF — matches `clubSandwich::Wald_test(test="HTZ")$df_denom` at atol=1e-10 (mirrors PR #465's `_compute_cr2_bm_contrast_dof` pattern for MultiPeriodDiD's post-period-average ATT). `vcov_type` is now propagated to `SunAbrahamResults.vcov_type` for downstream introspection. `SurveyDesign` (any kind — analytical weights, stratified, PSU, or replicate-weight) combined with `vcov_type in {"classical","hc2","hc2_bm"}` raises `NotImplementedError`: the survey-design TSL (or replicate-weight refit) variance overrides the analytical sandwich family, and the auto-cluster guard for one-way families would silently downgrade unit-level PSUs to per-observation PSUs. Use `vcov_type="hc1"` (default) for survey designs. `conley` rejected at `__init__` with a deferral message (would require threading 6+ `conley_*` params through the saturated regression call). **Deviation from R:** SA's within-transform HC1 SE differs from `fixest::sunab()` by ~1-2% (~2e-3 absolute) on typical panel sizes due to a different `(n-k)` finite-sample correction (fixest counts absorbed FE in k_total; SA's `solve_ols` counts only within-transformed columns); the IW aggregation step is otherwise identical (pinned at atol=5e-3, tracked in TODO.md). First PR of the Phase 1b standalone-estimator threading initiative (7 PRs to follow: StackedDiD, WooldridgeDiD-OLS, CallawaySantAnna, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD). diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md index f44a0b1d..62d4e71e 100644 --- a/METHODOLOGY_REVIEW.md +++ b/METHODOLOGY_REVIEW.md @@ -637,7 +637,7 @@ and covariate-adjusted specifications.) |-------|-------| | Module | `continuous_did.py`, `continuous_did_bspline.py`, `continuous_did_results.py` | | Primary Reference | Callaway, Goodman-Bacon & Sant'Anna (2024), *Difference-in-Differences with a Continuous Treatment*, NBER WP 32117 | -| R Reference | `contdid` v0.1.0 (CRAN) — two parity surfaces at relative tolerance: (a) **scalar overall ATT parity** with raw R `cont_did` / `pte_default` output at `< 0.01` (1%) on all 6 benchmarks; **scalar overall ACRT parity** with raw R `cont_did` at `< 0.01` (1%) on benchmarks 4-5; (b) **harmonized boundary-knot-normalized curve parity** with R-side ATT(d)/ACRT(d) reconstructed under `Boundary.knots = range(treated_doses)` (matching the library) at `< 0.01` max ATT(d) and `< 0.02` max ACRT(d) on benchmarks 1-3 via the `_compare_with_r` helper at `tests/test_methodology_continuous_did.py:395-459`; benchmark 6 is event-study, scalar `overall_att` only (binarized ATT, no curve comparison and no ACRT in event-study mode). Surface (a) is direct raw-package parity; surface (b) is reconstructed-basis parity because raw `contdid` curves use `range(dvals)` instead of `range(dose)`. NOT bit-exact (`atol=1e-8`) like HAD because of the boundary-knots deviation documented below. See `tests/test_methodology_continuous_did.py::TestRBenchmark` | +| R Reference | `contdid` v0.1.0 (CRAN) — two parity surfaces at relative tolerance: (a) **scalar overall ATT parity** with raw R `cont_did` / `pte_default` output at `< 0.01` (1%) on all 6 benchmarks; **scalar overall ACRT parity** with raw R `cont_did` at `< 0.01` (1%) on benchmarks 4-5; (b) **harmonized boundary-knot-normalized curve parity** with R-side ATT(d)/ACRT(d) reconstructed under `Boundary.knots = range(treated_doses)` (matching the library) at `< 0.01` max ATT(d) and `< 0.02` max ACRT(d) on benchmarks 1-3 via the benchmark harness (`_run_r_contdid` rebuilds the R-side basis under `Boundary.knots = range(treated_doses)` at `tests/test_methodology_continuous_did.py:333-367`; `_compare_with_r` orchestrates the Python-vs-R comparison at `:395-459`); benchmark 6 is event-study, scalar `overall_att` only (binarized ATT, no curve comparison and no ACRT in event-study mode). Surface (a) is direct raw-package parity; surface (b) is reconstructed-basis parity because raw `contdid` curves use `range(dvals)` instead of `range(dose)`. NOT bit-exact (`atol=1e-8`) like HAD because of the boundary-knots deviation documented below. See `tests/test_methodology_continuous_did.py::TestRBenchmark` | | Status | **Complete** | | Last Review | 2026-05-20 | @@ -645,7 +645,7 @@ and covariate-adjusted specifications.) - [x] **PT and SPT identification** (CGBS 2024 Assumptions 1-2) — two-level parallel trends with explicit untreated-and-doses conditioning; estimands `ATT(d|d)`, `ATT(d)`, `ACRT(d)`, `ATT^{loc}`, `ATT^{glob}`, `ACRT^{glob}` defined in `docs/methodology/continuous-did.md` § 4 + REGISTRY `## ContinuousDiD` Identification block. Hand-calc coverage: `tests/test_methodology_continuous_did.py::TestLinearDoseResponse` (4 tests at `atol=1e-10` / `atol=1e-6` on no-noise linear DGP — locks the `ATT^{glob}` binarization formula `E[ΔY | D > 0] − E[ΔY | D = 0]`, the `ACRT^{glob}` plug-in average, and the `ATT(d) = 2d`, `ACRT(d) = 2` closed forms). - [x] **B-spline basis matching `splines2::bSpline`** (cubic and linear degrees, `num_knots=0` default; global boundary knots from the training-dose range, NOT per-cell) — `tests/test_methodology_continuous_did.py::TestQuadraticWithCubicBasis::test_quadratic_recovery` recovers `ATT(d) = d²` at `atol=1e-6` via a degree-3 basis (cubic spline can represent quadratic exactly). The matching basis algorithm lives in `diff_diff/continuous_did_bspline.py` (216 LoC); the boundary-knots deviation from R `contdid` is documented in the Deviations block below. - [x] **Multi-period (g,t) cell iteration with base period selection** — `TestMultiPeriodAggregation::test_multiple_groups` and `test_gt_cell_count` exercise the cohort iteration on 2-cohort staggered panels; cell counts agree with the R `ptetools`-style convention. Scalar parity with raw R `cont_did` at 1% relative further locks the staggered-aggregation surface via `TestRBenchmark::test_benchmark_4_staggered_dose` and `test_benchmark_5_not_yet_treated` (both assert overall ATT AND overall ACRT at `< 0.01`). -- [x] **Dose-response (`aggregate="dose"`) and event-study (`aggregate="eventstudy"`) aggregation** with group-proportional weights (`n_treated/n_total` per group, divided among post-treatment cells; matches R `ptetools` convention). Two R-side surfaces are exercised: (a) **scalar `overall_att`** via `TestRBenchmark::test_benchmark_1_basic_cubic` / `_2_linear` / `_3_interior_knots` / `_4_staggered_dose` / `_5_not_yet_treated` (dose mode) and `_6_event_study` (event-study mode — binarized ATT only; benchmark 6 validates the event-study code path through the scalar surface, NOT per-horizon `event_study_effects`); (b) **harmonized boundary-knot-normalized ATT(d) / ACRT(d) curves** on benchmarks 1-3 via `_compare_with_r` (helper at `tests/test_methodology_continuous_did.py:395-459` rebuilds the R-side basis under `Boundary.knots = range(treated_doses)` before comparison — raw `contdid` curves use `range(dvals)`, so this is reconstructed-basis parity not raw-package parity). Per-benchmark tolerances: all 6 assert overall ATT at `< 0.01` (1%); benchmarks 1-3 additionally assert max ATT(d) at `< 0.01` and max ACRT(d) at `< 0.02` via the helper; benchmarks 4-5 assert overall ACRT at `< 0.01` inline. Per-horizon `event_study_effects` estimates and inference are exercised by Python-side tests at `tests/test_continuous_did.py:557-690` and `:1500-1528` (no R cross-language comparison on the per-horizon surface). Skipped if R / `contdid` not installed via `_check_r_contdid()`; benchmarks use R's `dvals` for exact evaluation-grid alignment between Python and R outputs (boundary knots are harmonized separately under surface (b) — see the `_compare_with_r` helper's `Boundary.knots = range(treated_doses)` block). +- [x] **Dose-response (`aggregate="dose"`) and event-study (`aggregate="eventstudy"`) aggregation** with group-proportional weights (`n_treated/n_total` per group, divided among post-treatment cells; matches R `ptetools` convention). Two R-side surfaces are exercised: (a) **scalar `overall_att`** via `TestRBenchmark::test_benchmark_1_basic_cubic` / `_2_linear` / `_3_interior_knots` / `_4_staggered_dose` / `_5_not_yet_treated` (dose mode) and `_6_event_study` (event-study mode — binarized ATT only; benchmark 6 validates the event-study code path through the scalar surface, NOT per-horizon `event_study_effects`); (b) **harmonized boundary-knot-normalized ATT(d) / ACRT(d) curves** on benchmarks 1-3 via the benchmark harness — `_run_r_contdid` at `tests/test_methodology_continuous_did.py:333-367` rebuilds the R-side basis under `Boundary.knots = range(treated_doses)` (raw `contdid` curves use `range(dvals)`, so this is reconstructed-basis parity not raw-package parity), and `_compare_with_r` orchestrates the comparison at `:395-459`. Per-benchmark tolerances: all 6 assert overall ATT at `< 0.01` (1%); benchmarks 1-3 additionally assert max ATT(d) at `< 0.01` and max ACRT(d) at `< 0.02` via the helper; benchmarks 4-5 assert overall ACRT at `< 0.01` inline. Per-horizon `event_study_effects` estimates and inference are exercised by Python-side tests at `tests/test_continuous_did.py:557-690` and `:1500-1528` (no R cross-language comparison on the per-horizon surface). Skipped if R / `contdid` not installed via `_check_r_contdid()`; benchmarks use R's `dvals` for exact evaluation-grid alignment between Python and R outputs (boundary knots are harmonized separately under surface (b) — see the `_run_r_contdid` helper's `Boundary.knots = range(treated_doses)` block at `tests/test_methodology_continuous_did.py:333-367`). - [x] **Multiplier bootstrap for inference** (PSU-level multiplier weights on the survey path per Phase 6) — implementation in `diff_diff/continuous_did.py`; bootstrap SE invariant on rank-deficient cells locked in `TestEdgeCasesMethodology::test_all_same_dose` (verifies `dose_response_att.se` is finite on a heterogeneous-outcome / identical-dose DGP); 80 unit tests in `tests/test_continuous_did.py` exercise the rest of the bootstrap path. - [x] **Analytical SEs via influence functions** (NOT delta method; corrected post-v3.0.0, see Corrections Made) — IF-based variance with `safe_inference()` joint-NaN consistency on all six estimand fields (`overall_att`, `overall_acrt`, dose-response, event-study). - [x] **Survey support**: weighted B-spline OLS, two-stage linearization (TSL) on influence functions, bootstrap + survey via PSU-level multiplier weights (Phase 3 + Phase 6). Boxed in REGISTRY `## ContinuousDiD` → Implementation Checklist → "Survey design support (Phase 3)" item. @@ -658,7 +658,7 @@ and covariate-adjusted specifications.) **Test Coverage:** - 15 methodology tests in `tests/test_methodology_continuous_did.py` (5 classes: 4 + 1 + 2 + 2 + 6); the R-benchmark class (6 tests) skips if R / `contdid` v0.1.0 is not installed via `_check_r_contdid()` guard. - 80 unit tests in `tests/test_continuous_did.py` (1,530 LoC) covering bootstrap, survey design, IF-based SE, anticipation, rank-deficient cells, and result-class field contracts. -- R cross-language coverage at relative tolerance (NOT bit-exact — see Deviations § "Boundary knots") on 6 benchmark configurations across two surfaces: (a) **scalar parity with raw R `cont_did` / `pte_default`** — all 6 assert overall ATT at `< 0.01` (1%); benchmarks 4-5 also assert overall ACRT at `< 0.01` inline; benchmark 6 is event-study mode with scalar `overall_att` only (binarized ATT, no per-horizon and no ACRT comparison). Per-horizon `event_study_effects` is exercised by Python-side tests at `tests/test_continuous_did.py:557-690` and `:1500-1528`. (b) **harmonized boundary-knot-normalized curve parity** with R-side ATT(d) / ACRT(d) reconstructed under `Boundary.knots = range(treated_doses)` (matching the library) on benchmarks 1-3 via the `_compare_with_r` helper — max ATT(d) at `< 0.01` and max ACRT(d) at `< 0.02`. Surface (a) is direct raw-package parity; surface (b) is reconstructed-basis parity because raw `contdid` curves use `range(dvals)`. +- R cross-language coverage at relative tolerance (NOT bit-exact — see Deviations § "Boundary knots") on 6 benchmark configurations across two surfaces: (a) **scalar parity with raw R `cont_did` / `pte_default`** — all 6 assert overall ATT at `< 0.01` (1%); benchmarks 4-5 also assert overall ACRT at `< 0.01` inline; benchmark 6 is event-study mode with scalar `overall_att` only (binarized ATT, no per-horizon and no ACRT comparison). Per-horizon `event_study_effects` is exercised by Python-side tests at `tests/test_continuous_did.py:557-690` and `:1500-1528`. (b) **harmonized boundary-knot-normalized curve parity** with R-side ATT(d) / ACRT(d) reconstructed under `Boundary.knots = range(treated_doses)` (matching the library) on benchmarks 1-3 via the benchmark harness (`_run_r_contdid` does the R-side rebuild at `tests/test_methodology_continuous_did.py:333-367`; `_compare_with_r` orchestrates at `:395-459`) — max ATT(d) at `< 0.01` and max ACRT(d) at `< 0.02`. Surface (a) is direct raw-package parity; surface (b) is reconstructed-basis parity because raw `contdid` curves use `range(dvals)`. - Documentation: `docs/methodology/continuous-did.md` (14,885 bytes theory note covering PT vs SPT, estimands, B-spline OLS, multiplier bootstrap). **Corrections Made:** @@ -671,7 +671,7 @@ and covariate-adjusted specifications.) 5. **Tracker-promotion consolidation (this PR, 2026-05-20):** formal Deviations block added to REGISTRY `## ContinuousDiD` consolidating the boundary-knots deviation, the `bspline_derivative` warning, and the two axis-E silent-coercion warnings into a single labeled surface. The original Edge Cases / Notes entries remain in place — Deviations is an additional canonical surface (per CLAUDE.md "Documenting Deviations (AI Review Compatibility)" labels). **Deviations from the paper / from R / library extensions:** -1. **Deviation from R — boundary knots use `range(dose)` not `range(dvals)`** — knots are built once from all treated doses (global, not per-cell) to ensure a common basis across (g,t) cells for aggregation. The evaluation grid is clamped to training-dose boundary knots (`range(dose)`). R's `contdid` v0.1.0 has an inconsistency where `splines2::bSpline(dvals)` uses `range(dvals)` instead of `range(dose)`, which can produce extrapolation artifacts at dose-grid extremes. **Scope caveat:** R cross-language coverage therefore runs at **relative** tolerance bands across two surfaces, NOT bit-exact (`atol=1e-8`) like HAD — `contdid` and ContinuousDiD cannot bit-match on aggregated dose-response or ACRT curves because they use different knot placement; the agreement band reflects the boundary-knot divergence rather than algorithmic drift. (a) **Scalar parity with raw R `cont_did` / `pte_default`** at 1% relative on overall ATT for all 6 benchmarks and on overall ACRT for benchmarks 4-5 (benchmark 6 is event-study, scalar `overall_att` only). (b) **Harmonized boundary-knot-normalized curve parity** with R-side ATT(d) / ACRT(d) reconstructed under `Boundary.knots = range(treated_doses)` (matching the library) on benchmarks 1-3 via the `_compare_with_r` helper — max ATT(d) at 1% and max ACRT(d) at 2%. The slightly wider 2% ACRT(d)-curve tolerance on benchmarks 1-3 reflects the tighter coupling between basis derivative numerics and the boundary-knot choice; benchmarks 4-5 use overall scalars (`overall_acrt`) where the boundary effect averages down to 1%. Library extension toward methodological soundness (avoids extrapolation). +1. **Deviation from R — boundary knots use `range(dose)` not `range(dvals)`** — knots are built once from all treated doses (global, not per-cell) to ensure a common basis across (g,t) cells for aggregation. The evaluation grid is clamped to training-dose boundary knots (`range(dose)`). R's `contdid` v0.1.0 has an inconsistency where `splines2::bSpline(dvals)` uses `range(dvals)` instead of `range(dose)`, which can produce extrapolation artifacts at dose-grid extremes. **Scope caveat:** R cross-language coverage therefore runs at **relative** tolerance bands across two surfaces, NOT bit-exact (`atol=1e-8`) like HAD — `contdid` and ContinuousDiD cannot bit-match on aggregated dose-response or ACRT curves because they use different knot placement; the agreement band reflects the boundary-knot divergence rather than algorithmic drift. (a) **Scalar parity with raw R `cont_did` / `pte_default`** at 1% relative on overall ATT for all 6 benchmarks and on overall ACRT for benchmarks 4-5 (benchmark 6 is event-study, scalar `overall_att` only). (b) **Harmonized boundary-knot-normalized curve parity** with R-side ATT(d) / ACRT(d) reconstructed under `Boundary.knots = range(treated_doses)` (matching the library) on benchmarks 1-3 via the benchmark harness (`_run_r_contdid` does the R-side rebuild at `tests/test_methodology_continuous_did.py:333-367`; `_compare_with_r` orchestrates at `:395-459`) — max ATT(d) at 1% and max ACRT(d) at 2%. The slightly wider 2% ACRT(d)-curve tolerance on benchmarks 1-3 reflects the tighter coupling between basis derivative numerics and the boundary-knot choice; benchmarks 4-5 use overall scalars (`overall_acrt`) where the boundary effect averages down to 1%. Library extension toward methodological soundness (avoids extrapolation). 2. **Library extension — `bspline_derivative_design_matrix` derivative-failure warning** — previously swallowed `scipy.interpolate.BSpline.ValueError` in the per-basis derivative loop, leaving affected derivative-matrix columns silently zero. Now aggregates the failed basis indices into a single `UserWarning` naming them. Both ACRT point estimates and analytical/bootstrap inference read the same `dPsi` matrix, so both are biased when this fires — the warning wording makes that explicit. The all-identical-knot degenerate case (single dose value) remains silently handled (mathematically-zero derivatives are correct there). Phase 2 axis-C #12 silent-failures audit fix. No R correspondence; `contdid` v0.1.0 does not implement an equivalent warning. 3. **Library extension — `+inf` → `0` never-treated recoding warns** — the R-style convention of coding never-treated units as `first_treat=+inf` is still accepted and normalized to `first_treat=0` internally, but the estimator now emits a `UserWarning` reporting the row count so the silent recategorization is surfaced. Only `+inf` is recoded (matching the R convention). Any **negative** `first_treat` value (including `-inf`) raises `ValueError` with the row count, since such units would otherwise silently fall out of both the treated (`g > 0`) and never-treated (`g == 0`) masks. Pass `0` directly for never-treated units to avoid the warning. Library extension toward stricter safety; matches the broader Phase 2 axis-E silent-coercion convention. No R correspondence; `contdid` v0.1.0 silently absorbs `+inf` without a signal. 4. **Library extension — zero-`first_treat` rows with nonzero `dose` force-zeroed with warning** — never-treated cells must have `D=0` for internal consistency in the dose-response. The estimator now emits a `UserWarning` with the affected row count before the zeroing, so unintended nonzero doses on never-treated rows are no longer absorbed without a signal. Library extension toward stricter safety with no R correspondence — `contdid` v0.1.0 has the same `first_treat = 0` → `D = 0` invariant requirement but silently coerces without a warning; same axis-E silent-coercion lineage as #3. diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 1221db04..012e75c2 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -758,7 +758,7 @@ remain in place — this Deviations block is the canonical AI-review surface per CLAUDE.md "Documenting Deviations (AI Review Compatibility)" labels.* -1. **Deviation from R:** `range(dose)` vs `range(dvals)` boundary knots — the library uses `range(dose)` (training-dose range) for B-spline boundary knots; R's `contdid` v0.1.0 uses `range(dvals)` via `splines2::bSpline(dvals)`, which can produce extrapolation artifacts at dose-grid extremes. **Scope caveat:** R cross-language coverage therefore runs at **relative** tolerance bands across two surfaces: (a) **scalar parity with raw R `cont_did` / `pte_default`** at 1% relative on overall ATT for all 6 benchmarks and on overall ACRT for benchmarks 4-5; (b) **harmonized boundary-knot-normalized curve parity** with R-side ATT(d) / ACRT(d) reconstructed under `Boundary.knots = range(treated_doses)` (matching the library) on benchmarks 1-3 via the `_compare_with_r` helper at `tests/test_methodology_continuous_did.py:395-459` — max ATT(d) at 1% and max ACRT(d) at 2%. Benchmark 6 is event-study, scalar `overall_att` only. NOT bit-exact (`atol=1e-8`) like HAD. Library extension toward methodological soundness (avoids extrapolation). Cross-references the § Edge Cases "Boundary knots" bullet above and `METHODOLOGY_REVIEW.md` § ContinuousDiD Deviations #1. +1. **Deviation from R:** `range(dose)` vs `range(dvals)` boundary knots — the library uses `range(dose)` (training-dose range) for B-spline boundary knots; R's `contdid` v0.1.0 uses `range(dvals)` via `splines2::bSpline(dvals)`, which can produce extrapolation artifacts at dose-grid extremes. **Scope caveat:** R cross-language coverage therefore runs at **relative** tolerance bands across two surfaces: (a) **scalar parity with raw R `cont_did` / `pte_default`** at 1% relative on overall ATT for all 6 benchmarks and on overall ACRT for benchmarks 4-5; (b) **harmonized boundary-knot-normalized curve parity** with R-side ATT(d) / ACRT(d) reconstructed under `Boundary.knots = range(treated_doses)` (matching the library) on benchmarks 1-3 via the benchmark harness — `_run_r_contdid` does the R-side rebuild at `tests/test_methodology_continuous_did.py:333-367`, and `_compare_with_r` orchestrates the Python-vs-R comparison at `:395-459` — max ATT(d) at 1% and max ACRT(d) at 2%. Benchmark 6 is event-study, scalar `overall_att` only. NOT bit-exact (`atol=1e-8`) like HAD. Library extension toward methodological soundness (avoids extrapolation). Cross-references the § Edge Cases "Boundary knots" bullet above and `METHODOLOGY_REVIEW.md` § ContinuousDiD Deviations #1. 2. **Note:** `bspline_derivative_design_matrix` derivative-failure `UserWarning` — Phase 2 axis-C #12 silent-failures audit fix. No R correspondence; `contdid` v0.1.0 does not implement an equivalent warning. Cross-references the § Edge Cases `**Note:**` bullet above (L745) and `METHODOLOGY_REVIEW.md` § ContinuousDiD Deviations #2. Locked in `tests/test_continuous_did.py::TestBSplineDerivativeDegenerateBasis` (3 tests); source-level aggregate-warning block at `diff_diff/continuous_did_bspline.py:150-187`. 3. **Note:** `+inf` → `0` never-treated recoding emits `UserWarning` reporting the affected row count; negative `first_treat` (including `-inf`) raises `ValueError`. Axis-E silent-coercion fix per Phase 2 audit. No R correspondence; `contdid` v0.1.0 silently absorbs `+inf` without a signal. Cross-references the § Implementation Checklist `**Note:**` below and `METHODOLOGY_REVIEW.md` § ContinuousDiD Deviations #3. 4. **Note:** Zero-`first_treat` rows with nonzero `dose` are force-zeroed with `UserWarning` reporting the affected row count (axis-E silent-coercion). No R correspondence; `contdid` v0.1.0 has the same `first_treat = 0` → `D = 0` invariant but silently coerces without a warning. Cross-references the § Implementation Checklist `**Note:**` below and `METHODOLOGY_REVIEW.md` § ContinuousDiD Deviations #4. From 2685f32c544be004c554ea5a790b7e6061229337 Mon Sep 17 00:00:00 2001 From: igerber Date: Wed, 20 May 2026 16:45:28 -0400 Subject: [PATCH 5/6] Address CI codex R1 P3s on PR-ContinuousDiD: survey-test attribution + stale (L745) ref MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two informational P3s from CI codex R1, both narrow doc fixes: 1. **Documentation/Tests** — METHODOLOGY_REVIEW.md Test Coverage block attributed survey-design coverage to tests/test_continuous_did.py, but the ContinuousDiD survey tests actually live in tests/test_survey_phase3.py::TestContinuousDiDSurvey (L653-705 analytical SE + bootstrap; L1368-1407 event-study + rejection paths) and tests/test_survey_phase6.py (L1230-1244 replicate + n_bootstrap rejection; L1548-1610 positive-weight-gate cell skipping). test_continuous_did.py has zero survey-flagged tests (grep confirmed). Split the coverage summary into core estimator tests vs. survey-specific tests and cited the correct files. 2. **Maintainability** — REGISTRY Deviations entry #2 hardcoded "(L745)" referring to the § Edge Cases bspline_derivative note. The L745 ref would drift on the next nearby edit, weakening the "canonical AI-review surface" claim. Replaced with a stable textual cross-reference ("the § Edge Cases **Note:** bullet above (bspline_derivative_design_matrix entry)"). Co-Authored-By: Claude Opus 4.7 --- METHODOLOGY_REVIEW.md | 3 ++- docs/methodology/REGISTRY.md | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md index 62d4e71e..4ed2b872 100644 --- a/METHODOLOGY_REVIEW.md +++ b/METHODOLOGY_REVIEW.md @@ -657,7 +657,8 @@ and covariate-adjusted specifications.) **Test Coverage:** - 15 methodology tests in `tests/test_methodology_continuous_did.py` (5 classes: 4 + 1 + 2 + 2 + 6); the R-benchmark class (6 tests) skips if R / `contdid` v0.1.0 is not installed via `_check_r_contdid()` guard. -- 80 unit tests in `tests/test_continuous_did.py` (1,530 LoC) covering bootstrap, survey design, IF-based SE, anticipation, rank-deficient cells, and result-class field contracts. +- 80 core unit tests in `tests/test_continuous_did.py` (1,530 LoC) covering the B-spline basis (`TestBSplineBasis`, `TestBSplineDerivativeDegenerateBasis`), bootstrap, IF-based analytical SE, anticipation, rank-deficient cells, dose grid, dvals/grid validation, `+inf` recoding, zero-dose coercion, and result-class field contracts. **Survey-design coverage is NOT in this file** — it lives in the dedicated survey suites (next bullet). +- ContinuousDiD survey-design tests: `tests/test_survey_phase3.py::TestContinuousDiDSurvey` (`tests/test_survey_phase3.py:653-705` analytical SE + bootstrap; `:1368-1407` event-study aggregation + survey-design rejection paths) and `tests/test_survey_phase6.py` (`:1230-1244` replicate-weight + n_bootstrap rejection; `:1548-1610` positive-weight-gate cell skipping). - R cross-language coverage at relative tolerance (NOT bit-exact — see Deviations § "Boundary knots") on 6 benchmark configurations across two surfaces: (a) **scalar parity with raw R `cont_did` / `pte_default`** — all 6 assert overall ATT at `< 0.01` (1%); benchmarks 4-5 also assert overall ACRT at `< 0.01` inline; benchmark 6 is event-study mode with scalar `overall_att` only (binarized ATT, no per-horizon and no ACRT comparison). Per-horizon `event_study_effects` is exercised by Python-side tests at `tests/test_continuous_did.py:557-690` and `:1500-1528`. (b) **harmonized boundary-knot-normalized curve parity** with R-side ATT(d) / ACRT(d) reconstructed under `Boundary.knots = range(treated_doses)` (matching the library) on benchmarks 1-3 via the benchmark harness (`_run_r_contdid` does the R-side rebuild at `tests/test_methodology_continuous_did.py:333-367`; `_compare_with_r` orchestrates at `:395-459`) — max ATT(d) at `< 0.01` and max ACRT(d) at `< 0.02`. Surface (a) is direct raw-package parity; surface (b) is reconstructed-basis parity because raw `contdid` curves use `range(dvals)`. - Documentation: `docs/methodology/continuous-did.md` (14,885 bytes theory note covering PT vs SPT, estimands, B-spline OLS, multiplier bootstrap). diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 012e75c2..cf62fa27 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -759,7 +759,7 @@ surface per CLAUDE.md "Documenting Deviations (AI Review Compatibility)" labels.* 1. **Deviation from R:** `range(dose)` vs `range(dvals)` boundary knots — the library uses `range(dose)` (training-dose range) for B-spline boundary knots; R's `contdid` v0.1.0 uses `range(dvals)` via `splines2::bSpline(dvals)`, which can produce extrapolation artifacts at dose-grid extremes. **Scope caveat:** R cross-language coverage therefore runs at **relative** tolerance bands across two surfaces: (a) **scalar parity with raw R `cont_did` / `pte_default`** at 1% relative on overall ATT for all 6 benchmarks and on overall ACRT for benchmarks 4-5; (b) **harmonized boundary-knot-normalized curve parity** with R-side ATT(d) / ACRT(d) reconstructed under `Boundary.knots = range(treated_doses)` (matching the library) on benchmarks 1-3 via the benchmark harness — `_run_r_contdid` does the R-side rebuild at `tests/test_methodology_continuous_did.py:333-367`, and `_compare_with_r` orchestrates the Python-vs-R comparison at `:395-459` — max ATT(d) at 1% and max ACRT(d) at 2%. Benchmark 6 is event-study, scalar `overall_att` only. NOT bit-exact (`atol=1e-8`) like HAD. Library extension toward methodological soundness (avoids extrapolation). Cross-references the § Edge Cases "Boundary knots" bullet above and `METHODOLOGY_REVIEW.md` § ContinuousDiD Deviations #1. -2. **Note:** `bspline_derivative_design_matrix` derivative-failure `UserWarning` — Phase 2 axis-C #12 silent-failures audit fix. No R correspondence; `contdid` v0.1.0 does not implement an equivalent warning. Cross-references the § Edge Cases `**Note:**` bullet above (L745) and `METHODOLOGY_REVIEW.md` § ContinuousDiD Deviations #2. Locked in `tests/test_continuous_did.py::TestBSplineDerivativeDegenerateBasis` (3 tests); source-level aggregate-warning block at `diff_diff/continuous_did_bspline.py:150-187`. +2. **Note:** `bspline_derivative_design_matrix` derivative-failure `UserWarning` — Phase 2 axis-C #12 silent-failures audit fix. No R correspondence; `contdid` v0.1.0 does not implement an equivalent warning. Cross-references the § Edge Cases `**Note:**` bullet above (`bspline_derivative_design_matrix` entry) and `METHODOLOGY_REVIEW.md` § ContinuousDiD Deviations #2. Locked in `tests/test_continuous_did.py::TestBSplineDerivativeDegenerateBasis` (3 tests); source-level aggregate-warning block at `diff_diff/continuous_did_bspline.py:150-187`. 3. **Note:** `+inf` → `0` never-treated recoding emits `UserWarning` reporting the affected row count; negative `first_treat` (including `-inf`) raises `ValueError`. Axis-E silent-coercion fix per Phase 2 audit. No R correspondence; `contdid` v0.1.0 silently absorbs `+inf` without a signal. Cross-references the § Implementation Checklist `**Note:**` below and `METHODOLOGY_REVIEW.md` § ContinuousDiD Deviations #3. 4. **Note:** Zero-`first_treat` rows with nonzero `dose` are force-zeroed with `UserWarning` reporting the affected row count (axis-E silent-coercion). No R correspondence; `contdid` v0.1.0 has the same `first_treat = 0` → `D = 0` invariant but silently coerces without a warning. Cross-references the § Implementation Checklist `**Note:**` below and `METHODOLOGY_REVIEW.md` § ContinuousDiD Deviations #4. From 14ccc9290648ea080a29ba6a73950ee401a19b25 Mon Sep 17 00:00:00 2001 From: igerber Date: Wed, 20 May 2026 16:52:11 -0400 Subject: [PATCH 6/6] Address CI codex R2 P3 on PR-ContinuousDiD: prune stale In-Progress queue + example refs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit One informational P3 from CI codex R2 — METHODOLOGY_REVIEW.md still described ContinuousDiD as "In Progress" in two surrounding surfaces even after the status-table flip, creating conflicting status signals. Fixed both sites: 1. L27 explanatory paragraph: removed the ContinuousDiD example from the In Progress band's "has methodology file but no paper review" illustration (it's now Complete). 2. L1289-1292 Priority Order queue: removed entry #9 (ContinuousDiD) and renumbered the remaining queue. Retroactive fix per feedback_changelog_accuracy_fixes (CI review catching one factual error in the queue means scanning for the same mistake): PR #473 promoted HeterogeneousAdoptionDiD to Complete but left entry #6 (HAD) in the same In Progress queue. Removed HAD's entry too and renumbered, so the queue is now self-consistent with the status table for all Complete entries. Co-Authored-By: Claude Opus 4.7 --- METHODOLOGY_REVIEW.md | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md index 4ed2b872..a1bec076 100644 --- a/METHODOLOGY_REVIEW.md +++ b/METHODOLOGY_REVIEW.md @@ -24,7 +24,7 @@ A **Complete** entry has a documented review pass against the primary academic s The catalog grew incrementally over several quarters, so formats vary across the existing Complete entries; the consistent invariant is that someone walked through the implementation against the academic source and captured the result here. New reviews going forward should aim for the fuller structure (Verified Components + Corrections Made + Deviations + dedicated methodology test file) used by the more recent entries. -**In Progress** entries have a REGISTRY.md section and unit-test coverage, but no formal walk-through has been captured here yet. The In Progress band is wide — some entries also have some combination of a paper review (primary or companion), a dedicated methodology test file, and R parity fixtures (e.g., DCDH has a methodology file, R parity, and a companion-paper review for the 2026 universal-rollout extension; ContinuousDiD has the methodology file but no paper review); others have only the REGISTRY entry and unit tests (e.g., PowerAnalysis). The "Documentation in place" sub-section enumerates what each entry already has; the "Outstanding for promotion" sub-section enumerates what's still needed to flip it to Complete. +**In Progress** entries have a REGISTRY.md section and unit-test coverage, but no formal walk-through has been captured here yet. The In Progress band is wide — some entries also have some combination of a paper review (primary or companion), a dedicated methodology test file, and R parity fixtures (e.g., DCDH has a methodology file, R parity, and a companion-paper review for the 2026 universal-rollout extension); others have only the REGISTRY entry and unit tests (e.g., PowerAnalysis). The "Documentation in place" sub-section enumerates what each entry already has; the "Outstanding for promotion" sub-section enumerates what's still needed to flip it to Complete. **Not Started** entries have neither a tracker walk-through nor an REGISTRY.md section. This tracker no longer carries any Not Started rows; new estimators are expected to enter as In Progress when their REGISTRY entry lands. @@ -1286,14 +1286,12 @@ Promotion priority for the **In Progress** entries, ordered by what's blocked on **Consolidation-pass-blocked (already has paper review or methodology file or R parity; mostly Verified Components walk-through):** -6. **HeterogeneousAdoptionDiD (HAD)** — largest current surface, Phase 4.5 just shipped; shares the de Chaisemartin (2026) paper review with DCDH; needs a dedicated Verified Components block. -7. **ChaisemartinDHaultfoeuille (DCDH)** — methodology test file + 24 R parity tests + 347 unit tests + a companion-paper review for the 2026 universal-rollout extension. Primary-source reviews for the 2020 AER and 2022/2024 NBER WP 29873 papers are still outstanding alongside the Verified Components walk-through. -8. **WooldridgeDiD (ETWFE)** — companion-paper review (Wooldridge 2023 nonlinear extension) merged in PR #443; primary-source review for Wooldridge (2025) ETWFE not yet on file, and no dedicated methodology test file. -9. **ContinuousDiD** — 15 methodology tests already in place; mostly a consolidation pass with a documented boundary-knots deviation from R `contdid` v0.1.0. -10. **TROP** — paper review recently merged (PR #443); needs methodology file and cross-language anchor (when paper-author reference becomes available). -11. **StaggeredTripleDifference** — shares the primary paper (Ortiz-Villavicencio & Sant'Anna 2025) with TripleDifference, but no dedicated paper review on file yet; needs R parity (R fixtures gitignored — tracked in TODO.md, PR #245). -12. **ConleySpatialHAC** — paper review + committed R `conleyreg` goldens; needs dedicated methodology test file + summary R-parity table in this tracker. -13. **Survey Data Support** — cross-cutting feature; promotion requires the per-estimator integration paths to be locked down first. +6. **ChaisemartinDHaultfoeuille (DCDH)** — methodology test file + 24 R parity tests + 347 unit tests + a companion-paper review for the 2026 universal-rollout extension. Primary-source reviews for the 2020 AER and 2022/2024 NBER WP 29873 papers are still outstanding alongside the Verified Components walk-through. +7. **WooldridgeDiD (ETWFE)** — companion-paper review (Wooldridge 2023 nonlinear extension) merged in PR #443; primary-source review for Wooldridge (2025) ETWFE not yet on file, and no dedicated methodology test file. +8. **TROP** — paper review recently merged (PR #443); needs methodology file and cross-language anchor (when paper-author reference becomes available). +9. **StaggeredTripleDifference** — shares the primary paper (Ortiz-Villavicencio & Sant'Anna 2025) with TripleDifference, but no dedicated paper review on file yet; needs R parity (R fixtures gitignored — tracked in TODO.md, PR #245). +10. **ConleySpatialHAC** — paper review + committed R `conleyreg` goldens; needs dedicated methodology test file + summary R-parity table in this tracker. +11. **Survey Data Support** — cross-cutting feature; promotion requires the per-estimator integration paths to be locked down first. ---