Skip to content

PowerAnalysis methodology review (PR-A): Bloom 1995 + Burlig 2020 source audits#506

Merged
igerber merged 1 commit into
mainfrom
feature/power-analysis-paper-review
May 31, 2026
Merged

PowerAnalysis methodology review (PR-A): Bloom 1995 + Burlig 2020 source audits#506
igerber merged 1 commit into
mainfrom
feature/power-analysis-paper-review

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 31, 2026

Summary

  • PR-A (paper-review) of the PowerAnalysis methodology validation — the Step-1 fidelity artifact of the locked 2-PR sequence. Adds Bloom (1995) and Burlig, Preonas & Woerman (2020) source audits under docs/methodology/papers/, sourced only from the papers.
  • Documents (under review — does NOT yet fix) four discrepancies between the authoritative PowerAnalysis surfaces and the source material; reconciliation is deferred to PR-B and tracked in TODO.md.
  • No estimator logic, formula, or test changes. (One class-docstring panel-variance display was corrected from a self-canceling factor to the form _compute_variance already implements — docstring only.)

Methodology references (required if estimator / math changes)

  • Method name(s): PowerAnalysis (analytical MDE / power / sample size; simulation-based power)
  • Paper / source link(s): Bloom (1995) https://doi.org/10.1177/0193841X9501900504 ; Burlig, Preonas & Woerman (2020) https://doi.org/10.1016/j.jdeveco.2020.102458 (reviewed via the open NBER WP 26250)
  • Any intentional deviations from the source (and why): This PR documents (does not resolve) the deviations, via a reviewer-recognized REGISTRY **Note:** + power.py docstring notes + a references.rst clarification, with reconciliation tracked in TODO.md for PR-B:
    1. the analytical path uses the normal (z) approximation (faithful to Bloom, who derives the MDE multiplier from the normal), while Burlig's Eq. 1 uses the t distribution;
    2. the REGISTRY SE formula's 1/√(1−R²) and cluster-size m terms are not what power.py implements;
    3. the displayed sample-size formula omits the T(1−T) allocation factor the code applies;
    4. the panel (1+(T−1)ρ)/T factor is an equicorrelated/Moulton design effect, not Burlig's serial-correlation-robust (SCR) variance (Eq. 2) — an attribution overclaim pending PR-B re-attribution or implementation.

Validation

  • Tests added/updated: None — this is a documentation-only (paper-review) PR. Source validation, equation-anchored methodology tests, and R parity (pcpanel for the panel path; a normal-based reference for the analytical path) are deferred to PR-B.
  • Backtest / simulation / notebook evidence (if applicable): N/A

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good

Executive Summary

  • No executable estimator or inference logic changed; the diff is limited to documentation, docstrings, and a tracked follow-up in TODO.md.
  • The PR correctly turns four pre-existing PowerAnalysis mismatches into documented, tracked deviations in docs/methodology/REGISTRY.md:3207-3220, diff_diff/power.py:1232-1236, docs/references.rst:246-248, and TODO.md:77.
  • The corrected panel-variance docstring now matches the implemented formula in _compute_variance() at diff_diff/power.py:1226-1228 and diff_diff/power.py:1331-1343.
  • No new empty-result, NaN/Inf inference, control-group, or parameter-propagation risk is introduced because no estimator code path changed.
  • One residual documentation inconsistency remains: the REGISTRY R-equivalents table still points PowerAnalysis to pwr.t.test(), which conflicts with the new Bloom audit and the code’s normal-based critical values.

Methodology

  • Severity: P3-informational. Impact: The PR documents the existing PowerAnalysis deviations instead of leaving them implicit; under the review policy, these are mitigated because they are now explicitly noted in the Methodology Registry and tracked in TODO.md. Concrete fix: None required in this PR; reconcile or re-attribute in PR-B. References: docs/methodology/REGISTRY.md:3207-3220, diff_diff/power.py:1232-1236, docs/references.rst:246-248, TODO.md:77.

Code Quality

  • No findings. The only code-file change is documentation text, and the corrected panel formula matches the implementation at diff_diff/power.py:1226-1228 and diff_diff/power.py:1331-1343.

Performance

  • No findings. No runtime logic changed.

Maintainability

  • No findings beyond the documentation inconsistency noted below.

Tech Debt

  • No findings. The only new deferred work introduced by this PR is explicitly tracked at TODO.md:77.

Security

  • No findings. The diff adds only documentation/TODO content and no new execution or secret-handling surface.

Documentation/Tests

  • Severity: P2. Impact: docs/methodology/REGISTRY.md:3364 still maps PowerAnalysis to pwr.t.test(), but the new Bloom audit says the analytical path is normal-based and that pwr::pwr.t.test() is not the faithful parity target, consistent with _get_critical_values() using stats.norm.ppf. That leaves the methodology docs internally inconsistent on the recommended analytical reference. Concrete fix: Update the PowerAnalysis row to a normal-based analytical reference or annotate the row as under review until PR-B settles the contract. References: docs/methodology/REGISTRY.md:3364, docs/methodology/papers/bloom-1995-review.md:128-131, diff_diff/power.py:1273-1280.
  • Severity: P3-informational. Impact: No automated tests changed. That is acceptable for a docs-only PR, but the new audits do not by themselves add executable assurance for the eventual PR-B reconciliation. Concrete fix: In PR-B, add focused analytical tests for the chosen MDE/sample-size formulas and the selected panel-path parity target. I did not run tests for this review.

@igerber igerber force-pushed the feature/power-analysis-paper-review branch from f70b9d6 to aa64e26 Compare May 31, 2026 15:25
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: aa64e2675b635304e55d779da41ba6b782ba5e3b


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The previous re-review P2 on PowerAnalysis is addressed: the methodology docs now explicitly mark the analytical path as under review, document the four source mismatches, and track the follow-up in TODO.md (docs/methodology/REGISTRY.md:L3207-L3220, docs/methodology/REGISTRY.md:L3364, diff_diff/power.py:L1232-L1236, docs/references.rst:L246-L248, TODO.md:L77-L77).
  • The diff is not limited to PowerAnalysis documentation; it also rolls back vcov_type="conley" support on SunAbraham and WooldridgeDiD, and rewrites the StackedDiD Conley deferral (diff_diff/sun_abraham.py:L520-L555, diff_diff/wooldridge.py:L429-L441, diff_diff/stacked_did.py:L203-L216).
  • [Newly identified] SunAbraham.set_params(vcov_type="conley") still accepts the now-forbidden value, so sklearn-style mutation bypasses the new constructor guard and falls into a lower-level Conley code path that this PR removed (diff_diff/sun_abraham.py:L543-L555, diff_diff/sun_abraham.py:L1441-L1450, diff_diff/sun_abraham.py:L2150-L2160).
  • WooldridgeDiD propagates the rollback correctly by revalidating vcov_type in set_params(), so the unmitigated contract regression is specific to SunAbraham (diff_diff/wooldridge.py:L429-L441, diff_diff/wooldridge.py:L487-L521).
  • I did not run the test suite in this environment; numpy is unavailable, so this review is from static inspection.

Methodology

  • Severity: P1 [Newly identified]. Impact: the new SunAbraham variance-family contract is inconsistent across public entry points. __init__ now rejects vcov_type="conley", but set_params() still blindly mutates self.vcov_type without revalidation. Because the old estimator-level Conley front-door and Conley argument threading were removed, a caller can still put the estimator into the forbidden "conley" state via sklearn-style mutation and then reach _fit_saturated_regression() / LinearRegression(vcov_type="conley") without any conley_* inputs. That is exactly the changed-surface propagation problem this repo flags as a P1 anti-pattern. Concrete fix: factor the constructor validation into a shared helper and call it from set_params() before mutation, mirroring WooldridgeDiD.set_params(). Add a regression test that failed mutation leaves vcov_type unchanged. References: diff_diff/sun_abraham.py:L543-L555, diff_diff/sun_abraham.py:L1441-L1450, diff_diff/sun_abraham.py:L2150-L2160, diff_diff/wooldridge.py:L487-L521, tests/test_sun_abraham.py:L1656-L1661.
  • Severity: P3-informational. Impact: the PowerAnalysis source mismatches are now explicitly documented and tracked, so under the review policy they are mitigated rather than blocking. Concrete fix: none required in this PR; PR-B can reconcile or re-attribute the formulas. References: docs/methodology/REGISTRY.md:L3207-L3220, diff_diff/power.py:L1232-L1236, TODO.md:L77-L77.

Code Quality

  • No additional findings beyond the SunAbraham.set_params() contract regression above.

Performance

  • No findings. The estimator changes are removals/deferrals, and the PowerAnalysis changes are documentation-only.

Maintainability

  • Severity: P3-informational. Impact: the StackedDiD Conley deferral is now described as a simple threading follow-up and “same shape as the SunAbraham conley follow-up,” which drops the prior methodology caveat about duplicated units across stacked sub-experiments at distance 0. The feature is still rejected, so this does not change runtime numbers, but it does understate the unresolved design issue for the eventual follow-up PR. Concrete fix: restore a short methodology caveat in the rejection message / REGISTRY / TODO row instead of framing the work as pure plumbing. References: diff_diff/stacked_did.py:L107-L109, diff_diff/stacked_did.py:L208-L216, docs/methodology/REGISTRY.md:L1473-L1478, TODO.md:L107-L109.

Tech Debt

  • No new untracked tech debt. The PowerAnalysis discrepancies are properly tracked in TODO.md, which is the correct mitigation for this docs-first audit (TODO.md:L77-L77).

Security

  • No findings.

Documentation/Tests

  • No separate blocker beyond the missing SunAbraham.set_params() regression noted above. The changed tests only assert constructor-time rejection for the rolled-back Conley surfaces, not mutation-time rejection (tests/test_sun_abraham.py:L1625-L1661, tests/test_wooldridge.py:L1755-L1757).
  • I did not execute the test suite in this environment because numpy is unavailable.

Path to Approval

  1. Make SunAbraham.set_params() re-run the same vcov_type validation as __init__ before mutating state, so vcov_type="conley" is rejected consistently across both public entry points.
  2. Add regression tests that SunAbraham().set_params(vcov_type="conley") and SunAbraham().set_params(vcov_type="foo") both raise and leave self.vcov_type unchanged.

…rce audits

Step-1 fidelity artifact for the PowerAnalysis methodology validation (the paper-review
PR of the 2-PR sequence; source validation + code reconciliation + tests are deferred to
PR-B). No estimator logic, formula, or test changes.

New source audits under docs/methodology/papers/ (sourced only from the papers):
- bloom-1995-review.md: the MDE-multiplier framing — MDE = M*SE with M from the NORMAL
  distribution (Bloom builds 1.65 + 0.84 = 2.49 from z-quantiles, p.548-549), the
  cross-sectional impact-estimator SE sigma*sqrt((1-R^2)/(n*T*(1-T))) (Eq. 1/2), the
  T(1-T) allocation factor (optimal at 50/50), one- and two-sided multipliers; documents
  that Bloom explicitly excludes clustering/multi-site design effects (Note 1).
- burlig-preonas-woerman-2020-review.md: the serial-correlation-robust (SCR) panel-DD
  variance (Eq. 2, verbatim — three covariance terms psi^B/psi^A/psi^X over m pre- and r
  post-periods, psi^X entering negatively), the McKenzie special case (Eq. 3), the
  increase-MDE condition (Eq. 4); Eq. 1 uses the t-distribution; pcpanel is the panel
  parity reference.

The audits surfaced discrepancies between the authoritative PowerAnalysis surfaces and the
source material. Per the agreed approach these are DOCUMENTED as under-review now (not yet
fixed — reconciliation is deferred to PR-B and tracked in TODO.md):
- REGISTRY.md ## PowerAnalysis: umbrella **Note:** enumerating the four discrepancies
  (t-vs-normal-z multiplier; SE R^2/cluster-m terms; missing T(1-T) allocation factor in the
  displayed sample-size formula; panel (1+(T-1)rho)/T is an equicorrelated/Moulton design
  effect, NOT Burlig SCR — an attribution overclaim).
- REGISTRY.md R-equivalents table: annotate the PowerAnalysis row as under-review (analytical
  path is normal-based, so pwr.t.test is not the faithful parity target; panel parity ref is
  Stata pcpanel) — resolves the cross-reference inconsistency the audits introduced.
- power.py: docstring notes on PowerAnalysis and simulate_power flagging the panel attribution
  and normal-vs-t approximation as under review; the class docstring panel-variance display
  corrected from a self-canceling factor to the implemented
  (1/N_treated + 1/N_control) * (1+(T-1)rho)/T (docstring-only; no logic change).
- references.rst: clarify the analytical panel path uses an equicorrelated approximation,
  not Burlig's SCR formula.
- TODO.md: tracker row (Methodology/Correctness) for the PR-B reconciliation.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
@igerber igerber force-pushed the feature/power-analysis-paper-review branch from aa64e26 to 2470be6 Compare May 31, 2026 15:32
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 2470be69608d9cf018d21fc4aef17deabe444e0e


Overall Assessment

✅ Looks good

Executive Summary

  • The current diff is scoped to PowerAnalysis documentation, methodology registry text, source-audit notes, and TODO.md; I did not find any estimator logic, weighting, variance, inference, or default-behavior changes in the executable code paths. The only diff_diff/power.py edits are docstring updates at diff_diff/power.py:L1222-L1236 and diff_diff/power.py:L1902-L1915.
  • Cross-checking the implementation confirms the four documented discrepancies are real: the analytical path uses normal critical values via stats.norm.ppf, the basic DiD variance is 2*sigma^2*(1/n_T+1/n_C), required-N uses the allocation factor treat_frac*(1-treat_frac), and the panel path uses a Moulton-style ((1 + (T-1)*rho) / T) factor rather than Burlig’s SCR variance. See diff_diff/power.py:L1273-L1350 and diff_diff/power.py:L1530-L1579.
  • Those discrepancies are now explicitly marked with a **Note:** in the authoritative registry and tracked in TODO.md, so under the review policy they are mitigated P3-informational items rather than blockers. See docs/methodology/REGISTRY.md:L3214-L3227 and TODO.md:L75-L77.
  • The class docstring’s panel-variance display now matches the implementation, and the R-equivalence/reference text now clearly says the analytical panel path is under review rather than presenting Burlig SCR or pwr.t.test() as a faithful parity target. See diff_diff/power.py:L1226-L1228, docs/methodology/REGISTRY.md:L3371-L3371, and docs/references.rst:L246-L248.
  • The prior AI-review SunAbraham P1 does not carry forward on this head: current code and tests still treat vcov_type="conley" as supported, so that rollback-specific regression is not part of this diff. See diff_diff/sun_abraham.py:L581-L585, diff_diff/sun_abraham.py:L2251-L2262, tests/test_sun_abraham.py:L1657-L1667, diff_diff/wooldridge.py:L447-L450, and tests/test_wooldridge.py:L1755-L1762.

Methodology

  • Severity: P3-informational. Impact: The PR correctly documents four existing PowerAnalysis source-material mismatches in the authoritative REGISTRY and tracks the follow-up in TODO.md: Bloom/Burlig t vs implemented z, the registry SE expression vs the implemented basic-DiD variance, the missing allocation factor in the displayed sample-size formula, and the analytical panel path’s Moulton approximation vs Burlig SCR. Because these are now explicitly labeled with **Note:** and tracked for PR-B, they are mitigated under the stated review policy, not P1 blockers. Concrete fix: None required in PR-A; reconcile or re-attribute in PR-B. References: docs/methodology/REGISTRY.md:L3214-L3227, TODO.md:L75-L77, docs/methodology/papers/bloom-1995-review.md:L122-L150, docs/methodology/papers/burlig-preonas-woerman-2020-review.md:L125-L150, diff_diff/power.py:L1273-L1350, diff_diff/power.py:L1530-L1579.

Code Quality

No findings. The docstring correction in diff_diff/power.py aligns the displayed panel formula with _compute_variance() instead of introducing new behavioral changes. See diff_diff/power.py:L1226-L1228 and diff_diff/power.py:L1331-L1343.

Performance

No findings. The diff is documentation-only for runtime purposes.

Maintainability

No findings. The new paper-review files are explicitly marked non-authoritative, and the follow-up ownership is clear in TODO.md, which keeps this docs-first PR from silently becoming the final methodology contract. See docs/methodology/papers/bloom-1995-review.md:L13-L18, docs/methodology/papers/burlig-preonas-woerman-2020-review.md:L13-L18, and TODO.md:L75-L77.

Tech Debt

No findings. The deferred reconciliation is properly tracked under “Tech Debt from Code Reviews,” so the remaining methodology cleanup is visible and non-blocking for this PR. See TODO.md:L73-L77.

Security

No findings.

Documentation/Tests

No findings in the changed diff. I did not run tests because this environment is missing pytest and numpy, so this review is based on static inspection only.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 31, 2026
@igerber igerber merged commit fe20a9a into main May 31, 2026
33 of 34 checks passed
@igerber igerber deleted the feature/power-analysis-paper-review branch May 31, 2026 16:55
igerber added a commit that referenced this pull request May 31, 2026
… Eq.2 equicorrelated; tracker -> Complete

Reconciles diff_diff/power.py with the Bloom (1995) + Burlig, Preonas & Woerman
(2020) source audits (paper reviews added in PR-A #506). Behavior change: the
analytical panel-DiD variance was the Moulton (1+(T-1)rho)/T factor (wrong
period-scaling, ~4x too small at rho=0/m=r=5, AND opposite rho-sign). Replaced
with the within-unit equicorrelated special case of Burlig Eq. 2,
sigma^2 (1/n_T+1/n_C)(1/m+1/r)(1-rho), so within-unit correlation now LOWERS the
MDE. The MDE multiplier stays the normal-z Bloom multiplier (documented as a
deliberate large-sample approximation to Burlig's t).

- power.py: equicorrelated variance in _compute_variance + _compute_required_n;
  input validation for ALL designs (n_pre>=1, n_post>=1, rho in [-1/(T-1), 1))
  enforced BEFORE the 2x2-vs-panel router, so invalid two-period shapes no longer
  fall through silently; the (1-rho) factor applies at T=2 too (Burlig footnote
  11, the m=r=1 case), so rho is never silently ignored and rho=0 recovers Bloom's
  2*sigma^2; docstrings rewritten; PR-A under-review notes removed.
- REGISTRY ## PowerAnalysis equation block rewritten (z not t; unified
  equicorrelated SE with the 2x2 as the m=r=1 special case; cluster-m and
  inverted-R^2 terms removed; both reference surfaces; checklist ticked).
- New tests/test_methodology_power.py (Bloom Table 1; 2x2 + panel closed forms;
  literal-equicorrelated Monte-Carlo; sample_size<->mde round-trip; input-guard +
  rho-at-T=2 + compute_* wrapper validation; base-R qnorm parity).
- benchmarks/R/generate_power_golden.R + benchmarks/data/r_power_golden.json.
- tests/test_power.py: inverted test_icc_effect + test_extreme_icc to Burlig's sign.
- references.rst: + Frison & Pocock (1992), McKenzie (2012) lineage.
- docs/tutorials/06_power_analysis.ipynb: corrected rho cells + summary.
- METHODOLOGY_REVIEW.md row -> Complete; TODO row removed; CHANGELOG.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
igerber added a commit that referenced this pull request May 31, 2026
… Eq.2 equicorrelated; tracker -> Complete

Reconciles diff_diff/power.py with the Bloom (1995) + Burlig, Preonas & Woerman
(2020) source audits (paper reviews added in PR-A #506). Behavior change: the
analytical panel-DiD variance was the Moulton (1+(T-1)rho)/T factor (wrong
period-scaling, ~4x too small at rho=0/m=r=5, AND opposite rho-sign). Replaced
with the within-unit equicorrelated special case of Burlig Eq. 2,
sigma^2 (1/n_T+1/n_C)(1/m+1/r)(1-rho), so within-unit correlation now LOWERS the
MDE. The MDE multiplier stays the normal-z Bloom multiplier (documented as a
deliberate large-sample approximation to Burlig's t).

- power.py: equicorrelated variance in _compute_variance + _compute_required_n;
  input validation for ALL designs (n_pre>=1, n_post>=1, rho in [-1/(T-1), 1))
  enforced BEFORE the 2x2-vs-panel router, so invalid two-period shapes no longer
  fall through silently; the (1-rho) factor applies at T=2 too (Burlig footnote
  11, the m=r=1 case), so rho is never silently ignored and rho=0 recovers Bloom's
  2*sigma^2; docstrings rewritten; PR-A under-review notes removed.
- REGISTRY ## PowerAnalysis equation block rewritten (z not t; unified
  equicorrelated SE with the 2x2 as the m=r=1 special case; cluster-m and
  inverted-R^2 terms removed; both reference surfaces; checklist ticked).
- New tests/test_methodology_power.py (Bloom Table 1; 2x2 + panel closed forms;
  literal-equicorrelated Monte-Carlo; sample_size<->mde round-trip; input-guard +
  rho-at-T=2 + compute_* wrapper validation; base-R qnorm parity).
- benchmarks/R/generate_power_golden.R + benchmarks/data/r_power_golden.json.
- tests/test_power.py: inverted test_icc_effect + test_extreme_icc to Burlig's sign.
- references.rst: + Frison & Pocock (1992), McKenzie (2012) lineage.
- docs/tutorials/06_power_analysis.ipynb: corrected rho cells + summary.
- METHODOLOGY_REVIEW.md row -> Complete; TODO row removed; CHANGELOG.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
@igerber igerber mentioned this pull request Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant