diff --git a/CHANGELOG.md b/CHANGELOG.md
index 5a28e1bc..f73eeef7 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 - **New estimator: `SyntheticControl` — classic Synthetic Control Method (Abadie, Diamond & Hainmueller 2010; Abadie & Gardeazabal 2003).** Standalone estimator (`diff_diff/synthetic_control.py`) + `SyntheticControlResults` (`diff_diff/synthetic_control_results.py`) + `synthetic_control()` convenience function, exported from `diff_diff`. Builds a single treated unit's counterfactual as a convex combination of never-treated donor units — **donor (unit) weights only**, no time weights or ridge, distinct from `SyntheticDiD`. The inner simplex-constrained weighted-LS solve `W*(V)` reuses `utils._sc_weight_fw` (folding `V^½` into the predictor matrix, `intercept=False`, `zeta=0`); the diagonal predictor-importance matrix `V` is selected data-driven by minimizing pre-period outcome MSPE (`v_method="nested"`, softmax-on-simplex multistart Nelder-Mead + Powell polish) or supplied by the user (`v_method="custom"`). Predictors are built from `predictors`/`predictor_window`/`predictors_op`, `special_predictors`, and per-period outcome lags (`pre_period_outcomes`), in the R `Synth::dataprep` row order; per-row standardization (SD over donors+treated, ddof=1) matches the R `Synth::synth` source. Reports the gap path (`α̂_1t = Y_1t − Σ_j w_j Y_jt`), `att` (mean post-period gap), `pre_rmspe`, donor weights, `v_weights`, and a predictor-balance table. **No analytical standard error** — `se`/`t_stat`/`p_value`/`conf_int` are NaN (in-space placebo permutation inference with the post/pre RMSPE-ratio statistic is planned for a follow-up release; `_placebo_gaps`/`_rmspe_ratio`/`_fit_snapshot` are reserved on the results object). Ten validation gates baked in: predictor-period leakage, absorbing post-period suffix + no-anticipation cross-check against the treatment column, post-period canonicalization, donor-pool filtering before period derivation, empty-window rejection, poor-pre-fit `UserWarning` (RMSPE > SD of treated pre-outcomes), duplicate-predictor-label rejection, inner-solve non-convergence warning, order-independent gap-path rebuild, and the `standardize="none"` deviation; plus fail-closed `custom_v` cross-field rules and degenerate single-donor / single-pre-period handling. **R-`Synth` parity** (`tests/test_methodology_synthetic_control.py`, fixtures generated by `benchmarks/R/generate_synth_basque_golden.R` into `tests/data/`): two-tier on the Basque Country study — Tier-1 feeds R's `solution.v` via `custom_v` and reproduces the published donor weights (region 10 Cataluña 0.851 + region 14 Madrid 0.149) to `atol=1e-3` deterministically; Tier-2 (`@pytest.mark.slow`) checks the data-driven nested fit lands in a tolerance band (the nested `V` legitimately differs because the outer objective uses all pre periods, not R's `time.optimize.ssr` window). Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (with `**Deviation from R:** standardize="none"` and `**Note:**` labels for the standardization formula, objective window, softmax `V` parametrization, and 1×SD poor-fit threshold), `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`.
+- **StaggeredTripleDifference methodology-review-tracker promotion: In Progress → Complete**, plus a new opt-in Eq-4.14 overall ATT. Closes the Ortiz-Villavicencio & Sant'Anna (2025, arXiv:2505.09942v3) primary-source review on the tracker (PR-A #499 added the paper review on file; this PR validates the source against it). New paper-equation-anchored Verified Components in `tests/test_methodology_staggered_triple_diff.py` (Theorem 4.1 / Eq. 4.5 RA=IPW=DR identification; Eq. 4.1 three-term DDD decomposition; Eqs. 4.11-4.12 optimal-GMM weight normalization + single-group reduction; Eq. 4.13 event-study cohort-share weighting; Eq. 4.14 / Cor. 4.2 overall) alongside the existing R cross-validation against `triplediff::ddd(panel=TRUE)` + `agg_ddd()`. **New feature — opt-in `overall_att_es` (paper Eq. 4.14 overall):** the unweighted mean of the post-treatment event-study effects ES(e), exposed on `StaggeredTripleDiffResults` (with `overall_se_es` / `overall_t_stat_es` / `overall_p_value_es` / `overall_conf_int_es`) and populated only when `aggregate="event_study"` / `"all"`. The default `overall_att` is unchanged (the Callaway-Sant'Anna simple post-treatment (g,t) average — the library-wide convention). Its analytical SE is the influence function of that mean (the average of the per-event-time combined IFs, routed through the same survey-aware variance estimator as the per-e effects via a new `_se_from_psi` helper); a multiplier-bootstrap SE replaces it under `n_bootstrap>0`. Computed via a side-channel stash on the shared `CallawaySantAnnaAggregationMixin._aggregate_event_study` (no return-signature change; CallawaySantAnna unaffected), over post-treatment `e >= -anticipation` (the library convention, matching `overall_att`). Cross-validated against R `agg_ddd(type="eventstudy")$overall.att` / `overall.se` (SE matches to ~0.1%). REGISTRY `## StaggeredTripleDifference`: the previously-unlabeled overall-aggregation prose is formalized under a `**Note:**` documenting both overalls, and the duplicate aggregation-weight deviation is consolidated (fixing a `P(G=g)` vs R `P(S=g)` mislabel). `METHODOLOGY_REVIEW.md` row L69 promoted to **Complete** (`Last Review = 2026-05-30`) with a Verified Components / R Comparison Results detail block; priority queue pruned. `docs/references.rst` Ortiz-Villavicencio entry pinned to arXiv:2505.09942v3.
 - **ConleySpatialHAC methodology-review-tracker promotion: In Progress → Complete.** Closes the Conley (1999) *Journal of Econometrics* 92(1) primary-source review on the methodology-review tracker. The paper review on file at `docs/methodology/papers/conley-1999-review.md` was previously merged (2026-05-09); this PR is the F.L.I.P. consolidation — new `tests/test_methodology_conley.py` with paper-equation-numbered Verified Components walk-through (~1600 LoC; 10 classes; 60 tests, 5 of them `@pytest.mark.slow`). Coverage: Eq. 4.2 cross-sectional sandwich (pairwise-distance specialization; the project's paper review identifies Eq. 4.2 page 18 as the real-valued/pairwise form, with Eq. 3.13 reserved for the lattice-indexed form), Eq. 4.2 HC0 + rank-1 limits, Andrews (1991) HAC lag truncation matching `conleyreg::time_dist.cpp`, haversine convention with Earth radius 6371.01 km, Phase 2 panel block-decomposed sandwich at `atol=1e-12`, sparse k-d-tree dense-vs-sparse bit-identity (Wave A #120 numerical correctness), and R `conleyreg` v0.1.9 parity at `atol=1e-6` on 6 fixtures (3 cross-sectional + 3 panel) plus the sparse-forced and time-asymmetric kernel parity contracts. Three dedicated deviations-area classes: `TestConleyLibraryExtensions` (Wave A library extensions — combined spatial+cluster product kernel #119, callable conley_metric validation #123, sparse k-d-tree activation #120, indefiniteness guard), `TestConleyDeviationsFromR` (1-D radial Bartlett vs paper's 2-D separable Eq. 3.14, time-label normalization via `np.unique`, independent temporal kernel deferred), and `TestConleyDeferrals` (5 fail-closed `NotImplementedError`/`TypeError` contracts: LinearRegression + survey_design, DiD/MPD/TWFE + survey_design, Conley + weights, SyntheticDiD + Conley, wild_bootstrap + Conley). Methodology-anchored tests extracted from `tests/test_conley_vcov.py`: full classes `TestConleyDirectHelper`, `TestConleyReductions`, `TestConleyReductionsAddendum`, `TestConleyParityR`, `TestConleyParitySpacetime`, `TestConleyPanelHelper`, `TestConleySparseRParityForced`; plus methodology-anchored tests from `TestConleyKernels`, `TestConleyDistanceMetrics`, `TestConleySparse`. File drops 4248 → 3113 lines after extraction. Defensive surface preserved: input validation, NaN/inf guards, dispatch-level validity, estimator-level integration smoke tests, set_params atomicity, sparse-path activation thresholds + density-gate fallback. `METHODOLOGY_REVIEW.md` row L91 promoted to **Complete** with `Last Review = 2026-05-26`; detail block rewritten with Verified Components / Test Coverage / R Comparison Results inline table / Corrections Made / Deviations / Outstanding Concerns. Priority queue at L1386 pruned: PreTrendsPower removed (already Complete since 2026-05-19) and ConleySpatialHAC removed (this PR); substantive-review-blocked renumbered #2-#5 → #1-#4 and consolidation-pass-blocked renumbered #6-#8 → #5-#6.
 
 ### Added / Changed
diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md
index 88678d52..fd0a8b5d 100644
--- a/METHODOLOGY_REVIEW.md
+++ b/METHODOLOGY_REVIEW.md
@@ -66,7 +66,7 @@ The catalog grew incrementally over several quarters, so formats vary across the
 | Estimator | Module | R Reference | Status | Last Review |
 |-----------|--------|-------------|--------|-------------|
 | TripleDifference | `triple_diff.py` | `triplediff::ddd()` | **Complete** | 2026-02-18 |
-| StaggeredTripleDifference | `staggered_triple_diff.py` | `triplediff::ddd(panel=TRUE)` + `agg_ddd()` | **In Progress** | — |
+| StaggeredTripleDifference | `staggered_triple_diff.py` | `triplediff::ddd(panel=TRUE)` + `agg_ddd()` | **Complete** | 2026-05-30 |
 
 ### Counterfactual / Synthetic Estimators
 
@@ -940,21 +940,39 @@ These three are feature deferrals (paper-supported extensions that the library h
 | Module | `staggered_triple_diff.py`, `staggered_triple_diff_results.py` |
 | Primary Reference | Ortiz-Villavicencio & Sant'Anna (2025) — same paper as TripleDifference, staggered case |
 | R Reference | `triplediff::ddd(panel=TRUE)` + `agg_ddd()` (per `benchmarks/R/benchmark_staggered_triplediff.R`) |
-| Status | **In Progress** |
-| Last Review | — |
+| Status | **Complete** |
+| Last Review | 2026-05-30 |
 
 **Documentation in place:**
+- Paper review: `docs/methodology/papers/ortiz-villavicencio-santanna-2025-review.md` (full-paper, equal-depth, arXiv:2505.09942v3; shared primary source with TripleDifference) — PR #499
 - REGISTRY.md section: `## StaggeredTripleDifference` (per-cohort comparisons against three sub-groups, DR/RA/IPW per component, GMM-optimal closed-form inverse-variance weighting, event-study via CS mixin, IF-based SEs, multiplier bootstrap for simultaneous bands, survey support)
-- `tests/test_methodology_staggered_triple_diff.py`: 6 tests across 3 classes (never-treated comparison, not-yet-treated comparison, aggregation)
-- Dedicated unit-test suite: `tests/test_staggered_triple_diff.py` (~680 lines, full coverage of DR/RA/IPW paths, both control-group modes, GMM weighting, event-study aggregation, edge cases)
-- Survey-specific: `tests/test_survey_staggered_ddd.py`
-
-**Outstanding for promotion:**
-- Paper review under `docs/methodology/papers/` covering Ortiz-Villavicencio & Sant'Anna (2025) for the staggered case (the primary paper is shared with TripleDifference, but no dedicated review file exists on disk yet)
-- R parity validation against `triplediff::ddd(panel=TRUE)` + `agg_ddd()` (per `benchmarks/R/benchmark_staggered_triplediff.R`) — CSV fixtures not committed (gitignored); tests skip without local R + `triplediff` (tracked in TODO.md row, PR #245)
-- Per-cohort group-effect SE convention: implementation includes WIF (conservative vs R's `wif=NULL`); documented in REGISTRY, deferred decision on whether to add an opt-in WIF-disable path (tracked in TODO.md row, PR #245)
-- Formal Verified Components walk-through here
-- Cluster-robust analytical SEs accepted but not wired (deferred per REGISTRY)
+- `tests/test_methodology_staggered_triple_diff.py`: R cross-validation (group-time ATT/SE, both control groups) + paper-equation-anchored Verified Components (below)
+- Dedicated unit-test suite: `tests/test_staggered_triple_diff.py` (full coverage of DR/RA/IPW paths, both control-group modes, GMM weighting, event-study aggregation, edge cases)
+- Survey-specific: `tests/test_survey_staggered_ddd.py` (incl. the Eq-4.14 overall under survey weighting)
+
+**Verified Components (validated against the paper + R):**
+- **Identification (Theorem 4.1 / Eq. 4.5):** RA = IPW = DR coincide without covariates.
+- **Three-term DDD decomposition (Eq. 4.1):** post-treatment ATT(g,t) recover a known constant effect.
+- **GMM combination (Eqs. 4.11-4.12):** optimal weights sum to one; a single comparison group reduces to `w=[1]`.
+- **Event study (Eq. 4.13):** ES(e) equals the eligible-treated cohort-share-weighted average of ATT(g, g+e).
+- **Overall (Eq. 4.14 / Cor. 4.2):** opt-in `overall_att_es` = unweighted mean of post-treatment ES(e), cross-validated against R `agg_ddd(type="eventstudy")$overall.att`/`overall.se`.
+
+**R Comparison Results** (`benchmarks/R/benchmark_staggered_triplediff.R`; `triplediff::ddd(panel=TRUE)` + `agg_ddd()`; CSV fixtures gitignored / regenerated on-the-fly, JSON golden committed):
+
+| Quantity | Tolerance | Observed agreement |
+|----------|-----------|--------------------|
+| Group-time ATT(g,t) | rtol 0.1% | exact |
+| Group-time SE(g,t) | rtol 1% | matches |
+| Event-study ES(e) | rtol 25% | within (per-e eligible-treated weighting deviation) |
+| Overall ATT, Eq. 4.14 (`overall_att_es`) | rtol 10% | ≤5% (weighting deviation averages out in the mean) |
+| Overall SE, Eq. 4.14 (`overall_se_es`) | rtol 3% | ≤0.5% |
+
+The paper-equation-anchored Verified Components above are deterministic and run without R.
+The R cross-validation in this table runs only when local `R` + `triplediff` are available
+(it skips otherwise — the fixtures are gitignored); making those fixtures deterministic in
+CI and extending covariate-adjusted R parity are tracked follow-ups in `TODO.md`.
+
+**Documented deviations (verified non-masking; REGISTRY `## StaggeredTripleDifference`):** comparison-cohort admissibility (matches R `triplediff`, base-period/anticipation-aware; paper uses `g_c > max(g,t)`); aggregation weights `P(S=g,Q=1)` (matches paper Eq. 4.13 where `G_i` is defined only for `Q=1`, not R's `P(S=g)`) — drives the 25% aggregation tolerance; per-cohort group-effect WIF (conservative vs R `wif=NULL`); default `overall_att` is the CS-simple post-treatment average (paper Eq. 4.14 available opt-in as `overall_att_es`); cluster-robust analytical SEs accepted-but-deferred (multiplier bootstrap provides unit-level clustering).
 
 ---
 
@@ -1396,8 +1414,7 @@ Promotion priority for the **In Progress** entries, ordered by what's blocked on
 
 **Consolidation-pass-blocked (already has paper review or methodology file or R parity; mostly Verified Components walk-through):**
 
-5. **StaggeredTripleDifference** — shares the primary paper (Ortiz-Villavicencio & Sant'Anna 2025) with TripleDifference, but no dedicated paper review on file yet; needs R parity (R fixtures gitignored — tracked in TODO.md, PR #245).
-6. **Survey Data Support** — cross-cutting feature; promotion requires the per-estimator integration paths to be locked down first.
+5. **Survey Data Support** — cross-cutting feature; promotion requires the per-estimator integration paths to be locked down first.
 
 ---
 
diff --git a/diff_diff/staggered_aggregation.py b/diff_diff/staggered_aggregation.py
index d90139d4..dee32c34 100644
--- a/diff_diff/staggered_aggregation.py
+++ b/diff_diff/staggered_aggregation.py
@@ -443,7 +443,7 @@ def _compute_aggregated_se_with_wif(
         unit: str,
         precomputed: Optional["PrecomputedData"] = None,
         return_psi: bool = False,
-    ) -> "Union[float, Tuple[float, np.ndarray]]":
+    ) -> "Union[Tuple[float, Optional[int]], Tuple[float, np.ndarray, Optional[int]]]":
         """
         Compute SE with weight influence function (wif) adjustment.
 
@@ -458,6 +458,15 @@ def _compute_aggregated_se_with_wif(
         Formula (matching R's did::aggte):
             agg_inf_i = Σ_k w_k × inf_i_k + wif_i × ATT_k
             se = sqrt(mean(agg_inf^2) / n)
+
+        Returns
+        -------
+        ``(se, effective_df)`` when ``return_psi=False``; ``(se, psi_total,
+        effective_df)`` when ``return_psi=True``. This 2-tuple / 3-tuple arity is
+        held on EVERY branch — including the empty-IF (``se=0.0``) and non-finite-IF
+        (``se=NaN``) early returns — so callers that unpack two or three values fail
+        soft instead of raising on degenerate influence functions. ``effective_df``
+        is non-None only for replicate designs that dropped replicates.
         """
         # Extract global unit info for correct pg = n_g / N_total scaling.
         # Without this, the local path builds the unit set from only units in
@@ -486,15 +495,37 @@ def _compute_aggregated_se_with_wif(
             n_global_units=n_global_units,
         )
 
+        # Consistent return arity across ALL branches: return_psi=True -> 3-tuple
+        # (se, psi, effective_df); return_psi=False -> 2-tuple (se, effective_df).
+        # The empty / non-finite-IF branches must match so callers that unpack three
+        # values (``_aggregate_event_study``) or two (``_aggregate_simple``) fail soft
+        # (NaN SE) instead of raising on degenerate IF/WIF edge cases.
         if len(psi_total) == 0:
-            return (0.0, psi_total) if return_psi else 0.0
+            return (0.0, psi_total, None) if return_psi else (0.0, None)
 
         # Check for NaN propagation from non-finite WIF
         if not np.all(np.isfinite(psi_total)):
-            return (np.nan, psi_total) if return_psi else np.nan
+            return (np.nan, psi_total, None) if return_psi else (np.nan, None)
 
-        # Use design-based variance when full survey design is available
-        # Use unit-level resolved survey (panel IF is indexed by unit, not obs)
+        se, effective_df = self._se_from_psi(psi_total, precomputed)
+        if return_psi:
+            return (se, psi_total, effective_df)
+        return (se, effective_df)
+
+    def _se_from_psi(
+        self,
+        psi_total: np.ndarray,
+        precomputed: Optional["PrecomputedData"] = None,
+    ) -> "Tuple[float, Optional[int]]":
+        """Standard error (and per-statistic effective df) from a combined IF vector.
+
+        Routes a finite, non-empty influence-function vector through the same
+        variance estimator the per-event-time and simple-aggregation SE paths use:
+        replicate-weight variance, full survey-design variance, or the simple
+        ``sqrt(sum(psi^2))``. Callers must guard emptiness/finiteness first.
+        Returns ``(se, effective_df)``; ``effective_df`` is non-None only for
+        replicate designs that dropped replicates.
+        """
         resolved_survey = (
             precomputed.get("resolved_survey_unit") if precomputed is not None else None
         )
@@ -514,9 +545,7 @@ def _compute_aggregated_se_with_wif(
                 se = np.nan
             else:
                 se = np.sqrt(max(variance, 0.0))
-            if return_psi:
-                return (se, psi_total, effective_df)
-            return (se, effective_df)
+            return se, effective_df
 
         if resolved_survey is not None and (
             resolved_survey.strata is not None
@@ -530,15 +559,10 @@ def _compute_aggregated_se_with_wif(
                 se = np.nan
             else:
                 se = np.sqrt(max(variance, 0.0))
-            if return_psi:
-                return (se, psi_total, None)
-            return (se, None)
+            return se, None
 
         variance = np.sum(psi_total**2)
-        se = np.sqrt(variance)
-        if return_psi:
-            return (se, psi_total, None)
-        return (se, None)
+        return np.sqrt(variance), None
 
     def _aggregate_event_study(
         self,
@@ -671,6 +695,10 @@ def _aggregate_event_study(
             _psi_vectors.append(psi_e)
             _psi_event_times.append(e)
 
+        # Reset the Eq. (4.14) overall before any early return so a reused estimator
+        # instance never reads a stale value from a prior fit.
+        self._event_study_overall = None
+
         # Batch inference for all relative periods
         if not agg_effects_list:
             return {}
@@ -769,6 +797,44 @@ def _aggregate_event_study(
         # Attach VCV to self for CallawaySantAnna to pick up
         self._event_study_vcov = event_study_vcov
 
+        # Eq. (4.14) overall ATT: the unweighted mean of the post-treatment
+        # event-study effects ES(e). Stashed on self (mirroring _event_study_vcov)
+        # so the StaggeredTripleDifference estimator can expose it as overall_att_es;
+        # CallawaySantAnna leaves it unread. Post-treatment is the library predicate
+        # e >= -anticipation (matching _aggregate_simple and the default overall_att),
+        # NOT a hardcoded e >= 0.
+        #
+        # The POINT ESTIMATE averages EVERY finite post-treatment ES(e) effect (read
+        # from event_study_effects by event-time key), so it is always the true Eq.
+        # 4.14 average -- it must NOT be silently restricted to horizons with a finite
+        # influence function. The SE is the influence function of that mean (the
+        # average of the per-event-time combined IFs, via the same survey-aware
+        # variance routine as the per-e effects). If any contributing horizon lacks a
+        # finite, well-formed combined IF (a finite ES(e) can have a non-finite
+        # WIF/IF, which the per-e path already surfaces as a NaN SE), the combined IF
+        # for the mean is undefined: the SE is NaN while the point estimate is
+        # retained, and the consumer (fit) warns and NaN-propagates the inference.
+        post_e = [
+            e
+            for e in event_study_effects
+            if e >= -self.anticipation and np.isfinite(event_study_effects[e]["effect"])
+        ]
+        if post_e:
+            att_es = float(np.mean([event_study_effects[e]["effect"] for e in post_e]))
+            psi_by_e = {e: psi for e, psi in zip(_psi_event_times, _psi_vectors)}
+            psis = [psi_by_e.get(e) for e in post_e]
+            se_es: float = np.nan
+            eff_df_es: Optional[int] = None
+            if all(p is not None and len(p) > 0 and np.all(np.isfinite(p)) for p in psis):
+                if len({len(p) for p in psis}) == 1:
+                    psi_es = np.column_stack(psis).mean(axis=1)
+                    se_es, eff_df_es = self._se_from_psi(psi_es, precomputed)
+            self._event_study_overall = {
+                "att": att_es,
+                "se": float(se_es),
+                "effective_df": eff_df_es,
+            }
+
         return event_study_effects
 
     def _aggregate_by_group(
diff --git a/diff_diff/staggered_bootstrap.py b/diff_diff/staggered_bootstrap.py
index a97d6ed4..ba057b22 100644
--- a/diff_diff/staggered_bootstrap.py
+++ b/diff_diff/staggered_bootstrap.py
@@ -81,6 +81,12 @@ class CSBootstrapResults:
         Bootstrap p-values for group effects.
     bootstrap_distribution : Optional[np.ndarray]
         Full bootstrap distribution of overall ATT (if requested).
+    overall_att_es_se : Optional[float]
+        Bootstrap standard error for the paper Eq. 4.14 overall (event-study average).
+    overall_att_es_ci : Optional[Tuple[float, float]]
+        Bootstrap confidence interval for the Eq. 4.14 overall.
+    overall_att_es_p_value : Optional[float]
+        Bootstrap p-value for the Eq. 4.14 overall.
     """
 
     n_bootstrap: int
@@ -100,6 +106,10 @@ class CSBootstrapResults:
     group_effect_p_values: Optional[Dict[Any, float]] = None
     bootstrap_distribution: Optional[np.ndarray] = field(default=None, repr=False)
     cband_crit_value: Optional[float] = None
+    # Paper Eq. (4.14) overall (event-study average) bootstrap inference.
+    overall_att_es_se: Optional[float] = None
+    overall_att_es_ci: Optional[Tuple[float, float]] = None
+    overall_att_es_p_value: Optional[float] = None
 
 
 # =============================================================================
@@ -485,8 +495,18 @@ def _run_multiplier_bootstrap(
         event_study_ses = None
         event_study_cis = None
         event_study_p_values = None
-
-        if bootstrap_event_study is not None and event_study_info is not None:
+        # Paper Eq. (4.14) overall (event-study average) bootstrap inference; stays
+        # NaN unless event-study draws exist. Mirrors the analytical overall_att_es.
+        overall_att_es_se = np.nan
+        overall_att_es_ci = (np.nan, np.nan)
+        overall_att_es_p_value = np.nan
+
+        # ``rel_periods`` can be empty when balance_e (or an empty event study) leaves
+        # no relative periods; guard the column_stack so the bootstrap mirrors the
+        # analytical empty/NaN surface instead of raising "need at least one array to
+        # concatenate". event_study_ses stays None and the Eq. 4.14 overall bootstrap
+        # stays NaN (the analytical fit path emits the requested-but-undefined warning).
+        if bootstrap_event_study is not None and event_study_info is not None and rel_periods:
             es_effects = np.array([event_study_info[e]["effect"] for e in rel_periods])
             es_boot_matrix = np.column_stack([bootstrap_event_study[e] for e in rel_periods])
             es_ses, es_ci_lo, es_ci_hi, es_pv = _compute_effect_bootstrap_stats_batch_func(
@@ -500,6 +520,34 @@ def _run_multiplier_bootstrap(
             }
             event_study_p_values = {e: float(es_pv[i]) for i, e in enumerate(rel_periods)}
 
+            # Eq. (4.14) overall = unweighted mean of post-treatment ES(e) (e >=
+            # -anticipation, matching post_treatment_mask above). The per-draw mean
+            # over those event times is the bootstrap distribution of the overall.
+            es_post = [
+                e
+                for e in rel_periods
+                if e >= -self.anticipation
+                and e in event_study_info
+                and np.isfinite(event_study_info[e]["effect"])
+            ]
+            if es_post:
+                original_es_overall = float(
+                    np.mean([event_study_info[e]["effect"] for e in es_post])
+                )
+                boot_es_overall = np.column_stack([bootstrap_event_study[e] for e in es_post]).mean(
+                    axis=1
+                )
+                (
+                    overall_att_es_se,
+                    overall_att_es_ci,
+                    overall_att_es_p_value,
+                ) = _compute_effect_bootstrap_stats_func(
+                    original_es_overall,
+                    boot_es_overall,
+                    alpha=self.alpha,
+                    context="overall ATT (event-study average)",
+                )
+
         # Batch compute bootstrap statistics for group effects
         group_effect_ses = None
         group_effect_cis = None
@@ -564,10 +612,11 @@ def _run_multiplier_bootstrap(
             overall_se = np.nan
             overall_ci = (np.nan, np.nan)
             overall_p_value = np.nan
+            overall_att_es_se = np.nan
+            overall_att_es_ci = (np.nan, np.nan)
+            overall_att_es_p_value = np.nan
             gt_ses = {gt: np.nan for gt in gt_ses} if gt_ses else gt_ses
-            gt_cis = (
-                {gt: (np.nan, np.nan) for gt in gt_cis} if gt_cis else gt_cis
-            )
+            gt_cis = {gt: (np.nan, np.nan) for gt in gt_cis} if gt_cis else gt_cis
             gt_p_values = {gt: np.nan for gt in gt_p_values} if gt_p_values else gt_p_values
             if event_study_ses:
                 event_study_ses = {k: np.nan for k in event_study_ses}
@@ -597,6 +646,9 @@ def _run_multiplier_bootstrap(
             group_effect_p_values=group_effect_p_values,
             bootstrap_distribution=bootstrap_overall,
             cband_crit_value=cband_crit_value,
+            overall_att_es_se=overall_att_es_se,
+            overall_att_es_ci=overall_att_es_ci,
+            overall_att_es_p_value=overall_att_es_p_value,
         )
 
     def _prepare_event_study_aggregation(
diff --git a/diff_diff/staggered_triple_diff.py b/diff_diff/staggered_triple_diff.py
index d2c7339d..893b4a3b 100644
--- a/diff_diff/staggered_triple_diff.py
+++ b/diff_diff/staggered_triple_diff.py
@@ -87,7 +87,7 @@ class StaggeredTripleDifference(
     References
     ----------
     Ortiz-Villavicencio, M. & Sant'Anna, P.H.C. (2025). "Better Understanding
-    Triple Differences Estimators." arXiv:2505.09942.
+    Triple Differences Estimators." arXiv:2505.09942v3.
     """
 
     def __init__(
@@ -599,6 +599,12 @@ def fit(
         overall_att, overall_se, overall_effective_df = self._aggregate_simple(
             group_time_effects, influence_func_info, df_agg, unit, precomputed_agg
         )
+        # Preserve the ORIGINAL survey df before the simple-overall statistic mutates
+        # it below. The Eq. 4.14 overall (overall_att_es) must fall back to this
+        # original df, never to the simple overall's per-statistic replicate df — the
+        # two statistics can drop different replicate subsets, so reusing the simple
+        # overall's df would silently give overall_att_es the wrong p-value/CI.
+        df_survey_original = df_survey
         # Use per-statistic effective df from replicate aggregation if available;
         # otherwise fall back to the original df from the survey design.
         if overall_effective_df is not None:
@@ -633,6 +639,55 @@ def fit(
                 unit,
             )
 
+        # Paper Eq. (4.14) overall ATT (event-study average): an opt-in summary
+        # alongside the default CS-simple ``overall_att``. ``_aggregate_event_study``
+        # stashes it on ``self._event_study_overall``; populated only when the
+        # event-study aggregation ran. Analytical inference here; the bootstrap block
+        # below overrides the SE when ``n_bootstrap > 0`` (mirroring ``overall_se``).
+        overall_att_es = None
+        overall_se_es = None
+        overall_t_stat_es = None
+        overall_p_value_es = None
+        overall_conf_int_es = None
+        # Whether the ANALYTICAL Eq. 4.14 SE was non-finite while its point estimate
+        # was finite (i.e. a contributing horizon's influence function was non-finite).
+        # Captured before any bootstrap override so the terminal warning below does not
+        # misdiagnose a bootstrap-side NaN SE (e.g. cluster-unidentified) as an
+        # analytical-IF failure (the bootstrap path emits its own warning).
+        analytical_overall_es_se_nonfinite = False
+        if aggregate in ("event_study", "all"):
+            es_overall = getattr(self, "_event_study_overall", None)
+            if es_overall is not None:
+                overall_att_es = es_overall["att"]
+                overall_se_es = es_overall["se"]
+                analytical_overall_es_se_nonfinite = bool(
+                    np.isfinite(overall_att_es) and not np.isfinite(overall_se_es)
+                )
+                es_eff_df = es_overall.get("effective_df")
+                # Fall back to the ORIGINAL survey df, not the simple-overall's mutated
+                # per-statistic df (P1 fix): overall_att_es has its own replicate df.
+                df_for_es = es_eff_df if es_eff_df is not None else df_survey_original
+                overall_t_stat_es, overall_p_value_es, overall_conf_int_es = safe_inference(
+                    overall_att_es, overall_se_es, alpha=self.alpha, df=df_for_es
+                )
+            else:
+                # Event-study aggregation was requested but yielded no post-treatment
+                # horizon: this is "requested but undefined" -> NaN + warning (the
+                # library's overall-aggregation contract, matching _aggregate_simple),
+                # distinct from "not requested" which leaves the fields None.
+                warnings.warn(
+                    "Event-study aggregation was requested but no post-treatment "
+                    "horizons are available for the Eq. 4.14 overall (overall_att_es); "
+                    "returning NaN.",
+                    UserWarning,
+                    stacklevel=2,
+                )
+                overall_att_es = np.nan
+                overall_se_es = np.nan
+                overall_t_stat_es, overall_p_value_es, overall_conf_int_es = safe_inference(
+                    np.nan, np.nan, alpha=self.alpha, df=df_survey
+                )
+
         # Reject replicate-weight designs for bootstrap — replicate variance
         # is an analytical alternative, not compatible with bootstrap
         if (
@@ -670,6 +725,17 @@ def fit(
                 )
                 overall_conf_int = bootstrap_results.overall_att_ci
                 overall_p_value = bootstrap_results.overall_att_p_value
+
+                # Mirror the override for the Eq. (4.14) event-study-average overall
+                # (only when event-study aggregation produced it). A NaN bootstrap SE
+                # (e.g. cluster-unidentified) correctly NaNs the inference.
+                if overall_att_es is not None and bootstrap_results.overall_att_es_se is not None:
+                    overall_se_es = bootstrap_results.overall_att_es_se
+                    overall_t_stat_es, overall_p_value_es, overall_conf_int_es = safe_inference(
+                        overall_att_es, overall_se_es, alpha=self.alpha, df=df_survey
+                    )
+                    overall_conf_int_es = bootstrap_results.overall_att_es_ci
+                    overall_p_value_es = bootstrap_results.overall_att_es_p_value
                 if bootstrap_results.cband_crit_value is not None:
                     cband_crit_value = bootstrap_results.cband_crit_value
 
@@ -743,6 +809,29 @@ def fit(
                         )
                         group_effects[g_key]["t_stat"] = t_val
 
+        # Eq. 4.14 overall: an ANALYTICAL non-finite SE under a finite point estimate
+        # (a contributing horizon's influence function is non-finite, or the variance is
+        # unidentified — e.g. a single-PSU/cluster design). Surface it — never NaN the SE
+        # silently. Gated on the analytical-origin flag (captured before the bootstrap
+        # override) and the final state still being non-finite: a bootstrap that supplies a
+        # finite SE rescues it (no warning), and a bootstrap that NaNs the SE for unrelated
+        # reasons (e.g. cluster-unidentified) is reported by the bootstrap's own warning,
+        # not misdiagnosed here as an analytical-IF failure.
+        if (
+            analytical_overall_es_se_nonfinite
+            and overall_se_es is not None
+            and not np.isfinite(overall_se_es)
+        ):
+            warnings.warn(
+                "Eq. 4.14 overall (overall_att_es) point estimate is defined but its "
+                "standard error is undefined (NaN): either a contributing post-treatment "
+                "event-study horizon has a non-finite influence function, or the variance "
+                "is unidentified (e.g. a single-PSU/cluster survey design). overall_se_es "
+                "and its inference fields are NaN.",
+                UserWarning,
+                stacklevel=2,
+            )
+
         n_treated_units = int(np.sum((unit_cohorts > 0) & (eligibility_per_unit == 1)))
         n_control_units = n_units - n_treated_units
         n_never_enabled = int(np.sum(unit_cohorts == 0))
@@ -780,6 +869,11 @@ def fit(
             epv_diagnostics=epv_diagnostics if epv_diagnostics else None,
             epv_threshold=self.epv_threshold,
             pscore_fallback=self.pscore_fallback,
+            overall_att_es=overall_att_es,
+            overall_se_es=overall_se_es,
+            overall_t_stat_es=overall_t_stat_es,
+            overall_p_value_es=overall_p_value_es,
+            overall_conf_int_es=overall_conf_int_es,
         )
         self.is_fitted_ = True
         return self.results_
diff --git a/diff_diff/staggered_triple_diff_results.py b/diff_diff/staggered_triple_diff_results.py
index be653e8b..2c7c434f 100644
--- a/diff_diff/staggered_triple_diff_results.py
+++ b/diff_diff/staggered_triple_diff_results.py
@@ -55,6 +55,15 @@ class StaggeredTripleDiffResults:
         Number of eligible units (Q = 1).
     n_ineligible : int
         Number of ineligible units (Q = 0).
+    overall_att_es : float, optional
+        Paper Eq. (4.14) "overall" ATT: the unweighted mean of the post-treatment
+        event-study effects ES(e). Populated only when ``aggregate`` is
+        ``"event_study"`` or ``"all"``; ``None`` otherwise. Distinct from
+        ``overall_att`` (the Callaway-Sant'Anna simple post-treatment (g,t) average,
+        which is the default headline ATT).
+    overall_se_es, overall_t_stat_es, overall_p_value_es, overall_conf_int_es : optional
+        Standard error, t-statistic, p-value, and confidence interval for
+        ``overall_att_es``; ``None`` when ``overall_att_es`` is ``None``.
     """
 
     group_time_effects: Dict[Tuple[Any, Any], Dict[str, Any]]
@@ -94,6 +103,16 @@ class StaggeredTripleDiffResults:
     )
     epv_threshold: float = 10
     pscore_fallback: str = "error"
+    # Paper Eq. (4.14) "overall" ATT: the unweighted mean of the post-treatment
+    # event-study effects ES(e). Populated only when aggregate in {"event_study",
+    # "all"}; None otherwise. Distinct from the default ``overall_att`` (the
+    # Callaway-Sant'Anna simple post-treatment (g,t) average). See REGISTRY
+    # ## StaggeredTripleDifference "Aggregation".
+    overall_att_es: Optional[float] = None
+    overall_se_es: Optional[float] = None
+    overall_t_stat_es: Optional[float] = None
+    overall_p_value_es: Optional[float] = None
+    overall_conf_int_es: Optional[Tuple[float, float]] = None
 
     # --- Inference-field aliases (balance/external-adapter compatibility) ---
     @property
@@ -194,6 +213,28 @@ def summary(self, alpha: Optional[float] = None) -> str:
             ]
         )
 
+        # Paper Eq. (4.14) overall (event-study average), when computed
+        # (aggregate in {"event_study", "all"}). The headline ATT above remains the
+        # Callaway-Sant'Anna simple post-treatment (g,t) average.
+        if (
+            self.overall_att_es is not None
+            and self.overall_se_es is not None
+            and self.overall_conf_int_es is not None
+        ):
+            p_es = self.overall_p_value_es if self.overall_p_value_es is not None else float("nan")
+            t_es = self.overall_t_stat_es if self.overall_t_stat_es is not None else float("nan")
+            sig_es = _get_significance_stars(p_es)
+            lines.extend(
+                [
+                    "",
+                    "Overall ATT (event-study average, paper Eq. 4.14):",
+                    f"{'ATT':<15} {self.overall_att_es:>12.4f} {self.overall_se_es:>12.4f} "
+                    f"{t_es:>10.3f} {p_es:>10.4f} {sig_es:>6}",
+                    f"{conf_level}% Confidence Interval: "
+                    f"[{self.overall_conf_int_es[0]:.4f}, {self.overall_conf_int_es[1]:.4f}]",
+                ]
+            )
+
         cv = self.coef_var
         if np.isfinite(cv):
             lines.append(f"{'CV (SE/abs(ATT)):':<25} {cv:>10.4f}")
@@ -429,6 +470,12 @@ def to_dict(self) -> Dict[str, Any]:
             d["group_effects"] = self.group_effects
         if self.comparison_group_counts is not None:
             d["comparison_group_counts"] = self.comparison_group_counts
+        if self.overall_att_es is not None:
+            d["overall_att_es"] = self.overall_att_es
+            d["overall_se_es"] = self.overall_se_es
+            d["overall_t_stat_es"] = self.overall_t_stat_es
+            d["overall_p_value_es"] = self.overall_p_value_es
+            d["overall_conf_int_es"] = self.overall_conf_int_es
         return d
 
     @property
diff --git a/docs/api/_autosummary/diff_diff.CSBootstrapResults.rst b/docs/api/_autosummary/diff_diff.CSBootstrapResults.rst
index e7a16b54..60b3ca1a 100644
--- a/docs/api/_autosummary/diff_diff.CSBootstrapResults.rst
+++ b/docs/api/_autosummary/diff_diff.CSBootstrapResults.rst
@@ -34,6 +34,9 @@
       ~CSBootstrapResults.overall_att_se
       ~CSBootstrapResults.overall_att_ci
       ~CSBootstrapResults.overall_att_p_value
+      ~CSBootstrapResults.overall_att_es_se
+      ~CSBootstrapResults.overall_att_es_ci
+      ~CSBootstrapResults.overall_att_es_p_value
       ~CSBootstrapResults.group_time_ses
       ~CSBootstrapResults.group_time_cis
       ~CSBootstrapResults.group_time_p_values
diff --git a/docs/api/_autosummary/diff_diff.StaggeredTripleDiffResults.rst b/docs/api/_autosummary/diff_diff.StaggeredTripleDiffResults.rst
index 70bbf478..dbcdbb5a 100644
--- a/docs/api/_autosummary/diff_diff.StaggeredTripleDiffResults.rst
+++ b/docs/api/_autosummary/diff_diff.StaggeredTripleDiffResults.rst
@@ -56,6 +56,11 @@
       ~StaggeredTripleDiffResults.overall_t_stat
       ~StaggeredTripleDiffResults.overall_p_value
       ~StaggeredTripleDiffResults.overall_conf_int
+      ~StaggeredTripleDiffResults.overall_att_es
+      ~StaggeredTripleDiffResults.overall_se_es
+      ~StaggeredTripleDiffResults.overall_t_stat_es
+      ~StaggeredTripleDiffResults.overall_p_value_es
+      ~StaggeredTripleDiffResults.overall_conf_int_es
       ~StaggeredTripleDiffResults.groups
       ~StaggeredTripleDiffResults.time_periods
       ~StaggeredTripleDiffResults.n_obs
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index 926d0acb..1e81a7dc 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -2205,10 +2205,13 @@ functions across comparison groups. Minimizes asymptotic variance subject to `su
 Event study (Equation 4.13): cohort-share-weighted average across cohorts for each
 relative time `e = t - g`. Reuses `CallawaySantAnnaAggregationMixin._aggregate_event_study()`.
 
-Overall ATT: cohort-size-weighted average across post-treatment (g,t) pairs.
-Reuses `CallawaySantAnnaAggregationMixin._aggregate_simple()`. Note: this is the
-simple post-treatment aggregation, not the paper's Equation 4.14 (which averages
-over event-study effects).
+Overall ATT: two summaries are available. (1) The default `overall_att` is the
+cohort-size-weighted average across post-treatment (g,t) pairs (reuses
+`CallawaySantAnnaAggregationMixin._aggregate_simple()`) — the library-wide
+Callaway-Sant'Anna "simple" convention, matching R `agg_ddd(type="simple")`.
+(2) The opt-in `overall_att_es` is the paper's Equation 4.14 overall — the unweighted
+mean of the post-treatment event-study effects ES(e) — populated when
+`aggregate="event_study"` or `"all"` (see the labeled Note below).
 
 Group effects: average across post-treatment time periods for each cohort.
 Reuses `CallawaySantAnnaAggregationMixin._aggregate_by_group()`.
@@ -2216,15 +2219,29 @@ Reuses `CallawaySantAnnaAggregationMixin._aggregate_by_group()`.
 All aggregation SEs include the WIF (Weight Influence Function) adjustment for
 uncertainty in cohort-share weights, inherited from the CallawaySantAnna mixin.
 
-- **Deviation from R:** Aggregation weights and WIF use the eligible-treated
-  population `P(S=g, Q=1)` (matching the paper's Eq 4.13, where `G_i` is defined
-  only for `Q=1` units). R's `agg_ddd()` uses `P(S=g)` (all units in the enabling
-  group, including ineligible). This is implemented by setting `unit_cohorts=0` for
-  ineligible units before calling the aggregation mixin.
+- **Deviation from R:** Aggregation weights (and the WIF) use the eligible-treated
+  population `P(S=g, Q=1)` — matching the paper's Eq 4.13, where `G_i` is defined
+  only for `Q=1` units (`G_i = g` iff `S_i = g` and `Q_i = 1`), so the paper's
+  `P(G=g)` *is* `P(S=g, Q=1)`. R's `agg_ddd()` instead weights by `P(S=g)` (all units
+  in the enabling group, including ineligible). Implemented by setting `unit_cohorts=0`
+  for ineligible units before calling the aggregation mixin. Group-time `ATT(g,t)`
+  values are identical to R; only the weighted average across (g,t) pairs differs —
+  this is the source of the larger tolerance in the R cross-validation tests.
 - **Note:** Per-cohort group-effect SEs include WIF via the inherited mixin.
   R's `agg_ddd(type="group")` uses `wif=NULL` for per-cohort aggregation since
   within-cohort weights are fixed. This makes our per-cohort group-effect SEs
   slightly conservative relative to R.
+- **Note:** The default overall ATT (`overall_att`) is the Callaway-Sant'Anna simple
+  post-treatment (g,t) average — the library-wide convention across staggered
+  estimators — and is NOT the paper's Equation 4.14 overall (which averages the
+  event-study effects). The paper's Eq 4.14 form is exposed opt-in as `overall_att_es`
+  (populated only when `aggregate` is `"event_study"`/`"all"`), computed as the
+  unweighted mean of the post-treatment ES(e) over `e >= -anticipation`. Its analytical
+  SE is the influence function of that mean (the average of the per-event-time combined
+  IFs, routed through the same survey-aware variance estimator as the per-e effects);
+  a multiplier-bootstrap SE replaces it when `n_bootstrap > 0`. The two summaries answer
+  different questions and generally differ; `overall_att_es` is cross-validated against
+  R `agg_ddd(type="eventstudy")$overall.att`/`overall.se`.
 
 *Standard errors:*
 
@@ -2267,11 +2284,6 @@ confidence bands (sup-t) for event study.
   (TSL) or `compute_replicate_if_variance()` (replicate weights). Bootstrap
   uses PSU-level multiplier weights. The R `triplediff` package does not
   support survey weights.
-- **Deviation from R:** Event-study and simple aggregation reuse
-  `CallawaySantAnnaAggregationMixin` cohort-size weights (`n_treated` per cohort)
-  instead of R's `agg_ddd()` group-probability weights (`pg = P(G=g)` over all
-  units including ineligible). Group-time ATT(g,t) values are identical; only the
-  weighted average across (g,t) pairs differs.
 
 *Edge cases:*
 - Single comparison group: GMM reduces to w=[1], no matrix inversion
@@ -2283,7 +2295,11 @@ confidence bands (sup-t) for event study.
 - No valid (g,t) pairs at all: raises `ValueError`
 
 **Reference implementation(s):**
-- R `triplediff` (companion package by paper authors) — not yet validated against
+- R `triplediff` (companion package by paper authors) — cross-validated in
+  `tests/test_methodology_staggered_triple_diff.py` (group-time ATT/SE for both control
+  groups, plus the Eq. 4.14 overall `overall_att_es` against `agg_ddd(type="eventstudy")`).
+  CSV fixtures are gitignored and regenerated on-the-fly from
+  `benchmarks/R/benchmark_staggered_triplediff.R`; the JSON golden is committed.
 
 **Requirements checklist:**
 - [x] Panel data with (unit, time, enabling-group S, eligibility Q, outcome Y)
diff --git a/docs/references.rst b/docs/references.rst
index fa54071d..2d5036c1 100644
--- a/docs/references.rst
+++ b/docs/references.rst
@@ -129,7 +129,7 @@ Triply Robust Panel (TROP)
 Triple Difference (DDD)
 -----------------------
 
-- **Ortiz-Villavicencio, M., & Sant'Anna, P. H. C. (2025).** "Better Understanding Triple Differences Estimators." *Working Paper*. https://arxiv.org/abs/2505.09942
+- **Ortiz-Villavicencio, M., & Sant'Anna, P. H. C. (2025).** "Better Understanding Triple Differences Estimators." *arXiv:2505.09942v3*. https://arxiv.org/abs/2505.09942v3
 
   This paper shows that common DDD implementations (taking the difference between two DiDs, or applying three-way fixed effects regressions) are generally invalid when identification requires conditioning on covariates. The ``TripleDifference`` class implements their regression adjustment, inverse probability weighting, and doubly robust estimators.
 
diff --git a/tests/test_methodology_staggered_triple_diff.py b/tests/test_methodology_staggered_triple_diff.py
index 3412358d..6f16fd1d 100644
--- a/tests/test_methodology_staggered_triple_diff.py
+++ b/tests/test_methodology_staggered_triple_diff.py
@@ -15,7 +15,7 @@
 import pandas as pd
 import pytest
 
-from diff_diff import StaggeredTripleDifference
+from diff_diff import StaggeredTripleDifference, generate_staggered_ddd_data
 
 BENCHMARK_DIR = Path(__file__).parent.parent / "benchmarks" / "data" / "synthetic"
 RESULTS_FILE = BENCHMARK_DIR / "staggered_ddd_r_results.json"
@@ -46,7 +46,9 @@ def _r_triplediff_available() -> bool:
     try:
         result = subprocess.run(
             ["Rscript", "-e", "library(triplediff); library(jsonlite); cat('OK')"],
-            capture_output=True, text=True, timeout=30,
+            capture_output=True,
+            text=True,
+            timeout=30,
         )
         return result.returncode == 0 and "OK" in result.stdout
     except (subprocess.TimeoutExpired, FileNotFoundError, OSError):
@@ -57,8 +59,9 @@ def _ensure_csv_fixtures():
     """Generate CSV fixtures via R if they don't exist and R is available."""
     # Check if any CSV is missing
     needed_keys = ["s42_dgp1", "s123_dgp1", "s99_dgp1"]
-    missing = [k for k in needed_keys
-               if not (BENCHMARK_DIR / f"staggered_ddd_data_{k}.csv").exists()]
+    missing = [
+        k for k in needed_keys if not (BENCHMARK_DIR / f"staggered_ddd_data_{k}.csv").exists()
+    ]
     if not missing:
         return True  # All present
 
@@ -69,7 +72,9 @@ def _ensure_csv_fixtures():
     try:
         result = subprocess.run(
             ["Rscript", str(R_SCRIPT)],
-            capture_output=True, text=True, timeout=120,
+            capture_output=True,
+            text=True,
+            timeout=120,
             cwd=str(Path(__file__).parent.parent),
         )
         return result.returncode == 0
@@ -95,10 +100,7 @@ def _load_r_data(seed: int, dgp: int) -> pd.DataFrame:
     """Load the CSV data that R used for a given scenario."""
     csv_path = BENCHMARK_DIR / f"staggered_ddd_data_s{seed}_dgp{dgp}.csv"
     if not csv_path.exists():
-        pytest.skip(
-            f"Data file not found: {csv_path}. "
-            "Requires R + triplediff to generate."
-        )
+        pytest.skip(f"Data file not found: {csv_path}. " "Requires R + triplediff to generate.")
     data = pd.read_csv(csv_path)
     return data.rename(columns=R_TO_PY_COLS)
 
@@ -111,7 +113,12 @@ def _run_python(data, method, control_group):
         base_period="varying",
     )
     return est.fit(
-        data, "outcome", "unit", "period", "first_treat", "eligibility",
+        data,
+        "outcome",
+        "unit",
+        "period",
+        "first_treat",
+        "eligibility",
         aggregate="event_study",
     )
 
@@ -153,16 +160,19 @@ def test_gt_att_matches_r(self, r_results, key):
 
         py_gt = sorted(res.group_time_effects.items())
         r_gt = list(zip(r["gt_groups"], r["gt_periods"]))
-        assert len(py_gt) == len(r["gt_att"]), (
-            f"{key}: Python has {len(py_gt)} GT cells, R has {len(r['gt_att'])}"
-        )
+        assert len(py_gt) == len(
+            r["gt_att"]
+        ), f"{key}: Python has {len(py_gt)} GT cells, R has {len(r['gt_att'])}"
         for i, ((g, t), eff) in enumerate(py_gt):
-            assert (g, t) == (r_gt[i][0], r_gt[i][1]), (
-                f"{key}: GT cell mismatch at index {i}: Python=({g},{t}), R={r_gt[i]}"
-            )
+            assert (g, t) == (
+                r_gt[i][0],
+                r_gt[i][1],
+            ), f"{key}: GT cell mismatch at index {i}: Python=({g},{t}), R={r_gt[i]}"
             _assert_close(
-                eff["effect"], r["gt_att"][i],
-                ATT_RTOL, ATT_ATOL,
+                eff["effect"],
+                r["gt_att"][i],
+                ATT_RTOL,
+                ATT_ATOL,
                 f"{key} ATT(g={g},t={t})",
             )
 
@@ -173,13 +183,15 @@ def test_gt_se_matches_r(self, r_results, key):
         res = _run_python(data, r["est_method"], r["control_group"])
 
         py_gt = sorted(res.group_time_effects.items())
-        assert len(py_gt) == len(r["gt_se"]), (
-            f"{key}: Python has {len(py_gt)} GT cells, R has {len(r['gt_se'])}"
-        )
+        assert len(py_gt) == len(
+            r["gt_se"]
+        ), f"{key}: Python has {len(py_gt)} GT cells, R has {len(r['gt_se'])}"
         for i, ((g, t), eff) in enumerate(py_gt):
             _assert_close(
-                eff["se"], r["gt_se"][i],
-                SE_RTOL, SE_ATOL,
+                eff["se"],
+                r["gt_se"][i],
+                SE_RTOL,
+                SE_ATOL,
                 f"{key} SE(g={g},t={t})",
             )
 
@@ -208,16 +220,19 @@ def test_gt_att_matches_r(self, r_results, key):
 
         py_gt = sorted(res.group_time_effects.items())
         r_gt = list(zip(r["gt_groups"], r["gt_periods"]))
-        assert len(py_gt) == len(r["gt_att"]), (
-            f"{key}: Python has {len(py_gt)} GT cells, R has {len(r['gt_att'])}"
-        )
+        assert len(py_gt) == len(
+            r["gt_att"]
+        ), f"{key}: Python has {len(py_gt)} GT cells, R has {len(r['gt_att'])}"
         for i, ((g, t), eff) in enumerate(py_gt):
-            assert (g, t) == (r_gt[i][0], r_gt[i][1]), (
-                f"{key}: GT cell mismatch at index {i}: Python=({g},{t}), R={r_gt[i]}"
-            )
+            assert (g, t) == (
+                r_gt[i][0],
+                r_gt[i][1],
+            ), f"{key}: GT cell mismatch at index {i}: Python=({g},{t}), R={r_gt[i]}"
             _assert_close(
-                eff["effect"], r["gt_att"][i],
-                ATT_RTOL, ATT_ATOL,
+                eff["effect"],
+                r["gt_att"][i],
+                ATT_RTOL,
+                ATT_ATOL,
                 f"{key} ATT(g={g},t={t})",
             )
 
@@ -228,13 +243,15 @@ def test_gt_se_matches_r(self, r_results, key):
         res = _run_python(data, r["est_method"], r["control_group"])
 
         py_gt = sorted(res.group_time_effects.items())
-        assert len(py_gt) == len(r["gt_se"]), (
-            f"{key}: Python has {len(py_gt)} GT cells, R has {len(r['gt_se'])}"
-        )
+        assert len(py_gt) == len(
+            r["gt_se"]
+        ), f"{key}: Python has {len(py_gt)} GT cells, R has {len(r['gt_se'])}"
         for i, ((g, t), eff) in enumerate(py_gt):
             _assert_close(
-                eff["se"], r["gt_se"][i],
-                SE_RTOL, SE_ATOL,
+                eff["se"],
+                r["gt_se"][i],
+                SE_RTOL,
+                SE_ATOL,
                 f"{key} SE(g={g},t={t})",
             )
 
@@ -247,16 +264,24 @@ def test_gt_se_matches_r(self, r_results, key):
 
 
 class TestStaggeredDDDAggregation:
-    """Cross-validate event study and overall ATT aggregation.
-
-    Note: Aggregation weights differ between R's agg_ddd() and the
-    CallawaySantAnna mixin we reuse. The group-time ATT(g,t) values
-    match R exactly (0.00%), but aggregation introduces small differences
-    due to the weighting scheme. We use 10% tolerance here.
+    """Cross-validate event study and overall ATT aggregation against R.
+
+    The eligible-treated aggregation weighting (REGISTRY deviation) propagates into
+    the individual per-relative-time ES(e) values and the simple overall, so those
+    are checked at a loose 25% envelope (the per-(g,t) ATT(g,t) themselves match R
+    exactly, 0.00%). The Eq. 4.14 overall (``overall_att_es``) is the unweighted MEAN
+    of the post-treatment ES(e), where the weighting deviation largely averages out
+    -- empirically <=5% on ATT and <=0.5% on SE across all benchmark scenarios -- so
+    it is checked at a much tighter, dedicated envelope.
     """
 
-    AGG_RTOL = 0.25  # 25% — aggregation weighting differs from R's agg_ddd
+    AGG_RTOL = 0.25  # per-e ES + simple overall: weighting deviation on individual values
     AGG_ATOL = 3.0
+    # Eq. 4.14 overall_att_es: the weighting deviation averages out in the mean
+    # (max observed across all scenarios: ATT 5.0%, SE 0.4%); tight-but-safe envelope.
+    OVERALL_ES_ATT_RTOL = 0.10
+    OVERALL_ES_SE_RTOL = 0.03
+    OVERALL_ES_ATOL = 0.5
 
     @pytest.mark.parametrize("key", ALL_SCENARIOS)
     def test_event_study_att_close_to_r(self, r_results, key):
@@ -271,8 +296,10 @@ def test_event_study_att_close_to_r(self, r_results, key):
         for e, eff in res.event_study_effects.items():
             if e in r_es:
                 _assert_close(
-                    eff["effect"], r_es[e],
-                    self.AGG_RTOL, self.AGG_ATOL,
+                    eff["effect"],
+                    r_es[e],
+                    self.AGG_RTOL,
+                    self.AGG_ATOL,
                     f"{key} ES ATT(e={e})",
                 )
 
@@ -288,10 +315,325 @@ def test_overall_att_close_to_r(self, r_results, key):
             base_period="varying",
         )
         res = est.fit(
-            data, "outcome", "unit", "period", "first_treat", "eligibility",
+            data,
+            "outcome",
+            "unit",
+            "period",
+            "first_treat",
+            "eligibility",
         )
         _assert_close(
-            res.overall_att, r["overall_att_simple"],
-            self.AGG_RTOL, self.AGG_ATOL,
+            res.overall_att,
+            r["overall_att_simple"],
+            self.AGG_RTOL,
+            self.AGG_ATOL,
             f"{key} overall ATT (simple)",
         )
+
+    @pytest.mark.parametrize("key", ALL_SCENARIOS)
+    def test_overall_att_es_close_to_r(self, r_results, key):
+        """Paper Eq. 4.14 overall (event-study average) vs R agg_ddd(type='eventstudy').
+
+        R's ``agg_ddd(type="eventstudy")$overall.att`` / ``overall.se`` ARE the
+        paper's Eq. 4.14 overall and its analytical SE; the library exposes them
+        opt-in as ``overall_att_es`` / ``overall_se_es``. The AGG_RTOL tolerance
+        reflects the documented eligible-treated weighting (REGISTRY deviation)
+        propagating from ES(e) into the overall -- NOT a new deviation; per-(g,t)
+        ATT(g,t) match R exactly, and empirically the overall agreement is far
+        tighter than AGG_RTOL.
+        """
+        r = r_results[key]
+        r_att = r.get("overall_att_es")
+        if r_att is None or (isinstance(r_att, float) and np.isnan(r_att)):
+            pytest.skip("No event-study overall in R")
+        data = _load_r_data(r["seed"], r["dgp_type"])
+        res = _run_python(data, r["est_method"], r["control_group"])
+        assert res.overall_att_es is not None
+        _assert_close(
+            res.overall_att_es,
+            r_att,
+            self.OVERALL_ES_ATT_RTOL,
+            self.OVERALL_ES_ATOL,
+            f"{key} overall ATT (Eq 4.14, event-study avg)",
+        )
+        r_se = r.get("overall_se_es")
+        if r_se is not None and not (isinstance(r_se, float) and np.isnan(r_se)):
+            _assert_close(
+                res.overall_se_es,
+                r_se,
+                self.OVERALL_ES_SE_RTOL,
+                self.OVERALL_ES_ATOL,
+                f"{key} overall SE (Eq 4.14, event-study avg)",
+            )
+
+
+# ---------------------------------------------------------------------------
+# Paper-equation-anchored Verified Components
+#
+# These validate the implementation against properties the paper STATES
+# (Ortiz-Villavicencio & Sant'Anna 2025, arXiv:2505.09942v3), using synthetic
+# panels from ``generate_staggered_ddd_data`` (DDD-CPT holds by construction).
+# They need no R and run in the pure-Python CI fallback. Paper assumptions:
+# S (random sampling), SO (strong overlap), NA (no anticipation), DDD-CPT
+# (conditional parallel trends).
+# ---------------------------------------------------------------------------
+
+_FIT_COLS = ("outcome", "unit", "period", "first_treat", "eligibility")
+
+
+class TestTheorem41Identification:
+    """Theorem 4.1 / Eq. (4.5): nonparametric ATT(g,t) identification, with the
+    regression-adjustment (RA), IPW, and doubly-robust (DR) estimands all equal.
+
+    Observable implication tested: with no covariates the three ``est_method``
+    paths reduce to the same DiD building block, so ATT(g,t) must coincide.
+    """
+
+    def test_ra_ipw_dr_agree_without_covariates(self):
+        data = generate_staggered_ddd_data(n_units=600, treatment_effect=3.0, seed=11)
+        fits = {
+            m: StaggeredTripleDifference(estimation_method=m, control_group="nevertreated").fit(
+                data, *_FIT_COLS
+            )
+            for m in ("dr", "ipw", "reg")
+        }
+        for k in fits["dr"].group_time_effects:
+            a = fits["dr"].group_time_effects[k]["effect"]
+            b = fits["ipw"].group_time_effects[k]["effect"]
+            c = fits["reg"].group_time_effects[k]["effect"]
+            if np.isfinite(a) and np.isfinite(b) and np.isfinite(c):
+                assert abs(a - b) < 1e-6 and abs(a - c) < 1e-6, (
+                    f"RA=IPW=DR identification (Thm 4.1) violated at {k}: "
+                    f"dr={a:.6f}, ipw={b:.6f}, reg={c:.6f}"
+                )
+
+
+class TestEquation41Decomposition:
+    """Eq. (4.1): the three-term DR DDD estimand (DiD_A + DiD_B - DiD_C).
+
+    Observable implication tested: under a constant treatment effect with
+    DDD-CPT holding (the DGP's eligibility trend is common across enabling
+    groups), post-treatment ATT(g,t) recover the true effect.
+    """
+
+    def test_recovers_constant_effect(self):
+        tau = 2.5
+        data = generate_staggered_ddd_data(n_units=800, treatment_effect=tau, noise_sd=0.3, seed=21)
+        res = StaggeredTripleDifference(estimation_method="dr", control_group="notyettreated").fit(
+            data, *_FIT_COLS, aggregate="event_study"
+        )
+        post = [
+            v["effect"]
+            for (g, t), v in res.group_time_effects.items()
+            if t >= g and np.isfinite(v["effect"])
+        ]
+        assert len(post) > 0
+        assert (
+            abs(float(np.mean(post)) - tau) < 0.3
+        ), f"mean post ATT(g,t)={np.mean(post):.4f} vs true tau={tau}"
+
+
+class TestEquations411412GMM:
+    """Eqs. (4.11)-(4.12): optimal GMM combination across comparison cohorts.
+
+    Verified components: the optimal weights sum to one (Eq. 4.12 normalization
+    ``w = Omega^{-1} 1 / (1' Omega^{-1} 1)``); a single comparison group reduces
+    to ``w=[1]``; multiple not-yet-treated cohorts yield a multi-weight
+    combination. (Variance-minimization optimality is by construction via the
+    ``Omega^{-1}`` weights; ``Omega`` is not exposed for a separate inequality
+    test, so it is not asserted here.)
+    """
+
+    def test_weights_sum_to_one_multi_group(self):
+        data = generate_staggered_ddd_data(
+            n_units=600, cohort_periods=[3, 5, 7], treatment_effect=3.0, seed=31
+        )
+        res = StaggeredTripleDifference(estimation_method="dr", control_group="notyettreated").fit(
+            data, *_FIT_COLS
+        )
+        assert res.gmm_weights is not None
+        multi = {k: w for k, w in res.gmm_weights.items() if len(w) > 1}
+        assert multi, "expected at least one (g,t) with multiple comparison cohorts"
+        for k, w in multi.items():
+            assert (
+                abs(sum(w.values()) - 1.0) < 1e-9
+            ), f"GMM weights at {k} sum to {sum(w.values())}, expected 1.0"
+
+    def test_single_group_reduces_to_unit_weight(self):
+        data = generate_staggered_ddd_data(n_units=400, treatment_effect=3.0, seed=32)
+        res = StaggeredTripleDifference(estimation_method="dr", control_group="nevertreated").fit(
+            data, *_FIT_COLS
+        )
+        assert res.gmm_weights is not None
+        for k, w in res.gmm_weights.items():
+            assert len(w) == 1, f"nevertreated should use one comparison group at {k}, got {w}"
+            assert abs(next(iter(w.values())) - 1.0) < 1e-12
+
+
+class TestEquation413EventStudy:
+    """Eq. (4.13): event-study ES(e) is the eligible-treated-cohort-share-weighted
+    average of ATT(g, g+e) across cohorts at each relative time e.
+    """
+
+    def test_es_is_cohort_share_weighted_average(self):
+        data = generate_staggered_ddd_data(
+            n_units=600,
+            cohort_periods=[3, 5],
+            treatment_effect=3.0,
+            dynamic_effects=True,
+            effect_growth=0.4,
+            seed=41,
+        )
+        res = StaggeredTripleDifference(estimation_method="dr", control_group="nevertreated").fit(
+            data, *_FIT_COLS, aggregate="event_study"
+        )
+        assert res.event_study_effects is not None
+        gt = res.group_time_effects
+        checked_multi = False
+        for e, es in res.event_study_effects.items():
+            contrib = [
+                (g, t) for (g, t) in gt if (t - g) == e and np.isfinite(gt[(g, t)]["effect"])
+            ]
+            if len(contrib) < 2:
+                continue
+            checked_multi = True
+            w = np.array([gt[(g, t)]["n_treated"] for (g, t) in contrib], dtype=float)
+            eff = np.array([gt[(g, t)]["effect"] for (g, t) in contrib])
+            expected = float(np.sum(w / w.sum() * eff))
+            assert (
+                abs(es["effect"] - expected) < 1e-9
+            ), f"ES(e={e})={es['effect']:.6f} != cohort-share-weighted avg {expected:.6f}"
+        assert checked_multi, "DGP did not produce a multi-cohort relative time to check"
+
+
+class TestEquation414Overall:
+    """Eq. (4.14) / Corollary 4.2: ``overall_att_es`` is the UNWEIGHTED mean of the
+    post-treatment event-study effects ES(e); opt-in via ``aggregate='event_study'``;
+    ``None`` otherwise; the default ``overall_att`` is undisturbed.
+    """
+
+    def _fit(self, aggregate, anticipation=0, seed=51):
+        data = generate_staggered_ddd_data(
+            n_units=600,
+            cohort_periods=[3, 5],
+            treatment_effect=3.0,
+            dynamic_effects=True,
+            effect_growth=0.3,
+            seed=seed,
+        )
+        est = StaggeredTripleDifference(
+            estimation_method="dr", control_group="nevertreated", anticipation=anticipation
+        )
+        return est.fit(data, *_FIT_COLS, aggregate=aggregate)
+
+    def test_overall_es_is_unweighted_mean_of_post_es(self):
+        res = self._fit("event_study")
+        assert res.overall_att_es is not None
+        assert res.event_study_effects is not None
+        post = [
+            es["effect"]
+            for e, es in res.event_study_effects.items()
+            if e >= 0 and np.isfinite(es["effect"])
+        ]
+        assert abs(res.overall_att_es - float(np.mean(post))) < 1e-10
+
+    def test_overall_es_none_without_event_study(self):
+        assert self._fit("simple").overall_att_es is None
+        assert self._fit(None).overall_att_es is None
+
+    def test_overall_es_respects_anticipation_window(self):
+        # anticipation=1 => post-treatment is e >= -1 (matches the default overall_att set)
+        res = self._fit("event_study", anticipation=1)
+        assert res.overall_att_es is not None
+        assert res.event_study_effects is not None
+        post = [
+            es["effect"]
+            for e, es in res.event_study_effects.items()
+            if e >= -1 and np.isfinite(es["effect"])
+        ]
+        assert abs(res.overall_att_es - float(np.mean(post))) < 1e-10
+
+    def test_overall_es_does_not_disturb_default_overall(self):
+        res_es = self._fit("event_study")
+        res_simple = self._fit("simple")
+        # Default overall_att stays the CS-simple aggregation regardless of ES opt-in
+        assert abs(res_es.overall_att - res_simple.overall_att) < 1e-9
+        assert res_es.overall_se_es is not None
+        assert np.isfinite(res_es.overall_se_es) and res_es.overall_se_es > 0
+
+
+class TestEquation414OverallBootstrap:
+    """Cross-surface: the Eq. 4.14 overall is finite under the multiplier bootstrap
+    (its SE is overridden by the per-draw mean of post-treatment ES(e) draws).
+    """
+
+    def test_bootstrap_overall_es_finite(self):
+        data = generate_staggered_ddd_data(
+            n_units=500, cohort_periods=[3, 5], treatment_effect=3.0, seed=61
+        )
+        res = StaggeredTripleDifference(
+            estimation_method="dr",
+            control_group="nevertreated",
+            n_bootstrap=199,
+            seed=7,
+        ).fit(data, *_FIT_COLS, aggregate="event_study")
+        assert res.overall_att_es is not None
+        assert np.isfinite(res.overall_att_es)
+        assert res.overall_se_es is not None
+        assert np.isfinite(res.overall_se_es) and res.overall_se_es > 0
+
+    def test_balance_e_empties_event_study_fails_soft(self):
+        """A balance_e that removes every event-study cohort must fail SOFT (NaN +
+        warning), not raise, on BOTH the analytical and multiplier-bootstrap paths.
+
+        Regression for the empty-rel_periods ``np.column_stack`` crash in the shared
+        bootstrap mixin: ``balance_e=100`` (no cohort reaches relative time 100)
+        empties the event study, so the Eq. 4.14 overall is "requested but
+        undefined" -> NaN. The default ``overall_att`` (CS-simple, balance_e-
+        independent) stays finite.
+        """
+        data = generate_staggered_ddd_data(
+            n_units=400, cohort_periods=[3, 5], treatment_effect=3.0, seed=71
+        )
+        for nb in (0, 10):
+            est = StaggeredTripleDifference(
+                estimation_method="dr", control_group="nevertreated", n_bootstrap=nb, seed=3
+            )
+            with pytest.warns(UserWarning):
+                res = est.fit(data, *_FIT_COLS, aggregate="event_study", balance_e=100)
+            # requested-but-undefined -> NaN (not a stale value, not a crash)
+            assert res.overall_att_es is not None and np.isnan(res.overall_att_es)
+            assert res.overall_se_es is None or np.isnan(res.overall_se_es)
+            # default overall_att is balance_e-independent and stays finite
+            assert np.isfinite(res.overall_att)
+
+
+class TestAggregationReturnContract:
+    """`CallawaySantAnnaAggregationMixin._compute_aggregated_se_with_wif` must keep a
+    consistent return arity on its empty/non-finite-IF branches: `return_psi=True` ->
+    3-tuple `(se, psi, effective_df)`, `return_psi=False` -> 2-tuple `(se, effective_df)`.
+
+    Regression for the refactor where the early-return branches still yielded a
+    2-tuple / bare float while `_aggregate_event_study` unpacks three values -- a
+    degenerate (empty / non-finite) combined IF would then raise a tuple-unpack
+    ValueError instead of failing soft (NaN SE), which is exactly the edge case the
+    Eq. 4.14 overall's NaN-SE contract relies on.
+    """
+
+    def test_degenerate_if_branch_returns_consistent_arity(self):
+        est = StaggeredTripleDifference()
+        est.anticipation = 0  # mixin reads self.anticipation
+        empty = np.array([])
+        df0 = pd.DataFrame({"unit": []})  # empty design -> empty combined IF
+        r3 = est._compute_aggregated_se_with_wif(
+            [], empty, empty, empty, {}, df0, "unit", return_psi=True
+        )
+        assert (
+            isinstance(r3, tuple) and len(r3) == 3
+        ), f"return_psi=True must be 3-tuple, got {r3!r}"
+        r2 = est._compute_aggregated_se_with_wif(
+            [], empty, empty, empty, {}, df0, "unit", return_psi=False
+        )
+        assert (
+            isinstance(r2, tuple) and len(r2) == 2
+        ), f"return_psi=False must be 2-tuple, got {r2!r}"
diff --git a/tests/test_staggered_triple_diff.py b/tests/test_staggered_triple_diff.py
index 197c5803..bcd772aa 100644
--- a/tests/test_staggered_triple_diff.py
+++ b/tests/test_staggered_triple_diff.py
@@ -143,6 +143,74 @@ def test_summary_runs(self, simple_data):
         assert "Staggered Triple Difference" in summary
         assert "ATT" in summary
 
+    def test_overall_att_es_result_surface(self, simple_data):
+        """Opt-in Eq. 4.14 overall (overall_att_es) is exposed on the result surface
+        (attribute, summary(), to_dict()) under aggregate='event_study', and is None
+        for the default (no event-study) fit."""
+        res = StaggeredTripleDifference().fit(
+            simple_data,
+            "outcome",
+            "unit",
+            "period",
+            "first_treat",
+            "eligibility",
+            aggregate="event_study",
+        )
+        es_fields = (
+            "overall_att_es",
+            "overall_se_es",
+            "overall_t_stat_es",
+            "overall_p_value_es",
+            "overall_conf_int_es",
+        )
+        assert res.overall_att_es is not None and np.isfinite(res.overall_att_es)
+        for f in es_fields:
+            assert getattr(res, f) is not None
+        assert "Eq. 4.14" in res.summary()
+        d = res.to_dict()
+        for f in es_fields:
+            assert f in d
+        # Default fit (no event-study aggregation): fields are None and absent from to_dict.
+        res_default = StaggeredTripleDifference().fit(
+            simple_data, "outcome", "unit", "period", "first_treat", "eligibility"
+        )
+        assert res_default.overall_att_es is None
+        assert "overall_att_es" not in res_default.to_dict()
+
+    def test_overall_att_es_aggregate_all_matches_event_study(self, simple_data):
+        """Eq. 4.14 overall (overall_att_es) is populated under aggregate='all' (not only
+        'event_study') and matches the 'event_study' path bit-for-bit on the same data.
+
+        Regression guard for the documented public contract that BOTH surfaces expose
+        overall_att_es: aggregate='all' additionally computes the per-cohort group
+        aggregation, and must neither drop nor perturb the event-study-average overall.
+        """
+        res_all = StaggeredTripleDifference().fit(
+            simple_data, "outcome", "unit", "period", "first_treat", "eligibility", aggregate="all"
+        )
+        res_es = StaggeredTripleDifference().fit(
+            simple_data,
+            "outcome",
+            "unit",
+            "period",
+            "first_treat",
+            "eligibility",
+            aggregate="event_study",
+        )
+        # Populated (not None / finite) on the 'all' surface.
+        assert res_all.overall_att_es is not None and np.isfinite(res_all.overall_att_es)
+        assert res_all.overall_se_es is not None and np.isfinite(res_all.overall_se_es)
+        # Identical point estimate AND inference vs the event_study path: same data, same
+        # analytical influence-function SE (no bootstrap), so bit-for-bit close.
+        assert res_all.overall_att_es == pytest.approx(res_es.overall_att_es, abs=1e-10)
+        assert res_all.overall_se_es == pytest.approx(res_es.overall_se_es, abs=1e-10)
+        assert res_all.overall_t_stat_es == pytest.approx(res_es.overall_t_stat_es, abs=1e-10)
+        assert res_all.overall_p_value_es == pytest.approx(res_es.overall_p_value_es, abs=1e-10)
+        ci_all, ci_es = res_all.overall_conf_int_es, res_es.overall_conf_int_es
+        assert ci_all is not None and ci_es is not None
+        assert ci_all[0] == pytest.approx(ci_es[0], abs=1e-10)
+        assert ci_all[1] == pytest.approx(ci_es[1], abs=1e-10)
+
     def test_to_dataframe_group_time(self, simple_data):
         est = StaggeredTripleDifference()
         res = est.fit(simple_data, "outcome", "unit", "period", "first_treat", "eligibility")
@@ -411,9 +479,7 @@ def test_inf_first_treat_works(self):
             UserWarning,
             match=rf"{n_inf_rows} row\(s\) have first_treat=inf; recoding to 0",
         ):
-            res = est.fit(
-                data, "outcome", "unit", "period", "first_treat", "eligibility"
-            )
+            res = est.fit(data, "outcome", "unit", "period", "first_treat", "eligibility")
         assert np.isfinite(res.overall_att)
 
     def test_survey_design_invalid_type_raises(self, simple_data):
@@ -574,12 +640,10 @@ def test_collinear_covariates_emit_lstsq_fallback_warning(self):
                 covariates=["x1", "x2", "x3"],
             )
         or_warnings = [
-            w for w in caught
-            if "outcome-regression influence-function step" in str(w.message)
+            w for w in caught if "outcome-regression influence-function step" in str(w.message)
         ]
         assert len(or_warnings) == 1, (
-            f"Expected exactly one aggregate OR lstsq-fallback warning, "
-            f"got {len(or_warnings)}."
+            f"Expected exactly one aggregate OR lstsq-fallback warning, " f"got {len(or_warnings)}."
         )
         msg = str(or_warnings[0].message)
         assert "(g, g_c, t) pair(s)" in msg
@@ -616,10 +680,7 @@ def test_collinear_covariates_emit_ps_hessian_warning(self):
                 "eligibility",
                 covariates=["x1", "x2", "x3"],
             )
-        ps_warnings = [
-            w for w in caught
-            if "propensity-score Hessian" in str(w.message)
-        ]
+        ps_warnings = [w for w in caught if "propensity-score Hessian" in str(w.message)]
         assert len(ps_warnings) == 1, (
             f"Expected exactly one aggregate PS-Hessian lstsq-fallback "
             f"warning under IPW, got {len(ps_warnings)}: "
@@ -654,9 +715,7 @@ def test_well_conditioned_covariates_emit_no_lstsq_warning(self):
                 "eligibility",
                 covariates=["x1", "x2"],
             )
-        lstsq_warnings = [
-            w for w in caught if "Rank-deficient X'WX" in str(w.message)
-        ]
+        lstsq_warnings = [w for w in caught if "Rank-deficient X'WX" in str(w.message)]
         assert lstsq_warnings == [], (
             f"Unexpected lstsq-fallback warning on clean covariates: "
             f"{[str(w.message) for w in lstsq_warnings]}"
@@ -670,10 +729,6 @@ def test_no_covariates_no_warning(self, simple_data):
 
         with _w.catch_warnings(record=True) as caught:
             _w.simplefilter("always")
-            est.fit(
-                simple_data, "outcome", "unit", "period", "first_treat", "eligibility"
-            )
-        lstsq_warnings = [
-            w for w in caught if "Rank-deficient X'WX" in str(w.message)
-        ]
+            est.fit(simple_data, "outcome", "unit", "period", "first_treat", "eligibility")
+        lstsq_warnings = [w for w in caught if "Rank-deficient X'WX" in str(w.message)]
         assert lstsq_warnings == []
diff --git a/tests/test_survey_staggered_ddd.py b/tests/test_survey_staggered_ddd.py
index 264c602d..dfccde40 100644
--- a/tests/test_survey_staggered_ddd.py
+++ b/tests/test_survey_staggered_ddd.py
@@ -5,6 +5,8 @@
 full design (strata/PSU/FPC), replicate weights, and bootstrap + survey.
 """
 
+import warnings
+
 import numpy as np
 import pandas as pd
 import pytest
@@ -843,9 +845,7 @@ def test_fallback_uses_weighted_mean(self, sddd_data):
         # logit failure in at least some subgroups
         data["collinear_x"] = data["eligibility"].astype(float) * 100.0
         sd = SurveyDesign(weights="weight")
-        est = StaggeredTripleDifference(
-            estimation_method="ipw", pscore_fallback="unconditional"
-        )
+        est = StaggeredTripleDifference(estimation_method="ipw", pscore_fallback="unconditional")
         import warnings
 
         with warnings.catch_warnings():
@@ -900,3 +900,166 @@ def test_zero_mass_subgroup_warns(self):
         assert len(mass_warnings) > 0, "Expected zero-mass subgroup warning"
         # Result should still be finite (other cohorts contribute)
         assert np.isfinite(res.overall_att)
+
+
+# ---------------------------------------------------------------------------
+# Paper Eq. 4.14 overall (event-study average) under survey weighting
+# ---------------------------------------------------------------------------
+
+
+class TestSurveyOverallAttEs:
+    """The opt-in ``overall_att_es`` (paper Eq. 4.14) honours survey weights: its
+    analytical SE routes through the same survey-aware variance estimator as the
+    per-event-time effects, so uniform weights reproduce the unweighted result,
+    nontrivial weights move the SE, and a full strata/PSU design stays finite.
+    """
+
+    def test_uniform_weight_equivalence(self):
+        data = _make_staggered_ddd_data(n_units=200, seed=42)
+        data["weight"] = 1.0
+        sd = SurveyDesign(weights="weight")
+        res_w = StaggeredTripleDifference(estimation_method="reg").fit(
+            data,
+            "outcome",
+            "unit",
+            "period",
+            "first_treat",
+            "eligibility",
+            aggregate="event_study",
+            survey_design=sd,
+        )
+        res_u = StaggeredTripleDifference(estimation_method="reg").fit(
+            data,
+            "outcome",
+            "unit",
+            "period",
+            "first_treat",
+            "eligibility",
+            aggregate="event_study",
+        )
+        assert res_w.overall_att_es is not None and res_u.overall_att_es is not None
+        assert np.isclose(res_w.overall_att_es, res_u.overall_att_es, atol=1e-8)
+        assert res_w.overall_se_es is not None and res_u.overall_se_es is not None
+        assert np.isclose(res_w.overall_se_es, res_u.overall_se_es, atol=1e-8)
+
+    def test_nontrivial_weights_change_se(self):
+        data = _make_staggered_ddd_data(n_units=200, seed=42)  # variable weights
+        sd = SurveyDesign(weights="weight")
+        res_w = StaggeredTripleDifference(estimation_method="reg").fit(
+            data,
+            "outcome",
+            "unit",
+            "period",
+            "first_treat",
+            "eligibility",
+            aggregate="event_study",
+            survey_design=sd,
+        )
+        res_u = StaggeredTripleDifference(estimation_method="reg").fit(
+            data,
+            "outcome",
+            "unit",
+            "period",
+            "first_treat",
+            "eligibility",
+            aggregate="event_study",
+        )
+        assert res_w.overall_se_es is not None and res_u.overall_se_es is not None
+        # Variable survey weights change the Eq-4.14 overall SE
+        assert not np.isclose(res_w.overall_se_es, res_u.overall_se_es, atol=1e-6)
+
+    def test_full_design_overall_es_finite(self):
+        data = _make_staggered_ddd_data(n_units=200, seed=42)
+        sd = SurveyDesign(weights="weight", strata="stratum", psu="psu")
+        res = StaggeredTripleDifference(estimation_method="reg").fit(
+            data,
+            "outcome",
+            "unit",
+            "period",
+            "first_treat",
+            "eligibility",
+            aggregate="event_study",
+            survey_design=sd,
+        )
+        assert res.overall_att_es is not None and np.isfinite(res.overall_att_es)
+        assert res.overall_se_es is not None and np.isfinite(res.overall_se_es)
+        assert res.overall_se_es > 0
+
+    def test_replicate_weight_overall_es_finite(self):
+        """The Eq. 4.14 overall (overall_att_es) is computed and finite under a
+        replicate-weight (BRR) survey design, routing its SE through the replicate
+        variance path. Regression for the P1 fix that overall_att_es uses its OWN
+        per-statistic effective df (compute_replicate_if_variance), not the simple
+        overall's mutated df.
+        """
+        data = _make_staggered_ddd_data(n_units=200, seed=42)
+        rng = np.random.default_rng(99)
+        unit_ids = sorted(data["unit"].unique())
+        base_w = data.groupby("unit")["weight"].first().reindex(unit_ids).values
+        n_rep = 20
+        for r in range(n_rep):
+            factor = np.abs(1.0 + rng.standard_normal(len(unit_ids)) * 0.1)
+            data[f"rep_{r}"] = data["unit"].map(dict(zip(unit_ids, base_w * factor)))
+        sd = SurveyDesign(
+            weights="weight",
+            replicate_weights=[f"rep_{r}" for r in range(n_rep)],
+            replicate_method="BRR",
+        )
+        res = StaggeredTripleDifference(estimation_method="reg").fit(
+            data,
+            "outcome",
+            "unit",
+            "period",
+            "first_treat",
+            "eligibility",
+            aggregate="event_study",
+            survey_design=sd,
+        )
+        assert res.overall_att_es is not None and np.isfinite(res.overall_att_es)
+        assert res.overall_se_es is not None and np.isfinite(res.overall_se_es)
+        assert res.overall_se_es > 0
+        # default overall_att (simple) also computed under the replicate design
+        assert np.isfinite(res.overall_att) and np.isfinite(res.overall_se)
+
+    def test_single_psu_bootstrap_cluster_unidentified(self):
+        """Under a single-PSU survey design with n_bootstrap>0, the Eq. 4.14 overall
+        keeps its (analytical) point estimate but its SE is unidentified -> NaN, and
+        the inference fields are NaN-consistent.
+
+        Regression for the P2 warning-misdiagnosis fix: when overall_se_es is NaN the
+        terminal warning must not attribute it *solely* to a non-finite influence
+        function -- a single-PSU/cluster design is the actual cause here, and the
+        authoritative "variance unidentified" warning is raised by the bootstrap. The
+        Eq. 4.14 warning, when emitted, must name the unidentified-variance cause.
+        """
+        data = _make_staggered_ddd_data(n_units=120, seed=42).copy()
+        data["psu"] = 0  # collapse to a single PSU -> variance/cluster unidentified
+        data["fpc_col"] = 10**6  # design-based variance so the PSU is retained
+        sd = SurveyDesign(weights="weight", psu="psu", fpc="fpc_col")
+        with warnings.catch_warnings(record=True) as caught:
+            warnings.simplefilter("always")
+            res = StaggeredTripleDifference(estimation_method="reg", n_bootstrap=50, seed=7).fit(
+                data,
+                "outcome",
+                "unit",
+                "period",
+                "first_treat",
+                "eligibility",
+                aggregate="event_study",
+                survey_design=sd,
+            )
+        # Point estimate survives; SE + inference are NaN-consistent (never a stale value).
+        assert res.overall_att_es is not None and np.isfinite(res.overall_att_es)
+        assert res.overall_se_es is not None and np.isnan(res.overall_se_es)
+        assert res.overall_t_stat_es is not None and np.isnan(res.overall_t_stat_es)
+        assert res.overall_p_value_es is not None and np.isnan(res.overall_p_value_es)
+        assert res.overall_conf_int_es is not None
+        assert np.isnan(res.overall_conf_int_es[0]) and np.isnan(res.overall_conf_int_es[1])
+        msgs = [str(w.message) for w in caught]
+        # The bootstrap raises the authoritative "variance unidentified" warning.
+        assert any("unidentified" in m for m in msgs)
+        # The Eq. 4.14 overall warning is cause-accurate (would fail on the old wording
+        # that attributed the NaN solely to a non-finite influence function).
+        es_warns = [m for m in msgs if "overall_att_es)" in m]
+        assert es_warns, "expected the Eq. 4.14 overall warning under single-PSU bootstrap"
+        assert all("unidentified" in m for m in es_warns)