Skip to content

gh-115999: Specialize STORE_ATTR in free-threaded builds.#127838

Merged
nascheme merged 12 commits into
python:mainfrom
nascheme:gh-115999-specialize-store-attr
Dec 19, 2024
Merged

gh-115999: Specialize STORE_ATTR in free-threaded builds.#127838
nascheme merged 12 commits into
python:mainfrom
nascheme:gh-115999-specialize-store-attr

Conversation

@nascheme
Copy link
Copy Markdown
Member

@nascheme nascheme commented Dec 11, 2024

  • Fix thread safety issues with specialized opcodes (STORE_ATTR_INSTANCE_VALUE, STORE_ATTR_SLOT, STORE_ATTR_WITH_HINT). Need a combination of locks and atomics to be safe.
  • Fix thread safety issues with _Py_Specialize_StoreAttr . Avoid using borrowed references. Save and store the tp_version_tag from the beginning of the specialization process since it might change. Use helper functions to update opcode.
  • Add unit tests to ensure specialization is happening in free-threaded builds. Add addtional tests to try to trigger data races.

@nascheme nascheme changed the title gh-115999: Enable specialization of STORE_ATTR free-threaded builds. gh-115999: Specialize STORE_ATTR in free-threaded builds. Dec 11, 2024
@nascheme nascheme marked this pull request as ready for review December 11, 2024 21:54
@nascheme nascheme requested a review from mpage December 11, 2024 21:54
Comment thread Python/bytecodes.c
Comment thread Python/bytecodes.c Outdated
Copy link
Copy Markdown
Contributor

@mpage mpage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on! This looks like a regression on the default build. I haven't had a chance to dig into it, but I suspect it might either be due to the check that Sam flagged in _STORE_ATTR_INSTANCE_VALUE or the change to when we read the type version in _Py_Specialize_StoreAttr. It looks like the richards benchmark is the most heavily affected, so that might be a good isolated benchmark to use for debugging.

Comment thread Lib/test/test_opcache.py Outdated
Comment thread Python/specialize.c Outdated
Comment thread Python/specialize.c Outdated
Comment thread Python/specialize.c Outdated
Comment thread Python/specialize.c Outdated
@nascheme
Copy link
Copy Markdown
Member Author

nascheme commented Dec 13, 2024

Thanks for taking this on! This looks like a regression on the default build. I haven't had a chance to dig into it, but I suspect it might either be due to the check that Sam flagged in _STORE_ATTR_INSTANCE_VALUE or the change to when we read the type version in _Py_Specialize_StoreAttr. It looks like the richards benchmark is the most heavily affected, so that might be a good isolated benchmark to use for debugging.

Based on my benchmarking my second commit, the regression with richards is gone. Likely it was the fact that STORE_ATTR_INSTANCE_VALUE was not actually working.

Regarding the tp_version_tag not getting set due to type_get_version() being hoisted, fixing it as you suggest by using _PyType_LookupRefAndVersion is a bit complex to do. So, I deferred doing that for now. I'll look at it again.

@nascheme
Copy link
Copy Markdown
Member Author

Regarding the tp_version_tag not getting set due to type_get_version() being hoisted, fixing it as you suggest by using _PyType_LookupRefAndVersion is a bit complex to do. So, I deferred doing that for now. I'll look at it again.

This was actually not hard to fix. I was thinking that _PyType_LookupRefAndVersion would not set the tag in the case the lookup fails. However, that's not the case.

Comment thread Python/bytecodes.c Outdated
Comment thread Python/bytecodes.c Outdated
@nascheme nascheme force-pushed the gh-115999-specialize-store-attr branch from f920dcd to 4c484ab Compare December 13, 2024 19:17
@nascheme
Copy link
Copy Markdown
Member Author

I rebased on main to resolve merge conflicts and force pushed.

Copy link
Copy Markdown
Contributor

@mpage mpage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of small comments inline, but LGTM overall.

Comment thread Python/specialize.c Outdated
Comment thread Python/specialize.c Outdated
@nascheme nascheme requested a review from methane as a code owner December 17, 2024 21:59
* Fix locking for `STORE_ATTR_INSTANCE_VALUE`.  Create
  `_GUARD_TYPE_VERSION_AND_LOCK` so that object stays locked and
  `tp_version_tag` cannot change.  Fix inverted logic bug that caused
  erroneous deopt.

* Fix locking for `_STORE_ATTR_WITH_HINT`.  Double check that
  `_PyObject_GetManagedDict()` hasn't disappeared since we locked the
  dict.

- Pass `tp_version_tag` to `specialize_dict_access()`, ensuring
  the version we store on the cache is the correct one (in case of
  it changing during the specalize analysis).

- Split `analyze_descriptor` into `analyze_descriptor_load` and
  `analyze_descriptor_store` since those don't share much logic.
  Add `descriptor_is_class` helper function.

- In `specialize_dict_access`, double check `_PyObject_GetManagedDict()`
  in case we race and dict was materialized before the lock.
If the type is new and a version tag hasn't yet been assigned, we would
fail to specialize it.  Use `_PyType_LookupRefAndVersion()` instead of
`type_get_version()`, which will assign a version.
Use provided value of `tp_version` to store in cache.
This also fixes the case if the dict is replaced with a different one.
The function is only used in with-GIL builds.
For `specialize_dict_access_inline()`, we need to lock the keys object.

* Add `_PyDictKeys_StringLookupSplit` which does required locking and
  use in place of `_PyDictKeys_StringLookup`.

* Change `_PyObject_TryGetInstanceAttribute` to use that function
  in the case of split keys.

* Add `unicodekeys_lookup_split` helper which allows code sharing
  between `_Py_dict_lookup` and `_PyDictKeys_StringLookupSplit`.
@nascheme nascheme force-pushed the gh-115999-specialize-store-attr branch from 699f4e9 to 6bf9016 Compare December 18, 2024 06:16
Copy link
Copy Markdown
Contributor

@mpage mpage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Just a couple of small comments inline.

Comment thread Objects/dictobject.c Outdated
Comment thread Objects/dictobject.c Outdated
Copy link
Copy Markdown
Contributor

@mpage mpage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Benchmark results:

@nascheme nascheme merged commit 1b15c89 into python:main Dec 19, 2024
srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this pull request Dec 23, 2024
…thongh-127838)

* Add `_PyDictKeys_StringLookupSplit` which does locking on dict keys and
  use in place of `_PyDictKeys_StringLookup`.

* Change `_PyObject_TryGetInstanceAttribute` to use that function
  in the case of split keys.

* Add `unicodekeys_lookup_split` helper which allows code sharing
  between `_Py_dict_lookup` and `_PyDictKeys_StringLookupSplit`.

* Fix locking for `STORE_ATTR_INSTANCE_VALUE`.  Create
  `_GUARD_TYPE_VERSION_AND_LOCK` uop so that object stays locked and
  `tp_version_tag` cannot change.

* Pass `tp_version_tag` to `specialize_dict_access()`, ensuring
  the version we store on the cache is the correct one (in case of
  it changing during the specalize analysis).

* Split `analyze_descriptor` into `analyze_descriptor_load` and
  `analyze_descriptor_store` since those don't share much logic.
  Add `descriptor_is_class` helper function.

* In `specialize_dict_access`, double check `_PyObject_GetManagedDict()`
  in case we race and dict was materialized before the lock.

* Avoid borrowed references in `_Py_Specialize_StoreAttr()`.

* Use `specialize()` and `unspecialize()` helpers.

* Add unit tests to ensure specializing happens as expected in FT builds.

* Add unit tests to attempt to trigger data races (useful for running under TSAN).

* Add `has_split_table` function to `_testinternalcapi`.
srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this pull request Jan 8, 2025
…thongh-127838)

* Add `_PyDictKeys_StringLookupSplit` which does locking on dict keys and
  use in place of `_PyDictKeys_StringLookup`.

* Change `_PyObject_TryGetInstanceAttribute` to use that function
  in the case of split keys.

* Add `unicodekeys_lookup_split` helper which allows code sharing
  between `_Py_dict_lookup` and `_PyDictKeys_StringLookupSplit`.

* Fix locking for `STORE_ATTR_INSTANCE_VALUE`.  Create
  `_GUARD_TYPE_VERSION_AND_LOCK` uop so that object stays locked and
  `tp_version_tag` cannot change.

* Pass `tp_version_tag` to `specialize_dict_access()`, ensuring
  the version we store on the cache is the correct one (in case of
  it changing during the specalize analysis).

* Split `analyze_descriptor` into `analyze_descriptor_load` and
  `analyze_descriptor_store` since those don't share much logic.
  Add `descriptor_is_class` helper function.

* In `specialize_dict_access`, double check `_PyObject_GetManagedDict()`
  in case we race and dict was materialized before the lock.

* Avoid borrowed references in `_Py_Specialize_StoreAttr()`.

* Use `specialize()` and `unspecialize()` helpers.

* Add unit tests to ensure specializing happens as expected in FT builds.

* Add unit tests to attempt to trigger data races (useful for running under TSAN).

* Add `has_split_table` function to `_testinternalcapi`.
eendebakpt added a commit to eendebakpt/cpython that referenced this pull request May 22, 2026
Both the bytecode handler (_STORE_ATTR_SLOT) and the C-API path
(PyMember_SetOne for Py_T_OBJECT_EX / _Py_T_OBJECT) currently take
per-object locks (ob_mutex via LOCK_OBJECT or Py_BEGIN_CRITICAL_SECTION)
to sequence the read-of-old-value, store-of-new-value, and decref-of-old.

Replace the lock with a single atomic exchange on the slot pointer.
Atomic exchange returns the previous value and stores the new one in one
atomic op, which is both sufficient for correctness on the write side
(each concurrent writer observes a unique old value, so no double-decref)
and consistent with how the read side already works: _LOAD_ATTR_SLOT and
PyMember_GetOne are lock-free for the fast path, using FT_ATOMIC_LOAD_PTR
plus _Py_TryIncrefCompare to cope with mid-flight pointer changes.

This is the WRITE-side counterpart to the existing lock-free LOAD_ATTR_SLOT
read path, and follows the pattern Dino Viehland established with
pythongh-130373 (Avoid locking in _LOAD_ATTR_WITH_HINT, merged Feb 2025) and
companions pythongh-130312 / pythongh-130313 (don't lock when populating sets in
bytecode / when clearing attribute-less objects). The primitive being
used here, _Py_atomic_exchange_ptr, is being formally proposed in
pythongh-150044 for an internal _Py_SETREF_ATOMIC macro.

Three coordinated changes are required:

  1. _STORE_ATTR_SLOT in Python/bytecodes.c: drop LOCK_OBJECT /
     UNLOCK_OBJECT and the DEOPT_IF on lock-acquire failure; use
     _Py_atomic_exchange_ptr on Py_GIL_DISABLED, plain read/write on
     the GIL build.

  2. PyMember_SetOne in Python/structmember.c, Py_T_OBJECT_EX and
     _Py_T_OBJECT cases: drop Py_BEGIN_CRITICAL_SECTION; use
     _Py_atomic_exchange_ptr to match the bytecode path. The two write
     paths must change together: if only the bytecode were lock-free,
     PyMember_SetOne's plain "oldv = *addr" inside the CS would race
     with concurrent atomic-exchange writers, returning a "stolen" old
     value that another writer is also about to DECREF (double-free).

  3. PyMember_GetOne in Python/structmember.c, Py_T_OBJECT_EX and
     _Py_T_OBJECT cases: replace the CS+Py_XINCREF fallback (taken on
     _Py_TryIncrefCompare failure) with _Py_XGetRef, which retries the
     atomic-load + try-incref loop. The previous fallback relied on
     writers also taking ob_mutex to keep the slot stable inside the
     CS; with lock-free writers, the CS no longer excludes them, and a
     plain Py_XINCREF on a re-loaded slot value could resurrect a
     refcount-0 (about-to-be-freed) object.

History (git blame on the LOCK_OBJECT line):
  * 22b0de2 (2024-06): original FT structure of _STORE_ATTR_SLOT.
  * 1b15c89 (2024-12, pythongh-127838 / pythongh-115999): added LOCK_OBJECT in
    the broader STORE_ATTR specialization commit. For
    _STORE_ATTR_INSTANCE_VALUE the lock is genuinely required (it also
    updates the PyDictValues insertion-order list, which has to be
    coordinated with the NULL->non-NULL transition). For
    _STORE_ATTR_SLOT it was added for symmetry but isn't required
    when atomic exchange is used.
  * bef63d2 (2025-12, pythonGH-134584 / pythongh-142729): removed a redundant
    refcount op from _STORE_ATTR_SLOT, confirming the slot fast path
    is still being actively simplified.

Why this is correct:
  * The slot store is a single pointer write at a fixed offset
    (determined by the type and guarded by _GUARD_TYPE_VERSION before
    _STORE_ATTR_SLOT).
  * Atomic exchange replaces the LOCK_OBJECT + plain-load-old +
    atomic-release-store-new + UNLOCK_OBJECT sequence with one
    lock-prefix xchg.
  * Each writer's exchange returns exactly one old pointer, so
    Py_XDECREF(old_value) cannot double-free across concurrent writers.
  * Readers (LOAD_ATTR_SLOT, PyMember_GetOne) atomically load + try-
    incref, then re-validate that the slot still points at the same
    object before claiming the reference. The TryIncrefCompare path
    correctly fails if ob_ref_shared has reached 0 (object being
    freed), so we never resurrect.
  * Type-change races (`__class__ = NewType`) are caught by the
    _GUARD_TYPE_VERSION uop that runs before _STORE_ATTR_SLOT, same as
    for the already-lock-free _LOAD_ATTR_SLOT.
  * The DEOPT_IF on lock-acquire failure is dropped: atomic exchange
    always succeeds, so the bytecode no longer thrashes between
    specialized and generic forms under contention.

Microbench (origin/main @ 28eac9a, --disable-gil
--enable-optimizations --with-lto=yes, taskset -c 4, best-of-9 ns/op):

  workload                       baseline   patched   ratio
  STORE_ATTR_SLOT (alone)         15.40      9.30     0.604   (-39.6%)
  attr_idiv (LOAD + / + STORE)    41.22     30.20     0.733   (-26.7%)
  Point.normalize (per Point)    417.4     398.0     0.953   (-4.6%)

pyperformance bm_float (direct timing, best-of-5, ms):

  baseline FT    53.2
  patched  FT    48.5    (-4.7 ms, -8.8%)
  GIL (ref)      46.6    (patched FT is within 4% of GIL on this bench)

bm_nbody is unchanged as a control (its STORE-heavy work is on
STORE_SUBSCR_LIST_INT, not STORE_ATTR).

Comprehensive regression sweep on PGO+LTO build: test_descr, test_class,
test_dataclasses, test_dict, test_typing, test_threading,
test_concurrent_futures, test_thread, test_capi, test_inspect,
test_subclassinit, test_long, test_float, test_complex, test_gc,
test_pickle, test_listcomps, test_exceptions, test_compile - all pass.
The full test_free_threading suite (192 tests) also passes.

Concurrent stress test (4 writer threads + 2 reader threads racing on
the same __slots__ object's slots, 200k writes each, 100k reads each):
no crash, no inconsistency, refcount stays in expected range.

Standalone benchmark script: bench_store_attr_slot.py (microbench +
Point.normalize pipeline; see PR comment).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
eendebakpt added a commit to eendebakpt/cpython that referenced this pull request May 22, 2026
Both the bytecode handler (_STORE_ATTR_SLOT) and the C-API path
(PyMember_SetOne for Py_T_OBJECT_EX / _Py_T_OBJECT) currently take
per-object locks (ob_mutex via LOCK_OBJECT or Py_BEGIN_CRITICAL_SECTION)
to sequence the read-of-old-value, store-of-new-value, and decref-of-old.

Replace the lock with a single atomic exchange on the slot pointer.
Atomic exchange returns the previous value and stores the new one in one
atomic op, which is both sufficient for correctness on the write side
(each concurrent writer observes a unique old value, so no double-decref)
and consistent with how the read side already works: _LOAD_ATTR_SLOT and
PyMember_GetOne are lock-free for the fast path, using FT_ATOMIC_LOAD_PTR
plus _Py_TryIncrefCompare to cope with mid-flight pointer changes.

This is the WRITE-side counterpart to the existing lock-free LOAD_ATTR_SLOT
read path, and follows the pattern Dino Viehland established with
pythongh-130373 (Avoid locking in _LOAD_ATTR_WITH_HINT, merged Feb 2025) and
companions pythongh-130312 / pythongh-130313 (don't lock when populating sets in
bytecode / when clearing attribute-less objects). The primitive being
used here, _Py_atomic_exchange_ptr, is being formally proposed in
pythongh-150044 for an internal _Py_SETREF_ATOMIC macro.

Three coordinated changes are required:

  1. _STORE_ATTR_SLOT in Python/bytecodes.c: drop LOCK_OBJECT /
     UNLOCK_OBJECT and the DEOPT_IF on lock-acquire failure; use
     _Py_atomic_exchange_ptr on Py_GIL_DISABLED, plain read/write on
     the GIL build.

  2. PyMember_SetOne in Python/structmember.c, Py_T_OBJECT_EX and
     _Py_T_OBJECT cases: drop Py_BEGIN_CRITICAL_SECTION; use
     _Py_atomic_exchange_ptr to match the bytecode path. The two write
     paths must change together: if only the bytecode were lock-free,
     PyMember_SetOne's plain "oldv = *addr" inside the CS would race
     with concurrent atomic-exchange writers, returning a "stolen" old
     value that another writer is also about to DECREF (double-free).

  3. PyMember_GetOne in Python/structmember.c, Py_T_OBJECT_EX and
     _Py_T_OBJECT cases: replace the CS+Py_XINCREF fallback (taken on
     _Py_TryIncrefCompare failure) with _Py_XGetRef, which retries the
     atomic-load + try-incref loop. The previous fallback relied on
     writers also taking ob_mutex to keep the slot stable inside the
     CS; with lock-free writers, the CS no longer excludes them, and a
     plain Py_XINCREF on a re-loaded slot value could resurrect a
     refcount-0 (about-to-be-freed) object.

History (git blame on the LOCK_OBJECT line):
  * 22b0de2 (2024-06): original FT structure of _STORE_ATTR_SLOT.
  * 1b15c89 (2024-12, pythongh-127838 / pythongh-115999): added LOCK_OBJECT in
    the broader STORE_ATTR specialization commit. For
    _STORE_ATTR_INSTANCE_VALUE the lock is genuinely required (it also
    updates the PyDictValues insertion-order list, which has to be
    coordinated with the NULL->non-NULL transition). For
    _STORE_ATTR_SLOT it was added for symmetry but isn't required
    when atomic exchange is used.
  * bef63d2 (2025-12, pythonGH-134584 / pythongh-142729): removed a redundant
    refcount op from _STORE_ATTR_SLOT, confirming the slot fast path
    is still being actively simplified.

Why this is correct:
  * The slot store is a single pointer write at a fixed offset
    (determined by the type and guarded by _GUARD_TYPE_VERSION before
    _STORE_ATTR_SLOT).
  * Atomic exchange replaces the LOCK_OBJECT + plain-load-old +
    atomic-release-store-new + UNLOCK_OBJECT sequence with one
    lock-prefix xchg.
  * Each writer's exchange returns exactly one old pointer, so
    Py_XDECREF(old_value) cannot double-free across concurrent writers.
  * Readers (LOAD_ATTR_SLOT, PyMember_GetOne) atomically load + try-
    incref, then re-validate that the slot still points at the same
    object before claiming the reference. The TryIncrefCompare path
    correctly fails if ob_ref_shared has reached 0 (object being
    freed), so we never resurrect.
  * Type-change races (`__class__ = NewType`) are caught by the
    _GUARD_TYPE_VERSION uop that runs before _STORE_ATTR_SLOT, same as
    for the already-lock-free _LOAD_ATTR_SLOT.
  * The DEOPT_IF on lock-acquire failure is dropped: atomic exchange
    always succeeds, so the bytecode no longer thrashes between
    specialized and generic forms under contention.

Microbench (origin/main @ 28eac9a, --disable-gil
--enable-optimizations --with-lto=yes, taskset -c 4, best-of-9 ns/op):

  workload                       baseline   patched   ratio
  STORE_ATTR_SLOT (alone)         15.40      9.30     0.604   (-39.6%)
  attr_idiv (LOAD + / + STORE)    41.22     30.20     0.733   (-26.7%)
  Point.normalize (per Point)    417.4     398.0     0.953   (-4.6%)

pyperformance bm_float (direct timing, best-of-5, ms):

  baseline FT    53.2
  patched  FT    48.5    (-4.7 ms, -8.8%)
  GIL (ref)      46.6    (patched FT is within 4% of GIL on this bench)

bm_nbody is unchanged as a control (its STORE-heavy work is on
STORE_SUBSCR_LIST_INT, not STORE_ATTR).

Comprehensive regression sweep on PGO+LTO build: test_descr, test_class,
test_dataclasses, test_dict, test_typing, test_threading,
test_concurrent_futures, test_thread, test_capi, test_inspect,
test_subclassinit, test_long, test_float, test_complex, test_gc,
test_pickle, test_listcomps, test_exceptions, test_compile - all pass.
The full test_free_threading suite (192 tests) also passes.

Concurrent stress test (4 writer threads + 2 reader threads racing on
the same __slots__ object's slots, 200k writes each, 100k reads each):
no crash, no inconsistency, refcount stays in expected range.

Standalone benchmark script: bench_store_attr_slot.py (microbench +
Point.normalize pipeline; see PR comment).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
eendebakpt added a commit to eendebakpt/cpython that referenced this pull request May 23, 2026
Both the bytecode handler (_STORE_ATTR_SLOT) and the C-API path
(PyMember_SetOne for Py_T_OBJECT_EX / _Py_T_OBJECT) currently take
per-object locks (ob_mutex via LOCK_OBJECT or Py_BEGIN_CRITICAL_SECTION)
to sequence the read-of-old-value, store-of-new-value, and decref-of-old.

Replace the lock with a single atomic exchange on the slot pointer.
Atomic exchange returns the previous value and stores the new one in one
atomic op, which is both sufficient for correctness on the write side
(each concurrent writer observes a unique old value, so no double-decref)
and consistent with how the read side already works: _LOAD_ATTR_SLOT and
PyMember_GetOne are lock-free for the fast path, using FT_ATOMIC_LOAD_PTR
plus _Py_TryIncrefCompare to cope with mid-flight pointer changes.

This is the WRITE-side counterpart to the existing lock-free LOAD_ATTR_SLOT
read path, and follows the pattern Dino Viehland established with
pythongh-130373 (Avoid locking in _LOAD_ATTR_WITH_HINT, merged Feb 2025) and
companions pythongh-130312 / pythongh-130313 (don't lock when populating sets in
bytecode / when clearing attribute-less objects). The primitive being
used here, _Py_atomic_exchange_ptr, is being formally proposed in
pythongh-150044 for an internal _Py_SETREF_ATOMIC macro.

Three coordinated changes are required:

  1. _STORE_ATTR_SLOT in Python/bytecodes.c: drop LOCK_OBJECT /
     UNLOCK_OBJECT and the DEOPT_IF on lock-acquire failure; use
     _Py_atomic_exchange_ptr on Py_GIL_DISABLED, plain read/write on
     the GIL build.

  2. PyMember_SetOne in Python/structmember.c, Py_T_OBJECT_EX and
     _Py_T_OBJECT cases: drop Py_BEGIN_CRITICAL_SECTION; use
     _Py_atomic_exchange_ptr to match the bytecode path. The two write
     paths must change together: if only the bytecode were lock-free,
     PyMember_SetOne's plain "oldv = *addr" inside the CS would race
     with concurrent atomic-exchange writers, returning a "stolen" old
     value that another writer is also about to DECREF (double-free).

  3. PyMember_GetOne in Python/structmember.c, Py_T_OBJECT_EX and
     _Py_T_OBJECT cases: replace the CS+Py_XINCREF fallback (taken on
     _Py_TryIncrefCompare failure) with _Py_XGetRef, which retries the
     atomic-load + try-incref loop. The previous fallback relied on
     writers also taking ob_mutex to keep the slot stable inside the
     CS; with lock-free writers, the CS no longer excludes them, and a
     plain Py_XINCREF on a re-loaded slot value could resurrect a
     refcount-0 (about-to-be-freed) object.

History (git blame on the LOCK_OBJECT line):
  * 22b0de2 (2024-06): original FT structure of _STORE_ATTR_SLOT.
  * 1b15c89 (2024-12, pythongh-127838 / pythongh-115999): added LOCK_OBJECT in
    the broader STORE_ATTR specialization commit. For
    _STORE_ATTR_INSTANCE_VALUE the lock is genuinely required (it also
    updates the PyDictValues insertion-order list, which has to be
    coordinated with the NULL->non-NULL transition). For
    _STORE_ATTR_SLOT it was added for symmetry but isn't required
    when atomic exchange is used.
  * bef63d2 (2025-12, pythonGH-134584 / pythongh-142729): removed a redundant
    refcount op from _STORE_ATTR_SLOT, confirming the slot fast path
    is still being actively simplified.

Why this is correct:
  * The slot store is a single pointer write at a fixed offset
    (determined by the type and guarded by _GUARD_TYPE_VERSION before
    _STORE_ATTR_SLOT).
  * Atomic exchange replaces the LOCK_OBJECT + plain-load-old +
    atomic-release-store-new + UNLOCK_OBJECT sequence with one
    lock-prefix xchg.
  * Each writer's exchange returns exactly one old pointer, so
    Py_XDECREF(old_value) cannot double-free across concurrent writers.
  * Readers (LOAD_ATTR_SLOT, PyMember_GetOne) atomically load + try-
    incref, then re-validate that the slot still points at the same
    object before claiming the reference. The TryIncrefCompare path
    correctly fails if ob_ref_shared has reached 0 (object being
    freed), so we never resurrect.
  * Type-change races (`__class__ = NewType`) are caught by the
    _GUARD_TYPE_VERSION uop that runs before _STORE_ATTR_SLOT, same as
    for the already-lock-free _LOAD_ATTR_SLOT.
  * The DEOPT_IF on lock-acquire failure is dropped: atomic exchange
    always succeeds, so the bytecode no longer thrashes between
    specialized and generic forms under contention.

Microbench (origin/main @ 28eac9a, --disable-gil
--enable-optimizations --with-lto=yes, taskset -c 4, best-of-9 ns/op):

  workload                       baseline   patched   ratio
  STORE_ATTR_SLOT (alone)         15.40      9.30     0.604   (-39.6%)
  attr_idiv (LOAD + / + STORE)    41.22     30.20     0.733   (-26.7%)
  Point.normalize (per Point)    417.4     398.0     0.953   (-4.6%)

pyperformance bm_float (direct timing, best-of-5, ms):

  baseline FT    53.2
  patched  FT    48.5    (-4.7 ms, -8.8%)
  GIL (ref)      46.6    (patched FT is within 4% of GIL on this bench)

bm_nbody is unchanged as a control (its STORE-heavy work is on
STORE_SUBSCR_LIST_INT, not STORE_ATTR).

Comprehensive regression sweep on PGO+LTO build: test_descr, test_class,
test_dataclasses, test_dict, test_typing, test_threading,
test_concurrent_futures, test_thread, test_capi, test_inspect,
test_subclassinit, test_long, test_float, test_complex, test_gc,
test_pickle, test_listcomps, test_exceptions, test_compile - all pass.
The full test_free_threading suite (192 tests) also passes.

Concurrent stress test (4 writer threads + 2 reader threads racing on
the same __slots__ object's slots, 200k writes each, 100k reads each):
no crash, no inconsistency, refcount stays in expected range.

Standalone benchmark script: bench_store_attr_slot.py (microbench +
Point.normalize pipeline; see PR comment).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
eendebakpt added a commit to eendebakpt/cpython that referenced this pull request May 23, 2026
Both the bytecode handler (_STORE_ATTR_SLOT) and the C-API path
(PyMember_SetOne for Py_T_OBJECT_EX / _Py_T_OBJECT) currently take
per-object locks (ob_mutex via LOCK_OBJECT or Py_BEGIN_CRITICAL_SECTION)
to sequence the read-of-old-value, store-of-new-value, and decref-of-old.

Replace the lock with a single atomic exchange on the slot pointer.
Atomic exchange returns the previous value and stores the new one in one
atomic op, which is both sufficient for correctness on the write side
(each concurrent writer observes a unique old value, so no double-decref)
and consistent with how the read side already works: _LOAD_ATTR_SLOT and
PyMember_GetOne are lock-free for the fast path, using FT_ATOMIC_LOAD_PTR
plus _Py_TryIncrefCompare to cope with mid-flight pointer changes.

This is the WRITE-side counterpart to the existing lock-free LOAD_ATTR_SLOT
read path, and follows the pattern Dino Viehland established with
pythongh-130373 (Avoid locking in _LOAD_ATTR_WITH_HINT, merged Feb 2025) and
companions pythongh-130312 / pythongh-130313 (don't lock when populating sets in
bytecode / when clearing attribute-less objects). The primitive being
used here, _Py_atomic_exchange_ptr, is being formally proposed in
pythongh-150044 for an internal _Py_SETREF_ATOMIC macro.

Three coordinated changes are required:

  1. _STORE_ATTR_SLOT in Python/bytecodes.c: drop LOCK_OBJECT /
     UNLOCK_OBJECT and the DEOPT_IF on lock-acquire failure; use
     _Py_atomic_exchange_ptr on Py_GIL_DISABLED, plain read/write on
     the GIL build.

  2. PyMember_SetOne in Python/structmember.c, Py_T_OBJECT_EX and
     _Py_T_OBJECT cases: drop Py_BEGIN_CRITICAL_SECTION; use
     _Py_atomic_exchange_ptr to match the bytecode path. The two write
     paths must change together: if only the bytecode were lock-free,
     PyMember_SetOne's plain "oldv = *addr" inside the CS would race
     with concurrent atomic-exchange writers, returning a "stolen" old
     value that another writer is also about to DECREF (double-free).

  3. PyMember_GetOne in Python/structmember.c, Py_T_OBJECT_EX and
     _Py_T_OBJECT cases: replace the CS+Py_XINCREF fallback (taken on
     _Py_TryIncrefCompare failure) with _Py_XGetRef, which retries the
     atomic-load + try-incref loop. The previous fallback relied on
     writers also taking ob_mutex to keep the slot stable inside the
     CS; with lock-free writers, the CS no longer excludes them, and a
     plain Py_XINCREF on a re-loaded slot value could resurrect a
     refcount-0 (about-to-be-freed) object.

History (git blame on the LOCK_OBJECT line):
  * 22b0de2 (2024-06): original FT structure of _STORE_ATTR_SLOT.
  * 1b15c89 (2024-12, pythongh-127838 / pythongh-115999): added LOCK_OBJECT in
    the broader STORE_ATTR specialization commit. For
    _STORE_ATTR_INSTANCE_VALUE the lock is genuinely required (it also
    updates the PyDictValues insertion-order list, which has to be
    coordinated with the NULL->non-NULL transition). For
    _STORE_ATTR_SLOT it was added for symmetry but isn't required
    when atomic exchange is used.
  * bef63d2 (2025-12, pythonGH-134584 / pythongh-142729): removed a redundant
    refcount op from _STORE_ATTR_SLOT, confirming the slot fast path
    is still being actively simplified.

Why this is correct:
  * The slot store is a single pointer write at a fixed offset
    (determined by the type and guarded by _GUARD_TYPE_VERSION before
    _STORE_ATTR_SLOT).
  * Atomic exchange replaces the LOCK_OBJECT + plain-load-old +
    atomic-release-store-new + UNLOCK_OBJECT sequence with one
    lock-prefix xchg.
  * Each writer's exchange returns exactly one old pointer, so
    Py_XDECREF(old_value) cannot double-free across concurrent writers.
  * Readers (LOAD_ATTR_SLOT, PyMember_GetOne) atomically load + try-
    incref, then re-validate that the slot still points at the same
    object before claiming the reference. The TryIncrefCompare path
    correctly fails if ob_ref_shared has reached 0 (object being
    freed), so we never resurrect.
  * Type-change races (`__class__ = NewType`) are caught by the
    _GUARD_TYPE_VERSION uop that runs before _STORE_ATTR_SLOT, same as
    for the already-lock-free _LOAD_ATTR_SLOT.
  * The DEOPT_IF on lock-acquire failure is dropped: atomic exchange
    always succeeds, so the bytecode no longer thrashes between
    specialized and generic forms under contention.

Microbench (origin/main @ 28eac9a, --disable-gil
--enable-optimizations --with-lto=yes, taskset -c 4, best-of-9 ns/op):

  workload                       baseline   patched   ratio
  STORE_ATTR_SLOT (alone)         15.40      9.30     0.604   (-39.6%)
  attr_idiv (LOAD + / + STORE)    41.22     30.20     0.733   (-26.7%)
  Point.normalize (per Point)    417.4     398.0     0.953   (-4.6%)

pyperformance bm_float (direct timing, best-of-5, ms):

  baseline FT    53.2
  patched  FT    48.5    (-4.7 ms, -8.8%)
  GIL (ref)      46.6    (patched FT is within 4% of GIL on this bench)

bm_nbody is unchanged as a control (its STORE-heavy work is on
STORE_SUBSCR_LIST_INT, not STORE_ATTR).

Comprehensive regression sweep on PGO+LTO build: test_descr, test_class,
test_dataclasses, test_dict, test_typing, test_threading,
test_concurrent_futures, test_thread, test_capi, test_inspect,
test_subclassinit, test_long, test_float, test_complex, test_gc,
test_pickle, test_listcomps, test_exceptions, test_compile - all pass.
The full test_free_threading suite (192 tests) also passes.

Concurrent stress test (4 writer threads + 2 reader threads racing on
the same __slots__ object's slots, 200k writes each, 100k reads each):
no crash, no inconsistency, refcount stays in expected range.

Standalone benchmark script: bench_store_attr_slot.py (microbench +
Point.normalize pipeline; see PR comment).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants