JIT: Improve codegen for Vector128/256.NarrowWithSaturation by saucecontrol · Pull Request #126226 · dotnet/runtime

saucecontrol · 2026-03-27T21:04:51Z

This adds some missing optimized paths for NarrowWithSaturation intrinsics in pre-AVX-512 environments.

Vector128.NarrowWithSaturation was fully accelerated for signed types but not unsigned:

static Vector128<ushort> NarrowSaturate(Vector128<uint> x, Vector128<uint> y)
	=> Vector128.NarrowWithSaturation(x, y);

        vbroadcastss xmm0, dword ptr [reloc @RWD00]
        vpminuw  xmm1, xmm0, xmmword ptr [rdx]
-       vpand    xmm1, xmm1, xmm0
-       vpminuw  xmm2, xmm0, xmmword ptr [r8]
-       vpand    xmm0, xmm2, xmm0
+       vpminuw  xmm0, xmm0, xmmword ptr [r8]
        vpackuswb xmm0, xmm1, xmm0
        vmovups  xmmword ptr [rcx], xmm0
        mov      rax, rcx
        ret      

 RWD00  	dd	00FF00FFh		; 2.34184e-38
 
-; Total bytes of code 39
+; Total bytes of code 31

Vector256.NarrowWithSaturation was using the slow path for both signed and unsigned:

static Vector256<sbyte> NarrowSaturate(Vector256<short> x, Vector256<short> y)
	=> Vector256.NarrowWithSaturation(x, y);

-       vbroadcastss ymm0, dword ptr [reloc @RWD00]
-       vpmaxsw  ymm1, ymm0, ymmword ptr [rdx]
-       vbroadcastss ymm2, dword ptr [reloc @RWD04]
-       vpminsw  ymm1, ymm1, ymm2
-       vbroadcastss ymm3, dword ptr [reloc @RWD08]
-       vpand    ymm1, ymm1, ymm3
-       vpmaxsw  ymm0, ymm0, ymmword ptr [r8]
-       vpminsw  ymm0, ymm0, ymm2
-       vpand    ymm0, ymm0, ymm3
-       vpackuswb ymm0, ymm1, ymm0
+       vmovups  ymm0, ymmword ptr [rdx]
+       vpacksswb ymm0, ymm0, ymmword ptr [r8]
        vpermq   ymm0, ymm0, -40
        vmovups  ymmword ptr [rcx], ymm0
        mov      rax, rcx
        vzeroupper 
        ret      

-RWD00  	dd	FF80FF80h		;      -nan
-RWD04  	dd	007F007Fh		; 1.16633e-38
-RWD08  	dd	00FF00FFh		; 2.34184e-38
 
-; Total bytes of code 73
+; Total bytes of code 26

Full diffs

Copilot

Pull request overview

This PR refactors x86/x64 SIMD vector conversion intrinsic selection into a shared helper and adds missing fast paths for Vector128/256.NarrowWithSaturation in non-AVX512 environments, reducing instruction count and code size for several narrow-with-saturation cases.

Changes:

Introduce GenTreeHWIntrinsic::GetHWIntrinsicIdForVectorConvert(...) to centralize lookup of conversion-related intrinsics (including optional saturating preference).
Improve Vector128/256.NarrowWithSaturation codegen on pre-AVX512 machines by using pack-based sequences where applicable.
Refactor existing conversion/widen/narrow construction to use the shared lookup helper instead of duplicated switch logic.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
src/coreclr/jit/hwintrinsicxarch.cpp	Uses the new conversion lookup helper and adds optimized pack-based paths for `NarrowWithSaturation` on non-AVX512.
src/coreclr/jit/gentree.h	Declares the new shared vector-convert intrinsic lookup helper.
src/coreclr/jit/gentree.cpp	Implements the helper and refactors several SIMD convert/narrow/widen paths to use it.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

saucecontrol · 2026-03-29T02:55:09Z

cc @dotnet/jit-contrib this is ready for review.

Diffs

JulieLeeMSFT · 2026-05-04T17:45:13Z

@saucecontrol, please resolve comments.

tannergooding · 2026-05-26T20:26:32Z

The changes look correct to me, but I'm not really happy with NarrowWithSaturation going the opposite direction that we want 😅

I think the better long term thing here is to remove BaseTypeFromFirstArg, making it the default. We then only have BaseTypeFromSecondArg and can implicitly get it from the return type if no arguments exist. That helps force correctness, but also will involve a PR that finds all the intrinsics like Narrow that are incorrectly using the return type (and where it differs from the arg base type).

saucecontrol · 2026-05-26T20:43:48Z

That makes sense. For this PR, I was going for consistency, because NarrowWithSaturation falls back to Narrow, and the transition of base type between the two was confusing.

I'd be happy to take on switching all of the Narrow intrinsics for xarch and aarch to BaseTypeFromFirstArg in a follow-up since that's going to be bigger. Narrow is called from quite a few places (e.g. integral vector division) where adjustments will have to be made.

tannergooding

CC. @dotnet/jit-contrib, @EgorBo, @kg for secondary sign-off. Community PR improving some HWIntrinsic codegen

kg · 2026-05-27T01:18:44Z

+                if (varTypeIsFloating(simdBaseType))
                {
-                    // gtNewSimdNarrowNode uses the base type of the return for the simdBaseType
-                    retNode = gtNewSimdNarrowNode(retType, op1, op2, TYP_FLOAT, simdSize);
+                    retNode = gtNewSimdNarrowNode(retType, op1, op2, simdBaseType, simdSize);
                }


This previously always went DOUBLE -> FLOAT but now it seems like it would go DOUBLE -> DOUBLE or FLOAT -> FLOAT? I don't understand this change

Yeah, in a nutshell, we're just delegating to the Narrow intrinsic for floating types because they have the same documented behavior, which is a saturating conversion. The Narrow intrinsic was originally modeled to have the same behavior as a C# narrowing cast, which truncates for integral types and saturates for floating. NarrowWithSaturation was added later, and just didn't need to duplicate that floating logic.

The confusion comes from the fact that Narrow is wired up so that the simdBaseType is the return type, which is the narrower type, while NarrowWithSaturation took the base type from the first arg. So the base type changed from DOUBLE to FLOAT before, only because of the differing conventions.

Per Tanner's comment (#126226 (comment)), we want to change them both to take the type from the first arg (I'll do it in a followup), but that won't change the fact that we pass through the same type when delegating to the other intrinsic. With this it goes FLOAT -> FLOAT, and later it will go DOUBLE -> DOUBLE.

kg · 2026-05-27T01:24:59Z

                }
-                else if ((simdSize == 16) && ((simdBaseType == TYP_SHORT) || (simdBaseType == TYP_INT)))
+                else if (((simdSize == 16) || (simdSize == 32)) &&
+                         ((simdBaseType == TYP_BYTE) || (simdBaseType == TYP_SHORT)))


We previously would have used PackSignedSaturate here for a basetype of SHORT or INT with a size of 16; now the basetype has to be BYTE or SHORT instead of SHORT or INT. Is that intentional? Is it compensated for by the changes to the switch statement below?

Yes, this is intentional because of the simdBaseType now matching the return type. Again, it's just for consistency with Narrow for now. We can change them both at the same time.

kg · 2026-05-27T01:28:05Z

                else if (compOpportunisticallyDependsOn(InstructionSet_AVX512))
                {
-                    if ((simdSize == 32) || (simdSize == 64))
+                    switch (simdBaseType)


nit: The way the individual simd sizes (16/32/64) are being handled here is a little magical and hard to follow, a brief comment explaining the approach might be nice.

kg · 2026-05-27T01:29:52Z

-                                break;
-                            }
+                    var_types opBaseType  = getHWIntrinsicWidenType(simdBaseType);
+                    unsigned  tmpSimdSize = (simdSize == 64) ? (simdSize / 2) : (simdSize * 2);


This in particular, I don't get. I assume it would make sense to me with a high level explanation comment. We're conditionally widening or narrowing based on simdSize? But this is NarrowWithSaturation, shouldn't we always be generating smaller or equal-sized vectors?

I'll have to revisit this as part of the planned cleanup. I can add more comments then if that's ok with you. The basic idea is that we are narrowing two vectors into 1, so if the target isn't the max width supported by hardware, we can make the operation cheaper by building a single double-width vector and narrowing it with a single instruction. When that's not possible (because the sources are already the max supported vector width) we have to use two narrow instructions, which result in two half-size outputs that have to be rejoined. So we either need a temp size that's double or half the original size.

kg · 2026-05-27T01:32:30Z

Will leave the checkmark to Egor for now because I can't quite make sense of what's going on in this PR (will try again later though.) I didn't see any problems.

Copilot AI review requested due to automatic review settings March 27, 2026 21:04

github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 27, 2026

dotnet-policy-service Bot added the community-contribution Indicates that the PR has been added by a community member label Mar 27, 2026

Copilot started reviewing on behalf of saucecontrol March 27, 2026 21:05 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp Outdated

Comment thread src/coreclr/jit/gentree.cpp Outdated

Comment thread src/coreclr/jit/gentree.cpp Outdated

Comment thread src/coreclr/jit/gentree.cpp Outdated

Comment thread src/coreclr/jit/gentree.cpp Outdated

This was referenced Mar 27, 2026

Suboptimal codgen for Vector128.NarrowWithSaturation #116526

Open

JIT: Skip redundant AND masking in NarrowWithSaturation codegen #122898

Closed

Copilot AI review requested due to automatic review settings March 28, 2026 01:41

Copilot started reviewing on behalf of saucecontrol March 28, 2026 01:42 View session

Copilot AI reviewed Mar 28, 2026

View reviewed changes

Comment thread src/coreclr/jit/hwintrinsicxarch.cpp Outdated

Comment thread src/coreclr/jit/hwintrinsicxarch.cpp

This was referenced Mar 28, 2026

XHarness package install failure on iOS due to devicectl NSPOSIXErrorDomain error 49 #123796

Open

modpowTest.FastReducer_AssertFailure_RegressionTest hanging/timing out #126212

Closed

Copilot AI review requested due to automatic review settings March 28, 2026 04:12

Copilot started reviewing on behalf of saucecontrol March 28, 2026 04:13 View session

Copilot AI reviewed Mar 28, 2026

View reviewed changes

build-analysis Bot mentioned this pull request Mar 28, 2026

Unable to pull image from mcr.microsoft.com #117164

Open

tannergooding reviewed Apr 6, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.h Outdated

JulieLeeMSFT added the needs-author-action An issue or pull request that requires more info or actions from the author. label May 4, 2026

improve codegen for vector NarrowWithSaturation

3aae856

saucecontrol force-pushed the vectorconvert branch from 73eca4b to 3aae856 Compare May 18, 2026 17:52

dotnet-policy-service Bot removed the needs-author-action An issue or pull request that requires more info or actions from the author. label May 18, 2026

formatting

9abaf49

Copilot AI review requested due to automatic review settings May 18, 2026 18:05

saucecontrol commented May 18, 2026

View reviewed changes

Comment thread src/coreclr/jit/hwintrinsic.cpp

saucecontrol commented May 18, 2026

View reviewed changes

Comment thread src/coreclr/jit/hwintrinsiclistxarch.h

saucecontrol commented May 18, 2026

View reviewed changes

Comment thread src/coreclr/jit/hwintrinsicxarch.cpp

build-analysis Bot mentioned this pull request May 18, 2026

[wasm] Tests failing with DirectoryNotFoundException trying to load test data #128293

Open

tannergooding requested a review from kg May 26, 2026 18:24

tannergooding requested a review from EgorBo May 26, 2026 18:24

tannergooding approved these changes May 26, 2026

View reviewed changes

kg reviewed May 27, 2026

View reviewed changes

Conversation

saucecontrol commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

saucecontrol commented Mar 29, 2026

Uh oh!

Uh oh!

JulieLeeMSFT commented May 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tannergooding commented May 26, 2026

Uh oh!

saucecontrol commented May 26, 2026

Uh oh!

tannergooding left a comment

Choose a reason for hiding this comment

Uh oh!

kg May 27, 2026

Choose a reason for hiding this comment

Uh oh!

saucecontrol May 27, 2026

Choose a reason for hiding this comment

Uh oh!

kg May 27, 2026

Choose a reason for hiding this comment

Uh oh!

saucecontrol May 27, 2026

Choose a reason for hiding this comment

Uh oh!

kg May 27, 2026

Choose a reason for hiding this comment

Uh oh!

kg May 27, 2026

Choose a reason for hiding this comment

Uh oh!

saucecontrol May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kg commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

saucecontrol commented Mar 27, 2026 •

edited

Loading

saucecontrol May 27, 2026 •

edited

Loading