JIT: Improve codegen for Vector128/256.NarrowWithSaturation#126226
JIT: Improve codegen for Vector128/256.NarrowWithSaturation#126226saucecontrol wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors x86/x64 SIMD vector conversion intrinsic selection into a shared helper and adds missing fast paths for Vector128/256.NarrowWithSaturation in non-AVX512 environments, reducing instruction count and code size for several narrow-with-saturation cases.
Changes:
- Introduce
GenTreeHWIntrinsic::GetHWIntrinsicIdForVectorConvert(...)to centralize lookup of conversion-related intrinsics (including optional saturating preference). - Improve
Vector128/256.NarrowWithSaturationcodegen on pre-AVX512 machines by using pack-based sequences where applicable. - Refactor existing conversion/widen/narrow construction to use the shared lookup helper instead of duplicated switch logic.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/coreclr/jit/hwintrinsicxarch.cpp | Uses the new conversion lookup helper and adds optimized pack-based paths for NarrowWithSaturation on non-AVX512. |
| src/coreclr/jit/gentree.h | Declares the new shared vector-convert intrinsic lookup helper. |
| src/coreclr/jit/gentree.cpp | Implements the helper and refactors several SIMD convert/narrow/widen paths to use it. |
|
cc @dotnet/jit-contrib this is ready for review. |
|
@saucecontrol, please resolve comments. |
73eca4b to
3aae856
Compare
|
The changes look correct to me, but I'm not really happy with I think the better long term thing here is to remove |
|
That makes sense. For this PR, I was going for consistency, because I'd be happy to take on switching all of the Narrow intrinsics for xarch and aarch to |
| if (varTypeIsFloating(simdBaseType)) | ||
| { | ||
| // gtNewSimdNarrowNode uses the base type of the return for the simdBaseType | ||
| retNode = gtNewSimdNarrowNode(retType, op1, op2, TYP_FLOAT, simdSize); | ||
| retNode = gtNewSimdNarrowNode(retType, op1, op2, simdBaseType, simdSize); | ||
| } |
There was a problem hiding this comment.
This previously always went DOUBLE -> FLOAT but now it seems like it would go DOUBLE -> DOUBLE or FLOAT -> FLOAT? I don't understand this change
There was a problem hiding this comment.
Yeah, in a nutshell, we're just delegating to the Narrow intrinsic for floating types because they have the same documented behavior, which is a saturating conversion. The Narrow intrinsic was originally modeled to have the same behavior as a C# narrowing cast, which truncates for integral types and saturates for floating. NarrowWithSaturation was added later, and just didn't need to duplicate that floating logic.
The confusion comes from the fact that Narrow is wired up so that the simdBaseType is the return type, which is the narrower type, while NarrowWithSaturation took the base type from the first arg. So the base type changed from DOUBLE to FLOAT before, only because of the differing conventions.
Per Tanner's comment (#126226 (comment)), we want to change them both to take the type from the first arg (I'll do it in a followup), but that won't change the fact that we pass through the same type when delegating to the other intrinsic. With this it goes FLOAT -> FLOAT, and later it will go DOUBLE -> DOUBLE.
| } | ||
| else if ((simdSize == 16) && ((simdBaseType == TYP_SHORT) || (simdBaseType == TYP_INT))) | ||
| else if (((simdSize == 16) || (simdSize == 32)) && | ||
| ((simdBaseType == TYP_BYTE) || (simdBaseType == TYP_SHORT))) |
There was a problem hiding this comment.
We previously would have used PackSignedSaturate here for a basetype of SHORT or INT with a size of 16; now the basetype has to be BYTE or SHORT instead of SHORT or INT. Is that intentional? Is it compensated for by the changes to the switch statement below?
There was a problem hiding this comment.
Yes, this is intentional because of the simdBaseType now matching the return type. Again, it's just for consistency with Narrow for now. We can change them both at the same time.
| else if (compOpportunisticallyDependsOn(InstructionSet_AVX512)) | ||
| { | ||
| if ((simdSize == 32) || (simdSize == 64)) | ||
| switch (simdBaseType) |
There was a problem hiding this comment.
nit: The way the individual simd sizes (16/32/64) are being handled here is a little magical and hard to follow, a brief comment explaining the approach might be nice.
| break; | ||
| } | ||
| var_types opBaseType = getHWIntrinsicWidenType(simdBaseType); | ||
| unsigned tmpSimdSize = (simdSize == 64) ? (simdSize / 2) : (simdSize * 2); |
There was a problem hiding this comment.
This in particular, I don't get. I assume it would make sense to me with a high level explanation comment. We're conditionally widening or narrowing based on simdSize? But this is NarrowWithSaturation, shouldn't we always be generating smaller or equal-sized vectors?
There was a problem hiding this comment.
I'll have to revisit this as part of the planned cleanup. I can add more comments then if that's ok with you. The basic idea is that we are narrowing two vectors into 1, so if the target isn't the max width supported by hardware, we can make the operation cheaper by building a single double-width vector and narrowing it with a single instruction. When that's not possible (because the sources are already the max supported vector width) we have to use two narrow instructions, which result in two half-size outputs that have to be rejoined. So we either need a temp size that's double or half the original size.
|
Will leave the checkmark to Egor for now because I can't quite make sense of what's going on in this PR (will try again later though.) I didn't see any problems. |
Resolves #116526
This adds some missing optimized paths for
NarrowWithSaturationintrinsics in pre-AVX-512 environments.Vector128.NarrowWithSaturationwas fully accelerated for signed types but not unsigned:vbroadcastss xmm0, dword ptr [reloc @RWD00] vpminuw xmm1, xmm0, xmmword ptr [rdx] - vpand xmm1, xmm1, xmm0 - vpminuw xmm2, xmm0, xmmword ptr [r8] - vpand xmm0, xmm2, xmm0 + vpminuw xmm0, xmm0, xmmword ptr [r8] vpackuswb xmm0, xmm1, xmm0 vmovups xmmword ptr [rcx], xmm0 mov rax, rcx ret RWD00 dd 00FF00FFh ; 2.34184e-38 -; Total bytes of code 39 +; Total bytes of code 31Vector256.NarrowWithSaturationwas using the slow path for both signed and unsigned:Full diffs