Add Arm64 AdvSimd implementation of Matrix4x4 Invert#128640
Conversation
Code proposed by the Arm MCP server guided workflow. https://github.com/arm/mcp Testing using dotnet/performance InvertBenchmark shows a 17% improvement on Cobalt.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an Arm64 (AdvSimd/NEON) intrinsic implementation for Matrix4x4 inversion to improve performance on Arm64 platforms.
Changes:
- Add an
AdvSimd.Arm64fast-path inInvert. - Introduce
AdvSimdImplmirroring the existing DirectXMath/SSE-based inversion algorithm using NEON intrinsics.
|
Tagging subscribers to this area: @dotnet/area-system-numerics |
| } | ||
|
|
||
| [CompExactlyDependsOn(typeof(AdvSimd.Arm64))] | ||
| static bool AdvSimdImpl(in Impl matrix, out Impl result) |
There was a problem hiding this comment.
As a general nit, we're looking at landing #127690 which adds a number of xplat helper APIs and will allow us to unify the Arm64 and x64 implementations to a single code path.
It provides helpers like Vector128.ConcateLowerLower(row1, row2) which avoids having to extract a Vector64<T> if that isn't viable (such as on x64 or for SVE where no "half width" vector exists for Vector<T>)
It also provides ones like Vector128.UnzipEven(vTemp1, vTemp2) which unifies the consideration of needing to use a shuffle on some platforms or for some base types vs having a dedicated instruction on others.
If we can hold off until that lands, we should be able to just update Matrix4x4 to no longer have any architecture specific code paths.
There was a problem hiding this comment.
If we can hold off until that lands, we should be able to just update Matrix4x4 to no longer have any architecture specific code paths.
That would be a much better solution. This PR is essentially duplicating the X86 code path.
What are the chances of landing #127690 and someone producing a combined version of Invert in time for .NET11? If it's not likely to happen, then would this PR be useful as a stopgap to help performance? Understood you may not want to for code size and churn reasons.
There was a problem hiding this comment.
#127690 should be merged in the next few days, its just waiting on secondary sign-off. It's part of the planned work for .NET 11
Once that's done, updating Matrix4x4 to be xplat should be trivial; I can get it done relatively quickly.
|
Closing this as it should be implemented with the new APIs once they are available. |
Code proposed by the Arm MCP server guided workflow. https://github.com/arm/mcp
Testing using dotnet/performance InvertBenchmark shows a 17% improvement on Cobalt.