For reasons, I have found myself writing a vector math library. Like Aras some time ago I’ve found an unpleasant surprise when looking at the debug performance of some vector types.
There is no point repeating what Aras already wrote, but I have some additional data points. My concrete example uses float3
(3D), where I would still like to use vector instructions, but none of this here is actually float3
specific.
I only care about Clang, or more specifically Clang 19, which you can get via Visual Studio 2022. My setup is X64 on Windows.
I ended up doing this investigation because I had written a little sample particle simulation and decided to improve it by adding a vector math library. The particle simulation just used scalar code, no float3
struct in sight. Adding the vector math library did improve performance of optimized builds, but completely fell off a cliff in non-optimized debug builds.
This was unexpected. Did I not avoid all of the obviously bad things? My code, for example, does not contain templates. Vector math isn’t going to change, ever, and I need just (float|int)(2|3|4)
. It’s way more important to have readable, sensible code and duplicate that 6 times than to have a single template. But alas, that was not enough to save me from performance woes.
When I say float3
, I mean struct like this:
struct float3 {
float x, y, z, pad;
};
It’s actually the size of 4 floats, and for storage we may want to use a packed format, but computations absolutely need to happen with a width of 4, as that is the most commonly supported vector length.
My concrete test case is code like this, which I have taken verbatim from code I had written earlier with my brain intentionally switched off. It’s important that the vector math library is decent even if you don’t think about whether you first multiply all scalars and then the vector or not, for example. The code goes over 64k particles, and then 10 centers of gravity, and calculates some gravity-inspired velocity update.
float dx = cPos.x - pos.x;
float dy = cPos.y - pos.y;
float dz = cPos.z - pos.z;
float squared = dx * dx + dy * dy + dz * dz;
// float* centerMass
float f = centerMass[c] / squared;
float distance = sqrtf(squared);
if (distance < 10)
distance = 10;
dx *= f * dx / distance;
dy *= f * dy / distance;
dz *= f * dz / distance;
// This is assuming a time step of 1/60
vel.x += dx / 60;
vel.y += dy / 60;
vel.z += dz / 60;
There is some opportunity for vectorization, but also a bunch of scalar code.
I have experimented with ten different setups:
- purely scalar code where
float3
is used for storage, but for nothing else (VEC_SCALAR
) - purely scalar code with operators on
float3
, and the operators are free functions (VEC_OP
) - purely scalar code with operators on
float3
, and the operators are member functions (VEC_OP_MEMBER
) - raw SSE, where
float3
is defined to be__m128
(VEC_SSE_RAW
) - SSE where
float3
is a struct that contains a union of__m128
andx, y, z, pad
(VEC_SSE
), and we directly operate on the__m128
with intrinsics - as in the previous scenario, but with operators (
VEC_SEE_OP
) - as in the previous scenario, but the operators use the
vectorcall
calling convention (VEC_SEE_OP_VEC
) - as in the previous scenario, but are also member functions (
VEC_SEE_OP_VEC_MEM
) - using Clang’s built in vector types, OpenCL flavored (
VEC_BUILTIN_OCL
) - using Clang’s built in vector types, GCC flavored (
VEC_BUILTIN_GCC
) - using Clang’s built in vector types, OpenCL flavored, in a struct (
VEC_BUILTIN_OCL_WRAP
)
In all cases, all the operators are marked as always-inline. You can find the test code here.
Some words about the unexpected entries:
- Clang has a various different inbuilt vector types which all offer a slightly different interface, but they come with operators already. For example, OpenCL-like vectors are defined like this:
typedef float float3 __attribute__((ext_vector_type(3), __aligned__(16))); static_assert(sizeof(float3) == 4 * sizeof(float), "Unexpected vector size");
vectorcall
is a custom calling convention on Windows for functions that take many vector register arguments. That is clearly not the case here, but I wanted to see whether anything happens anyway: The docs (MSDN) mention homogeneous vector aggregate (HVA) values as something that might benefit, and ourfloat3
should qualify.- I added member vs. non-member function because I had no idea whether
vectorcall
on a member-function was even valid in the first place.
Let me lead with the results. We have five scenarios to compare: -O0
, -O1
, -O2
, -O0
but we no longer force inlining of all the operators involved, and then also -Og
. -Og
is “optimize but keep debuggable”. As of writing -Og
is identical to -O1
in terms for optimization passes, but it also sets -fextend-variable-liveness=all
– everything else is identical1. I explicitly pass those to Clang via -Xclang -O1
, since Visual Studio uses the clang-cl
frontend. The reported value is the mean time taken for one loop of my benchmark, in microseconds. This is a microbenchmark. Don’t read too much into the specific numbers. Also note that this is of course measuring a whole bunch of things that are NOT about vector code (like the scalar code we always have, loops, etc.).
Setup | O0 | O1 | O2 | O0, no inlining | Og | O0/O1 speedup |
---|---|---|---|---|---|---|
VEC_SCALAR | 4077 | 3001 | 2973 | 4159 | 3094 | 1.33 |
VEC_OP | 8315 | 2979 | 3014 | 8669 | 3061 | 2.79 |
VEC_OP_MEMBER | 8571 | 3003 | 2985 | 8476 | 3076 | 2.85 |
VEC_SSE_RAW | 5575 | 1527 | 1535 | 5564 | 1585 | 3.65 |
VEC_SSE | 6415 | 1553 | 1540 | 6459 | 1597 | 4.13 |
VEC_SSE_OP | 19694 | 1541 | 1532 | 26292 | 1609 | 12.82 |
VEC_SSE_OP_VEC | 19801 | 1536 | 1536 | 21269 | 1573 | 12.89 |
VEC_SSE_OP_VEC_MEM | 20226 | 1547 | 1547 | 21336 | 1564 | 13.07 |
VEC_BUILTIN_OCL | 2932 | 1534 | 1536 | 2892 | 1570 | 1.91 |
VEC_BUILTIN_GCC | 2910 | 1512 | 1521 | 2914 | 1594 | 1.92 |
VEC_BUILTIN_OCL_WRAP | 16477 | 1463 | 1485 | 19328 | 1570 | 11.26 |
What can we learn from this?
- There is no real difference between
-Og
,-O1
and-O2
for this code. - There is a way of writing vector code where the debug version is just as fast as an optimized scalar version: builtin vector types.
- While using SSE intrinsics is not free, just using raw intrinsics is not the worst choice for debug performance. It’s nowhere near their intended speed, unfortunately. Combining them with operators however is a guarantee for pain in Debug builds. The middle ground of putting the SSE type into a struct also comes at a cost.
- However, the real cost comes from using operators: As soon as we wrap the otherwise really decent builtin types in a struct and use operators, we hit a performance cliff.
vectorcall
does help, but only if you don’t inline. That’s expected. It has no practical value here, since in reality we will always want to inline these functions. It also works on member functions, and using member functions makes no difference here in any scenario.
The builtin vector types are an interesting option. I have learned a couple of things about them:
- If you use intrinsics, odds are you are already using builtin vectors. This is how
_mm_add_ps
is defined for me (DEFAULT_FN_ATTRS
includes__always_inline__
). This is already using builtin vector types of the GCC variety.
typedef float __v4sf __attribute__((__vector_size__(16)));
typedef float __m128 __attribute__((__vector_size__(16), __aligned__(16)));
static __inline__ __m128 __DEFAULT_FN_ATTRS _mm_add_ps(__m128 __a, __m128 __b)
{
return (__m128)((__v4sf)__a + (__v4sf)__b);
}
- The main reason not to use builtin vector types is that you have now completely given up control of the interface of your vector types. This might be OK in your scenario.
- A secondary reason not to use builtin vector types is that none of the IDEs I have tried (VS2022, CLion) actually supported them fully. VS2022 did a bit better than CLion. The latter just does not know about the operators that the inbuilt vector types support. The former doesn’t recognize
v.x
as a valid way to access the first component of an OpenCL vector. You can sort-of work around this by stubbing out an implementation like this, but it’s not great:
#if defined(__INTELLISENSE__) || defined(__clang_analyzer__) || defined(__JETBRAINS_IDE__)
struct float4 {
float x, y, z, w;
float4 operator+(const float4& rhs);
};
#else
typedef float float4 __attribute__((ext_vector_type(4)));
#endif
- The OpenCL variety of built-in vectors does not enjoy great documentation, unfortunately. The GCC variety doesn’t support using
v.x
to get the first field of the vector. Neither struck me as a lovingly crafted implementation.
Next, I want to talk about why things are so much slower in unoptimized builds. Let us look at some codegen.
For all scenarios with operators, the biggest culprit is that we now tend to create a lot of temporary objects, and even with inlining these objects don’t completely disappear. Let’s pick on this line here specifically:
delta = f * delta / distance;
In VEC_OP
, you can see that the results of f * delta
and (f * delta) / distance
do exist. f * delta
lives at rsp+138h
, and the result of the division is at rsp+148h
. Then we copy that into delta as rsp+158h
:
delta = f * delta / distance;
14CF movss xmm0, dword ptr [rsp+4Ch]
14D5 movss xmm1, dword ptr [rsp+50h]
14DB lea rax, [rsp+138h]
14E3 mov qword ptr [rsp+120h], rax
14EB lea rax, [rsp+158h]
14F3 mov qword ptr [rsp+118h], rax
14FB movss dword ptr [rsp+114h], xmm1
1504 movss xmm1, dword ptr [rsp+114h]
150D mov rax, qword ptr [rsp+118h]
1515 mulss xmm1, dword ptr [rax]
// store result of first multiplication in rsp+138h
1519 movss dword ptr [rsp+138h], xmm1
1522 movss xmm1, dword ptr [rsp+114h]
152B mov rax, qword ptr [rsp+118h]
1533 mulss xmm1, dword ptr [rax+4h]
1538 movss dword ptr [rsp+13Ch], xmm1
1541 movss xmm1, dword ptr [rsp+114h]
154A mov rax, qword ptr [rsp+118h]
1552 mulss xmm1, dword ptr [rax+8h]
1557 movss dword ptr [rsp+140h], xmm1
1560 xorps xmm1, xmm1
1563 movss dword ptr [rsp+144h], xmm1
156C lea rax, [rsp+148h]
1574 mov qword ptr [rsp+108h], rax
157C movss dword ptr [rsp+104h], xmm0
1585 lea rax, [rsp+138h]
158D mov qword ptr [rsp+F8h], rax
1595 mov rax, qword ptr [rsp+F8h]
159D movss xmm0, dword ptr [rax]
15A1 divss xmm0, dword ptr [rsp+104h]
// store result of first division in rsp+148h
15AA movss dword ptr [rsp+148h], xmm0
15B3 mov rax, qword ptr [rsp+F8h]
15BB movss xmm0, dword ptr [rax+4h]
15C0 divss xmm0, dword ptr [rsp+104h]
15C9 movss dword ptr [rsp+14Ch], xmm0
15D2 mov rax, qword ptr [rsp+F8h]
15DA movss xmm0, dword ptr [rax+8h]
15DF divss xmm0, dword ptr [rsp+104h]
15E8 movss dword ptr [rsp+150h], xmm0
15F1 xorps xmm0, xmm0
15F4 movss dword ptr [rsp+154h], xmm0
// copy from rsp+148h to rsp+158h
15FD mov rax, qword ptr [rsp+148h]
1605 mov qword ptr [rsp+158h], rax
160D mov rax, qword ptr [rsp+150h]
1615 mov qword ptr [rsp+160h], rax
161D lea rax, [rsp+128h]
1625 mov qword ptr [rsp+F0h], rax
162D movss xmm0, dword ptr [__real@42700000 (402ch)]
1635 movss dword ptr [rsp+ECh], xmm0
163E lea rax, [rsp+158h]
1646 mov qword ptr [rsp+E0h], rax
To be clear, it is neither surprising nor bad that these values exist. That is likely what you want from an unoptimized build. It just goes to show that the culprit here really is passing (or mostly returning) structs by value.
I also want to show the unoptimized codegen for the builtin vector types for that line:
delta = f * delta / distance;
1296 movss xmm0, dword ptr [rsp+4Ch]
129C shufps xmm0, xmm0, 0h
12A0 movaps xmm1, xmmword ptr [rsp+60h]
12A5 mulps xmm0, xmm1
12A8 movss xmm1, dword ptr [rsp+48h]
12AE shufps xmm1, xmm1, 0h
12B2 divps xmm0, xmm1
12B5 movaps xmmword ptr [rsp+60h], xmm0
This is better than using SSE intrinsics, because as noted above, these so-called SSE intrinsics are not intrinsic. They are functions that take and return things by value, and they create copies that we then pass around:
delta = _mm_div_ps(_mm_mul_ps(_mm_set1_ps(f), delta), _mm_set1_ps(distance));
// copy float argument f around
12EB movss xmm0, dword ptr [rsp+48h]
12F1 movss dword ptr [rsp+1ECh], xmm0
12FA movss xmm0, dword ptr [rsp+1ECh]
1303 shufps xmm0, xmm0, 0h
// copy result of _mm_set1_ps around
1307 movaps xmmword ptr [rsp+1D0h], xmm0
130F movaps xmm1, xmmword ptr [rsp+1D0h]
1317 movaps xmm2, xmmword ptr [rsp+60h]
// copy float argument distance around etc.
131C movss xmm0, dword ptr [rsp+4Ch]
1322 movss dword ptr [rsp+1CCh], xmm0
132B movss xmm0, dword ptr [rsp+1CCh]
1334 shufps xmm0, xmm0, 0h
1338 movaps xmmword ptr [rsp+1B0h], xmm0
1340 movaps xmm0, xmmword ptr [rsp+1B0h]
1348 movaps xmmword ptr [rsp+200h], xmm2
1350 movaps xmmword ptr [rsp+1F0h], xmm0
1358 movaps xmm0, xmmword ptr [rsp+1F0h]
1360 movaps xmm2, xmmword ptr [rsp+200h]
1368 mulps xmm0, xmm2
136B movaps xmmword ptr [rsp+180h], xmm1
1373 movaps xmmword ptr [rsp+170h], xmm0
137B movaps xmm0, xmmword ptr [rsp+170h]
1383 movaps xmm1, xmmword ptr [rsp+180h]
138B divps xmm0, xmm1
138E movaps xmmword ptr [rsp+60h], xmm0
One final question I’d like to answer is this: why is VEC_SSE
so much faster than VEC_BUILTIN_OCL_WRAP
? In both cases, we are essentially putting a vector type into a struct and keep copying 4 floats. The main difference is that VEC_SSE
directly operates on the contents of the struct (and copies that content), whereas VEC_BUILTIN_OCL_WRAP
copies the struct itself.
Here is a sample from VEC_BUILTIN_OCL_WRAP
:
vel += delta / 60;
1461 mov qword ptr [rsp+F8h], rcx
1469 mov dword ptr [rsp+F0h], __avx10_version (42700000h)
1474 mov qword ptr [rsp+E8h], rax
147C mov rax, qword ptr [rsp+E8h]
1484 movaps xmm0, xmmword ptr [rax]
1487 movaps xmm1, xmmword ptr [rsp+F0h]
148F shufps xmm1, xmm1, 0h
1493 divps xmm0, xmm1
1496 movaps xmmword ptr [rsp+130h], xmm0
149E lea rax, [rsp+130h]
14A6 mov qword ptr [rsp+C8h], rax
14AE lea rax, [rsp+190h]
14B6 mov qword ptr [rsp+C0h], rax
14BE mov rax, qword ptr [rsp+C8h]
14C6 movaps xmm0, xmmword ptr [rax]
14C9 mov rax, qword ptr [rsp+C0h]
14D1 addps xmm0, xmmword ptr [rax]
14D4 movaps xmmword ptr [rax], xmm0
Compare this to the same in VEC_SSE
:
vel.vec = _mm_add_ps(vel.vec, _mm_div_ps(delta.vec, _mm_set1_ps(60)));
13A5 mov dword ptr [rsp+170h], __avx10_version (42700000h)
13B0 movaps xmm0, xmmword ptr [rsp+170h]
13B8 shufps xmm0, xmm0, 0h
13BC movaps xmmword ptr [rsp+160h], xmm0
13C4 movaps xmm1, xmmword ptr [rsp+160h]
13CC movaps xmm0, xmmword ptr [rsp+60h]
13D1 movaps xmmword ptr [rsp+130h], xmm1
13D9 movaps xmmword ptr [rsp+120h], xmm0
13E1 movaps xmm1, xmmword ptr [rsp+120h]
13E9 divps xmm1, xmmword ptr [rsp+130h]
13F1 movaps xmm0, xmmword ptr [rsp+220h]
13F9 movaps xmmword ptr [rsp+1F0h], xmm1
1401 movaps xmmword ptr [rsp+1E0h], xmm0
1409 movaps xmm0, xmmword ptr [rsp+1E0h]
1411 addps xmm0, xmmword ptr [rsp+1F0h]
1419 movaps xmmword ptr [rsp+220h], xmm0
Note how VEC_SSE
consistently uses the XMM registers, whereas VEC_BUILTIN_OCL_WRAP
constantly bounces back and forth: copy using general purpose registers, math via XMM registers. That’s likely where the slowdown comes from. I have not managed to avoid this (I have tried removing the union, for example, and that doesn’t help at all).
For the particular use-case I needed, I ended up going with wrapping Clang’s vector types in a struct. Yes, the debug performance is terrible. Yes, that means that debug builds need to run with -Og
. But if you want to retain control over the interface of your float3
type, then that’s the best option I have found. I hope I can one day write a note about how -Og
is actually great for debuggability, but as it stands I have not run those experiments yet.
-
You can see the difference in command-line by doing comparing the output of these two commands:
clang -Og -### input.cpp
vsclang -O1 -### input.cpp
. You can compare the enabled optimization passes and their arguments by likewise comparing these two:-Og -mllvm -debug-pass=Structure input.cpp
vs.-O1 -mllvm -debug-pass=Structure input.cpp
. Both is very easy on CompilerExplorer. ↩