bool4 and an unexpected new perspective

Hey! You can find me on Mastodon and Bluesky!

This blog post is 20 paragraphs exposition and 2 paragraphs punchline. It is a follow-up to my previous post on vectorization. I mentioned there that Unity’s math package implements bool4 like this:

struct bool4
{
    [MarshalAs(U1)]
    public bool x;
    [MarshalAs(U1)]
    public bool y;
    [MarshalAs(U1)]
    public bool z;
    [MarshalAs(U1)]
    public bool w;
}

This is a 4 byte struct where every single byte represents a boolean. The C# spec mandates that false is represented by 0, and that true has some representation other than zero. The particular representation is thus implementation-defined. In Unity’s case, the most relevant implementation is IL2CPP, which represents this struct as 4 C++-bools in a trenchcoat in a struct, where true is represented by 1.

This is inconvenient since the typical result of a comparison of two float4 in SSE a 128 bit vector, with components of either 0xFFFFFFFF or 0x00000000. This is readily usable as a mask for _mm_and_ps and friends. Alas, bool4 does not fit1. But I can’t get rid of it, so I need to live with it.

Naively converting from the nice mask to the ugly bool4 yields code like this:

movmskps     r8d, xmm0
movzx        eax, r8b
and          al, 1h
mov          byte ptr [rsp+8h], al
mov          eax, r8d
shr          eax, 1h
and          al, 1h
mov          byte ptr [rsp+9h], al
mov          eax, r8d
shr          eax, 2h
and          al, 1h
shr          r8d, 3h
mov          byte ptr [rsp+Ah], al

This uses movmskps to collect the top bit of the four floats into an int32, and then checks the bits of that int one-by-one.

Can we do better? Yes, we can! First, we can note that bool4 we can still be treated as a uint32_t, which makes a lot of the handling (like binary and, or, xor) at least somewhat nice. We could also choose to not use 1 as the value for true and just settle for non-zero, but this would make comparisons for equality a little bit uglier and also runs the risk of not being compatible with some existing assumptions.

A nicer version might look like this:

__m128i v = _mm_castps_si128(_mm_cmpgt_ps(a, b)); // example
v = _mm_packs_epi16(v, _mm_undefined_si128());
v = _mm_packs_epi16(v, _mm_undefined_si128());
uint32_t mask = ((uint32_t)_mm_cvtsi128_si32(v)) & 0x01010101u;

We use a saturated pack, which turns 16bit integers into 8bit integers. The only two integers we care about are 0 and -1 (= 0xFFFF). Those just turn into 0 and -1 (= 0xFF). By doing this twice, we pack each four byte mask into one byte, and then we can just mask it to the bits we want to have and write the mask to the bool4 directly. For example:

               00000000 FFFFFFFF 00000000 FFFFFFFF
packs_epi16 -> 0000FFFF 0000FFFF XXXXXXXX XXXXXXXX
packs_epi16 -> 00FF00FF XXXXXXXX XXXXXXXX XXXXXXXX
and         -> 00010001

So what about the inverse? float4 uses bool4 in select(float4, float4, bool4), for example, and there it would be useful to go from the 32bit bool4 to the full 128bit mask again. Let’s pretend that we already have gone from a single bit to a mask of 00 and FF, e.g. 0x00010001 to 0x00FF00FF. Going from this to the full 128bits again is just _mm_unpacklo_epi8 twice:

__m128i v = _mm_cvtsi32_si128(mask);
v = _mm_unpacklo_epi8(v, v);
v = _mm_unpacklo_epi8(v, v);

For example:

                 00FF00FF XXXXXXXX XXXXXXXX XXXXXXXX // these upper bits are 0, but they are irrelevant
unpacklo_epi8 -> 0000FFFF 0000FFFF XXXXXXXX XXXXXXXX
unpacklo_epi8 -> 00000000 FFFFFFFF 00000000 FFFFFFFF

OK, cute. But none of this is why I decided to write a blog post (…not that my threshold for blog post writing is especially high…).

No, the noteworthy thing is this: How do we take that first step and go from 0x00010001 to 0x00FF00FF? Well, we multiply by 0xFF. Do you also see the convolution in the ‘digit domain’ here? It looks as if we are convolving 0x00010001 with 0xFF, neat! I had never considered this before, and I am not the first to make this observation. But what a welcome new perspective on multiplication this is, even if it is very much not satisfying that it does not deal with carries: the Schönhage–Strassen algorithm I just linked to also does a convolution first and then deals with the carries as a separate step. For the purposes of bit manipulations, the convolution perspective still seems very useful.

Can I just say how utterly delighted I am that looking at something seemingly ugly (bool4) leads to a fresh new perspective, from a completely unexpected angle? Had I not also insisted on keeping single bits in bool4 (instead of 0xFF), this detour also would not have happened. While I’m still not convinced that bool4 is the right thing, I can now appreciate that it gave me something. Maybe I’ll keep that in mind before I laugh again that Unity Mathematics also goes the full length and defines not just bool vectors, but also bool matrices: and who knows, maybe I will also learn something new by looking at bool4x2 soon.


  1. A sophisticated compiler with knowledge of bool4 would just eliminate all intermediate bool4 instances in cases where we take a mask, then go to bool4, and then back to a mask. But that requires looking at multiple operations, and that’s not what we are doing today. 

Share: X (Twitter) Facebook LinkedIn