```
inline uint8_t pack8bools(bool* a)
{
uint64_t t;
memcpy(&t, a, sizeof t); // strict-aliasing & alignment safe load
return 0x8040201008040201ULL*t >> 56;
// bit order: a[0]<<7 | a[1]<<6 | ... | a[7]<<0 on little-endian
// for a[0] => LSB, use 0x0102040810204080ULL on little-endian
}
void unpack8bools(uint8_t b, bool* a)
{
// on little-endian, a[0] = (b>>7) & 1 like printing order
auto MAGIC = 0x8040201008040201ULL; // for opposite order, byte-reverse this
auto MASK = 0x8080808080808080ULL;
uint64_t t = ((MAGIC*b) & MASK) >> 7;
memcpy(a, &t, sizeof t); // store 8 bytes without UB
}
```

Assuming `sizeof(bool) == 1`

To portably do LSB <-> `a[0]`

(like the `pext/pdep`

version below) instead of using the opposite of host endianness, use `htole64(0x0102040810204080ULL)`

as the magic multiplier in both versions. (`htole64`

is from BSD / GNU `htobe64`

with the same constant gives the other order, MSB-first like you’d use for printing a number in base 2.

You may want to make sure that the bool array is 8-byte aligned (`alignas(8)`

) for performance, and that the compiler knows this. `memcpy`

is always safe for any alignment, but on ISAs that require alignment, a compiler can only inline `memcpy`

As a single load or store instruction if it knows the pointer is sufficiently aligned. `*(uint64_t*)a`

would promise alignment, but also violate the strict-aliasing rule. Even on ISAs that allow unaligned loads, they can be faster when naturally aligned. But the compiler can still inline memcpy without seeing that guarantee at compile time.

## How they work

Suppose we have 8 bools `b[0]`

to `b[7]`

whose least significant bits are named ah respectively that we want to pack into a single byte. Treating those 8 consecutive `bool`

s as one 64-bit word and load them. We’ll get the bits in reversed order in a little-endian machine. Now we’ll do a multiplication (here dots are zero bits)

```
| b7 || b6 || b4 || b4 || b3 || b2 || b1 || b0 |
.......h.......g.......f.......e.......d.......c.......b.......a
× 1000000001000000001000000001000000001000000001000000001000000001
────────────────────────────────────────────────────────────────
↑......h.↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑....f...↑...e....↑..d.....↑.c......↑b.......a
+ ↑...e....↑..d.....↑.c......↑b.......a
↑..d.....↑.c......↑b.......a
↑.c......↑b.......a
↑b.......a
a
────────────────────────────────────────────────────────────────
= abcdefghxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
```

The arrows are added so it’s easier to see the position of the set bits in the magic number. At this point 8 least significant bits have been put in the top byte, we’ll just need to mask the remaining bits out

So the magic number for packing would be `0b1000000001000000001000000001000000001000000001000000001000000001`

or `0x8040201008040201`

. If you’re on a big endian machine you’ll need to use the magic number `0x0102040810204080`

which is calculated in a similar manner

For unpacking we can do a similar multiplication

```
| b7 || b6 || b4 || b4 || b3 || b2 || b1 || b0 |
abcdefgh
× 1000000001000000001000000001000000001000000001000000001000000001
────────────────────────────────────────────────────────────────
= h0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh
& 1000000010000000100000001000000010000000100000001000000010000000
────────────────────────────────────────────────────────────────
= h0000000g0000000f0000000e0000000d0000000c0000000b0000000a0000000
```

After multiplying we have the needed bits at the most significant positions, so we need to mask out irrelevant bits and shift the remaining ones to the least significant positions. The output will be the bytes contain a to h in little endian.

## The efficient way

On newer x86 CPUs with BMI2 there are PEXT and PDEP instructions for this purpose. The `pack8bools`

function above can be replaced with

```
_pext_u64(*((uint64_t*)a), 0x0101010101010101ULL);
```

And the `unpack8bools`

function can be implemented as

```
_pdep_u64(b, 0x0101010101010101ULL);
```

(This maps LSB -> LSB, like a `0x0102040810204080ULL`

multiplier constant, opposite of `0x8040201008040201ULL`

. x86 is little-endian: `a[0] = (b>>0) & 1;`

after memcpy.)

Unfortunately those instructions are very slow on AMD before Zen 3 so you may need to compare with the multiplication method above to see which is better

### The other fast way is SSE2

x86 SIMD has an operation that takes the high bit of every byte (or float or double) in a vector register, and gives it to you as an integer. The instruction for bytes is `pmovmskb`

. This can of course do 16 bytes at a time with the same number of instructions, so it gets better than the multiply trick if you have lots of this to do.

```
#include <immintrin.h>
inline uint8_t pack8bools_SSE2(const bool* a)
{
__m128i v = _mm_loadl_epi64( (const __m128i*)a ); // 8-byte load, despite the pointer type.
// __m128 v = _mm_cvtsi64_si128( uint64 ); // alternative if you already have an 8-byte integer
v = _mm_slli_epi32(v, 7); // low bit of each byte becomes the highest
return _mm_movemask_epi8(v);
}
```

There isn’t a single instruction to unpack until AVX-512, which has mask-to-vector instructions. It is doable with SIMD, but likely not as efficacy as the multiply trick. See Convert 16 bits mask to 16 bytes mask and more generally is there an inverse instruction to the movemask instruction in intel avx2? for unpacking bitmaps to other element sizes.

How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD has some answers specifically for 8-bits -> 8-bytes, but if you can’t do 16 bits at a time for that direction, the multiply trick is probably better, and `pext`

certainly is (except on CPUs where it’s disastrously slow, like AMD before Zen 3).