[Design] Fast C++ Bit Unpacking #2353

yingsu00 · 2022-08-22T06:13:03Z

yingsu00
Aug 22, 2022
Collaborator

Parquet uses Rle and Bitpacking encoding extensively. The usages in Parquet V1 include

Decoding repetition levels to interpret nested structures
Decoding definition levels to interpret nulls
Dictionary ids

It’s crucial to have a fast bit unpacking library for the Parquet reader. In this doc we present the design and lessons learnt from implementing this little lib.

Bulk Read Contiguous Values

SIMD Algorithms

AVX2 and BMI2 are pretty much available on almost all x86/amd64 CPUs nowadays. We are going to use the _pdep_u64 intrinsics in BMI2 and _mm256_cvtepi[x]_epi[y] intrinsics in AVX2. In Lemire’s https://github.com/lemire/LittleIntPacker the bmipacking32.c uses these intrinsics for some bitWidths, but uses non-SIMD implementations for other bit widths. I guess they were faster than using the SIMD intrinsics when the this library was made. But since this was many years ago, what was faster at that time might not be as fast as the intrinsics now. On Intel CPUs, these SIMD intrinsics are faster than 10 years ago while other instructions may not improve as much. In my benchmarks on Intel CoffeeLake, the algorithms using these SIMD intrinsics outperformed all other non-SIMD implementations with manually unrolled loops and plain bit operations like “and”, “or”, and “shiftings”.

However we shall notice that these intrinsics are slower on AMD CPUs. M1 and other ARM processors may also have different costs. Therefore the design for these CPUs would be different. This design is for Intel CPUs that supports AVX2s only.

The actual design is also relevant to the input bit width and output types. The output types we support now are:

uint8_t (mostly for repetition levels and definition levels)
uint16_t (maybe someone uses it?)
uint32_t (mostly for dictionary ids)

The input bit width for a certain output type is in [1, bits in output type]. In Parquet, all bit packed runs contain a multiple of 8 values. In our implementations, we would process a multiple of 8 values at a time. One advantage of this is that we always process the full bytes in every iteration without having to shift bits, which is quite expensive.

Now we talk about the implementation for each output type.

uint8_t

When the output is uint8_t, we process 8 values a time, and the input buffer pointer would move input “bitWidth” bytes. We load the memory into an integer, then use _pdep_u64 to deposit/scatter the bits in this integer to a uint64_t using a predefined mask. For example if input bitWidth is 4 the mask is 0x0f0f0f0f0f0f0f0f. Then the bitpacked value 0x12345678 would be unpacked to 0x0102030405060708.

Note that the bitWidth <= 8. If we cast the input buffer to uint64_t* like this uint64_t val = *reinterpret_cast<const uint64_t*>(inputBuffer) and don’t specially handle the tail, the code might fail with segmentation fault because the last 8 values are less than 64 bits. In the real implementation, we could either directly dereference the input buffer after casting it to uint64*_t or uint32*_t with special handling of last 8 values, or we could use std::memcpy to copy exactly the number of bytes needed. However, I found that it affects the performance a lot sometimes. The clang compiler SOMETIMES fails to optimize the loop using std::memcpy. Let’s look at the disassembly of these two simple functions:

// 266us to decode 8M values
static inline void unpack1to8_cast(
        const uint8_t *&inputBuffer,
        uint64_t numValues,
        uint8_t *&outputBuffer) {
    uint64_t numBytes = (numValues + 7) / 8;
    auto readEndOffset = inputBuffer + numBytes;

    while (inputBuffer < readEndOffset) {
        uint64_t val = *reinterpret_cast<const uint8_t *>(inputBuffer);
        *(reinterpret_cast<uint64_t *>(outputBuffer)) = _pdep_u64(val, kPdepMask8[1]);
        inputBuffer++;
        outputBuffer += 8;
    }
}

The corresponding assembly code:

0000000100003480 <__ZL15unpack1to8_casthRPKhyyRPh>:
100003480: 55                    pushq	%rbp
100003481: 48 89 e5              movq	%rsp, %rbp
100003484: 48 8b 17              movq	(%rdi), %rdx         #  rdi Function argument #1 (inputBuffer)  
100003487: 4c 8d 8a 00 75 6d 00  leaq	7173376(%rdx), %r9   # rdx Function argument #3 (inputBufferLen)
10000348e: 48 8b 0e              movq	(%rsi), %rcx		#  rsi Function argument #2 (inputBuffer)   rcx Function argument #4 (numValue)
100003491: 49 b8 7f 7f 7f 7f 7f 7f 7f 7f movabsq	$9187201950435737471, %r8
10000349b: 0f 1f 44 00 00        nopl	(%rax,%rax)
1000034a0: 48 8b 02              movq	(%rdx), %rax
1000034a3: c4 c2 fb f5 c0        pdepq	%r8, %rax, %rax
1000034a8: 48 89 01              movq	%rax, (%rcx)
1000034ab: 48 83 c2 07           addq	$7, %rdx
1000034af: 48 89 17              movq	%rdx, (%rdi)
1000034b2: 48 8b 0e              movq	(%rsi), %rcx
1000034b5: 48 83 c1 08           addq	$8, %rcx
1000034b9: 48 89 0e              movq	%rcx, (%rsi)       
1000034bc: 48 8b 17              movq	(%rdi), %rdx
1000034bf: 4c 39 ca              cmpq	%r9, %rdx
1000034c2: 72 dc                 jb	0x1000034a0 <__ZL15unpack1to8_casthRPKhyyRPh+0x20>
1000034c4: 5d                    popq	%rbp
1000034c5: c3                    retq
1000034c6: 66 2e 0f 1f 84 00 00 00 00 00 nopw	%cs:(%rax,%rax)

Code using memcpy:

// 1998us to decode 8M values
static inline void unpack1to8_memcpy(
        const uint8_t *&inputBuffer,
        uint64_t numValues,
        uint8_t *&outputBuffer) {
    uint64_t numBytes = (numValues + 7) / 8;
    auto readEndOffset = inputBuffer + numBytes;

    uint64_t val = 0;
    while (inputBuffer < readEndOffset) {
        std::memcpy(&val, inputBuffer, 1);
        *(reinterpret_cast<uint64_t *>(outputBuffer)) =
                _pdep_u64(val, kPdepMask8[1]);
        inputBuffer++;
        outputBuffer += 8;
    }
}

00000001000034d0 <__ZL17unpack1to8_memcpyhRPKhyyRPh>:
1000034d0: 55                    pushq	%rbp
1000034d1: 48 89 e5              movq	%rsp, %rbp
1000034d4: 41 57                 pushq	%r15
1000034d6: 41 56                 pushq	%r14
1000034d8: 41 55                 pushq	%r13
1000034da: 41 54                 pushq	%r12
1000034dc: 53                    pushq	%rbx
1000034dd: 48 83 ec 18           subq	$24, %rsp
1000034e1: 48 8b 1e              movq	(%rsi), %rbx
1000034e4: 48 c7 45 d0 00 00 00 00movq	$0, -48(%rbp)
1000034ec: 40 84 ff              testb	%dil, %dil
1000034ef: 0f 84 61 00 00 00     je	0x100003556 <__ZL17unpack1to8_memcpyhRPKhyyRPh+0x86>
1000034f5: 49 89 d6              movq	%rdx, %r14
1000034f8: 49 89 f7              movq	%rsi, %r15
1000034fb: 41 89 fc              movl	%edi, %r12d
1000034fe: 48 8d 05 0b 0a 00 00  leaq	2571(%rip), %rax  # 100003f10 <__ZL10kPdepMask8>
100003505: 4a 8b 04 e0           movq	(%rax,%r12,8), %rax
100003509: 48 89 45 c0           movq	%rax, -64(%rbp)
10000350d: 49 69 c4 00 a3 0f 00  imulq	$1024768, %r12, %rax
100003514: 48 01 d8              addq	%rbx, %rax
100003517: 48 89 45 c8           movq	%rax, -56(%rbp)
10000351b: 4c 8b 2a              movq	(%rdx), %r13
10000351e: 66 90                 nop
100003520: 48 8d 7d d0           leaq	-48(%rbp), %rdi
100003524: 48 89 de              movq	%rbx, %rsi
100003527: 4c 89 e2              movq	%r12, %rdx
10000352a: e8 e3 07 00 00        callq	0x100003d12 <dyld_stub_binder+0x100003d12>
10000352f: 48 8b 45 d0           movq	-48(%rbp), %rax
100003533: c4 e2 fb f5 45 c0     pdepq	-64(%rbp), %rax, %rax
100003539: 49 89 45 00           movq	%rax, (%r13)
10000353d: 4c 01 e3              addq	%r12, %rbx
100003540: 49 89 1f              movq	%rbx, (%r15)
100003543: 4d 8b 2e              movq	(%r14), %r13
100003546: 49 83 c5 08           addq	$8, %r13
10000354a: 4d 89 2e              movq	%r13, (%r14)
10000354d: 49 8b 1f              movq	(%r15), %rbx
100003550: 48 3b 5d c8           cmpq	-56(%rbp), %rbx
100003554: 72 ca                 jb	0x100003520 <__ZL17unpack1to8_memcpyhRPKhyyRPh+0x50>
100003556: 48 83 c4 18           addq	$24, %rsp
10000355a: 5b                    popq	%rbx
10000355b: 41 5c                 popq	%r12
10000355d: 41 5d                 popq	%r13
10000355f: 41 5e                 popq	%r14
100003561: 41 5f                 popq	%r15
100003563: 5d                    popq	%rbp
100003564: c3                    retq
100003565: 66 2e 0f 1f 84 00 00 00 00 00 nopw	%cs:(%rax,%rax)
10000356f: 90

We can see the implementation using memcpy spends many more instructions calculating the address. In my benchmarks the second implementation took 1998us to decode 8M values, while the first one only took 266us. This might be a clang issue and it does not always happen. But to avoid such possible regression, I chose to use casting in the loop and get the last 8 values using std::memcpy. Btw it seems Lemire’s bmipacking implementation didn’t handle this and always assume there're >=8 bytes left in the buffer.

Note that even though we don't exhaust all the read 64bits in one iteration and the read might not be aligned, it doesn't seem to affect the performance. This may be a result of CPU caching because a cache line is usually (64 bytes) much larger than 8 bytes, and subsequent reads shall read from the cache.

When the input bitWidth is 8, we can just do simple memcpy. However clang does a good job here and using memcpy has the SAME performance as the algorithm using _pdep_u64. To simplify the code I didn’t special case it.

Another takeaway is that manually unrolling the loop does NOT make any difference. There is no data dependency among the loop iterations, and just using a plain loop can make the code shorter.

uint16_t

bitWidth in [1, 7]: Process 2 * bitWidth bytes (16 values) a time.

Copy 2 * bitWidth bytes into 2 integers, 8 values each. Then call _pdep_u64 on each of the integers to deposit the values from bitWidth to 8 bits wide, and store the 8 * 2 output values to a piece of memory aligned by 16 bytes. Now use _mm256_cvtepu8_epi16 to cast these 16 values in the register and store them back to memory.

Note that using memcpy for uint16_t doesn’t make much difference that we saw in uint8_t. It is even slightly faster than dereferencing the memory to a uint64_t integer and shifting its bits before the second _pdep_u64.

Another observation is that manually unrolling the loop to process 32 or 64 values in one iteration does NOT make any difference.

bitWidth = 8: Process 2 * bitWidth bytes (16 values) a time.

For this case we just simply call _mm256_cvtepu8_epi16 without _pdep_u64. This is about 10% faster than the above implementation for 1-7 bitWidth.

bitWidth in [9, 15]: Process bitWidth bytes (2 * 4 values) a time.

We read the memory twice in each iteration of the loop. The first time we read ceil(bitWidth / 2) bytes, and the second time the rest that make up the 8 values. E.g. if bitWidth = 9, read 5 bytes to the first integer, and the next 4 bytes to the second integer. Then _pdep_u64 to deposit the 4 values in the first integer to 16 bits directly. Then shift the first integer right for 36 bits, and second integer left 4 bits and or the two result into the 3rd integer, this is to use up the remaining unhandled 4 bits in the first integer. Call _pdep_u64 on the third integer and store the second 4 values to memory.

When the bitWidth is even, we don’t need to shift the integers. However special casing the even bitWidth doesn’t change the performance. To make it simple, we use the above implementation for both odd and even bitWidth.

bitWidth = 16

Just memcpy. It's faster than the above.

Uint32_t

bitWidth in [1, 7]: Process 2 * bitWidth bytes (16 values) a time.

The same as uint16_t.
Manually unrolling does NOT make a difference.

bitWidth = 8 : Process 8 bytes (8 values) a time.

Directly use _mm256_cvtepu8_epi32 to convert 8 bits to 32 bits.

bitWidth in [9, 15]: Process bitWidth bytes (2 * 4 values) a time.

Same as uint16_t

bitWidth = 16

Directly use _mm256_cvtepu16_epi32 to convert 16 bits to 32 bits.

Other Considerations

Function overloading vs. templates

We could choose to use templated functions with output type as a template parameter, or choose function overloading. There're 2 reasons I chose function overloading other than templates:

Functions for different output types have different implementations. Templated functions are convenient if the implementation logic for different types is identical, but for our library, it's not the case. Using templated functions makes the code harder to read and maintain.
The output types can only be 1, 2, 4, 8 byte integers. We don't need to generate code for the other types.

Class vs. Global Functions In Utility Headers

We could wrap the input and output buffer in a class and provide public functions to unpack the data. However benchmarks show that calling the functions on a class instance is slower than putting them as global functions. This may be related to function dispatch cost and/or inlining. So I chose to put the functions in a header file. The full implementation is about 500 lines of code and is pretty small.

Function Pointers vs. Switches

The performance of these two are about the same. Since for a given column, the input bitWidth and output type are fixed, CPU can predict the correct function or switches and the cost of function dispatching is very minimal.

Benchmark Results

The following benchmark results are from BitUnpackingBenchmark.cpp for unpacking 8M values. The code was compiled by Apple clang version 12.0.5 (clang-1205.0.22.11), on MacOS with CPU Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz (CoffeeLake). The results are in microseconds(us).

Output Bit Width	Bit Width	arrow	duckdb	Lemire fastpforlib	Lemire BMI2	velox
8	1	2,750	10,430	4,280		292
8	2	2,730	11,660	3,880		307
8	3	2,790	12,590	4,640		317
8	4	2,680	13,890	3,450		334
8	5	3,030	15,470	4,750		361
8	6	3,120	16,780	4,500		460
8	7	3,380	18,110	5,610		487
8	8	2,900	19,150	2,680		521
16	1	2,750	10,660	2,320		549
16	2	2,670	11,770	2,280		585
16	3	2,790	12,770	2,620		709
16	4	2,610	14,130	2,320		751
16	5	3,050	16,840	2,990		820
16	6	3,030	16,850	6,200		890
16	7	3,210	18,080	3,250		1,130
16	8	2,850	19,540	2,380		956
16	9	3,510	21,530	3,520		1,340
16	10	3,570	23,160	3,500		2,100
16	11	3,790	24,290	3,840		2,190
16	12	3,680	25,780	3,600		2,260
16	13	3,970	27,090	3,990		2,580
16	14	3,990	28,940	4,080		2,330
16	15	4,150	30,660	4,330		2,440
16	16	3,230	32,570	2,970		2,370
32	1	2,730	10,740	2,530	1,520	1,620
32	2	2,680	11,670	2,550	1,580	1,650
32	3	2,800	12,820	2,810	1,640	1,700
32	4	2,640	14,110	2,530	1,650	1,730
32	5	3,080	15,510	2,950	1,810	1,830
32	6	3,150	16,680	2,950	2,250	1,900
32	7	3,240	17,810	3,480	2,280	1,920
32	8	2,950	19,140	2,610	2,380	1,960
32	9	3,430	27,230	3,760	2,480	2,060
32	10	3,570	34,630	3,780	2,440	2,090
32	11	3,730	41,070	4,060	2,740	2,220
32	12	3,660	48,670	3,850	2,730	2,240
32	13	3,950	55,820	4,100	3,030	2,330
32	14	3,990	62,250	3,680	3,270	2,580
32	15	4,130	70,350	3,810	3,720	2,470
32	16	3,240	77,600	3,490	3,840	2,340

Read and Skip Values In a RowSet

Sometimes the queries have filters. Velox would read the columns with filters first, and skip the rows that didn’t pass the filters when reading later columns. Our bit unpacking library will support reading non-contiguous values.

(To Be Finished)

Push Down Filters To Bit Packed Data

(To Be Finished)

yingsu00 · 2022-08-22T08:54:26Z

yingsu00
Aug 22, 2022
Collaborator Author

cc @oerling

0 replies

kagamiori · 2022-08-22T16:26:07Z

kagamiori
Aug 22, 2022
Collaborator

CC @Yuhta who might be interested in this idea.

0 replies

yingsu00 · 2022-08-22T20:11:12Z

yingsu00
Aug 22, 2022
Collaborator Author

This is the issue for #2217. @Yuhta can you please review the PR?

2 replies

Yuhta Aug 23, 2022
Collaborator

Sure I will take a look this week. Do you have any plan to integrate it into the reader? May have some conflicts with #2352

yingsu00 Aug 29, 2022
Collaborator Author

Sure I will take a look this week. Do you have any plan to integrate it into the reader? May have some conflicts with #2352

Yes the following PRs will replace the implementation in #2352. As shown in my PR message, the implementation in #2352 is 2x slower than #2217. Plus there will be some refactoring done. It'll be nice if we can merge #2217 first.

Yuhta · 2022-08-23T15:56:10Z

Yuhta
Aug 23, 2022
Collaborator

It's really weird that why memcpy is not optimized out in some cases.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Design] Fast C++ Bit Unpacking #2353

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Design] Fast C++ Bit Unpacking #2353

yingsu00 Aug 22, 2022 Collaborator

Bulk Read Contiguous Values

SIMD Algorithms

uint8_t

uint16_t

bitWidth in [1, 7]: Process 2 * bitWidth bytes (16 values) a time.

bitWidth = 8: Process 2 * bitWidth bytes (16 values) a time.

bitWidth in [9, 15]: Process bitWidth bytes (2 * 4 values) a time.

bitWidth = 16

Uint32_t

bitWidth in [1, 7]: Process 2 * bitWidth bytes (16 values) a time.

bitWidth = 8 : Process 8 bytes (8 values) a time.

bitWidth in [9, 15]: Process bitWidth bytes (2 * 4 values) a time.

bitWidth = 16

Other Considerations

Function overloading vs. templates

Class vs. Global Functions In Utility Headers

Function Pointers vs. Switches

Benchmark Results

Read and Skip Values In a RowSet

Push Down Filters To Bit Packed Data

Replies: 4 comments · 2 replies

yingsu00 Aug 22, 2022 Collaborator Author

kagamiori Aug 22, 2022 Collaborator

yingsu00 Aug 22, 2022 Collaborator Author

Yuhta Aug 23, 2022 Collaborator

yingsu00 Aug 29, 2022 Collaborator Author

Yuhta Aug 23, 2022 Collaborator

yingsu00
Aug 22, 2022
Collaborator

Replies: 4 comments 2 replies

yingsu00
Aug 22, 2022
Collaborator Author

kagamiori
Aug 22, 2022
Collaborator

yingsu00
Aug 22, 2022
Collaborator Author

Yuhta Aug 23, 2022
Collaborator

yingsu00 Aug 29, 2022
Collaborator Author

Yuhta
Aug 23, 2022
Collaborator