Replies: 4 comments 2 replies
-
cc @oerling |
Beta Was this translation helpful? Give feedback.
0 replies
-
CC @Yuhta who might be interested in this idea. |
Beta Was this translation helpful? Give feedback.
0 replies
-
This is the issue for #2217. @Yuhta can you please review the PR? |
Beta Was this translation helpful? Give feedback.
2 replies
-
It's really weird that why |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
[email protected]
Parquet uses Rle and Bitpacking encoding extensively. The usages in Parquet V1 include
It’s crucial to have a fast bit unpacking library for the Parquet reader. In this doc we present the design and lessons learnt from implementing this little lib.
Bulk Read Contiguous Values
SIMD Algorithms
AVX2 and BMI2 are pretty much available on almost all x86/amd64 CPUs nowadays. We are going to use the
_pdep_u64
intrinsics in BMI2 and_mm256_cvtepi[x]_epi[y]
intrinsics in AVX2. In Lemire’s https://github.com/lemire/LittleIntPacker the bmipacking32.c uses these intrinsics for some bitWidths, but uses non-SIMD implementations for other bit widths. I guess they were faster than using the SIMD intrinsics when the this library was made. But since this was many years ago, what was faster at that time might not be as fast as the intrinsics now. On Intel CPUs, these SIMD intrinsics are faster than 10 years ago while other instructions may not improve as much. In my benchmarks on Intel CoffeeLake, the algorithms using these SIMD intrinsics outperformed all other non-SIMD implementations with manually unrolled loops and plain bit operations like “and”, “or”, and “shiftings”.However we shall notice that these intrinsics are slower on AMD CPUs. M1 and other ARM processors may also have different costs. Therefore the design for these CPUs would be different. This design is for Intel CPUs that supports AVX2s only.
The actual design is also relevant to the input bit width and output types. The output types we support now are:
The input bit width for a certain output type is in [1, bits in output type]. In Parquet, all bit packed runs contain a multiple of 8 values. In our implementations, we would process a multiple of 8 values at a time. One advantage of this is that we always process the full bytes in every iteration without having to shift bits, which is quite expensive.
Now we talk about the implementation for each output type.
uint8_t
When the output is uint8_t, we process 8 values a time, and the input buffer pointer would move input “bitWidth” bytes. We load the memory into an integer, then use
_pdep_u64
to deposit/scatter the bits in this integer to auint64_t
using a predefined mask. For example if input bitWidth is 4 the mask is 0x0f0f0f0f0f0f0f0f. Then the bitpacked value 0x12345678 would be unpacked to 0x0102030405060708.Note that the bitWidth <= 8. If we cast the input buffer to uint64_t* like this
uint64_t val = *reinterpret_cast<const uint64_t*>(inputBuffer)
and don’t specially handle the tail, the code might fail with segmentation fault because the last 8 values are less than 64 bits. In the real implementation, we could either directly dereference the input buffer after casting it touint64*_t
oruint32*_t
with special handling of last 8 values, or we could usestd::memcpy
to copy exactly the number of bytes needed. However, I found that it affects the performance a lot sometimes. The clang compiler SOMETIMES fails to optimize the loop usingstd::memcpy
. Let’s look at the disassembly of these two simple functions:The corresponding assembly code:
Code using memcpy:
We can see the implementation using memcpy spends many more instructions calculating the address. In my benchmarks the second implementation took 1998us to decode 8M values, while the first one only took 266us. This might be a clang issue and it does not always happen. But to avoid such possible regression, I chose to use casting in the loop and get the last 8 values using std::memcpy. Btw it seems Lemire’s bmipacking implementation didn’t handle this and always assume there're >=8 bytes left in the buffer.
Note that even though we don't exhaust all the read 64bits in one iteration and the read might not be aligned, it doesn't seem to affect the performance. This may be a result of CPU caching because a cache line is usually (64 bytes) much larger than 8 bytes, and subsequent reads shall read from the cache.
When the input bitWidth is 8, we can just do simple memcpy. However clang does a good job here and using memcpy has the SAME performance as the algorithm using
_pdep_u64
. To simplify the code I didn’t special case it.Another takeaway is that manually unrolling the loop does NOT make any difference. There is no data dependency among the loop iterations, and just using a plain loop can make the code shorter.
uint16_t
bitWidth in [1, 7]: Process 2 * bitWidth bytes (16 values) a time.
Copy 2 * bitWidth bytes into 2 integers, 8 values each. Then call
_pdep_u64
on each of the integers to deposit the values from bitWidth to 8 bits wide, and store the 8 * 2 output values to a piece of memory aligned by 16 bytes. Now use_mm256_cvtepu8_epi16
to cast these 16 values in the register and store them back to memory.Note that using memcpy for
uint16_t
doesn’t make much difference that we saw inuint8_t
. It is even slightly faster than dereferencing the memory to auint64_t
integer and shifting its bits before the second_pdep_u64
.Another observation is that manually unrolling the loop to process 32 or 64 values in one iteration does NOT make any difference.
bitWidth = 8: Process 2 * bitWidth bytes (16 values) a time.
For this case we just simply call
_mm256_cvtepu8_epi16
without_pdep_u64
. This is about 10% faster than the above implementation for 1-7 bitWidth.bitWidth in [9, 15]: Process bitWidth bytes (2 * 4 values) a time.
We read the memory twice in each iteration of the loop. The first time we read ceil(bitWidth / 2) bytes, and the second time the rest that make up the 8 values. E.g. if bitWidth = 9, read 5 bytes to the first integer, and the next 4 bytes to the second integer. Then
_pdep_u64
to deposit the 4 values in the first integer to 16 bits directly. Then shift the first integer right for 36 bits, and second integer left 4 bits and or the two result into the 3rd integer, this is to use up the remaining unhandled 4 bits in the first integer. Call_pdep_u64
on the third integer and store the second 4 values to memory.When the bitWidth is even, we don’t need to shift the integers. However special casing the even bitWidth doesn’t change the performance. To make it simple, we use the above implementation for both odd and even bitWidth.
bitWidth = 16
Just memcpy. It's faster than the above.
Uint32_t
bitWidth in [1, 7]: Process 2 * bitWidth bytes (16 values) a time.
The same as uint16_t.
Manually unrolling does NOT make a difference.
bitWidth = 8 : Process 8 bytes (8 values) a time.
Directly use
_mm256_cvtepu8_epi32
to convert 8 bits to 32 bits.bitWidth in [9, 15]: Process bitWidth bytes (2 * 4 values) a time.
Same as uint16_t
bitWidth = 16
Directly use
_mm256_cvtepu16_epi32
to convert 16 bits to 32 bits.Other Considerations
Function overloading vs. templates
We could choose to use templated functions with output type as a template parameter, or choose function overloading. There're 2 reasons I chose function overloading other than templates:
Class vs. Global Functions In Utility Headers
We could wrap the input and output buffer in a class and provide public functions to unpack the data. However benchmarks show that calling the functions on a class instance is slower than putting them as global functions. This may be related to function dispatch cost and/or inlining. So I chose to put the functions in a header file. The full implementation is about 500 lines of code and is pretty small.
Function Pointers vs. Switches
The performance of these two are about the same. Since for a given column, the input bitWidth and output type are fixed, CPU can predict the correct function or switches and the cost of function dispatching is very minimal.
Benchmark Results
The following benchmark results are from BitUnpackingBenchmark.cpp for unpacking 8M values. The code was compiled by Apple clang version 12.0.5 (clang-1205.0.22.11), on MacOS with CPU Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz (CoffeeLake). The results are in microseconds(us).
Read and Skip Values In a RowSet
Sometimes the queries have filters. Velox would read the columns with filters first, and skip the rows that didn’t pass the filters when reading later columns. Our bit unpacking library will support reading non-contiguous values.
(To Be Finished)
Push Down Filters To Bit Packed Data
(To Be Finished)
Beta Was this translation helpful? Give feedback.
All reactions