Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve convert_hex_to_binary x86_64 codegen. #473

Closed
wants to merge 1 commit into from

Conversation

ttsugriy
Copy link
Contributor

@ttsugriy ttsugriy commented Aug 19, 2023

This results in shorter and branchless x86_64 assembly: https://godbolt.org/z/5jqTbYWWh

Original:

convert_hex_to_binary(char):             # @convert_hex_to_binary(char)
        mov     eax, edi
        cmp     al, 57
        jg      .LBB0_2
        add     eax, -48
        ret
.LBB0_2:
        xor     ecx, ecx
        cmp     al, 97
        setb    cl
        shl     ecx, 5
        add     eax, ecx
        add     eax, -87
        ret

and new version:

convert_hex_to_binary(char):             # @convert_hex_to_binary(char)
        xor     ecx, ecx
        cmp     dil, 97
        setl    cl
        shl     ecx, 5
        add     ecx, -87
        cmp     dil, 58
        mov     eax, -48
        cmovge  eax, ecx
        add     eax, edi
        ret

It also results in 1 fewer instructions for ARM64 using GCC - https://godbolt.org/z/x9oYoG93h

convert_hex_to_binary(char):
        and     w0, w0, 255
        cmp     w0, 57
        bls     .L7
        cmp     w0, 97
        mov     w2, 97
        mov     w1, 65
        csel    w1, w1, w2, cc
        sub     w0, w0, w1
        add     w0, w0, 10
        ret
.L7:
        sub     w0, w0, #48
        ret

vs new

convert_hex_to_binary(char):
        and     w0, w0, 255
        cmp     w0, 57
        bls     .L7
        cmp     w0, 97
        mov     w2, 87
        mov     w1, 55
        csel    w1, w1, w2, cc
        sub     w0, w0, w1
        ret
.L7:
        sub     w0, w0, #48
        ret

Specifically, the first version has an extra add w0, w0, 10

This results in shorter and branchless x86_64 assembly:
https://godbolt.org/z/5jqTbYWWh

Original:
```
convert_hex_to_binary(char):             # @convert_hex_to_binary(char)
        mov     eax, edi
        cmp     al, 57
        jg      .LBB0_2
        add     eax, -48
        ret
.LBB0_2:
        xor     ecx, ecx
        cmp     al, 97
        setb    cl
        shl     ecx, 5
        add     eax, ecx
        add     eax, -87
        ret
```
and new version:
```
convert_hex_to_binary(char):             # @convert_hex_to_binary(char)
        xor     ecx, ecx
        cmp     dil, 97
        setl    cl
        shl     ecx, 5
        add     ecx, -87
        cmp     dil, 58
        mov     eax, -48
        cmovge  eax, ecx
        add     eax, edi
        ret
```
@anonrig anonrig requested a review from lemire August 19, 2023 11:31
@lemire
Copy link
Member

lemire commented Aug 20, 2023

When convert_hex_to_binary, we know that the character is an hexadecimal digit so it is a nice function to optimize.

Your new code should sustain a throughput of 1 routine per ~slightly more than 2 cycles.

Can we do better?

What about...

unsigned convert_hex_to_binary(const char c)  {
  const static unsigned char table[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
                                         0, 0, 0, 0, 0, 0, 0, 10, 11, 
                                         12, 13, 14, 15, 0, 0, 0, 0, 
                                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                          0, 0, 10, 11, 12, 13, 14, 15};
  return table[c - '0'];
}

This should compile to something like...

        movsxd  rax, edi
        lea     rcx, [rip + table(char)::table]
        movzx   eax, byte ptr [rax + rcx - 48]
        ret

It is a fraction of the muops and should support a throughput of 1 per cycle, so it is unbeatable from that point of view.

The table fits in a cache line, so it is going to be efficient.

@lemire
Copy link
Member

lemire commented Aug 20, 2023

Note that we have as a homework to greatly optimize this work via...

https://github.com/ada-url/ada/pull/459/files

We still don't have good benchmarks for this (at least not benchmarks that are in place).

@ttsugriy
Copy link
Contributor Author

thanks for the suggestion, @lemire ! I've also considered your approach but since I'm not sure if trading a few CPU instructions is worth a memory lookup. That's also a reason why compilers don't always prefer using lookup tables to implement switches. Depending on access pattern, this table may end up in cache which would make its cost very low, but in case cache hit rate is not going to be good, it may end up being slower.

@lemire
Copy link
Member

lemire commented Aug 21, 2023

@ttsugriy

The function you propose compiles to about 30 bytes (x64). The function I propose compiles to about 16 bytes, plus a table of 54 bytes... so about 70 bytes. Not counting alignment and function headers, it is a difference of about 40 bytes. Of course, it is a short function so it is likely to get inlined. The table will not be duplicated, but the code itself might be. So it is unclear to me which is going to generate a smaller binary. Let us try it out.

I use GCC 11.

Current main branch:

$ ls -al build/src/libada.a      
-rw-r--r--  1 dlemire  staff  494544 build/src/libada.a

This PR:

$ ls -al buildpatch1/src/libada.a
-rw-r--r--  1 dlemire  staff  494544 buildpatch1/src/libada.a

PR with a table:

$ ls -al buildpatch2/src/libada.a
-rw-r--r--  1 dlemire  staff  494536 buildpatch2/src/libada.a

So the function with a table actually saves 8 bytes.

Does it matter? Well. It is 0.0016 % and current processors have megabytes of cache.

Now, we can, of course, measure this empirically, by running benchmarks. We don't yet have good benchmarks for this problem, but I hope to have some later today. So we will be able to measure the difference objectively.

@ttsugriy
Copy link
Contributor Author

@lemire

I wasn't making any claims about binary size - what I'm saying is that lookup table requires extra memory load and in my experience memory is more often a bottleneck than CPU, so in practice I usually observe runtime reduction in case a few extra CPU instructions can replace a memory access. Obviously only benchmark using representative workload can answer this, but in any case this PR seems like an improvement over original implementation whereas change to lookup table, while may be even better, is much harder to compare in terms of performance.

@lemire
Copy link
Member

lemire commented Aug 21, 2023

I think we need a benchmark before we can consider such a PR.

I have the following for consideration: #477

@lemire
Copy link
Member

lemire commented Aug 21, 2023

Now that we have a benchmark, we can run it on the three different options (current code, this PR and the table-based approach).

You can run benchmarks/bench_search_params to reproduce these results.

GCC 11, Ice Lake processor.

Main:

-----------------------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------
url_search_params_AdaURL      65900 ns        65735 ns        10671 GHz=3.19635 cycle/byte=11.6984 cycles/url=32.773k instructions/byte=40.3667 instructions/cycle=3.45062 instructions/ns=11.0294 instructions/url=113.087k ns/url=10.2532k speed=170.473M/s time/byte=5.86604ns time/url=16.4337us url/s=60.8505k/s

This PR

-----------------------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------
url_search_params_AdaURL      65202 ns        65028 ns        10785 GHz=3.19652 cycle/byte=11.5858 cycles/url=32.4575k instructions/byte=40.3469 instructions/cycle=3.48245 instructions/ns=11.1317 instructions/url=113.032k ns/url=10.154k speed=172.325M/s time/byte=5.80298ns time/url=16.257us url/s=61.5118k/s

Table approach:

-----------------------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------
url_search_params_AdaURL      57930 ns        57759 ns        11989 GHz=1.99858 cycle/byte=9.17089 cycles/url=25.6923k instructions/byte=40.4051 instructions/cycle=4.40579 instructions/ns=8.80533 instructions/url=113.195k ns/url=12.8552k speed=194.012M/s time/byte=5.15432ns time/url=14.4398us url/s=69.253k/s

The PR is about 1% faster than the main branch, but the table-based approach is 13% faster than the PR.

@zingaburga
Copy link

If you want to avoid branches + lookup table, this might be shorter:

unsigned convert_hex_to_binary(const char c) noexcept {
  char c2 = c | 32; // convert to lowercase
  return (c2>='a') ? c2-'a'+10 : c2-'0';
}

x86-64 codegen:

    or      dil, 32
    movsx   ecx, dil
    lea     edx, [rcx - 48]
    lea     eax, [rcx - 87]
    cmp     cl, 97
    cmovl   eax, edx
    ret

ARM64 codegen:

    orr     w9, w0, #0x20
    mov     w8, #-48
    and     w9, w9, #0xff
    mov     w10, #-87
    cmp     w9, #96
    csel    w8, w10, w8, hi
    add     w0, w8, w9
    ret

@zingaburga
Copy link

zingaburga commented Aug 22, 2023

Gah, scrap that, found a shorter version:

unsigned convert_hex_to_binary(char c) noexcept {
  c &= ~48; // convert to uppercase + set '0' to 0
  return (c>'A'-10) ? c-('A'-10) : c;
}

x86-64:

    and     dil, -49
    movsx   ecx, dil
    lea     eax, [rcx - 55]
    cmp     cl, 56
    cmovl   eax, ecx
    ret

ARM64:

    mov     w8, #207
    and     w8, w0, w8
    subs    w9, w8, #55
    csel    w0, w9, w8, hi
    ret

@lemire
Copy link
Member

lemire commented Aug 23, 2023

@zingaburga Very nice. Very slightly slower than a table in my tests, but seemingly faster than this PR.

@zingaburga
Copy link

zingaburga commented Aug 23, 2023

Thanks!

On x86, it's +2 instructions to avoid a lookup, but on ARM64, it should roughly be the same.

Actually, it should be possible to eliminate the sign extension, but I can't seem to get the compiler to comply. This seems to work (unless I misunderstand something), but don't know how to get the compiler to produce it naturally. This would reduce it to 4 instructions, or effectively 3 with move elimination, which should put it almost on par with the lookup table (but without a lookup).

@lemire
Copy link
Member

lemire commented Aug 24, 2023

@zingaburga

on ARM64, it should roughly be the same.

Let us try it out experimentally.

LLVM 14, Apple M2 (aarch64). Benchmark: bench_search_params.

You can run our benchmarks like so:

cmake -B build -D ADA_BENCHMARKS=ON
cmake --build build -j
./build/benchmarks/bench_search_params

Version with a table

-----------------------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------
url_search_params_AdaURL      34558 ns        34558 ns        20581 GHz=3.60044 cycle/byte=10.8839 cycles/url=30.4913k instructions/byte=54.0965 instructions/cycle=4.97032 instructions/ns=17.8954 instructions/url=151.551k ns/url=8.46875k speed=324.264M/s time/byte=3.08391ns time/url=8.63957us url/s=115.746k/s

The ANDNOT version

-----------------------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------
url_search_params_AdaURL      34617 ns        34612 ns        20458 GHz=3.60699 cycle/byte=10.8635 cycles/url=30.434k instructions/byte=54.4747 instructions/cycle=5.01448 instructions/ns=18.0872 instructions/url=152.611k ns/url=8.4375k speed=323.758M/s time/byte=3.08872ns time/url=8.65306us url/s=115.566k/s

Analysis

On the M2 with LLVM 14, the performance is undistinguishable, and so is the binary size of the produced library. However, the version with a table has significantly fewer instructions (54.0 instructions per byte vs. 54.5 instructions per byte). You end up with the same speed because the ANDNOT version retires very slightly fewer instructions per cycle.

Practically speaking, the difference is probably insignificant, meaning that one can go with your approach or the table, and it makes no difference, at least on Apple Silicon.

@zingaburga
Copy link

Thanks for the benchmark - interesting result!

My statement was just based on the output of the compiler, not on any benchmark. Putting in your code, with Clang 14, gives me 4 instructions, so the same number as my version, and all instructions are reasonably fast. In terms of binary size, one would think the tableless variant to be slightly smaller due to not having a table.

I haven't looked at the benchmark code, but could it be inlining a bunch of stuff? If the adrp+add instruction is pre-computed, the table version ends up being two instructions (add+load), whilst tableless is three (the mov constant can be moved out of the loop).
I'm guessing that's where the fewer instructions comes from?

@ttsugriy ttsugriy closed this Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants