-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significantly improve performance by using chunks_exact(SIMD_LEN)
and remainder()
#10
Comments
The reason I was also looking at the performance yesterday (it was much slower than memchr etc.) and noticed that the if statements complicate unrolling and we need to do a bit of loop unrolling ourselves: https://godbolt.org/z/88nM38hhn should give something like this:
Using this trick I was able to get close to memchr() performance on position_simd() with u8's. |
@smu160 I've also had some sus results with |
Hi @LaihoE, I realized I had a bug in my implementation that I shared via godbolt. For some reason I put Anyhow, I'll finish the benchmarks to see it does compared to the other versions we mentioned here. |
The
as_simd
call is not guaranteed to gives us the perfect output ofprefix
,middle
,suffix
(where we knowmiddle
will have everything we need). You'd probably have to guarantee some sort of alignment in order to make that work well. So, we'll always have to check the prefix/suffix and there's no guarantee that the middle will even have any elements. We can get around this by going by to the slice api and just usingchunks_exact(SIMD_LEN)
followed by checking inchunks_exact(SIMD_LEN).remainder()
.For example:
You can see the generated asm for yourself on godbolt. You'll have to comment out my version and uncomment yours to see the differnce in cycle count. Of course, the proof is in the pudding. Using the benchmarks in PR #8, I got great preliminary results!
Let me know your thoughts. Thank you!!
The text was updated successfully, but these errors were encountered: