Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't fsync() in checksum #297

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

stewartsmith
Copy link
Contributor

@stewartsmith stewartsmith commented Feb 9, 2024

This gives a major boost in librepo performance. For a reposync of an Amazon Linux 2023 x86-64 repository on a m5n.16xlarge EC2 instance with a 500MB/sec 3000IOP EBS volume, this alone reduces run time by 30 seconds of wall time, and gets reposync nearly using a whole core rather than only two thirds of one.


For reference, my benchmarking has been done on a m5n.16xlarge EC2 instance to the in-region S3 buckets as well as to the CDN repositories. That instance type has 256GB memory, a 75Gbit network connection, and is a 64 core Cascade Lake system. The root volume is a 256GB gp3 EBS volume with 500MB/sec of IO and 3000 IOPs.

The background of this is that a lot of EC2 instances don't live that long (relatively speaking), and never install RPMs except on launch - so all the time-to-install RPMs is time spent scaling up a system that could be better served by running the customer workload.

Goes well when paired with #294 and #295 and #296


What I'm not entirely sure of here is the other implications of this change - as in, what is relying on this checksum being crash safe, and should we instead re-compute it sometimes?

I'm open to putting this behind an ifdef or something if that seems safer. I'd love input here.

This gives a major boost in librepo performance. For a reposync of an
Amazon Linux 2023 x86-64 repository on a m5n.16xlarge EC2 instance with
a 500MB/sec 3000IOP EBS volume, this alone reduces run time by 30
seconds of wall time, and gets reposync nearly using a whole core rather
than only two thirds of one.
@stewartsmith
Copy link
Contributor Author

To give an idea of what these four PRs combined do, on the same machine, we take the original librepo doing a reposync of AL2023 x86-64 repositories from 1min42s down to 1min08sec.

@stewartsmith
Copy link
Contributor Author

As an exercise, I tried removing the computing and writing of the checksum along with tweaking the max number of connections. This enabled me to get a peak of around 1.1GB/sec (to /dev/shm... disk IO was starting to become a limiting factor) when reposyncing Fedora - ending up in 1min30s to sync latest packages from fedora 39 x86-64 repos.

It may be worth considering an alternative / option to the checksum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant