Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading big repodata (uncompressed other.xml that has 3GB) #493

Open
kontura opened this issue Mar 23, 2022 · 17 comments
Open

Loading big repodata (uncompressed other.xml that has 3GB) #493

kontura opened this issue Mar 23, 2022 · 17 comments

Comments

@kontura
Copy link
Contributor

kontura commented Mar 23, 2022

Apologies for so many issues lately but I have got another one.

In rhel we have some big repodata, such as other.xml that has 3GB uncompressed.

This fails to load because of an overflow. I think it happens e.g. here: https://github.com/openSUSE/libsolv/blob/master/src/repodata.c#L2528 since data->attrdatalen is an unsigned int but repodata_set() takes an Id (this conversion happens at other places as well) then the value is stored and during internalization it tries to use it as an int in extdata which results in a crash.

Is there something more we can do about this apart from catching the overflow?
I tried converting some of the involved parameters and variables to Offsets but didn't manage to make it work. I am also not sure if it is even a possible/valid approach, if I understand correctly we would need some way to differentiate Ids from Offsets? Perhaps a new REPOKEY_TYPE_STR_OFFSET. Still that doesn't seem like a scalable approach since it would only buy some time until the unsigned int overflows.
Maybe another possibility could be adding additional string data spaces?

Related issue is also that in order to parse the metadata we need to load them into RAM all at once even though the resulting solvfile has around one third of the size. Could there be some streaming parsing model? Where we process the metadata and internalize them continually?

@mlschroe
Copy link
Member

Oh no, making that work will not be much fun. And the next limit will be 4GB, of course.

Regarding a streaming implementation: you would need to do the parsing and solv file writing in one step. Certainly doable, but not trivial and quite a bit of code.

Once the solv file is written and loaded again, the memory usage is quite small as the changelog author/text is put into the "paged" part of the solv file ("vertical data") and thus not read in.

@mlschroe
Copy link
Member

If that >3GB repo public so that I can do some testing?

@lukash
Copy link
Contributor

lukash commented Mar 23, 2022

Once the solv file is written and loaded again, the memory usage is quite small as the changelog author/text is put into the "paged" part of the solv file ("vertical data") and thus not read in.

Oh... would that be a reason to reload cache data immediately after writing? Does it somehow reduce the memory footprint (even though the data were already loaded to the same pool)?

I've removed the reloading in libdnf 5 because I saw no purpose to it and it had a lackluster comment that didn't really explain it.

What does the "paged" part of the solv file (vertical data) mean? Is it not read into memory?

@mlschroe
Copy link
Member

The data is segmented into 32K pages and read on demand.

@lukash
Copy link
Contributor

lukash commented Mar 23, 2022

From the file? Does libsolv store the path and reopen the file if needed? Because we are passing an open FILE * which we then close. It's really unexpected libsolv does this, I would certainly assume I can delete the file after the FILE * is closed...

@etoddallen
Copy link

Hi. I submitted one of the original bugs to RHEL. I'd just like to point out that the huge XML file is the other.xml, not one of the primary ones. Specifically, it's the %changelog entries. For packages with huge, ancient %changelog's, if there are 20 copies of them, those ancient %changelogs entries are replicated 20x, which greatly bloats this file.

Anyway, I would not expect this data to be crucial to solving package dependencies. So, for other.xml data and maybe also for filelists.xml data, a lazy loading approach might make sense.

@etoddallen
Copy link

Clarification: *...if the repo contains 20 different versions of these packages...

@kontura
Copy link
Contributor Author

kontura commented Mar 24, 2022

If that >3GB repo public so that I can do some testing?

I don't think it is easily available, but I made a python script that should suffice: https://github.com/kontura/createrepo_dummy
If you generate the dummy repo with ~20000 pkgs it should be big enough, though it can take a quite while..

@mlschroe
Copy link
Member

Regarding the paging: repo_add_solv dup()s away the file descriptor and uses pread to read the pages.
You can delete the file, the data can still be accessed as long as the fd is open. (This is of course not true on windows.)

@mlschroe
Copy link
Member

Regarding the lazy converting to a solv file: doesn't dnf do that already?

@etoddallen
Copy link

Regarding laziness: I was talking about a more fine-grained laziness, where dnf could request changelog or filelists data only for specific packages, so that it didn't need to load the entire other.xml or filelists.xml.

@mlschroe
Copy link
Member

mlschroe commented Apr 7, 2022

That would indeed be nice, but then we'd run in weird cases if the remote repo is changed and the files cannot be accessed anymore. So I think the default should be to do the download and do the solv file conversion on demand.

@dralley
Copy link
Contributor

dralley commented Apr 12, 2022

Hi. I submitted one of the original bugs to RHEL.

@etoddallen Please point me to that Bugzilla because I will gladly raise the volume on it. This has caused a multitude of problems for us.

The root of the issue is that they have been keeping all of the changelogs, for every package, due to a bug. The packages that are updated the most frequently, have the longest changelog lists, and also the most copies of those changelogs because all the old packages are kept as well.

I'm not going to work out the big-O on that but you can see why it scales poorly. If you take the same metadata and drop all but the last 10 changelogs per-package it ends up like 10% of the size.

@etoddallen
Copy link

Sure. The one that I reported is here:
https://bugzilla.redhat.com/show_bug.cgi?id=2008233
But it was closed as a dup in favor of a newer one:
https://bugzilla.redhat.com/show_bug.cgi?id=2040170
The symptoms were faintly different (SIGSEGV vs. OOM), but the same root problem: gigantic other.xml.

@Adrixop95
Copy link

bump;
Any update on this? I also ran into this bug :v

@dralley
Copy link
Contributor

dralley commented Jun 13, 2022

Some infra is being updated to restrict the # of changelogs kept per-package, but I don't know precisely when it's all going to be in production. Most distros are using createrepo_c I believe so they already get it for "free".

@dralley
Copy link
Contributor

dralley commented Sep 27, 2022

The gigantic other.xml issue is now resolved at the source. The issue was partially that, on top of keeping full copies of the metadata and doing so for every version of the RPM, some RPMs such as OpenSSL, Samba and so forth actually keep changelog history going back more than 20 years. So was really huge quantities of data.

Restricting to the 10 most recent changelogs (as most distributions do) shrunk RHEL's other.xml.gz by 90-99%.

So this issue as-written can probably be closed though we may still want one open for the unsigned int overflow issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants