Loading big repodata (uncompressed other.xml that has 3GB) #493

kontura · 2022-03-23T07:27:48Z

Apologies for so many issues lately but I have got another one.

In rhel we have some big repodata, such as other.xml that has 3GB uncompressed.

This fails to load because of an overflow. I think it happens e.g. here: https://github.com/openSUSE/libsolv/blob/master/src/repodata.c#L2528 since data->attrdatalen is an unsigned int but repodata_set() takes an Id (this conversion happens at other places as well) then the value is stored and during internalization it tries to use it as an int in extdata which results in a crash.

Is there something more we can do about this apart from catching the overflow?
I tried converting some of the involved parameters and variables to Offsets but didn't manage to make it work. I am also not sure if it is even a possible/valid approach, if I understand correctly we would need some way to differentiate Ids from Offsets? Perhaps a new REPOKEY_TYPE_STR_OFFSET. Still that doesn't seem like a scalable approach since it would only buy some time until the unsigned int overflows.
Maybe another possibility could be adding additional string data spaces?

Related issue is also that in order to parse the metadata we need to load them into RAM all at once even though the resulting solvfile has around one third of the size. Could there be some streaming parsing model? Where we process the metadata and internalize them continually?

The text was updated successfully, but these errors were encountered:

mlschroe · 2022-03-23T09:10:38Z

Oh no, making that work will not be much fun. And the next limit will be 4GB, of course.

Regarding a streaming implementation: you would need to do the parsing and solv file writing in one step. Certainly doable, but not trivial and quite a bit of code.

Once the solv file is written and loaded again, the memory usage is quite small as the changelog author/text is put into the "paged" part of the solv file ("vertical data") and thus not read in.

mlschroe · 2022-03-23T09:15:31Z

If that >3GB repo public so that I can do some testing?

lukash · 2022-03-23T09:35:23Z

Once the solv file is written and loaded again, the memory usage is quite small as the changelog author/text is put into the "paged" part of the solv file ("vertical data") and thus not read in.

Oh... would that be a reason to reload cache data immediately after writing? Does it somehow reduce the memory footprint (even though the data were already loaded to the same pool)?

I've removed the reloading in libdnf 5 because I saw no purpose to it and it had a lackluster comment that didn't really explain it.

What does the "paged" part of the solv file (vertical data) mean? Is it not read into memory?

mlschroe · 2022-03-23T12:15:12Z

The data is segmented into 32K pages and read on demand.

lukash · 2022-03-23T12:26:59Z

From the file? Does libsolv store the path and reopen the file if needed? Because we are passing an open FILE * which we then close. It's really unexpected libsolv does this, I would certainly assume I can delete the file after the FILE * is closed...

etoddallen · 2022-03-23T12:55:34Z

Hi. I submitted one of the original bugs to RHEL. I'd just like to point out that the huge XML file is the other.xml, not one of the primary ones. Specifically, it's the %changelog entries. For packages with huge, ancient %changelog's, if there are 20 copies of them, those ancient %changelogs entries are replicated 20x, which greatly bloats this file.

Anyway, I would not expect this data to be crucial to solving package dependencies. So, for other.xml data and maybe also for filelists.xml data, a lazy loading approach might make sense.

etoddallen · 2022-03-23T12:56:35Z

Clarification: *...if the repo contains 20 different versions of these packages...

kontura · 2022-03-24T08:36:31Z

If that >3GB repo public so that I can do some testing?

I don't think it is easily available, but I made a python script that should suffice: https://github.com/kontura/createrepo_dummy
If you generate the dummy repo with ~20000 pkgs it should be big enough, though it can take a quite while..

mlschroe · 2022-03-24T08:46:16Z

Regarding the paging: repo_add_solv dup()s away the file descriptor and uses pread to read the pages.
You can delete the file, the data can still be accessed as long as the fd is open. (This is of course not true on windows.)

mlschroe · 2022-03-24T08:47:16Z

Regarding the lazy converting to a solv file: doesn't dnf do that already?

etoddallen · 2022-04-06T21:08:31Z

Regarding laziness: I was talking about a more fine-grained laziness, where dnf could request changelog or filelists data only for specific packages, so that it didn't need to load the entire other.xml or filelists.xml.

mlschroe · 2022-04-07T07:57:59Z

That would indeed be nice, but then we'd run in weird cases if the remote repo is changed and the files cannot be accessed anymore. So I think the default should be to do the download and do the solv file conversion on demand.

dralley · 2022-04-12T23:44:08Z

Hi. I submitted one of the original bugs to RHEL.

@etoddallen Please point me to that Bugzilla because I will gladly raise the volume on it. This has caused a multitude of problems for us.

The root of the issue is that they have been keeping all of the changelogs, for every package, due to a bug. The packages that are updated the most frequently, have the longest changelog lists, and also the most copies of those changelogs because all the old packages are kept as well.

I'm not going to work out the big-O on that but you can see why it scales poorly. If you take the same metadata and drop all but the last 10 changelogs per-package it ends up like 10% of the size.

etoddallen · 2022-04-13T03:25:22Z

Sure. The one that I reported is here:
https://bugzilla.redhat.com/show_bug.cgi?id=2008233
But it was closed as a dup in favor of a newer one:
https://bugzilla.redhat.com/show_bug.cgi?id=2040170
The symptoms were faintly different (SIGSEGV vs. OOM), but the same root problem: gigantic other.xml.

Adrixop95 · 2022-06-13T10:16:44Z

bump;
Any update on this? I also ran into this bug :v

dralley · 2022-06-13T13:15:48Z

Some infra is being updated to restrict the # of changelogs kept per-package, but I don't know precisely when it's all going to be in production. Most distros are using createrepo_c I believe so they already get it for "free".

dralley · 2022-09-27T13:21:52Z

The gigantic other.xml issue is now resolved at the source. The issue was partially that, on top of keeping full copies of the metadata and doing so for every version of the RPM, some RPMs such as OpenSSL, Samba and so forth actually keep changelog history going back more than 20 years. So was really huge quantities of data.

Restricting to the 10 most recent changelogs (as most distributions do) shrunk RHEL's other.xml.gz by 90-99%.

So this issue as-written can probably be closed though we may still want one open for the unsigned int overflow issue

dralley mentioned this issue Dec 2, 2023

Changelog limit doesn't work properly with --compatibility rpm-software-management/createrepo_c#318

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading big repodata (uncompressed other.xml that has 3GB) #493

Loading big repodata (uncompressed other.xml that has 3GB) #493

kontura commented Mar 23, 2022

mlschroe commented Mar 23, 2022

mlschroe commented Mar 23, 2022

lukash commented Mar 23, 2022

mlschroe commented Mar 23, 2022

lukash commented Mar 23, 2022

etoddallen commented Mar 23, 2022

etoddallen commented Mar 23, 2022

kontura commented Mar 24, 2022

mlschroe commented Mar 24, 2022

mlschroe commented Mar 24, 2022

etoddallen commented Apr 6, 2022

mlschroe commented Apr 7, 2022

dralley commented Apr 12, 2022 •

edited

Loading

etoddallen commented Apr 13, 2022

Adrixop95 commented Jun 13, 2022

dralley commented Jun 13, 2022 •

edited

Loading

dralley commented Sep 27, 2022

Loading big repodata (uncompressed other.xml that has 3GB) #493

Loading big repodata (uncompressed other.xml that has 3GB) #493

Comments

kontura commented Mar 23, 2022

mlschroe commented Mar 23, 2022

mlschroe commented Mar 23, 2022

lukash commented Mar 23, 2022

mlschroe commented Mar 23, 2022

lukash commented Mar 23, 2022

etoddallen commented Mar 23, 2022

etoddallen commented Mar 23, 2022

kontura commented Mar 24, 2022

mlschroe commented Mar 24, 2022

mlschroe commented Mar 24, 2022

etoddallen commented Apr 6, 2022

mlschroe commented Apr 7, 2022

dralley commented Apr 12, 2022 • edited Loading

etoddallen commented Apr 13, 2022

Adrixop95 commented Jun 13, 2022

dralley commented Jun 13, 2022 • edited Loading

dralley commented Sep 27, 2022

dralley commented Apr 12, 2022 •

edited

Loading

dralley commented Jun 13, 2022 •

edited

Loading