Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental Synchronization Issue with Bandersnatch #1663

Open
lxyeternal opened this issue Feb 5, 2024 · 5 comments
Open

Incremental Synchronization Issue with Bandersnatch #1663

lxyeternal opened this issue Feb 5, 2024 · 5 comments

Comments

@lxyeternal
Copy link

I am currently using bandersnatch for mirroring PyPI and have encountered an issue regarding incremental synchronization. I want to set up my bandersnatch mirror to only sync new packages added to pypi.org. For packages that have been removed from pypi.org, do not delete these packages from the local mirror during synchronization. In short, only perform incremental backups without deleting any packages.

how to configure bandersnatch.conf to achieve this?

@cooperlees
Copy link
Contributor

cooperlees commented Feb 5, 2024

You're in luck. bandersnatch does not delete unless you run a bandersnatch verify. So you get that by default.

We do not have a feature to only take new packages created/added on PyPI to day. But I am not sure you mean this. I would take a PR to do so, but I don't know the cleanest way. I guess pull down the fill mirror list via the XMLRPC call we do and save all the package names and use that as your start point. Then from there compare to the original list and make that an allow list maybe?

This would need to be some sort of filter plugin to be accepted.

@lxyeternal
Copy link
Author

Thank you very much. I only want to mirror all packages from the pypi.org. My target is to build a comprehensive dataset of python registry for research.

@allamiro
Copy link

allamiro commented Jul 15, 2024

@lxyeternal and @cooperlees,

Another approach may be ..to consider is using a local SQLite database to track package metadata. During each sync, compare PyPI's current metadata with the database to identify new or updated packages. Download only those packages and update the database without deleting any local packages. This method simplifies incremental synchronization and ensures no historical data is lost ... Let me know what you both think

@cooperlees
Copy link
Contributor

I'd need more information here on implmentation and the goals with this being off by default as most use cases would not benefit from this addition.

Also, how would you detect bad data from failed runs (crashes) etc. and be able to re-sync the SQLite Database if this did happen? This opens up a new data store to keep clean and up to date. State is hard.

@allamiro
Copy link

I'd need more information here on implmentation and the goals with this being off by default as most use cases would not benefit from this addition.

Also, how would you detect bad data from failed runs (crashes) etc. and be able to re-sync the SQLite Database if this did happen? This opens up a new data store to keep clean and up to date. State is hard.

I appreciate the feedback and acknowledge the valid concerns raised regarding the implementation and goals for incremental synchronization with Bandersnatch. I must clarify that I misspoke earlier regarding dirsync upon reviewing its documentation .. it appears it may not be suitable for our needs.
Instead, there are other Python libraries that could potentially be leveraged, or we might consider developing a custom script. Specifically, pyrsync offers robust functionality for incremental file synchronization .. which might be more appropriate for our use case.
I will continue researching more to find the best possible solution and ensure we address all these concerns effectively

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants