Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide regular dumps of Trac database #231

Open
bmispelon opened this issue Nov 19, 2024 · 4 comments
Open

Provide regular dumps of Trac database #231

bmispelon opened this issue Nov 19, 2024 · 4 comments
Labels

Comments

@bmispelon
Copy link
Member

A long time ago, database dumps of the Trac tables used to be provided for public consumption but that practice has stopped at some point.

I think we should start doing this again (inspired by a discussion I had on Discord with @ulgens today). It would be useful both for people trying to work on code.djangoproject.com locally, but also those who'd like to extract statistics from Trac.

Some of the technical challenges to figure out:

  • Where to host the file (it used to be on the wiki, but maybe there's a better place)?
  • How to clean the data? The database has things like session ids which should not be shared under any circumstances, as well as user emails which should stay private.
@ulgens
Copy link
Contributor

ulgens commented Nov 21, 2024

Hey @bmispelon , thanks for moving this to an issue.

I was looking for a way to get a dump from Trac to run data analyses on it. I wanted to measure a couple of points, but my big goal was to understand if something like https://clickpy.clickhouse.com/ is doable for Django issues and find ways to improve the dashboard.

  • I have no opinion about the hosting solution. Is wiki still alive?
  • Did we have anything to process data before the export when those dumps were active? If so, is the script still accessible?
    • I guess we can start by deciding what we want to export/dump. My guess is that the data will be issue-based, so even without any user-related data, it may be useful to some degree. If we want to export state transitions, owner info, commenters, etc., what are the data points that we can share about the user?

@bmispelon
Copy link
Member Author

I have no opinion about the hosting solution. Is wiki still alive?

Depends what you mean by "alive" 😁 . It is technically still working, but it's barely used (I think the fellows update the roadmap page, but that's all I know of).
It could be a good solution for hosting in terms of not having to recreate something that already exist, but it might be harder to automate (I'm not sure how easy it is to interact with the wiki part of Trac programmatically).
Another low-tech solution is to have a fixed location where we upload/save the dump, make sure that location is served by the webserver, and hardcode that link in the templates where we want it.

Did we have anything to process data before the export when those dumps were active? If so, is the script still accessible?

Unfortunately the script (if there ever was one) is not available anymore. I tried to reach out to the person who I thought was uploading the dumps, but never got a reply.

I guess we can start by deciding what we want to export/dump. My guess is that the data will be issue-based, so even without any user-related data, it may be useful to some degree. If we want to export state transitions, owner info, commenters, etc., what are the data points that we can share about the user?

Personally I think we should share as much of the data as possible, while preserving security (session ids for example) and users' privacy (email addresses). Basically if if the information is available publicly in some form, it should be included in the dump. I think that's where the hard part of this issue resides: figuring out which tables/columns are safe to share or not.

Not sure if you've seen already, but we already share a (very limited) dump of the trac database with mostly just the tables and no data: https://github.com/django/djangoproject.com/blob/main/tracdb/trac.sql. If you come up with some scripts or queries you'd like to try out, don't hesitate to get in touch with me and I can run them on the live data.

@thibaudcolas
Copy link
Member

I’d find it useful personally as part of searching for things in Trac (have lots of trouble with the default experience), and checking people’s contribution history (for example as part of the Steering Council elections).

Re hosting – is it a question of file size / bandwidth, or automation, or discoverability? 🤔 Depending on the answer, could be some of the DSF’s infrastructure and platforms, or something meant for analysts like Kaggle perhaps.

For cleaning the data – I’d guess a table allowlist, and then anonymization steps for specific tables? See for example django-birdbath, not sure it’s the right fit here but that’s what my employer uses to help sharing database dumps without sharing the personal data that’s meant to stay in production only.

@bmispelon
Copy link
Member Author

At this point it's purely a technical issue about how to clean the data in a reproducible and automated way. I'll take a look at birdbath, thanks 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants