Provide regular dumps of Trac database #231

bmispelon · 2024-11-19T19:04:50Z

A long time ago, database dumps of the Trac tables used to be provided for public consumption but that practice has stopped at some point.

I think we should start doing this again (inspired by a discussion I had on Discord with @ulgens today). It would be useful both for people trying to work on code.djangoproject.com locally, but also those who'd like to extract statistics from Trac.

Some of the technical challenges to figure out:

Where to host the file (it used to be on the wiki, but maybe there's a better place)?
How to clean the data? The database has things like session ids which should not be shared under any circumstances, as well as user emails which should stay private.

ulgens · 2024-11-21T10:00:39Z

Hey @bmispelon , thanks for moving this to an issue.

I was looking for a way to get a dump from Trac to run data analyses on it. I wanted to measure a couple of points, but my big goal was to understand if something like https://clickpy.clickhouse.com/ is doable for Django issues and find ways to improve the dashboard.

I have no opinion about the hosting solution. Is wiki still alive?
Did we have anything to process data before the export when those dumps were active? If so, is the script still accessible?
- I guess we can start by deciding what we want to export/dump. My guess is that the data will be issue-based, so even without any user-related data, it may be useful to some degree. If we want to export state transitions, owner info, commenters, etc., what are the data points that we can share about the user?

bmispelon · 2024-11-21T11:36:11Z

I have no opinion about the hosting solution. Is wiki still alive?

Depends what you mean by "alive" 😁 . It is technically still working, but it's barely used (I think the fellows update the roadmap page, but that's all I know of).
It could be a good solution for hosting in terms of not having to recreate something that already exist, but it might be harder to automate (I'm not sure how easy it is to interact with the wiki part of Trac programmatically).
Another low-tech solution is to have a fixed location where we upload/save the dump, make sure that location is served by the webserver, and hardcode that link in the templates where we want it.

Did we have anything to process data before the export when those dumps were active? If so, is the script still accessible?

Unfortunately the script (if there ever was one) is not available anymore. I tried to reach out to the person who I thought was uploading the dumps, but never got a reply.

I guess we can start by deciding what we want to export/dump. My guess is that the data will be issue-based, so even without any user-related data, it may be useful to some degree. If we want to export state transitions, owner info, commenters, etc., what are the data points that we can share about the user?

Personally I think we should share as much of the data as possible, while preserving security (session ids for example) and users' privacy (email addresses). Basically if if the information is available publicly in some form, it should be included in the dump. I think that's where the hard part of this issue resides: figuring out which tables/columns are safe to share or not.

Not sure if you've seen already, but we already share a (very limited) dump of the trac database with mostly just the tables and no data: https://github.com/django/djangoproject.com/blob/main/tracdb/trac.sql. If you come up with some scripts or queries you'd like to try out, don't hesitate to get in touch with me and I can run them on the live data.

thibaudcolas · 2024-12-13T15:11:21Z

I’d find it useful personally as part of searching for things in Trac (have lots of trouble with the default experience), and checking people’s contribution history (for example as part of the Steering Council elections).

Re hosting – is it a question of file size / bandwidth, or automation, or discoverability? 🤔 Depending on the answer, could be some of the DSF’s infrastructure and platforms, or something meant for analysts like Kaggle perhaps.

For cleaning the data – I’d guess a table allowlist, and then anonymization steps for specific tables? See for example django-birdbath, not sure it’s the right fit here but that’s what my employer uses to help sharing database dumps without sharing the personal data that’s meant to stay in production only.

bmispelon · 2024-12-13T15:17:37Z

At this point it's purely a technical issue about how to clean the data in a reproducible and automated way. I'll take a look at birdbath, thanks 👍🏻

bmispelon added the question label Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide regular dumps of Trac database #231

Provide regular dumps of Trac database #231

bmispelon commented Nov 19, 2024

ulgens commented Nov 21, 2024 •

edited

Loading

bmispelon commented Nov 21, 2024

thibaudcolas commented Dec 13, 2024

bmispelon commented Dec 13, 2024

Provide regular dumps of Trac database #231

Provide regular dumps of Trac database #231

Comments

bmispelon commented Nov 19, 2024

ulgens commented Nov 21, 2024 • edited Loading

bmispelon commented Nov 21, 2024

thibaudcolas commented Dec 13, 2024

bmispelon commented Dec 13, 2024

ulgens commented Nov 21, 2024 •

edited

Loading