Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving database efficiency #44

Open
meteorologist15 opened this issue Jun 8, 2024 · 4 comments
Open

Improving database efficiency #44

meteorologist15 opened this issue Jun 8, 2024 · 4 comments

Comments

@meteorologist15
Copy link

My recent usage of esgpull has prompted me to officially raise an issue with regards to how efficiently (or inefficiently in this case) the database is utilized. Our carryover esgpull.db from the Synda days stands at around 2.3 GB in size. Starting this past Monday (2024-06-03), I have been attempting to download a fairly large grouping of data (~2.0 TiB; 60,000+ files), and after adding and tracking the query, the "update" process (esgpull update <QUERY_ID>) is STILL trudging along, with the database seemingly processing/validating/querying one file approximately every 60-90 seconds. If the pace continues, I'll be able to get through this "update" step in a little over a month from now, a very noticeable scaling issue. I'm not sure exactly how esgpull's "update" command utilizes/accesses the database, but I would presume that whatever algorithm is being used now should eventually be tuned to scale better with bigger requests and an already-packed esgpull.db.

@svenrdz
Copy link
Collaborator

svenrdz commented Jul 16, 2024

Hi @meteorologist15
Apologies for the wait on this issue.
I've just spent the day on the problem and found a few ways to improve the performance on updates.
The fix currently sits in this PR: #47
I'll make sure to push a new release as soon as possible, to both PyPI and conda.

@meteorologist15
Copy link
Author

Thank you for pushing this PR! I was wondering though, were there any changes made that could break my old esgpull.db? I just installed a brand new instance of esgpull, copied over my esgpull.db, and when I ran 'esgpull show', I got the following message:


DatabaseError: (sqlite3.DatabaseError) database disk image is malformed
[SQL: SELECT facet.name AS facet_name, facet.value AS facet_value, facet.sha AS facet_sha
FROM facet, selection_facet
WHERE ? = selection_facet.selection_sha AND facet.sha = selection_facet.facet_sha]

(Background on this error at: https://sqlalche.me/e/20/4xp6)

I then tried to transfer over a copy of an older esgpull.db than I had made when I had just translated my synda.db over to an esgpull.db. That also failed with a similar message:


DatabaseError: (sqlite3.DatabaseError) database disk image is malformed
[SQL: SELECT count(?) AS count_1
FROM query_file JOIN file ON file.sha = query_file.file_sha
WHERE query_file.query_sha = ?]

(Background on this error at: https://sqlalche.me/e/20/4xp6)

Any thoughts? Thanks!

@meteorologist15
Copy link
Author

Also, I see additional "esgpull.db-shm" and "esgpull.db-wal" files in my 'db' directory, in addition to the clean esgpull.db upon installation. Could you perhaps describe what they are for? Thanks!

@meteorologist15
Copy link
Author

Perhaps the 'LEGACY' parameter is tripping things up? (though it had worked before?)

From the log:

[2024-07-19 18:27:57] DEBUG root
Locals:
{
'self': <sqlalchemy.dialects.sqlite.pysqlite.SQLiteDialect_pysqlite object at 0x2af5617664b0>,
'cursor': <sqlite3.Cursor object at 0x2af561820bc0>,
'statement': 'SELECT count(?) AS count_1 \nFROM query_file JOIN file ON file.sha = query_file.file_sha \nWHERE query_file.query_sha = ?',
'parameters': ('*', 'LEGACY'),
'context': <sqlalchemy.dialects.sqlite.base.SQLiteExecutionContext object at 0x2af5617d5520>
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants