Improving database efficiency #44

meteorologist15 · 2024-06-08T22:26:08Z

My recent usage of esgpull has prompted me to officially raise an issue with regards to how efficiently (or inefficiently in this case) the database is utilized. Our carryover esgpull.db from the Synda days stands at around 2.3 GB in size. Starting this past Monday (2024-06-03), I have been attempting to download a fairly large grouping of data (~2.0 TiB; 60,000+ files), and after adding and tracking the query, the "update" process (esgpull update <QUERY_ID>) is STILL trudging along, with the database seemingly processing/validating/querying one file approximately every 60-90 seconds. If the pace continues, I'll be able to get through this "update" step in a little over a month from now, a very noticeable scaling issue. I'm not sure exactly how esgpull's "update" command utilizes/accesses the database, but I would presume that whatever algorithm is being used now should eventually be tuned to scale better with bigger requests and an already-packed esgpull.db.

svenrdz · 2024-07-16T17:06:26Z

Hi @meteorologist15
Apologies for the wait on this issue.
I've just spent the day on the problem and found a few ways to improve the performance on updates.
The fix currently sits in this PR: #47
I'll make sure to push a new release as soon as possible, to both PyPI and conda.

meteorologist15 · 2024-07-19T22:33:58Z

Thank you for pushing this PR! I was wondering though, were there any changes made that could break my old esgpull.db? I just installed a brand new instance of esgpull, copied over my esgpull.db, and when I ran 'esgpull show', I got the following message:

DatabaseError: (sqlite3.DatabaseError) database disk image is malformed
[SQL: SELECT facet.name AS facet_name, facet.value AS facet_value, facet.sha AS facet_sha
FROM facet, selection_facet
WHERE ? = selection_facet.selection_sha AND facet.sha = selection_facet.facet_sha]

(Background on this error at: https://sqlalche.me/e/20/4xp6)

I then tried to transfer over a copy of an older esgpull.db than I had made when I had just translated my synda.db over to an esgpull.db. That also failed with a similar message:

DatabaseError: (sqlite3.DatabaseError) database disk image is malformed
[SQL: SELECT count(?) AS count_1
FROM query_file JOIN file ON file.sha = query_file.file_sha
WHERE query_file.query_sha = ?]

(Background on this error at: https://sqlalche.me/e/20/4xp6)

Any thoughts? Thanks!

meteorologist15 · 2024-07-19T22:36:51Z

Also, I see additional "esgpull.db-shm" and "esgpull.db-wal" files in my 'db' directory, in addition to the clean esgpull.db upon installation. Could you perhaps describe what they are for? Thanks!

meteorologist15 · 2024-07-19T22:43:30Z

Perhaps the 'LEGACY' parameter is tripping things up? (though it had worked before?)

From the log:

[2024-07-19 18:27:57] DEBUG root
Locals:
{
'self': <sqlalchemy.dialects.sqlite.pysqlite.SQLiteDialect_pysqlite object at 0x2af5617664b0>,
'cursor': <sqlite3.Cursor object at 0x2af561820bc0>,
'statement': 'SELECT count(?) AS count_1 \nFROM query_file JOIN file ON file.sha = query_file.file_sha \nWHERE query_file.query_sha = ?',
'parameters': ('*', 'LEGACY'),
'context': <sqlalchemy.dialects.sqlite.base.SQLiteExecutionContext object at 0x2af5617d5520>
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving database efficiency #44

Improving database efficiency #44

meteorologist15 commented Jun 8, 2024

svenrdz commented Jul 16, 2024

meteorologist15 commented Jul 19, 2024

meteorologist15 commented Jul 19, 2024

meteorologist15 commented Jul 19, 2024

Improving database efficiency #44

Improving database efficiency #44

Comments

meteorologist15 commented Jun 8, 2024

svenrdz commented Jul 16, 2024

meteorologist15 commented Jul 19, 2024

(Background on this error at: https://sqlalche.me/e/20/4xp6)

(Background on this error at: https://sqlalche.me/e/20/4xp6)

meteorologist15 commented Jul 19, 2024

meteorologist15 commented Jul 19, 2024