Write out results schema to filesystem in summarize functions #1729

tomdemeyere · 2024-02-19T17:12:33Z

Summary of Changes

This has two purposes:

Make sure the results are written safely somewhere while the file database problem is not sorted
For calculators where nothing is written to the disk (MACE, EMT, ...) this results file gives sense to the currently empty directories

What do you think?

EDIT: Probably not going to be as easy as I was expecting

Checklist

I have read the "Guidelines" section of the contributing guide. Don't lie! 😉
My PR is on a custom branch and is not named main.
I have used black, isort, and ruff as described in the style guide.
I have added relevant, comprehensive unit tests.

Notes

Your PR will likely not be merged without proper and thorough tests.
If you are an external contributor, you will see a comment from @buildbot-princeton. This is solely for the maintainers.
When your code is ready for review, ping one of the active maintainers.

buildbot-princeton · 2024-02-19T17:12:36Z

Can one of the admins verify this patch?

Andrew-S-Rosen

Thanks!

EDIT: Probably not going to be as easy as I was expecting

How come? Because of the pickle error? See below.

As a general comment, it's unclear to me if pickle is the best approach here because the results are not easy to inspect outside of Python, and pickle is not guaranteed to work across Python versions. It's also not exactly how the results are stored in the database since it is the serialized results that are stored. All schemas are (de)serializable. As such, I would recommend the following:

from monty.serialization import dumpfn

dumpfn(task_doc, "results.json") # it will get auto-gzizpped if `GZIP_FILES: bool = True`

You can read it back in a lossless way via

from monty.serialization import loadfn

loadfn("results.json")

Make sure the results are written safely somewhere while the file database problem is not sorted

Sure, that's fair. Most users of quacc are also MongoDB users and so we haven't had such issues, but I have long been aware that (seemingly convenient) filesystem-based stores are a large challenge for concurrent workflows in general. This is because databases that involve concurrent connections often need a client/server model, which is what most people using a filesystem-based data store are trying to avoid in the first place. This is a non-trivial issue but one I hope can be solved by others on the Materials Project team.

For calculators where nothing is written to the disk (MACE, EMT, ...) this results file gives sense to the currently empty directories

They will only be empty for static calculations, as relaxations will dump out the log and restart files. In any case, it is perhaps worthwhile for us to deal with this directly if it is a nuisance. One option would be to see if CREATE_UNIQUE_DIR is True and if directory is blank. If so, the directory can be removed. Additionally, we would replace the value of dir_name in the output with a value of None to reflect the fact that it doesn't exist on the filesystem anymore.

This is not something you need to implement (for the sake of simplicity, let's not deal with it in this PR regardless). I am just suggesting it as a viable option for later.

src/quacc/schemas/ase.py

src/quacc/settings.py

Andrew-S-Rosen · 2024-02-20T12:46:15Z

I'll take a look later to figure out what's going on here.

tomdemeyere · 2024-02-20T13:05:03Z

@Andrew-S-Rosen Thanks, I thought the same. It seems that a string is finding its way in the Monty code while it should be a dict.

EDIT: Thanks for the comment btw, it's always nice to learn new things, there are some aspect of python I am not familiar with.

Andrew-S-Rosen · 2024-02-20T15:26:01Z

I am sorting this out. Seems to be related to materialsproject/emmet#914.

Andrew-S-Rosen · 2024-02-22T00:49:21Z

I just fixed what I believe is the upstream issue. Just waiting on a new release of emmet-core.

Edit: Okay that worked, but now we have the following to deal with:

Andrew-S-Rosen · 2024-03-03T20:07:14Z

Note to self for later:

We can circumvent these issues by jsanitize-ing the data beforehand.

Andrew-S-Rosen · 2024-03-04T02:32:05Z

Edited:

Alright, the JSON serialization approach does work for the most part. I just had to jsanitize the data beforehand. But ultimately, I think the original pickle approach might be worth considering instead simply because the data types will be exactly the same since no sanitation needs to occur. I also like the thought of ensuring that all outputs are pickle-able because this is often important for several of the supported workflow engines.

I have sorted out the pickle errors. The only thing left now should be to uncomment the remaining SETTINGS.WRITE_PICKLE blocks, which need a directory to work properly. I can continue revisiting this bit by bit when I get some time.

Fix pickle-ability of output schemas, first noticed in #1729.

codecov · 2024-03-04T07:53:21Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.38%. Comparing base (5666eaf) to head (fb50d1f).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1729   +/-   ##
=======================================
  Coverage   99.37%   99.38%           
=======================================
  Files          81       81           
  Lines        3200     3227   +27     
=======================================
+ Hits         3180     3207   +27     
  Misses         20       20

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Andrew-S-Rosen · 2024-03-04T07:54:09Z

The light at the end of the tunnel: the checks have emerged (after quite a bit of refactoring in main)!! Now we just need some tests 😄

tomdemeyere · 2024-03-04T13:56:40Z

@Andrew-S-Rosen

Thanks for taking care of that, I think that was a little bit beyond my python abilities, tests will come this weeks probably

EDIT: it seems that one particular line takes a lot of time (flat 30 seconds each time I run a calculation) I will turn off get_metadata for now

metadata = MoleculeMetadata().from_molecule(mol).model_dump()

in quacc.schemas.atoms.py

Andrew-S-Rosen · 2024-03-04T17:11:42Z

EDIT: it seems that one particular line takes a lot of time (flat 30 seconds each time I run a calculation) I will turn off get_metadata for now

Is this a result of the change in the present PR, or are you just commenting more generally? Could you also provide some insight about the Atoms object? I am curious if there is a lot of data in .info. If this is a new thing, perhaps it is related to #1829. I doubt it is related to present PR specifically.

tomdemeyere · 2024-03-04T20:03:21Z

EDIT: it seems that one particular line takes a lot of time (flat 30 seconds each time I run a calculation) I will turn off get_metadata for now

Is this a result of the change in the present PR, or are you just commenting more generally? Could you also provide some insight about the Atoms object? I am curious if there is a lot of data in .info. If this is a new thing, perhaps it is related to #1829. I doubt it is related to present PR specifically.

Ah yes! There is something inside the atoms.info, although it's not a lot just a simple short string. I assume this was related to this commit but it might not be you are right

tomdemeyere · 2024-03-17T20:27:39Z

Here are the tests! Did not create new ones but made sure to test all summarise functions

Andrew-S-Rosen · 2024-03-17T22:58:20Z

@tomdemeyere: Fantastic, thank you very much! I will proceed with the merge.

Later on, if you find needing JSON is critical, let me know because I can likely make it work if needed.

tomdemeyere · 2024-03-18T00:14:28Z

Just thought about it:

Would you mind adding a reserved key name in "additional_fields" which value would be used as a name for the pickle file?

{"pickle_label": "water_dimer_5.5"}

This is a nice niche feature that helps visual inspection?

Andrew-S-Rosen · 2024-03-18T00:30:57Z

Personally, I'm a bit hesitant to add in "hidden" features where we overload a given method with multiple purposes (as you likely have observed from my comments about disliking the .pop business on user-supplied dictionaries).

initial?

c9c2cea

Andrew-S-Rosen reviewed Feb 20, 2024

View reviewed changes

src/quacc/schemas/ase.py Outdated Show resolved Hide resolved

src/quacc/settings.py Outdated Show resolved Hide resolved

json

e5a41d5

Andrew-S-Rosen changed the title ~~Proposal: Pickling results schema in summarize functions~~ Write out results schema to filesystem in summarize functions Feb 22, 2024

Andrew-S-Rosen added the enhancement New feature or request label Feb 29, 2024

Andrew-S-Rosen added 7 commits February 29, 2024 12:07

Merge branch 'main' into pickle

da94535

Merge branch 'main' into pickle

c200f45

Merge branch 'main' into pickle

a54ece1

Merge branch 'main' into pickle

9f1afbc

results.json --> quacc_results.json

4a481f4

Update requirements.txt

a90f159

Merge branch 'main' into pickle

d7a7330

Andrew-S-Rosen and others added 6 commits March 3, 2024 16:24

Try jsanitize-ing

6ca1ecd

pre-commit auto-fixes

5601523

Update requirements.txt

ae83ee2

fix

7799761

fix

8b2b3b3

fix

2835c8a

Andrew-S-Rosen mentioned this pull request Mar 4, 2024

Fix pickle serializability #1828

Merged

Andrew-S-Rosen added a commit that referenced this pull request Mar 4, 2024

Fix pickle serializability (#1828)

aae4484

Fix pickle-ability of output schemas, first noticed in #1729.

Andrew-S-Rosen added 3 commits March 3, 2024 19:43

Revisit the pickle approach

5467c04

Merge branch 'main' into pickle

1b770cc

fix precommit

04558ad

Andrew-S-Rosen added 7 commits March 3, 2024 19:51

Merge branch 'main' into pickle

9eae881

Merge branch 'main' into pickle

4bed411

fix

85c9f49

fix

fdaf32e

Merge branch 'main' into pickle

144547f

fix

833d586

Merge branch 'main' into pickle

67bf50b

tomdemeyere added 2 commits March 17, 2024 15:53

Merge branch 'main' into pickle

376f46a

tests !!!

fb50d1f

Andrew-S-Rosen merged commit 1fcc33a into Quantum-Accelerators:main Mar 17, 2024
20 checks passed

Andrew-S-Rosen mentioned this pull request Mar 25, 2024

add orca recipes Open-Catalyst-Project/om-data#3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write out results schema to filesystem in summarize functions #1729

Write out results schema to filesystem in summarize functions #1729

tomdemeyere commented Feb 19, 2024 •

edited

Loading

buildbot-princeton commented Feb 19, 2024

Andrew-S-Rosen left a comment •

edited

Loading

Andrew-S-Rosen commented Feb 20, 2024

tomdemeyere commented Feb 20, 2024 •

edited

Loading

Andrew-S-Rosen commented Feb 20, 2024

Andrew-S-Rosen commented Feb 22, 2024 •

edited

Loading

Andrew-S-Rosen commented Mar 3, 2024

Andrew-S-Rosen commented Mar 4, 2024 •

edited

Loading

codecov bot commented Mar 4, 2024 •

edited

Loading

Andrew-S-Rosen commented Mar 4, 2024 •

edited

Loading

tomdemeyere commented Mar 4, 2024 •

edited

Loading

Andrew-S-Rosen commented Mar 4, 2024 •

edited

Loading

tomdemeyere commented Mar 4, 2024

tomdemeyere commented Mar 17, 2024

Andrew-S-Rosen commented Mar 17, 2024

tomdemeyere commented Mar 18, 2024

Andrew-S-Rosen commented Mar 18, 2024

Write out results schema to filesystem in summarize functions #1729

Write out results schema to filesystem in summarize functions #1729

Conversation

tomdemeyere commented Feb 19, 2024 • edited Loading

Summary of Changes

Checklist

Notes

buildbot-princeton commented Feb 19, 2024

Andrew-S-Rosen left a comment • edited Loading

Choose a reason for hiding this comment

Andrew-S-Rosen commented Feb 20, 2024

tomdemeyere commented Feb 20, 2024 • edited Loading

Andrew-S-Rosen commented Feb 20, 2024

Andrew-S-Rosen commented Feb 22, 2024 • edited Loading

Andrew-S-Rosen commented Mar 3, 2024

Andrew-S-Rosen commented Mar 4, 2024 • edited Loading

codecov bot commented Mar 4, 2024 • edited Loading

Codecov Report

Andrew-S-Rosen commented Mar 4, 2024 • edited Loading

tomdemeyere commented Mar 4, 2024 • edited Loading

Andrew-S-Rosen commented Mar 4, 2024 • edited Loading

tomdemeyere commented Mar 4, 2024

tomdemeyere commented Mar 17, 2024

Andrew-S-Rosen commented Mar 17, 2024

tomdemeyere commented Mar 18, 2024

Andrew-S-Rosen commented Mar 18, 2024

tomdemeyere commented Feb 19, 2024 •

edited

Loading

Andrew-S-Rosen left a comment •

edited

Loading

tomdemeyere commented Feb 20, 2024 •

edited

Loading

Andrew-S-Rosen commented Feb 22, 2024 •

edited

Loading

Andrew-S-Rosen commented Mar 4, 2024 •

edited

Loading

codecov bot commented Mar 4, 2024 •

edited

Loading

Andrew-S-Rosen commented Mar 4, 2024 •

edited

Loading

tomdemeyere commented Mar 4, 2024 •

edited

Loading

Andrew-S-Rosen commented Mar 4, 2024 •

edited

Loading