Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write out results schema to filesystem in summarize functions #1729

Merged
merged 27 commits into from
Mar 17, 2024

Conversation

tomdemeyere
Copy link
Contributor

@tomdemeyere tomdemeyere commented Feb 19, 2024

Summary of Changes

This has two purposes:

  • Make sure the results are written safely somewhere while the file database problem is not sorted

  • For calculators where nothing is written to the disk (MACE, EMT, ...) this results file gives sense to the currently empty directories

What do you think?

EDIT: Probably not going to be as easy as I was expecting

Checklist

  • I have read the "Guidelines" section of the contributing guide. Don't lie! 😉
  • My PR is on a custom branch and is not named main.
  • I have used black, isort, and ruff as described in the style guide.
  • I have added relevant, comprehensive unit tests.

Notes

  • Your PR will likely not be merged without proper and thorough tests.
  • If you are an external contributor, you will see a comment from @buildbot-princeton. This is solely for the maintainers.
  • When your code is ready for review, ping one of the active maintainers.

@buildbot-princeton
Copy link
Collaborator

Can one of the admins verify this patch?

Copy link
Member

@Andrew-S-Rosen Andrew-S-Rosen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

EDIT: Probably not going to be as easy as I was expecting

How come? Because of the pickle error? See below.

As a general comment, it's unclear to me if pickle is the best approach here because the results are not easy to inspect outside of Python, and pickle is not guaranteed to work across Python versions. It's also not exactly how the results are stored in the database since it is the serialized results that are stored. All schemas are (de)serializable. As such, I would recommend the following:

from monty.serialization import dumpfn

dumpfn(task_doc, "results.json") # it will get auto-gzizpped if `GZIP_FILES: bool = True`

You can read it back in a lossless way via

from monty.serialization import loadfn

loadfn("results.json")

Make sure the results are written safely somewhere while the file database problem is not sorted

Sure, that's fair. Most users of quacc are also MongoDB users and so we haven't had such issues, but I have long been aware that (seemingly convenient) filesystem-based stores are a large challenge for concurrent workflows in general. This is because databases that involve concurrent connections often need a client/server model, which is what most people using a filesystem-based data store are trying to avoid in the first place. This is a non-trivial issue but one I hope can be solved by others on the Materials Project team.

For calculators where nothing is written to the disk (MACE, EMT, ...) this results file gives sense to the currently empty directories

They will only be empty for static calculations, as relaxations will dump out the log and restart files. In any case, it is perhaps worthwhile for us to deal with this directly if it is a nuisance. One option would be to see if CREATE_UNIQUE_DIR is True and if directory is blank. If so, the directory can be removed. Additionally, we would replace the value of dir_name in the output with a value of None to reflect the fact that it doesn't exist on the filesystem anymore.

This is not something you need to implement (for the sake of simplicity, let's not deal with it in this PR regardless). I am just suggesting it as a viable option for later.

src/quacc/schemas/ase.py Outdated Show resolved Hide resolved
src/quacc/settings.py Outdated Show resolved Hide resolved
@Andrew-S-Rosen
Copy link
Member

I'll take a look later to figure out what's going on here.

@tomdemeyere
Copy link
Contributor Author

tomdemeyere commented Feb 20, 2024

@Andrew-S-Rosen Thanks, I thought the same. It seems that a string is finding its way in the Monty code while it should be a dict.

EDIT: Thanks for the comment btw, it's always nice to learn new things, there are some aspect of python I am not familiar with.

@Andrew-S-Rosen
Copy link
Member

I am sorting this out. Seems to be related to materialsproject/emmet#914.

@Andrew-S-Rosen
Copy link
Member

Andrew-S-Rosen commented Feb 22, 2024

I just fixed what I believe is the upstream issue. Just waiting on a new release of emmet-core.

Edit: Okay that worked, but now we have the following to deal with:

@Andrew-S-Rosen Andrew-S-Rosen changed the title Proposal: Pickling results schema in summarize functions Write out results schema to filesystem in summarize functions Feb 22, 2024
@Andrew-S-Rosen Andrew-S-Rosen added the enhancement New feature or request label Feb 29, 2024
@Andrew-S-Rosen
Copy link
Member

Note to self for later:

We can circumvent these issues by jsanitize-ing the data beforehand.

@Andrew-S-Rosen
Copy link
Member

Andrew-S-Rosen commented Mar 4, 2024

Edited:

Alright, the JSON serialization approach does work for the most part. I just had to jsanitize the data beforehand. But ultimately, I think the original pickle approach might be worth considering instead simply because the data types will be exactly the same since no sanitation needs to occur. I also like the thought of ensuring that all outputs are pickle-able because this is often important for several of the supported workflow engines.

I have sorted out the pickle errors. The only thing left now should be to uncomment the remaining SETTINGS.WRITE_PICKLE blocks, which need a directory to work properly. I can continue revisiting this bit by bit when I get some time.

Andrew-S-Rosen added a commit that referenced this pull request Mar 4, 2024
Fix pickle-ability of output schemas, first noticed in
#1729.
Copy link

codecov bot commented Mar 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.38%. Comparing base (5666eaf) to head (fb50d1f).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1729   +/-   ##
=======================================
  Coverage   99.37%   99.38%           
=======================================
  Files          81       81           
  Lines        3200     3227   +27     
=======================================
+ Hits         3180     3207   +27     
  Misses         20       20           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Andrew-S-Rosen
Copy link
Member

Andrew-S-Rosen commented Mar 4, 2024

The light at the end of the tunnel: the checks have emerged (after quite a bit of refactoring in main)!! Now we just need some tests 😄

@tomdemeyere
Copy link
Contributor Author

tomdemeyere commented Mar 4, 2024

@Andrew-S-Rosen

Thanks for taking care of that, I think that was a little bit beyond my python abilities, tests will come this weeks probably

EDIT: it seems that one particular line takes a lot of time (flat 30 seconds each time I run a calculation) I will turn off get_metadata for now

metadata = MoleculeMetadata().from_molecule(mol).model_dump()

in quacc.schemas.atoms.py

@Andrew-S-Rosen
Copy link
Member

Andrew-S-Rosen commented Mar 4, 2024

EDIT: it seems that one particular line takes a lot of time (flat 30 seconds each time I run a calculation) I will turn off get_metadata for now

Is this a result of the change in the present PR, or are you just commenting more generally? Could you also provide some insight about the Atoms object? I am curious if there is a lot of data in .info. If this is a new thing, perhaps it is related to #1829. I doubt it is related to present PR specifically.

@tomdemeyere
Copy link
Contributor Author

EDIT: it seems that one particular line takes a lot of time (flat 30 seconds each time I run a calculation) I will turn off get_metadata for now

Is this a result of the change in the present PR, or are you just commenting more generally? Could you also provide some insight about the Atoms object? I am curious if there is a lot of data in .info. If this is a new thing, perhaps it is related to #1829. I doubt it is related to present PR specifically.

Ah yes! There is something inside the atoms.info, although it's not a lot just a simple short string. I assume this was related to this commit but it might not be you are right

@tomdemeyere
Copy link
Contributor Author

Here are the tests! Did not create new ones but made sure to test all summarise functions

@Andrew-S-Rosen
Copy link
Member

@tomdemeyere: Fantastic, thank you very much! I will proceed with the merge.

Later on, if you find needing JSON is critical, let me know because I can likely make it work if needed.

@Andrew-S-Rosen Andrew-S-Rosen merged commit 1fcc33a into Quantum-Accelerators:main Mar 17, 2024
20 checks passed
@tomdemeyere
Copy link
Contributor Author

Just thought about it:

Would you mind adding a reserved key name in "additional_fields" which value would be used as a name for the pickle file?

{"pickle_label": "water_dimer_5.5"}

This is a nice niche feature that helps visual inspection?

@Andrew-S-Rosen
Copy link
Member

Personally, I'm a bit hesitant to add in "hidden" features where we overload a given method with multiple purposes (as you likely have observed from my comments about disliking the .pop business on user-supplied dictionaries).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

Successfully merging this pull request may close these issues.

3 participants