Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Landuse data tool (V1) fails to run on Derecho #1152

Closed
glemieux opened this issue Jan 25, 2024 · 2 comments
Closed

Landuse data tool (V1) fails to run on Derecho #1152

glemieux opened this issue Jan 25, 2024 · 2 comments
Labels
porting Code does not reproduce the expected outcomes across multiple machines type: tools This PR adds or updates support tools. No regression testing necessary.

Comments

@glemieux
Copy link
Contributor

Testing out a new conda environment on derecho for ESCOMP/CTSM#2331 (comment) resulted in the following (unhelpful) error during regridding:

2:52 $ python luh2.py -l ~/scratch/luh2/LUH2_v2h_Historic/states.nc -s ~/scratch/luh2/staticData_quarterdeg.nc -r ~/scratch/luh2/surfdata_4x5_16pfts_Irrig_CMIP6_simyr2000_c170824.nc -w regridder.nc -o testout.nc
Input file dataset opened: /glade/u/home/glemieux/scratch/luh2/LUH2_v2h_Historic/states.nc
PrepDataset: LUH2
LUH2 dataset lat/lon boundary variables formatted and added as new variable for xESMF
data set updated for xESMF

Input file dataset opened: /glade/u/home/glemieux/scratch/luh2/surfdata_4x5_16pfts_Irrig_CMIP6_simyr2000_c170824.nc
PrepDataset: SurfData
Surface dataset dimensions renamed for xESMF
data set updated for xESMF

Input file dataset opened: /glade/u/home/glemieux/scratch/luh2/staticData_quarterdeg.nc

Defining regridder, method:  conservative
/glade/work/glemieux/conda-envs/ctsm_pylib/lib/python3.7/site-packages/xesmf/backend.py:56: UserWarning: Latitude is outside of [-90, 90]
  warnings.warn('Latitude is outside of [-90, 90]')
regridder saved to file:  regridder.nc

Regridding
skipping variable 1/21: time
skipping variable 2/21: lat
skipping variable 3/21: lon
regridding variable 4/21: primf
Killed

The exact same conda environment works on lobata, so I suspect this is a bandwidth issue of some sort?

@glemieux glemieux added type: tools This PR adds or updates support tools. No regression testing necessary. bug - unknown porting Code does not reproduce the expected outcomes across multiple machines and removed bug - unknown labels Jan 25, 2024
@glemieux
Copy link
Contributor Author

glemieux commented Jan 25, 2024

This does not appear to be an issue on perlmutter.

It might be worth reviewing this documentation: https://ncar-hpc-docs.readthedocs.io/en/latest/pbs/checking-memory-use/?h=memory#interactive-monitoring

Issue 1. Your Derecho or Casper job shows signs of a memory issue (e.g., the job log ends with "Killed" or "Bus Error") despite qhist reporting a higher requested than used memory amount.

@glemieux
Copy link
Contributor Author

glemieux commented Feb 29, 2024

Per discussions with @ekluzek and @ckoven a few weeks back, I ran this on a develop queue on derecho, to see if it was an issue with running on the constraints of the login node, where I had been testing earlier. Running in batch mode with one node, I was able to successfully run the script to completion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
porting Code does not reproduce the expected outcomes across multiple machines type: tools This PR adds or updates support tools. No regression testing necessary.
Projects
Archived in project
Development

No branches or pull requests

1 participant