Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On perlmutter, when file not present on server inputdata, the auto-download command may claim success and simply write a 0-sized file #5899

Closed
ndkeen opened this issue Aug 28, 2023 · 5 comments · Fixed by #5900
Labels
inputdata Changes affecting inputdata collection on blues pm-cpu Perlmutter at NERSC (CPU-only nodes) pm-gpu Perlmutter machine at NERSC (GPU nodes)

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Aug 28, 2023

I've noticed a few 0-size partition files at NERSC and I can reproduce by simply asking for odd-number partition file.

I see this in submit output:

Model mpassi missing file graph127 = '/global/cfs/cdirs/e3sm/inputdata/ice/mpas-seaice/oEC60to30v3/partitions/mpas-seaice.graph.info.230424.part.127'
Trying to download file: 'ice/mpas-seaice/oEC60to30v3/partitions/mpas-seaice.graph.info.230424.part.127' to path '/global/cfs/cdirs/e3sm/inputdata/ice/mpas-seaice/oEC60to30v3/partitions/mpas-seaice.graph.info.230424.part.127' using WGET protocol.
SUCCESS

but then:

login38% lr /global/cfs/cdirs/e3sm/inputdata/ice/mpas-seaice/oEC60to30v3/partitions/mpas-seaice.graph.info.230424.part.127
-rw-rw-r--+ 1 ndk ndk 0 Aug 28 13:27 /global/cfs/cdirs/e3sm/inputdata/ice/mpas-seaice/oEC60to30v3/partitions/mpas-seaice.graph.info.230424.part.127
@ndkeen ndkeen added pm-gpu Perlmutter machine at NERSC (GPU nodes) inputdata Changes affecting inputdata collection on blues pm-cpu Perlmutter at NERSC (CPU-only nodes) labels Aug 28, 2023
@ndkeen
Copy link
Contributor Author

ndkeen commented Aug 28, 2023

I think we've figured out issue -- or actually, how to avoid it -- still not sure why this happens.

While LCRC was down, we added NERSC as a data server. Which maybe isn't a good idea to leave there
in general (as maybe better to have one true golden set), but also seems to have this issue of writing 0-size files.
Which I'm sure we could figure out and correct if needed

Easier reproducer:

login22% rm o
login22% rm /global/cfs/cdirs/e3sm/inputdata/ice/mpas-seaice/oEC60to30v3/partitions/mpas-seaice.graph.info.230424.part.127
rm: cannot remove '/global/cfs/cdirs/e3sm/inputdata/ice/mpas-seaice/oEC60to30v3/partitions/mpas-seaice.graph.info.230424.part.127': No such file or directory
login22% wget https://portal.nersc.gov/project/e3sm/inputdata/ice/mpas-seaice/oEC60to30v3/partitions/mpas-seaice.graph.info.230424.part.127 -nc --output-document o
--2023-08-28 13:51:55--  https://portal.nersc.gov/project/e3sm/inputdata/ice/mpas-seaice/oEC60to30v3/partitions/mpas-seaice.graph.info.230424.part.127
Resolving portal.nersc.gov (portal.nersc.gov)... 128.55.206.109, 128.55.206.106, 128.55.206.111, ...
Connecting to portal.nersc.gov (portal.nersc.gov)|128.55.206.109|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2023-08-28 13:51:55 ERROR 404: Not Found.

login22% ls -l o
-rw-rw-r-- 1 ndk ndk 0 Aug 28 13:51 o

@rljacob
Copy link
Member

rljacob commented Aug 28, 2023

I don't understand the reproducer. Did it also create a 0 size file called mpas-seaice.graph.info.230424.part.127 ?

@ndkeen
Copy link
Contributor Author

ndkeen commented Aug 28, 2023

For the reproducer, I was trying to simplify by asking it write output file named o:
wget https://portal.nersc.gov/project/e3sm/inputdata/ice/mpas-seaice/oEC60to30v3/partitions/mpas-seaice.graph.info.230424.part.127 -nc --output-document o

which it does, and not sure why.
So just ask for any file that is not there.

login22% rm o
rm: cannot remove 'o': No such file or directory
login22% wget https://portal.nersc.gov/project/e3sm/inputdata/foo -nc --output-document o
--2023-08-28 14:42:02--  https://portal.nersc.gov/project/e3sm/inputdata/foo
Resolving portal.nersc.gov (portal.nersc.gov)... 128.55.206.110, 128.55.206.111, 128.55.206.108, ...
Connecting to portal.nersc.gov (portal.nersc.gov)|128.55.206.110|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2023-08-28 14:42:02 ERROR 404: Not Found.

login22% ls -l o
-rw-rw-r-- 1 ndk e3sm 0 Aug 28 14:42 o

@rljacob
Copy link
Member

rljacob commented Aug 28, 2023

related CIME issue ESMCI/cime#4480

@ndkeen
Copy link
Contributor Author

ndkeen commented Aug 28, 2023

Actually, I think the issue might be --output-document argument? Without that, it will not write 0-size file for me. I think that is actually expected behavior. As noted in the CIME issue, the python is then supposed to detect and cleanup that 0-size file.

ndkeen added a commit that referenced this issue Aug 28, 2023
We were using NERSC as inputdata server while LCRC was down.
However, running into issues with 0-sized files written (possible related to args of wget).
No need to have NERSC server as backup now.

Fixes #5899
[bfb]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inputdata Changes affecting inputdata collection on blues pm-cpu Perlmutter at NERSC (CPU-only nodes) pm-gpu Perlmutter machine at NERSC (GPU nodes)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants