Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

downloadIsisData and rclone Problems #5245

Closed
KrisBecker opened this issue Jul 27, 2023 · 1 comment · Fixed by #5255
Closed

downloadIsisData and rclone Problems #5245

KrisBecker opened this issue Jul 27, 2023 · 1 comment · Fixed by #5255
Assignees
Labels
bug Something isn't working

Comments

@KrisBecker
Copy link
Contributor

KrisBecker commented Jul 27, 2023

ISIS version(s) affected: all

Description

The ISISDATA downloadIsisData script is needlessly copying existing files with every download of SPICE kernels for most or all missions. Actually the problem exists largely due to a combination of rclone behavior and duplicate files existing on both the NAIF kernel website and USGS AWS ISISDATA contents. This problem occurs on every download of ISISDATA including a clean download, and on restarts or updates of existing ISISDATA.

The current implementation of downloadIsisData first downloads from the NAIF SPICE kernel repository. (Note: using http protocol rather than https is still an issue.) Once this download completes, it then runs a check on the copy of the USGS AWS ISISDATA repo. There is a significant amount of redundant SPICE kernels files on the USGS AWS that exist on both repos even thought they contain the exact same contents. The problem seems to be that the default command configuration of downloadIsisData causes rclone to check file modification times and size of the files. And USGS AWS file modification times are different than that of the NAIF kernel modification time. On each invocation of a restart or update of an existing local ISISDATA download, files are downloaded again needlessly.

This is evident in the log files that are produced by reclone (which I modified downloadIsisData to provide an rclone log file). As an example, we can look at Juno kernel dataset. This will work through the case of a new download and then a restart of the same download on the new local Juno ISISDATA.

Command used for initial download:
./downloadIsisData juno $PWD/isisdatatest --config rclone.conf -vv --log=isisdatatest_juno_initial.log

rclone copies all the kernels from NAIF on the initial download. You can differentiate the source of the file because NAIF sources begin with the kernel type directory - in this case, its a CK.

2023/07/27 11:31:28 INFO  : ck/juno_sc_rec_150104_150110_v03.bc: Copied (new)

Later in the same download session, it is now working on the USGS AWS Juno ISISDATA contents. Note the directory name starts with kernels. In this case, it detects the times are different and copies the file again.

2023/07/27 11:34:27 DEBUG : kernels/ck/juno_sc_rec_150104_150110_v03.bc: Modification times differ by -45434h35m32s: 2022-08-12 22:31:28 +0000 UTC, 2017-06-06 12:55:56 -0700 MST
2023/07/27 11:35:57 INFO  : kernels/ck/juno_sc_rec_150104_150110_v03.bc: Copied (replaced existing)

Note in some cases, the behavior is different. It runs an md5 hash on AWS files and updates the mod time but does not download the file. However, on subsequent runs, the NAIF file is downloaded again due to different times and the lack of md5 hashes from http sites.

Run the same command with a new log file for an update:
./downloadIsisData juno $PWD/isisdatatest --config rclone.conf -vv --log=isisdatatest_juno_update.log

When encountering the same file on the NAIF site, it is recopied:

2023/07/27 11:42:58 DEBUG : ck/juno_sc_rec_150104_150110_v03.bc: Modification times differ by 45434h35m32s: 2017-06-06 19:55:56 +0000 UTC, 2022-08-12 15:31:28 -0700 MST
2023/07/27 11:44:22 INFO  : ck/juno_sc_rec_150104_150110_v03.bc: Copied (replaced existing)

And then, the same behavior as the initial download when the copying from USGS AWS repo:

2023/07/27 11:46:50 DEBUG : kernels/ck/juno_sc_rec_150104_150110_v03.bc: Modification times differ by -45434h35m32s: 2022-08-12 22:31:28 +0000 UTC, 2017-06-06 12:55:56 -0700 MST
2023/07/27 11:48:08 INFO  : kernels/ck/juno_sc_rec_150104_150110_v03.bc: Copied (replaced existing)

I confirmed both copies in NAIF and AWS have the same md5 hash.

This is quite concerning as there is a significant amount of needless copying of data occurring. I have tried additional parameters but there are still problems that can occur. For example, using --checksum will help eliminate additional copies but you run the risk of replacing newer files from NAIF with old files from AWS if the sizes differ. Or you will not be able to detect changes on either copy since NAIF copies will still occur (or not!) since hashes are not computed until they are downloaded.

Here are the sizes and dates of each of the files on their sources:

# From NAIF
% rclone lsl --config rclone.conf juno_naifKernels:ck/juno_sc_rec_150104_150110_v03.bc
 38774784 2017-06-06 12:55:56.000000000 juno_sc_rec_150104_150110_v03.bc

# From USGS AWS
% rclone lsl --config rclone.conf juno_usgs:kernels/ck/juno_sc_rec_150104_150110_v03.bc
 38774784 2022-08-12 15:31:28.000000000 juno_sc_rec_150104_150110_v03.bc

How to reproduce
Run successive commands as shown above. This behavior is seen for every mission that has duplicate files on NAIF and AWS with different time stamps. I added the --log parameter to downloadIsisData but you could redirect output to a file, which is a bit messy.

./downloadIsisData juno $PWD/isisdatatest --config rclone.conf -vv

./downloadIsisData juno $PWD/isisdatatest --config rclone.conf -vv

Possible Solution
There does not seem to be any rclone command line options that would completely resolve this problem. I tried --checksum but the problem is that if a file changes on the NAIF server, it will always be replaced by the AWS copy because the NAIF version will always be downloaded (it does not have md5 hashes).

The only way I think this can be resolved is if all redundant files are removed from the USGS AWS repo.

I suppose you could write additional complicated scripts but that seems uncertain to resolve all these issues.

Additional context

@KrisBecker KrisBecker added the bug Something isn't working label Jul 27, 2023
@amystamile-usgs amystamile-usgs self-assigned this Aug 7, 2023
@Kelvinrr
Copy link
Collaborator

Kelvinrr commented Aug 7, 2023

I think a simple way to solve this is using rclone's union feature that would let rclone manage duplicate files from multiple sources. Tracking down redundant kernels is still important. But I think using unions would solve this problem but also allow for the use of sync instead of copy to delete old files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants