You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The ISISDATA downloadIsisData script is needlessly copying existing files with every download of SPICE kernels for most or all missions. Actually the problem exists largely due to a combination of rclone behavior and duplicate files existing on both the NAIF kernel website and USGS AWS ISISDATA contents. This problem occurs on every download of ISISDATA including a clean download, and on restarts or updates of existing ISISDATA.
The current implementation of downloadIsisData first downloads from the NAIF SPICE kernel repository. (Note: using http protocol rather than https is still an issue.) Once this download completes, it then runs a check on the copy of the USGS AWS ISISDATA repo. There is a significant amount of redundant SPICE kernels files on the USGS AWS that exist on both repos even thought they contain the exact same contents. The problem seems to be that the default command configuration of downloadIsisData causes rclone to check file modification times and size of the files. And USGS AWS file modification times are different than that of the NAIF kernel modification time. On each invocation of a restart or update of an existing local ISISDATA download, files are downloaded again needlessly.
This is evident in the log files that are produced by reclone (which I modified downloadIsisData to provide an rclone log file). As an example, we can look at Juno kernel dataset. This will work through the case of a new download and then a restart of the same download on the new local Juno ISISDATA.
Command used for initial download: ./downloadIsisData juno $PWD/isisdatatest --config rclone.conf -vv --log=isisdatatest_juno_initial.log
rclone copies all the kernels from NAIF on the initial download. You can differentiate the source of the file because NAIF sources begin with the kernel type directory - in this case, its a CK.
2023/07/27 11:31:28 INFO : ck/juno_sc_rec_150104_150110_v03.bc: Copied (new)
Later in the same download session, it is now working on the USGS AWS Juno ISISDATA contents. Note the directory name starts with kernels. In this case, it detects the times are different and copies the file again.
2023/07/27 11:34:27 DEBUG : kernels/ck/juno_sc_rec_150104_150110_v03.bc: Modification times differ by -45434h35m32s: 2022-08-12 22:31:28 +0000 UTC, 2017-06-06 12:55:56 -0700 MST
2023/07/27 11:35:57 INFO : kernels/ck/juno_sc_rec_150104_150110_v03.bc: Copied (replaced existing)
Note in some cases, the behavior is different. It runs an md5 hash on AWS files and updates the mod time but does not download the file. However, on subsequent runs, the NAIF file is downloaded again due to different times and the lack of md5 hashes from http sites.
Run the same command with a new log file for an update: ./downloadIsisData juno $PWD/isisdatatest --config rclone.conf -vv --log=isisdatatest_juno_update.log
When encountering the same file on the NAIF site, it is recopied:
2023/07/27 11:42:58 DEBUG : ck/juno_sc_rec_150104_150110_v03.bc: Modification times differ by 45434h35m32s: 2017-06-06 19:55:56 +0000 UTC, 2022-08-12 15:31:28 -0700 MST
2023/07/27 11:44:22 INFO : ck/juno_sc_rec_150104_150110_v03.bc: Copied (replaced existing)
And then, the same behavior as the initial download when the copying from USGS AWS repo:
2023/07/27 11:46:50 DEBUG : kernels/ck/juno_sc_rec_150104_150110_v03.bc: Modification times differ by -45434h35m32s: 2022-08-12 22:31:28 +0000 UTC, 2017-06-06 12:55:56 -0700 MST
2023/07/27 11:48:08 INFO : kernels/ck/juno_sc_rec_150104_150110_v03.bc: Copied (replaced existing)
I confirmed both copies in NAIF and AWS have the same md5 hash.
This is quite concerning as there is a significant amount of needless copying of data occurring. I have tried additional parameters but there are still problems that can occur. For example, using --checksum will help eliminate additional copies but you run the risk of replacing newer files from NAIF with old files from AWS if the sizes differ. Or you will not be able to detect changes on either copy since NAIF copies will still occur (or not!) since hashes are not computed until they are downloaded.
Here are the sizes and dates of each of the files on their sources:
How to reproduce
Run successive commands as shown above. This behavior is seen for every mission that has duplicate files on NAIF and AWS with different time stamps. I added the --log parameter to downloadIsisData but you could redirect output to a file, which is a bit messy.
Possible Solution
There does not seem to be any rclone command line options that would completely resolve this problem. I tried --checksum but the problem is that if a file changes on the NAIF server, it will always be replaced by the AWS copy because the NAIF version will always be downloaded (it does not have md5 hashes).
The only way I think this can be resolved is if all redundant files are removed from the USGS AWS repo.
I suppose you could write additional complicated scripts but that seems uncertain to resolve all these issues.
Additional context
The text was updated successfully, but these errors were encountered:
I think a simple way to solve this is using rclone's union feature that would let rclone manage duplicate files from multiple sources. Tracking down redundant kernels is still important. But I think using unions would solve this problem but also allow for the use of sync instead of copy to delete old files.
ISIS version(s) affected: all
Description
The ISISDATA downloadIsisData script is needlessly copying existing files with every download of SPICE kernels for most or all missions. Actually the problem exists largely due to a combination of rclone behavior and duplicate files existing on both the NAIF kernel website and USGS AWS ISISDATA contents. This problem occurs on every download of ISISDATA including a clean download, and on restarts or updates of existing ISISDATA.
The current implementation of downloadIsisData first downloads from the NAIF SPICE kernel repository. (Note: using http protocol rather than https is still an issue.) Once this download completes, it then runs a check on the copy of the USGS AWS ISISDATA repo. There is a significant amount of redundant SPICE kernels files on the USGS AWS that exist on both repos even thought they contain the exact same contents. The problem seems to be that the default command configuration of downloadIsisData causes rclone to check file modification times and size of the files. And USGS AWS file modification times are different than that of the NAIF kernel modification time. On each invocation of a restart or update of an existing local ISISDATA download, files are downloaded again needlessly.
This is evident in the log files that are produced by reclone (which I modified downloadIsisData to provide an rclone log file). As an example, we can look at Juno kernel dataset. This will work through the case of a new download and then a restart of the same download on the new local Juno ISISDATA.
Command used for initial download:
./downloadIsisData juno $PWD/isisdatatest --config rclone.conf -vv --log=isisdatatest_juno_initial.log
rclone copies all the kernels from NAIF on the initial download. You can differentiate the source of the file because NAIF sources begin with the kernel type directory - in this case, its a CK.
Later in the same download session, it is now working on the USGS AWS Juno ISISDATA contents. Note the directory name starts with
kernels
. In this case, it detects the times are different and copies the file again.Note in some cases, the behavior is different. It runs an md5 hash on AWS files and updates the mod time but does not download the file. However, on subsequent runs, the NAIF file is downloaded again due to different times and the lack of md5 hashes from http sites.
Run the same command with a new log file for an update:
./downloadIsisData juno $PWD/isisdatatest --config rclone.conf -vv --log=isisdatatest_juno_update.log
When encountering the same file on the NAIF site, it is recopied:
And then, the same behavior as the initial download when the copying from USGS AWS repo:
I confirmed both copies in NAIF and AWS have the same md5 hash.
This is quite concerning as there is a significant amount of needless copying of data occurring. I have tried additional parameters but there are still problems that can occur. For example, using
--checksum
will help eliminate additional copies but you run the risk of replacing newer files from NAIF with old files from AWS if the sizes differ. Or you will not be able to detect changes on either copy since NAIF copies will still occur (or not!) since hashes are not computed until they are downloaded.Here are the sizes and dates of each of the files on their sources:
How to reproduce
Run successive commands as shown above. This behavior is seen for every mission that has duplicate files on NAIF and AWS with different time stamps. I added the
--log
parameter to downloadIsisData but you could redirect output to a file, which is a bit messy..
/downloadIsisData juno $PWD/isisdatatest --config rclone.conf -vv
./downloadIsisData juno $PWD/isisdatatest --config rclone.conf -vv
Possible Solution
There does not seem to be any rclone command line options that would completely resolve this problem. I tried
--checksum
but the problem is that if a file changes on the NAIF server, it will always be replaced by the AWS copy because the NAIF version will always be downloaded (it does not have md5 hashes).The only way I think this can be resolved is if all redundant files are removed from the USGS AWS repo.
I suppose you could write additional complicated scripts but that seems uncertain to resolve all these issues.
Additional context
The text was updated successfully, but these errors were encountered: