Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add copy of release_files.path to file_registry for long-term storage #17750

Open
miketheman opened this issue Mar 11, 2025 · 3 comments
Open

Comments

@miketheman
Copy link
Member

Background

A Filename registry is maintained to persist even after files have been removed, to prevent re-upload, re-use of that exact filename.

When a release's files are removed, the ability to surface their storage location is effectively lost, as the path generator tool uses hashers to determine the placement from the file's hashes, which are also not preserved.

# Figure out what our filepath is going to be, we're going to use a
# directory structure based on the hash of the file contents. This
# will ensure that the contents of the file cannot change without
# it also changing the path that the file is saved too.
path="/".join(
[
file_hashes[PATH_HASHER][:2],
file_hashes[PATH_HASHER][2:4],
file_hashes[PATH_HASHER][4:],
filename,
]
),

The path data exists in the BigQuery Project Metadata Table so it's semi-available, but harder to get to during routine operations and investigations.

This could also feasibly be used via Inspector or something similar, if made available via some API.

Proposal

A few steps to tackle the problem, can definitely change based on further findings or ideas.

  • Add Filename.path column (easy)
  • Populate Filename.path during file upload, around here (easy)
  • Backfill the column from existing data in File.path for filenames that match (easyish)
  • Backfill the column for remaining empty entries from BigQuery data (medium to hard)
@di
Copy link
Member

di commented Mar 11, 2025

Definitely in favor of this.

I might suggest we store the blake2_256_digest from the release_files table instead (maybe the other *_digest columns too?) rather than the path, and just give ourselves a helper to convert the right digest to a path instead.

@miketheman
Copy link
Member Author

I might suggest we store the blake2_256_digest from the release_files table instead (maybe the other *_digest columns too?) rather than the path, and just give ourselves a helper to convert the right digest to a path instead.

Interesting - any reason we wouldn't want to use the already-computed path, since that's the reality of where it was stored? By storing a hash, we allow ourselves the ability to change the computed value if the approach to creating the path changes ever in the future.

@di
Copy link
Member

di commented Mar 11, 2025

My thought is that having the digests would be potentially useful for other things as well, and that since the path and the blake2_256_digest are basically the same, it would make more sense to store the digest (or multiple digests) and derive the path from them rather than vice versa.

Also, if we ever changed how files are stored, we'd just need to update the helper, not rewrite the path column for every filename.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants