Add copy of `release_files.path` to `file_registry` for long-term storage #17750

miketheman · 2025-03-11T16:16:24Z

Background

A Filename registry is maintained to persist even after files have been removed, to prevent re-upload, re-use of that exact filename.

When a release's files are removed, the ability to surface their storage location is effectively lost, as the path generator tool uses hashers to determine the placement from the file's hashes, which are also not preserved.

warehouse/warehouse/forklift/legacy.py

Lines 1525 to 1536 in aeb9ccd

    
           # Figure out what our filepath is going to be, we're going to use a 
        
           # directory structure based on the hash of the file contents. This 
        
           # will ensure that the contents of the file cannot change without 
        
           # it also changing the path that the file is saved too. 
        
           path="/".join( 
        
               [ 
        
                   file_hashes[PATH_HASHER][:2], 
        
                   file_hashes[PATH_HASHER][2:4], 
        
                   file_hashes[PATH_HASHER][4:], 
        
                   filename, 
        
               ] 
        
           ),

The path data exists in the BigQuery Project Metadata Table so it's semi-available, but harder to get to during routine operations and investigations.

This could also feasibly be used via Inspector or something similar, if made available via some API.

Proposal

A few steps to tackle the problem, can definitely change based on further findings or ideas.

Add Filename.path column (easy)
Populate Filename.path during file upload, around here (easy)
Backfill the column from existing data in File.path for filenames that match (easyish)
Backfill the column for remaining empty entries from BigQuery data (medium to hard)

The text was updated successfully, but these errors were encountered:

di · 2025-03-11T16:35:30Z

Definitely in favor of this.

I might suggest we store the blake2_256_digest from the release_files table instead (maybe the other *_digest columns too?) rather than the path, and just give ourselves a helper to convert the right digest to a path instead.

miketheman · 2025-03-11T18:01:25Z

I might suggest we store the blake2_256_digest from the release_files table instead (maybe the other *_digest columns too?) rather than the path, and just give ourselves a helper to convert the right digest to a path instead.

Interesting - any reason we wouldn't want to use the already-computed path, since that's the reality of where it was stored? By storing a hash, we allow ourselves the ability to change the computed value if the approach to creating the path changes ever in the future.

di · 2025-03-11T18:44:55Z

My thought is that having the digests would be potentially useful for other things as well, and that since the path and the blake2_256_digest are basically the same, it would make more sense to store the digest (or multiple digests) and derive the path from them rather than vice versa.

Also, if we ever changed how files are stored, we'd just need to update the helper, not rewrite the path column for every filename.

miketheman added the data quality label Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add copy of `release_files.path` to `file_registry` for long-term storage #17750

Add copy of `release_files.path` to `file_registry` for long-term storage #17750

miketheman commented Mar 11, 2025

di commented Mar 11, 2025

miketheman commented Mar 11, 2025

di commented Mar 11, 2025

Add copy of release_files.path to file_registry for long-term storage #17750

Add copy of release_files.path to file_registry for long-term storage #17750

Comments

miketheman commented Mar 11, 2025

Background

Proposal

di commented Mar 11, 2025

miketheman commented Mar 11, 2025

di commented Mar 11, 2025

Add copy of `release_files.path` to `file_registry` for long-term storage #17750

Add copy of `release_files.path` to `file_registry` for long-term storage #17750