Transaction size and upload process #70

mnm678 · 2019-11-14T18:00:28Z

PyPI currently does not allow anything more than a single file per transaction, are we going to be required to change that for this PEP?

Currently on PyPI all of this is handled in-band during the upload, and each file uploaded is treated as a separate transaction. Does this PEP require that all files for a release be uploaded within the same transaction? Are we going to have to move to an out of band process for dealing with releases and have the upload API just queue a release?

lukpueh · 2019-11-15T14:21:53Z

Are there even files in a PyPI release that depend on each other? Because we e.g. don't have to wait for a source distribution to be uploaded in order to make the wheel for that release available. I think the consistent snapshot section in the pep mentions this.

So, if all files that can be uploaded to PyPI are self-contained, then a snapshot can be generated on each upload. We just have to make sure that snapshots are created one at a time.

cc @dstufft

dstufft · 2019-11-16T04:53:16Z

Correct, nothing currently requires multiple files to be uploaded in one atomic transaction (and indeed these files can be uploaded minutes, hours, days, weeks, etc apart). So as long as TUF is fine with that, then that shouldn't be a problem.

lukpueh · 2019-11-18T11:14:10Z

Thanks for clarifying, @dstufft! Are there any other files created on upload? @trishankatdatadog has mentioned simple indices in the past, but are those actually files, or are they created dynamically when querying the simple API?

lukpueh · 2019-11-18T11:29:28Z

@trishankatdatadog, would it be thinkable to just remove the transaction processes, as we describe them in the PEP at the moment?

Instead, files continue to be uploaded independently as they are now. When a file is uploaded it is placed into the snapshot FIFO queue. The snapshot process reads from that queue and generates consistent snapshots one at a time.

I think all the TUF-related work can be done in the snapshot process, because it is already the bottle neck (I/O, crypto operations) even if there were concurrent coordinating transaction processes. This would have the advantage that upload processes don't have to coordinate in regards to the bins, the uploaded targets fall into.

dstufft · 2019-11-18T18:14:54Z

All of the APIs are dynamically generated and then cached in the CDN (and re-generated when the cached object gets invalidated or ages out). I suppose part of implementing this is going to be ensuring that we have stable output of that dynamic generation OR moving to generating it on upload.

trishankatdatadog · 2019-11-18T18:24:35Z

@lukpueh Sorry, I'm not quite clear by what you mean. Could you define transaction vs upload processes?

lukpueh · 2019-11-19T09:04:29Z

@dstufft, generating simple index pages on upload is, AFAICS, the only way to protect this information with TUF, because TUF can only protect files. The question is, should simple indices be protected? I think they do, because that's probably how clients learn about new releases.

lukpueh · 2019-11-19T09:43:34Z

@trishankatdatadog, we defined transaction processes in the pep as follows:

peps/pep-0458.txt

Lines 879 to 888 in c16b3fb

    
           Each transaction process keeps track of a project upload, adds all new target 
        
           files to the most recent, relevant *bin-n* metadata and informs the 
        
           snapshot process to produce a consistent snapshot. Each project release SHOULD 
        
           be handled in an atomic transaction, so that a given consistent snapshot 
        
           contains all target files of a project release. However, transaction processes 
        
           MAY be parallelized under the following constraints: 
        
           - Pairs of transaction processes MUST NOT concurrently work on the same project. 
        
           - Pairs of transaction processes MUST NOT concurrently work on projects that 
        
             belong to the same *bin-n* role.

I was under the impression that transaction processes need to be daemonized, because they may have to block on each other in order to respect the two concurrency constraints mentioned in the pep.

Uploads, on the other hand, are the HTTP (or whatever protocol) requests from clients that post the target files to PyPI. For the sake of user experience, these probably shouldn't block on each other.

What I tried to ask is whether transaction processes are necessary, and whether it wouldn't be enough to, on each upload, place the uploaded target file into a snapshot queue (in-band and oblivious to tuf bins), and in the background run the single snapshot process to generate consistent snapshots for one target file (plus newly generated simple indices) at a time.

Is this clearer now? If not, we can discuss it on a chat/call.

dstufft · 2019-11-19T15:18:28Z

@lukpueh I mean, at the end of the day unless TUF is also serving the file itself, it doesn't know if the response is being dynamically generated or if it's generated once right? It's just going to record the hash of what the response should be in it's metadata.

lukpueh · 2019-11-19T15:23:52Z

@dstufft, you're absolutely right, it does not have to be an actual file on the repository. it's enough, if the the contents are deterministic for a given snapshot, and can be hashed and stored in targets or delegated targets metadata, when the snapshot is created.

dstufft · 2019-11-19T15:35:06Z

That being said, one argument in favor of pre-generating it, instead of generating it on the fly is that if we continue to generate it on the fly there is a much larger chance that we accidentally change something that modifies the output. Even something as simple as the project changing the capitalization of the name would currently change the hash of the simple page which is likely non-obvious. So there is a reasonable argument that the safest thing to do is move from dynamically generating the page on each request, to generating it during the upload process and then storing it indefinitely. That would make it so changes only affect new versions of that file for sure unless we go out of our way to regenerate the stored generated pages.

I don't think the PEP needs to go into details about how exactly that needs to be acomplished, but it would be great if it could at least call out the fact that as it is the simple responses are dynamic, and they need to be ensured to be deterministic even if the underlying database changes.

trishankatdatadog · 2019-11-19T16:35:32Z

@lukpueh That's fine, as long as we make it clear that although uploads may be parallel, snapshots are not (because two snapshots might write to the same bin). Are we on the same page?

@dstufft

Following discussions with @dstufft and @trishankatdatadog regarding file uploads and simple index generation on PyPI (see secure-systems-lab#70) this commit once more refines the "producing consistent snapshots" section. It includes the following changes: - Remove the notion of *transaction processes* and instead talk about *uploads*. Background: Transaction processes are only relevant if multiple files of a project release need to be handled in a single transaction, which is not the case on PyPI, where each upload of a distribution file is self-contained. With this change, upload process just place files into a queue, without updating bin-n metadata (as transaction processes would have done in parallel), and all the metadata update/creation work is done by the snapshot process in strictly sequential manner. - Add a paragraph about simple index pages and how their hashes should be included in *bin-n* metadata, and how they need to remain stable if re-generated dynamically.

@dstufft

Following discussions with @dstufft and @trishankatdatadog regarding file uploads and simple index generation on PyPI (see secure-systems-lab#70) this commit once more refines the "producing consistent snapshots" section. It includes the following changes: - Remove the notion of *transaction processes* and instead talk about *uploads*. Background: Transaction processes are only relevant if multiple files of a project release need to be handled in a single transaction, which is not the case on PyPI, where each upload of a distribution file is self-contained. With this change, upload process just place files into a queue, without updating bin-n metadata (as transaction processes would have done in parallel), and all the metadata update/creation work is done by the snapshot process in strictly sequential manner. - Add a paragraph about simple index pages and how their hashes should be included in *bin-n* metadata, and how they need to remain stable if re-generated dynamically.

lukpueh · 2019-12-02T15:22:00Z

Fixed in #75

trishankatdatadog added this to the Round 5 milestone Nov 14, 2019

lukpueh mentioned this issue Nov 22, 2019

Update "Producing Consistent Snapshots" #75

Merged

lukpueh closed this as completed Dec 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transaction size and upload process #70

Transaction size and upload process #70

mnm678 commented Nov 14, 2019

lukpueh commented Nov 15, 2019

dstufft commented Nov 16, 2019

lukpueh commented Nov 18, 2019

lukpueh commented Nov 18, 2019

dstufft commented Nov 18, 2019

trishankatdatadog commented Nov 18, 2019 •

edited

Loading

lukpueh commented Nov 19, 2019

lukpueh commented Nov 19, 2019

dstufft commented Nov 19, 2019

lukpueh commented Nov 19, 2019

dstufft commented Nov 19, 2019

trishankatdatadog commented Nov 19, 2019

lukpueh commented Dec 2, 2019

Transaction size and upload process #70

Transaction size and upload process #70

Comments

mnm678 commented Nov 14, 2019

lukpueh commented Nov 15, 2019

dstufft commented Nov 16, 2019

lukpueh commented Nov 18, 2019

lukpueh commented Nov 18, 2019

dstufft commented Nov 18, 2019

trishankatdatadog commented Nov 18, 2019 • edited Loading

lukpueh commented Nov 19, 2019

lukpueh commented Nov 19, 2019

dstufft commented Nov 19, 2019

lukpueh commented Nov 19, 2019

dstufft commented Nov 19, 2019

trishankatdatadog commented Nov 19, 2019

lukpueh commented Dec 2, 2019

trishankatdatadog commented Nov 18, 2019 •

edited

Loading