-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transaction size and upload process #70
Comments
Are there even files in a PyPI release that depend on each other? Because we e.g. don't have to wait for a source distribution to be uploaded in order to make the wheel for that release available. I think the consistent snapshot section in the pep mentions this. So, if all files that can be uploaded to PyPI are self-contained, then a snapshot can be generated on each upload. We just have to make sure that snapshots are created one at a time. cc @dstufft |
Correct, nothing currently requires multiple files to be uploaded in one atomic transaction (and indeed these files can be uploaded minutes, hours, days, weeks, etc apart). So as long as TUF is fine with that, then that shouldn't be a problem. |
Thanks for clarifying, @dstufft! Are there any other files created on upload? @trishankatdatadog has mentioned simple indices in the past, but are those actually files, or are they created dynamically when querying the simple API? |
@trishankatdatadog, would it be thinkable to just remove the transaction processes, as we describe them in the PEP at the moment? Instead, files continue to be uploaded independently as they are now. When a file is uploaded it is placed into the snapshot FIFO queue. The snapshot process reads from that queue and generates consistent snapshots one at a time. I think all the TUF-related work can be done in the snapshot process, because it is already the bottle neck (I/O, crypto operations) even if there were concurrent coordinating transaction processes. This would have the advantage that upload processes don't have to coordinate in regards to the bins, the uploaded targets fall into. |
All of the APIs are dynamically generated and then cached in the CDN (and re-generated when the cached object gets invalidated or ages out). I suppose part of implementing this is going to be ensuring that we have stable output of that dynamic generation OR moving to generating it on upload. |
@lukpueh Sorry, I'm not quite clear by what you mean. Could you define transaction vs upload processes? |
@dstufft, generating simple index pages on upload is, AFAICS, the only way to protect this information with TUF, because TUF can only protect files. The question is, should simple indices be protected? I think they do, because that's probably how clients learn about new releases. |
@trishankatdatadog, we defined transaction processes in the pep as follows: Lines 879 to 888 in c16b3fb
I was under the impression that transaction processes need to be daemonized, because they may have to block on each other in order to respect the two concurrency constraints mentioned in the pep. Uploads, on the other hand, are the HTTP (or whatever protocol) requests from clients that post the target files to PyPI. For the sake of user experience, these probably shouldn't block on each other. What I tried to ask is whether transaction processes are necessary, and whether it wouldn't be enough to, on each upload, place the uploaded target file into a snapshot queue (in-band and oblivious to tuf bins), and in the background run the single snapshot process to generate consistent snapshots for one target file (plus newly generated simple indices) at a time. Is this clearer now? If not, we can discuss it on a chat/call. |
@lukpueh I mean, at the end of the day unless TUF is also serving the file itself, it doesn't know if the response is being dynamically generated or if it's generated once right? It's just going to record the hash of what the response should be in it's metadata. |
@dstufft, you're absolutely right, it does not have to be an actual file on the repository. it's enough, if the the contents are deterministic for a given snapshot, and can be hashed and stored in targets or delegated targets metadata, when the snapshot is created. |
That being said, one argument in favor of pre-generating it, instead of generating it on the fly is that if we continue to generate it on the fly there is a much larger chance that we accidentally change something that modifies the output. Even something as simple as the project changing the capitalization of the name would currently change the hash of the simple page which is likely non-obvious. So there is a reasonable argument that the safest thing to do is move from dynamically generating the page on each request, to generating it during the upload process and then storing it indefinitely. That would make it so changes only affect new versions of that file for sure unless we go out of our way to regenerate the stored generated pages. I don't think the PEP needs to go into details about how exactly that needs to be acomplished, but it would be great if it could at least call out the fact that as it is the simple responses are dynamic, and they need to be ensured to be deterministic even if the underlying database changes. |
@lukpueh That's fine, as long as we make it clear that although uploads may be parallel, snapshots are not (because two snapshots might write to the same bin). Are we on the same page? |
Following discussions with @dstufft and @trishankatdatadog regarding file uploads and simple index generation on PyPI (see secure-systems-lab#70) this commit once more refines the "producing consistent snapshots" section. It includes the following changes: - Remove the notion of *transaction processes* and instead talk about *uploads*. Background: Transaction processes are only relevant if multiple files of a project release need to be handled in a single transaction, which is not the case on PyPI, where each upload of a distribution file is self-contained. With this change, upload process just place files into a queue, without updating bin-n metadata (as transaction processes would have done in parallel), and all the metadata update/creation work is done by the snapshot process in strictly sequential manner. - Add a paragraph about simple index pages and how their hashes should be included in *bin-n* metadata, and how they need to remain stable if re-generated dynamically.
Following discussions with @dstufft and @trishankatdatadog regarding file uploads and simple index generation on PyPI (see secure-systems-lab#70) this commit once more refines the "producing consistent snapshots" section. It includes the following changes: - Remove the notion of *transaction processes* and instead talk about *uploads*. Background: Transaction processes are only relevant if multiple files of a project release need to be handled in a single transaction, which is not the case on PyPI, where each upload of a distribution file is self-contained. With this change, upload process just place files into a queue, without updating bin-n metadata (as transaction processes would have done in parallel), and all the metadata update/creation work is done by the snapshot process in strictly sequential manner. - Add a paragraph about simple index pages and how their hashes should be included in *bin-n* metadata, and how they need to remain stable if re-generated dynamically.
Fixed in #75 |
dstufft:
The text was updated successfully, but these errors were encountered: