Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transaction size and upload process #70

Closed
mnm678 opened this issue Nov 14, 2019 · 13 comments
Closed

Transaction size and upload process #70

mnm678 opened this issue Nov 14, 2019 · 13 comments
Milestone

Comments

@mnm678
Copy link
Collaborator

mnm678 commented Nov 14, 2019

dstufft:

PyPI currently does not allow anything more than a single file per transaction, are we going to be required to change that for this PEP?

Currently on PyPI all of this is handled in-band during the upload, and each file uploaded is treated as a separate transaction. Does this PEP require that all files for a release be uploaded within the same transaction? Are we going to have to move to an out of band process for dealing with releases and have the upload API just queue a release?

@trishankatdatadog trishankatdatadog added this to the Round 5 milestone Nov 14, 2019
@lukpueh
Copy link
Member

lukpueh commented Nov 15, 2019

Are there even files in a PyPI release that depend on each other? Because we e.g. don't have to wait for a source distribution to be uploaded in order to make the wheel for that release available. I think the consistent snapshot section in the pep mentions this.

So, if all files that can be uploaded to PyPI are self-contained, then a snapshot can be generated on each upload. We just have to make sure that snapshots are created one at a time.

cc @dstufft

@dstufft
Copy link

dstufft commented Nov 16, 2019

Correct, nothing currently requires multiple files to be uploaded in one atomic transaction (and indeed these files can be uploaded minutes, hours, days, weeks, etc apart). So as long as TUF is fine with that, then that shouldn't be a problem.

@lukpueh
Copy link
Member

lukpueh commented Nov 18, 2019

Thanks for clarifying, @dstufft! Are there any other files created on upload? @trishankatdatadog has mentioned simple indices in the past, but are those actually files, or are they created dynamically when querying the simple API?

@lukpueh
Copy link
Member

lukpueh commented Nov 18, 2019

@trishankatdatadog, would it be thinkable to just remove the transaction processes, as we describe them in the PEP at the moment?

Instead, files continue to be uploaded independently as they are now. When a file is uploaded it is placed into the snapshot FIFO queue. The snapshot process reads from that queue and generates consistent snapshots one at a time.

I think all the TUF-related work can be done in the snapshot process, because it is already the bottle neck (I/O, crypto operations) even if there were concurrent coordinating transaction processes. This would have the advantage that upload processes don't have to coordinate in regards to the bins, the uploaded targets fall into.

@dstufft
Copy link

dstufft commented Nov 18, 2019

All of the APIs are dynamically generated and then cached in the CDN (and re-generated when the cached object gets invalidated or ages out). I suppose part of implementing this is going to be ensuring that we have stable output of that dynamic generation OR moving to generating it on upload.

@trishankatdatadog
Copy link
Collaborator

trishankatdatadog commented Nov 18, 2019

@lukpueh Sorry, I'm not quite clear by what you mean. Could you define transaction vs upload processes?

@lukpueh
Copy link
Member

lukpueh commented Nov 19, 2019

@dstufft, generating simple index pages on upload is, AFAICS, the only way to protect this information with TUF, because TUF can only protect files. The question is, should simple indices be protected? I think they do, because that's probably how clients learn about new releases.

@lukpueh
Copy link
Member

lukpueh commented Nov 19, 2019

@trishankatdatadog, we defined transaction processes in the pep as follows:

peps/pep-0458.txt

Lines 879 to 888 in c16b3fb

Each transaction process keeps track of a project upload, adds all new target
files to the most recent, relevant *bin-n* metadata and informs the
snapshot process to produce a consistent snapshot. Each project release SHOULD
be handled in an atomic transaction, so that a given consistent snapshot
contains all target files of a project release. However, transaction processes
MAY be parallelized under the following constraints:
- Pairs of transaction processes MUST NOT concurrently work on the same project.
- Pairs of transaction processes MUST NOT concurrently work on projects that
belong to the same *bin-n* role.

I was under the impression that transaction processes need to be daemonized, because they may have to block on each other in order to respect the two concurrency constraints mentioned in the pep.

Uploads, on the other hand, are the HTTP (or whatever protocol) requests from clients that post the target files to PyPI. For the sake of user experience, these probably shouldn't block on each other.

What I tried to ask is whether transaction processes are necessary, and whether it wouldn't be enough to, on each upload, place the uploaded target file into a snapshot queue (in-band and oblivious to tuf bins), and in the background run the single snapshot process to generate consistent snapshots for one target file (plus newly generated simple indices) at a time.

Is this clearer now? If not, we can discuss it on a chat/call.

@dstufft
Copy link

dstufft commented Nov 19, 2019

@lukpueh I mean, at the end of the day unless TUF is also serving the file itself, it doesn't know if the response is being dynamically generated or if it's generated once right? It's just going to record the hash of what the response should be in it's metadata.

@lukpueh
Copy link
Member

lukpueh commented Nov 19, 2019

@dstufft, you're absolutely right, it does not have to be an actual file on the repository. it's enough, if the the contents are deterministic for a given snapshot, and can be hashed and stored in targets or delegated targets metadata, when the snapshot is created.

@dstufft
Copy link

dstufft commented Nov 19, 2019

That being said, one argument in favor of pre-generating it, instead of generating it on the fly is that if we continue to generate it on the fly there is a much larger chance that we accidentally change something that modifies the output. Even something as simple as the project changing the capitalization of the name would currently change the hash of the simple page which is likely non-obvious. So there is a reasonable argument that the safest thing to do is move from dynamically generating the page on each request, to generating it during the upload process and then storing it indefinitely. That would make it so changes only affect new versions of that file for sure unless we go out of our way to regenerate the stored generated pages.

I don't think the PEP needs to go into details about how exactly that needs to be acomplished, but it would be great if it could at least call out the fact that as it is the simple responses are dynamic, and they need to be ensured to be deterministic even if the underlying database changes.

@trishankatdatadog
Copy link
Collaborator

@lukpueh That's fine, as long as we make it clear that although uploads may be parallel, snapshots are not (because two snapshots might write to the same bin). Are we on the same page?

lukpueh added a commit to lukpueh/peps that referenced this issue Nov 22, 2019
Following discussions with @dstufft and @trishankatdatadog
regarding file uploads and simple index generation on PyPI (see
secure-systems-lab#70) this commit once more refines the
"producing consistent snapshots" section.

It includes the following changes:

- Remove the notion of *transaction processes* and instead talk
  about *uploads*.
  Background: Transaction processes are only relevant if multiple
  files of a project release need to be handled in a single
  transaction, which is not the case on PyPI, where each upload of
  a distribution file is self-contained.
  With this change, upload process just place files into a queue,
  without updating bin-n metadata (as transaction processes would
  have done in parallel), and all the metadata update/creation work
  is done by the snapshot process in strictly sequential manner.

- Add a paragraph about simple index pages and how their hashes
  should be included in *bin-n* metadata, and how they need to
  remain stable if re-generated dynamically.
lukpueh added a commit to lukpueh/peps that referenced this issue Nov 25, 2019
Following discussions with @dstufft and @trishankatdatadog
regarding file uploads and simple index generation on PyPI (see
secure-systems-lab#70) this commit once more refines the
"producing consistent snapshots" section.

It includes the following changes:

- Remove the notion of *transaction processes* and instead talk
  about *uploads*.
  Background: Transaction processes are only relevant if multiple
  files of a project release need to be handled in a single
  transaction, which is not the case on PyPI, where each upload of
  a distribution file is self-contained.
  With this change, upload process just place files into a queue,
  without updating bin-n metadata (as transaction processes would
  have done in parallel), and all the metadata update/creation work
  is done by the snapshot process in strictly sequential manner.

- Add a paragraph about simple index pages and how their hashes
  should be included in *bin-n* metadata, and how they need to
  remain stable if re-generated dynamically.
@lukpueh
Copy link
Member

lukpueh commented Dec 2, 2019

Fixed in #75

@lukpueh lukpueh closed this as completed Dec 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants