Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: centralize Pants lockfile validation metadata in a single file #14281

Open
Eric-Arellano opened this issue Jan 27, 2022 · 15 comments
Open

Comments

@Eric-Arellano
Copy link
Contributor

Eric-Arellano commented Jan 27, 2022

Status quo

Currently, we store lockfile metadata in headers like this:

# This lockfile was autogenerated by Pants. To regenerate, run:
#
# build-support/bin/generate_all_lockfiles.sh
#
# --- BEGIN PANTS LOCKFILE METADATA: DO NOT EDIT OR REMOVE ---
# {
# "version": 2,
# "valid_for_interpreter_constraints": [
# "CPython<3.10,>=3.7"
# ],
# "generated_with_requirements": [
# "flake8-2020<1.7.0,>=1.6.0",
# "flake8-no-implicit-concat",
# "flake8-pantsbuild<3,>=2.0",
# "flake8<4.0,>=3.9.2"
# ]
# }
# --- END PANTS LOCKFILE METADATA ---

@jsirois points out that this will be an issue with PEX and package-lock.json, which both use JSON. We will have to strip the headers at consumption time.

There are two concerning downsides to this:

  • We break interoperability with non-Pants ecosystem, which is a project goal to support.
  • IDEs won't like it. You can call the file .jsonc to work around that, but that's invalid to call package-lock.json a .jsonc file.

While we switched our proprietary JVM lockfile format to TOML to accommodate comments (#14175), we cannot strongarm every lock we support to be able to do this.

Proposal

Instead, we would need to start storing Pants metadata in dedicated, first-class file(s).

We could store one metadata file per lock, like black.lock and black.lock.metadata...but that would double the number of files you have to generate, where the metadata ones would also be very small. I believe this is more boilerplate than we want.

Instead, I propose a single file used for lockfile validation, something like this:

[black]
valid_for_interpreter_constraints = ["==3.7.*"]
requirements = ["black==2.2"]
lockfile_hash = "abac1"

[scrooge]
artifacts = ["group:artifact:2.1"]

(Note that we already check for ambiguity amongst all resolve names, so the keys will be unique.)

@chrisjrn
Copy link
Contributor

chrisjrn commented Jan 27, 2022

tl;dr: I think comments are essential in human-editable configuration files and would prefer to do the work to allow that, but could live with this proposal with some modifications if we have to go down this route.

First, on the assumptions here:

  1. I don't think it's bad, per se, for Pants to preprocess a lockfile and provide something different to the tool that consumes the lockfile, especially if that preprocessing allows the serialized config file to have comments.
  2. YAML 1.2 is a superset of JSON, which allows comments, which, if there's insistence on JSON-like presentation, should be what we encourage for writable configuration files, specifically so that they can allow comments (including warnings that a human-editable file is not meant to be edited by humans). If not, TOML is a much better format for this, since it's somewhat more human-writable and no less machine-writable.
  3. Having metadata inside the lockfile is quite important, because the lockfile data and its metadata are an atomic unit, and our header comments say as much. Having the metadata separate means it's possible to drop in a new file, and Pants wouldn't validate that it's correct. Currently we don't do a whole lot to validate that the lockfiles haven't been hand-amended, but I don't feel great about making a step backwards there.

Generally speaking, we should allow for comments in our lockfiles, either by making the LockfileMetadata serializer configurable in how comments are rendered (in case we need to deal with, e.g. XML-style <!-- blocks),

So my summary of the problem is that PEX doesn't allow comments in lockfiles. I'd recommend point 2 (changing out the JSON parser in PEX to allow comments), but also I don't feel bad about implementing point 1 either as the solution to this, or as future work.

Secondly, re the proposal itself:

If we decide we need to separate the metadata and the lockfile itself, the metadata should also ensure that the lockfile was in fact generated from the input requirements specified in the separate config file: the status quo is that we validate the inputs are correct, this would be validating that the lockfile content based on the inputs.

Probably the most reliable way to do that is to ensure that the lockfile is the same one as Pants generated. So we'd want to store a SHA digest of the lockfile output in the new metadata file, and have Pants issue a warning or error about the dangers of manually editing lockfiles.

@jsirois
Copy link
Contributor

jsirois commented Jan 27, 2022

So my summary of the problem is that PEX doesn't allow comments in lockfiles.

How do you propose handling the js world and package.json?

@chrisjrn
Copy link
Contributor

@jsirois

I'd recommend point 2 (changing out the JSON parser in PEX to allow comments), but also I don't feel bad about implementing point 1 either as the solution to this, or as future work.

@Eric-Arellano
Copy link
Contributor Author

How do you propose handling the js world and package.json?

We will have to strip the header at consumption time, which smells wrong to me: integrity-wise, should Pants really be messing with lock contents?

My question: is this actually a big deal? And are there other downsides I'm missing to writing-then-stripping the header from the lockfile? ("wasted computation" seems irrelevant, this all should be fast.)

@chrisjrn
Copy link
Contributor

We're definitely already reading the file twice in Python-land: once in Pants to extract the metadata, and once when pip consumes the lockfile. It wouldn't be the worst thing in the world to output a pre-processed lockfile as an intermediate build step. There'd be no extra read operations, and only one extra write.

@jsirois
Copy link
Contributor

jsirois commented Jan 27, 2022

@chrisjrn if the answer for package.json is 1 then I'd need better justification to change the Pex format to a non-stdlib one since pants needs to support 1 or Eric's proposal to handle the js ecosystem anyhow.

@Eric-Arellano
Copy link
Contributor Author

if the answer for package.json is 1 then I'd need better justification to change the Pex format to a non-stdlib one since pants needs to support 1 or Eric's proposal to handle the js ecosystem anyhow.

That makes sense to me for Pex to not consider Pants in determining which lockfile format to use. I personally find TOML more readable (less nesting), and I think it's useful it allows comments. So I'd encourage its use. But I know vendoring toml or tomli has downsides, so it's reasonable to me if you think sticking with JSON is better.

What I'm more interested in is confirming that it is acceptable for Pants to pre-process the lockfile before passing it to Pex. It seems fine to me to do.

@jsirois
Copy link
Contributor

jsirois commented Jan 27, 2022

I really don't buy the comments / readability arguments since a lock file is not for editing. You do have to be able to read it, but you can honestly read the current Pex lock files just fine. N.B. tomli / tomli-w are not an option, Python 3 only for both. It would have to be toml.

@Eric-Arellano
Copy link
Contributor Author

I really don't buy the comments / readability arguments since a lock file is not for editing.

I think it's useful to have nice Git diffs with lockfiles. When you update a lockfile, you should be looking at what transitive deps have changed. TOML is particularly good at this imo because of how it handles nesting & whitespace.

If we stick with json for PEX, we could use json.dumps(indent=) to make it more readable at least.

It would have to be toml.

Which is still unmaintained, although might get new life soon: uiri/toml#361 (comment)

@jsirois
Copy link
Contributor

jsirois commented Jan 27, 2022

Yes, Pex already supports --indent and it goes to great lengths to supply a stable output (sorts keys and all order doesn't matter lists are sorted). So unless you show me a diff that's hard to read with Pex lockfiles I'm still viewing this as theoretical preaching.

@Eric-Arellano
Copy link
Contributor Author

I don't like this idea of a centralized metadata file due to Chris's concerns about bad coupling. It also makes total sense why Pex wants to stick with JSON.

Meaning that Pants will need to pre-process the lockfile. Not a very big deal.

@thejcannon
Copy link
Member

I don't feel comfortable using a file that's generateable and consumable by another tool without the right guardrails in place.

Scenario: Pants support JS (:tada:) and is managing the lockfile. Some Pants-ignorant user's runs an npm command that they read online, or their colleague told them to run, etc... which overwrites the lockfile with a newer, totally valid version. The metadata doesn't get updated. Pants doesn't know the lockfile wasn't generated from the metadata so is happy. But now the metadata is inconsistent with the lockfile.

@thejcannon
Copy link
Member

Ahh I missed this:

Probably the most reliable way to do that is to ensure that the lockfile is the same one as Pants generated. So we'd want to store a SHA digest of the lockfile output in the new metadata file, and have Pants issue a warning or error about the dangers of manually editing lockfiles.

That solves my concern. I'm +1 on this proposal. It ends all discussions about any particular tool's compatibility with Pants' needs.

@thejcannon
Copy link
Member

Here's a light brain dump on what this would entail:

  • All the obvious stuff: generate-lockfiles updates the metadata file on disk, also strips existing comment headers from lockfiles on disk. (Bikeshed on where this new file lives by default)
  • Both in-file metadata and extra-file metadata are supported for a time
  • invalid_lockfile_behavior config now is regarding the metadata file and is moved to [generate-lockfiles], and unless there's a push otherwise, would encompass the file hash as well (with a helpful message if warn/error). Additionally, this gives us a chance to rename it as well. (Bikeshed, but mismatched_lockfile_behavior or something similar)
  • The only interesting thing here is what to do for lockfile that have been removed from Pants' view. I think any operation that touches the metadata file does light validation that the entries == Pants worldview (and there are no extra entries). This wouldn't involve any extra IO, so should be fast. Probably bikeshed on a new option or invalid_lockfile_behavior on how to report discrepancies

@cognifloyd
Copy link
Member

I added a note about this to #18326.

The support in that PR bypasses the problems here by only supporting dependencies defined in package.json. I believe any node_third_party_package explicitly defined in a BUILD file is essentially ignored, which makes the package.json file(s) the only source of truth for the dependencies that go into generating the lockfile.

As soon as there are multiple inputs (or at least a pants-specific input beyond the ecosystem-native dep definition) that should be covered by the lockfile, that is when pants has to step in and start tracking metadata about the lockfile's generation to ensure manual regeneration does not exclude those pants-specific dependencies.

This is an aspect about lockfiles that I hadn't considered, so I'm posting it as food for thought for others. Hopefully it can help us reach a resolution on the ideal cross-language way to handle the lockfile metadata and verification of the lockfile based on that metadata before we can trust that the lockfile is in a sane state / ready for use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants