Proposal: centralize Pants lockfile validation metadata in a single file #14281

Eric-Arellano · 2022-01-27T17:00:50Z

Status quo

Currently, we store lockfile metadata in headers like this:

pants/3rdparty/python/lockfiles/flake8.txt

Lines 1 to 18 in 94cef33

    
           # This lockfile was autogenerated by Pants. To regenerate, run: 
        
           # 
        
           #    build-support/bin/generate_all_lockfiles.sh 
        
           # 
        
           # --- BEGIN PANTS LOCKFILE METADATA: DO NOT EDIT OR REMOVE --- 
        
           # { 
        
           #   "version": 2, 
        
           #   "valid_for_interpreter_constraints": [ 
        
           #     "CPython<3.10,>=3.7" 
        
           #   ], 
        
           #   "generated_with_requirements": [ 
        
           #     "flake8-2020<1.7.0,>=1.6.0", 
        
           #     "flake8-no-implicit-concat", 
        
           #     "flake8-pantsbuild<3,>=2.0", 
        
           #     "flake8<4.0,>=3.9.2" 
        
           #   ] 
        
           # } 
        
           # --- END PANTS LOCKFILE METADATA ---

@jsirois points out that this will be an issue with PEX and package-lock.json, which both use JSON. We will have to strip the headers at consumption time.

There are two concerning downsides to this:

We break interoperability with non-Pants ecosystem, which is a project goal to support.
IDEs won't like it. You can call the file .jsonc to work around that, but that's invalid to call package-lock.json a .jsonc file.

While we switched our proprietary JVM lockfile format to TOML to accommodate comments (#14175), we cannot strongarm every lock we support to be able to do this.

Proposal

Instead, we would need to start storing Pants metadata in dedicated, first-class file(s).

We could store one metadata file per lock, like black.lock and black.lock.metadata...but that would double the number of files you have to generate, where the metadata ones would also be very small. I believe this is more boilerplate than we want.

Instead, I propose a single file used for lockfile validation, something like this:

[black]
valid_for_interpreter_constraints = ["==3.7.*"]
requirements = ["black==2.2"]
lockfile_hash = "abac1"

[scrooge]
artifacts = ["group:artifact:2.1"]

(Note that we already check for ambiguity amongst all resolve names, so the keys will be unique.)

The text was updated successfully, but these errors were encountered:

chrisjrn · 2022-01-27T17:54:12Z

tl;dr: I think comments are essential in human-editable configuration files and would prefer to do the work to allow that, but could live with this proposal with some modifications if we have to go down this route.

First, on the assumptions here:

I don't think it's bad, per se, for Pants to preprocess a lockfile and provide something different to the tool that consumes the lockfile, especially if that preprocessing allows the serialized config file to have comments.
YAML 1.2 is a superset of JSON, which allows comments, which, if there's insistence on JSON-like presentation, should be what we encourage for writable configuration files, specifically so that they can allow comments (including warnings that a human-editable file is not meant to be edited by humans). If not, TOML is a much better format for this, since it's somewhat more human-writable and no less machine-writable.
Having metadata inside the lockfile is quite important, because the lockfile data and its metadata are an atomic unit, and our header comments say as much. Having the metadata separate means it's possible to drop in a new file, and Pants wouldn't validate that it's correct. Currently we don't do a whole lot to validate that the lockfiles haven't been hand-amended, but I don't feel great about making a step backwards there.

Generally speaking, we should allow for comments in our lockfiles, either by making the LockfileMetadata serializer configurable in how comments are rendered (in case we need to deal with, e.g. XML-style <!-- blocks),

So my summary of the problem is that PEX doesn't allow comments in lockfiles. I'd recommend point 2 (changing out the JSON parser in PEX to allow comments), but also I don't feel bad about implementing point 1 either as the solution to this, or as future work.

Secondly, re the proposal itself:

If we decide we need to separate the metadata and the lockfile itself, the metadata should also ensure that the lockfile was in fact generated from the input requirements specified in the separate config file: the status quo is that we validate the inputs are correct, this would be validating that the lockfile content based on the inputs.

Probably the most reliable way to do that is to ensure that the lockfile is the same one as Pants generated. So we'd want to store a SHA digest of the lockfile output in the new metadata file, and have Pants issue a warning or error about the dangers of manually editing lockfiles.

jsirois · 2022-01-27T18:07:17Z

So my summary of the problem is that PEX doesn't allow comments in lockfiles.

How do you propose handling the js world and package.json?

chrisjrn · 2022-01-27T18:08:39Z

@jsirois

I'd recommend point 2 (changing out the JSON parser in PEX to allow comments), but also I don't feel bad about implementing point 1 either as the solution to this, or as future work.

Eric-Arellano · 2022-01-27T18:22:38Z

How do you propose handling the js world and package.json?

We will have to strip the header at consumption time, which smells wrong to me: integrity-wise, should Pants really be messing with lock contents?

My question: is this actually a big deal? And are there other downsides I'm missing to writing-then-stripping the header from the lockfile? ("wasted computation" seems irrelevant, this all should be fast.)

chrisjrn · 2022-01-27T18:26:21Z

We're definitely already reading the file twice in Python-land: once in Pants to extract the metadata, and once when pip consumes the lockfile. It wouldn't be the worst thing in the world to output a pre-processed lockfile as an intermediate build step. There'd be no extra read operations, and only one extra write.

jsirois · 2022-01-27T18:43:02Z

@chrisjrn if the answer for package.json is 1 then I'd need better justification to change the Pex format to a non-stdlib one since pants needs to support 1 or Eric's proposal to handle the js ecosystem anyhow.

Eric-Arellano · 2022-01-27T21:11:12Z

if the answer for package.json is 1 then I'd need better justification to change the Pex format to a non-stdlib one since pants needs to support 1 or Eric's proposal to handle the js ecosystem anyhow.

That makes sense to me for Pex to not consider Pants in determining which lockfile format to use. I personally find TOML more readable (less nesting), and I think it's useful it allows comments. So I'd encourage its use. But I know vendoring toml or tomli has downsides, so it's reasonable to me if you think sticking with JSON is better.

What I'm more interested in is confirming that it is acceptable for Pants to pre-process the lockfile before passing it to Pex. It seems fine to me to do.

jsirois · 2022-01-27T21:33:14Z

I really don't buy the comments / readability arguments since a lock file is not for editing. You do have to be able to read it, but you can honestly read the current Pex lock files just fine. N.B. tomli / tomli-w are not an option, Python 3 only for both. It would have to be toml.

Eric-Arellano · 2022-01-27T21:48:21Z

I really don't buy the comments / readability arguments since a lock file is not for editing.

I think it's useful to have nice Git diffs with lockfiles. When you update a lockfile, you should be looking at what transitive deps have changed. TOML is particularly good at this imo because of how it handles nesting & whitespace.

If we stick with json for PEX, we could use json.dumps(indent=) to make it more readable at least.

It would have to be toml.

Which is still unmaintained, although might get new life soon: uiri/toml#361 (comment)

jsirois · 2022-01-27T21:52:24Z

Yes, Pex already supports --indent and it goes to great lengths to supply a stable output (sorts keys and all order doesn't matter lists are sorted). So unless you show me a diff that's hard to read with Pex lockfiles I'm still viewing this as theoretical preaching.

Eric-Arellano · 2022-03-08T21:38:32Z

I don't like this idea of a centralized metadata file due to Chris's concerns about bad coupling. It also makes total sense why Pex wants to stick with JSON.

Meaning that Pants will need to pre-process the lockfile. Not a very big deal.

thejcannon · 2022-03-09T22:34:13Z

I don't feel comfortable using a file that's generateable and consumable by another tool without the right guardrails in place.

Scenario: Pants support JS (:tada:) and is managing the lockfile. Some Pants-ignorant user's runs an npm command that they read online, or their colleague told them to run, etc... which overwrites the lockfile with a newer, totally valid version. The metadata doesn't get updated. Pants doesn't know the lockfile wasn't generated from the metadata so is happy. But now the metadata is inconsistent with the lockfile.

thejcannon · 2022-03-09T22:45:16Z

Ahh I missed this:

Probably the most reliable way to do that is to ensure that the lockfile is the same one as Pants generated. So we'd want to store a SHA digest of the lockfile output in the new metadata file, and have Pants issue a warning or error about the dangers of manually editing lockfiles.

That solves my concern. I'm +1 on this proposal. It ends all discussions about any particular tool's compatibility with Pants' needs.

thejcannon · 2022-03-10T17:44:04Z

Here's a light brain dump on what this would entail:

All the obvious stuff: generate-lockfiles updates the metadata file on disk, also strips existing comment headers from lockfiles on disk. (Bikeshed on where this new file lives by default)
Both in-file metadata and extra-file metadata are supported for a time
invalid_lockfile_behavior config now is regarding the metadata file and is moved to [generate-lockfiles], and unless there's a push otherwise, would encompass the file hash as well (with a helpful message if warn/error). Additionally, this gives us a chance to rename it as well. (Bikeshed, but mismatched_lockfile_behavior or something similar)
The only interesting thing here is what to do for lockfile that have been removed from Pants' view. I think any operation that touches the metadata file does light validation that the entries == Pants worldview (and there are no extra entries). This wouldn't involve any extra IO, so should be fast. Probably bikeshed on a new option or invalid_lockfile_behavior on how to report discrepancies

cognifloyd · 2023-03-13T23:28:29Z

I added a note about this to #18326.

The support in that PR bypasses the problems here by only supporting dependencies defined in package.json. I believe any node_third_party_package explicitly defined in a BUILD file is essentially ignored, which makes the package.json file(s) the only source of truth for the dependencies that go into generating the lockfile.

As soon as there are multiple inputs (or at least a pants-specific input beyond the ecosystem-native dep definition) that should be covered by the lockfile, that is when pants has to step in and start tracking metadata about the lockfile's generation to ensure manual regeneration does not exclude those pants-specific dependencies.

This is an aspect about lockfiles that I hadn't considered, so I'm posting it as food for thought for others. Hopefully it can help us reach a resolution on the ideal cross-language way to handle the lockfile metadata and verification of the lockfile based on that metadata before we can trust that the lockfile is in a sane state / ready for use.

Eric-Arellano mentioned this issue Jan 27, 2022

[internal] Add internal ability to generate lockfiles with PEX #14278

Merged

Eric-Arellano closed this as completed Mar 8, 2022

Eric-Arellano mentioned this issue Mar 8, 2022

Switch to PEX for lockfile generation #13964

Closed

Eric-Arellano reopened this Mar 9, 2022

asherf mentioned this issue Jul 7, 2022

Allow pex to read pants style pex lock files pex-tool/pex#1829

Closed

cognifloyd mentioned this issue Mar 13, 2023

javascript: Initial NPM support #18326

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: centralize Pants lockfile validation metadata in a single file #14281

Proposal: centralize Pants lockfile validation metadata in a single file #14281

Eric-Arellano commented Jan 27, 2022 •

edited

Loading

chrisjrn commented Jan 27, 2022 •

edited

Loading

jsirois commented Jan 27, 2022

chrisjrn commented Jan 27, 2022

Eric-Arellano commented Jan 27, 2022

chrisjrn commented Jan 27, 2022

jsirois commented Jan 27, 2022

Eric-Arellano commented Jan 27, 2022

jsirois commented Jan 27, 2022

Eric-Arellano commented Jan 27, 2022

jsirois commented Jan 27, 2022

Eric-Arellano commented Mar 8, 2022

thejcannon commented Mar 9, 2022

thejcannon commented Mar 9, 2022

thejcannon commented Mar 10, 2022

cognifloyd commented Mar 13, 2023

Proposal: centralize Pants lockfile validation metadata in a single file #14281

Proposal: centralize Pants lockfile validation metadata in a single file #14281

Comments

Eric-Arellano commented Jan 27, 2022 • edited Loading

Status quo

Proposal

chrisjrn commented Jan 27, 2022 • edited Loading

jsirois commented Jan 27, 2022

chrisjrn commented Jan 27, 2022

Eric-Arellano commented Jan 27, 2022

chrisjrn commented Jan 27, 2022

jsirois commented Jan 27, 2022

Eric-Arellano commented Jan 27, 2022

jsirois commented Jan 27, 2022

Eric-Arellano commented Jan 27, 2022

jsirois commented Jan 27, 2022

Eric-Arellano commented Mar 8, 2022

thejcannon commented Mar 9, 2022

thejcannon commented Mar 9, 2022

thejcannon commented Mar 10, 2022

cognifloyd commented Mar 13, 2023

Eric-Arellano commented Jan 27, 2022 •

edited

Loading

chrisjrn commented Jan 27, 2022 •

edited

Loading