Implements content addressing inside copyPromise #4586

arcanis · 2022-06-29T12:14:48Z

This PR is connected to #4533

What's the problem this PR addresses?

We want to support content-addressable cache when running installs. The implementation in #4533 is a great start, but implements this support inside a copyPromise function unique to the nm linker - subsequently importing it into the pnpm linker so that the two share the same code.

While avoiding code duplication is a good thing generally speaking, I strongly feel like we should leverage tools from the core (improving them if needed) rather than implement special-cases that we then reuse in multiple places. It doesn't give us as much freedom to implement behaviours tailored to specific use cases, but this cost is imo offset by the easier maintenance now that we have a clearly delimited core to rely on.

Note that this only covers the pnpm linker - the nm linker is a little special in that @larixer has definitely more experience there so I trust him to implement and maintain his own versions of our tools if really needed.

How did you fix it?

The main difference compared to #4533 is that I check whether the index is busted by just checking its mtime, rather than computing its checksum.

I have two further optimizations in mind, although they'll have to be implemented separately:

I'm thinking we could use the Zip extra fields to precompute sha512 checksums for all files inside zip archives; this might then speedup the checksumFilePromise calls, which wouldn't need to fetch the file contents anymore.
The copyFile implementation currently does a write(dest, read(src)) instead of a clone(dest, src). We should allow FakeFS implementation to know whether they are compatible for fast cloning (at minima by keeping track of whether the deepest underlying baseFs is xfs).

Performance-wise, using the Yarn repository as test subject, on my fairly slow machine:

This PR:

Install pnpm w/ cold index    2m 39s
Install pnpm w/ hot index     52s
Install pnpm w/o index        40s

PR #4533:

Install pnpm w/ cold index    1m 16s
Install pnpm w/ hot index     36s

The lockPromise is probably where a lot of perfs are lost - I may try to replace it with renamePromise instead (I initially used lockPromise because I'd like to preserve the inode when repairing a modified file, so that the fix affects all packages on the machine).

Checklist

I have read the Contributing Guide.

I have set the packages that need to be released for my changes to be effective.

I will check that all automated PR checks pass before the PR gets reviewed.

arcanis · 2022-07-12T16:08:50Z

I'm thinking about landing it soon, to let people experiment with it in RCs; I think it's fine if reviews are done post-merge if you don't have the bandwidth right now, I'll do a follow-up if necessary.

merceyz · 2022-07-12T17:32:09Z

packages/yarnpkg-fslib/sources/FakeFS.ts

+  async checksumFilePromise(path: P, {algorithm = `sha512`}: {algorithm?: string} = {}) {
+    const fd = await this.openPromise(path, `r`);
+
+    try {
+      const CHUNK_SIZE = 65536;
+      const chunk = Buffer.allocUnsafeSlow(CHUNK_SIZE);
+
+      const hash = createHash(algorithm);
+
+      let bytesRead = 0;
+      while ((bytesRead = await this.readPromise(fd, chunk, 0, CHUNK_SIZE)) !== 0)
+        hash.update(bytesRead === CHUNK_SIZE ? chunk : chunk.slice(0, bytesRead));
+
+      return hash.digest(`hex`);
+    } finally {
+      await this.closePromise(fd);
+    }
+  }


Why is this duplicated in here instead of using the one in hashUtils?

Because otherwise fslib would depend on the core, which isn't necessary. The best would be to change the core to use this new FakeFS function rather than the hashUtils one, but I wanted to avoid touching the core in this PR 🤔

merceyz · 2022-07-12T17:34:37Z

packages/plugin-pnpm/sources/PnpmLinker.ts

+    this.indexFolderPromise = setupCopyIndex(xfs, {
+      indexPath: ppath.join(opts.project.configuration.get(`globalFolder`), `index` as Filename),
+    });


I don't like that the constructor has side-effects, can't this happen later when it's actually needed?

I did it early to avoid having to either do IO calls for each package, or having to keep a boolean state (which I feel like would add complexity to avoid a situation - a pnpm install on an empty project without any other install - that doesn't happen much, and doesn't cost much even if it happens).

I would rather have that boolean state, this will do 257 mkdir calls every time PnpmInstaller is constructed, regardless of which linker is in use and if the directories already exists.

larixer

In general looks good to me. A couple of observations that might be optimised in the future. They both seem to eat around 30-35% of the overall performance.

lockPromise - it is generally a complex operation which creates the file by opening it in exclusive mode and then removes the file and it is done for EACH copied file. I think it eats majority of these 30%. I It can be optimised at least by doing lockPromise once per package instead or by switching to atomic IO operations or to daemon server as pnpm does.
Computing hashes per each file. I think this one doesn't actually contribute much, but can be optimised by storing precomputed hashes in the storage instead and addressing the storage via hash(package_contents | file_path_inside_package).

arcanis · 2022-07-18T15:00:39Z

It can be optimised at least by doing lockPromise once per package instead or by switching to atomic IO operations or to daemon server as pnpm does.

Locking by package wouldn't work, because multiple packages may contain the same file content (and thus point to the same index file) - I want to improve this part, I agree it's a low-hanging fruit, but I'll probably address that as a follow-up.

arcanis added 3 commits June 29, 2022 13:43

Implements content addressing inside copyPromise

4aa380e

Adds tests

23adcb5

Versions

ced5af0

arcanis marked this pull request as ready for review June 30, 2022 09:20

arcanis requested a review from larixer July 12, 2022 16:07

merceyz reviewed Jul 12, 2022

View reviewed changes

larixer approved these changes Jul 13, 2022

View reviewed changes

arcanis merged commit 3d710ef into master Jul 18, 2022

arcanis deleted the mael/copypromise-link-strategy-index branch July 18, 2022 15:00

larixer mentioned this pull request Oct 13, 2022

feat(plugin-pnpm): Content-addressable storage support #4533

Closed

3 tasks

7rulnik mentioned this pull request Mar 15, 2023

[Bug?]: Fail to install turbo when using nodeLinker: pnpm with v4 #5165

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implements content addressing inside copyPromise #4586

Implements content addressing inside copyPromise #4586

arcanis commented Jun 29, 2022 •

edited

Loading

arcanis commented Jul 12, 2022

merceyz Jul 12, 2022

arcanis Jul 12, 2022

merceyz Jul 12, 2022

arcanis Jul 12, 2022

merceyz Jul 20, 2022

larixer left a comment

arcanis commented Jul 18, 2022

Implements content addressing inside copyPromise #4586

Implements content addressing inside copyPromise #4586

Conversation

arcanis commented Jun 29, 2022 • edited Loading

arcanis commented Jul 12, 2022

merceyz Jul 12, 2022

Choose a reason for hiding this comment

arcanis Jul 12, 2022

Choose a reason for hiding this comment

merceyz Jul 12, 2022

Choose a reason for hiding this comment

arcanis Jul 12, 2022

Choose a reason for hiding this comment

merceyz Jul 20, 2022

Choose a reason for hiding this comment

larixer left a comment

Choose a reason for hiding this comment

arcanis commented Jul 18, 2022

arcanis commented Jun 29, 2022 •

edited

Loading