Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements content addressing inside copyPromise #4586

Merged
merged 3 commits into from
Jul 18, 2022

Conversation

arcanis
Copy link
Member

@arcanis arcanis commented Jun 29, 2022

This PR is connected to #4533

What's the problem this PR addresses?

We want to support content-addressable cache when running installs. The implementation in #4533 is a great start, but implements this support inside a copyPromise function unique to the nm linker - subsequently importing it into the pnpm linker so that the two share the same code.

While avoiding code duplication is a good thing generally speaking, I strongly feel like we should leverage tools from the core (improving them if needed) rather than implement special-cases that we then reuse in multiple places. It doesn't give us as much freedom to implement behaviours tailored to specific use cases, but this cost is imo offset by the easier maintenance now that we have a clearly delimited core to rely on.

Note that this only covers the pnpm linker - the nm linker is a little special in that @larixer has definitely more experience there so I trust him to implement and maintain his own versions of our tools if really needed.

How did you fix it?

The main difference compared to #4533 is that I check whether the index is busted by just checking its mtime, rather than computing its checksum.

I have two further optimizations in mind, although they'll have to be implemented separately:

  • I'm thinking we could use the Zip extra fields to precompute sha512 checksums for all files inside zip archives; this might then speedup the checksumFilePromise calls, which wouldn't need to fetch the file contents anymore.

  • The copyFile implementation currently does a write(dest, read(src)) instead of a clone(dest, src). We should allow FakeFS implementation to know whether they are compatible for fast cloning (at minima by keeping track of whether the deepest underlying baseFs is xfs).

Performance-wise, using the Yarn repository as test subject, on my fairly slow machine:

This PR:

Install pnpm w/ cold index    2m 39s
Install pnpm w/ hot index     52s
Install pnpm w/o index        40s

PR #4533:

Install pnpm w/ cold index    1m 16s
Install pnpm w/ hot index     36s

The lockPromise is probably where a lot of perfs are lost - I may try to replace it with renamePromise instead (I initially used lockPromise because I'd like to preserve the inode when repairing a modified file, so that the fix affects all packages on the machine).

Checklist

  • I have set the packages that need to be released for my changes to be effective.
  • I will check that all automated PR checks pass before the PR gets reviewed.

@arcanis arcanis marked this pull request as ready for review June 30, 2022 09:20
@arcanis arcanis requested a review from larixer July 12, 2022 16:07
@arcanis
Copy link
Member Author

arcanis commented Jul 12, 2022

I'm thinking about landing it soon, to let people experiment with it in RCs; I think it's fine if reviews are done post-merge if you don't have the bandwidth right now, I'll do a follow-up if necessary.

Comment on lines +275 to +292
async checksumFilePromise(path: P, {algorithm = `sha512`}: {algorithm?: string} = {}) {
const fd = await this.openPromise(path, `r`);

try {
const CHUNK_SIZE = 65536;
const chunk = Buffer.allocUnsafeSlow(CHUNK_SIZE);

const hash = createHash(algorithm);

let bytesRead = 0;
while ((bytesRead = await this.readPromise(fd, chunk, 0, CHUNK_SIZE)) !== 0)
hash.update(bytesRead === CHUNK_SIZE ? chunk : chunk.slice(0, bytesRead));

return hash.digest(`hex`);
} finally {
await this.closePromise(fd);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this duplicated in here instead of using the one in hashUtils?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because otherwise fslib would depend on the core, which isn't necessary. The best would be to change the core to use this new FakeFS function rather than the hashUtils one, but I wanted to avoid touching the core in this PR 🤔

Comment on lines +85 to +87
this.indexFolderPromise = setupCopyIndex(xfs, {
indexPath: ppath.join(opts.project.configuration.get(`globalFolder`), `index` as Filename),
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that the constructor has side-effects, can't this happen later when it's actually needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it early to avoid having to either do IO calls for each package, or having to keep a boolean state (which I feel like would add complexity to avoid a situation - a pnpm install on an empty project without any other install - that doesn't happen much, and doesn't cost much even if it happens).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather have that boolean state, this will do 257 mkdir calls every time PnpmInstaller is constructed, regardless of which linker is in use and if the directories already exists.

Copy link
Member

@larixer larixer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks good to me. A couple of observations that might be optimised in the future. They both seem to eat around 30-35% of the overall performance.

  1. lockPromise - it is generally a complex operation which creates the file by opening it in exclusive mode and then removes the file and it is done for EACH copied file. I think it eats majority of these 30%. I It can be optimised at least by doing lockPromise once per package instead or by switching to atomic IO operations or to daemon server as pnpm does.
  2. Computing hashes per each file. I think this one doesn't actually contribute much, but can be optimised by storing precomputed hashes in the storage instead and addressing the storage via hash(package_contents | file_path_inside_package).

@arcanis
Copy link
Member Author

arcanis commented Jul 18, 2022

It can be optimised at least by doing lockPromise once per package instead or by switching to atomic IO operations or to daemon server as pnpm does.

Locking by package wouldn't work, because multiple packages may contain the same file content (and thus point to the same index file) - I want to improve this part, I agree it's a low-hanging fruit, but I'll probably address that as a follow-up.

@arcanis arcanis merged commit 3d710ef into master Jul 18, 2022
@arcanis arcanis deleted the mael/copypromise-link-strategy-index branch July 18, 2022 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants