Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: mention zero-installs repo size #4839

Merged
merged 3 commits into from
Sep 15, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions packages/gatsby/content/features/zero-installs.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,26 @@ Yes, very much. To give you an idea, a `node_modules` folder of 135k uncompresse

Another huge difference is the number of changes. Back in Yarn 1, when updating a package, a huge amount of files had to be recreated, or even simply moved. When the same happens in a Yarn 2 install, you get a very predictable result: exactly one changed file for each added/removed package. This in turn has beneficial side effects in terms of performance and security, since you can easily spot the invalid checksums on a per-package basis.

### What does this do to my repository size?

Every time you update a dependency and commit it, the repository will grow, the git history now containing both the new and old versions. This may lead to larger repository sizes, which makes it slower to clone a repository. Still, thanks to Git being fairly efficient at cloning large repositories, and clones being called far less than installs, the tradeoff isn't as damaging as it may seem. In case you still want to mitigate any potential problems, some mitigations are possible:

- [Partial clones](https://docs.gitlab.com/ee/topics/git/partial_clone.html) let you define a set of files that Git will only fetch when required. For instance, by running `git clone --filter=blob:limit=2m`, no outdated file larger than 2MB will be downloaded.

- Git will lazily download the missing files as needed. For example, if you run `git checkout` on an old commit, Git will fetch whatever files are needed before the command returns, so you won't see any difference.

- **This feature is supported by both GitHub and Gitlab, and is probably the best option at your disposal** if you can modify the way `git clone` is performed. Note however that the `actions/checkout` GitHub Action doesn't allow it yet (a PR is open [here](https://github.com/actions/checkout/pull/680)).

- [Sparse checkouts](https://github.blog/2020-01-17-bring-your-monorepo-down-to-size-with-sparse-checkout/) are the older cousin of partial clones. Instead of retrieving the whole Git history but putting aside the binary data, sparse checkouts instead define a cutoff commit which Git will treat as having no parents.

- Because sparse checkouts only let you see a slice of the history, some commands like `git log` or `git blame` may return truncated results.

- Branches can be [filtered](https://stackoverflow.com/questions/10067848/remove-folder-and-its-contents-from-git-githubs-history) every couple of years, to remove the cache folder from the history. In case you have audit requirements, you can fork your own repository to a secondary one to make sure the old commits are archived.

- [Git LFS](https://git-lfs.github.com/) is an extension allowing Git to store large files as small pointers, which Git will then dynamically convert into the real file content during clones by downloading them from an external server.

- While GitLab offers Git LFS storage for free, GitHub is limited to 1GB of storage and bandwidth.

### Does it have security implications?

Note that, by design, this setup requires that you trust people modifying your repository. In particular, projects accepting PRs from external users will have to be careful that the PRs affecting the package archives are legit (since it would otherwise be possible to a malicious user to send a PR for a new dependency after having altered its archive content). The best way to do this is to add a CI step (for untrusted PRs only) that uses the `--check-cache` flag:
Expand Down