Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go_repository caching #394

Closed
Globegitter opened this issue Dec 6, 2018 · 8 comments
Closed

go_repository caching #394

Globegitter opened this issue Dec 6, 2018 · 8 comments

Comments

@Globegitter
Copy link
Contributor

I know bazel only utilises the repository_cache right now for http_archive as well as when a rule utilises repository_ctx.download_and_extract or repository_ctx.download and a sha is provided. Further there has also been some progress made on caching git_repository: bazelbuild/bazel#5928.

I also know that when using go_repository I can add

sha256 = "...",
strip_prefix = "...",
urls = ["...tar.gz"],

and the repository_cache will be utilised via one of the above mentioned functions. But what is the case right now if I do not specify a tar file to be downloaded but just want to use e.g. git directly? I have seen a fetch_repo binary is used rather than the 'native' git_repository or the methods the rule itself is using. What is the reason for this, does that provide any way to cache the fetched repo to a specified location?

As a bit of a background note, I actually do not mind adding the sha and git urls manually once but the problem is it does not play nicely with bazel run //:gazelle -- update-repos ..., i.e. if I run this on a go_repository where I have added these attributes it just removes them and replaces them with the latest commit so that leads to not such a nice developer experience if someone wants to utilise this tool. Maybe there is also some way to tell the update-repos command to prefer the archives? I suppose it should not even be too difficult to add right? (Happy to give it a try)

@jayconrod
Copy link
Contributor

There's not really a good way to implement go_repository caching right now.

Bazel's download_and_extract is the only repository context API that provides caching, and that only works if a SHA-256 sum is provided. In most cases, there isn't any known stable SHA-256. I don't believe any of the .zip or .tar.gz archives prepared by GitHub are guaranteed to be stable. They are prepared on demand, and the file order / compression / alignment can vary depending on server software. This has broken us in the past, which is why rules_go and Gazelle have archives and SHA-256 sums attached to each release.

About git_repository, my understanding was that the native version had a big internal git repository, and they just fetched new commits all into the same one. That worked well for caching, but the native git_repository is being removed. The Starlark version of git_repository doesn't have access to this cache, and neither do we.

I filed a feature request some time ago for a general cache that we could use for repository rules (bazelbuild/bazel#5086). If that gets implemented, we could do something like this.

In the future, I expect go_repository will mostly be fetching code from Go module proxies via HTTPS. Ideally, we won't have to talk to git repositories at all anymore during a build. However, the Go toolchain hashes the content of files inside module zip files (these are the hashes in go.sum files), not the zip file itself, which is what Bazel hashes. So we're stuck again :(.

@Globegitter
Copy link
Contributor Author

Thanks for the comment and the background information - we have found tar.gz hashes to be stable for us so far, but maybe it will start hitting us at some point. What @aehlig proposed as a first solution to get caching for the starlark git_repository rule is to pass in an env variable, e.g. BAZEL_GIT_REPOSITORY_CACHE and then it would use that path as a cache for all git dependencies using worktrees - which sounds very much the same as what you are describing existed for the native version. You can find the code for that here: https://github.com/siedentop/bazel/tree/feature/git_repo_rule_cache

I wonder if we could adopt that same approach for the fetch_repo binary until there is a native API - it just checks if a BAZEL_GO_REPOSITORY_CACHE env var is set and if so uses that as its cache location - what do you think about that?

@jayconrod
Copy link
Contributor

I'd love to have a general API for repository rule caching. BAZEL_GIT_REPOSITORY_CACHE will save some time in git_repository, but having a generic cache where we could store files would be really useful.

Aside from that though, I'd rather fetch_repo not have its own cache. In general, repository rules should not have side effects outside of the repositories check out.

One possible exception though: Go has a module cache. So for go_repository rules that fetch modules, we might want to use go get to fetch a module, then pull zip files out of the module cache. We'd want to avoid creating a hard dependency on the host system's Go toolchain though.

@Globegitter
Copy link
Contributor Author

Fair enough about not wanting to add a specific cache. In regards to not using the go module cache, could we not just use the go sdk / cli provided by rules_go to just run the 'hermetically provided' go get command rather than relying on host system toolchains?

rules_nodejs is doing something very similar with npm packages, by providing yarn hermetically and relying on its global yarn cache.

@jayconrod
Copy link
Contributor

That would work, but I think it would be pretty surprising for people if they don't have a Go SDK installed, but they end up with a module cache populated after running a Bazel command. If we did that, I think users would need to opt in somehow. Setting GOPATH would probably be enough (the module cache is at $GOPATH/pkg/mod).

@Globegitter
Copy link
Contributor Author

Yep agreed that this should only happen when GOPATH is set. Not sure if I have any bandwith to look at this before the end of the year but if I do I will give it a try.

@blico
Copy link
Contributor

blico commented May 16, 2019

My understanding is that with #400 we have a module cache (@bazel_gazelle_go_repository_cache//pkg/mod) shared by go_repository rules, and the rules use Bazel's Go SDK.

This means that right now user calls to their installed Go SDK will not hit go_repository's module cache, but use a separate module cache at $GOPATH/pkg/mod.

What can we do to share/combine the go_repository's module cache with the user's local module cache? Also what can we do to ensure that all uses of Bazel's Go SDK use the go_repository module cache?

@jayconrod
Copy link
Contributor

What can we do to share/combine the go_repository's module cache with the user's local module cache?

I wanted go_repository's build and module caches to be separate from the user's cache to avoid any dependence on the system configuration. This makes reproducible builds easier, and it avoids the need for a Go toolchain to be installed.

I'm not really opposed to sharing the user's module cache, but it should be opt-in, probably through an environment variable, something like GO_REPOSITORY_USE_HOST_CACHE=1. That would look at go env GOPATH and go env GOCACHE to find the locations for these.

Also what can we do to ensure that all uses of Bazel's Go SDK use the go_repository module cache?

go_repository in module mode is the only rule that makes use of the module cache, since it's the only rule that downloads anything. So no change needed I think? Just make sure every go_repository rule is in module mode (version and sum are set, urls, commit, tag, etc. are not).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants