Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow plugins to efficiently reuse Gazelle's filesystem walk #1737

Closed
wants to merge 1 commit into from

Conversation

dzbarsky
Copy link
Contributor

@dzbarsky dzbarsky commented Feb 17, 2024

What type of PR is this?
Performance improvement

What package or component does this PR mostly affect?
Plugins

What does this PR do? Why is it needed?
Our gazelle plugin (for typescript) builds an in-memory trie of the filesystem. This is needed because a given import string can resolve to multiply physical files, depending what is on-disk. (For example foo/bar might resolve to foo/bar.ts, foo/bar/index.js, 'foo/bar.js, 'node_modules/foo/bar.js, etc.)

We build this trie during the Configure phase so that we are guaranteed it's complete before Resolve. However, it's wasteful to issue os.Readdir in our plugin when Gazelle has just done the same work. This patch is around 15% improvement in total Gazelle runtime for us. (4.2s -> 3.5s)

Other notes for review
Adding an optional interface like this was the most backwards-compatible way to make the change. Another option would be to add the directory contents to config.Config. That would keep the API surface area cleaner and mean plugins don't need to invoke a new API to benefit. However, that would require deciding what to do about Config.Clone(), and also might have some tricky interactions with potential future efforts to parallelize walk.Walk. But I'm happy to rework the PR to do that if yall prefer!

@fmeum
Copy link
Member

fmeum commented Feb 17, 2024

This is pretty lightweight and generic, which I like.

@linzhp @tyler-french What are your thoughts?

@linzhp
Copy link
Contributor

linzhp commented Feb 21, 2024

I don't think Configure() is the right place to build a trie. By design, Configure() is supposed to read flags and directives to control the behavior of Gazelle. Generate() is a better place to build data structures like trie for resolver, and it also has full access to all files in the directory.

@pyrocat101
Copy link

The problem with GenerateRules is that is is post-order, whereas Configure is pre-order. Building a trie in pre-order simplifies some uses cases. For example, a plugin that wants to be git-aware can collect .gitignore and correctly apply ignore rules in a single pass.

@linzhp
Copy link
Contributor

linzhp commented Feb 21, 2024

.gitignore is a configuration file that affect the subtree, which is similar to gazelle:exclude. Can you read the .gitignore and save it in the plugin's config object, so that you can build the trie in any order?

@dzbarsky
Copy link
Contributor Author

@pyrocat101 can speak to whether that would be correct, but I want to point out that such an approach would eat into some (most?) of the performance gains. Most directories don't have a .gitignore, but we wouldn't know that until we attempt to open the file in Configure.

If your concern is the semantic impurity, another option would be to add an extra (optional) pass between Configure and Resolve to feed the file list to the plugin, though that would be a bit more invasive change.

@pyrocat101
Copy link

.gitignore is a configuration file that affect the subtree, which is similar to gazelle:exclude. Can you read the .gitignore and save it in the plugin's config object, so that you can build the trie in any order?

But reading .gitignore requires stat syscalls on every folder during Configure, despite that walk.Walk already has the information.

To elaborate on the usage, the trie tree is used not only in Resolve, but also in GenerateRules. Our plugin generates rules that collects all source files in the currently visited Bazel package. Instead of using the expensive filepath.Walk to do so, we query that trie tree to avoid redundant syscalls. For trie tree to only contain entries that are not ignored by git, the trie is built in pre-order so each file can match against the .gitignore ancestry chain.

@achew22
Copy link
Member

achew22 commented Mar 9, 2024

This might be a good time for us to diverge from the classic plugin interface. There are a couple of tricky to deal with issues with the current interface. Most notably, no functions take a context, and arguments can not be added in a backwards compatible way in case of a deviation.

@linzhp and I have chatted about this in the past and I think we're reasonably aligned on it but I wanna jot down a couple of thoughts I've had.

As high level design objectives, we want:

  • Something much more forward compatible than the current interface
  • Respectful of contexts for managing lifecycle in plugins
  • As a nice bonus it'd be nice to be able to write a plugin in another language

Given that gRPC seems to be the bees knees these days and it meets the design criteria. Obviously there is no need to actually have it RPC for existing plugins, and a pretty straightforward plugin to write would be the one that conforms to the V2 interface, but it translates the calls into the old interface. That should mean no pain for any preexisting plugins.

You also get the ability to generate a gRPC client and use something like go-plugin to allow binaries that are external to your own to join in on the "fun". It's pretty straightforward to do in any language. Here is an example in python of a plugin that is launched by the wrapper process and gRPC communicated.

@dzbarsky
Copy link
Contributor Author

I think you can get most of what you want by just making the plugin functions take (ctx context.Context, params Params) arguments, so you can extend the structs in the future.

Writing plugins in other language is nice but it's not a serious problem, there are plenty of examples (both in open-source and internal to serious bazel users) of the Go plugin spawning a server written in another language if that makes the dependency analysis easier. Protobuf over stdio or UDS works pretty well. gRPC has some gotchas in the "less supported" languages (including Python. When we tried to use it for anything serious we hit so many issues that we ended up writing our own bindings to the C core)

Anyway, we ended up forking Gazelle to do what we describe above, it's a pretty self-contained solution so we're not too worried about maintaining it. If you're not interested in pursuing a more optimal solution, feel free to close out this issue.

@linzhp
Copy link
Contributor

linzhp commented Mar 23, 2024

Having a Configure2 Interface that takes (ctx context.Context, params Params) sounds good to me

jbedard added a commit to aspect-build/aspect-cli that referenced this pull request Nov 7, 2024
We should not be using `os.Stat` to check if files exist. It is very
expensive and gazelle has already read the full list of entries for the
directory.

The gazelle patch exposes that list. This is similar to
bazel-contrib/bazel-gazelle#1737 which was done
at airtable for similar reasons.

On one major repo this reduces the time by almost 2s which is ~15% and
well worth the patch imo.

---

### Changes are visible to end-users: no

- Searched for relevant documentation and updated as needed: yes
- Breaking change (forces users to change their own code or config): no
- Suggested release notes appear below: yes

Reduce fs io of `aspect configure` during fs traversal.

### Test plan

- Covered by existing test cases

GitOrigin-RevId: b17d52b2723d3b7e4d5ee33fc294f491062ded5a
jbedard added a commit to aspect-build/aspect-cli that referenced this pull request Nov 7, 2024
We should not be using `os.Stat` to check if files exist. It is very
expensive and gazelle has already read the full list of entries for the
directory.

The gazelle patch exposes that list. This is similar to
bazel-contrib/bazel-gazelle#1737 which was done
at airtable for similar reasons.

On one major repo this reduces the time by almost 2s which is ~15% and
well worth the patch imo.

---

### Changes are visible to end-users: no

- Searched for relevant documentation and updated as needed: yes
- Breaking change (forces users to change their own code or config): no
- Suggested release notes appear below: yes

Reduce fs io of `aspect configure` during fs traversal.

### Test plan

- Covered by existing test cases

GitOrigin-RevId: b17d52b2723d3b7e4d5ee33fc294f491062ded5a
jbedard added a commit to aspect-build/aspect-cli that referenced this pull request Nov 7, 2024
We should not be using `os.Stat` to check if files exist. It is very
expensive and gazelle has already read the full list of entries for the
directory.

The gazelle patch exposes that list. This is similar to
bazel-contrib/bazel-gazelle#1737 which was done
at airtable for similar reasons.

On one major repo this reduces the time by almost 2s which is ~15% and
well worth the patch imo.

---

### Changes are visible to end-users: no

- Searched for relevant documentation and updated as needed: yes
- Breaking change (forces users to change their own code or config): no
- Suggested release notes appear below: yes

Reduce fs io of `aspect configure` during fs traversal.

### Test plan

- Covered by existing test cases

GitOrigin-RevId: b17d52b2723d3b7e4d5ee33fc294f491062ded5a
@dzbarsky dzbarsky closed this Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants