Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make IgnoreFile more efficient #2086

Closed
wants to merge 16 commits into from
Closed

Make IgnoreFile more efficient #2086

wants to merge 16 commits into from

Conversation

Smaug123
Copy link
Contributor

@Smaug123 Smaug123 commented Feb 11, 2022

(I'm locally developing on aarch64-darwin net6.0, which means my local environment is considerably different from the net5.0 that Paket etc are set up for; and there is at least one bug in Fake on my architecture. So there may be some iterating here while I satisfy the CI checks.)

The purpose of this PR is to dramatically improve the perf of the IgnoreFile checks.

Before: on every file, we traversed the filesystem, potentially up to the filesystem root, then read in the first .fantomasignore file we found.

After: we touch the filesystem only if we can't already know the answer to the question "is there a .fantomasignore file here?"; and we only ever parse a given .fantomasignore file once (modulo multithreading race conditions).

Even more efficient for the -r mode of Fantomas (which I know is deprecated, but I really like it :( ) would be if we could somehow parse the .fantomasignore files up front, discovering all the skipped files, and then simply not touch them. But I've tried to keep this PR very tightly contained.

I've also fixed a potential bug where the alternative directory separator character was being ignored in isIgnoredFile; an alternative fix would be to use fullPath instead of file, which I think would normalise slashes, but this way is more obviously correct.

Question: is this problematic for the daemon? A long-running Fantomas process might need to invalidate this currently-permanent cache at some point; otherwise the user could edit the .fantomasignore files but we'll never reread them.

@Smaug123
Copy link
Contributor Author

Smaug123 commented Feb 11, 2022

[EDIT: ignore, I found a Windows machine!]
@nojaf Would you mind please regenerating the paket.lock file and pushing it up on this branch? I simply can't do it without either doing the work to package the .NET 5 SDK for Nix on Apple Silicon, or installing the .NET SDK globally without Nix, or hoisting the entire repo into .NET 6 (which I am not sure I can do without diagnosing and fixing at least one bug in Fake), or moving to a different OS/architecture.

@Smaug123 Smaug123 requested a review from nojaf February 11, 2022 21:47
Copy link
Contributor

@nojaf nojaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Patrick, thank you for this PR!
I'm ok with improving the ignore file, as this isn't one of the strongest areas of the tool.

That being said, it is questionable whether this should be fixed at all.
Fantomas has support for multiple paths so any filtering of files could be solved outside of the tool. The thought has crossed my mind to remove the .fantomasignore file in the next version.
One reason to do this is just to get rid of another dependency (our current solution isn't without hiccups, for example, markashleybell/MAB.DotIgnore#9).

I haven't made up my mind though, there is an ease of use factor of course in having our current fantomasignore setup. So, what I'm trying to get at here, is the question: 'Is your actual problem solvable outside of Fantomas'?

Lastly, I would ask you to use your own fork of Fantomas when raising PRs. No real reason, other than having the same experience for other contributors. This isn't listed in the contribution guidelines as, well, you are one of the few people that has permission to do this.

Let me know what you think.

/// Store of the IgnoreFiles present in each folder discovered so far.
/// This is to save repeatedly hitting the disk for each file, and to save
/// loading the IgnoreLists from the disk repeatedly (which is nontrivially expensive!).
let private ignoreFiles: ConcurrentDictionary<string, IgnoreList option> =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we could not work with a mailbox processor instead?

/// loading the IgnoreLists from the disk repeatedly (which is nontrivially expensive!).
let private ignoreFiles: ConcurrentDictionary<string, IgnoreList option> =
ConcurrentDictionary()

let isIgnoredFile (file: string) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The daemon is using this file function so I guess any changes to the content of the ignore will not be picked up.
Perhaps we could add a timestamp of when this was parsed to the cache and re-parse when the file changed.
Thinking out loud here.

@Smaug123
Copy link
Contributor Author

Smaug123 commented Feb 13, 2022

Noted, thanks - I don't think GitHub lets me change the source branch of a PR, but future ones will come from my fork.

The problem is solvable outside Fantomas, but it's not all that easy. Specifically, GR's internal CI pipeline definitions make it a mighty faff to work out what files have changed in a given pull request or even a given commit, so I gave up on the idea "let's have Fantomas run on precisely the set of changed files". (This also makes a pre-push Git hook much less easy, since now you need to know somehow what merge base to use for comparison; most of the time it'll just be master and you can git diff --name-only HEAD origin/master to find out what files to format, but that doesn't work if the PR is into a branch other than master.) So my fallback idea was simply to keep a .fantomasignore in the root of the repository, and incrementally remove lines from the .fantomasignore once we were happy to start autoformatting a folder or file.

I think my intended uses of .fantomasignore are all either excluding entire directories or single files, so I certainly don't need the full globbing power of MAB.DotIgnore and I could roll my own. However, the boilerplate that's going to be required to roll our own does feel a bit sad.

Note that the converse strategy is very easy for rolling our own: keeping a list of files and directories that Fantomas should format. However, that approach means new files would be unformatted by default, which I really don't want to encourage!

@nojaf
Copy link
Contributor

nojaf commented Feb 13, 2022

Hey, thanks for elaborating. Sorry if I appear a bit too strong in my initial response.
I never had the strongest intentions of dropping .fantomasignore, I was merely curious if the problem could be tackled outside Fantomas. Perhaps you were unaware of Fantomas accepting multiple paths. This still is a lesser-known gimmick.

We can pursue this PR if the daemon usage is not impacted by it. That is my biggest concern at the moment.

@Smaug123
Copy link
Contributor Author

It would be easy enough to adjust so that e.g. the cache was invalidated on a 5sec or 10sec timer or something, although I don't know how I'd go about making it automatically pick up changes to the .fantomasignore files immediately. All I really care about is that we don't go checking the same files hundreds of times in quick succession; a fairly short invalidation policy would be fine for me.

By the way, I can't reproduce the CI failure. On the assumption that I've got my concurrent code wrong, I'll take your advice and switch to a mailbox processor.

@Smaug123
Copy link
Contributor Author

I guess the big question, if we go ahead with this PR, is what the cache invalidation policy should be. I'm inclined to say something simple like "invalidate the entire cache every five seconds", though the structure I've got here would be easily adapted to invalidating only specific files.

@Smaug123
Copy link
Contributor Author

I put up an example cache invalidation policy in bb01625.

let ignoreFileInvalidation =
let rec go () =
async {
do! Async.Sleep(TimeSpan.FromSeconds 5.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a time-based mechanism to clear the cache? This rubs me the wrong a bit.
Could we not parse the ignore file on demand and just keep track of the location of the ignore file.
Or we parse the ignore file and take the file's last modified timestamp into account.
We re-evaluate this timestamp and if the file is newer, we invalidate the cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not in general possible to keep track of the location of the ignore-file, because someone might put another ignore-file nearer to the file under consideration. If we even might invalidate the cache on every request, then we must traverse the entire filesystem up to the cached ignore-file on every request (unless the OS can call us when a relevant file has changed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is not per se a problem, but filesystem access is really quite slow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That scenario of someone adding a new .fantomasignore file somewhere closer feels like a quite theoretical one.
It is highly unlikely to begin with that there is a .fantomasignore file somewhere else than the repository root.
This is an assumption but I would really like to go with it. When we first introduce the concept of the ignore file, we only checked the working directory of the cli tool.
Because that didn't work for the daemon, we started traversing parent directories until we potentially found something.
Maybe that was a bad idea and we should split up the mechanism, only traverse for the daemon (and having no cache at all) and look in the pwd for the regular cli tool usage.

@Smaug123
Copy link
Contributor Author

Smaug123 commented Feb 14, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants