-
Notifications
You must be signed in to change notification settings - Fork 1.3k
why our syntax highlighter is in bad shape / has a lot of "tech debt", and what we should do about it #21942
Comments
Hey @slimsag 👋 Thank you for so much detail and context here! 🙇 I had a question on the speed/scaling front. Is there any caching of highlighted files that occurs in Thanks! |
There is no caching in place, aside from of course whatever Cloudflare does with our regular GraphQL requests (I think nothing.) It would be a mistake to think that adding caching will help substantially / resolve the issue, though. Cold cache states matter, especially so when we're talking about the vast number of files on a Sourcegraph instance. Importantly, most requests are quite reasonably fast (<500ms last I checked.) It's those few odd-ball ones that are not because e.g. the service is crashing and falling back to plaintext mode. (we'd have the same issues with a cache, just in a different form) -- not to say it couldn't help, I just do not think it would be the right place to start. |
I looked in Grafana but couldn't see any latency metrics, do you know where I'd be able to find those @slimsag ? Thanks for the fast follow-up. |
I've seen a few ask "Why use Syntect/Rust? what other options are there?" and I want to take the opportunity to address that here. First, you can read about the state of syntax highlighting as of late 2018 in my write-up here: https://news.ycombinator.com/item?id=17932872 For the most part, nothing here has changed in the past 3 years since I wrote that. I have kept up with syntax highlighting improvements since then, and to date none really surpass syntect in:
This includes the following:
If you want to compromise on some aspects, there are some avenues I consider interesting:
By far, Syntect is the absolute best value for engineering effort IMO. |
@scalabilitysolved sure - here you go, from Sourcegraph.com (which, again, has more issues than most customer deployments): 99% of requests every hour complete in under ~5s on Sourcegraph.com in milliseconds: Other important metrics |
I was looking into this a bit more closely with @olafurpg. I think there is something off with our usage of syntect in production. Two examples to showcase this: (the numbers I show are all for local runs on an M1 MacBook Pro)
The p90 and p99 numbers here are much better than for production which you've shown. Now, I understand that it's not an exact apples-to-apples comparison (M1 vs GCP virtual cores, no Docker vs with Docker, tree-sitter is reading from memory, the synhtml example is reading from disk (although that is probably in page cache)), but the large difference in latencies seems to suggest that there's something else going on here, apart from the syntax highlighting being slow. For example, one thing that seems off in the current code is that we create arbitrarily large JSON payloads in https://sourcegraph.com/github.com/sourcegraph/gosyntect/-/blob/gosyntect.go?L108 ; however, if the JSON payload is over 1MiB (this limit is configurable), Rocket will fail to parse the JSON. So the round-trip will have been wasted; instead we could save the cost of marshaling to JSON + sending it + trying to unmarshal it. There are a few things we could do here, which don't have to be mutually exclusive:
|
Numbers from a gcloud instance with 16 vCPUs (CPU utilization seemed pretty low in testing, consistently being less than 25%) and 64 GB RAM (I was eyeballing usage and it didn't go over 5GB)
|
Once the Christmas branch freeze lifts, we should update our usage of syntect to point to https://github.com/sourcegraph/syntect/ (instead of https://github.com/slimsag/syntect). The fork under
NOTE: This was completed earlier. https://github.com/sourcegraph/sourcegraph/pull/40783 |
There were performance and accuracy improvements for range-based syntax highligting and queries in Tree-Sitter, well-tested through NeoVim, now part of the 0.20.8 release: https://github.com/tree-sitter/tree-sitter/releases/tag/v0.20.8 |
inc-258 |
Closing this in favor of https://github.com/sourcegraph/sourcegraph/issues/59022 which has more up-to-date details + a bunch of narrower actionable points. |
Ok here goes.. this is a bit ranty, but bear with me.
We haven't invested at all in
syntect_server
, it has received likely less than 1 full month of engineering effort in total, by all people at Sourcegraph despite being core to Sourcegraph's functioning and existing for >4 years since I hacked together the initial version in 2 days back when Sourcegraph was an immature startup.Despite this, it "is basically okay" for most users/customers today, except rarely having issues in very large Sourcegraph deployments like Sourcegraph.com where it is too slow to respond and we fallback to plaintext highlighting.
The goal of this issue is not to say syntect_server is trash and we should throw it out (I think that would be a grave mistake, personally, for many reasons) but rather to explain why we're in a bad shape today and what I hope someone else will pick up here at Sourcegraph to improve the situation.
syntect
(the library syntect_server shells out to for syntax highlighting) supports a specific set of grammars out of the box. They only test with a specific set of grammars, and even among those - there are some known issues.So - this means we have three problems that need solving (in a metaphorical sense):
panic!
and.unwrap()
in a few places which is not idiomatic Rust. There is an issue open since 2017, and nobody has worked on it: Replace calls to panic! with Result types trishume/syntect#98panic::catch_unwind
which is NOT idiomatic or a good idea in general - but hey, it does work. https://github.com/sourcegraph/syntect_server/blob/master/src/main.rs#L63-L66syntect_server
in a dumb Go reverse proxy which:WORKERS
- unfortunately each requiring 1.1GB of memory to cache the compiled language grammars separately.Why can't I horizontally scale syntect_server?
The combination of issues we experience right now means it doesn't make sense to.
First, due to the legitimate case of "Highlighting is sometimes slow" - you really do want most requests to have access to a lot of CPU. Highlighting files does not require equal resources, some files require far more than others. In an ideal world, you want multiple beefy syntect_server instances.
With horizontal scaling, we could choose to either continue using
panic::catch_unwind
or relying on Kubernetes container restarts. However, with the rate of panics on large deployments like Sourcegraph.com - if we were to simply removepanic::catch_unwind
it would effectively render the service dead entirely. How many are happening? I'm not sure, the monitoring for worker_deaths got removed from the Grafana dashboard but I think the Prometheus metric may still be around.Importantly, Kubernetes health checks to a
/healthz
endpoint on the container do NOT tell us if a request has gotten stuck at 100% CPU and will never return! That may be the case for multiple requests, and all CPUs may be consumed entirely, while the container may respond to/healthz
still. The only way we would know is if the/healthz
endpoint could somehow monitor all currently pending requests on the server and verify if they have been stuck for some period of time (>10s, like that nasty http-server-stabilizer hack does) syntect does not support a timeout parameter / context like Go does so fixing this is not trivial: we don't know where in the syntect codebase it is getting stuck or why, and syntect is a relatively complex stack machine.What should we do to fix this?
So far, nobody / no team has been interested / had the time to even investigate this. I want to be clear, syntect_server is basically unmaintained because of a combination of two mentalities: "it works well enough" (generally true) and "stephen owns it" (no longer true)
In my view, the following would give us the best return on engineering effort in order of priority:
syntect
to either get entirely stuck, or panicking. i.e. https://github.com/sourcegraph/sourcegraph/issues/16085 and then debug/fix those upstream insyntect
.syntect
so that we can get rid ofhttp-server-stabilizer
entirely by using timeouts in thesyntect_server
Rust code itself. long-term feature request: ability to specify a timeout for highlighting trishume/syntect#202syntect
's usages ofpanic!
and.unwind()
so we can get rid of the stack unwinding hack: Replace calls to panic! with Result types trishume/syntect#98http-server-stabilizer
entirely, and add support in Sourcegraph for horizontally scalingsyntect_server
.The text was updated successfully, but these errors were encountered: