Add performance measurement #9722

sgraband · 2021-07-12T12:08:30Z

Feature Description:

We would like to contribute a mechanism to measure/monitor the system performance, more precisely the startup time of Theia to avoid regressions and as a benchmark for possible improvements.

We currently have a script (using puppeteer) that uses the Google DevTools performance tracing to measure the largest contentful paint (LCP) metric. The script is parameterized and can run the measurement multiple times, if neccessary. The script starts a performance tracing, opens theia and stops the tracing again. This generates a profile file, that contains all events that were captured during the recording. This file can also be imported into the Google DevTools to see a timeline of all events. Then the LCP metric is parsed from the file and written to the console. If there is more than 1 run a mean and standard deviation are calculated and logged as well.

We believe this script is already useful as a stand-lone tool as it allows measuring performance effects/improvements in a consistent way.

As useful extensions, we could also integrate this into the nightly build and in PR builds.
However, in our opinion, hardcoded limits for the measurements should be avoided, as this will lead to a lot of failed builds.
One possible solution to integrate the script into the build could be to run the startup measurement multiple times during the nightly build and keep a history of the results. These numbers can then be used to compare the results of PR builds to see if the performance is affected. For example, startup times that take 20% longer than the mean of the nightly builds could be flagged with a warning.

Any opinions or suggestions? We will first contribute the script and then potential integrations if wanted.

@JonasHelming

tsmaeder · 2021-07-13T11:48:45Z

Having a single metric that tells you whether your PR makes Theia slower or faster sounds like a very good thing. However, I have a couple of questions:

Is LSP a good metric? Can the user start doing work at the the point of of the LCP? Theia is not a traditional website and probably many things are started lazily or in the background.
What state is the IDE started to? Are there editors open, etc.?
Is back-end startup included in the number?
While browser load time is important, why are we only measuring startup time? Is browser load time a good measure for most PRs?
If running on CI, how do we ensure the results are not skewed by other tasks being run on the same shared infrastructure?
Having a number is useful, but if we get a "yellow card" on our PR, what are the expectations on the developer and how does the developer find out what he needs to do? If we have bad numbers, we should at least provide the trace file.

I'm wondering where you folks are coming from: is startup time a problem in your work? Is VS Code considerably faster, for example?

One thing that would be really cool IMO is to have a suite of common tasks that we time for every release that could be part of the release process.

sgraband · 2021-07-16T11:04:03Z

Thanks for your feedback @tsmaeder! To answer your questions:

Having a single metric that tells you whether your PR makes Theia slower or faster sounds like a very good thing. However, I have a couple of questions:

Is LSP a good metric? Can the user start doing work at the the point of of the LCP? Theia is not a traditional website and probably many things are started lazily or in the background.

The LCP roughly corresponds to the point of time where the Theia loading screen disappears and the application is drawn. So it is basically the point where the user can start doing work minus the drawing.

What state is the IDE started to? Are there editors open, etc.?

Currently we start a basic IDE with an empty workspace and nothing opened. However this can certainly be extended to cover more use cases, like large workspaces or large amount of VS Code extensions.

Is back-end startup included in the number?

No, the back-end startup time is not included. The backend is only measured indirectly in the way it affects the frontend to be slower / faster.

While browser load time is important, why are we only measuring startup time? Is browser load time a good measure for most PRs?

We measure startup time as this is relatively well defined and easy to measure. It doesn't take a lot of time and could therefore be executed with each PR. This way startup time regressions can be detected early. Of course it makes sense to also do additional measurements which can be added later.

If running on CI, how do we ensure the results are not skewed by other tasks being run on the same shared infrastructure?

We can't really prevent the tests being affected by the infrastructure. One mitigation is for example to run the tests many times, clean the results and then take the average. However this is certainly not fool-proof.
Therefore we would like to suggest to not flag builds as failures/unstable when the performance requirement is not met. We should rather just post the number on the PR as information, without leading to failed builds. Ideally the number could also be collected by some dashboard so that it can be tracked over time.
Only after we have some experience of how the performance tests behave in practice and some confidence in the numbers I would start thinking about considering failing a build because of them.

Having a number is useful, but if we get a "yellow card" on our PR, what are the expectations on the developer and how does the developer find out what he needs to do? If we have bad numbers, we should at least provide the trace file.

We would like to suggest that first the performance numbers shall just be shown without any "call-to-action" and ideally should be collected somewhere so they can be tracked. Once there is enough confidence in the stability of them and the community decides that it's worth it one can think about handing out "yellow cards". In that case we should offer: An easy way to re trigger the performance tests (so any fluke can easily be handled with) and the log files so the problem can be analyzed.
In general we definitely want to avoid a poor signal-to-noise ratio. When every second PR is flagged without a reason the flag will just be ignored in practice.

I'm wondering where you folks are coming from: is startup time a problem in your work? Is VS Code considerably faster, for example?

Startup time is a very important metric for user experience and influences the user's perception of the quality of the tool a lot. Also it's easy to measure and therefore a good candidate for the first of hopefully many performance tests.

One thing that would be really cool IMO is to have a suite of common tasks that we time for every release that could be part of the release process.

Yes absolutely. In an ideal world we would have

a selection of performance tests which can be executed with each PR without increasing the build time a lot so any regression is hopefully caught early
a more complete collection of performance tests executed with each nightly build where build time is not that important
a full set of performance tests (maybe the same as nightly) which is checked at least for each release

To summarize:

As a first step we just want to provide the performance measurement script (only including startup time measurement) so anybody interested can run the test(s) themselves
In the future it definitely makes sense to integrate them also in the CI process, however it's important to reduce the number of false positives as much as possible.
Additional performance tests covering much more complex scenarios can be added iteratively

sdirix · 2023-09-19T14:53:41Z

Current state of this issue:

A script was contributed to Theia with Add performance measurement #9777 using the LCP metric
There is not yet an automatic measurement and logging of Theia builds integrated in the main repository
The e2e repository however captures the logs produced during the e2e tests, including startup time logs. See here.

vince-fugnitto added metrics issues related to metrics and logging proposal feature proposals (potential future features) labels Jul 12, 2021

vince-fugnitto added the performance issues related to performance label Jul 16, 2021

sgraband mentioned this issue Jul 26, 2021

Add performance measurement #9777

Merged

1 task

sgraband mentioned this issue Aug 26, 2021

Improve startup performance logging #9964

Open

sdirix mentioned this issue Sep 29, 2021

Improve startup performance #9869

Closed

sdirix mentioned this issue Sep 19, 2023

Improve start-up performance #12924

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add performance measurement #9722

Add performance measurement #9722

sgraband commented Jul 12, 2021

tsmaeder commented Jul 13, 2021

sgraband commented Jul 16, 2021

sdirix commented Sep 19, 2023

Add performance measurement #9722

Add performance measurement #9722

Comments

sgraband commented Jul 12, 2021

Feature Description:

tsmaeder commented Jul 13, 2021

sgraband commented Jul 16, 2021

sdirix commented Sep 19, 2023