-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change detection improvements #38
Comments
When the test contains sequential data (e.g. throughput per-second), rather than just averaging these for comparison with other run we could treat the sequence as a multi-dimensional vector and compute distance between those vectors. When the sequences are not of equal length we could just pad the missing dimensions with average value. Truncating the longer vector is another option but that would lose information. |
The critical problem with current approach using T-test for each variable is that with growing number of variables that are not independent the likelihood of false positives in at least one of them is growing. It's useful to incorporate more performance counters into the dataset but clustering them by covariance could have better properties. |
I made an experiment with the current algorithm: I've generated 200 dummy runs using ~normal distribution (10 + sum of 5 |
Actually, when using the minWindow = 5, and trying the 2 * stddev test, I got 10 changes, and in the t-test I got 9 changes. These numbers roughly fit the expected population outside mean +- 2*stdev (4.55%) or confidence levels (5% chance of rejecting null hypotesis while it holds). We get what we ask for from the statistics, even though we wish for no false positives. |
@johnaohara Thinking about comparing histograms, I think it could be done, and the method could be used for any constant-size vector: we could average the values to obtain the baseline and then calculate square root of difference in each item and average these. Then the regular thresholds would apply. I am not sure how useful this would be in practice but it is something that makes sense to try and is not possible currently (you could compare each vector item but you'd probably need higher thresholds - it is not possible to diff each vector item first now). The UI would not need to be more complicated: this could be a default for any regression var returning and array. If the vector size differs, though, the comparison would fail and notification would be sent. Charting would be a bit more difficult to do: I can imagine interactive time axis, using the whole chart to display just the single histogram, with gray 'average histogram' in the background. Optional log scale? (with some primitive heuristic choosing the default). If users choose to plot data that have completely different scales such chart wouldn't be too useful, but hey, they can normalize them in the calculation function (not affecting the regression algorithm at all). |
If we decide to adopt some form of statistical tests again we should use https://en.wikipedia.org/wiki/Holm%E2%80%93Bonferroni_method to compensate for the multiple comparisons. |
When monitoring performance on a branch it might be that a regression is introduced, and later on it's fixed. Horreum does not let us confirm if the performance after the fix is equal to the one before the regression. |
Worth reading paper on change detection: https://arxiv.org/pdf/1101.1438.pdf |
Hey @rvansa hope you are doing well! thanks for the link to the paper, will take a look |
Hi John, yep, except no AC in my home office :) I've actually found the paper when I've stumbled upon this python library: https://centre-borelli.github.io/ruptures-docs/ |
A pseudo-issue tracking ideas for improvements in regression monitoring
The text was updated successfully, but these errors were encountered: