-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Measuring experiments impact on retention #2958
Comments
I think there may be a third problem - we've changed a bunch of extra stuff on top of what you're testing, so the impact of these could throw the experiment as you're introducing more variables. I think the correct way to do this would be:
Ultimately, this is why I think experimentation has to be part of deployment and everyone is missing a trick apart from the FAANG companies that have built this in house |
For sure, a principle behind the scientific method is controlled variables, can't really attribute the result to a specific experiment unless everything else is equal. Though at this point I think it's unfeasible to do this, and while we may get some noisy results or coincidences every now and then, we're also not talking about 0.X% changes but rather XXX% or even more, so I think it's worth continuing doing experiments. Perhaps a good takeaway is shortening the scope of experiments and measuring only for proxy actions of retention. This would also reduce the noise of changing variables, particularly if we do an experiment during the 2-week time range. We can then run more experiments and be more confident on the results. Thoughts? |
@paolodamico Resurrecting this old issue as it's something I've also been thinking about. To answer the question Are users on X experiment retaining better or not? — isn't the most important thing to synchronize the period of time in which retention becomes meaningful with the period of time for which a feature flag is active? Put more simply, I often feel like we may only have a feature flag active for 4 weeks or so, before deciding to roll out a feature to everyone or kill it. Yet, querying for retention is something we may often want to look at over a longer timescale, right? 3-month retention, 6-month retention, etc. — degrees of retention which, when good, justify our marketing spend and are likely to add revenue. It then becomes really hard to attribute retention, or leading indicators of retention, to a feature flag that may have been active in the past. I think this might have more to do with how we start/stop experiments. Say, for example, we're tracking a leading indicator of retention, "Actions created by unique users, weekly". The process goes like this:
Of course the logically robust solution is to run tests with higher sample sizes, longer time periods, and demand a higher delta between control and experimental group before making a decision. But, being a startup, these goals may not always be accessible to us. We might make bad decisions on incomplete data that take 2+ months to materialize. So — wouldn't a lot of confidence be gained by automatically tracking when a feature flag was turned on or off, or when its release conditions were modified? I'm picturing this as a feature that would automatically add annotations to graphs (which could of course be edited for more context). |
This issue hasn't seen activity in two years! If you want to keep it open, post a comment or remove the |
This issue was closed due to lack of activity. Feel free to reopen if it's still relevant. |
Looking for some input on this conceptual issue I've been having.
Context
We've run a few experiments so far with our own feature flags and they have already lead to interesting results (see example). The thing about experiments like that one (the actions UX refactoring, #1841) or the new persons page (#2353) is that while we ultimately want to move the retention needle with those, we do have some intermediate goals that serve as proxies for what we're trying to achieve. For instance, with the actions experiment we can measure if users in the experiment group create more or less actions than the ones in the control group, or if they use actions more or less on insights graphs. The same is true for the persons page where we can measure if users are viewing more or less sessions or events for those persons, or changing their properties.
We'll inevitably have some experiments where there are no good intermediate metrics to measure and what we actually want to see is if users in the control group long-term retain better or not. Furthermore, we would want to measure this even if we have a good intermediate/proxy metric.
The Problem
Currently we cannot answer that question, Are users on X experiment retaining better or not? for a couple of reasons:
user signed up
to cohortize users, but as that's a server-side event, it doesn't contain the feature flag details and therefore applying a filter to it leads to no results.Thoughts on this?
The text was updated successfully, but these errors were encountered: