Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measuring experiments impact on retention #2958

Closed
paolodamico opened this issue Jan 15, 2021 · 5 comments
Closed

Measuring experiments impact on retention #2958

paolodamico opened this issue Jan 15, 2021 · 5 comments
Labels
concept Ideas that need some shaping up still stale

Comments

@paolodamico
Copy link
Contributor

Looking for some input on this conceptual issue I've been having.

Context

We've run a few experiments so far with our own feature flags and they have already lead to interesting results (see example). The thing about experiments like that one (the actions UX refactoring, #1841) or the new persons page (#2353) is that while we ultimately want to move the retention needle with those, we do have some intermediate goals that serve as proxies for what we're trying to achieve. For instance, with the actions experiment we can measure if users in the experiment group create more or less actions than the ones in the control group, or if they use actions more or less on insights graphs. The same is true for the persons page where we can measure if users are viewing more or less sessions or events for those persons, or changing their properties.

We'll inevitably have some experiments where there are no good intermediate metrics to measure and what we actually want to see is if users in the control group long-term retain better or not. Furthermore, we would want to measure this even if we have a good intermediate/proxy metric.

The Problem

Currently we cannot answer that question, Are users on X experiment retaining better or not? for a couple of reasons:

  1. When filtering the (first time) retention table by users who have a feature flag active, what it actually happens is that cohorts will now be created based on when the users first had an event with the feature flag activated, we're not forming the cohorts as if the feature flag filter wasn't there and then filtering for users on the experiment group (more context on Bug with retention cohorts & filtering #2654, as you can see there's even the edge case of cohorts even larger when applying the filter). This is not trivial because aside from the technical complications, the UX of this is not straightforward.
  2. A secondary smaller problem would be that we use the custom event user signed up to cohortize users, but as that's a server-side event, it doesn't contain the feature flag details and therefore applying a filter to it leads to no results.

Thoughts on this?

@paolodamico paolodamico added concept Ideas that need some shaping up still discussion labels Jan 15, 2021
@jamesefhawkins
Copy link
Collaborator

I think there may be a third problem - we've changed a bunch of extra stuff on top of what you're testing, so the impact of these could throw the experiment as you're introducing more variables.

I think the correct way to do this would be:

  1. Create experiment. This means a feature flag is on for some, off for others.
  2. Somehow, we need to pin all the other functionality from when this launches for those users affected by the experiment only, until the experiment is concluded.

Ultimately, this is why I think experimentation has to be part of deployment and everyone is missing a trick apart from the FAANG companies that have built this in house

@paolodamico
Copy link
Contributor Author

For sure, a principle behind the scientific method is controlled variables, can't really attribute the result to a specific experiment unless everything else is equal. Though at this point I think it's unfeasible to do this, and while we may get some noisy results or coincidences every now and then, we're also not talking about 0.X% changes but rather XXX% or even more, so I think it's worth continuing doing experiments.

Perhaps a good takeaway is shortening the scope of experiments and measuring only for proxy actions of retention. This would also reduce the noise of changing variables, particularly if we do an experiment during the 2-week time range. We can then run more experiments and be more confident on the results.

Thoughts?

@samwinslow
Copy link
Contributor

@paolodamico Resurrecting this old issue as it's something I've also been thinking about. To answer the question Are users on X experiment retaining better or not? — isn't the most important thing to synchronize the period of time in which retention becomes meaningful with the period of time for which a feature flag is active?

Put more simply, I often feel like we may only have a feature flag active for 4 weeks or so, before deciding to roll out a feature to everyone or kill it. Yet, querying for retention is something we may often want to look at over a longer timescale, right? 3-month retention, 6-month retention, etc. — degrees of retention which, when good, justify our marketing spend and are likely to add revenue.

It then becomes really hard to attribute retention, or leading indicators of retention, to a feature flag that may have been active in the past. I think this might have more to do with how we start/stop experiments.

Say, for example, we're tracking a leading indicator of retention, "Actions created by unique users, weekly". The process goes like this:

  1. We run a feature flag to change the actions creation experience for 4 weeks.
  2. We get some noisy data with a small sample size and jump to a conclusion that we should release the feature to everyone.
  3. We remove the feature flag conditional code from our codebase, exposing the feature, and delete the feature flag.
  4. Much later, we see a clear drop in our key metric and wonder why.

Of course the logically robust solution is to run tests with higher sample sizes, longer time periods, and demand a higher delta between control and experimental group before making a decision. But, being a startup, these goals may not always be accessible to us. We might make bad decisions on incomplete data that take 2+ months to materialize.

So — wouldn't a lot of confidence be gained by automatically tracking when a feature flag was turned on or off, or when its release conditions were modified? I'm picturing this as a feature that would automatically add annotations to graphs (which could of course be edited for more context).

@posthog-bot
Copy link
Contributor

This issue hasn't seen activity in two years! If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.

@posthog-bot
Copy link
Contributor

This issue was closed due to lack of activity. Feel free to reopen if it's still relevant.

@posthog-bot posthog-bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
concept Ideas that need some shaping up still stale
Projects
None yet
Development

No branches or pull requests

4 participants