Measuring experiments impact on retention #2958

paolodamico · 2021-01-15T21:34:04Z

Looking for some input on this conceptual issue I've been having.

Context

We've run a few experiments so far with our own feature flags and they have already lead to interesting results (see example). The thing about experiments like that one (the actions UX refactoring, #1841) or the new persons page (#2353) is that while we ultimately want to move the retention needle with those, we do have some intermediate goals that serve as proxies for what we're trying to achieve. For instance, with the actions experiment we can measure if users in the experiment group create more or less actions than the ones in the control group, or if they use actions more or less on insights graphs. The same is true for the persons page where we can measure if users are viewing more or less sessions or events for those persons, or changing their properties.

We'll inevitably have some experiments where there are no good intermediate metrics to measure and what we actually want to see is if users in the control group long-term retain better or not. Furthermore, we would want to measure this even if we have a good intermediate/proxy metric.

The Problem

Currently we cannot answer that question, Are users on X experiment retaining better or not? for a couple of reasons:

When filtering the (first time) retention table by users who have a feature flag active, what it actually happens is that cohorts will now be created based on when the users first had an event with the feature flag activated, we're not forming the cohorts as if the feature flag filter wasn't there and then filtering for users on the experiment group (more context on Bug with retention cohorts & filtering #2654, as you can see there's even the edge case of cohorts even larger when applying the filter). This is not trivial because aside from the technical complications, the UX of this is not straightforward.
A secondary smaller problem would be that we use the custom event user signed up to cohortize users, but as that's a server-side event, it doesn't contain the feature flag details and therefore applying a filter to it leads to no results.

Thoughts on this?

The text was updated successfully, but these errors were encountered:

jamesefhawkins · 2021-01-17T01:00:37Z

I think there may be a third problem - we've changed a bunch of extra stuff on top of what you're testing, so the impact of these could throw the experiment as you're introducing more variables.

I think the correct way to do this would be:

Create experiment. This means a feature flag is on for some, off for others.
Somehow, we need to pin all the other functionality from when this launches for those users affected by the experiment only, until the experiment is concluded.

Ultimately, this is why I think experimentation has to be part of deployment and everyone is missing a trick apart from the FAANG companies that have built this in house

paolodamico · 2021-01-26T07:42:57Z

For sure, a principle behind the scientific method is controlled variables, can't really attribute the result to a specific experiment unless everything else is equal. Though at this point I think it's unfeasible to do this, and while we may get some noisy results or coincidences every now and then, we're also not talking about 0.X% changes but rather XXX% or even more, so I think it's worth continuing doing experiments.

Perhaps a good takeaway is shortening the scope of experiments and measuring only for proxy actions of retention. This would also reduce the noise of changing variables, particularly if we do an experiment during the 2-week time range. We can then run more experiments and be more confident on the results.

Thoughts?

samwinslow · 2021-09-30T16:13:11Z

@paolodamico Resurrecting this old issue as it's something I've also been thinking about. To answer the question Are users on X experiment retaining better or not? — isn't the most important thing to synchronize the period of time in which retention becomes meaningful with the period of time for which a feature flag is active?

Put more simply, I often feel like we may only have a feature flag active for 4 weeks or so, before deciding to roll out a feature to everyone or kill it. Yet, querying for retention is something we may often want to look at over a longer timescale, right? 3-month retention, 6-month retention, etc. — degrees of retention which, when good, justify our marketing spend and are likely to add revenue.

It then becomes really hard to attribute retention, or leading indicators of retention, to a feature flag that may have been active in the past. I think this might have more to do with how we start/stop experiments.

Say, for example, we're tracking a leading indicator of retention, "Actions created by unique users, weekly". The process goes like this:

We run a feature flag to change the actions creation experience for 4 weeks.
We get some noisy data with a small sample size and jump to a conclusion that we should release the feature to everyone.
We remove the feature flag conditional code from our codebase, exposing the feature, and delete the feature flag.
Much later, we see a clear drop in our key metric and wonder why.

Of course the logically robust solution is to run tests with higher sample sizes, longer time periods, and demand a higher delta between control and experimental group before making a decision. But, being a startup, these goals may not always be accessible to us. We might make bad decisions on incomplete data that take 2+ months to materialize.

So — wouldn't a lot of confidence be gained by automatically tracking when a feature flag was turned on or off, or when its release conditions were modified? I'm picturing this as a feature that would automatically add annotations to graphs (which could of course be edited for more context).

posthog-bot · 2024-04-03T07:31:42Z

This issue hasn't seen activity in two years! If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.

posthog-bot · 2024-04-18T07:31:27Z

This issue was closed due to lack of activity. Feel free to reopen if it's still relevant.

paolodamico added concept Ideas that need some shaping up still discussion labels Jan 15, 2021

paolodamico mentioned this issue Feb 22, 2021

Feature Concept: Full A/B testing suite #3431

Closed

paolodamico removed the discussion label Feb 22, 2022

posthog-bot added the stale label Apr 3, 2024

posthog-bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measuring experiments impact on retention #2958

Measuring experiments impact on retention #2958

paolodamico commented Jan 15, 2021

jamesefhawkins commented Jan 17, 2021

paolodamico commented Jan 26, 2021

samwinslow commented Sep 30, 2021

posthog-bot commented Apr 3, 2024

posthog-bot commented Apr 18, 2024

Measuring experiments impact on retention #2958

Measuring experiments impact on retention #2958

Comments

paolodamico commented Jan 15, 2021

Context

The Problem

jamesefhawkins commented Jan 17, 2021

paolodamico commented Jan 26, 2021

samwinslow commented Sep 30, 2021

posthog-bot commented Apr 3, 2024

posthog-bot commented Apr 18, 2024