Is A/B Testing (Mis)informing Your Business?
Improper optional stopping increased the average False Discovery Rate…This generates two possible costs for a company. The first is a cost of commission. Facing a false discovery, the company will needlessly switch to a new treatment and incur a switching cost. The second cost of a false discovery is a cost of omission. Erroneously believing to have found an improvement, the company stops further exploring for better treatments.
I was cruising the Twittersphere (Twitterverse?) yesterday and, amidst the noise and vitriol, came across a real gem: a study posted to SSRN called p-Hacking and False Discovery in A/B Testing.
(It appeared in my feed thanks to Adam Alter, author of the delightful Drunk Tank Pink – And Other Unexpected Forces that Shape How We Think, Feel, and Behave)
If you’re a member of the set of people who depend upon A/B testing and have an appetite for reading academic studies, then just leave now and go read the article – it’s OK, I’ll gladly take the hit to my bounce rate so that you can get the info straight from the source.
If, however, you prefer the TL;DR version (sorry, still pretty long), then here you go.
They found that, “about 73% of experimenters stop the experiment just when a positive effect reaches 90% confidence”. Why does that matter? Because it leads A/B testers to jump to false conclusions, and those false conclusions have real business impacts.
Using data from Optimizely, the study investigates the extent to which online A/B experimenters stop their experiments early based on the p-value (i.e., statistical significance) of the treatment effect (i.e., the difference in performance between A and B), and how such behavior impacts the value of the experimental results.
Say what? Basically, the study’s authors looked to see if A/B testers ran an experiment for a while, then stopped it prematurely when they liked or didn’t like what they were seeing (a misuse of data analysis called, among other things, p-hacking).
They found that, yes, “about 73% of experimenters stop the experiment just when a positive effect reaches 90% confidence”.
(They didn’t find any premature stopping at the negative end; the most likely explanation here is that if an experiment is going poorly, the tester lets it run its course just in case there’s a sudden comeback!)
Why does that matter? Because it leads A/B testers to jump to false conclusions, and those false conclusions have real business impacts.
Three quarters of A/B tests have no statistically significant impact either way (i.e., good or bad).
Additionally, the study’s authors estimate that “the proportion of all experiments that truly have no effect, regardless of whether the result was declared significant, to be about 73%-77%.”
Say what? Yeah. Three quarters of A/B tests have no statistically significant impact either way (i.e., good or bad). In the author’s words, “This means that the large majority of A/B tests in our sample…will not identify more effective business practices.”
In other words, you’re very often wasting your time and money on A/B tests. Ouch.
But that doesn’t mean that A/B testing is flawed; rather, it means that the things often being A/B tested are genuinely pretty trivial, and have no real-world impact. Here’s how the authors put it: “the high prevalence of non-significant results in A/B tests stems from the interventions being tested rather than the method of A/B testing”.
Eureka! Oh wait, never mind.
So, basically, there are two costs to false discovery: you start doing something that doesn’t work; and you don’t discover something that does work.
All of these mistakes come at a cost.
Quoting extensively from the study, now:
Improper optional stopping increased the average False Discovery Rate among p-hacked experiments from 33% to 40%. This generates two possible costs for a company. The first is a cost of commission. Facing a false discovery, the company will needlessly switch to a new treatment and incur a switching cost.
For many experiments this cost may be low, like changing the background color of a webpage. But for some it may be quite substantial, like building and rolling out the infrastructure to enable a new shipping policy.
The second cost of a false discovery is a cost of omission. Erroneously believing to have found an improvement, the company stops further exploring for better treatments. Consequently the company will delay (or completely forego) finding and rolling out a more effective policy.
So, basically, there are two costs to false discovery:
- You start doing something that doesn’t work
- You don’t discover something that does work
The authors go on to estimate the real-world impact the omission on lift: “we estimate the expected cost of omission in terms of lift at 1.95%. This corresponds to the 58th percentile of the positive observed lifts and the 76th percentile of all observed lifts. Hence the expected opportunity cost of omission following a false discovery is a fairly large forgone gain in lift.”
OK, that sounds crappy, but what can I do about it?
I think I’m starting to get a reputation as someone who hates on digital marketing…which is probably not the best thing for a marketer.
But here’s the thing: I believe digital marketing is a powerful toolset; in fact, I wish I’d had the opportunity to use it much more to this point in my career. However, I believe digital marketing needs to be used responsibly (don’t get distracted by tactics) and as part of a larger marketing strategy (digital channels + content are vastly superior to either alone).
And A/B testing is no different. Clearly, the technique has value. Clearly, many (most?) A/B testers are jumping to erroneous conclusions that have negative impact on their businesses.
So what can we do to maximize the gains and minimize the risks? The authors provide four strategies to avoid improper premature stopping of A/B tests, which will help to lower the false discovery rate:
- Tighten the significant threshold, moving from the usual 95% to a much more statistically significant 99.5%
- Use proper sequential testing and FDR-control procedures (note that Optimizely implemented this strategy shortly after the data window used in the study)
- Use Bayesian hypothesis testing that’s robust to optional stopping
- Forego null hypothesis testing altogether and to approach the business decision that A/B testing is meant to inform as a decision theoretic problem
OK, most folks are just gonna rely on their A/B platform to take care of all that scary-sounding Bayesian hypothesis decision theoretic stuff.
For those of us who glossed over while skimming that previous list, here’s what I’d do if I was overseeing A/B testing:
- Recognize that A/B testing has limitations – it’s not a panacea that can take the higher-level work out of your communications and design activities; instead, it’s there to help you optimize outcomes (but only when you let your experiments run to their full conclusion at a higher significance level!). But with that being said…
- Accept that many things you’re A/B testing just don’t matter – and that’s OK. Accepting this reality frees you up to spend your time on things that do matter (including testing big changes, rather than little tweaks here and there), which brings us to…
- Please, please, please, make sure you’ve taken care of the things that matter most – like thinking long-term, planning your marketing strategies in service to objectives, choosing tactics that support the strategies; optimizations are great, but only when you’ve sufficiently delegated and have the other stuff under control first! In other words: don’t mismanage the forest because you’re pruning a single tree.