Statistical Methods: in Online AB Testing by Georgi Z. Georgiev

Statistical Methods: in Online AB Testing by Georgi Z. Georgiev

Author:Georgi Z. Georgiev [Georgiev, Georgi Z.]
Language: eng
Format: epub
Publisher: Publisher
Published: 2019-11-15T00:00:00+00:00


6.6 Testing the perfect shade of blue

I would like to offer a humble reexamination of a famous test of this nature which was run by Google some time in 2008/2009. According to then Google executive Marissa Mayor, and a designer involved in the test (Holson 2009) (Hern 2014), two Google teams had to decide on the blue background of the toolbar across Google pages. A designer proposed one shade of blue which all designers liked. An engineer, however, tested another shade, and the data suggested that his blue worked better in driving clicks on the search results. The dispute was temporarily resolved by Mayor selecting a shade between the two shades.

However, afterwards a test was performed for each of the 41 shades of blue between the one proposed by the designer and the one proposed by the engineer. One shade was selected based on these experiments. There are conflicting accounts on what benefit this brought the company, since reports are mixing two different ‘shades of blue’ experiments, but the number $200 million is often cited.

While it is not entirely clear if Google ran 41 variants against a control in a single test, or whether they did 41 separate A/B tests, assume for argument’s sake that they did run 41 variants against a control in a single test. Further assume that they applied correct adjustments for significance and sample size, and they were able to run the test in a timely manner. Even so, if you, dear reader, had to design the test, would you test 41 variants, knowing that this is increasing the time required to run the test by about 12 times compared to a simple A/B test between the two shades?

What if we are to consider the fact that blue is on a scale and whatever effect it has on users is likely ‘dose-dependent’ with a certain peak around the ‘best’ shade of blue. With that in mind, it is more efficient to first run an A/B/n with just three variants against the control - two variants with each of the two shades proposed by the two camps, and then a shade exactly between them, versus a control with the current color, whatever it is.

This would have increased the sample size only by about 67% compared to a simple A/B test. Then another A/B/n test with several values around the winning shade could be run as well. This would have cut total execution time about four-fold, while having a high probability of homing in on the best shade of blue.

As an added benefit, during the duration of the second test the control and most variants would have an improved conversion rate relative to the baseline, meaning that we would be exposing significantly less users to suboptimal shades of blue. In other words, we would start gaining the benefits of the test much sooner than if we’d designed a large single test with 41 variants, only a handful of which would be close to optimal.

While most practitioners would either



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.