How To AB Test
How to approach A/B testing in your Bloomreach Engagement project
This guide will help you understand:
- The importance of A/B testing,
- How to set up such a use case,
- How to evaluate such a use case.
What this Use Case does and why we developed it
Companies face various problems, and some ways to identify your company's current pain include comparing your performance on basic metrics with industry benchmarks or using the Online Retail Formula. The conclusions from it usually lead to use cases focused on increasing revenue per visitor (RPV), conversion rate, average order value (AOV), etc. Increases in these metrics are considered solutions to problems.
Setting Up the Use Case
What to do before launching the use case?
Use cases are most often based on an understanding of the conversion process. When you break the process down into granular steps, you can decide to improve a part of the process - with a use case.
Have a clear hypothesis before launching use cases, such as: “By highlighting the stock availability on the product detail page, we believe we will improve the cart-to-detail ratio by 10%.”
You should be able to answer the following questions:
- What are you hoping to achieve?
- Why are you launching it?
- for this particular group of customers,
- under these conditions,
- in this form?
This thinking process during the setup can also help you later when tweaking the use case for another A/B test if the test does not perform as expected.
How long should the test run?
Determine how long the A/B test should run before launching it. An A/B test has to run for a particular period before you can look at the results, safely make conclusions, and base actions on the data.
The test should also run for at least 1 (ideally 2) complete business cycle. A business cycle is a week in 95% of cases, and it is about capturing all kinds of customers, behavior, traffic, regular campaigns, etc. Your start/stop times should start and stop at the same point in the business cycle. For example, running the test for two weeks and sending an exceptionally huge/successful newsletter at this time will skew the results.
We use Bayesian statistics to evaluate the impact of a use case. Read more about it and the reasons for switching from frequentist to Bayesian statistics.
What is the right A/B test ratio?
Ideally, always start with a 50:50 test. Do not change the ratio until you are sure of the results. Iterate and change the Variant over time rather than starting with too many Variants, as they would prolong the duration of the test.
If you use other than a 50:50 split between the Control Group and Variant during evaluation, you need to segment the customers based on the number of merge events and evaluate the use case through those segments. Otherwise, the evaluation is unfair and skewed to the Variant with a higher percentage.
Do not change the AB test ratio of a running campaign. If you really need to change the ratio, set up a new campaign (new AB split node, new weblayer, new experiment,...) and evaluate only the new campaign.
Customers already assigned to a Variant will remain in the specific Variant, which can skew the evaluation. Imagine an AA test (comparing two the same Variants) where A1 vs. A2 is 20:80, and after two weeks, change to 80:20 for another two weeks. The sample size will be almost the same, but all the heavy visitors/purchasers will have Variant A1, and the evaluation will say that A1 is the clear winner (but it is the same as A2).
Things to keep in mind
Conditions before A/B test
You want your A/B test to be clean and random - so that the results are relevant and reflect reality. You can help by cleaning your customer base of irrelevant customers before sending them to the A/B test.
If you build your A/B test in a scenario, ensure your conditions are defined before the A/B split node.
If sending an email, the email node will automatically ensure that customers who do not have the email attribute, or the appropriate consent, do not receive the email. However, for the A/B test to be correct, you also need to eliminate such people from the Control Group, so you need explicit conditions excluding customers without email attribute OR relevant consent from the A/B test.
Custom Control Group
Similarly, you want to ensure your A/B test groups are clean in weblayers/banners, too.
In order to do that, check if your weblayer Variant A contains any special JavaScript conditions. These conditions have to be included in the Control Group too.
'Viewcount' banners usually contain a condition that specifies the minimum of views an item must have recently had for the banner to be displayed (and tracked).
Evaluating the A/B Test
Tracking and Other Checks
Make sure to create an evaluation dashboard for each use case before (or within a few days of) launching it.
Come back to it after having launched the use case to:
- check that everything is running and being tracked,
- polish the dashboard.
You can include basic information about the use case in the dashboard. Documentation (along with processes) gets increasingly essential for the excellent functioning of organizations as they grow larger. Such information can include:
- brief description of the use case
- brief description of each Variant
- date when the use case was launched
- date when anything about the use case changed (this should also make a new period of evaluation of the use case)
- attribution window used within the dashboard
Customers to Filter Out
In order to make sure that your use case provides you with valid information, you should filter out your employees, your agencies, Bloomreach Engagement, and outliers - customers with unusually high order value or frequency.
This should be done at the end of the evaluation as the cut-off will vary per project/use case and will not be known before launching it. You can draw the distribution of 3 purchases and their total amount and decide what the cut-off should be based on this.
Time-Saving Note: You want to avoid filtering out every dashboard component. To do that, you want to create the filter directly in the AB test segmentation. If you have a set of filters that you always use to exclude customers from evaluations, create a global segmentation that includes all of these filters. This means adding one condition in each review's customer filter instead of all of them.
Half-Way Check
Your use case may be hurting the website. It makes such a negative impact that the results will not improve by the end of the predetermined testing period.
Let us say you need to wait for 20 days to get significance. After ten days, you should have a look at what is happening. Uplifts should be positive. If they are negative, use the Bayesian calculator to decide whether to stop the use case (if the results are really bad) or not.
At the End
After waiting for the period you determined at the launch of the use case, you need to see if your use case has gotten the hypothesized results. Using a Bayesian calculator, you have three numbers: Probability that the Variant is better than the Control Group, Expected uplift if the Variant is better, and Expected loss if the Variant is worse. Based on those numbers, you (with our advice) need to decide whether to use the use case.
You should also check the conversion trend - it should be consistently higher for the Variant (i.e., not generally lower, with one significant spike that an unrelated circumstance may have caused). Keep in mind that the pattern is highly random for the first days/week, meaning no conclusions can be derived from it.
You should check how the Variant influences the new or existing users. It may be hurting the existing user base. If you find this to be the case, you may want to run this Variant only on new users (since the start of the test).
If the results are not good enough to implement the use case but you still see a point in it - you believe the use case resolves a business problem that is still relevant - you need to tweak it and start relaunching the use case.
Do you understand why the test did not ‘confirm’ your hypothesis? Diving deeper into the data might show that it did work for a specific segment, at a particular time, or under specific conditions (i.e., only the first impression of a banner) and give you a new hypothesis to test. You can find lessons to be learned that lead to a better-informed test and (customer) insights that are valuable for you. Finding and communicating these insights can still make the test worth it.
Multiple Variant A/B Tests
Multiple Variants (NOT multivariate) are suited best for pretesting (choosing the best Variant), followed by testing the best-performing Variant with the Control Group. Remember that multiple Variant A/B tests are likely to take longer to gain significance than simple A/B tests.
When testing multiple Variants against a Control Group, you need to compare all the Variants that are better than the Control Group against the Control Group itself to see the probability that they are better.
If you have more such Variants, you need to compare them with each other and pick the winner based on the implementation cost and the uplift it brings.
Second best Variant is almost as good as the first one but it costs half of the winning Variant, so you will pick the second Variant in most of the cases.
Updated 9 months ago