Tinkering with A/B Testing

Tinkering with A/B Testing

2333 1408 Vincenzo Scalzi

User experience matters

Creating a service on the Internet requires to understand its targeted user base and how it will perceive the experience it will be presented with. While there are general rules of thumb that can be applied, notably with regards to perceived loading times, user interface trends and patterns to follow and avoid, each domain has its codes to follow to reach a larger audience.

The Human factors is a set of disciplines centered around the Human and the interaction with its environment. To put it simply: it focuses on ergonomics, safety, productivity, etc. Each person is different, but there are codes that work universally or with broad audiences, including some hardly known about, like slight color palette variants or spacing differences.

In that sense, to build an application that would work with everyone, you would need to follow the codes of its domain while including novel ideas that renders interaction unique. However, each person being different, this would require personalizing the experience automatically, which is near impossible in today’s landscape.

To worsen matters, we live in a shifting environment; what pleases a person can change over time, company brands evolve, and technology opens new opportunities in terms of design or experience that set new trends that have yet to be discovered.

Following that line of thinking, one would choose to propose a reduced number of experiences and focusing on them exclusively. A question remains, though: which set of experiences among an infinite list of variations maximizes the perceived value of the service?

A/B Testing: The way of the majority

Decisions are difficult, binding and our biases often lead to sub-optimal results. Industries that are reliant on public perception face substantial amounts of apparently unimportant choices. Words, colors, images, ways to reach identical results. Delegating a decision to the concerned party could represent a viable option. When in doubt, let the community have a part in the decision.

Retail fits the need for experience personalization. Choosing the relevant product to advertise is key to the business and displaying it in the best conditions is favorable for everyone involved in the process.

A/B Testing has become the de facto way to proceed for such decisions. It works by exposing variants corresponding to each option to a sparse but statistically significant subset of the user base and computing an objective score. After the end of the experiment, scores are analyzed and the decision with the best outcome is usually persisted and deployed to the general public.

There are caveats to account for: the choice of the score computation or heuristic function may impact the decision as there are different ways to account for the success of an experiment. The same applies to the way the user groups are created. Depending on the number of participants, there is a chance the population is unknowingly biased.

The iconic example of A/B Testing is the choice of the color of a button in a checkout form. The team in charge of this page would like to set a color that is easy to recognize and entices users to move forward with their current purchase. They are hesitating between its current standard color, the main color of the website palette and a shade of a trendy color. Instead of deciding themselves or polling social networks for an answer that might not be representative they put their options to the test, collect data on the interactions with each option and persists on the platform the one that performed the best.

This explanation does not cover the whole process, though: A/B Testing starts with an analysis of the elements of an application that could be improved, prioritized thanks to analytics solutions and quantifiable goals. The rest is history; business intelligence analyses the data based on the objectives, reaches a prioritized list of elements to optimize, each element is then taken separately to brainstorm about hypotheses that could yield better results.

Once hypotheses are ready, there are multiple solutions of which:

  • There are pertinent SaaS or in-house A/B Testing solutions with comprehensive offerings from the design of the scenarios to the validation of a hypothesis;
  • If a SaaS solution is not desirable or if you wish to experiment with this process in-house to understand it, you may start with a custom implementation.

Decisions as a service

It is possible to hypothesize that an A/B experiment connects options to scores. Users can participate in experiments and can be designated by their user identifier so that they will not notice changes from one session to the next.

Let us define the following terms and expressions that will guide us through this section:

  • A decision will represent the core of an A/B hypothesis: it is identified and possesses a status;
  • A decision value links the identifier of a decision, an input value and an output value;
  • A user is an identified being that browses the application. It is at best anonymized and exposes a set of decision values.

The simplest definition of the result of a test would be a data structure containing a user and a decision identifier, and the input and output of their relationship. This would be expressed as follows:

user_decision(did, uid, d_in, d_out)

Each decision appears once for each user and the relationship holds the input and output values of that decision for that user. In the relationship, the decision identifier will be queried more often than the user identifier so there should be an index covering it.

As for interactions, this service should retrieve statistics for decisions and retrieve and update a collection of decision values for a given user. The growth of this service would be dependent on the number of active users and decisions. Both of these statistics will change over time, so it is necessary for this solution to be as elastic as possible.

This design would run smoothly on relational databases with row locking updates, but scalability is difficult. Key-value databases causes an impractical statistics retrieval experience. Document databases that behave like relational databases would work best here, offering a reasonably sound structure and high elasticity in terms of compute and storage.

Settling on function as a service is a safe bet for a Proof of Concept, considering the low amount of computation required, criticality of the service and operational needs. However, depending on the service growth, dedicated computing capacity may quickly become a better alternative for better throughput, performance and lower costs. One could even avoid the computing layer entirely by developing a library that integrates with existing systems. In this case AWS Lambda is the designated platform. On the plus side, the code required for our basic features is minimal and will not significantly impact general website performance.

To avoid exposing sensitive information to the public, there are two major solutions:

  • Allowing each user to only query and update their statistics and using internal services to access global statistics;
  • Querying and updating it from the backends only and transferring the responsibility to the backends.

The second is the solution of choice for this post; it is leaner and leads to faster iteration while pro. To ease access to these features within a set of AWS accounts, an Application Load Balancer with an internal domain name may front the compute. Do not forget to set the right permissions so that the load balancer is reachable by your accounts. An Amazon API Gateway would work similarly and it is possible to use it interchangeably with an Application Load Balancer. Nevertheless, the presented infrastructure shall remain internal and will not end up using Amazon API Gateway’s features such as API keys, Authorizers, API stages, Throttling, Rate-limiting, Request and Response validation, etc.

The solution

The demonstration code is publicly available and can be found in this repository.

It is not advisable using this solution on your environments unless it perfectly fits your business. Nevertheless, digging into the code and extracting ideas may bring out unexpected insights for current and future projects and is thus highly recommended.

Figure 1 – An overview of the service architecture

Code

This application is exposed through two endpoints in the Application Load Balancer connecting to two different Lambda functions. Both Lambda functions share the same Layer or function dependency which is the DynamoDB access layer in this case. Input handling and business logic are both located in the handler function, called by AWS Lambda.

For a given user, adding a decision value is equivalent to creating the data structure mentioned previously. Known information contains the user identifier, the decision identifier and the decision input at this point in time. Void decision inputs would mean that the user is part of the experiment with an undefined status, which can be interpreted as a control participation or may hide an error. Be certain to always include a default input and output value.

Using Promise.all can be risky in this situation: if one of the wrapped asynchronous calls fails, this would result in a partially created or updated collection of decisions. Promise.allSettled as a drop-in replacement would solve this issue by not interrupting on the first promise rejection. This function is supported in the version of v8 packed in the nodejs12.x AWS Lambda runtime. The creation or update would only ignore failed writes and updates, which is better but not perfect.

To retrieve the version of Node included in an AWS Lambda JavaScript runtime, you would need to create a function containing the following code:

exports.handler = async () => process.versions;

At the time of writing, invoking a Lambda function containing that code would output the following:

{
  "node": "12.14.1",
  "v8": "7.7.299.13-node.16",
  "uv": "1.33.1",
  "zlib": "1.2.11",
  "brotli": "1.0.7",
  "ares": "1.15.0",
  "modules": "72",
  "nghttp2": "1.40.0",
  "napi": "5",
  "llhttp": "2.0.1",
  "http_parser": "2.8.0",
  "openssl": "1.1.1d",
  "cldr": "36.0",
  "icu": "65.1",
  "tz": "2019c",
  "unicode": "12.1"
}

This informs us that the current version of Node in the nodejs12.x AWS Lambda runtime is 12.14.1, which contains the v7.7.299.13-node.16 version of the v8 JavaScript runtime.

v8 versions follow a simple convention: every time a version of Chrome is released, the minor version of v8 is incremented by 1. If the Google Chrome release number finishes with “0” like in Google Chrome 80, then v8 increments the major version and resets the minor version, leading to v8 v8.0 in this instance. The full process is explained in detail in v8’s documentation.

Promise.allSettled is compatible with Google Chrome 76 and the version of v8 in the nodejs12.x runtime is v7.7. Since v8’s version is more recent than the availability of the function in Chrome, this implies that the function is available in the nodejs12.x AWS Lambda runtime.

Infrastructure

It is recommended to edit the two Terraform data blocks in the ALB provisioning file so that they match your private subnets instead of all the subnets of your default VPC and the internal Route 53 Zone name to match your private zone name.

The infrastructure is straightforward and can be provisioned in a matter of minutes with Terraform. This time, simplicity is the way: all the files are located in the infrastructure/ directory and contain a collection of resources for a service.

Provisioning and deprovisioning the infrastructure is, after having prepared your environment, as simple as running terraform apply -auto-approve and terraform destroy -auto-approve.

Here is an excerpt of the Terraform provisioning execution trace:

aws_lb_target_group.tg-statistics: Creating...
aws_lb.lb: Creating... aws_lb_target_group.tg-decision: Creating... aws_dynamodb_table.table: Creating... aws_iam_role.role: Creating...
aws_lambda_layer_version.lyr-crud: Creating...
aws_lb_target_group.tg-decision: Creation complete after 1s [id=arn:aws:elasticloadbalancing:eu-west-1:543055564181:targetgroup/tg-decision/fd68f5159e4dbbb9]
aws_lb_target_group.tg-statistics: Creation complete after 1s [id=arn:aws:elasticloadbalancing:eu-west-1:543055564181:targetgroup/tg-statistics/e7a331286e9a4dc0] aws_iam_role.role: Creation complete after 1s [id=role-lambda]

[…]

aws_lb_listener_rule.lr-statistics: Creation complete after 3s [id=arn:aws:elasticloadbalancing:eu-west-1:543055564181:listener-rule/app/ab/a36b21fdf190ae06/be95bfb74d2c2266/1c1b21afdfd16ae4]
aws_route53_record.ab: Still creating... [10s elapsed]
aws_route53_record.ab: Still creating... [20s elapsed]
aws_route53_record.ab: Still creating... [30s elapsed]
aws_route53_record.ab: Still creating... [40s elapsed]
aws_route53_record.ab: Still creating... [50s elapsed]
aws_route53_record.ab: Still creating... [1m0s elapsed]
aws_route53_record.ab: Still creating... [1m10s elapsed]
aws_route53_record.ab: Creation complete after 1m10s [id=Z09162511D5GH30JW9GKF_ab.vcz.internal._A]

Apply complete! Resources: 18 added, 0 changed, 0 destroyed.

Everything is now ready for prime time. How is it possible to test and use this service? The ALB and the Route53 zone are both private and only accessible from within the VPC where the service is provisioned. To access the service, you would need to call it from within that VPC like from a fresh EC2 instance accessed with SSH or EC2 Instance Connect using curl. This is the manual equivalent of using the HTTP client of your favorite technology or framework.

Step one: creating the decisions for a given user with their input values.

$> curl -X POST -d '{
    "uid": "c16115df5860d4ca5ca6cf3d5c7068c2",
    "decisions": [{
    "id": "1",
    "input": "aaa"
  },{
    "id": "33A-72",
    "input": "245"
  },{
    "id": "carrot",
    "input": "on"
  }]
}' 'http://ab.vcz.internal/api/decision'

"Done!"

Step two: reporting the output values for some decisions.

$> curl -X PATCH -d '{
  "uid": "c16115df5860d4ca5ca6cf3d5c7068c2",
  "decisions": [{
    "id": "1",
    "output": "bbb"
  },
    "id": "carrot",
    "output": "1.115"
  }]
}' 'http://ab.vcz.internal/api/decision'

"Done!"

Step three: verifying that the service registered the correct inputs and outputs.

$> curl -X GET 'http://ab.vcz.internal/api/statistics?did=carrot'
[{"d_out":"1.115","d_in":"on","uid":"c16115df5860d4ca5ca6cf3d5c7068c2","did":"carrot"}]

$> curl -X GET 'http://ab.vcz.internal/api/statistics?did=33A-72'
{"d_in":"245","uid":"c16115df5860d4ca5ca6cf3d5c7068c2","did":"33A-72"}

$> curl -X GET 'http://ab.vcz.internal/api/statistics?did=doesnotexist'
[]

Everything works as expected. The API resolves internally for each endpoint and returns the expected responses. It is now time to implement it in your application and watching it scale!

Start hacking!

There are ways A/B Testing can be generalized. It is possible to test combinations of hypotheses at the same time. This is called Multivariate Testing. In any case, well-tuned, testing based on user feedback can provide valuable understanding of a user base and accelerate decisions.

When there is a need for a customized A/B testing experience, the decision is up to you: experimenting or settling on a SaaS. Although, more than a primer on A/B testing, this article expresses an invitation to start hacking on new subjects using bleeding-edge ideas, tools and technologies. In other words, it is now time to try out new ideas and solutions!

Laisser une réponse