Is experimentation for everyone? A resounding yes, says Jonny Longden. All you need are two ingredients: A strong desire and tenacity to implement it.
There’s a dangerous myth lurking around, and it’s the idea that you have to be a large organization to practice experimentation. But it’s actually the smaller companies and start-ups that need experimentation the most, says Jonny Longden of performance marketing agency Journey Further.
With over a decade of experience in conversion optimization and personalization, Jonny co-founded Journey Further to help clients embed experimentation into the heart of what they do. He currently leads the conversion division of the agency, which also focuses on PPC, SEO, PR — among other marketing specializations.
Any company that wants to unearth any sort of discovery should be using experimentation, especially start-ups who are in the explorative phase of their development. “Experimentation requires no size: It’s all about how you approach it,” Jonny shared with AB Tasty’s VP Marketing Marylin Montoya.
Here are a few of our favorite takeaways from our wide-ranging chat with Jonny.
.
The democratization of experimentation
People tend to see more experimentation teams and programs built at large-scale companies, but that doesn’t necessarily mean other companies of different sizes can’t dip their toes in the experimentation pool. Smaller companies and start-ups can equally benefit from this as long as they have the tenacity and capabilities to implement it.
You need to truly believe that without experimentation, your ideas won’t work, says Jonny. There are things that you think are going to work and yet they don’t. Conversely, there are many things that don’t seem like they work but actually end up having a positive impact. The only way to arrive at this conclusion is through experimentation.
Ultimately, the greatest discoveries (for example, space, travel, medicine, etc.) have come from a scientific methodology, which is just observation, hypothesis, testing and refinement. Approach experimentation with this mindset, and it’s anyone’s game.
Building the right roadmaps with product teams
Embedding experimentation into the front of the product development process is important, but yet most people don’t do it, says Jonny. From a pure business perspective, it’s about trying to de-risk development and prove the value of a change or feature before investing any more time, money and bandwidth.
Luckily, the agile methodology employed by many modern teams is similar to experimentation. Both rely on iterative customer collaboration and a cycle of rigorous research, quantitative and qualitative data collection, validation and iteration. The sweet spot is the collection of both quantitative and qualitative data — a good balance of feedback and volume.
The success of building a roadmap for an experimentation program comes down to understanding the organizational structure of a company or industry. In SaaS companies, experimentation is embedded into the product teams; for e-commerce businesses, experimentation fits better into the marketing side. Once you’ve determined the owner and objectives of the experimentation, you’ll need to understand whether you can effectively roll out the testing and have the right processes in place to implement results of a test.
Experimentation is, ultimately, innovation
The more you experiment, the more you drive value. Experimentation at scale enables people to learn and build more tests based on these learnings. Don’t use testing to only identify winners because there’s much more knowledge to be gained from the failed tests. For example, you may only have 1 in 10 tests that work. The real value comes in the 9 lessons you’ve acquired, not just the 1 test that showed positive impact.
When you look at it through these lenses, you’ll realize that the post-test research and subsequent actions are vital: That’s where you’ll start to make more gains toward bigger innovation.
Jonny calls this the snowball effect of experimentation. Experimentation is innovation — when done right. At the root, it’s about exploring and seeing how your customers respond. And as long as you’re learning from the results of your tests, you’ll be able to innovate faster precisely because you are building upon these lessons. That’s how you drive innovation that actually works.
What else can you learn from our conversation with Jonny Longden?
Moving from experimentation to validation
How to maintain creativity during experimentation
Using CRO to identify the right issues to tackle
The required building blocks to successful experimentation
About Jonny Longden
Jonny Longden leads the conversion division of Journey Further, a performance marketing agency specializing in PPC, SEO, PR, etc. Based in the United Kingdom, the part-agency, part-consultancy helps businesses become data-driven and build experimentation into their programs. Prior to that, Jonny dedicated over a decade in conversion optimization, experimentation and personalization, working with Sky, Visa, Nike, O2, Mvideo, Principal Hotels and Nokia.
About 1,000 Experiments Club
The 1,000 Experiments Club is an AB Tasty-produced podcast hosted by Marylin Montoya, VP of Marketing at AB Tasty. Join Marylin and the Marketing team as they sit down with the most knowledgeable experts in the world of experimentation to uncover their insights on what it takes to build and run successful experimentation programs.
Are you on the brink of launching a new feature – one that will affect many of your high-value clients? You’ve worked hard to build it, you’re proud of it and you should be!
You can’t wait to release it for all your users, but wait! What if you’ve missed something? Something that would ruin all your engineering efforts?
There’s nothing worse than starting the day after a release by having to immediately deal with a number of alerts for production issues and spending the day checking a number of logging and monitoring systems for errors and, ultimately, having to rollback the feature you just launched. You would just feel frustrated and unmotivated.
In addition to sapping the morale of your technical teams, NIST has shown that the longer a bug takes to be detected, the more costly it is to fix. This is illustrated by the following graph:
This is explained by the fact that once the feature has been released and is in production, finding bugs is difficult and risky. In addition to preventing users from being affected by problems, it’s critical to ensure service availability.
Are you sure your feature is bug-free?
You might think that this won’t happen to you. That your feature is safe and ready to deploy.
History has shown that it can happen to the biggest companies. Let’s name a few examples.
Facebook, May 7, 2020. An update to Facebook’s SDK rolled out to all users, missed a bug: a server value that was supposed to provide a dictionary of things was changed to provide a simple YES/NO instead. This really tiny change was enough to break Facebook’s authentication system and affect tons of other apps like TikTok, Spotify, Pinterest, Venmo, and other apps that didn’t even use Facebook’s authentication system as it is extremely common for apps to connect to Facebook regardless of whether they use a Facebook-related feature, mainly for ad attribution. The result was unequivocal, the app simply crashed right after launch. Facebook fixed the problem in a hurry, with about two hours for things to get back to normal. But do you have the same resources as Facebook?
Apple, September 19, 2012. Another good example, even though it’s a bit older, would be the replacement of Google Maps with Apple Maps in iOS 6 in 2012 on iOS devices. For many customers and especially fans, Apple always handles the rollout of new features carefully, but this time they messed up. Apple didn’t want to be tied to Google’s app anymore, so they made their own version. However, in their rush to release their map system, Apple made some unforgivable navigational mistakes. Among the many failures of Apple Maps are erased cities, disappearing buildings, flattened landmarks, duplicate islands, distorted graphics, and erroneous location data. A large part of this mess could have been avoided if they had deployed their new map application progressively. They would have been able to spot the bugs and quickly fix them before massive deployment.
And now, thinking about this and seeing that even big companies are impacted, you’re stressed out and may not even want to release it anymore.
But don’t worry! At AB Tasty, we know that building a feature is only half of the story and that to be truly effective, that feature has to be well deployed.
Our feature management service has you covered. You’ll find a set of useful features, such as progressive rollout, to free you from the fear of a release catastrophe and erase feature management frictions, so that you can focus on value-added tasks to get high-quality features into production and apply your energy and innovation in the best way possible, thereby delivering maximum value to your customers.
What’s progressive rollout?
So now you’re curious: what’s progressive rollout? How will this help me monitor the release and make sure everything is okay?
A progressive rollout approach lets you test the waters of a new version with a restricted set of clients. You can set percentages of users to whom your feature will be released and gradually update the percentage to safely deploy your feature. You can also do a canary launch by manually targeting several groups of people at various stages of your rollout.
This is a practice already used by large companies that have realized the significant benefits of a progressive rollout.
Netflix, for example, is one of the most dynamic companies and its developers are constantly releasing updates and new software, but users rarely experience downtime and encounter very few bugs or issues. The company is able to deliver such a smooth experience thanks to sophisticated deployment strategies, such as Canary deployment and progressive deployment, multiple staging environments, blue/green deployments, traffic splitting, and easy rollbacks to help development teams release software changes with confidence that nothing will break.
Disney is another good example of a company that makes the most of progressive deployment. It has taken the phased deployment approach to a whole new level for its “Disney +” and “Star” streaming services by deploying them regionally rather than globally. This delivery method is driven by the needs of the business. The company is making sure that everything is ready at the regional level, in line with its focus on the most important markets. Prior to launching Disney+ in Europe, it spent a lot of time building the local infrastructure needed to deliver a high-quality experience to consumers when launching Disney+ in Europe, including establishing local colocation facilities and beefing up data centers to cache content regionally. After starting to roll it out in Europe, Disney was able to identify that, for some markets, the launch of Disney+ could actually create issues that would have resulted in latency and thus provide a poor experience for affected users. So they took proactive steps to reduce their overall bandwidth usage by at least 25% prior to their march 24 launch and delayed their launch in France by two weeks. Without progressive deployment, they wouldn’t have been able to identify these issues. And that’s why the launching of Disney + was remarkable.
What are the benefits of the progressive rollout?
There are three main benefits to the progressive rollout approach.
Avoiding bugs impacting everyone in production at once
First, by slowly ramping up the load, you can monitor and capture metrics about how the new feature impacts the production environment. If any unanticipated issues come to light, you can pause the full launch, fix the issues, and then smoothly move ahead. This data-driven approach guarantees a release with maximum safety and measurable KPIs.
Validating the “Viable part” in your MVP
You can effectively measure how well your feature is welcomed by your users. If you launch a new feature to 10% of your client base and notice revenue or engagement taking a dip, you can pause the release and investigate. The other major advantage? Anticipating costs. Since margin, profit and revenue are an important part of sustainability, unexpected costs that blow up your projected budgets at the end of the month are almost as bad as the night sweats that come from an unexpected bug! Monitoring your costs during a progressive rollout and immediately pausing the launch if those costs spike is a phenomenal level of control that you will absolutely want to get in on.
Progressively deploying services based upon business drivers
Finally, deploying a service or product progressively can also be seen as a way of prioritizing specific markets based on data-driven business plans. Disney, for example, decided not to launch the service in the U.S. when it launched “Star,” its new channel available in the Disney+ catalog for international audiences, which will feature more mature R-rated movies, FX TV shows, and other shows and movies that Disney owns the rights to but that do not fit the Disney+ family image. Ironically: U.S. customers will have to pay extra on their Disney+ subscription to access the same content on the other streaming service, Hulu.
The decision was made following a complex matrix of rights agreements and revenue streams. Disney found that subscribers are willing to pay for the separate Hulu and Disney+ libraries in the U.S., but that Star’s more limited lineup was enough to justify a standalone paid purchase for international customers, who will have to add $2 to their initial $6.99 subscription to access it. When the content library for Star is enough to justify not going through Hulu anymore, the U.S. customers will have access to it by paying just 1$ more. This progressive rollout approach has enabled Disney to make sure that once they launch Star in the U.S., everything will be ready and they will achieve good results.
In other words, the progressive rollout approach helps you ensure that your functionality meets the criteria of usability, viability, and desirability in accordance with your business plan.
How to act fast when you identify bugs while progressively deploying a feature?
Now that you know more about the progressive rollout of your features/products, you may be wondering how to take action if you identify bugs or if things aren’t going well. Lucky for you, we’ve thought of that part too. In addition to progressive rollout, you’ll also find automatic rollback on KPIs and feature flagging in the AB Tasty toolkit.
Feature flagging will let you set up flags on your feature, that work as simply as a switch on/off button. If for any reason you identify threats in your rollout or if the engagement of your users is not really convincing, you can simply toggle your feature off and take time to fix any issues.
This implies that you are aware and that someone from the product team is available to turn it off. But what if something happens overnight and no one can check on the progress of the deployment? Well, for that eventuality, you can set up automatic rollbacks (also called Rollback Threshold) linked to key performance indicators. Our algorithm will check the performance of your deployment and, based on the KPIs you set, if something goes wrong, it will automatically roll back the deployment and inform you that a problem has occurred. This way, in the morning, your engineers will be able to fix the problems without having to deal with the rollback themselves.
Conclusion
Downtime incidents are stressful for both you and your customers. To resolve them quickly and efficiently, you need to have access to the right tools and make the most of them. The progressive rollout, automatic rollback, and feature flagging are great levers to relieve your product teams of stress and let them focus on innovating your product to create a wonderful experience for your users. Highly effective organizations have already realized the importance of having the right approach to deployment with the right tools. What about your organization?
AB Tasty minimizes risk and maximizes results to make the lives of Product teams a whole lot easier. Create a free account today!
In the previous post in this series, we mentioned how important it is to adhere to naming conventions to keep track of your flags, such as who created them and for what purpose.
We also mentioned that in a worst case scenario without such conventions, someone could accidentally trigger the wrong flag inciting disruptions within your system.
This brings us to why it’s also essential to manage and control who has access to the flags in your system. Without access control, you might also risk someone switching on a faulty feature and going live to your user base.
Consequently, access control should be used to manage the level of control people across your organization have and restrict usage if necessary.
The general rule of thumb is to only give users the kind of access they require to do their job, not more and not less.
Let’s look at a practical example…
AB Tasty, for example, offers a feature flagging functionality that is designed to be used by product and marketing teams as well as developers while offering fine control over user access.
As you can see in the image above, there are different categories of access under ‘Role’ in the AB Tasty dashboard. There are 4 categories overall, which are divided as follows:
SuperAdmin- this is the category with the highest access privileges. The SuperAdmin can do anything from creating a project to toggling on/off a use case as well as access reporting.
Admin- the Admin can do almost anything a SuperAdmin can except create, delete or rename a project
User
Creator
Viewer
User, Creator and Viewer have varying but lower levels of access with Viewer having the lowest level of access.
It’s essential that as your use-cases grow and as you go further along the feature flag journey, you are able to manage access for users at a detailed level.
This is also useful when you’ve scheduled regular flag clean-ups to avoid technical debt. By implementing access control, you’ll minimize the risk of having someone on your team accidentally delete a permanent flag. Similarly, it allows teams to track short-lived flags they created so that they can delete them from your system to avoid accumulation of technical debt.
This way, each flag owner would be responsible for assessing the flag they own and keep tabs on it by grouping flags by owner instead of allocating the task of auditing the entire set of flags in your system at random.
Determining who gets access to what
As you’ve seen so far, different teams within your organization could and should have different levels of visibility and access to flags.
Once you’ve decided who gets access to which flags, you’ll need to consider how to control access.
Are technical teams the only ones allowed to make changes or will other teams such as your product and marketing teams get access to flags too? If so, you might need a sophisticated control panel to help them make changes.
Opting for feature flags as a service from a third party will make this process easier rather than building your own system as such services would provide you with a rich UI control panel without costing you too much time and money.
In contrast, building your own advanced feature flag service could drain your resources, especially as your use cases become more advanced and as more of your teams start using this service simultaneously.
Read more about whether you’re better off building or buying a feature flagging management system.
Audit logs
One way to track changes and control access to feature flags is by constructing an audit log. Such a log would help you keep track of all changes being made to each feature flag, such as who made the changes and when.
It allows for complete transparency, giving full visibility over the implementation of any feature flag changes in your system.
Because flag configuration carries risk, having an audit trail of change is crucial, especially when flags are being used in a highly regulated environment such as in finance or healthcare.
Thus, this makes the process more secure when you restrict access to the number of people allowed to make changes to sensitive flags.
Why does this matter?
At this point, you may be wondering why you should even be considering giving all your teams, especially your non-technical teams access to feature flags.
While your team of developers are the ones who create these flags, it is usually your marketing or product teams who will decide when to release a feature and for what purpose. They are the ones that are working closely with your customers and know their needs and preferences.
It is usually the case that your marketing team is the one forming these experiments in the first place to test out a new idea and this team would know who are the most relevant users to test on.
Therefore, the marketing team would often be the one who’s handling the experimental toggles, for example, to run A/B tests.
Meanwhile, operational toggles such as a kill switch are usually handled by the operations team.
When you have a clear-cut UI panel, you are restricting what those teams can do so each team focuses on managing the flags they’re actually responsible for without allowing any room for error.
A management system should allow you to keep track of your toggles, who created them and who can flip them on and off safely.
Conclusion
It’s important that you ensure you have full visibility over the flags in your system and that everyone across your teams know their level of responsibility when it comes to implementing feature flags.
Most sophisticated feature management solutions provide role-based access controls as well as an audit log for flags and contributors for increased transparency over your flags.
With such solutions, you would have an easy to use interface that allows you to expand feature management beyond your development team in a safe and controlled manner.
Feature flags are an indispensable tool to have when it comes to software development and release that can be used for a wide range of use-cases.
You know the saying ‘you can never have too much of a good thing’? In this case, the opposite is true. While feature flags can bring a lot of value to your processes, they can also pose some problems particularly if mismanaged. Such problems particularly come about when using an in-house feature flagging platform.
In this series of posts of ‘Fun with Flags’, we will be introducing some essential best practices when it comes to using and implementing feature flags.
In this first post, we will focus on an important best practice when it comes to keeping your flags organized: naming your flags.
What are feature flags?
First things first, let’s start with a brief overview of feature flags.
Feature flags are a software development practice that allows you to turn certain functionalities on and off to safely test new features without changing code.
Feature flags mitigate risks of new releases by decoupling code deployment from release. Therefore, any new changes that are ready can be released while features that are still a work-in-progress can be toggled off making them invisible to users.
There are many types of feature flags, categorized based on their dynamism and longevity that serve different purposes.
As the number of flags within your system grows, the harder it becomes to manage your flags. Therefore, you need to establish some practices in order to mitigate any negative impact on your system.
It is because of these different categories of flags that establishing a naming convention becomes especially important.
Click here for our definitive guide on feature flags.
What’s in a name?
Defining a name or naming convention for your flags is one essential way to keep all the different flags you have in your system organized.
Before creating your first flag, you will need to establish a naming convention that your teams could adhere to when the time comes to start your feature flag journey. Thus, you need to be organized from the very beginning so as not to lose sight of any flags you use now and in the future.
The most important rule when naming your flags is to try to be as detailed and descriptive as possible. This will make it easier for people in your organization to determine what this flag is for and what it does. The more detailed, the less the margin for confusion and error.
We suggest the following when coming up with a naming convention:
Include your team name– this is especially important if you come from a big organization with many different teams who will be using these flags.
Include the date of creating the flag– this helps when it comes to cleaning up flags, especially ones that have been in your system for so long that they’re no longer necessary and have out-served their purpose.
Include a description of the flag’s behavior.
Include the flag’s category– i.e. whether it’s permanent or temporary.
Include information about the test environment and scope as in which part of the system the flag affects.
Let’s consider an example…
Let’s say you want to create a kill switch that would allow you to roll back or turn off any buggy features.
Kill switches are usually long-lived so they may stay in a codebase for a long time, meaning it is essential the name is clear and detailed enough to keep track of this flag since its use can extend across many contexts over a long period of time.
The name of such a flag could look something like this:
algteam_20-07-2021_Ops_Configurator-killswitch
Following the naming convention mentioned above, the first part is the name of the team followed by the date of creation then the category and behavior of the flag. In this case, a kill switch comes under the category of operational (ops) toggles.
Using such a naming convention allows the rest of your teams across your organization to see exactly who created the flag, for what purpose and how long it has been in the codebase.
Why is this important?
Sticking with such naming conventions will help differentiate such long-lived flags from shorter-lived ones.
This is important to remember because an operational toggle such as a kill switch serves a vastly different purpose than, for example, an experiment toggle.
Experiment toggles are created in order to measure the impact of a feature flag on users. Over the course of your feature flag journey, you may create numerous flags such as these to test different features.
Without an established and well-thought out naming convention, you’ll end up with something like ‘new-flag’ and after some time and several flags later, this naming system becomes tedious and repetitive to the point when you cannot determine which flag is actually new.
In this case, you might want to opt for something more specific. For example, if you want to test out a new chat option within your application, you’ll want to name it something akin to:
Markteam_21-07-2021_chatbox-temp
Such a name will clue you into who created it and what this flag was created for, which is to test the new chat box option and of course, you will know it’s a temporary flag.
This way, there will be no more panic from trying to figure out what one flag out of hundreds is doing in your system or ending up with duplicate flags.
Or even worse, someone on your team could accidentally trigger the wrong flag, inciting chaos and disruption in your system and ending up with very angry users.
Don’t drown in debt
Making a clear distinction between the categories of flags is crucial as short-lived, temporary flags are not meant to stay in your system for long.
This will help you clean up those ‘stale’ flags, otherwise they may end up clogging your system giving rise to the dreaded technical debt.
To avoid an accumulation of this type of debt, you will need to frequently review the flags you have in your system, especially the short-lived ones.
This would be no easy task if your flags had names like ‘new flag 1’, ‘new flag 2’, and well, you get the gist.
A wise man once said…
…“the beginning of wisdom is to call things by their proper name” (Confucius). In the case of feature flags, this rings more true than ever.
Whatever naming convention you choose that suits your organization best, make sure that everyone across all your teams stick to them.
Feature flags can bring you a world of benefits but you need to use them with caution to harness their full potential.
With AB Tasty’s flagging functionality, you can remotely manage your flags and keep technical debt at bay by providing you with dedicated features to help you keep your feature flags under control.
For example, the Flag Tracking dashboard within the AB Tasty platform lists all the flags that are set up in your system so you can easily keep track of every single flag and its purpose.
One of the most critical metrics in DevOps is the speed with which you deliver new features. Aligning developers, ops teams, and support staff together, they quickly get new software into production that generates value sooner and can often be the deciding factor in whether your company gains an edge on the competition.
Quick delivery also shortens the time between software development and user feedback, which is essential for teams practicing CI/CD.
One practice you should consider adding to your CI/CD toolkit is the blue-green deployment. This process helps reduce both technical and business risks associated with software releases.
In this model, two identical production environments nicknamed “blue” and “green” are running side-by-side, but only one is live, receiving user transactions. The other is up but idle.
In this article, we’ll go over how blue-green deployments work. We’ll discuss the pros and cons of using this approach to release software. We’ll also compare how they stack up against other deployment methodologies and give you some of our recommended best practices for ensuring your blue-green deployments go smoothly.
[toc]
How do blue-green deployments work?
One of the most challenging steps in a deployment process is the cutover from testing to production. It must happen quickly and smoothly to minimize downtime.
A blue-green deployment methodology addresses this challenge by utilizing two parallel production environments. At any given time, only one of them is the live environment receiving user transactions. In the image below, that would be green. The blue idle system is a near-identical copy.
Your team will use the idle blue system as your test or staging environment to conduct the final round of testing when preparing to release a new feature. Once the new software is working correctly on blue, your ops team can switch routing to make blue the live system. You can then implement the feature on green, which is now idle, to get both systems resynchronized.
Generally speaking, that is all there is to a blue-green deployment. You have a great deal of flexibility in how the parallel systems and cut-overs are structured. For example, you might not want to maintain parallel databases, in which case all you will change is routing to web and app servers. For another project, you may use a blue-green deployment to release an untested feature on the live system, but set it behind a feature flag for A/B user testing.
Example
Let’s say you’re in charge of the DevOps team at a niche e-commerce company. You sell clothing and accessories popular in a small but high-value market. On your site, customers can customize and order products on-demand.
Your site’s backend consists of many microservices in a few different containers. You have microservices for inventory management, order management, customization apps, and a built-in social network to support your customers’ niche community.
Your team will release early and often as you credit your CI/CD model for your continued popularity. But this niche community is global, so your site sees fairly steady traffic throughout any given day. Finding a lull in which to update your production system is always tricky.
When one of your teams announces that their updated customization interface is ready for final testing in production, you decide to release it using a blue-green deployment so it can go out right away.
Animation of load balancer adjusting traffic from blue to green (Source)
The next day before lunch, your team decides they’re ready to launch the new customizer. At that moment, all traffic routes to your blue production system. You update the software on your idle green system and ask testers to put it through Q/A. Everything looks good, so your ops team uses a load balancer to redirect user sessions from blue to green.
Once traffic is completely filtered over to green, you make it the official production environment and set blue to idle. Your dev team pushes the updated customizer code to blue, puts in their lunch order, and takes a look at your backlog.
Pros: Benefits & use cases
One of the primary advantages of blue-green deployments over other software release strategies is how flexible they are. They can be beneficial in a wide range of environments and many use cases.
Rapid releasing
For product owners working within CI/CD frameworks, blue-green deployments are an excellent method to get your software into production. You can release software practically any time. You don’t need to schedule a weekend or off-hours release because, in most cases, all that is necessary to go live is a routing change. Because there is no associated downtime, these deployments have no negative impact on users.
They’re less disruptive for DevOps teams too. They don’t need to rush updates during a set outage window, leading to deployment errors and unnecessary stress. Executive teams will be happier too. They won’t have to watch the clock during downtime, tallying up lost revenue.
Simple rollbacks
The reverse process is equally fast. Because blue-green deployments utilize two parallel production environments, you can quickly flip back to the stable one should any issues arise in your live environment.
This reduces the risks inherent in experimenting in production. Your team can easily remove any issues with a simple routing change back to the stable production environment. There is a risk of losing user transactions cutting back—which we’ll get into a little further down—but many strategies for managing that situation are available.
You can temporarily set your app to be read-only during cutovers. Or you could do rolling cutovers with a load balancer while you wait for transactions to complete in the live environment.
Built-in disaster recovery
Because blue-green deployments use two production environments, they implicitly offer disaster recovery for your business systems. A dual production environment is its own hot backup.
Load balancing
Blue-green parallel production environments also make load balancing easy. When the two environments are functionally identical, you can use a load balancer or feature toggle in your software to route traffic to different environments as needed.
Easier A/B testing
Another use case for parallel production environments is A/B testing. You can load new features onto your idle environment and then split traffic with a feature toggle between your blue and green systems.
Collect data from those split user sessions, monitor your KPIs, and then, if analyses of the new feature look good in your management system, you can flip traffic over to the updated environment.
Cons: Challenges to be aware of
Blue-green deployments offer a great deal of value, but integrating the infrastructure and practices required to carry them out creates challenges for DevOps teams. Before integrating blue-green deployments into your CI/CD pipeline, it is worth understanding these challenges.
Resource-intensive
As is evident by now, to perform a blue-green deployment, you will need to resource and maintain two production environments. The costs of this, in money and sysadmin time, might be too high for some organizations.
For others, they may only be able to commit such resources for their highest value products. If that is the case, does the DevOps team release software in a CI/CD model for some products but not others? That may not be sustainable.
Extra database management
Managing your database—or multiple databases—when you have parallel production environments can be complicated. You need to account for anything downstream of the software update you’re making needs in both your blue and green environments, such as any external services you’re invoking.
For example, what if your feature change requires you to rename a database column? As soon as you change the name to blue, the green environment with old code won’t function with that database anymore.
Can your entire production environment even function with two separate databases? That’s often not the case if you’re using your blue and green systems for load balancing, testing, or any function other than as a hot backup.
A blue-green deployment diagram with a single database (Source)
Product management
Aside from system administration, managing a product that runs on two near-identical environments also requires more resources. Product Managers need reliable tools for tracking how their software is performing, which services different teams are updating, and ways to monitor the KPIs associated with each. A reliable product and feature management dashboard to monitor and coordinate all of these activities becomes essential.
Blue-green deployments vs. rolling deployments
Blue-green deployments are, of course, not the only option for performing rapid software releases. Another popular approach is to conduct a rolling deployment.
Rolling deployments also require a production environment that consists of multiple servers hosting an application, often, but not always, with a load balancer in front of them for routing traffic. When the DevOps team is ready to update their application, they configure a staggered release, pushing to one server after another.
While the release is rolling out, some live servers will be running the updated application, while others have the older version. This contrasts with a blue-green deployment, where the updated software is either live or not for all users.
As users initiate sessions with the application, they might either reach the old copy of the app or the new one, depending on how the load balancer routes them. When the rollout is complete, every new user session that comes in will reach the software’s updated version. If an error occurs during rollout, the DevOps team can halt updates and route all traffic to the remaining known-good servers until they resolve the error.
Rolling deployments are a viable option for organizations with the resources to host such a large production environment. For those organizations, they are an effective method for releasing small, gradual updates, as you would in agile development methodologies.
There are other use cases where blue-green deployments may be a better fit. For example, if you’re making a significant update where you don’t want any users to access the old version of your software, you would want to take an “all or nothing” approach, like a blue-green deployment.
Suppose your application requires a high degree of technical or customer support. In that case, the support burden is magnified during rolling deployment windows when support staff can’t tell which version of an application users are running.
Blue-green deployments vs. canary releasing
Rolling and blue-green deployments aren’t the only release strategies out there. Canary deployments are another alternative. At first, only a subset of all production environments receives a software update in a canary release. But instead of continuing to roll deploy to the rest, this partial release is held in place for testing purposes. A subset of users is then directed to the new software by a load balancer or a feature flag.
Canary releasing makes sense when you want to collect data and feedback from an identifiable set of users about updated software. Practicing canary releases dovetails nicely with broader rolling deployments, as you can gradually roll the updated software out to larger and larger segments of your user base until you’ve finished updating all production servers.
Best practices
You have many options for releasing software quickly. If you’re considering blue-green deployments as your new software release strategy, we recommend you adopt some of these best practices.
Automate as much as possible
Scripting and automating as much of the release process as possible has many benefits. Not only will the cutover happen faster, but there’s less room for human error. A dev can’t accidentally forget a checklist item if a script or a management platform handles the checklist. If everything is packaged in a script, then any developer or non-developer can carry out the deployment. You don’t need to wait for your system expert to get back to the office.
Monitor your systems
Always make sure to monitor both blue and green environments. For a blue-green deployment to go smoothly, you need to know what is going on in both your live and idle systems.
Both systems will likely need the same set ofmonitoring alerts, but set to different priorities. For example, you’ll want to know the second there is an error in your live system. But the same error in the idle system may need to be addressed sometime that business day.
In some cases, new and old versions of your software won’t be able to run simultaneously during a cutover. For example, if you need to alter your database schema, it would help if you structured your updates so that both blue and green systems will be functional throughout the cutover.
One way to handle these situations is to break your releases down into a series of even smaller release packages. Let’s say our e-commerce company is deepening its inventory and needs to update its database by changing a field name from “shirt” to “longsleeve_shirt” for clarity.
They might break this update down by:
Releasing a feature flag-enabled intermediary version of their code that can interpret results from both “shirt” and “longsleeve_shirt”;
Running a rename migration across their entire database to rename the field;
Releasing the final version of the code—or flip their feature flag—so the software only uses “longsleeve_shirt.”
Do more, smaller deployments
Smaller, more frequent updates are already an integral practice in agile development and CI/CD. It is even more important to follow this practice if you’re going to conduct blue-green deployments. Reducing deployment times shortens feedback loops, informing the next release, making each incremental upgrade more effective and more valuable for your organization.
Restructure your applications into microservices
This approach goes hand-in-hand with conducting smaller deployments. Restructuring application code into sets of microservices allows you to manage updates and changes more easily. Different features are compartmentalized in a way that makes them easier to update in isolation.
Use feature flags to reduce risk further
By themselves, blue-green deployments create a single, short window of risk. You’re updating everything, all-or-nothing, but you can cut back if needed should an issue arise.
Blue-green deployments also have a pretty consistent amount of administrative overhead that comes with each cutover. You can reduce this overhead through automation, but still, you’re going to follow the same process no matter whether you’re updating a single line of code or you’re overhauling your entire e-commerce suite.
Those conditions can be simple “yes/no” checks, or they can be complex decision trees. Feature flags help make software releases more manageable by controlling what is turned on or off at a feature-by-feature level.
For example, our e-commerce company can perform a blue-green deployment of their customizer microservice but leave the new code turned off behind a feature flag in the live system. Then, the DevOps team can turn on that feature according to whatever condition they wish, whenever it is convenient.
The team might want to do some further A/B testing in production. Or maybe they want to conduct some further fitness tests. Or it might make more sense for the team to do a canary release of the customizer for an identified set of early adopters.
Your feature flags can work in conjunction with a load balancer to manage which users see which application and feature subsets while performing a blue-green deployment. Instead of switching over entire applications all at once, you can cut over to the new application and then gradually turn individual features on and off on the live and idle systems until you’ve completely upgraded. This gradual process reduces risk and helps you track down any bugs as individual features go live one-by-one.
You can manually control feature flags in your codebase, or you can use feature flag services for more robust control. These platforms offer detailed reporting and KPI tracking along with a deep set of DevOps management tools.
We recommend using feature flags in any major application release when you’re doing a blue-green deployment. They’re valuable even in smaller deployments where you’re not necessarily switching environments. You can enable features gradually one at a time on blue, leaving green on standby as a hot backup if a major problem arises. Combining feature flags with blue-green deployments is an excellent way to perform continuous delivery at any scale.
Consider adding blue-green deployments to your DevOps arsenal
Blue-green deployments are an excellent method for managing software releases of any size, no matter whether they’re a whole application, major updates, a single microservice, or a small feature update.
It is essential to consider how well blue-green deployments will integrate into your existing delivery process before adopting them. This article detailed how blue-green deployments work, the pros and cons of using them in your delivery process, and how they stack up against other possible deployment methods. You should now have a better sense of whether blue-green deployments might be a viable option for your organization.
Picking an effective deployment strategy is an important decision for every DevOps team. Many options exist, and you want to find the strategy that best aligns with how you work. Today, we’ll go over canary deployments.
Are you an agile organization? Are you performing continuous integration and continuous delivery (CI/CD)? Are you developing a web app? Mobile app? Local desktop or cloud-based app? These factors, and many others, will determine how effective any given deployment strategy will be.
But no matter which strategy you use, remember that deployment issues will be inevitable. A merge may go wrong, bugs may appear, human error may cause a problem in production. The point is, don’t wear yourself out trying to find a deployment strategy that will be perfect. That strategy doesn’t exist.
Instead, try to find a strategy that is highly resilient and adaptive to the way you work. Instead of trying to prevent inevitable errors, deploy code in a way that minimizes errors and allows you to respond when they do occur quickly.
Canary deployments can help you put your best code into production as efficiently as possible. In this article, we’ll go over what they are and what they aren’t. We’ll go over the pros and cons, compare them to other deployment strategies, and show you how you can easily begin performing such deployments with your team.
In this article, we’ll go over:
[toc]
What is a canary deployment?
Canary deployments are a best practice for teams who’ve adopted a continuous delivery process. With this strategy, a new feature is first made available to a small subset of users. The new feature is monitored for several minutes to several hours, depending on the traffic volume, or just long enough to collect meaningful data. If the team identifies an issue, the new feature is quickly pulled. If no problems are found, the feature is made available to the entire user base.
The term “canary deployment” has a fascinating history. It comes from the phrase “canary in a coal mine,” which refers to the historical use of canaries and other small songbirds as living early-warning systems in mines. Miners would bring caged birds with them underground. If the birds fell ill or died, it was a warning that odorless toxic gases, like carbon monoxide, were present. While inhumane, it was an effective process used in Britain and the US until 1986, when electronic sensors replaced canaries.
A canary deployment turns a subset of your users —ideally a bug-tolerant subset— into your own early warning system. That user group identifies bugs, broken features, and unintuitive features before your software gets wider exposure.
Your canary users could be self-identified early adopters, a demographically targeted segment, or a random sampling. Whichever mix of users makes the most sense for verifying your new feature in production.
One helpful way to think about canary deployments is risk management. You are free to push new, exciting features more regularly without having to worry that any one new feature will harm the experience of your entire user base.
Canary releases vs. canary deployments
The phrases “canary release” and “canary deployment” are sometimes used interchangeably, but in DevOps, they really should be thought of as separate. A canary release is a test build of a complete application. It could be a nightly release or a beta, for example.
Teams will often distribute canary releases hoping that early adopters and power users, who are more familiar with development processes, will download the new application for real-world testing. The browser teams at Mozilla and Google, and many other open-source projects, are fond of this release strategy.
On the other hand, canary deployments are what we described earlier. A team will release new features into production with early adopters or different user subsets, routed to the new software by a load balancer or feature flag. Most of the user base still sees the current, stable software.
Canary deployment pros and cons
Canary deployments can be a powerful and effective release strategy. But they’re not the correct strategy in every possible scenario. Let’s run through some of the pros and cons so you can better determine whether they make sense for your DevOps team.
Pros
Support for CI/CD processes
Canary deployments shorten feedback loops on new features delivered to production. DevOps teams get real-world usage data faster, which allows them to refine and integrate the next round of features faster and more effectively. Shorter development loops like this are one of the hallmarks of continuous integration/continuous delivery processes.
Granular control over feature deployments
If your team conducts smaller, regular feature deployments, you reduce the risk of errors disrupting your workflow. If you catch a mistake in the deployment, you won’t have exposed many users to it, and it will be a minor matter to resolve. You won’t have exposed your entire user population and needed to pull colleagues off planned work to fix a major production issue.
Real-world testing
Internal testing has its place, but it is no substitute for putting your application in front of real-world users. Canary deployments are an excellent strategy for conducting small-scale real-world testing without imposing the significant risks of pushing an entirely new application to production.
Quickly improve engagement
Besides offering better technical testing, canary deployments allow you to quickly see how users engage with your new features. Are session lengths increasing? Are engagement metrics rising in the canary? If no bugs are found, get that feature in front of everyone.
There is no need to wait for a more extensive test deployment to complete. Engage those users and get iterating on your next feature.
More data to make business cases
Developers may see the value in their code, but DevOps teams still need to make business cases to leadership and the broader organization when they need more resources.
Canary deployments can quickly show you what demand might be for new features. Conduct a deployment for a compelling new feature on a small group of influencer users to get them talking. Use engagement and publicity metrics to make the case why you want to push a major new initiative tied to that feature.
Stronger risk management
Canary deployments are effectively a series of microtests. Rolling out new features incrementally and verifying them one at a time with canary testing can significantly reduce the total cost of errors or more significant system issues. You’ll never need to roll back a major release, suffer a PR hit, and need to rework a large and unwieldy codebase.
Cons
More overhead
Like any complex process, canary deployments come with some downsides. If you’re going to use a load balancer to partition users, you will need additional infrastructure and need to take on some additional administration.
In this scenario, you create a second production environment and backend that will run alongside your primary environment. You will have two codebases, two app servers, potentially two web servers, and networking infrastructure to maintain.
Alternatively, many DevOps teams use feature flags to manage their canary deployments on a single system. A feature flag can partition users into a canary test at runtime within a single code base. Canary users see the new feature, and everyone else runs the existing code.
Deploying local applications is hard
If you’re developing a locally installed application, you run the risk of users needing to initiate a manual update to get the latest version of your software. If your canary deployment sits in that latest update, your new feature may not get installed on as many client systems as you need to get good test results.
In other words, the more your software runs client-side, the less amenable it is to canary deployments. A full canary release might be a more suitable approach to get real-world test results in this scenario.
Users are still exposed to software issues
While the whole point of a canary deployment is to expose only a few users to a new feature to spare the broader user base, you will still expose end users to less-tested code. If the fallout from even a few users encountering a problem with a particular feature is too significant, then consider skipping this kind of deployment in favor of more rigorous internal testing.
How to perform a canary deployment
Planning out a canary deployment takes a few simple steps:
Identify your canary group
There are several different ways you can select a user group to be your canary.
Random subset
Pick a truly random sampling of different users. While you can do this with a load balancer, feature flag management software can easily route a certain percentage of total traffic to a canary test using a simple modulo.
Early adopters
If you run an early adopter program for highly engaged users, consider using them as your canary group. Make it a perk of their program. In exchange for tolerating bugs they might encounter in a canary deployment, you can offer them loyalty rewards.
By region
You might want to assign a specific region to be your canary. For example, you could set European IPs during late evening hours to go to your canary deployment. You would avoid exposing daytime users to your new features but still get a handful of off-hours user sessions to use as a test.
Internal testers
You can always configure sessions from your internal subnets to be the canary.
Decide on your canary metrics
The purpose of conducting a canary deployment is to get a firm “yes” or “no” answer to the question of whether your feature is safe to push into wider production. To answer that question, you first need to decide what metrics you’re going to use and install the means for monitoring performance.
For example, you may decide you want to monitor:
Internal error counts
CPU utilization
Memory utilization
Latency
You can customize feature management software quickly and easily to monitor performance analytics. These platforms can be excellent tools for encouraging a culture of experimentation.
Decide how to transition from canary to full deployment
As discussed, canary releases should only last on the order of several minutes to several hours. They are not intended to be overly long experiments. Because the timeframe is so short, your team should decide up front how many users or sessions you want in the canary and how you’re going to move to full deployment once your metrics hit positive benchmarks.
For example, you could go with a 5/95 random canary deployment. Configure a feature flag to move a random 5 percent of your users to the canary test while the remaining 95 percent stay on the stable production release. If you see positive results, remove the flag and deploy the feature completely.
Or you might want to take a more conservative approach. Another popular canary strategy is to deploy a canary test logarithmically, going from a 1 percent random sample to 10 percent to see how the new feature stands up to a larger load, then up to a full 100 percent.
Determine what infrastructure you need
Once your team is on the same page about the approach you’ll take, you’ll need to make sure you have all the proper infrastructure in place to make your canary deployment go off without a hitch.
You need a system for partitioning the user base and for monitoring performance. You can use a router or load balancer for the partitioning, but you can also do it right in your code with a feature flag. Feature flags are often more cost-effective and quick to set up, and they can be the more powerful solution.
Canary vs. blue/green deployments
Canary deployments are also sometimes confused with blue/green deployments. Both can use parallel production environments —managed with a load balancer or feature flag— to mitigate the risk of software issues.
In a blue/green deployment, those environments start identical, but only one receives traffic (the blue server). Your team releases a new feature onto the hot backup environment (the green server). Then the router, feature flag, or however you’re managing traffic, gradually shifts new user sessions from blue to green until 100 percent of all traffic goes to green. Once the cutover is complete, the team updates the now-old blue server with the new feature, and then it becomes the hot backup environment.
The way the switchover is handled in these two strategies differs because of the desired outcome. Blue/green deployments are used to eliminate downtime. Canary deployments are used to test a new feature in a production environment with minimal risk and are much more targeted.
Use feature flags for better deployments
When you boil it right down, a feature flag is nothing more than an “if” statement from which users take different code paths at runtime depending on a condition or conditions you set. In a canary deployment, that condition is whether the user is in the canary group or not.
Let’s say we’re running a fledgling social networking site for esports fans. Our DevOps team has been hard at work on a content recommender that gives users real-time recommendations based on livestreams they’re watching. The team has refined the recommendation feature to be significantly faster. It has performed well in internal testing, and now they want to see how it performs under real-world conditions.
The team doesn’t want to invest time and money into installing new physical infrastructure to conduct a canary deployment. Instead, the team decides to use a feature flag to expose the new recommendation engine to a random 5 percent sample of the user base.
The feature flag splits users into two groups with a simple modulo when users load a live stream. Within minutes your team gets results back from a few thousand user sessions with the new code. It does, in fact, load faster and improves user engagement, but there is an unanticipated spike in CPU utilization on the production server. Ops staff realize it is about to degrade performance, so they kill the canary flag.
The team agrees not to proceed with rollout until they can debug why the new code caused the unexpected server CPU spike. Thanks to the real-world test results provided by the canary deployment, they have a pretty good idea of what was going on and get back to work.
Features flags streamline and simplify canary deployments. They mitigate the need for a second production environment. Using feature flag management software like AB Tasty allows sophisticated testing and analysis.
Chad Sanderson breaks down the most successful types of experimentations based on company size and growth ambitions
For Chad Sanderson, head of product – data platform at Convoy, the role of data and experimentation are inextricably intertwined.
At Convoy, he oversees the end-to-end data platform team — which includes data engineering, machine learning, experimentation, data pipeline — among a multitude of other teams who are all in service of helping thousands of carriers ship freight more efficiently. The role has given him a broad overview of the process, from ideation, construction to execution.
As a result, Chad has had a front-row seat that most practitioners never do: The end-to-end process of experimentation from hypothesis, data definitions, analysis, reporting to year-end financials. Naturally, he had a few thoughts to share with AB Tasty’s VP Marketing Marylin Montoya in their conversation on the experimentation discipline and the complexities of identifying trustworthy metrics.
Introducing experimentation as a discipline
Experimentation, despite all of its accolades, is still relatively new. You’ll be hard pressed to find great collections of literature or an academic approach (although Ronny Kohavi has penned some thoughts on the subject matter). Furthermore, experimentation has not been considered a data science discipline, especially when compared to areas of machine learning or data warehousing.
While there are a few tips here and there available from blogs, you end up missing out on the deep technical knowledge and best practices of setting up a platform, building a metrics library and selecting the right metrics in a systematic way.
Chad attributes experimentation’s accessibility as a double-edged sword. A lot of companies have yet to apply the same rigor that they do to other data science-related fields because it’s easy to start from a marketing standpoint. But as the business grows, so does the maturity and the complexity of experimentation. That’s when the literature on platform creation and scaling is scant, leading to the field being undervalued and hard to recruit the right profiles.
When small-scale experimentation is your best bet
When you’re a massive-scale company — such as Microsoft or Google with different business units, data sources, technologies and operations — rolling out new features or changes is an incredibly risky endeavour, considering that fact that any mistake could impact millions of users. Imagine accidentally introducing a bug for Microsoft Word or PowerPoint: The impact on the bottom line would be detrimental.
The best way for these companies to experiment is with a cautious, small-scale approach. The aim is to focus on immediate action, catching things quickly in real time and rolling them back.
On the other hand, if you’re a startup in a hyper-growth stage, your approach will vastly differ. These smaller businesses typically have to show double-digit gains with every new feature rollout to their investors, meaning their actions are more so focused on proving the feature’s positive impact and the longevity of its success.
Make metrics your trustworthy allies
Every business will have very different metrics depending on what they’re looking for; it’s essential to define what you want before going down the path of experimentation and building your program.
One question you’ll need to ask yourself is: What do my decision-makers care about? What is leadership looking to achieve? This is the key to defining the right set of metrics that actually moves your business in the right direction. Chad recommends doing this by distinguishing your front-end and back-end metrics: the former is readily available, the latter not so much. Client-side metrics, what he refers to as front-end metrics, measure revenue per transaction. All metrics then lead to revenue, which in and of itself is not necessarily a bad thing, but that just means all your decisions are based on revenue growth and less on proving the scalability or winning impact of a feature.
Chad’s advice is to start with the measurement problems that you have, and from there, build out your experimentation culture, build out the system and lastly choose a platform.
What else can you learn from our conversation with Chad Sanderson?
Different experimentation needs for engineering and marketing
Building a culture of experimentation from top-down
The downside of scaling MVPs
Why marketers are flagbearers of experimentation
About Chad Sanderson
Chad Sanderson is an expert on digital experimentation and analysis at scale. He is a product manager, writer and public speaker, who has given lectures on topics such as advanced experimentation analysis, the statistics of digital experimentation, small-scale experimentation for small businesses and more. He previously worked as senior program manager for Microsoft’s AI platform. Prior to that, Chad worked for Subway’s experimentation team as a personalization manager.
About 1,000 Experiments Club
The 1,000 Experiments Club is an AB Tasty-produced podcast hosted by Marylin Montoya, VP of Marketing at AB Tasty. Join Marylin and the Marketing team as they sit down with the most knowledgeable experts in the world of experimentation to uncover their insights on what it takes to build and run successful experimentation programs.
Everyone hates tests. Ever since our school days, just hearing the word ‘test’ puts us on high alert and brings nothing but dread.
It seems we cannot escape the word even in software development. And it’s not just any test but a ‘test in production’.
Yes, it is the dreaded phrase that leaves you sweating and your heart pounding. Just reading the phrase may make you envision apocalyptic images of the inevitable disaster that could occur in its wake…
“Test in production” they said. “What could go wrong.” they said
We, too, hate tests but even we have to admit that testing in production is a pretty big deal now. Let us tell you why before you run away in horror…
I don’t always test my code. But when I do, I do it in production.
If it helps, think of it more as an essential part of your software development process and less as an actual ‘test’ where the only two options are pass or fail but for the sake of consistency and clarity, we’ll refer to it here as testing in production and who knows? Maybe by the end of this article, it won’t be so scary anymore!
There is no TEST. PRODUCTION only there is.
So here’s the low-down…
First things first, what is testing in production? Testing in production is when you test new code changes on live users rather than a staging or testing environment.
It may sound downright terrifying when you think about it. So what? You have a feature that is brand new and you’re supposed to unleash it to the wild just like that?
Let us break it down for you with the help of our finest selection of memes about test in production…
At this point, you’re probably vehemently shaking your head. The risks are simply too high for you to consider, especially in this day and age of fickle customers who might leave you at the drop of a hat if you make any simple mistake.
I see you test your code in production. I too like to live dangerously.
You may have a well-established product and you cannot risk upsetting your customers, especially your most loyal customers, and damaging your well-crafted reputation by releasing a potentially buggy feature.
Or you might even just be starting out and you simply cannot afford to make any amateur mistakes.
One does not simply test in production!
Why, oh why, should I test in production?
We’re here to tell you that you should absolutely test in production and here’s your answer as to why:
Testing in production allows you to generate feedback from your most relevant users so that you can adjust and improve your releases accordingly. This means that the end-result is a high-quality product that your customers are satisfied with.
There are no finer QA testers than the clients themselves
Additionally, when you test in production, you have the opportunity to test your ideas and even uncover new features that you had not considered before. Plus, it’s not just engineers who get to do this but your product teams can test out their ideas leading to increased productivity.
I’m just a project manager but sure, I’ll do QA
So now you’re thinking, great but there’s still the issue of it all leading to disaster and disgruntled customers.
But really, it’s not as terrifying as it sounds.
Stand back, we’re trying this in production
Wrap up in a feature flag
When you use feature flags while testing in production, you can expose your new features to a certain segment of your users. That way, not everyone will see your feature and in case anything goes wrong, you can roll back the feature with a kill switch.
What if I told you, you could have both speed and safety
Therefore, you have a quick, easy, and low-risk way to roll out your features and roll back any buggy features to fix them before releasing them to everybody else, lessening any negative impact on your user base if any issues arise.
Be the king (or queen) of your software development jungle
With feature flags, you are invincible. You are in complete control of your releases. All you need to do is wrap up your features in a feature flag and you can toggle them on and off like a light switch!
Gave that switch a flick. Switches love flicks
Still confused? Still feeling a bit wary? If you want to find out more about testing in production, read our blog article and let us show you why it’s very much a relevant process and a growing trend that you need to capitalize on today.
We’ll do it live
With AB Tasty’s flagging functionality, it’s easier than ever to manage testing in production. All you need to do is sit back and reap the benefits.
One of the pioneers of experimentation shares a humbling reality check: Most ideas will fail (and it’s a good thing)
Few people have accumulated as much experience as Ronny Kohavi when it comes to experimentation. His work at tech giants such as Amazon, Microsoft and Airbnb — just to name a few — has laid the foundation of modern online experimentation.
Before the idea of “build fast, deploy often” took hold across tech companies, developers followed a waterfall model that saw fewer releases (sometimes every 2-3 years). The shortening of development cycles in the early 2000s thanks to the Agile methodology and an uptick in online experimentation created the perfect storm for a software development revolution ― and Ronny was at the center of it all.
AB Tasty’s VP Marketing Marylin Montoya set out to uncover the early days of experimentation with Ronny and why failure is actually a good thing. Here are some of the key takeaways from their conversation.
.
Progressive deployments as a safety net
A typical cycle of experimentation involves exposing the test to 50% of the population for an average of two weeks before a gradual release. But Ronny suggests coming at it from a different vantage point: Starting with a small audience of just 2% before ramping up to 50%. The slower ramp-up gives you the time to detect any egregious issues or a degradation in metric values in near real time.
In an experiment, we may focus on just two features, but we have a large set of guardrails that suggest we shouldn’t be degrading X, Y or Z. Statistical data that you’re collecting could also suggest that you’re impacting something you didn’t mean to. Hence, the usage of progressive deployments in which you can identify external factors and easily rollback your test.
It’s like if you’re cooling water: You may realize you’re changing the temperature, but it’s not until you reach 0ºC (32ºF) that ice forms. You suddenly realize that when you get to a certain point, something very big happens. So, deploying at a safe velocity and monitoring the results can lead to huge improvements.
Your great idea? It will most likely fail.
Nothing gives you a better reality check than experimentation at scale. Everyone thinks they’re doing the best stuff in the world until it’s in the hands of their users. That’s when the real feedback kicks in.
Over two-thirds of ideas actually fail to move the metrics that they were designed to improve — a statistic Ronny shares from his time at Microsoft, where he founded the experimentation platform team of more than 100 data scientists, developers and program managers.
Don’t be deterred, however. In the world of experimentation, failing is a good thing. Fail fast, pivot fast. Being able to realize that the direction you’re going in isn’t as promising as previously thought enables you to use those new findings to enrich your next actions.
At Airbnb, Ronny’s experimentation team deployed a lot of machine learning algorithms to improve search. Out of 250 ideas tested in controlled experiments, only 20 of them proved to have a positive impact on the key metrics — meaning over 90% of ideas failed to move the needle. On the flip side, however, the 20 ideas that did succeed in some form? Those resulted in a 6% improvement in booking conversion, worth hundreds of millions of dollars.
The starter kit to experimentation
It’s easier today to convince leadership to invest in experimentation because there are plenty of successful use cases out there. Ronny’s advice is to start with a team that has iteration capital. If you’re able to run more experiments and a certain percentage are pass/fail, this ability to try ideas is key.
Pick a scenario where you can easily integrate the experimentation process into the development cycle and then work your way on to more complex scenarios. The value of experimentation is clearer because deployments are happening more often. If you’re working in a team that deploys every six months, there’s not a lot of wiggle room because everyone has already invested their efforts into this idea that the feature cannot fail. Which, as Ronny pointed out earlier, has a low probability of success.
Is experimentation for every company? The short answer is no. A company has to have certain ingredients in order to unlock the value of experimentation. One ingredient you need is being in a domain where it’s easy to make changes, such as website services or software. A second ingredient is you need enough users. Once you have tens of thousands of users, you can start experimenting and doing it at scale. And lastly, make sure you have trustworthy results from which you are taking your decisions.
What else can you learn from our conversation with Ronny Kohavi?
How experimentation becomes central to your product build
Why experimentation is at the root of top tech companies
The role leaders play in evangelizing an experimentation culture
How to build an environment for true experimentation and trustworthy results
About Ronny Kohavi
Ronny Kohavi is an authority in experimentation, having worked on controlled experiments, machine learning, search, personalization and AI for nearly three decades. Ronny previously was vice president and technical fellow at Airbnb. Prior to that, Ronny led the Analysis and Experimentation at Microsoft’s Cloud and AI group and was the director of data mining and personalization at Amazon. Ronny has also co-authored “Trustworthy Online Controlled Experiments : A Practical Guide to A/B Testing.,” which is currently the #1 best-selling data-mining book on Amazon.
About 1,000 Experiments Club
The 1,000 Experiments Club is an AB Tasty-produced podcast hosted by Marylin Montoya, VP of Marketing at AB Tasty. Join Marylin and the Marketing team as they sit down with the most knowledgeable experts in the world of experimentation to uncover their insights on what it takes to build and run successful experimentation programs.
Statistical significance is a powerful yet often underutilized digital marketing tool.
A concept that is theoretical and practical in equal measures, you can use statistical significance models to optimize many of your business’s core marketing activities (A/B testing included).
A/B testing is integral to improving the user experience (UX) of a consumer-facing touchpoint (a landing page, checkout process, mobile application, etc.) and increasing its performance while encouraging conversions.
By creating two versions of a particular marketing asset, both with slightly different functions or elements, and analyzing their performance, it’s possible to develop an optimized landing page, email, web app, etc. that yields the best results. This methodology is also referred to as two-sample hypothesis testing.
When it comes to success in A/B testing, statistical significance plays an important role. In this article, we will explore the concept in more detail and consider how statistical significance can enhance the A/B testing process.
But before we do that, let’s look at the meaning of statistical significance.
What is statistical significance and why does it matter?
According to Investopedia, statistical significance is defined as:
“The claim that a result from data generated by testing or experimentation is not likely to occur randomly or by chance but is instead likely to be attributable to a specific cause.”
In that sense, statistical significance will bestow you with the tools to drill down into a specific cause, thereby making informed decisions that are likely to benefit the business. In essence, it’s the opposite of shooting in the dark.
Make informed decisions with testing and experimentation
Calculating statistical significance
To calculate statistical significance accurately, most people use Pearson’s chi-squared test or distribution.
Invented by Karl Pearson, the chi (which represents ‘x’ in Greek)-squared test commands that users square their data to highlight possible variables.
This methodology is based on whole numbers. For instance, chi-squared is often used to test marketing conversions—a clear-cut scenario where users either take the desired action or they don’t.
In a digital marketing context, people apply Pearson’s chi-squared method using the following formula:
Statistically significant = Probability (p) < Threshold (ɑ)
Based on this notion, a test or experiment is viewed as statistically significant if the probability (p) turns out lower than the appointed threshold (a), also referred to as the alpha. In plainer terms, a test will prove statistically significant if there is a low probability that a result has happened by chance.
Statistical significance is important because applying it to your marketing efforts will give you confidence that the adjustments you make to a campaign, website, or application will have a positive impact on engagement, conversion rates, and other key metrics.
Essentially, statistically significant results aren’t based on chance and depend on two primary variables: sample size and effect size.
Statistical significance and digital marketing
At this point, it’s likely that you have a grasp of the role that statistical significance plays in digital marketing.
Without validating your data or giving your discoveries credibility, you will probably have to take promotional actions that offer very little value or return on investment (ROI), particularly when it comes to A/B testing.
Despite the wealth of data available in the digital age, many marketers are still making decisions based on their gut.
While the shooting in the dim light approach may yield positive results on occasion, to create campaigns or assets that resonate with your audience on a meaningful level, making intelligent decisions based on watertight insights is crucial.
That said, when conducting tests or experiments based on key elements of your digital marketing activities, taking a methodical approach will ensure that every move you make offers genuine value, and statistical significance will help you do so.
Using statistical significance for A/B testing
Now we move on to A/B testing, or more specifically, how you can use statistical significance techniques to enhance your A/B testing efforts.
Testing uses
Before we consider its practical applications, let’s consider what A/B tests you can run using statistical significance:
Emails clicks, open rates, and engagements
Landing page conversion rates
Notification responses
Push notification conversions
Customer reactions and browsing behaviors
Product launch reactions
Website calls to action (CTAs)
The statistical steps
To conduct successful A/B tests using statistical significance (the chi-squared test), you should follow these definitive steps:
1. Set a null hypothesis
The idea of the null hypothesis is that it won’t return any significant results. For example, a null hypothesis might be that there is no affirmative evidence to suggest that your audience prefers your new checkout journey to the original checkout journey. Such a hypothesis or statement will be used as an anchor or a benchmark.
2. Create an alternative theory or hypothesis
Once you’ve set your null hypothesis, you should create an alternative theory, one that you’re looking to prove, definitively. In this context, the alternative statement could be: our audience does favor our new checkout journey.
3. Set your testing threshold
With your hypotheses in place, you should set a percentage threshold (the (a) or alpha) that will dictate the validity of your theory. The lower you set the threshold—or (a)—the stricter the test will be. If your test is based on a wider asset such as an entire landing page, then you might set a higher threshold than if you’re analyzing a very specific metric or element like a CTA button, for instance.
For conclusive results, it’s imperative to set your threshold prior to running your A/B test or experiment.
4. Run your A/B test
With your theories and threshold in place, it’s time to run the A/B test. In this example, you would run two versions (A and B) of your checkout journey and document the results.
Here you might compare cart abandonment and conversion rates to see which version has performed better. If checkout journey B (the newer version) has outperformed the original (version A), then your alternative theory or hypothesis will be proved correct.
5. Apply the chi-squared method
Armed with your discoveries, you will be able to apply the chi-squared test to determine whether the actual results differ from the expected results.
To help you apply chi-squared calculations to your A/B test results, here’s a video tutorial for your reference:
By applying chi-squared calculations to your results, you will be able to determine if the outcome is statistically significant (if your (p) value is lower than your (a) value), thereby gaining confidence in your decisions, activities, or initiatives.
6. Put theory into action
If you’ve arrived at a statistically significant result, then you should feel confident transforming theory into practice.
In this particular example, if our checkout journey theory shows a statistically significant relationship, then you would make the informed decision to launch the new version (version B) to your entire consumer base or population, rather than certain segments of your audience.
If your results are not labelled as statistically significant, then you would run another A/B test using a bigger sample.
At first, running statistical significance experiments can prove challenging, but there are free online calculation tools that can help to simplify your efforts.
Statistical significance and A/B testing: what to avoid
While it’s important to understand how to apply statistical significance to your A/B tests effectively, knowing what to avoid is equally vital.
Here is a rundown of common A/B testing mistakes to ensure that you run your experiments and calculations successfully:
Unnecessary usage: If your marketing initiatives or activities are low cost or reversible, then you needn’t apply strategic significance to your A/B tests as this will ultimately cost you time. If you’re testing something irreversible or which requires a definitive answer, then you should apply chi-squared testing.
Lack of adjustments or comparisons: When applying statistical significance to A/B testing, you should allow for multiple variations or multiple comparisons. Failing to do so will either throw off or narrow your results, rendering them unusable in some instances.
Creating biases: When conducting A/B tests of this type, it’s common to apply biases to your experiments unwittingly—the kind of which that don’t consider the population or consumer base as a whole.
To avoid doing this, you must examine your test with a fine-tooth comb before launch to ensure that there aren’t any variables that could push or pull your results in the wrong direction. For example, is your test skewed towards a specific geographical region or narrow user demographic? If so, it might be time to make adjustments.
Statistical significance plays a pivotal role in A/B testing and, if handled correctly, will offer a level of insight that can help catalyze business success across industries.
While you shouldn’t rely on statistical significance for insight or validation, it’s certainly a tool that you should have in your digital marketing toolkit.
We hope that this guide has given you all you need to get started with statistical significance. If you have any wisdom to share, please do so by leaving a comment.