Article

6min read

How AB Tasty Delivers High-Quality Risk-Free Releases with Feature Flags

With their feature flagging functionality, AB Tasty were able to safely and quickly launch new changes to end users without impacting quality through progressive delivery and continuous feedback loops.

In the world of SaaS, velocity and quality are of utmost importance. This is an industry that is constantly evolving and companies must work diligently to keep pace with consumers’ fast-changing needs and to maintain competitive advantage.

AB Tasty has seen a high growth in users all around the world. Consequently, AB Tasty had to accelerate their development processes, which meant that development and feature teams experienced high growth in order to enable the development of more features and increase speed of delivery to market.

The challenges of CI/CD

However, with such high growth and scaling, the company was faced with many growing pains and bottlenecks, which significantly slow down CI/CD processes. This increased the risk of pile up of issues, which defeats the initial purpose of accelerating time-to-market.

Even with mature CI/CD processes, developer and product teams are not immune to pitfalls that impact speed of delivery and go-to-market.

With these challenges in mind, the team at AB Tasty had four main objectives in mind:

  • Accelerate time-to-market.
  • Increase speed of delivery without sacrificing quality.
  • Allow teams to remain autonomous to avoid delays.
  • Reduce risk by avoiding big bang releases.

The team essentially needed a stress-free solution to push code into production and an easy-to-use interface that can be used by development teams to release features as soon as they’re ready to eliminate the issue of bottlenecks and by product teams to gain more control of the release process by running safe experiments in production to gather useful feedback.

This is when the team at AB Tasty turned to their flagging feature.

Feature flags were a way for the team to safely test and deploy new changes to any users of their choice while keeping them turned off for everyone else.

The team at AB Tasty was able to do this by, first, defining a flag in the feature management interface whose value is controlled remotely by the tool’s API.

The team can then set targeting rules, that is the specific conditions for the flag to be triggered, based on the user ID. Using feature flags, they can perform highly granular user targeting, allowing them to target users using any user attributes to which they have access.

Then, in AB Tasty’s own codebase, teams can simply condition the activation of the feature that interests them, or its behavior, according to the value of the variable, using a simple conditional branch.

At the time, the company was working on a key project to revamp a major part of the UI namely the navigation system, which includes a new vertical navigation and new responsive grids to offer new personalization campaigns with the goal to make it more understandable to users.

For a project of this scope, there were big changes tied to many dependencies, such as the database, and so AB Tasty needed a way to progressively deploy these new changes to obtain initial feedback and avoid a big bang release.

Progressively deliver features

With such a large project, the goal was to mitigate risk by avoiding deploying major changes to all users at once. With feature flags, teams are able to reduce the number of users who can access the new changes.

In particular, the ON/OFF deployment logic of feature toggles within the feature management tool’s interface works like a switch so teams can progressively roll out features based on pre-set targeting rules while turning them off for everyone else.

Easily set up and manage beta and early adopter lists

After testing internally, the product team was looking for a way to easily manage their early adopter list before releasing to the rest of their users. This will enable them to receive quicker feedback from the most relevant (and more forgiving) users.

With AB Tasty’s flagging functionality, teams can simply add these early adopters’ account ID into the targeting of the flag, where they can then immediately access the new feature exclusively before anyone else.

Release without stress by ensuring that developers are ready to tackle any issues

Since most of the development team was based in France, the new changes were initially rolled out to that region so developers can ensure that everything works and can quickly fix any bugs before deploying to other regions (and timezones).

Should anything go wrong, teams can easily roll back the release with a kill switch by immediately toggling a flag off within the feature flagging platform interface so that the feature is no longer visible.

Enable continuous feedback loops

Teams can now test in production on end-users and to optimize features and iterate faster based on real production data. As a result, teams can launch the end-product to all users with the reassurance that they have identified and fixed any issues.

This also empowers teams to become more innovative, as they now have a safe way to test and receive feedback on their ideas, and are no longer limited in their scope of work.

Accelerate go-to-market

Furthermore, the ON/OFF deployment logic allows teams to release at their own pace. This accelerates the time-to-market as developers no longer need to wait for all changes to be ready to release their own changes resulting in less delays and disgruntled customers.

This speed would not be at the expense of quality as with continuous feedback loops, teams can iterate releases which ensures that only high quality products are released to users.

Teams can send features to production whenever they’re ready, make them visible to some users and officially launch to market once all deliverables are ready and good to go!

Subscribe to
our Newsletter

bloc Newsletter EN

We will process and store your personal data to respond to send you communications as described in our  Privacy Policy.

Article

9min read

Chaos Engineering 101: How Chaos Brings Order

As we go deeper into digital transformation and as companies move towards large-scale globally distributed systems, the complexity that comes with them increases. This means that failures in these intricate systems become much harder to predict, as opposed to traditional, monolithic systems. 

Yet, these failures could result in high costs for teams to repair them; not to mention the painstaking probability of the potential loss of new and existing customers.

The question then is how can we build confidence in the systems that we put into production? How can teams make sure that they’re releasing stable and resilient software that can handle any unpredictable conditions that they’re put into?

This is when teams turn to what is aptly referred to as “chaos engineering”.

What is chaos engineering?

According to the Principles of Chaos, chaos engineering is “the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”

In other words, chaos engineering is the process of testing distributed systems to ensure that it can withstand turbulent conditions and unexpected disturbances. Strictly speaking, this is the “chaos” of production.

Chaos engineering is particularly applicable to large-scale, distributed systems. Since such systems are now hosted on globally distributed infrastructures, there are many complex dependencies and moving parts with several points of failure. This makes it harder to predict when an unexpected error will occur.

Due to the unpredictability of these failures of the components of the system, it becomes harder to test for them in a typical software development life cycle.

This is when the concept of chaos engineering came about as a way to predict and test for such failures and uncover hidden flaws within these systems. 

In other words, this concept determines the resilience of these systems by identifying their vulnerabilities by carrying out controlled experiments to test for any unpredictable and unstable behavior.

This is done by breaking things on purpose by injecting failure and various types of faults into the system to see how it responds. This will help determine any potential outages and weaknesses in the system.

The ultimate goal of this is a lesson in how to build more resilient systems.

Where does the term come from?

Before we delve any deeper into chaos engineering, it would be helpful to understand where this concept originated.

Chaos engineering started in 2010 when the engineering team at Netflix decided to develop “Chaos Monkey”, which was later made open source, as the team at Netflix migrated from a monolithic architecture to the cloud, deployed on AWS.

For Netflix, this migration to hundreds of microservices brought on a high amount of complexity; therefore, engineers at Netflix were seeking a better approach to prevent sudden outages in the system. 

These engineers were mainly looking for a way to disable instances and services within their architecture to ensure that their system can handle such failures with minimal impact on the user experience, allowing them to build a more resilient and reliable architecture.

The idea behind the Chaos Monkey tool was that they would unleash a “wild monkey” to break down individual components in a carefully monitored environment to make sure that a breakdown in this single component wouldn’t affect the entire system. 

This, in turn, helped them locate the weaknesses in the system and build automatic recovery plans to address them and alter the system if necessary so that it could easily tolerate unplanned failures in the future.

Afterwards, Chaos Monkey improved and evolved to allow Netflix engineers to more precisely determine failures by testing against more failure states, enhancing the resilience of their system.

From then on, the chaos journey began for Netflix and later on for many organizations dealing with similar distributed systems.

Principles of chaos engineering

We can deduce that chaos engineering involves running experiments to understand how a distributed system behaves when faced with failure.

Unlike other forms of testing, chaos engineering involves experimentation and learning new things about a system by creating a hypothesis and attempting to prove that hypothesis. If it’s not true, this is a chance to learn something new about the system.

Testing, on the other hand, involves making an assumption about a system based on existing knowledge and determining whether it’s true or not by running tests; in other words, the test is conducted based on knowledge of specific properties about the system. The results, therefore, don’t provide new insights or information.

Chaos engineering, for its part, involves exploring scenarios that don’t usually occur during testing designed to gain new knowledge about the system by considering factors that often go beyond the obvious issues that are normally tested for.

The following principles provide a basis on which to run such experiments on your system:

  1. Plan an experiment

The first step involves planning an experiment, where you will need to pinpoint things that could go wrong. This will require gaining an understanding of your system’s normal behavior and determining what constitutes a normal state. Afterwards, you start off by forming a hypothesis of how you think the components of your system will behave in case something goes wrong and then create your control and experimental groups accordingly. 

Defining a metric to measure at this stage is useful to gauge the level of normalcy within your system. These could include metrics such as error rates or latency. 

  1. Design real-world events

At this stage, you will outline and introduce real-world events that could potentially cause disruptions to your system such as those that occur within hardware or server or any other external event that could lead to outages in your system such as a sudden spike in traffic, hardware failures, network latency or any event that could potentially disrupt the steady state of the system.

  1. Run the experiment

After defining your system’s normal behavior and the events that could disrupt it, experiments can then be run on your system preferably in a production environment to measure the impact of the failure to gain a better understanding of your system’s real-world behavior.

This will also allow you to prove or disprove your hypothesis. The harder it is to cause an outage in the system, the more confident you can be in your system’s resilience

However, keep in mind that since your experiments are run in production, it’s important to minimize the blast radius in case something goes wrong. This will ensure that any adverse effects are kept at a minimum and if things go smoothly, you can then gradually increase this radius till it reaches full scale. It’s also wise to have a roll back plan if something does go wrong.

  1. Monitor results

The experiment should give you a clear idea of what’s working and of what needs to be improved by looking for a difference between the control and experimental group. Teams can then make the necessary changes as they’re able to identify what led to the outage or disruption to the service, if relevant. 

Why we should break things on purpose: Benefits of chaos engineering

We can look at chaos engineering as a sort of safeguard that helps prevent worst case scenarios from happening and impacting the user experience before they actually happen.

Consequently, chaos engineering has a number of benefits.

Increased reliability and resilience

As we’ve already mentioned, running such controlled chaos experiments will help determine your system’s capabilities, thereby preparing the system against unexpected failures. 

Information gathered from these experiments can be used to strengthen your system and increase its resilience by locating potential weaknesses and finding ways to resolve them.

In other words, by learning what failure scenarios to prepare for, teams can improve and speed up their response to troubleshooting incidents. 

Enhanced user experience

By strengthening your system, it is less likely that it will experience major outages and downtime that could negatively affect the user experience. It allows you to pinpoint issues and problems before they actually become customer pain points.

This will, in turn, result in improved user experience and increased customer satisfaction as you are now releasing high performing, more resilient software.

Reduced revenue loss

By running chaos experiments, companies can prevent lengthy disruptions and outages to the system, which otherwise could lead to losses in revenue as well as high maintenance costs.

Improved confidence in the system

The insights gathered from these experiments can help teams build more resilient and robust systems.

This means that teams, by predicting the unexpected, are prepared for worst-case scenarios, which helps to increase confidence in their systems by having a recovery plan set up for such scenarios.

Nonetheless, organizations should still carefully consider the challenges of chaos engineering before implementing it as, despite its benefits, it can also be costly and time-consuming.

Unleashing chaos for better digital experiences

As we’ve seen, chaos engineering is an essential practice when it comes to creating uninterrupted, seamless digital experiences for your customers.

It’s not just breaking things for the sake of breaking things; it’s a way to gain insight on how a system behaves and to gauge its resilience. In other words, chaos engineering is not only breaking things, but it’s also about fixing weaknesses in a system to build its resilience by exposing hidden threats thereby minimizing risk.

It’s important to note that chaos engineering isn’t meant to replace the other types of testing that are carried out throughout the software development life cycle but instead to complement these tests to provide a high performing system.

Finally, chaos engineering has an important role in DevOps. At the heart of DevOps is the idea of continuous improvement, which is why integrating chaos engineering into a DevOps methodology is essential to mitigate security risks. It’s also a way for DevOps teams to deal with the rising complexity of applications nowadays.

Consequently, introducing chaos experiments into your DevOps CI/CD pipeline will help teams detect hidden issues more quickly, which grows confidence in the system enabling them to deploy faster to end-users.