In the previous addition to our Fun with Flags series, we discussed how to store feature flags based on the category of flags they belong to.
We also discussed how flags are broken down into categories based on their dynamism and longevity.
In this post, we will explore the differences between short- and long-lived flags and which to choose depending on the use-case. We already touched upon this difference in the previous post but here we will go into more details.
Feature flag longevity
First, we will briefly define what we mean by longevity. This concept refers to how long the decision logic for a flag will remain in a codebase.
Feature flags, or features toggles, can be broken down into the following categories. Each of these categories serve a different purpose and thus should be managed differently.
Feature toggles can be categorized as the following:
Release toggles– these are short-lived toggles that help developers in writing new features. They are usually on/off switches that control whether a feature is enabled or not. They are usually removed after the feature is released.
Operational toggles– these are used to control operational aspects of your system. They are usually used to turn features off. They can be short-lived but it’s not uncommon to have longer-lived ones that serve as kill switches.
Experiment toggles– these are used to perform A/B or multivariate tests. They will usually stay in the system for the duration necessary to generate statistically significant results from the feature testing so could be anywhere from a few weeks to months.
Permission toggles-these toggles are usually longer-lived compared to the other categories of toggles, sometimes remaining in your system for years. They are used for user segmenting or, in other words, to make features available to certain subsets of users.
As we’ve seen so far, there are different types of flags, each used for different purposes and could stay in your system from days to months, even years, depending on how they are deployed.
Being aware of the longevity of a flag is crucial for a few reasons.
First of all, it is important when it comes to implementing the flag. For example, for short-lived flags, usually an if/else statement is sufficient but for flags that are meant to stay for a longer period of time will require more sophistication to support these flags and their associated use cases.
The second reason is to avoid the accumulation of technical debt, which will be explained in further detail in the next section.
Short vs long-term feature flags
This brings us to the key differences between short- and long-lived flags.
From the above, we can conclude that short-lived flags have a limited time span, meaning that they should be removed from your system as soon as they have served their purpose.
Long-lived or, in some cases, permanent flags are used for an extended period of time, sometimes beyond the release of a feature.
Long-lived flags become part of your regular software operations and infrastructure so you create it with the intention of keeping it in your system for an indefinite period of time.
It is crucial, therefore, to constantly review the flags you have in your flags.
Why is this important?
Flags must be checked at regular intervals to avoid the accumulation of technical debt. This debt might take a toll on your system, causing major disruptions and even resulting in a total breakdown, if you don’t keep careful track of it. The more flags you add to your system, the higher this cost becomes.
Thus, the best way to minimize this cost is to conduct a regular clean-up of your flags. Remove any stale, unused flags that you no longer use; otherwise, you might end up with hundreds, or even thousands, of unused flags.
However, this may not be as simple as it sounds. Often, when flag adoption increases, many organizations could find it difficult to determine which flags are still active so this process might become time-consuming and challenging the further you are in your feature flag journey.
Let’s take a look at how to overcome this challenge…
You will need to have a process in place in which feature flags are regularly reviewed and removed from your codebase.
For short-term flags, check to see if the feature has been rolled out to all users or no users and for more long-lived flags, you should determine if the flag is still needed for one reason or another.
You should also consider setting up access control, lest someone mistakenly ends up deleting a permanent flag that you’re still very much using! Thus, consider setting up permissions for each flag to assign who can delete certain flags.
Additionally, adhering to naming conventions, which indicate the flag’s longevity is a great way to keep track of the many flags you have in your system.
An advanced feature flagging platform can give you full visibility over all the flags you have in your system.
In particular, the Flag Tracking Dashboard and enhanced Git integration give you the ability to manage tech debt and see where your flags are being used. It also gives you the ability to reference your codebase and all your AB Tasty Flags, allowing you to quickly locate the flags that you are no longer using.
To sum up…
To review what we’ve seen so far, the following table lists each category of flag and its longevity:
Category
Longevity
Release
From days to weeks (until the feature is rolled out)
Experiment
From weeks to months, depending on how long it takes to gather statistically significant results/Temporary
Operational
From months to years, depending on their use case.
Permissioning
Usually long-lasting/Permanent
Are you creating this flag for purposes of testing, rollout or feature releases? These flags are usually temporary (days to weeks) and should be removed once the feature is deployed.
Are you conducting an A/B test? This must remain long enough to gather sufficient data but no longer than that.
Are you using this flag to manage entitlements? This is when you give access to a feature to a certain segment of users. These are often long-lived or permanent flags.
As a final note: Review your flags often
In conclusion, make sure to review the flags you have in your system, particularly the short-lived ones, so that you can ensure they’re deleted as soon as possible.
The expected life-span of a flag will depend on the purpose it was created for so everyone in the team needs to have a clear understanding of the purpose of each flag in your system.
With that said, the most important takeaway is to make sure to schedule regular clean-ups to keep technical debt in check. As already mentioned, a third-party service will make the process less daunting by giving you full visibility of all your flags so that you can identify which flags need to be removed.
So far, in our Fun with Flags series, we have looked at some important best practices when it comes to feature flags including naming conventions and access control.
In this part of our series, we will be looking at the different ways feature flags can be stored based on which category they belong to.
Let’s start with a simple definition of feature flags. A feature flag is a software development practice that allows you to decouple code deployment from release, enabling quicker and safer releases.
They allow you to enable or disable functionality remotely without changing code. At its most basic, a feature flag is a file separate from your code file.
You could start off deploying feature flags using a configuration file. This is the simplest implementation method. A config file is usually used to store application settings and some flags just fall into this category.
However, this might be time-consuming as developers may need to redeploy the application after each value update to reflect the changed value. Redeploying can take some time and if your changes are mission critical you may lose some reactivity. Configuration files also don’t allow you to make context-specific decisions, meaning that flags will either be on or off for all users (only hold one specific value at a time).
This makes the most sense when you’re just starting out your feature flag journey and you wish to experiment with feature flags at a smaller scale. Additionally, this could be a viable option for static flags like a release toggle, which are feature flags that enable trunk-based development to support dev teams as they write new features.
For example, a developer may need for a kill switch to work instantaneously to kill off a buggy feature while the flipping time for a release toggle from off to on can take longer.
Such static flags, as their name implies, will be relatively static and will only change through code or configuration file changes.
To better understand how feature flag values should be stored, it’s best to start with the different categories of flags, which are based on longevity and dynamism.
Short vs long lived flags
Longevity of a feature flag, as the name suggests, refers to how long a flag will remain in your code.
Some types of flags are short-lived such as release toggles while others such as kill switches are long-lived; they will stay in your codebase often for years.
This distinction between short- and long-lived flags will influence how you go about implementing its toggle point. The longer the lifetime of a flag, the more careful you’ll need to be in choosing about choosing the toggle point location.
There are obviously different types of flags, which means these different flags cannot be managed and configured the same way. Dynamism refers to how much we need to modify a flag in its lifetime.
Therefore, feature flags can be broken down into two categories depending on how they are hosted:
Static flags-as already mentioned, these are usually hosted by a file that only changes when you want it to. These flags get hard-coded into the app at build time so the flag configuration is stored in the actual code.
Dynamic flags-usually hosted by a service that will change the value of the flag depending on the values you send with the request. This type of flag can be changed at runtime allowing for dynamic control of code execution on a user-by-user or session-by-session basis.
This means that the configuration for some types of flags needs to be more dynamic than others. For example, if you want to change a value based on a shared attribute among groups of users.
Alternative storage method: database
As an alternative to config files, feature flags can be stored in some sort of database. It is a convenient place to store your settings and they can often be easily updated.
So if you see the number of flags in your system and you need more granular controls, you might opt to deploy your flags via a database. This would allow you to target features to individual users.
Moreover, having flags in a database means that not only your developers can change values assigned to flags but also your product teams, thereby reducing dependency. This assumes that some kind of dashboard or management system is in place for other people to easily interact with the database. You’ll also probably want to keep track of changes to this database.
However, you still wouldn’t be able to do gradual rollouts or targeting users based on IDs or attributes or monitor metrics. In other words, it limits the extent of your user segmetting.
Consequently, while config files and databases are quick solutions, when you’re just starting out, they are not good in the long-term if you’re looking to scale or use more dynamic flags.
You might then consider a feature flag open source solution, which allows for simple user segmentation and controlled rollouts. Nevertheless, such solutions wouldn’t allow you to track who made changes or limit access and they are generally language specific. This could be a problem if your application uses multiple languages so you would need to deploy multiple solutions.
Feature management services
The most dynamic flags are those whose state depends on the current user. Dynamic flags need to be hosted by a service rather than a static file mentioned previously. This is because feature flag services allow you to serve different values to different types of users at runtime.
In that sense, you would target values based on user IDs or according to percentage, i.e. a feature can be seen by x% of users, or according to a certain group the users belong to, using some user traits as targeting criteria.
This also works if you’re experimenting with features, with a traditional A/B testing approach, and you need to randomly put users in different groups so the user’s ID would be sent to the service to place in a group accordingly.
Such IDs (that could be stored in cookies or localStorage…) are also useful when testing internally where a feature is turned on for internal users as detected via a special ID value. Developers would then turn these features on for themselves to make sure it’s working as it should and test directly in production.
In such situations, you would opt for a third party service to fetch values of the feature flags.
Such feature flag services would, therefore, also allow you to to choose values for each flag depending on specific attributes such as the country users are from.
For example, AB Tasty’s flagging functionality allows you to assign specific flag values to different user segments. So, internal users would be served one value and would then see the new feature while external users will be served another value.
For example, in the image below from the AB Tasty dashboard, the value that is fetched will be IOS in Scenario 1 and ANDROID in Scenario 2 so only the users matching this value will be exposed to the feature.
This is extremely useful if you want to do gradual rollouts and testing in production and for generally more advanced use-cases.
Conclusion
For advanced use-cases, it would make sense to opt for a third party solution to help you manage and implement the different types of flags in your system.
These solutions offer features such as a UI for managing configuration with access control to limit who does what as well as audit logs to determine which changes were made by whom.
Therefore, sophisticated feature management solutions like AB Tasty are built to scale and include role-based and access controls to execute more complex use-cases.
What are the top feature flags projects on GitHub? Let’s find out.
If you are into open-source software, GitHub is probably the go-to place to find interesting projects you can use for free and contribute to.
Feature flags tools are no exception. We have listed below the top 10 feature toggle repositories on github.com ranked by popularity.
If you want to explore alternatives that scale better and are suitable for more use cases, read our article about feature flag implementation journey where we answer the question: should I build or buy a feature flag platform.
Unleash is the open-source feature management platform. It provides a great overview of all feature toggles/flags across all your applications and services. Unleash enables software teams all over the world to take full control on how they deploy new functionality to end users.
Flipper gives you control over who has access to features in your app.
Enable or disable features for everyone, specific actors, groups of actors, a percentage of actors, or a percentage of time.
Configure your feature flags from the console or a web UI.
Regardless of what data store you are using, Flipper can performantly store your feature flags.
Use Flipper Cloud to cascade features from multiple environments, share settings with your team, control permissions, keep an audit history, and rollback.
Piranha is a tool to automatically refactor code related to stale flags. At a higher level, the input to the tool is the name of the flag and the expected behavior, after specifying a list of APIs related to flags in a properties file. Piranha will use these inputs to automatically refactor the code according to the expected behavior.
This repository contains four independent versions of Piranha, one for each of the four supported languages: Java, JavaScript, Objective-C and Swift.
Flagr is an open source Go service that delivers the right experience to the right entity and monitors the impact. It provides feature flags, experimentation (A/B testing), and dynamic configuration. It has clear swagger REST APIs for flags management and flag evaluation.
Flipt is an open source, on-prem feature flag application that allows you to run experiments across services in your environment. Flipt can be deployed within your existing infrastructure so that you don’t have to worry about your information being sent to a third party or the latency required to communicate across the internet.
Flipt supports use cases such as:
Simple on/off feature flags to toggle functionality in your applications
Rolling out features to a percentage of your customers
Using advanced segmentation to target and serve users based on custom properties that you define
Togglz is another implementation of the Feature Toggles pattern for Java.
Modular setup. Select exactly the components of the framework you want to use. Besides the main dependency, install specific integration modules if you are planning to integrate Togglz into a web application (Servlet environment) or if you are using CDI, Spring, Spring Boot, JSF.
Straight forward usage. Just call the isActive() method on the corresponding enum to check if a feature is active or not for the current user.
Admin console. Togglz comes with an embedded admin console that allows you to enable or disable features and edit the user list associated with every feature.
Activation strategies. They are responsible for deciding whether an enabled feature is active or not. Activation strategies can, for example, be used to activate features only for specific users, for specific client IPs or at a specified time.
Custom Strategies. Besides the built-in default strategies, it’s easy to add your own strategies. Togglz offers an extension point that allows you to implement a new strategy with only a single class.
Feature groups. To make sure you don’t get lost in all the different feature flags, Togglz allows you to define group for feature that are just used for a visual grouping in the admin console.
FunWithFlags is an OTP application that provides a 2-level storage to save and retrieve feature flags, an Elixir API to toggle and query them, and a web dashboard as control panel.
It stores flag information in Redis or a relational DB (PostgreSQL or MySQL, with Ecto) for persistence and synchronization across different nodes, but it also maintains a local cache in an ETS table for fast lookups. When flags are added or toggled on a node, the other nodes are notified via PubSub and reload their local ETS caches
Join VP Marketing Marylin Montoya as she takes a deep dive into all things experimentation
Today, we’re handing over the mic to AB Tasty’s VP Marketing Marylin Montoya to kick off our new podcast series, “1,000 Experiments Club.”
At AB Tasty, we’re a bunch of product designers, software engineers and marketers (aka Magic Makers), working to build a culture of experimentation. We wanted to move beyond the high-level rhetoric of experimentation and look into the nitty gritty building blocks that go into running experimentation programs and digital experiences.
Enter: “1,000 Experiments Club,” the podcast that examines how you can successfully do experimentation at scale. Our podcast brings together a selection of the best and brightest leaders to uncover their insights on how to experiment and how to fail … successfully.
In each episode, Marylin sits down to interview our guests from tech giants, hyper-growth startups and consulting agencies — each with their own unique view on how they’ve made experimentation the bedrock of their growth strategies.
You’ll learn about why failing is part of the process, how to turn metrics into your trustworthy allies, how to adapt experimentation to your company size, and how to get management buy-in if you’re just starting out. Our podcast is for CRO experts, product managers, software engineers; there’s something for everyone, no matter where you fall on the maturity model of experimentation!
We are kicking things off with three episodes, each guest documenting their journey of where they went wrong, but also the triumphs they’ve picked up from decades of experimentation, optimization and product development.
He believes anyone can and should do experimentation
In the culture of experimentation, there’s no such thing as a “failed” experiment: Every test is an opportunity to learn and build toward newer and better ideas. So have a listen and subscribe to “1,000 Experiments Club” on Apple Podcasts, Spotify or wherever you get your podcasts.
Is experimentation for everyone? A resounding yes, says Jonny Longden. All you need are two ingredients: A strong desire and tenacity to implement it.
There’s a dangerous myth lurking around, and it’s the idea that you have to be a large organization to practice experimentation. But it’s actually the smaller companies and start-ups that need experimentation the most, says Jonny Longden of performance marketing agency Journey Further.
With over a decade of experience in conversion optimization and personalization, Jonny co-founded Journey Further to help clients embed experimentation into the heart of what they do. He currently leads the conversion division of the agency, which also focuses on PPC, SEO, PR — among other marketing specializations.
Any company that wants to unearth any sort of discovery should be using experimentation, especially start-ups who are in the explorative phase of their development. “Experimentation requires no size: It’s all about how you approach it,” Jonny shared with AB Tasty’s VP Marketing Marylin Montoya.
Here are a few of our favorite takeaways from our wide-ranging chat with Jonny.
.
The democratization of experimentation
People tend to see more experimentation teams and programs built at large-scale companies, but that doesn’t necessarily mean other companies of different sizes can’t dip their toes in the experimentation pool. Smaller companies and start-ups can equally benefit from this as long as they have the tenacity and capabilities to implement it.
You need to truly believe that without experimentation, your ideas won’t work, says Jonny. There are things that you think are going to work and yet they don’t. Conversely, there are many things that don’t seem like they work but actually end up having a positive impact. The only way to arrive at this conclusion is through experimentation.
Ultimately, the greatest discoveries (for example, space, travel, medicine, etc.) have come from a scientific methodology, which is just observation, hypothesis, testing and refinement. Approach experimentation with this mindset, and it’s anyone’s game.
Building the right roadmaps with product teams
Embedding experimentation into the front of the product development process is important, but yet most people don’t do it, says Jonny. From a pure business perspective, it’s about trying to de-risk development and prove the value of a change or feature before investing any more time, money and bandwidth.
Luckily, the agile methodology employed by many modern teams is similar to experimentation. Both rely on iterative customer collaboration and a cycle of rigorous research, quantitative and qualitative data collection, validation and iteration. The sweet spot is the collection of both quantitative and qualitative data — a good balance of feedback and volume.
The success of building a roadmap for an experimentation program comes down to understanding the organizational structure of a company or industry. In SaaS companies, experimentation is embedded into the product teams; for e-commerce businesses, experimentation fits better into the marketing side. Once you’ve determined the owner and objectives of the experimentation, you’ll need to understand whether you can effectively roll out the testing and have the right processes in place to implement results of a test.
Experimentation is, ultimately, innovation
The more you experiment, the more you drive value. Experimentation at scale enables people to learn and build more tests based on these learnings. Don’t use testing to only identify winners because there’s much more knowledge to be gained from the failed tests. For example, you may only have 1 in 10 tests that work. The real value comes in the 9 lessons you’ve acquired, not just the 1 test that showed positive impact.
When you look at it through these lenses, you’ll realize that the post-test research and subsequent actions are vital: That’s where you’ll start to make more gains toward bigger innovation.
Jonny calls this the snowball effect of experimentation. Experimentation is innovation — when done right. At the root, it’s about exploring and seeing how your customers respond. And as long as you’re learning from the results of your tests, you’ll be able to innovate faster precisely because you are building upon these lessons. That’s how you drive innovation that actually works.
What else can you learn from our conversation with Jonny Longden?
Moving from experimentation to validation
How to maintain creativity during experimentation
Using CRO to identify the right issues to tackle
The required building blocks to successful experimentation
About Jonny Longden
Jonny Longden leads the conversion division of Journey Further, a performance marketing agency specializing in PPC, SEO, PR, etc. Based in the United Kingdom, the part-agency, part-consultancy helps businesses become data-driven and build experimentation into their programs. Prior to that, Jonny dedicated over a decade in conversion optimization, experimentation and personalization, working with Sky, Visa, Nike, O2, Mvideo, Principal Hotels and Nokia.
About 1,000 Experiments Club
The 1,000 Experiments Club is an AB Tasty-produced podcast hosted by Marylin Montoya, VP of Marketing at AB Tasty. Join Marylin and the Marketing team as they sit down with the most knowledgeable experts in the world of experimentation to uncover their insights on what it takes to build and run successful experimentation programs.
Are you on the brink of launching a new feature – one that will affect many of your high-value clients? You’ve worked hard to build it, you’re proud of it and you should be!
You can’t wait to release it for all your users, but wait! What if you’ve missed something? Something that would ruin all your engineering efforts?
There’s nothing worse than starting the day after a release by having to immediately deal with a number of alerts for production issues and spending the day checking a number of logging and monitoring systems for errors and, ultimately, having to rollback the feature you just launched. You would just feel frustrated and unmotivated.
In addition to sapping the morale of your technical teams, NIST has shown that the longer a bug takes to be detected, the more costly it is to fix. This is illustrated by the following graph:
This is explained by the fact that once the feature has been released and is in production, finding bugs is difficult and risky. In addition to preventing users from being affected by problems, it’s critical to ensure service availability.
Are you sure your feature is bug-free?
You might think that this won’t happen to you. That your feature is safe and ready to deploy.
History has shown that it can happen to the biggest companies. Let’s name a few examples.
Facebook, May 7, 2020. An update to Facebook’s SDK rolled out to all users, missed a bug: a server value that was supposed to provide a dictionary of things was changed to provide a simple YES/NO instead. This really tiny change was enough to break Facebook’s authentication system and affect tons of other apps like TikTok, Spotify, Pinterest, Venmo, and other apps that didn’t even use Facebook’s authentication system as it is extremely common for apps to connect to Facebook regardless of whether they use a Facebook-related feature, mainly for ad attribution. The result was unequivocal, the app simply crashed right after launch. Facebook fixed the problem in a hurry, with about two hours for things to get back to normal. But do you have the same resources as Facebook?
Apple, September 19, 2012. Another good example, even though it’s a bit older, would be the replacement of Google Maps with Apple Maps in iOS 6 in 2012 on iOS devices. For many customers and especially fans, Apple always handles the rollout of new features carefully, but this time they messed up. Apple didn’t want to be tied to Google’s app anymore, so they made their own version. However, in their rush to release their map system, Apple made some unforgivable navigational mistakes. Among the many failures of Apple Maps are erased cities, disappearing buildings, flattened landmarks, duplicate islands, distorted graphics, and erroneous location data. A large part of this mess could have been avoided if they had deployed their new map application progressively. They would have been able to spot the bugs and quickly fix them before massive deployment.
And now, thinking about this and seeing that even big companies are impacted, you’re stressed out and may not even want to release it anymore.
But don’t worry! At AB Tasty, we know that building a feature is only half of the story and that to be truly effective, that feature has to be well deployed.
Our feature management service has you covered. You’ll find a set of useful features, such as progressive rollout, to free you from the fear of a release catastrophe and erase feature management frictions, so that you can focus on value-added tasks to get high-quality features into production and apply your energy and innovation in the best way possible, thereby delivering maximum value to your customers.
What’s progressive rollout?
So now you’re curious: what’s progressive rollout? How will this help me monitor the release and make sure everything is okay?
A progressive rollout approach lets you test the waters of a new version with a restricted set of clients. You can set percentages of users to whom your feature will be released and gradually update the percentage to safely deploy your feature. You can also do a canary launch by manually targeting several groups of people at various stages of your rollout.
This is a practice already used by large companies that have realized the significant benefits of a progressive rollout.
Netflix, for example, is one of the most dynamic companies and its developers are constantly releasing updates and new software, but users rarely experience downtime and encounter very few bugs or issues. The company is able to deliver such a smooth experience thanks to sophisticated deployment strategies, such as Canary deployment and progressive deployment, multiple staging environments, blue/green deployments, traffic splitting, and easy rollbacks to help development teams release software changes with confidence that nothing will break.
Disney is another good example of a company that makes the most of progressive deployment. It has taken the phased deployment approach to a whole new level for its “Disney +” and “Star” streaming services by deploying them regionally rather than globally. This delivery method is driven by the needs of the business. The company is making sure that everything is ready at the regional level, in line with its focus on the most important markets. Prior to launching Disney+ in Europe, it spent a lot of time building the local infrastructure needed to deliver a high-quality experience to consumers when launching Disney+ in Europe, including establishing local colocation facilities and beefing up data centers to cache content regionally. After starting to roll it out in Europe, Disney was able to identify that, for some markets, the launch of Disney+ could actually create issues that would have resulted in latency and thus provide a poor experience for affected users. So they took proactive steps to reduce their overall bandwidth usage by at least 25% prior to their march 24 launch and delayed their launch in France by two weeks. Without progressive deployment, they wouldn’t have been able to identify these issues. And that’s why the launching of Disney + was remarkable.
What are the benefits of the progressive rollout?
There are three main benefits to the progressive rollout approach.
Avoiding bugs impacting everyone in production at once
First, by slowly ramping up the load, you can monitor and capture metrics about how the new feature impacts the production environment. If any unanticipated issues come to light, you can pause the full launch, fix the issues, and then smoothly move ahead. This data-driven approach guarantees a release with maximum safety and measurable KPIs.
Validating the “Viable part” in your MVP
You can effectively measure how well your feature is welcomed by your users. If you launch a new feature to 10% of your client base and notice revenue or engagement taking a dip, you can pause the release and investigate. The other major advantage? Anticipating costs. Since margin, profit and revenue are an important part of sustainability, unexpected costs that blow up your projected budgets at the end of the month are almost as bad as the night sweats that come from an unexpected bug! Monitoring your costs during a progressive rollout and immediately pausing the launch if those costs spike is a phenomenal level of control that you will absolutely want to get in on.
Progressively deploying services based upon business drivers
Finally, deploying a service or product progressively can also be seen as a way of prioritizing specific markets based on data-driven business plans. Disney, for example, decided not to launch the service in the U.S. when it launched “Star,” its new channel available in the Disney+ catalog for international audiences, which will feature more mature R-rated movies, FX TV shows, and other shows and movies that Disney owns the rights to but that do not fit the Disney+ family image. Ironically: U.S. customers will have to pay extra on their Disney+ subscription to access the same content on the other streaming service, Hulu.
The decision was made following a complex matrix of rights agreements and revenue streams. Disney found that subscribers are willing to pay for the separate Hulu and Disney+ libraries in the U.S., but that Star’s more limited lineup was enough to justify a standalone paid purchase for international customers, who will have to add $2 to their initial $6.99 subscription to access it. When the content library for Star is enough to justify not going through Hulu anymore, the U.S. customers will have access to it by paying just 1$ more. This progressive rollout approach has enabled Disney to make sure that once they launch Star in the U.S., everything will be ready and they will achieve good results.
In other words, the progressive rollout approach helps you ensure that your functionality meets the criteria of usability, viability, and desirability in accordance with your business plan.
How to act fast when you identify bugs while progressively deploying a feature?
Now that you know more about the progressive rollout of your features/products, you may be wondering how to take action if you identify bugs or if things aren’t going well. Lucky for you, we’ve thought of that part too. In addition to progressive rollout, you’ll also find automatic rollback on KPIs and feature flagging in the AB Tasty toolkit.
Feature flagging will let you set up flags on your feature, that work as simply as a switch on/off button. If for any reason you identify threats in your rollout or if the engagement of your users is not really convincing, you can simply toggle your feature off and take time to fix any issues.
This implies that you are aware and that someone from the product team is available to turn it off. But what if something happens overnight and no one can check on the progress of the deployment? Well, for that eventuality, you can set up automatic rollbacks (also called Rollback Threshold) linked to key performance indicators. Our algorithm will check the performance of your deployment and, based on the KPIs you set, if something goes wrong, it will automatically roll back the deployment and inform you that a problem has occurred. This way, in the morning, your engineers will be able to fix the problems without having to deal with the rollback themselves.
Conclusion
Downtime incidents are stressful for both you and your customers. To resolve them quickly and efficiently, you need to have access to the right tools and make the most of them. The progressive rollout, automatic rollback, and feature flagging are great levers to relieve your product teams of stress and let them focus on innovating your product to create a wonderful experience for your users. Highly effective organizations have already realized the importance of having the right approach to deployment with the right tools. What about your organization?
AB Tasty minimizes risk and maximizes results to make the lives of Product teams a whole lot easier. Create a free account today!
In the previous post in this series, we mentioned how important it is to adhere to naming conventions to keep track of your flags, such as who created them and for what purpose.
We also mentioned that in a worst case scenario without such conventions, someone could accidentally trigger the wrong flag inciting disruptions within your system.
This brings us to why it’s also essential to manage and control who has access to the flags in your system. Without access control, you might also risk someone switching on a faulty feature and going live to your user base.
Consequently, access control should be used to manage the level of control people across your organization have and restrict usage if necessary.
The general rule of thumb is to only give users the kind of access they require to do their job, not more and not less.
Let’s look at a practical example…
AB Tasty, for example, offers a feature flagging functionality that is designed to be used by product and marketing teams as well as developers while offering fine control over user access.
As you can see in the image above, there are different categories of access under ‘Role’ in the AB Tasty dashboard. There are 4 categories overall, which are divided as follows:
SuperAdmin- this is the category with the highest access privileges. The SuperAdmin can do anything from creating a project to toggling on/off a use case as well as access reporting.
Admin- the Admin can do almost anything a SuperAdmin can except create, delete or rename a project
User
Creator
Viewer
User, Creator and Viewer have varying but lower levels of access with Viewer having the lowest level of access.
It’s essential that as your use-cases grow and as you go further along the feature flag journey, you are able to manage access for users at a detailed level.
This is also useful when you’ve scheduled regular flag clean-ups to avoid technical debt. By implementing access control, you’ll minimize the risk of having someone on your team accidentally delete a permanent flag. Similarly, it allows teams to track short-lived flags they created so that they can delete them from your system to avoid accumulation of technical debt.
This way, each flag owner would be responsible for assessing the flag they own and keep tabs on it by grouping flags by owner instead of allocating the task of auditing the entire set of flags in your system at random.
Determining who gets access to what
As you’ve seen so far, different teams within your organization could and should have different levels of visibility and access to flags.
Once you’ve decided who gets access to which flags, you’ll need to consider how to control access.
Are technical teams the only ones allowed to make changes or will other teams such as your product and marketing teams get access to flags too? If so, you might need a sophisticated control panel to help them make changes.
Opting for feature flags as a service from a third party will make this process easier rather than building your own system as such services would provide you with a rich UI control panel without costing you too much time and money.
In contrast, building your own advanced feature flag service could drain your resources, especially as your use cases become more advanced and as more of your teams start using this service simultaneously.
Read more about whether you’re better off building or buying a feature flagging management system.
Audit logs
One way to track changes and control access to feature flags is by constructing an audit log. Such a log would help you keep track of all changes being made to each feature flag, such as who made the changes and when.
It allows for complete transparency, giving full visibility over the implementation of any feature flag changes in your system.
Because flag configuration carries risk, having an audit trail of change is crucial, especially when flags are being used in a highly regulated environment such as in finance or healthcare.
Thus, this makes the process more secure when you restrict access to the number of people allowed to make changes to sensitive flags.
Why does this matter?
At this point, you may be wondering why you should even be considering giving all your teams, especially your non-technical teams access to feature flags.
While your team of developers are the ones who create these flags, it is usually your marketing or product teams who will decide when to release a feature and for what purpose. They are the ones that are working closely with your customers and know their needs and preferences.
It is usually the case that your marketing team is the one forming these experiments in the first place to test out a new idea and this team would know who are the most relevant users to test on.
Therefore, the marketing team would often be the one who’s handling the experimental toggles, for example, to run A/B tests.
Meanwhile, operational toggles such as a kill switch are usually handled by the operations team.
When you have a clear-cut UI panel, you are restricting what those teams can do so each team focuses on managing the flags they’re actually responsible for without allowing any room for error.
A management system should allow you to keep track of your toggles, who created them and who can flip them on and off safely.
Conclusion
It’s important that you ensure you have full visibility over the flags in your system and that everyone across your teams know their level of responsibility when it comes to implementing feature flags.
Most sophisticated feature management solutions provide role-based access controls as well as an audit log for flags and contributors for increased transparency over your flags.
With such solutions, you would have an easy to use interface that allows you to expand feature management beyond your development team in a safe and controlled manner.
Feature flags are an indispensable tool to have when it comes to software development and release that can be used for a wide range of use-cases.
You know the saying ‘you can never have too much of a good thing’? In this case, the opposite is true. While feature flags can bring a lot of value to your processes, they can also pose some problems particularly if mismanaged. Such problems particularly come about when using an in-house feature flagging platform.
In this series of posts of ‘Fun with Flags’, we will be introducing some essential best practices when it comes to using and implementing feature flags.
In this first post, we will focus on an important best practice when it comes to keeping your flags organized: naming your flags.
What are feature flags?
First things first, let’s start with a brief overview of feature flags.
Feature flags are a software development practice that allows you to turn certain functionalities on and off to safely test new features without changing code.
Feature flags mitigate risks of new releases by decoupling code deployment from release. Therefore, any new changes that are ready can be released while features that are still a work-in-progress can be toggled off making them invisible to users.
There are many types of feature flags, categorized based on their dynamism and longevity that serve different purposes.
As the number of flags within your system grows, the harder it becomes to manage your flags. Therefore, you need to establish some practices in order to mitigate any negative impact on your system.
It is because of these different categories of flags that establishing a naming convention becomes especially important.
Click here for our definitive guide on feature flags.
What’s in a name?
Defining a name or naming convention for your flags is one essential way to keep all the different flags you have in your system organized.
Before creating your first flag, you will need to establish a naming convention that your teams could adhere to when the time comes to start your feature flag journey. Thus, you need to be organized from the very beginning so as not to lose sight of any flags you use now and in the future.
The most important rule when naming your flags is to try to be as detailed and descriptive as possible. This will make it easier for people in your organization to determine what this flag is for and what it does. The more detailed, the less the margin for confusion and error.
We suggest the following when coming up with a naming convention:
Include your team name– this is especially important if you come from a big organization with many different teams who will be using these flags.
Include the date of creating the flag– this helps when it comes to cleaning up flags, especially ones that have been in your system for so long that they’re no longer necessary and have out-served their purpose.
Include a description of the flag’s behavior.
Include the flag’s category– i.e. whether it’s permanent or temporary.
Include information about the test environment and scope as in which part of the system the flag affects.
Let’s consider an example…
Let’s say you want to create a kill switch that would allow you to roll back or turn off any buggy features.
Kill switches are usually long-lived so they may stay in a codebase for a long time, meaning it is essential the name is clear and detailed enough to keep track of this flag since its use can extend across many contexts over a long period of time.
The name of such a flag could look something like this:
algteam_20-07-2021_Ops_Configurator-killswitch
Following the naming convention mentioned above, the first part is the name of the team followed by the date of creation then the category and behavior of the flag. In this case, a kill switch comes under the category of operational (ops) toggles.
Using such a naming convention allows the rest of your teams across your organization to see exactly who created the flag, for what purpose and how long it has been in the codebase.
Why is this important?
Sticking with such naming conventions will help differentiate such long-lived flags from shorter-lived ones.
This is important to remember because an operational toggle such as a kill switch serves a vastly different purpose than, for example, an experiment toggle.
Experiment toggles are created in order to measure the impact of a feature flag on users. Over the course of your feature flag journey, you may create numerous flags such as these to test different features.
Without an established and well-thought out naming convention, you’ll end up with something like ‘new-flag’ and after some time and several flags later, this naming system becomes tedious and repetitive to the point when you cannot determine which flag is actually new.
In this case, you might want to opt for something more specific. For example, if you want to test out a new chat option within your application, you’ll want to name it something akin to:
Markteam_21-07-2021_chatbox-temp
Such a name will clue you into who created it and what this flag was created for, which is to test the new chat box option and of course, you will know it’s a temporary flag.
This way, there will be no more panic from trying to figure out what one flag out of hundreds is doing in your system or ending up with duplicate flags.
Or even worse, someone on your team could accidentally trigger the wrong flag, inciting chaos and disruption in your system and ending up with very angry users.
Don’t drown in debt
Making a clear distinction between the categories of flags is crucial as short-lived, temporary flags are not meant to stay in your system for long.
This will help you clean up those ‘stale’ flags, otherwise they may end up clogging your system giving rise to the dreaded technical debt.
To avoid an accumulation of this type of debt, you will need to frequently review the flags you have in your system, especially the short-lived ones.
This would be no easy task if your flags had names like ‘new flag 1’, ‘new flag 2’, and well, you get the gist.
A wise man once said…
…“the beginning of wisdom is to call things by their proper name” (Confucius). In the case of feature flags, this rings more true than ever.
Whatever naming convention you choose that suits your organization best, make sure that everyone across all your teams stick to them.
Feature flags can bring you a world of benefits but you need to use them with caution to harness their full potential.
With AB Tasty’s flagging functionality, you can remotely manage your flags and keep technical debt at bay by providing you with dedicated features to help you keep your feature flags under control.
For example, the Flag Tracking dashboard within the AB Tasty platform lists all the flags that are set up in your system so you can easily keep track of every single flag and its purpose.
One of the most critical metrics in DevOps is the speed with which you deliver new features. Aligning developers, ops teams, and support staff together, they quickly get new software into production that generates value sooner and can often be the deciding factor in whether your company gains an edge on the competition.
Quick delivery also shortens the time between software development and user feedback, which is essential for teams practicing CI/CD.
One practice you should consider adding to your CI/CD toolkit is the blue-green deployment. This process helps reduce both technical and business risks associated with software releases.
In this model, two identical production environments nicknamed “blue” and “green” are running side-by-side, but only one is live, receiving user transactions. The other is up but idle.
In this article, we’ll go over how blue-green deployments work. We’ll discuss the pros and cons of using this approach to release software. We’ll also compare how they stack up against other deployment methodologies and give you some of our recommended best practices for ensuring your blue-green deployments go smoothly.
[toc]
How do blue-green deployments work?
One of the most challenging steps in a deployment process is the cutover from testing to production. It must happen quickly and smoothly to minimize downtime.
A blue-green deployment methodology addresses this challenge by utilizing two parallel production environments. At any given time, only one of them is the live environment receiving user transactions. In the image below, that would be green. The blue idle system is a near-identical copy.
Your team will use the idle blue system as your test or staging environment to conduct the final round of testing when preparing to release a new feature. Once the new software is working correctly on blue, your ops team can switch routing to make blue the live system. You can then implement the feature on green, which is now idle, to get both systems resynchronized.
Generally speaking, that is all there is to a blue-green deployment. You have a great deal of flexibility in how the parallel systems and cut-overs are structured. For example, you might not want to maintain parallel databases, in which case all you will change is routing to web and app servers. For another project, you may use a blue-green deployment to release an untested feature on the live system, but set it behind a feature flag for A/B user testing.
Example
Let’s say you’re in charge of the DevOps team at a niche e-commerce company. You sell clothing and accessories popular in a small but high-value market. On your site, customers can customize and order products on-demand.
Your site’s backend consists of many microservices in a few different containers. You have microservices for inventory management, order management, customization apps, and a built-in social network to support your customers’ niche community.
Your team will release early and often as you credit your CI/CD model for your continued popularity. But this niche community is global, so your site sees fairly steady traffic throughout any given day. Finding a lull in which to update your production system is always tricky.
When one of your teams announces that their updated customization interface is ready for final testing in production, you decide to release it using a blue-green deployment so it can go out right away.
Animation of load balancer adjusting traffic from blue to green (Source)
The next day before lunch, your team decides they’re ready to launch the new customizer. At that moment, all traffic routes to your blue production system. You update the software on your idle green system and ask testers to put it through Q/A. Everything looks good, so your ops team uses a load balancer to redirect user sessions from blue to green.
Once traffic is completely filtered over to green, you make it the official production environment and set blue to idle. Your dev team pushes the updated customizer code to blue, puts in their lunch order, and takes a look at your backlog.
Pros: Benefits & use cases
One of the primary advantages of blue-green deployments over other software release strategies is how flexible they are. They can be beneficial in a wide range of environments and many use cases.
Rapid releasing
For product owners working within CI/CD frameworks, blue-green deployments are an excellent method to get your software into production. You can release software practically any time. You don’t need to schedule a weekend or off-hours release because, in most cases, all that is necessary to go live is a routing change. Because there is no associated downtime, these deployments have no negative impact on users.
They’re less disruptive for DevOps teams too. They don’t need to rush updates during a set outage window, leading to deployment errors and unnecessary stress. Executive teams will be happier too. They won’t have to watch the clock during downtime, tallying up lost revenue.
Simple rollbacks
The reverse process is equally fast. Because blue-green deployments utilize two parallel production environments, you can quickly flip back to the stable one should any issues arise in your live environment.
This reduces the risks inherent in experimenting in production. Your team can easily remove any issues with a simple routing change back to the stable production environment. There is a risk of losing user transactions cutting back—which we’ll get into a little further down—but many strategies for managing that situation are available.
You can temporarily set your app to be read-only during cutovers. Or you could do rolling cutovers with a load balancer while you wait for transactions to complete in the live environment.
Built-in disaster recovery
Because blue-green deployments use two production environments, they implicitly offer disaster recovery for your business systems. A dual production environment is its own hot backup.
Load balancing
Blue-green parallel production environments also make load balancing easy. When the two environments are functionally identical, you can use a load balancer or feature toggle in your software to route traffic to different environments as needed.
Easier A/B testing
Another use case for parallel production environments is A/B testing. You can load new features onto your idle environment and then split traffic with a feature toggle between your blue and green systems.
Collect data from those split user sessions, monitor your KPIs, and then, if analyses of the new feature look good in your management system, you can flip traffic over to the updated environment.
Cons: Challenges to be aware of
Blue-green deployments offer a great deal of value, but integrating the infrastructure and practices required to carry them out creates challenges for DevOps teams. Before integrating blue-green deployments into your CI/CD pipeline, it is worth understanding these challenges.
Resource-intensive
As is evident by now, to perform a blue-green deployment, you will need to resource and maintain two production environments. The costs of this, in money and sysadmin time, might be too high for some organizations.
For others, they may only be able to commit such resources for their highest value products. If that is the case, does the DevOps team release software in a CI/CD model for some products but not others? That may not be sustainable.
Extra database management
Managing your database—or multiple databases—when you have parallel production environments can be complicated. You need to account for anything downstream of the software update you’re making needs in both your blue and green environments, such as any external services you’re invoking.
For example, what if your feature change requires you to rename a database column? As soon as you change the name to blue, the green environment with old code won’t function with that database anymore.
Can your entire production environment even function with two separate databases? That’s often not the case if you’re using your blue and green systems for load balancing, testing, or any function other than as a hot backup.
A blue-green deployment diagram with a single database (Source)
Product management
Aside from system administration, managing a product that runs on two near-identical environments also requires more resources. Product Managers need reliable tools for tracking how their software is performing, which services different teams are updating, and ways to monitor the KPIs associated with each. A reliable product and feature management dashboard to monitor and coordinate all of these activities becomes essential.
Blue-green deployments vs. rolling deployments
Blue-green deployments are, of course, not the only option for performing rapid software releases. Another popular approach is to conduct a rolling deployment.
Rolling deployments also require a production environment that consists of multiple servers hosting an application, often, but not always, with a load balancer in front of them for routing traffic. When the DevOps team is ready to update their application, they configure a staggered release, pushing to one server after another.
While the release is rolling out, some live servers will be running the updated application, while others have the older version. This contrasts with a blue-green deployment, where the updated software is either live or not for all users.
As users initiate sessions with the application, they might either reach the old copy of the app or the new one, depending on how the load balancer routes them. When the rollout is complete, every new user session that comes in will reach the software’s updated version. If an error occurs during rollout, the DevOps team can halt updates and route all traffic to the remaining known-good servers until they resolve the error.
Rolling deployments are a viable option for organizations with the resources to host such a large production environment. For those organizations, they are an effective method for releasing small, gradual updates, as you would in agile development methodologies.
There are other use cases where blue-green deployments may be a better fit. For example, if you’re making a significant update where you don’t want any users to access the old version of your software, you would want to take an “all or nothing” approach, like a blue-green deployment.
Suppose your application requires a high degree of technical or customer support. In that case, the support burden is magnified during rolling deployment windows when support staff can’t tell which version of an application users are running.
Blue-green deployments vs. canary releasing
Rolling and blue-green deployments aren’t the only release strategies out there. Canary deployments are another alternative. At first, only a subset of all production environments receives a software update in a canary release. But instead of continuing to roll deploy to the rest, this partial release is held in place for testing purposes. A subset of users is then directed to the new software by a load balancer or a feature flag.
Canary releasing makes sense when you want to collect data and feedback from an identifiable set of users about updated software. Practicing canary releases dovetails nicely with broader rolling deployments, as you can gradually roll the updated software out to larger and larger segments of your user base until you’ve finished updating all production servers.
Best practices
You have many options for releasing software quickly. If you’re considering blue-green deployments as your new software release strategy, we recommend you adopt some of these best practices.
Automate as much as possible
Scripting and automating as much of the release process as possible has many benefits. Not only will the cutover happen faster, but there’s less room for human error. A dev can’t accidentally forget a checklist item if a script or a management platform handles the checklist. If everything is packaged in a script, then any developer or non-developer can carry out the deployment. You don’t need to wait for your system expert to get back to the office.
Monitor your systems
Always make sure to monitor both blue and green environments. For a blue-green deployment to go smoothly, you need to know what is going on in both your live and idle systems.
Both systems will likely need the same set ofmonitoring alerts, but set to different priorities. For example, you’ll want to know the second there is an error in your live system. But the same error in the idle system may need to be addressed sometime that business day.
In some cases, new and old versions of your software won’t be able to run simultaneously during a cutover. For example, if you need to alter your database schema, it would help if you structured your updates so that both blue and green systems will be functional throughout the cutover.
One way to handle these situations is to break your releases down into a series of even smaller release packages. Let’s say our e-commerce company is deepening its inventory and needs to update its database by changing a field name from “shirt” to “longsleeve_shirt” for clarity.
They might break this update down by:
Releasing a feature flag-enabled intermediary version of their code that can interpret results from both “shirt” and “longsleeve_shirt”;
Running a rename migration across their entire database to rename the field;
Releasing the final version of the code—or flip their feature flag—so the software only uses “longsleeve_shirt.”
Do more, smaller deployments
Smaller, more frequent updates are already an integral practice in agile development and CI/CD. It is even more important to follow this practice if you’re going to conduct blue-green deployments. Reducing deployment times shortens feedback loops, informing the next release, making each incremental upgrade more effective and more valuable for your organization.
Restructure your applications into microservices
This approach goes hand-in-hand with conducting smaller deployments. Restructuring application code into sets of microservices allows you to manage updates and changes more easily. Different features are compartmentalized in a way that makes them easier to update in isolation.
Use feature flags to reduce risk further
By themselves, blue-green deployments create a single, short window of risk. You’re updating everything, all-or-nothing, but you can cut back if needed should an issue arise.
Blue-green deployments also have a pretty consistent amount of administrative overhead that comes with each cutover. You can reduce this overhead through automation, but still, you’re going to follow the same process no matter whether you’re updating a single line of code or you’re overhauling your entire e-commerce suite.
Those conditions can be simple “yes/no” checks, or they can be complex decision trees. Feature flags help make software releases more manageable by controlling what is turned on or off at a feature-by-feature level.
For example, our e-commerce company can perform a blue-green deployment of their customizer microservice but leave the new code turned off behind a feature flag in the live system. Then, the DevOps team can turn on that feature according to whatever condition they wish, whenever it is convenient.
The team might want to do some further A/B testing in production. Or maybe they want to conduct some further fitness tests. Or it might make more sense for the team to do a canary release of the customizer for an identified set of early adopters.
Your feature flags can work in conjunction with a load balancer to manage which users see which application and feature subsets while performing a blue-green deployment. Instead of switching over entire applications all at once, you can cut over to the new application and then gradually turn individual features on and off on the live and idle systems until you’ve completely upgraded. This gradual process reduces risk and helps you track down any bugs as individual features go live one-by-one.
You can manually control feature flags in your codebase, or you can use feature flag services for more robust control. These platforms offer detailed reporting and KPI tracking along with a deep set of DevOps management tools.
We recommend using feature flags in any major application release when you’re doing a blue-green deployment. They’re valuable even in smaller deployments where you’re not necessarily switching environments. You can enable features gradually one at a time on blue, leaving green on standby as a hot backup if a major problem arises. Combining feature flags with blue-green deployments is an excellent way to perform continuous delivery at any scale.
Consider adding blue-green deployments to your DevOps arsenal
Blue-green deployments are an excellent method for managing software releases of any size, no matter whether they’re a whole application, major updates, a single microservice, or a small feature update.
It is essential to consider how well blue-green deployments will integrate into your existing delivery process before adopting them. This article detailed how blue-green deployments work, the pros and cons of using them in your delivery process, and how they stack up against other possible deployment methods. You should now have a better sense of whether blue-green deployments might be a viable option for your organization.
Picking an effective deployment strategy is an important decision for every DevOps team. Many options exist, and you want to find the strategy that best aligns with how you work. Today, we’ll go over canary deployments.
Are you an agile organization? Are you performing continuous integration and continuous delivery (CI/CD)? Are you developing a web app? Mobile app? Local desktop or cloud-based app? These factors, and many others, will determine how effective any given deployment strategy will be.
But no matter which strategy you use, remember that deployment issues will be inevitable. A merge may go wrong, bugs may appear, human error may cause a problem in production. The point is, don’t wear yourself out trying to find a deployment strategy that will be perfect. That strategy doesn’t exist.
Instead, try to find a strategy that is highly resilient and adaptive to the way you work. Instead of trying to prevent inevitable errors, deploy code in a way that minimizes errors and allows you to respond when they do occur quickly.
Canary deployments can help you put your best code into production as efficiently as possible. In this article, we’ll go over what they are and what they aren’t. We’ll go over the pros and cons, compare them to other deployment strategies, and show you how you can easily begin performing such deployments with your team.
In this article, we’ll go over:
[toc]
What is a canary deployment?
Canary deployments are a best practice for teams who’ve adopted a continuous delivery process. With this strategy, a new feature is first made available to a small subset of users. The new feature is monitored for several minutes to several hours, depending on the traffic volume, or just long enough to collect meaningful data. If the team identifies an issue, the new feature is quickly pulled. If no problems are found, the feature is made available to the entire user base.
The term “canary deployment” has a fascinating history. It comes from the phrase “canary in a coal mine,” which refers to the historical use of canaries and other small songbirds as living early-warning systems in mines. Miners would bring caged birds with them underground. If the birds fell ill or died, it was a warning that odorless toxic gases, like carbon monoxide, were present. While inhumane, it was an effective process used in Britain and the US until 1986, when electronic sensors replaced canaries.
A canary deployment turns a subset of your users —ideally a bug-tolerant subset— into your own early warning system. That user group identifies bugs, broken features, and unintuitive features before your software gets wider exposure.
Your canary users could be self-identified early adopters, a demographically targeted segment, or a random sampling. Whichever mix of users makes the most sense for verifying your new feature in production.
One helpful way to think about canary deployments is risk management. You are free to push new, exciting features more regularly without having to worry that any one new feature will harm the experience of your entire user base.
Canary releases vs. canary deployments
The phrases “canary release” and “canary deployment” are sometimes used interchangeably, but in DevOps, they really should be thought of as separate. A canary release is a test build of a complete application. It could be a nightly release or a beta, for example.
Teams will often distribute canary releases hoping that early adopters and power users, who are more familiar with development processes, will download the new application for real-world testing. The browser teams at Mozilla and Google, and many other open-source projects, are fond of this release strategy.
On the other hand, canary deployments are what we described earlier. A team will release new features into production with early adopters or different user subsets, routed to the new software by a load balancer or feature flag. Most of the user base still sees the current, stable software.
Canary deployment pros and cons
Canary deployments can be a powerful and effective release strategy. But they’re not the correct strategy in every possible scenario. Let’s run through some of the pros and cons so you can better determine whether they make sense for your DevOps team.
Pros
Support for CI/CD processes
Canary deployments shorten feedback loops on new features delivered to production. DevOps teams get real-world usage data faster, which allows them to refine and integrate the next round of features faster and more effectively. Shorter development loops like this are one of the hallmarks of continuous integration/continuous delivery processes.
Granular control over feature deployments
If your team conducts smaller, regular feature deployments, you reduce the risk of errors disrupting your workflow. If you catch a mistake in the deployment, you won’t have exposed many users to it, and it will be a minor matter to resolve. You won’t have exposed your entire user population and needed to pull colleagues off planned work to fix a major production issue.
Real-world testing
Internal testing has its place, but it is no substitute for putting your application in front of real-world users. Canary deployments are an excellent strategy for conducting small-scale real-world testing without imposing the significant risks of pushing an entirely new application to production.
Quickly improve engagement
Besides offering better technical testing, canary deployments allow you to quickly see how users engage with your new features. Are session lengths increasing? Are engagement metrics rising in the canary? If no bugs are found, get that feature in front of everyone.
There is no need to wait for a more extensive test deployment to complete. Engage those users and get iterating on your next feature.
More data to make business cases
Developers may see the value in their code, but DevOps teams still need to make business cases to leadership and the broader organization when they need more resources.
Canary deployments can quickly show you what demand might be for new features. Conduct a deployment for a compelling new feature on a small group of influencer users to get them talking. Use engagement and publicity metrics to make the case why you want to push a major new initiative tied to that feature.
Stronger risk management
Canary deployments are effectively a series of microtests. Rolling out new features incrementally and verifying them one at a time with canary testing can significantly reduce the total cost of errors or more significant system issues. You’ll never need to roll back a major release, suffer a PR hit, and need to rework a large and unwieldy codebase.
Cons
More overhead
Like any complex process, canary deployments come with some downsides. If you’re going to use a load balancer to partition users, you will need additional infrastructure and need to take on some additional administration.
In this scenario, you create a second production environment and backend that will run alongside your primary environment. You will have two codebases, two app servers, potentially two web servers, and networking infrastructure to maintain.
Alternatively, many DevOps teams use feature flags to manage their canary deployments on a single system. A feature flag can partition users into a canary test at runtime within a single code base. Canary users see the new feature, and everyone else runs the existing code.
Deploying local applications is hard
If you’re developing a locally installed application, you run the risk of users needing to initiate a manual update to get the latest version of your software. If your canary deployment sits in that latest update, your new feature may not get installed on as many client systems as you need to get good test results.
In other words, the more your software runs client-side, the less amenable it is to canary deployments. A full canary release might be a more suitable approach to get real-world test results in this scenario.
Users are still exposed to software issues
While the whole point of a canary deployment is to expose only a few users to a new feature to spare the broader user base, you will still expose end users to less-tested code. If the fallout from even a few users encountering a problem with a particular feature is too significant, then consider skipping this kind of deployment in favor of more rigorous internal testing.
How to perform a canary deployment
Planning out a canary deployment takes a few simple steps:
Identify your canary group
There are several different ways you can select a user group to be your canary.
Random subset
Pick a truly random sampling of different users. While you can do this with a load balancer, feature flag management software can easily route a certain percentage of total traffic to a canary test using a simple modulo.
Early adopters
If you run an early adopter program for highly engaged users, consider using them as your canary group. Make it a perk of their program. In exchange for tolerating bugs they might encounter in a canary deployment, you can offer them loyalty rewards.
By region
You might want to assign a specific region to be your canary. For example, you could set European IPs during late evening hours to go to your canary deployment. You would avoid exposing daytime users to your new features but still get a handful of off-hours user sessions to use as a test.
Internal testers
You can always configure sessions from your internal subnets to be the canary.
Decide on your canary metrics
The purpose of conducting a canary deployment is to get a firm “yes” or “no” answer to the question of whether your feature is safe to push into wider production. To answer that question, you first need to decide what metrics you’re going to use and install the means for monitoring performance.
For example, you may decide you want to monitor:
Internal error counts
CPU utilization
Memory utilization
Latency
You can customize feature management software quickly and easily to monitor performance analytics. These platforms can be excellent tools for encouraging a culture of experimentation.
Decide how to transition from canary to full deployment
As discussed, canary releases should only last on the order of several minutes to several hours. They are not intended to be overly long experiments. Because the timeframe is so short, your team should decide up front how many users or sessions you want in the canary and how you’re going to move to full deployment once your metrics hit positive benchmarks.
For example, you could go with a 5/95 random canary deployment. Configure a feature flag to move a random 5 percent of your users to the canary test while the remaining 95 percent stay on the stable production release. If you see positive results, remove the flag and deploy the feature completely.
Or you might want to take a more conservative approach. Another popular canary strategy is to deploy a canary test logarithmically, going from a 1 percent random sample to 10 percent to see how the new feature stands up to a larger load, then up to a full 100 percent.
Determine what infrastructure you need
Once your team is on the same page about the approach you’ll take, you’ll need to make sure you have all the proper infrastructure in place to make your canary deployment go off without a hitch.
You need a system for partitioning the user base and for monitoring performance. You can use a router or load balancer for the partitioning, but you can also do it right in your code with a feature flag. Feature flags are often more cost-effective and quick to set up, and they can be the more powerful solution.
Canary vs. blue/green deployments
Canary deployments are also sometimes confused with blue/green deployments. Both can use parallel production environments —managed with a load balancer or feature flag— to mitigate the risk of software issues.
In a blue/green deployment, those environments start identical, but only one receives traffic (the blue server). Your team releases a new feature onto the hot backup environment (the green server). Then the router, feature flag, or however you’re managing traffic, gradually shifts new user sessions from blue to green until 100 percent of all traffic goes to green. Once the cutover is complete, the team updates the now-old blue server with the new feature, and then it becomes the hot backup environment.
The way the switchover is handled in these two strategies differs because of the desired outcome. Blue/green deployments are used to eliminate downtime. Canary deployments are used to test a new feature in a production environment with minimal risk and are much more targeted.
Use feature flags for better deployments
When you boil it right down, a feature flag is nothing more than an “if” statement from which users take different code paths at runtime depending on a condition or conditions you set. In a canary deployment, that condition is whether the user is in the canary group or not.
Let’s say we’re running a fledgling social networking site for esports fans. Our DevOps team has been hard at work on a content recommender that gives users real-time recommendations based on livestreams they’re watching. The team has refined the recommendation feature to be significantly faster. It has performed well in internal testing, and now they want to see how it performs under real-world conditions.
The team doesn’t want to invest time and money into installing new physical infrastructure to conduct a canary deployment. Instead, the team decides to use a feature flag to expose the new recommendation engine to a random 5 percent sample of the user base.
The feature flag splits users into two groups with a simple modulo when users load a live stream. Within minutes your team gets results back from a few thousand user sessions with the new code. It does, in fact, load faster and improves user engagement, but there is an unanticipated spike in CPU utilization on the production server. Ops staff realize it is about to degrade performance, so they kill the canary flag.
The team agrees not to proceed with rollout until they can debug why the new code caused the unexpected server CPU spike. Thanks to the real-world test results provided by the canary deployment, they have a pretty good idea of what was going on and get back to work.
Features flags streamline and simplify canary deployments. They mitigate the need for a second production environment. Using feature flag management software like AB Tasty allows sophisticated testing and analysis.