How Chaos Testing helps build and deliver resilient software
Listen on the go!
|
In the real world, things are rarely perfect.
Software fails. Outages happen. Cyberattacks occur.
Consequently, customers suffer from non-availability of the needed application and long downtime. Businesses suffer from lost revenues and bruised reputation.
Is it possible to develop and deploy a software that never faces any issues? Probably not. But it is indeed possible to minimize the impact of those issues by designing resiliency in the software with the help of chaos testing or chaos engineering.
“Principles of Chaos Engineering” defines the practice as the “discipline of experimenting on a distributed system in order to build confidence in the system’s capacity to withstand turbulent conditions in production.”
This involves intentional introduction of failure into a software system to measure the system’s ability to tackle with it and evaluate the impact of the failure on the system’s availability and durability.
The concept of Chaos Engineering was introduced and developed by Netflix to test the resilience of their IT infrastructure for ensuring a seamless experience for their customers. They called it ‘Chaos Monkey’.
Antonio García Martínez, the author of the book ‘Chaos Monkeys’ explains the concept as –
“Imagine a monkey entering a ‘data center’, these ‘farms’ of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand [i.e. flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy.”
The concept might seem counterintuitive, but is a brilliant tool to prepare your software against any potential issues. Instead of waiting for an error to occur for implementing the fix, chaos engineering or chaos testing takes a proactive approach.
By undertaking the practice of chaos testing, organizations have more control over the introduced error. With this benefit, they can quickly identify hidden defects, vulnerabilities, and other issues, which may never have presented themselves during the traditional testing processes. Having such critical errors lying around may prove fatal in the long run.
The impact of software failures
History has been a witness to numerous incidents where the tiniest of bug in a software led to catastrophic outcomes.
Recently, London’s Heathrow Airport was hit by technical issues, impacting more than hundreds of flights. And, the two Boeing 737 Max incidents are unforgettable.
A few weeks ago, the video conferencing application Zoom went down for hours, which disrupted meetings and classes around the United States.
A recent cyberattack on Brno University Hospital in the Czech Republic caused an immediate computer shutdown in the midst of the coronavirus outbreak.
These are just a few examples from an ocean of incidents that happen every year due to software glitches and unresolved defects.
These failures cause an organization to be the talk of the town for all the wrong reasons. Not only does their business and revenue get impacted, but their hard-earned reputation also has to face a heavy blow.
Mean Time To Recovery (MTTR) vs. Mean Time To Failure (MTTF)
Traditionally, organizations take pride in the longest time that they have had without any outage or technical issues. This means that they rely on the Mean Time To Failure (MTTF), which is the average time for which a system operates before failing.
But, the irony is that the customer won’t remember all the time a software system worked perfectly, but will never forget the ONE time that it didn’t.
Therefore, it is high time that organizations shift their focus from the Mean Time To Failure to Mean Time to Recovery.
The Mean Time To Recovery is the average time that a software system takes to recover from a failure. Now, the objective for the global organizations should be to minimize the Mean Time To Recovery to such an extent that the customers do not even notice when an issue occurs. And this can be made possible with the help of Chaos testing.
Chaos testing and DevOps
Let’s get one thing clear – to systemically inject an error into a software system in no way means to cause any impact on the end customers. Whatever experimentation with the deliberate introduction of chaos happens must remain behind the curtains of a well-functioning application for the end users.
This cannot be feasible to do with the legacy software development and testing methodologies. Chaos engineering complements well only in a DevOps setup where automation is integral to the entire development and software testing process, a continuous monitoring and feedback loop is established, and there is scope for continuous improvement.
When a fault is injected into a system, several bugs are uncovered and vulnerabilities identified. With the help of a DevOps practice, these defects can be resolved in real time and automation can be deployed for future occurrences.
Chaos testing, combined with DevOps, is the ultimate way of developing and delivering highly fault-tolerant and resilient software.
How can we help
Achieving fast and continuous development & deployment of business-critical cloud-based applications across diverse platforms requires seamless collaboration among development, test automation, & operations teams.
At Cigniti, we standardize efforts, ascertain software resilience, and ensure accelerated time-to-market with DevOps testing solutions. We focus on delivering improved deployment quality with greater operational efficiency. Our DevOps testing specialists with their deep experience in Continuous Integration (CI) testing & Continuous Deployment (CD) help configure & execute popular CI/CD tools supporting your DevOps transformation & application testing efforts.
Schedule a discussion with us today.
Leave a Reply