Every time I read about another high profile system outage, I wonder what was missed during development and testing.
For example, although an unusual natural disaster triggered the recent Amazon cloud services outage, the root cause was a lurking bug that could have been revealed with a testing strategy that I (and others) have advocated for over twenty years. Here’s Amazon’s explanation, as reported by CNET:
‘When key AWS [Amazon Web Services] components like EC2 [Elastic Compute Cloud] go down, the ELB [Elastic Load Balancing] system frantically tries to assign workloads to servers with space. However, as Amazon’s cloud rebooted, “a large number of ELBs came up in a state which triggered a bug we hadn’t seen before,” the company said. The bug meant that Amazon tried to rapidly scale the affected ELBs to ones of a larger size, flooding Amazon’s cloud with requests that caused a backlog in its control plane. This, combined with a rise in the number of new servers being provisioned by customers in unaffected availability zones to add even more requests to the control plane, increasing the backlog still further.’
‘A similar bug occurred in the recovery process for components of Amazon’s Relational Database Service (RDS). Due to changes made to how Amazon dealt with storage failures, a bug appeared that meant RDS’s sharded across multiple availability zones did not complete fail over, rendering them useless. This bug is one which “only manifested when a certain sequence of communication failure is experienced,” [italics added] Amazon said, “situations we saw during this event as a variety of server shutdown sequences occurred.”
I commend Amazon for saying the cause was a “bug,” instead of a weaselly “issue.”
But this is all too familiar.
There are many similar stories, including thousands that would never make the cable news crawl.
Why does this happen? I don’t know how testing was done in these cases, but I do know that commonly used “happy-path” testing routinely misses showstoppers. Happy-path testing is tester jargon for a test that follows a routine and simple interaction with basic features. But even more extensive functional testing will not find stress-related showstoppers (there’s no stress.) Neither will performance testing that relies on large numbers of happy-paths. Although this can reveal certain kinds of basic capacity problems, it almost never reveals stress-related showstoppers. Why? Bugs that are sensitive to high stress can hide from either kind of superficial testing. And, it is exactly these kind of bugs that cause catastrophic failures.
Friend and mentor John Musa, partly in reaction to the AT&T outage, pioneered a realistic testing strategy to reveal failure modes in complex distributed systems. I learned a lot from John and extended his approach to automatically generate test suites with realistic variation in load, rate, and complexity of inputs. I call this multi-dimensional testing. The system under test is subjected to both realistic usage and varying stress levels, all at the same time. I’ve used this strategy many times to wring out hard-to-find showstoppers of all kinds. See No Glitches Allowed for a case study.
No testing approach can guarantee that it will find all bugs, but it is a certainty that the multi-dimensional strategy does a much better job of shining a light on the dark corners where high-stress showstopper bugs hide.