Paraphrasing what Netflix and many others have said before:
The only way to be comfortable with failures is to fail often
Why create chaos?
Over the past few years, Anders has focused heavily on DevOps; we have done so because we have seen it as one of the most critical shifts in both technology and company culture that has happened over the last few years. We, or most companies out there, have not always been the most comfortable with things breaking and failing. We want to change that.
An engineer waking up in the middle of the night due to system failure is one of the most stressful work situations that you can put a person in. You are sleep deprived, disoriented, and there is a fire under your ass. On top of this, the thing that has failed is something you have not touched in a long time. Your heart is going a million kilometers an hour, and you have no idea where to start. This is what chaos engineering wants you to experience, and often. Why is that?
Well, to no one’s surprise, it turns out that if you experience this often, you start figuring out ways for this not to happen. The point here is to build resilient systems where when the notification comes in at night (if it even needs to), all that needs to be checked is if the system could recover from it or not. If not, you probably know what to do and can go back to sleep in a bit again. Chaos engineering is the discipline of experimenting with on purpose, adding constant failures into a system in production. Sounds terrifying at first, but once one gets used to it, the stress associated with system failures will decrease, and systems will become more stable.
In April, we started a new event type called a chaos event, where we, on purpose, break our own systems. This to learn, harden and make the systems more resilient to failures. This time we focused on our Kubernetes infrastructure on Google and our database providers’ capabilities to recover from failure. We wanted to see how fast can we recover if our entire Kubernetes cluster and all our databases would disappear. Seems like an unlikely scenario? Maybe, but this would give us a good way of checking that our infrastructure as code setup actually worked the way we anticipated. We, of course, test all our code as we create it, but testing in production and trying everything at once is usually not the same.
We started off by me (Frank) acting as the chaos-inducing monkey (not far from reality anyways) that would destroy the system. This being the first time we held the event, we focused on our staging environments and our CI infrastructure. Both things are in daily use by both our developers and our customers. So without our DevOps team knowing much more than “things will go down today,” I started deleting things. Poof went the Kubernetes cluster, the databases, the DNS magically changed to point somewhere entirely else that it was support to point at, the NAT gateways became messed up, Google Cloud Storage buckets disappeared, and all of our GitLab CI runners were gone. In about 20 minutes, most of our systems were down.
Recovering from chaos
The team jumped on the task straight away, always staying in a Teams call the entire time. The first thing to do was to investigate what had happened. Quite quickly, most of the things that had happened were found, and the recovery process started. By utilizing already defined infrastructure code written for Terraform, the recovery process could be started immediately. Since Terraform, to a great extent, giving useful information about what things would be recreated (hence, hinting at the things that were removed), the team could start focusing on those things.
The first thing that had to be recreated was the Kubernetes infrastructure; this was also what took the longest to set up due to the time it takes for a Google Kubernetes Engine (GKE) cluster to recreate itself. When that was up, all main cluster components such as ingress, certification managers, and monitoring stacks were automatically deployed. At the same time, our database servers were recreated with fresh backups restored to each database. And with some minor changes to our infrastructure code to point to our new database server, all applications database information could be automatically updated, no manual updating secrets needed. Once the cluster was up and running, the databases restored, the application secrets updated, it was time to redeploy our applications. Terraform took care of settings up the surrounding infrastructure, such as the deleted bucket storage, and restore scripts took care of restoring their content. The application deployments themselves were done simply by clicking run on the CI/CD pipelines, all managed by our open-source tool Kólga. Going from nothing to fully working applications took about 2h, with CI runs starting to work about 1h in.
While not fully automatic, most of the tasks did not involve a lot of manual intervention. All resources were created either by running a script or clicking buttons in a UI.
While we felt quite proud that our recovery methods worked as good as they did, we still felt like there was a lot of room for improvements. Most of the things that we wanted to improve on were with regards to automation, a greater use of Vault for storing even more credentials, better alerting, and a more robust backup strategy that would handle even full-scale Google downtime. One thing that we noticed for instance was the changes that we had to do to our Terraform state file due to some providers not being able to handle such a massive change without needing to run only part of the Terraform script at a time.
One thing, in general, was that full-scale failures like these were quite “easy” to recover from. Since one can simply recreate everything. A general consensus was that we should make these events much harder 😅. Next time smaller, harder to find errors will be created for sure.
One could also argue that there should not have been any downtime at all, even if the entire Kubernetes cluster disappeared, which is a fair point. Using highly available multi-cloud setup would shield us a bit from this; there are fairly high costs related to this, however. Setting up such a setup would be something that could be tried out for a future event as well.