Deliberately creating chaos

By Frank Wickström

Paraphrasing what Netflix and many others have said before:

The only way to be comfortable with failures is to fail often

Why create chaos?

An engineer waking up in the middle of the night due to system failure is one of the most stressful work situations that you can put a person in. You are sleep deprived, disoriented, and there is a fire under your ass. On top of this, the thing that has failed is something you have not touched in a long time. Your heart is going a million kilometers an hour, and you have no idea where to start. This is what chaos engineering wants you to experience, and often. Why is that?

Well, to no one’s surprise, it turns out that if you experience this often, you start figuring out ways for this not to happen. The point here is to build resilient systems where when the notification comes in at night (if it even needs to), all that needs to be checked is if the system could recover from it or not. If not, you probably know what to do and can go back to sleep in a bit again. Chaos engineering is the discipline of experimenting with on purpose, adding constant failures into a system in production. Sounds terrifying at first, but once one gets used to it, the stress associated with system failures will decrease, and systems will become more stable.

Creating chaos!

We started off by me (Frank) acting as the chaos-inducing monkey (not far from reality anyways) that would destroy the system. This being the first time we held the event, we focused on our staging environments and our CI infrastructure. Both things are in daily use by both our developers and our customers. So without our DevOps team knowing much more than “things will go down today,” I started deleting things. Poof went the Kubernetes cluster, the databases, the DNS magically changed to point somewhere entirely else that it was support to point at, the NAT gateways became messed up, Google Cloud Storage buckets disappeared, and all of our GitLab CI runners were gone. In about 20 minutes, most of our systems were down.

Recovering from chaos

The first thing that had to be recreated was the Kubernetes infrastructure; this was also what took the longest to set up due to the time it takes for a Google Kubernetes Engine (GKE) cluster to recreate itself. When that was up, all main cluster components such as ingress, certification managers, and monitoring stacks were automatically deployed. At the same time, our database servers were recreated with fresh backups restored to each database. And with some minor changes to our infrastructure code to point to our new database server, all applications database information could be automatically updated, no manual updating secrets needed. Once the cluster was up and running, the databases restored, the application secrets updated, it was time to redeploy our applications. Terraform took care of settings up the surrounding infrastructure, such as the deleted bucket storage, and restore scripts took care of restoring their content. The application deployments themselves were done simply by clicking run on the CI/CD pipelines, all managed by our open-source tool Kólga. Going from nothing to fully working applications took about 2h, with CI runs starting to work about 1h in.

While not fully automatic, most of the tasks did not involve a lot of manual intervention. All resources were created either by running a script or clicking buttons in a UI.

Future improvements

One thing, in general, was that full-scale failures like these were quite “easy” to recover from. Since one can simply recreate everything. A general consensus was that we should make these events much harder 😅. Next time smaller, harder to find errors will be created for sure.

One could also argue that there should not have been any downtime at all, even if the entire Kubernetes cluster disappeared, which is a fair point. Using highly available multi-cloud setup would shield us a bit from this; there are fairly high costs related to this, however. Setting up such a setup would be something that could be tried out for a future event as well.

Anders is a Finnish IT company, whose mission is sustainable software development with the greatest colleagues of all time.