Deleting My First Production Cluster | by Prafful Mishra | Jul, 2022

Embracing failure to grow as a developer

Photo by Miguel Á. Padriñán from Pexels

We have all been there, one day or another, at least once in our career (or maybe more than once).

Either you were having a bad day, you were a bit less speculative about how a small decision of yours could bring the whole world down, or maybe the bug was very convoluted and needed all the stars to align together for this misfortune to hit you.

But these experiences bring loads of experience, along with testing your disaster recovery resources, team strength, and pressure handling capabilities. This is the story of how I did it, and more importantly, how we got everything back up?

We have an internal Data Science Platform(DSP) that runs workloads for machine learning and data science-related projects. We were in the process of launching a new version of the platform(say v2).

(All the existing users were on DSP)v1), with some teams running a few experimental workloads on v2 As we were adding more services, testing, and configuring them. From here on we’ll be talking about this DSP v2.

So, I guess you have figured that even though this was supposed to be a prod cluster, the number of teams using it was very low and the ones using were expecting disturbances as everything was still in dev/alpha phase (but this doesn’t make the situation any less critical).

Deployment Setup

We were using argoCD to make sure everything remains in sync with self-healing and pruning (this will be interesting later on).

We had automated onboarding for new teams who wish to use the platform already set up on the cluster (every project would get a dedicated namespace).

A diagram showing mulitple argoCD applications generated from Application Sets.
Application sets to generate multiple applications

This was done with the help of newly available ApplicationSets in argoCD that would generate Argo applications.

A diagram showing how each argoCD application was using kustomize-helm plugin to generate and sync manifests to the cluster.
ArgoCD generated manifest using the kustomize-helm plugin

These individual applications used kustomize to render a helm chart with a custom plugin described here. In essence, once the manifest from a helm chart is rendered using values.yaml, kustomize is used on top to generate final manifests, which are synced to the cluster using argoCD.

Here’s the timeline that led to the disaster:

  • As the helm chart automatically renders manifests for the newly onboarded project, it has a manifest for v1/Namespace
  • I did not explicitly state the namespace parameter for thekustomize-helm plugin for rendering the helm chart, hence the namespace that got passed to the helm chart as the current one (ie, argocd).
  • Disaster had not dropped yet, but everything was set in place for kaboom.
  • If you haven’t figured it out yet, the next time a project was onboarded, argoCD started managing its own namespace.
  • argoproj.io/v1alpha1/Appproject was supposed to have permissions set to not allow something like this, but as this AppProject was supposed to create new namespaces, the permissions were set to rather than explicitly denying access to critical namespaces at least. (But as things were still in dev/alpha stage not everything was set to be robust yet).
  • So, I realized at this moment what had happened, and to add more context I was working late at night because I was sleepless and decided it would be good to get some work done (not the best decision of my life though).
Old and new config for Kustomize-helm plugin in order
  • In the hurry to fix things up, I update the plugin command from helm template … to helm template -n project-namespace … , according to me, this would fix the issue for sure. (Easy peasy, huh?)
  • When everything rendered correctly now, argoCD was asked to update the namespace from argoCD to project-namespaceand argoCD deleted the namespace argocd.

Well, now the disaster had struck. I was a bit confused at first, but then reality hit me, and I realized that I am going to have a big day tomorrow. I decided to sleep peacefully and handle the storm the next morning head-on.

All the project applications using this cluster were managed by argoCD, hence once argoCD went down and almost every project on the cluster went down.

Fortunately, there were other basic services that were not managed by argoCD; they were still up and available. But we had affected all the users on this cluster directly.

The next morning starts with a long call with the central infrastructure team, explaining to them what had happened.

Day 1

As the incident was critical due to the cluster being a production one, the infrastructure team had to do a proper investigation (this helped me in understanding a few missing pieces), prepare an incident report, and mention the exact causes.

Once the investigation was done, they took a few hours to get argocd back up. (Everything after the basic cluster applications and argoCD was under my team’s ownership).

I had a short call with my team to update them about my late-night adventures. Apologised for my carelessness. We decided to reevaluate the deadline for the production release for DSP v2 After all the services are live again.

Day 2

By now, everything was in place from the infra team, and the ball was in my court to get everything back up as soon as possible so as to have minimum damage on the deadlines.

While we started syncing all the affecting services from our repositories, a few issues came to the surface which we were not aware of. We got a chance to test our complete Infra as Code implementation and our team collaboration.

We got stuck a few times but kept it rolling and got all the services back to work by the late end of this day with some new issues created such as wrt bugs, disaster recovery scripts, better maintenance patterns, etc.

I sent a message explaining everything once things were back on track.

A redacted version of the message to my team

Prepare for disaster as it’ll hit every day. Develop such that you never hit one.

A few major pointers that we learned from the incident

  • Coding when you are tired is a bad idea
  • Have better checks in place for everything (literally everything)
  • Have a plan for disaster
  • Test your disaster plan periodically
  • Simulate disasters frequently (big, small, and biggest)
  • Your team matters the most (positivity, support, and expertise)

Leave a Comment