Once you’ve got Kubernetes in production, those predictable business continuity and disaster recovery (DR) exercises get a lot more interesting — and not necessarily in a good way. That’s why I’m focusing on the challenges of Kubernetes disaster recovery and business continuity in my recently published research.
As Kubernetes makes its way into stateful applications — that is, apps that save data to, or read from, persistent disk storage — infrastructure and operations (I&O) leaders will have to figure out whether apps running on that cloud-native infrastructure can meet their DR goals. And if they’re relying only on out-of-the-box Kubernetes, the answer is likely to be no. As Annette Clewett of Red Hat asked back in 2019 at KubeCon + CloudNativeCon, “How can you have a serious platform if you have no backup and recovery?”
In the two years since, the Cloud Native Computing Foundation (CNCF) community has worked to build out the Kubernetes ecosystem to provide enterprise-grade storage for Kubernetes — for example, with the frequent upgrades of Rook, which orchestrates storage operators used in Kubernetes, including Ceph, Cassandra, and NFS. But it is still largely up to the user to figure out how to make it all work — from basic storage to full DR — either on their own or with the support of various vendors. Notably, the CNCF-certified Kubernetes distributions don’t claim to have built-in disaster recovery capabilities, with the exception of Kublr.
Some users might look to the hyperscale cloud service providers to offer a Kubernetes DR solution to go along with their managed Kubernetes services. If so, they will likely be disappointed. Microsoft, for example, offers a set of best practices for business continuity and DR on Azure Kubernetes Services but directs users to “common storage solutions [that] provide their own guidance about disaster recovery and replication.” AWS documentation for Elastic Kubernetes Services (EKS) describes the resiliency of the Kubernetes control plane but is silent on what this means for providing DR for the apps running on EKS.
Google takes a different tact, incorporating a more detailed discussion of disaster recovery for Google Kubernetes Engine, including storage as part of the “disaster recovery building blocks” for Google Cloud Platform. Those documents may be a fine starting point for engineers previously steeped in Kubernetes, but Google’s guides require expertise that may be beyond systems operations teams who are still sorting out their transition to site reliability engineers. If Kubernetes is going to move into mainstream enterprise IT, basic DR will have to become more straightforward. Failover for high-availability applications on Kubernetes is an even bigger challenge, of course.
So if you need DR for apps running on Kubernetes, you’d better shop around. Several storage vendors and Kubernetes-based platforms do provide DR or provide support for other vendors that do. Even with those tools, however, operations teams should expect some awkward moments in the regular DR exercises.
Some DR procedures won’t change. Project leaders convene a meeting to set DR goals. Application teams report additions and changes to the development. I&O teams update runbooks with recovery point objectives — the amount of data loss or data reentry that can be tolerated — and recovery time objectives (RTO), the acceptable time that systems will be unavailable. Application development teams and business users accustomed to aggressive RTOs, however, may not be aware of the complexities involved in hitting those same numbers with applications running on Kubernetes.
The Kubernetes DR picture should become clearer in coming months. As my colleagues Brent Ellis and Andras Cser and I observed at this year’s KubeCon + CloudNativeCon Europe, the CNCF community and various vendors are finally assembling the technologies and tools to ease Kubernetes adoption in the enterprise. But today, DR with Kubernetes remains a hurdle. For more on this topic, read my research or schedule an inquiry.