Waiting for etcd and controlplane nodes to be registered

6/13/2023

The etcd docs say “…an etcd cluster probably should have no more than seven nodes…. Thus a general rule is to always run an odd number of nodes in your control plane. Adding a node to an odd-sized cluster appears better (since there are more machines), but the fault tolerance is worse since precisely the same number of nodes may fail without losing quorum, but there are more nodes that can fail. So a 4-node control plane has worse availability than a 3-node control plane - as while both can suffer a single node outage, and neither can deal with a 2-node outage, the odds of that happening in a 4-node cluster are higher. A 4-node control plane requires 3 nodes to be up. A 3-node control plane also requires that 2 nodes be up. Thus a 2-node control plane requires both nodes to be up. This same logic applies as the control plane nodes scale - etcd will always demand that more than half of the nodes be alive and reachable in order to have a quorum, so the database can perform updates. And, because the chance of one node dying is higher with 2 nodes than with one node (twice as high, in fact, assuming they are identical nodes) then the reliability of a 2-node control plane is worse than with a single node! Thus the safer situation is to lock the database and prevent further writes by any node. Now the two halves of the cluster have inconsistent copies of data, with no way to reconcile them. If both nodes are up, but can’t reach each other, and continue to perform writes, you end up with a split-brain situation. In the case of a 2-node control plane, when one node can’t reach the other, it doesn’t know if the other node is dead (in which case it may be OK for this surviving node to keep doing updates to the database) or just not reachable. That is, going from a single-node control plane to a 2-node control plane makes availability worse, not better. So a two-node control plane requires not just one node to be available, but both the nodes (as the integer that is “more than half” of 2 is …2). Etcd uses a quorum system, requiring that more than half the replicas are available before committing any updates to the database. This information is stored as key-value pairs in the etcd database. It’s not quite as simple as “more is better.” One of the functions of the control plane is to provide the datastore used for the configuration of Kubernetes itself. But how many nodes should you use? May the Odds Be Ever in Your Favor So how do you ensure the control plane is highly available? Kubernetes achieves high availability by replicating the control plane functions on multiple nodes.

You also eliminate the chance that a vulnerability enables a workload to access the secrets of the control plane - which would give full access to the cluster.) (You can schedule pods on the control plane nodes, but it is not recommended for production clusters - you do not want workload requirements to take resources from the components that make Kubernetes highly available. Without a fully functioning control plane, a cluster cannot make any changes to its current state, meaning no new pods can be scheduled. As such, ensuring high availability for the control plane is critical in production environments. The control plane runs the components that allow the cluster to offer high availability, recover from worker node failures, respond to increased demand for a pod, etc. A Kubernetes cluster generally consists of two classes of nodes: workers, which run applications, and control plane nodes, which control the cluster - scheduling jobs on the workers, creating new replicas of pods when the load requires it, etc.

0 Comments

Waiting for etcd and controlplane nodes to be registered

Leave a Reply.

Author

Archives

Categories