|
| 1 | +<!-- |
| 2 | + This work is licensed under a Creative Commons Attribution 3.0 |
| 3 | + Unported License. |
| 4 | +
|
| 5 | + http://creativecommons.org/licenses/by/3.0/legalcode |
| 6 | +--> |
| 7 | + |
| 8 | +# HostClaim: multi-tenancy and hybrid clusters |
| 9 | + |
| 10 | + |
| 11 | +## Status |
| 12 | + |
| 13 | +provisional |
| 14 | + |
| 15 | +## Summary |
| 16 | + |
| 17 | +We introduce a new Custom Resource (named HostClaim) which will facilitate |
| 18 | +the creation of multiple clusters for different tenants. It also provides |
| 19 | +a framework for building clusters with different kind of compute resource: |
| 20 | +baremetal servers but also virtual machines hosted in private or public cloud. |
| 21 | + |
| 22 | +A HostClaim decouples the client need from the actual implementation of the |
| 23 | +compute resource: it establishes a security boundary and provides a way to |
| 24 | +migrate nodes between different kind of compute resources. |
| 25 | + |
| 26 | +A HostClaim expresses that one wants to start a given |
| 27 | +OS image with an initial configuration (typically cloud-init or ignition |
| 28 | +configuration files) on a compute resource that meets a set of requirements |
| 29 | +(host selectors). These requirements can be interpreted either as labels |
| 30 | +for a recyclable resource (such as a bare-metal server) or as characteristics |
| 31 | +for a disposable resource (such as a virtual machine created for the workload). |
| 32 | +The status and meta-data of the HostClaim provide the necessary information |
| 33 | +for the end user to define and manage his workload on the compute resource, |
| 34 | +but they do not grant full control over the resource (typically, BMC |
| 35 | +credentials of servers are not exposed to the tenant). |
| 36 | + |
| 37 | +## Motivation |
| 38 | + |
| 39 | +So far, the primary use case of cluster-api-baremetal is the creation of a |
| 40 | +single target cluster from a temporary management cluster. The pivot process |
| 41 | +transfers the resources describing the target cluster from the management |
| 42 | +cluster to the target cluster. Once the pivot process is complete, the target |
| 43 | +cluster takes over all the servers. It can scale based on its workload but it |
| 44 | +cannot share its servers with other clusters. |
| 45 | + |
| 46 | +There is another model where a single management cluster is used to create and |
| 47 | +manage several clusters across a set of bare-metal servers. This is the focus |
| 48 | +of the [Sylva Project](https://sylvaproject.org/) of the Linux foundation. |
| 49 | +Another example is [Das Schiff](https://github.com/telekom/das-schiff). |
| 50 | + |
| 51 | +One of the issue encountered today is that the compute resources |
| 52 | +(BareMetalHost) and the cluster definition (Cluster, MachineDeployment, |
| 53 | +Machines, Metal3Machines, etc.) must be in the same namespace. Since the goal |
| 54 | +is to share the compute resources, this means that a single namespace is used |
| 55 | +for all resources. Consequently, unless very complex access control |
| 56 | +rules are defined, cluster administrators have visibility over all clusters |
| 57 | +and full control over the servers as the credentials are stored in the same |
| 58 | +namespace. |
| 59 | + |
| 60 | +The solution so far is to completely proxy the access to the Kubernetes |
| 61 | +resources that define the clusters. |
| 62 | + |
| 63 | +Another unrelated problem is that Cluster-API has been designed |
| 64 | +to define clusters using homogeneous compute resources: it is challenging to |
| 65 | +define a cluster with both bare-metal servers and virtual machines in a private |
| 66 | +or public cloud. |
| 67 | +This [blog post](https://metal3.io/blog/2022/07/08/One_cluster_multiple_providers.html) |
| 68 | +proposes several approaches but none is entirely satisfactory. |
| 69 | + |
| 70 | +On the other hand, workloads can be easily defined in terms of OS images and |
| 71 | +initial configurations and standards such as qcow or cloud-init have emerged |
| 72 | +and are used by various infrastructure technologies. |
| 73 | +Due to the unique challenges of managing bare-metal, the Metal3 project has |
| 74 | +developed a set of abstractions and tools that could be used in different |
| 75 | +settings. The main mechanism is the selection process that occurs between |
| 76 | +the Metal3Machine and the BareMetalHost which assigns a disposable workload |
| 77 | +(being a Kubernetes node) to a recyclable compute resource (a |
| 78 | +server). |
| 79 | + |
| 80 | +This proposal introduces a new resource called HostClaim that solves |
| 81 | +both problems by decoupling the definition of the workload performed by |
| 82 | +the Metal3 machine controller from the actual compute resource. |
| 83 | +This resource acts as both a security boundary and a way to hide the |
| 84 | +implementation details of the compute resource. |
| 85 | + |
| 86 | +### Goals |
| 87 | + |
| 88 | +* Split responsibilities between infrastructure teams, who manage servers, and |
| 89 | + cluster administrators, who create/update/scale baremetal clusters deployed |
| 90 | + on those servers, using traditional Kubernetes RBAC to ensure isolation. |
| 91 | +* Provide a framework where cluster administrators can consume compute |
| 92 | + resources that are not baremetal servers, as long as they offer similar APIs, |
| 93 | + using the cluster-api-provider-metal3 to manage the life-cycle of those |
| 94 | + resources. |
| 95 | +* Define a resource where a user can request a compute resource to execute |
| 96 | + an arbitrary workload described by an OS image and an initial configuration. |
| 97 | + The user does not need to know exactly which resource is used and may not |
| 98 | + have full control over this resource (typically no BMC access). |
| 99 | + |
| 100 | + |
| 101 | +### Non-Goals |
| 102 | + |
| 103 | +* How to implement HostClaim for specific compute resources that are not |
| 104 | + BareMetalHost. |
| 105 | +* Discovery of which capabilities are exposed by the cluster. |
| 106 | + Which kind of compute resources are available and the semantics of the |
| 107 | + selectors are not handled. |
| 108 | +* Compute resource quotas. The HostClaim resource should make it possible to |
| 109 | + develop a framework to limit the number/size of compute resources allocated |
| 110 | + to a tenant, similar to how quotas work for pods. However, the specification |
| 111 | + of such a framework will be addressed in another design document. |
| 112 | +* Pivoting client clusters resources (managed clusters that are not the |
| 113 | + initial cluster). |
| 114 | + |
| 115 | +## Proposal |
| 116 | + |
| 117 | + |
| 118 | +### User Stories |
| 119 | + |
| 120 | +#### As a user I would like to execute a workload on an arbitrary server |
| 121 | +The OS image is available in qcow format on a remote server at ``url_image``. |
| 122 | +It supports cloud-init and a script can launch the workload at boot time |
| 123 | +(e.g., a systemd service). |
| 124 | + |
| 125 | +The cluster offers bare-metal as a service using Metal3 baremetal-operator. |
| 126 | +However, as a regular user, I am not allowed to directly access the definitions |
| 127 | +of the servers. All servers are labeled with an ``infra-kind`` label whose |
| 128 | +value depends on the characteristics of the computer. |
| 129 | + |
| 130 | +* I create a resource with the following content: |
| 131 | + ```yaml |
| 132 | + apiVersion: metal3.io/v1alpha1 |
| 133 | + kind: HostClaim |
| 134 | + metadata: |
| 135 | + name: my-host |
| 136 | + spec: |
| 137 | + online: false |
| 138 | + kind: baremetal |
| 139 | + |
| 140 | + hostSelector: |
| 141 | + matchLabels: |
| 142 | + infra-kind: medium |
| 143 | + ``` |
| 144 | +* After a while, the system associates the claim with a real server, and |
| 145 | + the resource's status is populated with the following information: |
| 146 | + ```yaml |
| 147 | + status: |
| 148 | + addresses: |
| 149 | + - address: 192.168.133.33 |
| 150 | + type: InternalIP |
| 151 | + - address: fe80::6be8:1f93:7f65:59cf%ens3 |
| 152 | + type: InternalIP |
| 153 | + - address: localhost.localdomain |
| 154 | + type: Hostname |
| 155 | + - address: localhost.localdomain |
| 156 | + type: InternalDNS |
| 157 | + bootMACAddress: "52:54:00:01:00:05" |
| 158 | + conditions: |
| 159 | + - lastTransitionTime: "2024-03-29T14:33:19Z" |
| 160 | + status: "True" |
| 161 | + type: Ready |
| 162 | + - lastTransitionTime: "2024-03-29T14:33:19Z" |
| 163 | + status: "True" |
| 164 | + type: AssociateBMH |
| 165 | + lastUpdated: "2024-03-29T14:33:19Z" |
| 166 | + nics: |
| 167 | + - MAC: "52:54:00:01:00:05" |
| 168 | + ip: 192.168.133.33 |
| 169 | + name: ens3 |
| 170 | + ``` |
| 171 | +* I also examine the annotations and labels of the HostClaim resource. They |
| 172 | + have been enriched with information from the BareMetalHost resource. |
| 173 | +* I create three secrets in the same namespace ``my-user-data``, |
| 174 | + ``my-meta-data``, and ``my-network-data``. I use the information from the |
| 175 | + status and meta data to customize the scripts they contain. |
| 176 | +* I modify the HostClaim to point to those secrets and start the server: |
| 177 | + ```yaml |
| 178 | + apiVersion: metal3.io/v1alpha1 |
| 179 | + kind: HostClaim |
| 180 | + metadata: |
| 181 | + name: my-host |
| 182 | + spec: |
| 183 | + online: true |
| 184 | + image: |
| 185 | + checksum: https://url_image.qcow2.md5 |
| 186 | + url: https://url_image.qcow2 |
| 187 | + format: qcow2 |
| 188 | + userData: |
| 189 | + name: my-user-data |
| 190 | + networkData: |
| 191 | + name: my-network-data |
| 192 | + kind: baremetal |
| 193 | + hostSelector: |
| 194 | + matchLabels: |
| 195 | + infra-kind: medium |
| 196 | + ``` |
| 197 | +* The workload is launched. When the machine is fully provisioned, the boolean |
| 198 | + field ready in the status becomes true. I can stop the server by changing |
| 199 | + the online status. I can also perform a reboot by targeting specific |
| 200 | + annotations in the reserved ``host.metal3.io`` domain. |
| 201 | +* When I destroy the host, the association is broken and another user can take |
| 202 | + over the server. |
| 203 | +
|
| 204 | +#### As an infrastructure administrator I would like to host several isolated clusters |
| 205 | +All the servers in the data-center are registered as BareMetalHost in one or |
| 206 | +several namespaces under the control of the infrastructure manager. Namespaces |
| 207 | +are created for each tenant of the infrastructure. They create |
| 208 | +standard cluster definitions in those namespaces. The only difference with |
| 209 | +standard baremetal cluster definitions is the presence of a ``kind`` field in |
| 210 | +the Metal3Machine templates. The value of this field is set to ``baremetal``. |
| 211 | +
|
| 212 | +When the cluster is started, a HostClaim is created for each Metal3Machine |
| 213 | +associated to the cluster. The ``hostSelector`` and ``kind`` fields are |
| 214 | +inherited from the Metal3Machine. They are used to define the BareMetalHost |
| 215 | +associated with the cluster. The associated BareMetalHost is not in the same |
| 216 | +namespace as the HostClaim. The exact definition of the BareMetalHost remains |
| 217 | +hidden from the cluster user, but parts of its status and metadata are copied |
| 218 | +back to the Host namespace. With this information, |
| 219 | +the data template controller has enough details to compute the different |
| 220 | +secrets (userData, metaData and networkData) associated to the Metal3Machine. |
| 221 | +Those secrets are linked to the HostClaim and, ultimately, to the |
| 222 | +BareMetalHost. |
| 223 | +
|
| 224 | +When the cluster is modified, new Machine and Metal3Machine resources replace |
| 225 | +the previous ones. The HostClaims follow the life-cycle of the Metal3Machines |
| 226 | +and are destroyed unless they are tagged for node reuse. The BareMetalHosts are |
| 227 | +recycled and are bound to new HostClaims, potentially belonging to other |
| 228 | +clusters. |
| 229 | +
|
| 230 | +#### As a cluster administrator I would like to build a cluster with different kind of nodes |
| 231 | +This scenario assumes that: |
| 232 | +* the cloud technologies CT<sub>i</sub> use qcow images and cloud-init to |
| 233 | + define workloads. |
| 234 | +* Clouds C<sub>i</sub> implementing CT<sub>i</sub> are accessible through |
| 235 | + credentials and endpoints described in a resource Cr<sub>i</sub>. |
| 236 | +* a HostClaim controller exists for each CT<sub>i</sub>. Compute resource can |
| 237 | + be parameterize through arguments in the HostClaim arg<sub>ij</sub>. |
| 238 | +
|
| 239 | +One can build a cluster where each machine deployment MD<sub>i</sub> |
| 240 | +contains Metal3Machine templates referring to kind CT<sub>i</sub>. |
| 241 | +The arguments identify the credentials to use (CR<sub>i</sub>) |
| 242 | +
|
| 243 | +```yaml |
| 244 | +apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 |
| 245 | +kind: Metal3MachineTemplate |
| 246 | +metadata: |
| 247 | + name: md-i |
| 248 | +spec: |
| 249 | + dataTemplate: |
| 250 | + name: dt-i |
| 251 | + kind: CT_i |
| 252 | + args: |
| 253 | + arg_i1: v1 |
| 254 | + ... |
| 255 | + arg_ik: vk |
| 256 | + hostSelector: |
| 257 | + matchLabels: |
| 258 | + ... |
| 259 | + image: |
| 260 | + checksum: https://image_url.qcow2.md5 |
| 261 | + format: qcow2 |
| 262 | + url: https://image_url.qcow2.md5 |
| 263 | +``` |
| 264 | +
|
| 265 | +The Metal3Machine controllers will create HostClaims with different kinds |
| 266 | +handled by different controllers creating the different compute resources. |
| 267 | +Connectivity must be established between the different subnets where each |
| 268 | +controller creates its compute resources. |
| 269 | +
|
| 270 | +The argument extension is not strictly necessary but is cleaner than using |
| 271 | +matchLabels in specific domains to convey information to controllers. |
| 272 | +Controllers for disposable resources such as virtual machine typically do not |
| 273 | +use hostSelectors. Controllers for a "bare-metal as a service" service |
| 274 | +may use selectors. |
| 275 | +
|
| 276 | +#### As a cluster administrator I would like to install a new baremetal cluster from a transient cluster |
| 277 | +The bootstrap process can be performed as usual from an ephemeral cluster |
| 278 | +(e.g., a KinD cluster). The constraint that all resources must be in the same |
| 279 | +namespace (Cluster and BareMetalHost resources) must be respected. The |
| 280 | +BareMetalHost should be marked as movable. |
| 281 | +
|
| 282 | +The only difference with the behavior without Host is the presence of an |
| 283 | +intermediate Host resource but the chain of resources is kept during the |
| 284 | +transfer and the pause annotation is used to stop Ironic. |
| 285 | +
|
| 286 | +Because this operation is only performed by the administrator of a cluster |
| 287 | +manager, the fact that the cluster definition and the BareMetalHosts are in |
| 288 | +the same namespace should not be an issue. |
| 289 | +
|
| 290 | +The tenant clusters cannot be pivoted which can be expected from a security |
| 291 | +point of vue as it would give the bare-metal servers credentials to the |
| 292 | +tenants. Partial pivot can be achieved with the help of HostClaim replicating |
| 293 | +HostClaims on other clusters but the specification of the associated |
| 294 | +controller is beyond the scope of this specification. |
| 295 | +
|
| 296 | +## Design Details |
| 297 | +TBD. |
0 commit comments