Skip to content

Commit 0aaaccc

Browse files
definition of the HostClaim resource
The HostClaim resource is introduced to address multi-tenancy in Metal3 and the definition of hybrid clusters with cluster-api. Introduces four scenarios representing the three main use cases and how pivot is made. Presentation of the design will be addressed in another merge request. Co-authored-by: Pierre Crégut <[email protected]> Co-authored-by: Laurent Roussarie <[email protected]> Signed-off-by: Pierre Crégut <[email protected]>
1 parent fa98964 commit 0aaaccc

File tree

1 file changed

+307
-0
lines changed

1 file changed

+307
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,307 @@
1+
<!--
2+
This work is licensed under a Creative Commons Attribution 3.0
3+
Unported License.
4+
5+
http://creativecommons.org/licenses/by/3.0/legalcode
6+
-->
7+
8+
# HostClaim: multi-tenancy and hybrid clusters
9+
10+
## Status
11+
12+
provisional
13+
14+
## Summary
15+
16+
We introduce a new Custom Resource (named HostClaim) which will facilitate
17+
the creation of multiple clusters for different tenants. It also provides
18+
a framework for building clusters with different kind of compute resource:
19+
baremetal servers but also virtual machines hosted in private or public cloud.
20+
21+
A HostClaim decouples the client need from the actual implementation of the
22+
compute resource: it establishes a security boundary and provides a way to
23+
migrate nodes between different kind of compute resources.
24+
25+
A HostClaim expresses that one wants to start a given
26+
OS image with an initial configuration (typically cloud-init or ignition
27+
configuration files) on a compute resource that meets a set of requirements
28+
(host selectors). These requirements can be interpreted either as labels
29+
for a recyclable resource (such as a bare-metal server) or as characteristics
30+
for a disposable resource (such as a virtual machine created for the workload).
31+
The status and meta-data of the HostClaim provide the necessary information
32+
for the end user to define and manage his workload on the compute resource,
33+
but they do not grant full control over the resource (typically, BMC
34+
credentials of servers are not exposed to the tenant).
35+
36+
## Motivation
37+
38+
So far, the primary use case of cluster-api-baremetal is the creation of a
39+
single target cluster from a temporary management cluster. The pivot process
40+
transfers the resources describing the target cluster from the management
41+
cluster to the target cluster. Once the pivot process is complete, the target
42+
cluster takes over all the servers. It can scale based on its workload but it
43+
cannot share its servers with other clusters.
44+
45+
There is another model where a single management cluster is used to create and
46+
manage several clusters across a set of bare-metal servers. This is the focus
47+
<!-- cSpell:ignore Sylva Schiff -->
48+
of the [Sylva Project](https://sylvaproject.org/) of the Linux foundation.
49+
Another example is [Das Schiff](https://github.com/telekom/das-schiff).
50+
51+
One of the issue encountered today is that the compute resources
52+
(BareMetalHost) and the cluster definition (Cluster, MachineDeployment,
53+
Machines, Metal3Machines, etc.) must be in the same namespace. Since the goal
54+
is to share the compute resources, this means that a single namespace is used
55+
for all resources. Consequently, unless very complex access control
56+
rules are defined, cluster administrators have visibility over all clusters
57+
and full control over the servers as the credentials are stored in the same
58+
namespace.
59+
60+
The solution so far is to completely proxy the access to the Kubernetes
61+
resources that define the clusters.
62+
63+
Another unrelated problem is that Cluster-API has been designed
64+
to define clusters using homogeneous compute resources: it is challenging to
65+
define a cluster with both bare-metal servers and virtual machines in a private
66+
or public cloud.
67+
This [blog post](https://metal3.io/blog/2022/07/08/One_cluster_multiple_providers.html)
68+
proposes several approaches but none is entirely satisfactory.
69+
70+
On the other hand, workloads can be easily defined in terms of OS images and
71+
initial configurations and standards such as qcow or cloud-init have emerged
72+
and are used by various infrastructure technologies.
73+
Due to the unique challenges of managing bare-metal, the Metal3 project has
74+
developed a set of abstractions and tools that could be used in different
75+
settings. The main mechanism is the selection process that occurs between
76+
the Metal3Machine and the BareMetalHost which assigns a disposable workload
77+
(being a Kubernetes node) to a recyclable compute resource (a
78+
server).
79+
80+
This proposal introduces a new resource called HostClaim that solves
81+
both problems by decoupling the definition of the workload performed by
82+
the Metal3 machine controller from the actual compute resource.
83+
This resource acts as both a security boundary and a way to hide the
84+
implementation details of the compute resource.
85+
86+
### Goals
87+
88+
* Split responsibilities between infrastructure teams, who manage servers, and
89+
cluster administrators, who create/update/scale baremetal clusters deployed
90+
on those servers, using traditional Kubernetes RBAC to ensure isolation.
91+
* Provide a framework where cluster administrators can consume compute
92+
resources that are not baremetal servers, as long as they offer similar APIs,
93+
using the cluster-api-provider-metal3 to manage the life-cycle of those
94+
resources.
95+
* Define a resource where a user can request a compute resource to execute
96+
an arbitrary workload described by an OS image and an initial configuration.
97+
The user does not need to know exactly which resource is used and may not
98+
have full control over this resource (typically no BMC access).
99+
100+
### Non-Goals
101+
102+
* How to implement HostClaim for specific compute resources that are not
103+
BareMetalHost.
104+
* Discovery of which capabilities are exposed by the cluster.
105+
Which kind of compute resources are available and the semantics of the
106+
selectors are not handled.
107+
* Compute resource quotas. The HostClaim resource should make it possible to
108+
develop a framework to limit the number/size of compute resources allocated
109+
to a tenant, similar to how quotas work for pods. However, the specification
110+
of such a framework will be addressed in another design document.
111+
* Pivoting client clusters resources (managed clusters that are not the
112+
initial cluster).
113+
114+
## Proposal
115+
116+
### User Stories
117+
118+
#### As a user I would like to execute a workload on an arbitrary server
119+
120+
The OS image is available in qcow format on a remote server at ``url_image``.
121+
It supports cloud-init and a script can launch the workload at boot time
122+
(e.g., a systemd service).
123+
124+
The cluster offers bare-metal as a service using Metal3 baremetal-operator.
125+
However, as a regular user, I am not allowed to directly access the definitions
126+
of the servers. All servers are labeled with an ``infra-kind`` label whose
127+
value depends on the characteristics of the computer.
128+
129+
* I create a resource with the following content:
130+
131+
```yaml
132+
apiVersion: metal3.io/v1alpha1
133+
kind: HostClaim
134+
metadata:
135+
name: my-host
136+
spec:
137+
online: false
138+
kind: baremetal
139+
140+
hostSelector:
141+
matchLabels:
142+
infra-kind: medium
143+
```
144+
145+
* After a while, the system associates the claim with a real server, and
146+
the resource's status is populated with the following information:
147+
148+
```yaml
149+
status:
150+
addresses:
151+
- address: 192.168.133.33
152+
type: InternalIP
153+
- address: fe80::6be8:1f93:7f65:59cf%ens3
154+
type: InternalIP
155+
- address: localhost.localdomain
156+
type: Hostname
157+
- address: localhost.localdomain
158+
type: InternalDNS
159+
bootMACAddress: "52:54:00:01:00:05"
160+
conditions:
161+
- lastTransitionTime: "2024-03-29T14:33:19Z"
162+
status: "True"
163+
type: Ready
164+
- lastTransitionTime: "2024-03-29T14:33:19Z"
165+
status: "True"
166+
type: AssociateBMH
167+
lastUpdated: "2024-03-29T14:33:19Z"
168+
nics:
169+
- MAC: "52:54:00:01:00:05"
170+
ip: 192.168.133.33
171+
name: ens3
172+
```
173+
174+
* I also examine the annotations and labels of the HostClaim resource. They
175+
have been enriched with information from the BareMetalHost resource.
176+
* I create three secrets in the same namespace ``my-user-data``,
177+
``my-meta-data``, and ``my-network-data``. I use the information from the
178+
status and meta data to customize the scripts they contain.
179+
* I modify the HostClaim to point to those secrets and start the server:
180+
181+
```yaml
182+
apiVersion: metal3.io/v1alpha1
183+
kind: HostClaim
184+
metadata:
185+
name: my-host
186+
spec:
187+
online: true
188+
image:
189+
checksum: https://url_image.qcow2.md5
190+
url: https://url_image.qcow2
191+
format: qcow2
192+
userData:
193+
name: my-user-data
194+
networkData:
195+
name: my-network-data
196+
kind: baremetal
197+
hostSelector:
198+
matchLabels:
199+
infra-kind: medium
200+
```
201+
202+
* The workload is launched. When the machine is fully provisioned, the boolean
203+
field ready in the status becomes true. I can stop the server by changing
204+
the online status. I can also perform a reboot by targeting specific
205+
annotations in the reserved ``host.metal3.io`` domain.
206+
* When I destroy the host, the association is broken and another user can take
207+
over the server.
208+
209+
#### As an infrastructure administrator I would like to host several isolated clusters
210+
211+
All the servers in the data-center are registered as BareMetalHost in one or
212+
several namespaces under the control of the infrastructure manager. Namespaces
213+
are created for each tenant of the infrastructure. They create
214+
standard cluster definitions in those namespaces. The only difference with
215+
standard baremetal cluster definitions is the presence of a ``kind`` field in
216+
the Metal3Machine templates. The value of this field is set to ``baremetal``.
217+
218+
When the cluster is started, a HostClaim is created for each Metal3Machine
219+
associated to the cluster. The ``hostSelector`` and ``kind`` fields are
220+
inherited from the Metal3Machine. They are used to define the BareMetalHost
221+
associated with the cluster. The associated BareMetalHost is not in the same
222+
namespace as the HostClaim. The exact definition of the BareMetalHost remains
223+
hidden from the cluster user, but parts of its status and metadata are copied
224+
back to the Host namespace. With this information,
225+
the data template controller has enough details to compute the different
226+
secrets (userData, metaData and networkData) associated to the Metal3Machine.
227+
Those secrets are linked to the HostClaim and, ultimately, to the
228+
BareMetalHost.
229+
230+
When the cluster is modified, new Machine and Metal3Machine resources replace
231+
the previous ones. The HostClaims follow the life-cycle of the Metal3Machines
232+
and are destroyed unless they are tagged for node reuse. The BareMetalHosts are
233+
recycled and are bound to new HostClaims, potentially belonging to other
234+
clusters.
235+
236+
#### As a cluster administrator I would like to build a cluster with different kind of nodes
237+
238+
This scenario assumes that:
239+
240+
* the cloud technologies CT_i use qcow images and cloud-init to
241+
define workloads.
242+
* Clouds C_i implementing CT_i are accessible through
243+
credentials and endpoints described in a resource Cr_i.
244+
* a HostClaim controller exists for each CT_i. Compute resource can
245+
be parameterize through arguments in the HostClaim arg_ij.
246+
247+
One can build a cluster where each machine deployment MD_i
248+
contains Metal3Machine templates referring to kind CT_i.
249+
The arguments identify the credentials to use (CR_i)
250+
251+
```yaml
252+
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
253+
kind: Metal3MachineTemplate
254+
metadata:
255+
name: md-i
256+
spec:
257+
dataTemplate:
258+
name: dt-i
259+
kind: CT_i
260+
args:
261+
arg_i1: v1
262+
...
263+
arg_ik: vk
264+
hostSelector:
265+
matchLabels:
266+
...
267+
image:
268+
checksum: https://image_url.qcow2.md5
269+
format: qcow2
270+
url: https://image_url.qcow2.md5
271+
```
272+
273+
The Metal3Machine controllers will create HostClaims with different kinds
274+
handled by different controllers creating the different compute resources.
275+
Connectivity must be established between the different subnets where each
276+
controller creates its compute resources.
277+
278+
The argument extension is not strictly necessary but is cleaner than using
279+
matchLabels in specific domains to convey information to controllers.
280+
Controllers for disposable resources such as virtual machine typically do not
281+
use hostSelectors. Controllers for a "bare-metal as a service" service
282+
may use selectors.
283+
284+
#### As a cluster administrator I would like to install a new baremetal cluster from a transient cluster
285+
286+
The bootstrap process can be performed as usual from an ephemeral cluster
287+
(e.g., a KinD cluster). The constraint that all resources must be in the same
288+
namespace (Cluster and BareMetalHost resources) must be respected. The
289+
BareMetalHost should be marked as movable.
290+
291+
The only difference with the behavior without Host is the presence of an
292+
intermediate Host resource but the chain of resources is kept during the
293+
transfer and the pause annotation is used to stop Ironic.
294+
295+
Because this operation is only performed by the administrator of a cluster
296+
manager, the fact that the cluster definition and the BareMetalHosts are in
297+
the same namespace should not be an issue.
298+
299+
The tenant clusters cannot be pivoted which can be expected from a security
300+
point of vue as it would give the bare-metal servers credentials to the
301+
tenants. Partial pivot can be achieved with the help of HostClaim replicating
302+
HostClaims on other clusters but the specification of the associated
303+
controller is beyond the scope of this specification.
304+
305+
## Design Details
306+
307+
TBD.

0 commit comments

Comments
 (0)