Skip to content

Commit 40604f3

Browse files
definition of the HostClaim resource
The HostClaim resource is introduced to address multi-tenancy in Metal3 and the definition of hybrid clusters with cluster-api. Introduces four scenarios representing the three main use cases and how pivot is made. Presentation of the design will be addressed in another merge request. Co-authored-by: Pierre Crégut <[email protected]> Co-authored-by: Laurent Roussarie <[email protected]> Signed-off-by: Pierre Crégut <[email protected]>
1 parent fa98964 commit 40604f3

File tree

1 file changed

+297
-0
lines changed

1 file changed

+297
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,297 @@
1+
<!--
2+
This work is licensed under a Creative Commons Attribution 3.0
3+
Unported License.
4+
5+
http://creativecommons.org/licenses/by/3.0/legalcode
6+
-->
7+
8+
# HostClaim: multi-tenancy and hybrid clusters
9+
10+
11+
## Status
12+
13+
provisional
14+
15+
## Summary
16+
17+
We introduce a new Custom Resource (named HostClaim) which will facilitate
18+
the creation of multiple clusters for different tenants. It also provides
19+
a framework for building clusters with different kind of compute resource:
20+
baremetal servers but also virtual machines hosted in private or public cloud.
21+
22+
A HostClaim decouples the client need from the actual implementation of the
23+
compute resource: it establishes a security boundary and provides a way to
24+
migrate nodes between different kind of compute resources.
25+
26+
A HostClaim expresses that one wants to start a given
27+
OS image with an initial configuration (typically cloud-init or ignition
28+
configuration files) on a compute resource that meets a set of requirements
29+
(host selectors). These requirements can be interpreted either as labels
30+
for a recyclable resource (such as a bare-metal server) or as characteristics
31+
for a disposable resource (such as a virtual machine created for the workload).
32+
The status and meta-data of the HostClaim provide the necessary information
33+
for the end user to define and manage his workload on the compute resource,
34+
but they do not grant full control over the resource (typically, BMC
35+
credentials of servers are not exposed to the tenant).
36+
37+
## Motivation
38+
39+
So far, the primary use case of cluster-api-baremetal is the creation of a
40+
single target cluster from a temporary management cluster. The pivot process
41+
transfers the resources describing the target cluster from the management
42+
cluster to the target cluster. Once the pivot process is complete, the target
43+
cluster takes over all the servers. It can scale based on its workload but it
44+
cannot share its servers with other clusters.
45+
46+
There is another model where a single management cluster is used to create and
47+
manage several clusters across a set of bare-metal servers. This is the focus
48+
of the [Sylva Project](https://sylvaproject.org/) of the Linux foundation.
49+
Another example is [Das Schiff](https://github.com/telekom/das-schiff).
50+
51+
One of the issue encountered today is that the compute resources
52+
(BareMetalHost) and the cluster definition (Cluster, MachineDeployment,
53+
Machines, Metal3Machines, etc.) must be in the same namespace. Since the goal
54+
is to share the compute resources, this means that a single namespace is used
55+
for all resources. Consequently, unless very complex access control
56+
rules are defined, cluster administrators have visibility over all clusters
57+
and full control over the servers as the credentials are stored in the same
58+
namespace.
59+
60+
The solution so far is to completely proxy the access to the Kubernetes
61+
resources that define the clusters.
62+
63+
Another unrelated problem is that Cluster-API has been designed
64+
to define clusters using homogeneous compute resources: it is challenging to
65+
define a cluster with both bare-metal servers and virtual machines in a private
66+
or public cloud.
67+
This [blog post](https://metal3.io/blog/2022/07/08/One_cluster_multiple_providers.html)
68+
proposes several approaches but none is entirely satisfactory.
69+
70+
On the other hand, workloads can be easily defined in terms of OS images and
71+
initial configurations and standards such as qcow or cloud-init have emerged
72+
and are used by various infrastructure technologies.
73+
Due to the unique challenges of managing bare-metal, the Metal3 project has
74+
developed a set of abstractions and tools that could be used in different
75+
settings. The main mechanism is the selection process that occurs between
76+
the Metal3Machine and the BareMetalHost which assigns a disposable workload
77+
(being a Kubernetes node) to a recyclable compute resource (a
78+
server).
79+
80+
This proposal introduces a new resource called HostClaim that solves
81+
both problems by decoupling the definition of the workload performed by
82+
the Metal3 machine controller from the actual compute resource.
83+
This resource acts as both a security boundary and a way to hide the
84+
implementation details of the compute resource.
85+
86+
### Goals
87+
88+
* Split responsibilities between infrastructure teams, who manage servers, and
89+
cluster administrators, who create/update/scale baremetal clusters deployed
90+
on those servers, using traditional Kubernetes RBAC to ensure isolation.
91+
* Provide a framework where cluster administrators can consume compute
92+
resources that are not baremetal servers, as long as they offer similar APIs,
93+
using the cluster-api-provider-metal3 to manage the life-cycle of those
94+
resources.
95+
* Define a resource where a user can request a compute resource to execute
96+
an arbitrary workload described by an OS image and an initial configuration.
97+
The user does not need to know exactly which resource is used and may not
98+
have full control over this resource (typically no BMC access).
99+
100+
101+
### Non-Goals
102+
103+
* How to implement HostClaim for specific compute resources that are not
104+
BareMetalHost.
105+
* Discovery of which capabilities are exposed by the cluster.
106+
Which kind of compute resources are available and the semantics of the
107+
selectors are not handled.
108+
* Compute resource quotas. The HostClaim resource should make it possible to
109+
develop a framework to limit the number/size of compute resources allocated
110+
to a tenant, similar to how quotas work for pods. However, the specification
111+
of such a framework will be addressed in another design document.
112+
* Pivoting client clusters resources (managed clusters that are not the
113+
initial cluster).
114+
115+
## Proposal
116+
117+
118+
### User Stories
119+
120+
#### As a user I would like to execute a workload on an arbitrary server
121+
The OS image is available in qcow format on a remote server at ``url_image``.
122+
It supports cloud-init and a script can launch the workload at boot time
123+
(e.g., a systemd service).
124+
125+
The cluster offers bare-metal as a service using Metal3 baremetal-operator.
126+
However, as a regular user, I am not allowed to directly access the definitions
127+
of the servers. All servers are labeled with an ``infra-kind`` label whose
128+
value depends on the characteristics of the computer.
129+
130+
* I create a resource with the following content:
131+
```yaml
132+
apiVersion: metal3.io/v1alpha1
133+
kind: HostClaim
134+
metadata:
135+
name: my-host
136+
spec:
137+
online: false
138+
kind: baremetal
139+
140+
hostSelector:
141+
matchLabels:
142+
infra-kind: medium
143+
```
144+
* After a while, the system associates the claim with a real server, and
145+
the resource's status is populated with the following information:
146+
```yaml
147+
status:
148+
addresses:
149+
- address: 192.168.133.33
150+
type: InternalIP
151+
- address: fe80::6be8:1f93:7f65:59cf%ens3
152+
type: InternalIP
153+
- address: localhost.localdomain
154+
type: Hostname
155+
- address: localhost.localdomain
156+
type: InternalDNS
157+
bootMACAddress: "52:54:00:01:00:05"
158+
conditions:
159+
- lastTransitionTime: "2024-03-29T14:33:19Z"
160+
status: "True"
161+
type: Ready
162+
- lastTransitionTime: "2024-03-29T14:33:19Z"
163+
status: "True"
164+
type: AssociateBMH
165+
lastUpdated: "2024-03-29T14:33:19Z"
166+
nics:
167+
- MAC: "52:54:00:01:00:05"
168+
ip: 192.168.133.33
169+
name: ens3
170+
```
171+
* I also examine the annotations and labels of the HostClaim resource. They
172+
have been enriched with information from the BareMetalHost resource.
173+
* I create three secrets in the same namespace ``my-user-data``,
174+
``my-meta-data``, and ``my-network-data``. I use the information from the
175+
status and meta data to customize the scripts they contain.
176+
* I modify the HostClaim to point to those secrets and start the server:
177+
```yaml
178+
apiVersion: metal3.io/v1alpha1
179+
kind: HostClaim
180+
metadata:
181+
name: my-host
182+
spec:
183+
online: true
184+
image:
185+
checksum: https://url_image.qcow2.md5
186+
url: https://url_image.qcow2
187+
format: qcow2
188+
userData:
189+
name: my-user-data
190+
networkData:
191+
name: my-network-data
192+
kind: baremetal
193+
hostSelector:
194+
matchLabels:
195+
infra-kind: medium
196+
```
197+
* The workload is launched. When the machine is fully provisioned, the boolean
198+
field ready in the status becomes true. I can stop the server by changing
199+
the online status. I can also perform a reboot by targeting specific
200+
annotations in the reserved ``host.metal3.io`` domain.
201+
* When I destroy the host, the association is broken and another user can take
202+
over the server.
203+
204+
#### As an infrastructure administrator I would like to host several isolated clusters
205+
All the servers in the data-center are registered as BareMetalHost in one or
206+
several namespaces under the control of the infrastructure manager. Namespaces
207+
are created for each tenant of the infrastructure. They create
208+
standard cluster definitions in those namespaces. The only difference with
209+
standard baremetal cluster definitions is the presence of a ``kind`` field in
210+
the Metal3Machine templates. The value of this field is set to ``baremetal``.
211+
212+
When the cluster is started, a HostClaim is created for each Metal3Machine
213+
associated to the cluster. The ``hostSelector`` and ``kind`` fields are
214+
inherited from the Metal3Machine. They are used to define the BareMetalHost
215+
associated with the cluster. The associated BareMetalHost is not in the same
216+
namespace as the HostClaim. The exact definition of the BareMetalHost remains
217+
hidden from the cluster user, but parts of its status and metadata are copied
218+
back to the Host namespace. With this information,
219+
the data template controller has enough details to compute the different
220+
secrets (userData, metaData and networkData) associated to the Metal3Machine.
221+
Those secrets are linked to the HostClaim and, ultimately, to the
222+
BareMetalHost.
223+
224+
When the cluster is modified, new Machine and Metal3Machine resources replace
225+
the previous ones. The HostClaims follow the life-cycle of the Metal3Machines
226+
and are destroyed unless they are tagged for node reuse. The BareMetalHosts are
227+
recycled and are bound to new HostClaims, potentially belonging to other
228+
clusters.
229+
230+
#### As a cluster administrator I would like to build a cluster with different kind of nodes
231+
This scenario assumes that:
232+
* the cloud technologies CT<sub>i</sub> use qcow images and cloud-init to
233+
define workloads.
234+
* Clouds C<sub>i</sub> implementing CT<sub>i</sub> are accessible through
235+
credentials and endpoints described in a resource Cr<sub>i</sub>.
236+
* a HostClaim controller exists for each CT<sub>i</sub>. Compute resource can
237+
be parameterize through arguments in the HostClaim arg<sub>ij</sub>.
238+
239+
One can build a cluster where each machine deployment MD<sub>i</sub>
240+
contains Metal3Machine templates referring to kind CT<sub>i</sub>.
241+
The arguments identify the credentials to use (CR<sub>i</sub>)
242+
243+
```yaml
244+
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
245+
kind: Metal3MachineTemplate
246+
metadata:
247+
name: md-i
248+
spec:
249+
dataTemplate:
250+
name: dt-i
251+
kind: CT_i
252+
args:
253+
arg_i1: v1
254+
...
255+
arg_ik: vk
256+
hostSelector:
257+
matchLabels:
258+
...
259+
image:
260+
checksum: https://image_url.qcow2.md5
261+
format: qcow2
262+
url: https://image_url.qcow2.md5
263+
```
264+
265+
The Metal3Machine controllers will create HostClaims with different kinds
266+
handled by different controllers creating the different compute resources.
267+
Connectivity must be established between the different subnets where each
268+
controller creates its compute resources.
269+
270+
The argument extension is not strictly necessary but is cleaner than using
271+
matchLabels in specific domains to convey information to controllers.
272+
Controllers for disposable resources such as virtual machine typically do not
273+
use hostSelectors. Controllers for a "bare-metal as a service" service
274+
may use selectors.
275+
276+
#### As a cluster administrator I would like to install a new baremetal cluster from a transient cluster
277+
The bootstrap process can be performed as usual from an ephemeral cluster
278+
(e.g., a KinD cluster). The constraint that all resources must be in the same
279+
namespace (Cluster and BareMetalHost resources) must be respected. The
280+
BareMetalHost should be marked as movable.
281+
282+
The only difference with the behavior without Host is the presence of an
283+
intermediate Host resource but the chain of resources is kept during the
284+
transfer and the pause annotation is used to stop Ironic.
285+
286+
Because this operation is only performed by the administrator of a cluster
287+
manager, the fact that the cluster definition and the BareMetalHosts are in
288+
the same namespace should not be an issue.
289+
290+
The tenant clusters cannot be pivoted which can be expected from a security
291+
point of vue as it would give the bare-metal servers credentials to the
292+
tenants. Partial pivot can be achieved with the help of HostClaim replicating
293+
HostClaims on other clusters but the specification of the associated
294+
controller is beyond the scope of this specification.
295+
296+
## Design Details
297+
TBD.

0 commit comments

Comments
 (0)