5
5
http://creativecommons.org/licenses/by/3.0/legalcode
6
6
-->
7
7
8
+ <!-- cSpell:ignore Sylva Schiff Kanod argocd GitOps -->
8
9
# HostClaim: multi-tenancy and hybrid clusters
9
10
10
11
## Status
@@ -106,12 +107,17 @@ implementation details of the compute resource.
106
107
of such a framework will be addressed in another design document.
107
108
* Pivoting client clusters resources (managed clusters that are not the
108
109
initial cluster).
110
+ * Using BareMetalHosts defined in other clusters. The HostClaim concept
111
+ supports this use case with some extensions. The design is outlined in the
112
+ alternative approach section but is beyond the scope of this document.
109
113
110
114
## Proposal
111
115
112
116
### User Stories
113
117
114
- #### As a user I would like to execute a workload on an arbitrary server
118
+ #### Deployment of Simple Workloads
119
+
120
+ As a user I would like to execute a workload on an arbitrary server.
115
121
116
122
The OS image is available in qcow format on a remote server at `` url_image `` .
117
123
It supports cloud-init and a script can launch the workload at boot time
@@ -202,7 +208,9 @@ value depends on the characteristics of the computer.
202
208
* When I destroy the host, the association is broken and another user can take
203
209
over the server.
204
210
205
- #### As an infrastructure administrator I would like to host several isolated clusters
211
+ #### Multi-tenancy
212
+
213
+ As an infrastructure administrator I would like to host several isolated clusters
206
214
207
215
All the servers in the data-center are registered as BareMetalHost in one or
208
216
several namespaces under the control of the infrastructure manager. Namespaces
@@ -229,7 +237,9 @@ and are destroyed unless they are tagged for node reuse. The BareMetalHosts are
229
237
recycled and are bound to new HostClaims, potentially belonging to other
230
238
clusters.
231
239
232
- #### As a cluster administrator I would like to build a cluster with different kind of nodes
240
+ #### Hybrid Clusters
241
+
242
+ As a cluster administrator I would like to build a cluster with different kind of nodes.
233
243
234
244
This scenario assumes that:
235
245
@@ -277,7 +287,9 @@ Controllers for disposable resources such as virtual machine typically do not
277
287
use hostSelectors. Controllers for a "bare-metal as a service" service
278
288
may use selectors.
279
289
280
- #### As a cluster administrator I would like to install a new baremetal cluster from a transient cluster
290
+ #### Manager Cluster Bootstrap
291
+
292
+ As a cluster administrator I would like to install a new baremetal cluster from a transient cluster.
281
293
282
294
The bootstrap process can be performed as usual from an ephemeral cluster
283
295
(e.g., a KinD cluster). The constraint that all resources must be in the same
@@ -300,4 +312,276 @@ controller is beyond the scope of this specification.
300
312
301
313
## Design Details
302
314
303
- TBD.
315
+ ### Implementation Details/Notes/Constraints
316
+
317
+ ### Risks and Mitigations
318
+
319
+ #### Security Impact of Making BareMetalHost Selection Cluster-wide
320
+
321
+ The main difference between Metal3 machines and HostClaims it the
322
+ selection process where a HostClaim can be bound to a BareMetalHost
323
+ in another namespace. We must make sure that this behavior is expected
324
+ from the owner of BareMetalHost resources, especially when we upgrade the
325
+ metal3 cluster api provider to a version supporting HostClaim.
326
+
327
+ The solution is to enforce that BareMetalHost that can be bound to a
328
+ HostClaim have a label (proposed name: ` ` hosts.metal3.io/namespaces` ` )
329
+ restricting authorized HostClaims to specific namespaces. The value could be
330
+ either ` ` *` ` for no constraint, or a comma separated list of namespace names.
331
+
332
+ #### Tenants Trying to Bypass the Selection Mechanism
333
+
334
+ The fact that a HostClaim is bound to a specific BareMetalHost will appear
335
+ as a label in the HostClaim and the HostClaim controller will use it to find
336
+ the associated BareMetalHost. This label could be modified by a malicious
337
+ tenant.
338
+
339
+ But the BareMetalHost has also a consumer reference. The label is only an
340
+ indication of the binding. If the consumer reference is invalid (different
341
+ from the HostClaim label), the label MUST be erased and the HostClaim
342
+ controller MUST NOT accept the binding.
343
+
344
+ #### Performance Impact
345
+
346
+ The proposal introduces a new resource with an associated controller between
347
+ the Metal3Machine and the BareMetalHost. There will be some duplication
348
+ of information between the BareMetalHost and the HostClaim status. The impact
349
+ for each node should still be limited especially when compared to the cost of
350
+ each Ironic action.
351
+
352
+ Because we plan to have several controllers for different kind of compute
353
+ resource, one can expect a few controllers working on the same custom resource.
354
+ This may create additional pressure on Kubernetes api server. It is possible
355
+ to limit the amount of exchanged information on a specific controller using
356
+ server based filters on the watch/list. To use this feature on current
357
+ Kubernetes versions, the HostClaim kind field must be copied in a label.
358
+
359
+ #### Impact on Other Cluster Api Components
360
+
361
+ There should be none: other components should mostly rely on Machine and Cluster
362
+ objects. Some tools may look at Metal3Machine conditions where some condition
363
+ names may be modified but the semantic of Ready condition will be preserved.
364
+
365
+ ### Work Items
366
+
367
+ ### Dependencies
368
+
369
+ ### Test Plan
370
+
371
+ ### Upgrade / Downgrade Strategy
372
+
373
+ ### Version Skew Strategy
374
+
375
+ ## Drawbacks
376
+
377
+ ## Alternatives
378
+
379
+ ### Multi-Tenancy Without HostClaim
380
+
381
+ We assume that we have a Kubernetes cluster managing a set of clusters for
382
+ cluster administrators (referred to as tenants in the following). Multi-tenancy
383
+ is a way to ensure that tenants have only control over their clusters.
384
+
385
+ There are at least two other ways for implementing multi-tenancy without
386
+ HostClaim. These methods proxy the entire definition of the cluster
387
+ or proxy the BareMetalHost itself.
388
+
389
+ #### Isolation Through Overlays
390
+
391
+ A solution for multi-tenancy is to hide all cluster resources from the end
392
+ user. In this approach, clusters and BareMetalHosts are defined within a single
393
+ namespace, but the cluster creation process ensures that resources
394
+ from different clusters do not overlap.
395
+
396
+ This approach was explored in the initial versions of the Kanod project.
397
+ Clusters must be described by the tenant in a git repository and the
398
+ descriptions are imported by a GitOps framework (argocd). The definitions are
399
+ processed by an argocd plugin that translates the YAML expressing the user's
400
+ intent into Kubernetes resources, and the naming of resources created by
401
+ this plugin ensures isolation.
402
+
403
+ Instead of using a translation plugin, it would be better to use a set of
404
+ custom resources. However, it is important to ensure that they are defined in
405
+ separate namespaces.
406
+
407
+ This approach has several drawbacks:
408
+
409
+ * The plugin or the controllers for the abstract clusters are complex
410
+ applications if we want to support many options, and they become part of
411
+ the trusted computing base of the cluster manager.
412
+ * It introduces a new level of access control that is distinct from the
413
+ Kubernetes model. If we want tooling or observability around the created
414
+ resources, we would need custom tools that adhere to this new policy, or we
415
+ would need to reflect everything we want to observe in the new custom
416
+ resources.
417
+ * This approach does not solve the problem of hybrid clusters.
418
+
419
+ #### Ephemeral BareMetalHost
420
+
421
+ Another solution is to have separate namespaces for each cluster but
422
+ import BareMetalHosts in those namespaces on demand when new compute resources
423
+ are needed.
424
+
425
+ The cluster requires a resource that acts as a source of BareMetalHosts, which
426
+ can be parameterized on servers requirements and the number of replicas. The
427
+ concept of
428
+ [BareMetalPool](https://gitlab.com/Orange-OpenSource/kanod/baremetalpool)
429
+ in Kanod is similar to ReplicaSets for pods. This concept is also used in
430
+ [this proposal](https://github.com/metal3-io/metal3-docs/pull/268) for a
431
+ Metal3Host resource. The number of replicas must be synchronized with the
432
+ requirements of the cluster. It may be updated by a
433
+ [separate controller](https://gitlab.com/Orange-OpenSource/kanod/kanod-poolscaler)
434
+ checking the requirements of machine deployments and control-planes.
435
+
436
+ The main security risk is that when a cluster releases a BareMetalHost, it may
437
+ keep the credentials that provide full control over the server.
438
+ This can be resolved if those credentials are temporary. In Kanod BareMetalPool
439
+ obtain new servers from a REST API implemented by a
440
+ [BareMetalHost broker](https://gitlab.com/Orange-OpenSource/kanod/brokerdef).
441
+ The broker implementation utilizes either the fact that Redfish is in fact an
442
+ HTTP API to implement a proxy or the capability of Redfish to create new users
443
+ with a Redfish ` ` operator` ` role to implemented BareMetalHost resources with
444
+ a limited lifespan.
445
+
446
+ A pool is implemented as an API that is protected by a set of credentials that
447
+ identify the user.
448
+
449
+ The advantages of this approach are:
450
+
451
+ * Support for pivot operation, even for tenant clusters, as it provides a
452
+ complete bare-metal-as-a-service solution.
453
+ * Cluster administrators have full access to the BMC and can configure servers
454
+ according to their needs using custom procedures that are not exposed by
455
+ standard Metal3 controllers.
456
+ * Network isolation can be established before the BareMetalHost is created in
457
+ the scope of the cluster. There is no transfer of servers from one network
458
+ configuration to another, which could invalidate parts of the introspection.
459
+
460
+ The disadvantages of the BareMetalPool approach are:
461
+
462
+ * The implementation of the broker with its dedicated server is quite complex.
463
+ * To have full dynamism over the pool of servers, a new type of autoscaler is
464
+ needed.
465
+ * Unnecessary inspection of servers are performed when they are transferred
466
+ from a cluster (tenant) to another.
467
+ * The current implementation of the proxy is limited to the Redfish protocol
468
+ and would require significant work for IPMI.
469
+
470
+ #### HostClaims as a right to consume BareMetalHosts
471
+
472
+ The approach combines the concept of remote endpoint of BareMetalPools with
473
+ the API-oriented approach of HostClaims, as described above.
474
+
475
+ In this variation, the HostClaim will be an object in the BareMetalHost
476
+ namespace defined with an API endpoint to drive the associated BareMetalHost.
477
+ The associated credentials are known from the Metal3Machine controller
478
+ because they are associated with the cluster. The Metal3 machine controller
479
+ will use this endpoint and the credentials to create the HostClaim. The
480
+ HostClaim controller will associate the HostClaim with a BareMetalHost.
481
+ Control actions and information about the status of the BareMetalHost will
482
+ be exchanged with the Metal3 machine controller through the endpoint.
483
+
484
+ The main advantage of the approach is that BareMetalHosts do not need to be on
485
+ the same cluster.
486
+
487
+ The main drawbacks are:
488
+
489
+ * It only addresses the multi-tenancy scenario. The hybrid scenario is not
490
+ solved but the usage of HostClaim outside Cluster-API is not addressed
491
+ either. The main reason is that there is no counter-part of the
492
+ BareMetalHost in the namespace of the tenant.
493
+ * The end user will have very limited direct view on the compute resources
494
+ it is using even when the BareMetalHosts are on the same cluster.
495
+
496
+ Extending HostClaims with a remote variant fulfills the same requirements
497
+ but keeps an explicit object in the namespace of the cluster definition
498
+ representing the API offered by this approach.
499
+
500
+ A remote HostClaim is a a HostClaim with kind set to ` ` remote` ` and at
501
+ least two arguments:
502
+
503
+ * One points to a URL and a set of credentials to access the endpoint on a
504
+ remote cluster,
505
+ * The second is the kind of the copied HostClaim created on the remote
506
+ cluster.
507
+
508
+ The two HostClaims are synchronized: the specification of the source HostClaim
509
+ is copied to the remote one (except the kind part). The status of the target
510
+ HostClaim is copied back to the source.. For meta-data, most of them are copied
511
+ from the target to the source. The exact implementation of this extension is
512
+ beyond the scope of this proposal.
513
+
514
+ ### Hybrid Clusters Without HostClaim
515
+
516
+ #### Control-Planes as a Service
517
+
518
+ The Kubernetes control-plane can be considered as an application with a
519
+ single endpoint. Some Cluster API control-plane providers implement a factory
520
+ for new control-planes directly, without relying on the infrastructure
521
+ provider. Usually, this control-plane is hosted in the management cluster
522
+ as a regular Kubernetes application. [Kamaji](https://kamaji.clastix.io/),
523
+ [k0smotron](https://docs.k0smotron.io/stable/) implement this approach.
524
+
525
+ The cluster is hybrid because the control-plane pods are not hosted on standard
526
+ nodes, but workers are usually all implemented by a single infrastructure
527
+ provider and are homogeneous.
528
+
529
+ The approach solves the problem of sharing resources for control-planes but
530
+ does not address the creation of clusters with distinct needs for workers.
531
+ Only one kind of workers is supported.
532
+
533
+ #### Many Infrastructure Cluster Resources per Cluster
534
+
535
+ It is possible to coerce Cluster API to create mixed clusters using the
536
+ fact that the different components are only loosely coupled. The approach is
537
+ presented in a [blog post](https://metal3.io/blog/2022/07/08/One_cluster_multiple_providers.html).
538
+
539
+ The goal is to define a cluster over technologies I_1, ... I_n where I_1 is
540
+ the technology used for the control-plane.
541
+ One Cluster object is defined, but an infrastructure cluster I_iCluster
542
+ is defined for each technology I_i (for example a Metal3Cluster for Metal3).
543
+ These infrastructure cluster objects use the same control-plane endpoint. The
544
+ Cluster object references the I_1Cluster object as ` ` infrastructureRef` ` .
545
+
546
+ With some limited assumptions over the providers, the approach works even if
547
+ the cluster is unaware of technologies I_2...I_n and it requires no modification
548
+ to Cluster API.
549
+
550
+ There is no standardization in the definition of machine deployments across
551
+ different technologies. For example, Metal3 is the sole infrastructure
552
+ provider that employs DataTemplates to capture parameters specific to a
553
+ given node.
554
+
555
+ But the main issue is that many existing providers are opinionated about
556
+ networking. Unfortunately, mixing infrastructure providers requires custom
557
+ configuration to interconnect the different deployments. A framework that does
558
+ not handle networking is a better base for building working hybrid clusters.
559
+
560
+ #### Bring Your Own Hosts
561
+
562
+ Bring Your Own Host (BYOH) is a cluster api provider that uses existing compute
563
+ resources that run a specific agent used for registering the resource and
564
+ deploying Kubernetes on the server.
565
+
566
+ BYOH does not impose many constraints on the compute resource, but it must be
567
+ launched before and it must know how to access the management cluster.
568
+ A solution is to implement for each kind of targeted compute
569
+ resource, a concept of pool launching an image with the agent activated and
570
+ the bootstrap credentials. An example for BareMetalHost could be the notion
571
+ of BareMetalPools presented above.
572
+
573
+ An autoscaler can be designed to keep the size of the pools synchronized with
574
+ the need of the cluster (size of the machine deployments and control-plane with
575
+ additional machines for updates).
576
+
577
+ The main drawbacks of the approach are:
578
+
579
+ * The approach requires many new resources and controllers. Keeping all of them
580
+ synchronized is complex. BareMetalPools are already a complex approach for
581
+ BareMetalHost multi-tenancy.
582
+ * Performing updates or pivot with BYOH is not easy. The way agents are stopped
583
+ or change their target cluster requires modifications of the BYOH controllers.
584
+
585
+ A prototype of this approach has been done in the Kanod project.
586
+
587
+ ## References
0 commit comments