Skip to content

Commit a5b0869

Browse files
committed
Alternative implementations of HostClaims and risks
* security risks that must be addressed * alternatives for the different use cases (multi tenancy or hybrid). Co-authored-by: Pierre Crégut <[email protected]> Co-authored-by: Laurent Roussarie <[email protected]> Signed-off-by: Pierre Crégut <[email protected]>
1 parent dd1bb33 commit a5b0869

File tree

2 files changed

+292
-5
lines changed

2 files changed

+292
-5
lines changed

.cspell-config.json

+3
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@
3636
"BMHs",
3737
"BMO",
3838
"BR",
39+
"BYOH",
3940
"CABPK",
4041
"CAPBM",
4142
"CAPI",
@@ -98,7 +99,9 @@
9899
"iso",
99100
"Jern",
100101
"js",
102+
"k0smotron",
101103
"k8s",
104+
"Kamaji",
102105
"Kashif",
103106
"keepalived",
104107
"Kind",

design/hostclaim-multitenancy-and-hybrid-clusters.md

+289-5
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
http://creativecommons.org/licenses/by/3.0/legalcode
66
-->
77

8+
<!-- cSpell:ignore Sylva Schiff Kanod argocd GitOps -->
89
# HostClaim: multi-tenancy and hybrid clusters
910

1011
## Status
@@ -106,12 +107,17 @@ implementation details of the compute resource.
106107
of such a framework will be addressed in another design document.
107108
* Pivoting client clusters resources (managed clusters that are not the
108109
initial cluster).
110+
* Using BareMetalHosts defined in other clusters. The HostClaim concept
111+
supports this use case with some extensions. The design is outlined in the
112+
alternative approach section but is beyond the scope of this document.
109113

110114
## Proposal
111115

112116
### User Stories
113117

114-
#### As a user I would like to execute a workload on an arbitrary server
118+
#### Deployment of Simple Workloads
119+
120+
As a user I would like to execute a workload on an arbitrary server.
115121

116122
The OS image is available in qcow format on a remote server at ``url_image``.
117123
It supports cloud-init and a script can launch the workload at boot time
@@ -202,7 +208,9 @@ value depends on the characteristics of the computer.
202208
* When I destroy the host, the association is broken and another user can take
203209
over the server.
204210
205-
#### As an infrastructure administrator I would like to host several isolated clusters
211+
#### Multi-tenancy
212+
213+
As an infrastructure administrator I would like to host several isolated clusters
206214
207215
All the servers in the data-center are registered as BareMetalHost in one or
208216
several namespaces under the control of the infrastructure manager. Namespaces
@@ -229,7 +237,9 @@ and are destroyed unless they are tagged for node reuse. The BareMetalHosts are
229237
recycled and are bound to new HostClaims, potentially belonging to other
230238
clusters.
231239
232-
#### As a cluster administrator I would like to build a cluster with different kind of nodes
240+
#### Hybrid Clusters
241+
242+
As a cluster administrator I would like to build a cluster with different kind of nodes.
233243
234244
This scenario assumes that:
235245
@@ -277,7 +287,9 @@ Controllers for disposable resources such as virtual machine typically do not
277287
use hostSelectors. Controllers for a "bare-metal as a service" service
278288
may use selectors.
279289
280-
#### As a cluster administrator I would like to install a new baremetal cluster from a transient cluster
290+
#### Manager Cluster Bootstrap
291+
292+
As a cluster administrator I would like to install a new baremetal cluster from a transient cluster.
281293
282294
The bootstrap process can be performed as usual from an ephemeral cluster
283295
(e.g., a KinD cluster). The constraint that all resources must be in the same
@@ -300,4 +312,276 @@ controller is beyond the scope of this specification.
300312
301313
## Design Details
302314
303-
TBD.
315+
### Implementation Details/Notes/Constraints
316+
317+
### Risks and Mitigations
318+
319+
#### Security Impact of Making BareMetalHost Selection Cluster-wide
320+
321+
The main difference between Metal3 machines and HostClaims it the
322+
selection process where a HostClaim can be bound to a BareMetalHost
323+
in another namespace. We must make sure that this behavior is expected
324+
from the owner of BareMetalHost resources, especially when we upgrade the
325+
metal3 cluster api provider to a version supporting HostClaim.
326+
327+
The solution is to enforce that BareMetalHost that can be bound to a
328+
HostClaim have a label (proposed name: ``hosts.metal3.io/namespaces``)
329+
restricting authorized HostClaims to specific namespaces. The value could be
330+
either ``*`` for no constraint, or a comma separated list of namespace names.
331+
332+
#### Tenants Trying to Bypass the Selection Mechanism
333+
334+
The fact that a HostClaim is bound to a specific BareMetalHost will appear
335+
as a label in the HostClaim and the HostClaim controller will use it to find
336+
the associated BareMetalHost. This label could be modified by a malicious
337+
tenant.
338+
339+
But the BareMetalHost has also a consumer reference. The label is only an
340+
indication of the binding. If the consumer reference is invalid (different
341+
from the HostClaim label), the label MUST be erased and the HostClaim
342+
controller MUST NOT accept the binding.
343+
344+
#### Performance Impact
345+
346+
The proposal introduces a new resource with an associated controller between
347+
the Metal3Machine and the BareMetalHost. There will be some duplication
348+
of information between the BareMetalHost and the HostClaim status. The impact
349+
for each node should still be limited especially when compared to the cost of
350+
each Ironic action.
351+
352+
Because we plan to have several controllers for different kind of compute
353+
resource, one can expect a few controllers working on the same custom resource.
354+
This may create additional pressure on Kubernetes api server. It is possible
355+
to limit the amount of exchanged information on a specific controller using
356+
server based filters on the watch/list. To use this feature on current
357+
Kubernetes versions, the HostClaim kind field must be copied in a label.
358+
359+
#### Impact on Other Cluster Api Components
360+
361+
There should be none: other components should mostly rely on Machine and Cluster
362+
objects. Some tools may look at Metal3Machine conditions where some condition
363+
names may be modified but the semantic of Ready condition will be preserved.
364+
365+
### Work Items
366+
367+
### Dependencies
368+
369+
### Test Plan
370+
371+
### Upgrade / Downgrade Strategy
372+
373+
### Version Skew Strategy
374+
375+
## Drawbacks
376+
377+
## Alternatives
378+
379+
### Multi-Tenancy Without HostClaim
380+
381+
We assume that we have a Kubernetes cluster managing a set of clusters for
382+
cluster administrators (referred to as tenants in the following). Multi-tenancy
383+
is a way to ensure that tenants have only control over their clusters.
384+
385+
There are at least two other ways for implementing multi-tenancy without
386+
HostClaim. These methods proxy the entire definition of the cluster
387+
or proxy the BareMetalHost itself.
388+
389+
#### Isolation Through Overlays
390+
391+
A solution for multi-tenancy is to hide all cluster resources from the end
392+
user. In this approach, clusters and BareMetalHosts are defined within a single
393+
namespace, but the cluster creation process ensures that resources
394+
from different clusters do not overlap.
395+
396+
This approach was explored in the initial versions of the Kanod project.
397+
Clusters must be described by the tenant in a git repository and the
398+
descriptions are imported by a GitOps framework (argocd). The definitions are
399+
processed by an argocd plugin that translates the YAML expressing the user's
400+
intent into Kubernetes resources, and the naming of resources created by
401+
this plugin ensures isolation.
402+
403+
Instead of using a translation plugin, it would be better to use a set of
404+
custom resources. However, it is important to ensure that they are defined in
405+
separate namespaces.
406+
407+
This approach has several drawbacks:
408+
409+
* The plugin or the controllers for the abstract clusters are complex
410+
applications if we want to support many options, and they become part of
411+
the trusted computing base of the cluster manager.
412+
* It introduces a new level of access control that is distinct from the
413+
Kubernetes model. If we want tooling or observability around the created
414+
resources, we would need custom tools that adhere to this new policy, or we
415+
would need to reflect everything we want to observe in the new custom
416+
resources.
417+
* This approach does not solve the problem of hybrid clusters.
418+
419+
#### Ephemeral BareMetalHost
420+
421+
Another solution is to have separate namespaces for each cluster but
422+
import BareMetalHosts in those namespaces on demand when new compute resources
423+
are needed.
424+
425+
The cluster requires a resource that acts as a source of BareMetalHosts, which
426+
can be parameterized on servers requirements and the number of replicas. The
427+
concept of
428+
[BareMetalPool](https://gitlab.com/Orange-OpenSource/kanod/baremetalpool)
429+
in Kanod is similar to ReplicaSets for pods. This concept is also used in
430+
[this proposal](https://github.com/metal3-io/metal3-docs/pull/268) for a
431+
Metal3Host resource. The number of replicas must be synchronized with the
432+
requirements of the cluster. It may be updated by a
433+
[separate controller](https://gitlab.com/Orange-OpenSource/kanod/kanod-poolscaler)
434+
checking the requirements of machine deployments and control-planes.
435+
436+
The main security risk is that when a cluster releases a BareMetalHost, it may
437+
keep the credentials that provide full control over the server.
438+
This can be resolved if those credentials are temporary. In Kanod BareMetalPool
439+
obtain new servers from a REST API implemented by a
440+
[BareMetalHost broker](https://gitlab.com/Orange-OpenSource/kanod/brokerdef).
441+
The broker implementation utilizes either the fact that Redfish is in fact an
442+
HTTP API to implement a proxy or the capability of Redfish to create new users
443+
with a Redfish ``operator`` role to implemented BareMetalHost resources with
444+
a limited lifespan.
445+
446+
A pool is implemented as an API that is protected by a set of credentials that
447+
identify the user.
448+
449+
The advantages of this approach are:
450+
451+
* Support for pivot operation, even for tenant clusters, as it provides a
452+
complete bare-metal-as-a-service solution.
453+
* Cluster administrators have full access to the BMC and can configure servers
454+
according to their needs using custom procedures that are not exposed by
455+
standard Metal3 controllers.
456+
* Network isolation can be established before the BareMetalHost is created in
457+
the scope of the cluster. There is no transfer of servers from one network
458+
configuration to another, which could invalidate parts of the introspection.
459+
460+
The disadvantages of the BareMetalPool approach are:
461+
462+
* The implementation of the broker with its dedicated server is quite complex.
463+
* To have full dynamism over the pool of servers, a new type of autoscaler is
464+
needed.
465+
* Unnecessary inspection of servers are performed when they are transferred
466+
from a cluster (tenant) to another.
467+
* The current implementation of the proxy is limited to the Redfish protocol
468+
and would require significant work for IPMI.
469+
470+
#### HostClaims as a right to consume BareMetalHosts
471+
472+
The approach combines the concept of remote endpoint of BareMetalPools with
473+
the API-oriented approach of HostClaims, as described above.
474+
475+
In this variation, the HostClaim will be an object in the BareMetalHost
476+
namespace defined with an API endpoint to drive the associated BareMetalHost.
477+
The associated credentials are known from the Metal3Machine controller
478+
because they are associated with the cluster. The Metal3 machine controller
479+
will use this endpoint and the credentials to create the HostClaim. The
480+
HostClaim controller will associate the HostClaim with a BareMetalHost.
481+
Control actions and information about the status of the BareMetalHost will
482+
be exchanged with the Metal3 machine controller through the endpoint.
483+
484+
The main advantage of the approach is that BareMetalHosts do not need to be on
485+
the same cluster.
486+
487+
The main drawbacks are:
488+
489+
* It only addresses the multi-tenancy scenario. The hybrid scenario is not
490+
solved but the usage of HostClaim outside Cluster-API is not addressed
491+
either. The main reason is that there is no counter-part of the
492+
BareMetalHost in the namespace of the tenant.
493+
* The end user will have very limited direct view on the compute resources
494+
it is using even when the BareMetalHosts are on the same cluster.
495+
496+
Extending HostClaims with a remote variant fulfills the same requirements
497+
but keeps an explicit object in the namespace of the cluster definition
498+
representing the API offered by this approach.
499+
500+
A remote HostClaim is a a HostClaim with kind set to ``remote`` and at
501+
least two arguments:
502+
503+
* One points to a URL and a set of credentials to access the endpoint on a
504+
remote cluster,
505+
* The second is the kind of the copied HostClaim created on the remote
506+
cluster.
507+
508+
The two HostClaims are synchronized: the specification of the source HostClaim
509+
is copied to the remote one (except the kind part). The status of the target
510+
HostClaim is copied back to the source.. For meta-data, most of them are copied
511+
from the target to the source. The exact implementation of this extension is
512+
beyond the scope of this proposal.
513+
514+
### Hybrid Clusters Without HostClaim
515+
516+
#### Control-Planes as a Service
517+
518+
The Kubernetes control-plane can be considered as an application with a
519+
single endpoint. Some Cluster API control-plane providers implement a factory
520+
for new control-planes directly, without relying on the infrastructure
521+
provider. Usually, this control-plane is hosted in the management cluster
522+
as a regular Kubernetes application. [Kamaji](https://kamaji.clastix.io/),
523+
[k0smotron](https://docs.k0smotron.io/stable/) implement this approach.
524+
525+
The cluster is hybrid because the control-plane pods are not hosted on standard
526+
nodes, but workers are usually all implemented by a single infrastructure
527+
provider and are homogeneous.
528+
529+
The approach solves the problem of sharing resources for control-planes but
530+
does not address the creation of clusters with distinct needs for workers.
531+
Only one kind of workers is supported.
532+
533+
#### Many Infrastructure Cluster Resources per Cluster
534+
535+
It is possible to coerce Cluster API to create mixed clusters using the
536+
fact that the different components are only loosely coupled. The approach is
537+
presented in a [blog post](https://metal3.io/blog/2022/07/08/One_cluster_multiple_providers.html).
538+
539+
The goal is to define a cluster over technologies I_1, ... I_n where I_1 is
540+
the technology used for the control-plane.
541+
One Cluster object is defined, but an infrastructure cluster I_iCluster
542+
is defined for each technology I_i (for example a Metal3Cluster for Metal3).
543+
These infrastructure cluster objects use the same control-plane endpoint. The
544+
Cluster object references the I_1Cluster object as ``infrastructureRef``.
545+
546+
With some limited assumptions over the providers, the approach works even if
547+
the cluster is unaware of technologies I_2...I_n and it requires no modification
548+
to Cluster API.
549+
550+
There is no standardization in the definition of machine deployments across
551+
different technologies. For example, Metal3 is the sole infrastructure
552+
provider that employs DataTemplates to capture parameters specific to a
553+
given node.
554+
555+
But the main issue is that many existing providers are opinionated about
556+
networking. Unfortunately, mixing infrastructure providers requires custom
557+
configuration to interconnect the different deployments. A framework that does
558+
not handle networking is a better base for building working hybrid clusters.
559+
560+
#### Bring Your Own Hosts
561+
562+
Bring Your Own Host (BYOH) is a cluster api provider that uses existing compute
563+
resources that run a specific agent used for registering the resource and
564+
deploying Kubernetes on the server.
565+
566+
BYOH does not impose many constraints on the compute resource, but it must be
567+
launched before and it must know how to access the management cluster.
568+
A solution is to implement for each kind of targeted compute
569+
resource, a concept of pool launching an image with the agent activated and
570+
the bootstrap credentials. An example for BareMetalHost could be the notion
571+
of BareMetalPools presented above.
572+
573+
An autoscaler can be designed to keep the size of the pools synchronized with
574+
the need of the cluster (size of the machine deployments and control-plane with
575+
additional machines for updates).
576+
577+
The main drawbacks of the approach are:
578+
579+
* The approach requires many new resources and controllers. Keeping all of them
580+
synchronized is complex. BareMetalPools are already a complex approach for
581+
BareMetalHost multi-tenancy.
582+
* Performing updates or pivot with BYOH is not easy. The way agents are stopped
583+
or change their target cluster requires modifications of the BYOH controllers.
584+
585+
A prototype of this approach has been done in the Kanod project.
586+
587+
## References

0 commit comments

Comments
 (0)