Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📖 Update autoscaling from zero enhancement proposal with support for platform-aware autoscale from zero #11962

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aleskandro
Copy link

What this PR does / why we need it:

This PR updates the contract between the cluster-autoscaler Cluster API provider and the infrastructure provider's controllers that reconcile the Infrastructure Machine Template to support platform-aware autoscale from 0 in clusters consisting of nodes heterogeneous in CPU architecture and OS.

With this commit, the infrastructure providers implementing controllers to reconcile the status of their Infrastructure Machine Templates for supporting autoscale from 0 will be able to fill the status.nodeInfo stanza with additional information about the nodes.

The status.nodeInfo stanza has type corev1.NodeSystemInfo to reflect the same content, the rendered nodes' objects would store in their status field.

The cluster-autoscaler can use that information to build the node template labels kubernetes.io/arch and kubernetes.io/os if that information is present.

Suppose the pending pods that trigger the cluster autoscaler have a node selector or a requiredDuringSchedulingIgnoredDuringExecution node affinity concerning the architecture or operating system of the node where they can execute. In that case, the autoscaler will be able to filter the nodes groups options according to the architecture or operating system requested by the pod.

The users could already provide this information to the cluster autoscaler through the labels capacity annotation. However, there is no similar capability to support future labels/taints through information set by the reconcilers of the status of Infrastructure Machine Templates.

Which issue(s) this PR fixes

Related to #11961

/area provider/core

@k8s-ci-robot k8s-ci-robot added area/provider/core Issues or PRs related to the core provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 12, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign enxebre for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Welcome @aleskandro!

It looks like this is your first PR to kubernetes-sigs/cluster-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @aleskandro. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 12, 2025
@aleskandro aleskandro force-pushed the update-kep-scale-from-0-multiarch branch from 4fbe927 to 666c952 Compare March 12, 2025 12:31
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems very straightforward, no objection from me

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this generally makes sense to me, and i don't have any objections.

i would like to get some attention on the API field change though, just to make sure folks agree about the name and placement.

@@ -204,8 +209,16 @@ status:
memory: 500mb
cpu: "1"
nvidia.com/gpu: "1"
nodeInfo:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would just like to call attention to this, at it represents a growth in our API for the infrastructure templates.

@aleskandro
Copy link
Author

Thanks @elmiko. To give more context about the choice of naming and structure for everyone to discuss alternatives.

I couldn't consider the architecture as part of the capacity field because the values of keys in that stanza are expected to be a Quantity. I tried to trace back the reasoning the community did about the capacity field and I noticed that the Node objects' status field has the same field - capacity - with the same content we set in the Infrastructure Machine Template objects.

The Node objects' status field also has a nodeInfo field with the same name and type that I'm now proposing for the Infrastructure Machine Template objects.

Among the alternatives I was evaluating, this seemed the most reasonable to me.

@elmiko
Copy link
Contributor

elmiko commented Mar 14, 2025

makes sense to me @aleskandro , no objection from me.

@@ -175,6 +179,7 @@ const (
// DockerMachineTemplateStatus defines the observed state of a DockerMachineTemplate
type DockerMachineTemplateStatus struct {
Capacity corev1.ResourceList `json:"capacity,omitempty"`
NodeInfo corev1.NodeSystemInfo `json:"nodeInfo,omitempty"`
Copy link
Member

@chrischdi chrischdi Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does embedding corev1.NodeSystemInfo make sense? We may only need a subset of it, xref:

https://github.com/kubernetes/api/blob/master/core/v1/types.go#L6165-L6190

// NodeSystemInfo is a set of ids/uuids to uniquely identify the node.
type NodeSystemInfo struct {
	// MachineID reported by the node. For unique machine identification
	// in the cluster this field is preferred. Learn more from man(5)
	// machine-id: http://man7.org/linux/man-pages/man5/machine-id.5.html
	MachineID string `json:"machineID" protobuf:"bytes,1,opt,name=machineID"`
	// SystemUUID reported by the node. For unique machine identification
	// MachineID is preferred. This field is specific to Red Hat hosts
	// https://access.redhat.com/documentation/en-us/red_hat_subscription_management/1/html/rhsm/uuid
	SystemUUID string `json:"systemUUID" protobuf:"bytes,2,opt,name=systemUUID"`
	// Boot ID reported by the node.
	BootID string `json:"bootID" protobuf:"bytes,3,opt,name=bootID"`
	// Kernel Version reported by the node from 'uname -r' (e.g. 3.16.0-0.bpo.4-amd64).
	KernelVersion string `json:"kernelVersion" protobuf:"bytes,4,opt,name=kernelVersion"`
	// OS Image reported by the node from /etc/os-release (e.g. Debian GNU/Linux 7 (wheezy)).
	OSImage string `json:"osImage" protobuf:"bytes,5,opt,name=osImage"`
	// ContainerRuntime Version reported by the node through runtime remote API (e.g. containerd://1.4.2).
	ContainerRuntimeVersion string `json:"containerRuntimeVersion" protobuf:"bytes,6,opt,name=containerRuntimeVersion"`
	// Kubelet Version reported by the node.
	KubeletVersion string `json:"kubeletVersion" protobuf:"bytes,7,opt,name=kubeletVersion"`
	// Deprecated: KubeProxy Version reported by the node.
	KubeProxyVersion string `json:"kubeProxyVersion" protobuf:"bytes,8,opt,name=kubeProxyVersion"`
	// The Operating System reported by the node
	OperatingSystem string `json:"operatingSystem" protobuf:"bytes,9,opt,name=operatingSystem"`
	// The Architecture reported by the node
	Architecture string `json:"architecture" protobuf:"bytes,10,opt,name=architecture"`
}

The fields in corev1.NodeSystemInfo don't specify omitEmpty, so it would get very verbous already when only requiring e.g. a single value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: at least should be a pointer when using the struct directly.

Copy link
Author

@aleskandro aleskandro Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrischdi, you're right, thanks. I'll change it to a pointer.

As for using a different struct, I am open to both the solutions. The advantage of using the NodeSystemInfo struct is to bind it with the pod status field using the same key name and type. On the other side, yes, we can have a dedicated struct, and only keep the subset of fields that the cluster API project needs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is if I'm right, whenever e.g. only setting architecture, kubectl get -o yaml or -o json will show:

status:
  nodeInfo:
    machineID: ""
    systemUUID: ""
    bootID: ""
    kernelVersion: ""
    osImage: ""
    containerRuntimeVersion: ""
    kubeletVersion: ""
    kubeProxyVersion: ""
    operatingSystem: ""
    architecture: "myvalue"

I guess no one will ever set machineID, systemUUID, bootId or kernelVersion. So would be very bad to always have them rendered (which is the case for strings without omitEmpty)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's important to note that this information will get transformed into the node template for the core autoscaler. i'm not sure how, or which values, are used by the scheduler. that might be worth investigating.

…atform-aware autoscale from zero

This commit updates the contract between the cluster-autoscaler Cluster API provider and the infrastructure provider's controllers that reconcile the Infrastructure Machine Template to support platform-aware autoscale from 0 in clusters consisting of nodes heterogeneous in CPU architecture and OS.

With this commit, the infrastructure providers implementing controllers to reconcile the status of their Infrastructure Machine Templates for supporting autoscale from 0 will be able to fill the status.nodeInfo stanza with additional information about the nodes.

The status.nodeInfo stanza has type corev1.NodeSystemInfo to reflect the same content, the rendered nodes' objects would store in their status field.

The cluster-autoscaler can use that information to build the node template labels `kubernetes.io/arch` and `kubernetes.io/os` if that information is present.

Suppose the pending pods that trigger the cluster autoscaler have a node selector or a requiredDuringSchedulingIgnoredDuringExecution node affinity concerning the architecture or operating system of the node where they can execute. In that case, the autoscaler will be able to filter the nodes groups options according to the architecture or operating system requested by the pod.

The users could already provide this information to the cluster autoscaler through the labels capacity annotation. However, there is no similar capability to support future labels/taints through information set by the reconcilers of the status of Infrastructure Machine Templates.
@aleskandro aleskandro force-pushed the update-kep-scale-from-0-multiarch branch from 666c952 to 106bcbd Compare March 17, 2025 15:12
@chrischdi
Copy link
Member

/hold

Especially for getting consensus on and resolve #11962 (comment)

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/core Issues or PRs related to the core provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants