🐛 Early bind ProviderID #11985

daimaxiaxie · 2025-03-18T10:12:33Z

What this PR does / why we need it:

The ProviderID appears before the infrastructure is ready, and the node already exists in the cluster. When deleting a machine, the corresponding node should be cleaned up.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #7412 aws/eks-anywhere#4708 #7237

k8s-ci-robot · 2025-03-18T10:12:41Z

This PR is currently missing an area label, which is used to identify the modified component when generating release notes.

Area labels can be added by org members by writing /area ${COMPONENT} in a comment

Please see the labels list for possible areas.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-03-18T10:12:41Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign justinsb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-03-18T10:12:41Z

Welcome @daimaxiaxie!

It looks like this is your first PR to kubernetes-sigs/cluster-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-03-18T10:12:42Z

Hi @daimaxiaxie. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

chrischdi · 2025-03-18T13:22:30Z

For understanding the use-case better: I'd be interested to see the InfraMachine's status. What is blocking on the InfraMachine's condition that its not Ready but the providerID is set?

The reasoning is, the proposed change might be a change in the documented behaviour of the controller which is also part of the contract is:

https://main.cluster-api.sigs.k8s.io/developer/providers/contracts/infra-machine#inframachine-initialization-completed

Note: The proposed change would not lead to a deletion of the node object in all cases.

fabriziopandini · 2025-03-18T16:40:22Z

I'm also curious to better understand the use case.

Note, if the issue is on delete, an option to consider is to improve the node cleanup logic in

cluster-api/internal/controllers/machine/machine_controller.go

Lines 724 to 745 in 9a9b00a

    
           // If we don't have a node reference, but a provider id has been set, 
        
           // try to retrieve the node one more time. 
        
           // 
        
           // NOTE: The following is a best-effort attempt to retrieve the node, 
        
           // errors are logged but not returned to ensure machines are deleted 
        
           // even if the node cannot be retrieved. 
        
           remoteClient, err := r.ClusterCache.GetClient(ctx, util.ObjectKey(cluster)) 
        
           if err != nil { 
        
           	log.Error(err, "Failed to get cluster client while deleting Machine and checking for nodes") 
        
           } else { 
        
           	node, err := r.getNode(ctx, remoteClient, *machine.Spec.ProviderID) 
        
           	if err != nil && err != ErrNodeNotFound { 
        
           		log.Error(err, "Failed to get node while deleting Machine") 
        
           	} else if err == nil { 
        
           		machine.Status.NodeRef = &corev1.ObjectReference{ 
        
           			APIVersion: corev1.SchemeGroupVersion.String(), 
        
           			Kind:       "Node", 
        
           			Name:       node.Name, 
        
           			UID:        node.UID, 
        
           		} 
        
           	} 
        
           }

Caveat. CAPI will always do not cleanup at best effort, because CPI should take care of this.

daimaxiaxie · 2025-03-19T03:10:33Z

For understanding the use-case better: I'd be interested to see the InfraMachine's status. What is blocking on the InfraMachine's condition that its not Ready but the providerID is set?

After setting the providerID, the bootstrap process has already started (the node joins the cluster). The InfraMachine requires additional actions and must wait for the bootstrap process to complete. Only after these steps are completed will the Infra
Machine transition to a Ready state. Since the entire process exceeded the maximum waiting time, the Autoscaler deleted the machine.

The providerID represents a machine, while 'Ready' signifies the machine's operational readiness. Logically, in the physical sequence of events, the machine must first exist before it can transition to a ready state.

daimaxiaxie · 2025-03-19T03:18:54Z

Caveat. CAPI will always do not cleanup at best effort, because CPI should take care of this.

What is CPI？

fabriziopandini · 2025-03-19T09:38:50Z

What is CPI？

CPI stands for the cloud provider interface, formerly called cloud controller manager.

The InfraMachine requires additional actions and must wait for the bootstrap process to complete. Only after these steps are completed will the Infra Machine transition to a Ready state.

As of today, we use infra.status.ready as a signal that CAPI can start consuming info from the infra machine. TBH I don't think we should make exception to this rule for specific fields, it might be confusing

What I can accept, is to improve the best-effort attempt to retrieve the node during deletion (the code I have linked above) by checking the infra machine too, but I would like to hear opinion from other maintainers about this too.

daimaxiaxie · 2025-03-19T09:52:35Z

What I can accept, is to improve the best-effort attempt to retrieve the node during deletion (the code I have linked above) by checking the infra machine too, but I would like to hear opinion from other maintainers about this too.

Early binding ProviderID allows for more efficient node retrieval.

chrischdi · 2025-03-19T15:38:20Z

I agree on fabrizio's proposal. I think it would be a great improvement to try to get the provider id during deletion from the infraMachine as a fallback.

Could look like the following:

diff --git a/internal/controllers/machine/machine_controller.go b/internal/controllers/machine/machine_controller.go
index fbe00eb7a..8bbe25429 100644
--- a/internal/controllers/machine/machine_controller.go
+++ b/internal/controllers/machine/machine_controller.go
@@ -435,7 +435,7 @@ func (r *Reconciler) reconcileDelete(ctx context.Context, s *scope) (ctrl.Result
 	s.deletingReason = clusterv1.MachineDeletingV1Beta2Reason
 	s.deletingMessage = "Deletion started"
 
-	err := r.isDeleteNodeAllowed(ctx, cluster, m)
+	err := r.isDeleteNodeAllowed(ctx, cluster, m, s.infraMachine)
 	isDeleteNodeAllowed := err == nil
 	if err != nil {
 		switch err {
@@ -713,14 +713,20 @@ func (r *Reconciler) nodeVolumeDetachTimeoutExceeded(machine *clusterv1.Machine)
 
 // isDeleteNodeAllowed returns nil only if the Machine's NodeRef is not nil
 // and if the Machine is not the last control plane node in the cluster.
-func (r *Reconciler) isDeleteNodeAllowed(ctx context.Context, cluster *clusterv1.Cluster, machine *clusterv1.Machine) error {
+func (r *Reconciler) isDeleteNodeAllowed(ctx context.Context, cluster *clusterv1.Cluster, machine *clusterv1.Machine, infraMachine *unstructured.Unstructured) error {
 	log := ctrl.LoggerFrom(ctx)
 	// Return early if the cluster is being deleted.
 	if !cluster.DeletionTimestamp.IsZero() {
 		return errClusterIsBeingDeleted
 	}
 
-	if machine.Status.NodeRef == nil && machine.Spec.ProviderID != nil {
+	providerID := machine.Spec.ProviderID
+	if machine.Spec.ProviderID == nil && infraMachine != nil {
+		// If we don't have a provider id, try to retrieve it from the inframachine.
+		_ = util.UnstructuredUnmarshalField(infraMachine, &providerID, "spec", "providerID")
+	}
+
+	if machine.Status.NodeRef == nil && providerID != nil {
 		// If we don't have a node reference, but a provider id has been set,
 		// try to retrieve the node one more time.
 		//
@@ -731,7 +737,7 @@ func (r *Reconciler) isDeleteNodeAllowed(ctx context.Context, cluster *clusterv1
 		if err != nil {
 			log.Error(err, "Failed to get cluster client while deleting Machine and checking for nodes")
 		} else {
-			node, err := r.getNode(ctx, remoteClient, *machine.Spec.ProviderID)
+			node, err := r.getNode(ctx, remoteClient, *providerID)
 			if err != nil && err != ErrNodeNotFound {
 				log.Error(err, "Failed to get node while deleting Machine")
 			} else if err == nil {

sbueringer · 2025-03-19T16:01:27Z

Sounds like this PR is trying to redefine how Machine creation works in CAPI?

daimaxiaxie · 2025-03-20T03:28:17Z

I agree on fabrizio's proposal. I think it would be a great improvement to try to get the provider id during deletion from the infraMachine as a fallback.

Since everyone agrees with this approach, I will proceed with the modification.

early bind ProviderID

7bffa2b

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 18, 2025

k8s-ci-robot requested review from fabriziopandini and vincepri March 18, 2025 10:12

k8s-ci-robot added the do-not-merge/needs-area PR is missing an area label label Mar 18, 2025

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Early bind ProviderID #11985

🐛 Early bind ProviderID #11985

daimaxiaxie commented Mar 18, 2025

k8s-ci-robot commented Mar 18, 2025

k8s-ci-robot commented Mar 18, 2025

k8s-ci-robot commented Mar 18, 2025

k8s-ci-robot commented Mar 18, 2025

chrischdi commented Mar 18, 2025 •

edited

Loading

fabriziopandini commented Mar 18, 2025

daimaxiaxie commented Mar 19, 2025 •

edited

Loading

daimaxiaxie commented Mar 19, 2025

fabriziopandini commented Mar 19, 2025

daimaxiaxie commented Mar 19, 2025

chrischdi commented Mar 19, 2025

sbueringer commented Mar 19, 2025

daimaxiaxie commented Mar 20, 2025

🐛 Early bind ProviderID #11985

Are you sure you want to change the base?

🐛 Early bind ProviderID #11985

Conversation

daimaxiaxie commented Mar 18, 2025

k8s-ci-robot commented Mar 18, 2025

k8s-ci-robot commented Mar 18, 2025

k8s-ci-robot commented Mar 18, 2025

k8s-ci-robot commented Mar 18, 2025

chrischdi commented Mar 18, 2025 • edited Loading

fabriziopandini commented Mar 18, 2025

daimaxiaxie commented Mar 19, 2025 • edited Loading

daimaxiaxie commented Mar 19, 2025

fabriziopandini commented Mar 19, 2025

daimaxiaxie commented Mar 19, 2025

chrischdi commented Mar 19, 2025

sbueringer commented Mar 19, 2025

daimaxiaxie commented Mar 20, 2025

chrischdi commented Mar 18, 2025 •

edited

Loading

daimaxiaxie commented Mar 19, 2025 •

edited

Loading