Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAPI should delete kubernetes node when InfraMachine with providerID but Machine without providerID #7412

Open
haijianyang opened this issue Oct 17, 2022 · 4 comments · May be fixed by #11985
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@haijianyang
Copy link
Contributor

haijianyang commented Oct 17, 2022

What steps did you take and what happened:
cluster-api-provider-elf(CAPE) is the infrastructure provider of cluster-api (CAPI)

During deleting the CAPI Machine that without providerID, and ElfMachine set providerID successful at the same time (kubernetes worker node is up). CAPI removed CAPI Machine and ElfMachine directly, did not delete the associated kubernetes node because CAPI Machine has not synced ElfMachine's providerID value.

// cluster-api/internal/controllers/machine/machine_controller.go
func (r *Reconciler) reconcileDelete(ctx context.Context, cluster *clusterv1.Cluster, m *clusterv1.Machine) (ctrl.Result, error) {
	err := r.isDeleteNodeAllowed(ctx, cluster, m)
	isDeleteNodeAllowed := err == nil
	if err != nil {
		switch err {
		case errNoControlPlaneNodes, errLastControlPlaneNode, errNilNodeRef, errClusterIsBeingDeleted, errControlPlaneIsBeingDeleted:
			var nodeName = ""
			if m.Status.NodeRef != nil {
				nodeName = m.Status.NodeRef.Name
			}
			log.Info("Deleting Kubernetes Node associated with Machine is not allowed", "Node", klog.KRef("", nodeName), "cause", err.Error())
		default:
			return ctrl.Result{}, errors.Wrapf(err, "failed to check if Kubernetes Node deletion is allowed")
		}
	}

...

        //  This code will not be executed due to errNilNodeRef
	if isDeleteNodeAllowed {
		log.Info("Deleting node", "Node", klog.KRef("", m.Status.NodeRef.Name))

		var deleteNodeErr error
		waitErr := wait.PollImmediate(2*time.Second, r.nodeDeletionRetryTimeout, func() (bool, error) {
			if deleteNodeErr = r.deleteNode(ctx, cluster, m.Status.NodeRef.Name); deleteNodeErr != nil && !apierrors.IsNotFound(errors.Cause(deleteNodeErr)) {
				return false, nil
			}
			return true, nil
		})
	}
}

What did you expect to happen:

CAPI should delete the kubernetes nodes when ElfMachine with providerID but CAPI Machine without providerID

logs

# CAPE logs
I1014 03:21:11.555773       1 elfmachine_controller.go:639] cape-controller-manager/elfmachine-controller "msg"="Set node providerID success" "elfCluster"="mycluster" "elfMachine"="mycluster-worker1-49wkt" "namespace"="default" "cluster"="mycluster" "node"="mycluster-worker1-49wkt" "providerID"="elf://165d2fb5-2b7a-477c-a752-c777581738c5"
I1014 03:21:11.591151       1 elfmachine_controller.go:306] cape-controller-manager/elfmachine-controller "msg"="Reconciling ElfMachine delete" "elfCluster"="mycluster" "elfMachine"="mycluster-worker1-49wkt" "namespace"="default" 
E1014 03:21:39.672186       1 elfmachine_controller.go:209] cape-controller-manager/elfmachine-controller "msg"="patch failed" "error"="elfmachines.infrastructure.cluster.x-k8s.io \"mycluster-worker1-49wkt\" not found" "elfCluster"="mycluster" "namespace"="default" "elfMachine"="infrastructure.cluster.x-k8s.io/v1beta1, Kind=ElfMachine default/mycluster-worker1-49wkt"
E1014 03:21:39.672277       1 controller.go:326]  "msg"="Reconciler error" "error"="elfmachines.infrastructure.cluster.x-k8s.io \"mycluster-worker1-49wkt\" not found" "controller"="elfmachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="ElfMachine" "elfMachine"={"name":"mycluster-worker1-49wkt","namespace":"default"} "name"="mycluster-worker1-49wkt" "namespace"="default" "reconcileID"="8a21bd13-f4ce-4fab-a893-649d6d672020" 

# CAPI logs
I1014 03:20:48.780918       1 machine_controller_noderef.go:49] "Cannot reconcile Machine's Node, no valid ProviderID yet" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" name="mycluster-worker1-5cbdd99959-d4jnw" reconcileID=b97e47a2-58ae-41cb-8f5c-a1bf135af532 machine="mycluster-worker1-5cbdd99959-d4jnw" namespace="default" cluster="mycluster"
I1014 03:21:11.478341       1 machineset_controller.go:460] "Deleted machine" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" machineSet="default/mycluster-worker1-5cbdd99959" namespace="default" name="mycluster-worker1-5cbdd99959" reconcileID=8d787ed2-6db0-4c48-be9d-ed35b7243258 machine="mycluster-worker1-5cbdd99959-d4jnw"
I1014 03:21:11.479102       1 machine_controller.go:296] "Deleting Kubernetes Node associated with Machine is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" machine="default/mycluster-worker1-5cbdd99959-d4jnw" namespace="default" name="mycluster-worker1-5cbdd99959-d4jnw" reconcileID=052c63c1-8fe9-4444-b204-fd608cc5a2a7 cluster="mycluster" node="nil" cause="noderef is nil"
E1014 03:21:39.755615       1 controller.go:326] "Reconciler error" err="machines.cluster.x-k8s.io \"mycluster-worker1-5cbdd99959-d4jnw\" not found" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" machine="default/mycluster-worker1-5cbdd99959-d4jnw" namespace="default" name="mycluster-worker1-5cbdd99959-d4jnw" reconcileID=6ceb8fa1-94ca-4cba-923c-35164d71e8d6

mycluster-worker1-49wkt should be deleted

[root@mycluster-control-plane-p2v2g ~]# kubectl get nodes
NAME                             STATUS     ROLES           AGE     VERSION
mycluster-control-plane-24twx    Ready      control-plane   3d19h   v1.24.0
mycluster-control-plane-8hzgz    Ready      control-plane   3d19h   v1.24.0
mycluster-control-plane-fjqnz    Ready      control-plane   3d18h   v1.24.0
mycluster-worker1-49wkt          NotReady   <none>          3d3h    v1.24.0
mycluster-worker1-hn5vq          Ready      <none>          3d3h    v1.24.0

Environment:

  • Cluster-api version: v1.2.2
  • minikube/kind version: v0.14.0
  • Kubernetes version: v1.24.0
  • OS: CentOS7

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 17, 2022
@haijianyang haijianyang changed the title Bug: Bug: CAPI should delete kubernetes node when InfraMachine with providerID but Machine without providerID Oct 17, 2022
@haijianyang haijianyang changed the title Bug: CAPI should delete kubernetes node when InfraMachine with providerID but Machine without providerID CAPI should delete kubernetes node when InfraMachine with providerID but Machine without providerID Oct 17, 2022
@fabriziopandini
Copy link
Member

/triage accepted
This is an interesting use case. We can explore how to cover this condition (machine deleted before picking up the provider ID) so we can ensure a proper node cleanup, but I would limit it to this specific scenario and not duplicate logic from reconcile and reconcile delete

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 17, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 16, 2023
@jessehu
Copy link
Contributor

jessehu commented Aug 25, 2023

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 25, 2023
@fabriziopandini
Copy link
Member

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 12, 2024
@fabriziopandini fabriziopandini added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label May 3, 2024
@daimaxiaxie daimaxiaxie linked a pull request Mar 18, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants