-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Early bind ProviderID #11985
base: main
Are you sure you want to change the base?
🐛 Early bind ProviderID #11985
Conversation
This PR is currently missing an area label, which is used to identify the modified component when generating release notes. Area labels can be added by org members by writing Please see the labels list for possible areas. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Welcome @daimaxiaxie! |
Hi @daimaxiaxie. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
For understanding the use-case better: I'd be interested to see the InfraMachine's status. What is blocking on the InfraMachine's condition that its not Ready but the providerID is set? The reasoning is, the proposed change might be a change in the documented behaviour of the controller which is also part of the contract is: Note: The proposed change would not lead to a deletion of the node object in all cases. |
I'm also curious to better understand the use case. Note, if the issue is on delete, an option to consider is to improve the node cleanup logic in cluster-api/internal/controllers/machine/machine_controller.go Lines 724 to 745 in 9a9b00a
Caveat. CAPI will always do not cleanup at best effort, because CPI should take care of this. |
After setting the providerID, the bootstrap process has already started (the node joins the cluster). The InfraMachine requires additional actions and must wait for the bootstrap process to complete. Only after these steps are completed will the Infra The providerID represents a machine, while 'Ready' signifies the machine's operational readiness. Logically, in the physical sequence of events, the machine must first exist before it can transition to a ready state. |
What is CPI? |
CPI stands for the cloud provider interface, formerly called cloud controller manager.
As of today, we use infra.status.ready as a signal that CAPI can start consuming info from the infra machine. TBH I don't think we should make exception to this rule for specific fields, it might be confusing What I can accept, is to improve the best-effort attempt to retrieve the node during deletion (the code I have linked above) by checking the infra machine too, but I would like to hear opinion from other maintainers about this too. |
Early binding ProviderID allows for more efficient node retrieval. |
I agree on fabrizio's proposal. I think it would be a great improvement to try to get the provider id during deletion from the infraMachine as a fallback. Could look like the following: diff --git a/internal/controllers/machine/machine_controller.go b/internal/controllers/machine/machine_controller.go
index fbe00eb7a..8bbe25429 100644
--- a/internal/controllers/machine/machine_controller.go
+++ b/internal/controllers/machine/machine_controller.go
@@ -435,7 +435,7 @@ func (r *Reconciler) reconcileDelete(ctx context.Context, s *scope) (ctrl.Result
s.deletingReason = clusterv1.MachineDeletingV1Beta2Reason
s.deletingMessage = "Deletion started"
- err := r.isDeleteNodeAllowed(ctx, cluster, m)
+ err := r.isDeleteNodeAllowed(ctx, cluster, m, s.infraMachine)
isDeleteNodeAllowed := err == nil
if err != nil {
switch err {
@@ -713,14 +713,20 @@ func (r *Reconciler) nodeVolumeDetachTimeoutExceeded(machine *clusterv1.Machine)
// isDeleteNodeAllowed returns nil only if the Machine's NodeRef is not nil
// and if the Machine is not the last control plane node in the cluster.
-func (r *Reconciler) isDeleteNodeAllowed(ctx context.Context, cluster *clusterv1.Cluster, machine *clusterv1.Machine) error {
+func (r *Reconciler) isDeleteNodeAllowed(ctx context.Context, cluster *clusterv1.Cluster, machine *clusterv1.Machine, infraMachine *unstructured.Unstructured) error {
log := ctrl.LoggerFrom(ctx)
// Return early if the cluster is being deleted.
if !cluster.DeletionTimestamp.IsZero() {
return errClusterIsBeingDeleted
}
- if machine.Status.NodeRef == nil && machine.Spec.ProviderID != nil {
+ providerID := machine.Spec.ProviderID
+ if machine.Spec.ProviderID == nil && infraMachine != nil {
+ // If we don't have a provider id, try to retrieve it from the inframachine.
+ _ = util.UnstructuredUnmarshalField(infraMachine, &providerID, "spec", "providerID")
+ }
+
+ if machine.Status.NodeRef == nil && providerID != nil {
// If we don't have a node reference, but a provider id has been set,
// try to retrieve the node one more time.
//
@@ -731,7 +737,7 @@ func (r *Reconciler) isDeleteNodeAllowed(ctx context.Context, cluster *clusterv1
if err != nil {
log.Error(err, "Failed to get cluster client while deleting Machine and checking for nodes")
} else {
- node, err := r.getNode(ctx, remoteClient, *machine.Spec.ProviderID)
+ node, err := r.getNode(ctx, remoteClient, *providerID)
if err != nil && err != ErrNodeNotFound {
log.Error(err, "Failed to get node while deleting Machine")
} else if err == nil { |
Sounds like this PR is trying to redefine how Machine creation works in CAPI? |
Since everyone agrees with this approach, I will proceed with the modification. |
What this PR does / why we need it:
The ProviderID appears before the infrastructure is ready, and the node already exists in the cluster. When deleting a machine, the corresponding node should be cleaned up.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #7412 aws/eks-anywhere#4708 #7237