Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTA-1427: Reconcile all nodes via a special event #1164

Merged

Conversation

hongkailiu
Copy link
Member

@hongkailiu hongkailiu commented Feb 26, 2025

This PR addressed the remained comments from #1144

In the 1st commit it improves the error handling on reconcileAllNodes() (See [1] for details).

Prior to this commit, the event on MC/MCP will be re-queued if reconcileAllNodes()
hits an error. However, syncMachineConfig() (or syncMachineConfigPool respectively)
is stateful, i.e., the result replies on the content of the caches that might be
changed from the original event.

With the commit, a special event stays between an MC/MCP event and reconcileAllNodes().
An error from the latter will re-queue the special event which basically means triggering
another run of reconcileAllNodes().

The 2nd commit is to improve the code either for better readability or for better APIs from the lib.
Sync.Map with powerful functions for atomic ops on a map is an example for the latter case.

In the 3rd commit, it moves sync.Once into nodeInformerController controller level.
As a result each nodeInformerController instance will
execute the function once which makes more sense than doing it
once grobally because what we do there is initialization of the
caches that are associated with the instance.

For the core code, it is not a big deal since we have only ONE
instance of the controller. However, it matters for unit tests where
there are many controllers.

[1]. #1144 (comment)

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 26, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Feb 26, 2025

@hongkailiu: This pull request references OTA-1427 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

This PR addressed the remained comments from #1144

Besides the refactoring to improve the readability in the first commit, it improves the error
handling on reconcileAllNodes() (See [1] for details) in the 2nd commit.

Prior to this commit, the event on MC/MCP will be re-queued if reconcileAllNodes()
hits an error. However, syncMachineConfig() (or syncMachineConfigPool respectively)
is stateful, i.e., the result replies on the content of the caches that might be
changed from the original event.

With the commit, a special event stays between an MC/MCP event and reconcileAllNodes().
An error from the latter will re-queue the special event which basically means triggering
another run of reconcileAllNodes().

[1]. #1144 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 26, 2025
@openshift-ci openshift-ci bot requested review from DavidHurta and wking February 26, 2025 19:55
@hongkailiu hongkailiu force-pushed the OTA-1427-comments-part-1 branch 3 times, most recently from ac49027 to 1a90c1a Compare February 27, 2025 17:14
@hongkailiu
Copy link
Member Author

/retest-required

@hongkailiu
Copy link
Member Author

/test e2e-agnostic-operator

@hongkailiu hongkailiu force-pushed the OTA-1427-comments-part-1 branch 3 times, most recently from 098d778 to 8a4d74a Compare March 3, 2025 18:58
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 3, 2025

@hongkailiu: This pull request references OTA-1427 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

This PR addressed the remained comments from #1144

In the 1st commit it improves the error handling on reconcileAllNodes() (See [1] for details).

Prior to this commit, the event on MC/MCP will be re-queued if reconcileAllNodes()
hits an error. However, syncMachineConfig() (or syncMachineConfigPool respectively)
is stateful, i.e., the result replies on the content of the caches that might be
changed from the original event.

With the commit, a special event stays between an MC/MCP event and reconcileAllNodes().
An error from the latter will re-queue the special event which basically means triggering
another run of reconcileAllNodes().

The 2nd commit is to improve the code either for better readability or for better APIs from the lib.
Sync.Map with powerful functions for atomic ops on a map is an example for the latter case.

In the 3rd commit, it moves sync.Once into nodeInformerController controller level.
As a result each nodeInformerController controller instance will
execute the function once which makes more sense than doing it
once grobally because what we do there is initialization of the
caches that are associated with the instance.

For the core code, it is not a big deal since we have only ONE
instance of the controller. However, it matters for unit tests where
there are many controllers.

[1]. #1144 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@hongkailiu hongkailiu force-pushed the OTA-1427-comments-part-1 branch 2 times, most recently from 3206463 to 4294762 Compare March 3, 2025 19:26
This is to improve the error handling on `reconcileAllNodes()`.
See [1] for details.

Prior to this commit, the event on MC/MCP will be re-queued if `reconcileAllNodes()`
hits an error. However, `syncMachineConfig()` (or `syncMachineConfigPool` respectively)
is stateful, i.e., the result replies on the content of the caches that might be
changed from the original event.

With the commit, a special event stays between an MC/MCP event and `reconcileAllNodes()`.
An error from the latter will re-queue the special event which basically means triggering
another run of `reconcileAllNodes()`.

[1]. openshift#1144 (comment)
@hongkailiu hongkailiu force-pushed the OTA-1427-comments-part-1 branch 2 times, most recently from 7fd2815 to 46c75c2 Compare March 3, 2025 20:39
As a result each nodeInformerController instance will execute
the function once which makes more sense than doing it once
grobally because what we do there is initialization of the
caches that are associated with the instance.

For the core code, it is not a big deal since we have only ONE
instance of the controller. However, it matters for unit tests where
there are many controllers.
@hongkailiu hongkailiu force-pushed the OTA-1427-comments-part-1 branch from 46c75c2 to 998d96c Compare March 3, 2025 20:48
@hongkailiu
Copy link
Member Author

/retest-required

@hongkailiu
Copy link
Member Author

/test e2e-agnostic-usc-devpreview
/retest-required

@hongkailiu
Copy link
Member Author

the build log from the devpreveiw job looks good to me:

curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1164/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1896695150575357952/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-59cf4d8767-4gskg_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong|Ingested|convert|all nodes'
I0304 00:17:24.787955       1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0304 00:17:24.787955       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0304 00:17:24.787996       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0304 00:17:24.808028       1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0304 00:17:24.808039       1 base_controller.go:82] Caches are synced for UpdateStatusController
I0304 00:17:24.907131       1 base_controller.go:82] Caches are synced for NodeInformer
I0304 00:17:24.907307       1 nodeinformer.go:157] Ingested 2 machineConfigPools in the cache
I0304 00:17:24.907353       1 nodeinformer.go:170] Ingested 19 machineConfig versions in the cache
I0304 00:17:24.907367       1 nodeinformer.go:111] NI :: Syncing with key Node/ci-op-y6m1h54b-8c750-9585h-master-0
I0304 00:17:24.907381       1 nodeinformer.go:335] Node ci-op-y6m1h54b-8c750-9585h-master-0 belongs to machine config pool master
...

@hongkailiu
Copy link
Member Author

/test okd-scos-e2e-aws-ovn

@hongkailiu hongkailiu changed the title [wip]OTA-1427: Reconcile all nodes via a special event OTA-1427: Reconcile all nodes via a special event Mar 4, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 4, 2025
@petr-muller
Copy link
Member

/cc

@openshift-ci openshift-ci bot requested a review from petr-muller March 4, 2025 14:01
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 4, 2025

@hongkailiu: This pull request references OTA-1427 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

This PR addressed the remained comments from #1144

In the 1st commit it improves the error handling on reconcileAllNodes() (See [1] for details).

Prior to this commit, the event on MC/MCP will be re-queued if reconcileAllNodes()
hits an error. However, syncMachineConfig() (or syncMachineConfigPool respectively)
is stateful, i.e., the result replies on the content of the caches that might be
changed from the original event.

With the commit, a special event stays between an MC/MCP event and reconcileAllNodes().
An error from the latter will re-queue the special event which basically means triggering
another run of reconcileAllNodes().

The 2nd commit is to improve the code either for better readability or for better APIs from the lib.
Sync.Map with powerful functions for atomic ops on a map is an example for the latter case.

In the 3rd commit, it moves sync.Once into nodeInformerController controller level.
As a result each nodeInformerController instance will
execute the function once which makes more sense than doing it
once grobally because what we do there is initialization of the
caches that are associated with the instance.

For the core code, it is not a big deal since we have only ONE
instance of the controller. However, it matters for unit tests where
there are many controllers.

[1]. #1144 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@hongkailiu
Copy link
Member Author

/test okd-scos-e2e-aws-ovn

1 similar comment
@hongkailiu
Copy link
Member Author

/test okd-scos-e2e-aws-ovn

Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

This is really nice! I left some comments but I don't think they are that big. Feel free to address them here or in a followup.

@@ -107,6 +107,11 @@ func (c *nodeInformerController) sync(ctx context.Context, syncCtx factory.SyncC

var msg informerMsg
switch t {
case eventKindName:
if name != eventNameReconcileAllNodes {
return fmt.Errorf("invalid name in queue key %s with type %s", queueKey, t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that when controller-level sync() returns an error, it causes the base controller to retry the same key, just rate-limited. Which means you never want to return an error for cases where it does not make sense to retry (which an invalid key is).

The right thing to do for these cases is to log an error and return nil.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isnt what we do in the other similar places? E.g.,

return fmt.Errorf("invalid queue key %s with unexpected type %s", queueKey, t)

I can change all of them with another PR if still desired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we should migrate them all but not necessarily in this PR

@@ -107,89 +117,23 @@ func (c *nodeInformerController) sync(ctx context.Context, syncCtx factory.SyncC

var msg informerMsg
switch t {
case eventKindName:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now the controller reacts to basically two kinds of triggers:

  1. Syncs triggered for standard kube informer "events" on watched resources (nodes, mcp, mc)
  2. Synthetic triggers (right now we have just one)

I think this should be documented in sync godoc, and ideally we should have a list of synthetic triggers and what they do (very briefly), even when we only have just one now.

Ideally the two cases should be recognizable in the code, too - it woudl be great if the ...KindName was used only for the "we are reconciling a change on a watched resource" triggers, and we'd use something else (syntheticKeyX?) for the synthetic ones.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Use syncSyntheticKey as the function name.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 10, 2025
@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed lgtm Indicates that a PR is ready to be merged. labels Mar 10, 2025
Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful :shipit:

@petr-muller
Copy link
Member

/hold cancel
/label no-qe

@openshift-ci openshift-ci bot added no-qe Allows PRs to merge without qe-approved label lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Mar 10, 2025
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 431ea6c and 2 for PR HEAD 3698cea in total

@hongkailiu hongkailiu force-pushed the OTA-1427-comments-part-1 branch from 3698cea to 9b92896 Compare March 10, 2025 17:33
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 10, 2025
@hongkailiu
Copy link
Member Author

@petr-muller

Sorry for the sloppiness on the unit tests.

Our testing is not happy with AddRateLimited.
I have reverted it to Add.

After rethinking it, the rate limited may not be even desired because it could cause the stale insights.
Or i did not understand your intension in the first place?

@hongkailiu
Copy link
Member Author

/retest-required

@petr-muller
Copy link
Member

/override ci/prow/e2e-agnostic-ovn

: [sig-network][Feature:EgressFirewall] when using openshift ovn-kubernetes should ensure egressfirewall is created [Suite:openshift/conformance/parallel] 

Copy link
Contributor

openshift-ci bot commented Mar 11, 2025

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-ovn

In response to this:

/override ci/prow/e2e-agnostic-ovn

: [sig-network][Feature:EgressFirewall] when using openshift ovn-kubernetes should ensure egressfirewall is created [Suite:openshift/conformance/parallel] 

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

openshift-ci bot commented Mar 11, 2025

@hongkailiu: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 11, 2025
Copy link
Contributor

openshift-ci bot commented Mar 11, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 36d66b0 into openshift:main Mar 11, 2025
15 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: cluster-version-operator
This PR has been included in build cluster-version-operator-container-v4.19.0-202503111834.p0.g36d66b0.assembly.stream.el9.
All builds following this will include this PR.

@hongkailiu hongkailiu deleted the OTA-1427-comments-part-1 branch March 12, 2025 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. no-qe Allows PRs to merge without qe-approved label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants