OTA-1427: Reconcile all nodes via a special event #1164

hongkailiu · 2025-02-26T19:54:32Z

This PR addressed the remained comments from #1144

In the 1st commit it improves the error handling on reconcileAllNodes() (See [1] for details).

Prior to this commit, the event on MC/MCP will be re-queued if reconcileAllNodes()
hits an error. However, syncMachineConfig() (or syncMachineConfigPool respectively)
is stateful, i.e., the result replies on the content of the caches that might be
changed from the original event.

With the commit, a special event stays between an MC/MCP event and reconcileAllNodes().
An error from the latter will re-queue the special event which basically means triggering
another run of reconcileAllNodes().

The 2nd commit is to improve the code either for better readability or for better APIs from the lib.
Sync.Map with powerful functions for atomic ops on a map is an example for the latter case.

In the 3rd commit, it moves sync.Once into nodeInformerController controller level.
As a result each nodeInformerController instance will
execute the function once which makes more sense than doing it
once grobally because what we do there is initialization of the
caches that are associated with the instance.

For the core code, it is not a big deal since we have only ONE
instance of the controller. However, it matters for unit tests where
there are many controllers.

[1]. #1144 (comment)

openshift-ci-robot · 2025-02-26T19:54:35Z

@hongkailiu: This pull request references OTA-1427 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

This PR addressed the remained comments from #1144

Besides the refactoring to improve the readability in the first commit, it improves the error
handling on reconcileAllNodes() (See [1] for details) in the 2nd commit.

Prior to this commit, the event on MC/MCP will be re-queued if reconcileAllNodes()
hits an error. However, syncMachineConfig() (or syncMachineConfigPool respectively)
is stateful, i.e., the result replies on the content of the caches that might be
changed from the original event.

With the commit, a special event stays between an MC/MCP event and reconcileAllNodes().
An error from the latter will re-queue the special event which basically means triggering
another run of reconcileAllNodes().

[1]. #1144 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hongkailiu · 2025-02-27T21:37:59Z

/retest-required

hongkailiu · 2025-03-01T15:03:01Z

/test e2e-agnostic-operator

openshift-ci-robot · 2025-03-03T19:11:12Z

@hongkailiu: This pull request references OTA-1427 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

This PR addressed the remained comments from #1144

In the 1st commit it improves the error handling on reconcileAllNodes() (See [1] for details).

Prior to this commit, the event on MC/MCP will be re-queued if reconcileAllNodes()
hits an error. However, syncMachineConfig() (or syncMachineConfigPool respectively)
is stateful, i.e., the result replies on the content of the caches that might be
changed from the original event.

With the commit, a special event stays between an MC/MCP event and reconcileAllNodes().
An error from the latter will re-queue the special event which basically means triggering
another run of reconcileAllNodes().

The 2nd commit is to improve the code either for better readability or for better APIs from the lib.
Sync.Map with powerful functions for atomic ops on a map is an example for the latter case.

In the 3rd commit, it moves sync.Once into nodeInformerController controller level.
As a result each nodeInformerController controller instance will
execute the function once which makes more sense than doing it
once grobally because what we do there is initialization of the
caches that are associated with the instance.

For the core code, it is not a big deal since we have only ONE
instance of the controller. However, it matters for unit tests where
there are many controllers.

[1]. #1144 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

This is to improve the error handling on `reconcileAllNodes()`. See [1] for details. Prior to this commit, the event on MC/MCP will be re-queued if `reconcileAllNodes()` hits an error. However, `syncMachineConfig()` (or `syncMachineConfigPool` respectively) is stateful, i.e., the result replies on the content of the caches that might be changed from the original event. With the commit, a special event stays between an MC/MCP event and `reconcileAllNodes()`. An error from the latter will re-queue the special event which basically means triggering another run of `reconcileAllNodes()`. [1]. openshift#1144 (comment)

As a result each nodeInformerController instance will execute the function once which makes more sense than doing it once grobally because what we do there is initialization of the caches that are associated with the instance. For the core code, it is not a big deal since we have only ONE instance of the controller. However, it matters for unit tests where there are many controllers.

hongkailiu · 2025-03-03T22:13:20Z

/retest-required

hongkailiu · 2025-03-03T22:52:14Z

/test e2e-agnostic-usc-devpreview
/retest-required

hongkailiu · 2025-03-04T02:34:36Z

the build log from the devpreveiw job looks good to me:

curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1164/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1896695150575357952/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-59cf4d8767-4gskg_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong|Ingested|convert|all nodes'
I0304 00:17:24.787955       1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0304 00:17:24.787955       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0304 00:17:24.787996       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0304 00:17:24.808028       1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0304 00:17:24.808039       1 base_controller.go:82] Caches are synced for UpdateStatusController
I0304 00:17:24.907131       1 base_controller.go:82] Caches are synced for NodeInformer
I0304 00:17:24.907307       1 nodeinformer.go:157] Ingested 2 machineConfigPools in the cache
I0304 00:17:24.907353       1 nodeinformer.go:170] Ingested 19 machineConfig versions in the cache
I0304 00:17:24.907367       1 nodeinformer.go:111] NI :: Syncing with key Node/ci-op-y6m1h54b-8c750-9585h-master-0
I0304 00:17:24.907381       1 nodeinformer.go:335] Node ci-op-y6m1h54b-8c750-9585h-master-0 belongs to machine config pool master
...

hongkailiu · 2025-03-04T02:34:49Z

/test okd-scos-e2e-aws-ovn

petr-muller · 2025-03-04T14:01:34Z

/cc

openshift-ci-robot · 2025-03-04T14:37:03Z

@hongkailiu: This pull request references OTA-1427 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

This PR addressed the remained comments from #1144

In the 1st commit it improves the error handling on reconcileAllNodes() (See [1] for details).

Prior to this commit, the event on MC/MCP will be re-queued if reconcileAllNodes()
hits an error. However, syncMachineConfig() (or syncMachineConfigPool respectively)
is stateful, i.e., the result replies on the content of the caches that might be
changed from the original event.

With the commit, a special event stays between an MC/MCP event and reconcileAllNodes().
An error from the latter will re-queue the special event which basically means triggering
another run of reconcileAllNodes().

The 2nd commit is to improve the code either for better readability or for better APIs from the lib.
Sync.Map with powerful functions for atomic ops on a map is an example for the latter case.

In the 3rd commit, it moves sync.Once into nodeInformerController controller level.
As a result each nodeInformerController instance will
execute the function once which makes more sense than doing it
once grobally because what we do there is initialization of the
caches that are associated with the instance.

For the core code, it is not a big deal since we have only ONE
instance of the controller. However, it matters for unit tests where
there are many controllers.

[1]. #1144 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hongkailiu · 2025-03-04T21:08:41Z

/test okd-scos-e2e-aws-ovn

hongkailiu · 2025-03-05T00:21:52Z

/test okd-scos-e2e-aws-ovn

petr-muller

/hold

This is really nice! I left some comments but I don't think they are that big. Feel free to address them here or in a followup.

petr-muller · 2025-03-10T13:43:02Z

pkg/updatestatus/nodeinformer.go

@@ -107,6 +107,11 @@ func (c *nodeInformerController) sync(ctx context.Context, syncCtx factory.SyncC

 	var msg informerMsg
 	switch t {
+	case eventKindName:
+		if name != eventNameReconcileAllNodes {
+			return fmt.Errorf("invalid name in queue key %s with type %s", queueKey, t)


Note that when controller-level sync() returns an error, it causes the base controller to retry the same key, just rate-limited. Which means you never want to return an error for cases where it does not make sense to retry (which an invalid key is).

The right thing to do for these cases is to log an error and return nil.

Isnt what we do in the other similar places? E.g.,

cluster-version-operator/pkg/updatestatus/controlplaneinformer.go

Line 165 in 431ea6c

return fmt.Errorf("invalid queue key %s with unexpected type %s", queueKey, t)

I can change all of them with another PR if still desired.

yeah we should migrate them all but not necessarily in this PR

pkg/updatestatus/nodeinformer.go

petr-muller · 2025-03-10T14:09:57Z

pkg/updatestatus/nodeinformer.go

@@ -107,89 +117,23 @@ func (c *nodeInformerController) sync(ctx context.Context, syncCtx factory.SyncC

 	var msg informerMsg
 	switch t {
+	case eventKindName:


So now the controller reacts to basically two kinds of triggers:

Syncs triggered for standard kube informer "events" on watched resources (nodes, mcp, mc)

Synthetic triggers (right now we have just one)

I think this should be documented in sync godoc, and ideally we should have a list of synthetic triggers and what they do (very briefly), even when we only have just one now.

Ideally the two cases should be recognizable in the code, too - it woudl be great if the ...KindName was used only for the "we are reconciling a change on a watched resource" triggers, and we'd use something else (syntheticKeyX?) for the synthetic ones.

Done.
Use syncSyntheticKey as the function name.

petr-muller

Beautiful

petr-muller · 2025-03-10T15:59:30Z

/hold cancel
/label no-qe

openshift-ci-robot · 2025-03-10T16:31:14Z

/retest-required

Remaining retests: 0 against base HEAD 431ea6c and 2 for PR HEAD 3698cea in total

hongkailiu · 2025-03-10T17:42:30Z

@petr-muller

Sorry for the sloppiness on the unit tests.

Our testing is not happy with AddRateLimited.
I have reverted it to Add.

After rethinking it, the rate limited may not be even desired because it could cause the stale insights.
Or i did not understand your intension in the first place?

hongkailiu · 2025-03-10T22:58:54Z

/retest-required

petr-muller · 2025-03-11T10:10:31Z

/override ci/prow/e2e-agnostic-ovn

: [sig-network][Feature:EgressFirewall] when using openshift ovn-kubernetes should ensure egressfirewall is created [Suite:openshift/conformance/parallel]

openshift-ci · 2025-03-11T10:11:02Z

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-ovn

In response to this:

/override ci/prow/e2e-agnostic-ovn

: [sig-network][Feature:EgressFirewall] when using openshift ovn-kubernetes should ensure egressfirewall is created [Suite:openshift/conformance/parallel]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2025-03-11T10:11:05Z

@hongkailiu: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci · 2025-03-11T10:14:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [petr-muller]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2025-03-11T19:24:11Z

[ART PR BUILD NOTIFIER]

Distgit: cluster-version-operator
This PR has been included in build cluster-version-operator-container-v4.19.0-202503111834.p0.g36d66b0.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 26, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 26, 2025

openshift-ci bot requested review from DavidHurta and wking February 26, 2025 19:55

hongkailiu force-pushed the OTA-1427-comments-part-1 branch 3 times, most recently from ac49027 to 1a90c1a Compare February 27, 2025 17:14

hongkailiu force-pushed the OTA-1427-comments-part-1 branch 3 times, most recently from 098d778 to 8a4d74a Compare March 3, 2025 18:58

hongkailiu force-pushed the OTA-1427-comments-part-1 branch 2 times, most recently from 3206463 to 4294762 Compare March 3, 2025 19:26

hongkailiu force-pushed the OTA-1427-comments-part-1 branch 2 times, most recently from 7fd2815 to 46c75c2 Compare March 3, 2025 20:39

hongkailiu added 2 commits March 3, 2025 15:48

Refactor nodeinformer for readability

a81ec69

hongkailiu force-pushed the OTA-1427-comments-part-1 branch from 46c75c2 to 998d96c Compare March 3, 2025 20:48

hongkailiu changed the title ~~[wip]OTA-1427: Reconcile all nodes via a special event~~ OTA-1427: Reconcile all nodes via a special event Mar 4, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 4, 2025

openshift-ci bot requested a review from petr-muller March 4, 2025 14:01

petr-muller approved these changes Mar 10, 2025

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 10, 2025

openshift-ci bot assigned petr-muller Mar 10, 2025

openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed lgtm Indicates that a PR is ready to be merged. labels Mar 10, 2025

petr-muller approved these changes Mar 10, 2025

View reviewed changes

openshift-ci bot added no-qe Allows PRs to merge without qe-approved label lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Mar 10, 2025

Address comments from review

9b92896

hongkailiu force-pushed the OTA-1427-comments-part-1 branch from 3698cea to 9b92896 Compare March 10, 2025 17:33

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 10, 2025

petr-muller approved these changes Mar 11, 2025

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 11, 2025

openshift-merge-bot bot merged commit 36d66b0 into openshift:main Mar 11, 2025
15 checks passed

hongkailiu deleted the OTA-1427-comments-part-1 branch March 12, 2025 13:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OTA-1427: Reconcile all nodes via a special event #1164

OTA-1427: Reconcile all nodes via a special event #1164

hongkailiu commented Feb 26, 2025 •

edited

Loading

openshift-ci-robot commented Feb 26, 2025 •

edited by openshift-ci bot

Loading

hongkailiu commented Feb 27, 2025

hongkailiu commented Mar 1, 2025

openshift-ci-robot commented Mar 3, 2025 •

edited by openshift-ci bot

Loading

hongkailiu commented Mar 3, 2025

hongkailiu commented Mar 3, 2025

hongkailiu commented Mar 4, 2025

hongkailiu commented Mar 4, 2025

petr-muller commented Mar 4, 2025

openshift-ci-robot commented Mar 4, 2025 •

edited by openshift-ci bot

Loading

hongkailiu commented Mar 4, 2025

hongkailiu commented Mar 5, 2025

petr-muller left a comment

petr-muller Mar 10, 2025

hongkailiu Mar 10, 2025

petr-muller Mar 10, 2025

petr-muller Mar 10, 2025

hongkailiu Mar 10, 2025

petr-muller left a comment

petr-muller commented Mar 10, 2025

openshift-ci-robot commented Mar 10, 2025

hongkailiu commented Mar 10, 2025

hongkailiu commented Mar 10, 2025

petr-muller commented Mar 11, 2025

openshift-ci bot commented Mar 11, 2025

openshift-ci bot commented Mar 11, 2025

openshift-ci bot commented Mar 11, 2025

openshift-bot commented Mar 11, 2025

OTA-1427: Reconcile all nodes via a special event #1164

OTA-1427: Reconcile all nodes via a special event #1164

Conversation

hongkailiu commented Feb 26, 2025 • edited Loading

openshift-ci-robot commented Feb 26, 2025 • edited by openshift-ci bot Loading

hongkailiu commented Feb 27, 2025

hongkailiu commented Mar 1, 2025

openshift-ci-robot commented Mar 3, 2025 • edited by openshift-ci bot Loading

hongkailiu commented Mar 3, 2025

hongkailiu commented Mar 3, 2025

hongkailiu commented Mar 4, 2025

hongkailiu commented Mar 4, 2025

petr-muller commented Mar 4, 2025

openshift-ci-robot commented Mar 4, 2025 • edited by openshift-ci bot Loading

hongkailiu commented Mar 4, 2025

hongkailiu commented Mar 5, 2025

petr-muller left a comment

Choose a reason for hiding this comment

petr-muller Mar 10, 2025

Choose a reason for hiding this comment

hongkailiu Mar 10, 2025

Choose a reason for hiding this comment

petr-muller Mar 10, 2025

Choose a reason for hiding this comment

petr-muller Mar 10, 2025

Choose a reason for hiding this comment

hongkailiu Mar 10, 2025

Choose a reason for hiding this comment

petr-muller left a comment

Choose a reason for hiding this comment

petr-muller commented Mar 10, 2025

openshift-ci-robot commented Mar 10, 2025

hongkailiu commented Mar 10, 2025

hongkailiu commented Mar 10, 2025

petr-muller commented Mar 11, 2025

openshift-ci bot commented Mar 11, 2025

openshift-ci bot commented Mar 11, 2025

openshift-ci bot commented Mar 11, 2025

openshift-bot commented Mar 11, 2025

hongkailiu commented Feb 26, 2025 •

edited

Loading

openshift-ci-robot commented Feb 26, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Mar 3, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Mar 4, 2025 •

edited by openshift-ci bot

Loading