-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTA-1427: Reconcile all nodes via a special event #1164
OTA-1427: Reconcile all nodes via a special event #1164
Conversation
@hongkailiu: This pull request references OTA-1427 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
ac49027
to
1a90c1a
Compare
/retest-required |
/test e2e-agnostic-operator |
098d778
to
8a4d74a
Compare
@hongkailiu: This pull request references OTA-1427 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
3206463
to
4294762
Compare
This is to improve the error handling on `reconcileAllNodes()`. See [1] for details. Prior to this commit, the event on MC/MCP will be re-queued if `reconcileAllNodes()` hits an error. However, `syncMachineConfig()` (or `syncMachineConfigPool` respectively) is stateful, i.e., the result replies on the content of the caches that might be changed from the original event. With the commit, a special event stays between an MC/MCP event and `reconcileAllNodes()`. An error from the latter will re-queue the special event which basically means triggering another run of `reconcileAllNodes()`. [1]. openshift#1144 (comment)
7fd2815
to
46c75c2
Compare
As a result each nodeInformerController instance will execute the function once which makes more sense than doing it once grobally because what we do there is initialization of the caches that are associated with the instance. For the core code, it is not a big deal since we have only ONE instance of the controller. However, it matters for unit tests where there are many controllers.
46c75c2
to
998d96c
Compare
/retest-required |
/test e2e-agnostic-usc-devpreview |
the build log from the devpreveiw job looks good to me: curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1164/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1896695150575357952/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-59cf4d8767-4gskg_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong|Ingested|convert|all nodes'
I0304 00:17:24.787955 1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0304 00:17:24.787955 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0304 00:17:24.787996 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0304 00:17:24.808028 1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0304 00:17:24.808039 1 base_controller.go:82] Caches are synced for UpdateStatusController
I0304 00:17:24.907131 1 base_controller.go:82] Caches are synced for NodeInformer
I0304 00:17:24.907307 1 nodeinformer.go:157] Ingested 2 machineConfigPools in the cache
I0304 00:17:24.907353 1 nodeinformer.go:170] Ingested 19 machineConfig versions in the cache
I0304 00:17:24.907367 1 nodeinformer.go:111] NI :: Syncing with key Node/ci-op-y6m1h54b-8c750-9585h-master-0
I0304 00:17:24.907381 1 nodeinformer.go:335] Node ci-op-y6m1h54b-8c750-9585h-master-0 belongs to machine config pool master
... |
/test okd-scos-e2e-aws-ovn |
/cc |
@hongkailiu: This pull request references OTA-1427 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/test okd-scos-e2e-aws-ovn |
1 similar comment
/test okd-scos-e2e-aws-ovn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
This is really nice! I left some comments but I don't think they are that big. Feel free to address them here or in a followup.
pkg/updatestatus/nodeinformer.go
Outdated
@@ -107,6 +107,11 @@ func (c *nodeInformerController) sync(ctx context.Context, syncCtx factory.SyncC | |||
|
|||
var msg informerMsg | |||
switch t { | |||
case eventKindName: | |||
if name != eventNameReconcileAllNodes { | |||
return fmt.Errorf("invalid name in queue key %s with type %s", queueKey, t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that when controller-level sync()
returns an error, it causes the base controller to retry the same key, just rate-limited. Which means you never want to return an error for cases where it does not make sense to retry (which an invalid key is).
The right thing to do for these cases is to log an error and return nil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isnt what we do in the other similar places? E.g.,
return fmt.Errorf("invalid queue key %s with unexpected type %s", queueKey, t) |
I can change all of them with another PR if still desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we should migrate them all but not necessarily in this PR
pkg/updatestatus/nodeinformer.go
Outdated
@@ -107,89 +117,23 @@ func (c *nodeInformerController) sync(ctx context.Context, syncCtx factory.SyncC | |||
|
|||
var msg informerMsg | |||
switch t { | |||
case eventKindName: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now the controller reacts to basically two kinds of triggers:
- Syncs triggered for standard kube informer "events" on watched resources (nodes, mcp, mc)
- Synthetic triggers (right now we have just one)
I think this should be documented in sync
godoc, and ideally we should have a list of synthetic triggers and what they do (very briefly), even when we only have just one now.
Ideally the two cases should be recognizable in the code, too - it woudl be great if the ...KindName
was used only for the "we are reconciling a change on a watched resource" triggers, and we'd use something else (syntheticKeyX
?) for the synthetic ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Use syncSyntheticKey
as the function name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful
/hold cancel |
3698cea
to
9b92896
Compare
Sorry for the sloppiness on the unit tests. Our testing is not happy with After rethinking it, the rate limited may not be even desired because it could cause the stale insights. |
/retest-required |
/override ci/prow/e2e-agnostic-ovn
|
@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-ovn In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@hongkailiu: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongkailiu, petr-muller The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[ART PR BUILD NOTIFIER] Distgit: cluster-version-operator |
This PR addressed the remained comments from #1144
In the 1st commit it improves the error handling on
reconcileAllNodes()
(See [1] for details).Prior to this commit, the event on MC/MCP will be re-queued if
reconcileAllNodes()
hits an error. However,
syncMachineConfig()
(orsyncMachineConfigPool
respectively)is stateful, i.e., the result replies on the content of the caches that might be
changed from the original event.
With the commit, a special event stays between an MC/MCP event and
reconcileAllNodes()
.An error from the latter will re-queue the special event which basically means triggering
another run of
reconcileAllNodes()
.The 2nd commit is to improve the code either for better readability or for better APIs from the lib.
Sync.Map with powerful functions for atomic ops on a map is an example for the latter case.
In the 3rd commit, it moves sync.Once into nodeInformerController controller level.
As a result each nodeInformerController instance will
execute the function once which makes more sense than doing it
once grobally because what we do there is initialization of the
caches that are associated with the instance.
For the core code, it is not a big deal since we have only ONE
instance of the controller. However, it matters for unit tests where
there are many controllers.
[1]. #1144 (comment)