-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTA-1339: UpdateStatus
: API to expose information about update progress & health
#2233
base: master
Are you sure you want to change the base?
OTA-1339: UpdateStatus
: API to expose information about update progress & health
#2233
Conversation
@petr-muller: This pull request references OTA-1339 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Hello @petr-muller! Some important instructions when contributing to openshift/api: |
// +kubebuilder:resource:path=clusterversionprogressinsights,scope=Cluster | ||
// +openshift:api-approved.openshift.io=https://github.com/openshift/api/pull/2012 | ||
// +openshift:file-pattern=cvoRunLevel=0000_00,operatorName=cluster-version-operator,operatorOrdering=02 | ||
// +openshift:enable:FeatureGate=UpgradeStatus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UpgradeStatus
is a TechPreview featuregate. At this point I think we'd need to have a new, DevPreview
featuregate if this is accepted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref: #2246
@petr-muller: This pull request references OTA-1339 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
// Progressing is used with Updating=True when the ClusterOperator is updating | ||
ClusterOperatorUpdatingReasonProgressing ClusterOperatorUpdatingReason = "Progressing" | ||
// CannotDetermine is used with Updating=Unknown | ||
ClusterOperatorUpdatingCannotDetermine ClusterOperatorUpdatingReason = "CannotDetermine" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: there doesn't seem to be any enum-style validation on conditions[].reason
, so maybe we can leave these off and go free-form in the resource-creating controller? We'd need Updated
or Pending
for the Updating=False
case to allow clients to distinguish there, but I'm not clear on why clients would need to distinguish by well-known reason for Updating=True
or Updating=Unknown
, especially with us only declaring one reason
string for each of those status
values. But also, obviously not a big deal if we wanted central definitions of some default reasons, in case that helps make resource-creators more consistent in the "there's no more-granular information we want to share" cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be useful to have constants defined (for code to check against), but you're right in that this isn't an enum, so it isn't strictly necessary.
Sometimes the condition types/reasons live alongside the API, sometime they live alongside the controller. I don't have a strong opinion on where they should live
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm more inclined to add validations eventually rather than drop to free-form, because I think we should either be full "reasons are well-known and can matter" or full "reasons are free-form", but not half-way. I'll add the CEL rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have started adding validations for conditions (to ensure condition presence and that they use the correct reasons etc) but then I realized tightening the API this much at this point makes it harder to experiment in the controller (for example, I'm not sure if I can get away with a single Healthy
condition on CV & MCP insights, or whether I would need to break it down. For now the API targets DevPreview and does not need to be this tight, so for now I only added TODOs everywhere.
// ControlPlaneUpdateVersions contains the original and target versions of the upgrade | ||
type ControlPlaneUpdateVersions struct { | ||
// previous is the version of the control plane before the update. When the cluster is being installed | ||
// for the first time, the version will have a placeholder value '<none>' and carry 'Installation' metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if the cluster history is:
- Completed vA.
- Partial vB.
- Partial vC.
it's not clear to me from these docs whether the value would be vA or vB. The client-side code suggests vB, in which case maybe wording like:
previous is the version the control plane was attempting to reconcile before the update.
or some such that makes it clear that it's "the previous target, regardless of completion" and not "the most recent completed target".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved the godoc, hopefully
|
||
// ScopeType is one of ControlPlane or WorkerPool | ||
// +kubebuilder:validation:Enum=ControlPlane;WorkerPool | ||
type ScopeType string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ScopeType
doesn't seemed to be consumed by ClusterVersionProgressInsight
. Maybe move it to types_health_insight.go
to live alongside the consuming structure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also used in MCP and Node insights. Introducing a "common.go" felt like overhead but maybe it's worth it (especially if we keep using ResourceRef
everywhere and not just in Health)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to types_common.go
// +than type safety for producers. | ||
// +required | ||
// +kubebuilder:validation:XValidation:rule="!has(self.group) && self.resource == 'nodes'",message="resource must be a nodes.core.k8s.io resource" | ||
Resource ResourceRef `json:"resource"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if we have this, we don't need Name
, do we? But I see you have an XValidation
rule on NodeProgressInsightStatus
to enforce matching, so there must be a reason to have both forms? Are we just satisfying both the "By OpenShift API conventions..." from the Godocs with Name
, and then simultaneously "...consistency seems to be more valuable..." with Resource
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, technically Name
is enough because progress insight types are coupled with exactly one kind, at least for now (Nodes were a thing I suspected may benefit from having a reference to e.g. a Machine
and the new MCO nodeupdatestatus).
Consistency for clients is part of the attempted value, and each insight type using consistent reference structs whenever they refer to other resources felt useful, so that if clients want to do some "read insight then read actual resource" logic, they do not need to implement the translation themselves for each insight type but just read .resource
and feed it into some shared utility code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Joining from another thread, I also noted that this seems redundant.
If you think there's value in the resource (specifying group and resource type), I don't really see the value in the name since it's already in the resource reference.
You've mentioned the reason for having a reference, but not why name as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely convinced, but it's not the hill I need to die on so in the spirit of simplification I can sacrifice the intended convenience, and will gut the .resource
fields 👍
// +required | ||
Scope ScopeType `json:"scopeType"` | ||
|
||
// version is the version of the node, when known |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this isn't making previous/target distinctions. Did we want to recycle something like ControlPlaneUpdateVersions
? But obviously not with that kind of name specificity, we'd want UpdateVersions
or NodeUpdateVersions
or some such, depending on whether we thought we could share with the control plane consumer or not.
If all we need is a single version string, can we expand the Godocs to be clear about whether it's previous or target version? I'm guessing it's target version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the current version - previous before the node is updated, and target afterwards. Will improve the godoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified in godoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some nit comments inline, but nothing stood out to me as a structural issue (not surprising given the amount of time you've invested in polishing this API proposal so far). So approving, because Godocs and that level thing can all get sorted out in post-merge follow-up, if folks can find a way to land dev-preview APIs so we have time to build the tooling that writes and reads these while Godocs and things that the tooling won't care about get sorted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not doing a fine tooth comb review at the moment because I know we are keen to get something in and iterate, I'm working out a good way to make sure the API review is completed before we GA, and track that as part of the work, but for now, some initial thoughts
// conditions provide details about the operator. It contains at most 10 items. Known conditions are: | ||
// - Updating: whether the operator is updating; When Updating=False, the reason field can be Pending or Updated | ||
// - Healthy: whether the operator is considered healthy; When Healthy=False, the reason field can be Unavailable or Degraded, and Unavailable is "stronger" than Degraded | ||
// +listType=map | ||
// +listMapKey=type | ||
// +patchStrategy=merge | ||
// +patchMergeKey=type | ||
// +optional | ||
// +kubebuilder:validation:MaxItems=10 | ||
Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If each cluster operator could implement these two conditions, would this API be required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is hard for cluster operators to implement these conditions because they require awareness of state beyond the ClusterOperator spec->status
reconciliation (specifically the Updating
one). Updating=False Reason=Pending
is not a state you can detect if you only look at the CO, you need the knowledge that the CVO started updating the cluster, and therefore that the CO should be eventually bumped.
But yeah, this one is probably the thinnest of the progress insights.
// Progressing is used with Updating=True when the ClusterOperator is updating | ||
ClusterOperatorUpdatingReasonProgressing ClusterOperatorUpdatingReason = "Progressing" | ||
// CannotDetermine is used with Updating=Unknown | ||
ClusterOperatorUpdatingCannotDetermine ClusterOperatorUpdatingReason = "CannotDetermine" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be useful to have constants defined (for code to check against), but you're right in that this isn't an enum, so it isn't strictly necessary.
Sometimes the condition types/reasons live alongside the API, sometime they live alongside the controller. I don't have a strong opinion on where they should live
// +patchMergeKey=type | ||
// +optional | ||
// +kubebuilder:validation:MaxItems=10 | ||
Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a condition on the ClusterVersion today that this is mirroring? I'm trying to understand why I'd want this resource vs just looking at the cluster version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole purpose of Status API is for consumers to not "just look at the cluster version". ClusterVersion has multiple conditions that are essentially phases (ReleaseAccepted
, Upgradeable
, Progressing
). Right now the implementation mimics Progressing
(cvo actually loaded new payload, validated it, and started reconciling) but in the future I wanted to cover the states before Progressing=True
in this insight. So again the CV insight is a result of a logic of interpreting a potentially complicated CV state.
// +many places and this API is intended to be consumed by clients, not produced, consistency seems to be more valuable | ||
// +than type safety for producers. | ||
// +required | ||
// +kubebuilder:validation:XValidation:rule="self.group == 'config.openshift.io' && self.resource == 'clusterversions'",message="resource must be a clusterversions.config.openshift.io resource" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name would also have to be cluster
no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether the API needs to encode the knowledge of this convention. CV name is a business logic matter, not a data integrity one? Whatever produces the insights is responsible for putting correct data there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would an insight achieve if it pointed to ClusterVersion called foo? There would be no foo
ClusterVersion of that name, so, does it make sense to even allow that?
// +required | ||
Assessment PoolAssessment `json:"assessment"` | ||
|
||
// completion is a percentage of the update completion (0-100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, should include units on integer fields, so completionPercent
would be a better fit IMO
// +patchStrategy=merge | ||
// +patchMergeKey=type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Patch/merge strategy doesn't actually work on CRDs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means I should remove this from everywhere? I think some linter complained when I did not have these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah they should be removed from everywhere, the linter is slightly out of date, I'll get that updated and let you know when that's done
// +kubebuilder:validation:Maximum=100 | ||
Completion int32 `json:"completion"` | ||
|
||
// summaries is a list of counts of nodes matching certain criteria (e.g. updated, degraded, etc.). Maximum 16 items can be listed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the type is an enum, the maximum should be tied to the number of different possible enum values
// +required | ||
Type NodeSummaryType `json:"type"` | ||
|
||
// count is the number of nodes matching the criteria, between 0 and 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect we support more than 1024 nodes, perhaps make the limit 5000? Probably want to check with perf and scale, but I believe our limit is well below 5000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we set the limit to 2k then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not
// +required | ||
Scope ScopeType `json:"scopeType"` | ||
|
||
// version is the version of the node, when known |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What version is this? And, does it make sense to tie this to the Kube/OpenShift semver (looks like semver-ish match?) when Nodes are actually tied to a rendered MC version, and that can change over time even when the cluster OCP version stays the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the user PoV it makes sense - if you start updating from 4.17 to 4.18 then seeing nodes bump from 4.17 to 4.18 is what you expect. "tied to a rendered MC version" is an implementation detail that experts know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, context in the godoc would be helpful here
cf6551a
to
bb826a4
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: petr-muller, wking The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
bb826a4
to
42ac987
Compare
OK tests pass with this revision: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_api/2233/pull-ci-openshift-api-master-integration/1902789916102758400 Now I will push the version that adds a CEL validation for progress insights to enforce |
6dd17a5
to
b48e2b0
Compare
/test minor-images
Not me! |
@petr-muller: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Alternative, smaller-footprint version of the API proposed in #2012, discussed in openshift/enhancements#1701 , addressing feedback on the early Update Health API Controller Proposal and Update Health API Draft docs.
This alternative proposal completely removes the "structure" part of
UpdateStatus
singleton proposed in #2012, and extracts the "data" part of it ("insight" leaves of theUpdateStatus
tree) to a top-level, individual, smaller-footprint CRDs. This approach completely removes some of the concerns voiced in #2012. I have some concerns about the approach (mostly about losing functionality), but I am able to iterate and maybe the structure part can be addressed separately.Except for the high-level layer, the example walkthrough (why the data structures are shaped like this and what functionality they are supposed to support) is applicable here as well.
/cc @JoelSpeed @sdodson @wking