Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTA-1339: UpdateStatus: API to expose information about update progress & health #2233

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

petr-muller
Copy link
Member

@petr-muller petr-muller commented Mar 13, 2025

Alternative, smaller-footprint version of the API proposed in #2012, discussed in openshift/enhancements#1701 , addressing feedback on the early Update Health API Controller Proposal and Update Health API Draft docs.

This alternative proposal completely removes the "structure" part of UpdateStatus singleton proposed in #2012, and extracts the "data" part of it ("insight" leaves of the UpdateStatus tree) to a top-level, individual, smaller-footprint CRDs. This approach completely removes some of the concerns voiced in #2012. I have some concerns about the approach (mostly about losing functionality), but I am able to iterate and maybe the structure part can be addressed separately.

Except for the high-level layer, the example walkthrough (why the data structures are shaped like this and what functionality they are supposed to support) is applicable here as well.

/cc @JoelSpeed @sdodson @wking

@openshift-ci openshift-ci bot requested review from JoelSpeed, sdodson and wking March 13, 2025 19:03
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 13, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 13, 2025

@petr-muller: This pull request references OTA-1339 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set.

In response to this:

Alternative, smaller-footprint version of the API proposed in #2012, discussed in openshift/enhancements#1701 , addressing feedback on the early Update Health API Controller Proposal and Update Health API Draft docs.

This alternative proposal completely removes the "structure" part of UpdateStatus singleton proposed in #2012, and extracts the "data" part of it ("insight" leaves of the UpdateStatus tree) to a top-level, individual, smaller-footprint CRDs. This approach completely removes some of the concerns voiced in #2012. I have some concerns about the approach (mostly about losing functionality), but I am able to iterate and maybe the structure part can be addressed separately.

/cc @JoelSpeed @sdodson @wking

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Mar 13, 2025

Hello @petr-muller! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 13, 2025
// +kubebuilder:resource:path=clusterversionprogressinsights,scope=Cluster
// +openshift:api-approved.openshift.io=https://github.com/openshift/api/pull/2012
// +openshift:file-pattern=cvoRunLevel=0000_00,operatorName=cluster-version-operator,operatorOrdering=02
// +openshift:enable:FeatureGate=UpgradeStatus
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpgradeStatus is a TechPreview featuregate. At this point I think we'd need to have a new, DevPreview featuregate if this is accepted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref: #2246

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 13, 2025

@petr-muller: This pull request references OTA-1339 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set.

In response to this:

Alternative, smaller-footprint version of the API proposed in #2012, discussed in openshift/enhancements#1701 , addressing feedback on the early Update Health API Controller Proposal and Update Health API Draft docs.

This alternative proposal completely removes the "structure" part of UpdateStatus singleton proposed in #2012, and extracts the "data" part of it ("insight" leaves of the UpdateStatus tree) to a top-level, individual, smaller-footprint CRDs. This approach completely removes some of the concerns voiced in #2012. I have some concerns about the approach (mostly about losing functionality), but I am able to iterate and maybe the structure part can be addressed separately.

Except for the high-level layer, the example walkthrough (why the data structures are shaped like this and what functionality they are supposed to support) is applicable here as well.

/cc @JoelSpeed @sdodson @wking

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

// Progressing is used with Updating=True when the ClusterOperator is updating
ClusterOperatorUpdatingReasonProgressing ClusterOperatorUpdatingReason = "Progressing"
// CannotDetermine is used with Updating=Unknown
ClusterOperatorUpdatingCannotDetermine ClusterOperatorUpdatingReason = "CannotDetermine"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there doesn't seem to be any enum-style validation on conditions[].reason, so maybe we can leave these off and go free-form in the resource-creating controller? We'd need Updated or Pending for the Updating=False case to allow clients to distinguish there, but I'm not clear on why clients would need to distinguish by well-known reason for Updating=True or Updating=Unknown, especially with us only declaring one reason string for each of those status values. But also, obviously not a big deal if we wanted central definitions of some default reasons, in case that helps make resource-creators more consistent in the "there's no more-granular information we want to share" cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be useful to have constants defined (for code to check against), but you're right in that this isn't an enum, so it isn't strictly necessary.

Sometimes the condition types/reasons live alongside the API, sometime they live alongside the controller. I don't have a strong opinion on where they should live

Copy link
Member Author

@petr-muller petr-muller Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more inclined to add validations eventually rather than drop to free-form, because I think we should either be full "reasons are well-known and can matter" or full "reasons are free-form", but not half-way. I'll add the CEL rules.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have started adding validations for conditions (to ensure condition presence and that they use the correct reasons etc) but then I realized tightening the API this much at this point makes it harder to experiment in the controller (for example, I'm not sure if I can get away with a single Healthy condition on CV & MCP insights, or whether I would need to break it down. For now the API targets DevPreview and does not need to be this tight, so for now I only added TODOs everywhere.

// ControlPlaneUpdateVersions contains the original and target versions of the upgrade
type ControlPlaneUpdateVersions struct {
// previous is the version of the control plane before the update. When the cluster is being installed
// for the first time, the version will have a placeholder value '<none>' and carry 'Installation' metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if the cluster history is:

  1. Completed vA.
  2. Partial vB.
  3. Partial vC.

it's not clear to me from these docs whether the value would be vA or vB. The client-side code suggests vB, in which case maybe wording like:

previous is the version the control plane was attempting to reconcile before the update.

or some such that makes it clear that it's "the previous target, regardless of completion" and not "the most recent completed target".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved the godoc, hopefully


// ScopeType is one of ControlPlane or WorkerPool
// +kubebuilder:validation:Enum=ControlPlane;WorkerPool
type ScopeType string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ScopeType doesn't seemed to be consumed by ClusterVersionProgressInsight. Maybe move it to types_health_insight.go to live alongside the consuming structure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also used in MCP and Node insights. Introducing a "common.go" felt like overhead but maybe it's worth it (especially if we keep using ResourceRef everywhere and not just in Health)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to types_common.go

// +than type safety for producers.
// +required
// +kubebuilder:validation:XValidation:rule="!has(self.group) && self.resource == 'nodes'",message="resource must be a nodes.core.k8s.io resource"
Resource ResourceRef `json:"resource"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we have this, we don't need Name, do we? But I see you have an XValidation rule on NodeProgressInsightStatus to enforce matching, so there must be a reason to have both forms? Are we just satisfying both the "By OpenShift API conventions..." from the Godocs with Name, and then simultaneously "...consistency seems to be more valuable..." with Resource?

Copy link
Member Author

@petr-muller petr-muller Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, technically Name is enough because progress insight types are coupled with exactly one kind, at least for now (Nodes were a thing I suspected may benefit from having a reference to e.g. a Machine and the new MCO nodeupdatestatus).

Consistency for clients is part of the attempted value, and each insight type using consistent reference structs whenever they refer to other resources felt useful, so that if clients want to do some "read insight then read actual resource" logic, they do not need to implement the translation themselves for each insight type but just read .resource and feed it into some shared utility code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Joining from another thread, I also noted that this seems redundant.

If you think there's value in the resource (specifying group and resource type), I don't really see the value in the name since it's already in the resource reference.

You've mentioned the reason for having a reference, but not why name as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely convinced, but it's not the hill I need to die on so in the spirit of simplification I can sacrifice the intended convenience, and will gut the .resource fields 👍

// +required
Scope ScopeType `json:"scopeType"`

// version is the version of the node, when known
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this isn't making previous/target distinctions. Did we want to recycle something like ControlPlaneUpdateVersions? But obviously not with that kind of name specificity, we'd want UpdateVersions or NodeUpdateVersions or some such, depending on whether we thought we could share with the control plane consumer or not.

If all we need is a single version string, can we expand the Godocs to be clear about whether it's previous or target version? I'm guessing it's target version?

Copy link
Member Author

@petr-muller petr-muller Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the current version - previous before the node is updated, and target afterwards. Will improve the godoc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified in godoc

Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some nit comments inline, but nothing stood out to me as a structural issue (not surprising given the amount of time you've invested in polishing this API proposal so far). So approving, because Godocs and that level thing can all get sorted out in post-merge follow-up, if folks can find a way to land dev-preview APIs so we have time to build the tooling that writes and reads these while Godocs and things that the tooling won't care about get sorted.

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not doing a fine tooth comb review at the moment because I know we are keen to get something in and iterate, I'm working out a good way to make sure the API review is completed before we GA, and track that as part of the work, but for now, some initial thoughts

Comment on lines 49 to 58
// conditions provide details about the operator. It contains at most 10 items. Known conditions are:
// - Updating: whether the operator is updating; When Updating=False, the reason field can be Pending or Updated
// - Healthy: whether the operator is considered healthy; When Healthy=False, the reason field can be Unavailable or Degraded, and Unavailable is "stronger" than Degraded
// +listType=map
// +listMapKey=type
// +patchStrategy=merge
// +patchMergeKey=type
// +optional
// +kubebuilder:validation:MaxItems=10
Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If each cluster operator could implement these two conditions, would this API be required?

Copy link
Member Author

@petr-muller petr-muller Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hard for cluster operators to implement these conditions because they require awareness of state beyond the ClusterOperator spec->status reconciliation (specifically the Updating one). Updating=False Reason=Pending is not a state you can detect if you only look at the CO, you need the knowledge that the CVO started updating the cluster, and therefore that the CO should be eventually bumped.

But yeah, this one is probably the thinnest of the progress insights.

// Progressing is used with Updating=True when the ClusterOperator is updating
ClusterOperatorUpdatingReasonProgressing ClusterOperatorUpdatingReason = "Progressing"
// CannotDetermine is used with Updating=Unknown
ClusterOperatorUpdatingCannotDetermine ClusterOperatorUpdatingReason = "CannotDetermine"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be useful to have constants defined (for code to check against), but you're right in that this isn't an enum, so it isn't strictly necessary.

Sometimes the condition types/reasons live alongside the API, sometime they live alongside the controller. I don't have a strong opinion on where they should live

// +patchMergeKey=type
// +optional
// +kubebuilder:validation:MaxItems=10
Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a condition on the ClusterVersion today that this is mirroring? I'm trying to understand why I'd want this resource vs just looking at the cluster version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole purpose of Status API is for consumers to not "just look at the cluster version". ClusterVersion has multiple conditions that are essentially phases (ReleaseAccepted, Upgradeable, Progressing). Right now the implementation mimics Progressing (cvo actually loaded new payload, validated it, and started reconciling) but in the future I wanted to cover the states before Progressing=True in this insight. So again the CV insight is a result of a logic of interpreting a potentially complicated CV state.

// +many places and this API is intended to be consumed by clients, not produced, consistency seems to be more valuable
// +than type safety for producers.
// +required
// +kubebuilder:validation:XValidation:rule="self.group == 'config.openshift.io' && self.resource == 'clusterversions'",message="resource must be a clusterversions.config.openshift.io resource"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name would also have to be cluster no?

Copy link
Member Author

@petr-muller petr-muller Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether the API needs to encode the knowledge of this convention. CV name is a business logic matter, not a data integrity one? Whatever produces the insights is responsible for putting correct data there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would an insight achieve if it pointed to ClusterVersion called foo? There would be no foo ClusterVersion of that name, so, does it make sense to even allow that?

// +required
Assessment PoolAssessment `json:"assessment"`

// completion is a percentage of the update completion (0-100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, should include units on integer fields, so completionPercent would be a better fit IMO

// +patchStrategy=merge
// +patchMergeKey=type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch/merge strategy doesn't actually work on CRDs

Copy link
Member Author

@petr-muller petr-muller Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means I should remove this from everywhere? I think some linter complained when I did not have these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah they should be removed from everywhere, the linter is slightly out of date, I'll get that updated and let you know when that's done

// +kubebuilder:validation:Maximum=100
Completion int32 `json:"completion"`

// summaries is a list of counts of nodes matching certain criteria (e.g. updated, degraded, etc.). Maximum 16 items can be listed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the type is an enum, the maximum should be tied to the number of different possible enum values

// +required
Type NodeSummaryType `json:"type"`

// count is the number of nodes matching the criteria, between 0 and 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we support more than 1024 nodes, perhaps make the limit 5000? Probably want to check with perf and scale, but I believe our limit is well below 5000

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we set the limit to 2k then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not

// +required
Scope ScopeType `json:"scopeType"`

// version is the version of the node, when known
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What version is this? And, does it make sense to tie this to the Kube/OpenShift semver (looks like semver-ish match?) when Nodes are actually tied to a rendered MC version, and that can change over time even when the cluster OCP version stays the same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the user PoV it makes sense - if you start updating from 4.17 to 4.18 then seeing nodes bump from 4.17 to 4.18 is what you expect. "tied to a rendered MC version" is an implementation detail that experts know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, context in the godoc would be helpful here

@petr-muller petr-muller force-pushed the update-status-api-minimal branch 2 times, most recently from cf6551a to bb826a4 Compare March 20, 2025 18:17
Copy link
Contributor

openshift-ci bot commented Mar 20, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: petr-muller, wking
Once this PR has been reviewed and has the lgtm label, please assign joelspeed for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@petr-muller petr-muller force-pushed the update-status-api-minimal branch from bb826a4 to 42ac987 Compare March 20, 2025 18:30
@petr-muller
Copy link
Member Author

OK tests pass with this revision: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_api/2233/pull-ci-openshift-api-master-integration/1902789916102758400

Now I will push the version that adds a CEL validation for progress insights to enforce .metadata.name == .status.name, which makes all tests fail for me locally, and I cannot figure out why.

@petr-muller petr-muller force-pushed the update-status-api-minimal branch from 6dd17a5 to b48e2b0 Compare March 21, 2025 13:32
@petr-muller
Copy link
Member Author

/test minor-images

 INFO[2025-03-21T14:44:21Z] build didn't start running within 1h0m0s (phase: Pending):
* Container docker-build is not ready with reason PodInitializing 

Not me!

Copy link
Contributor

openshift-ci bot commented Mar 21, 2025

@petr-muller: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants