OTA-1339: `UpdateStatus`: API to expose information about update progress & health #2012

petr-muller · 2024-08-26T16:59:18Z

Implements the API proposed in openshift/enhancements#1701 , addressing feedback on the early Update Health API Controller Proposal and Update Health API Draft docs.

Introduce a new UpdateStatus API (CRD) that exposes the status and health of the OpenShift cluster update process. In a typical OpenShift cluster, the API will be a singleton with an empty .spec, and its purpose will be to expose information through its status subresource.

UpdateStatus API Overview

apiVersion: update.openshift.io/v1alpha1
kind: UpdateStatus
metadata:
  name: cluster
spec: { }
status:
  controlPlane:
    _: ...
    informers:
    - name: cvo-example-informer
      insights: <list of insights reported by the informer>
    - name: mco-example-informer
      insights:
      - type: ClusterVersion # CV status insight
        _: ...
      - type: ClusterOperator # CO status insight
        _: ...
      - type: UpdateHealth # General update health insight
        _: ...
    conditions: <list of standard kubernetes conditions>
  workerPools:
  - name: workers
    _: ...
    informers: <list of informers with reported insights>
    conditions: <list of standard kubernetes conditions>
  - name: infra
    _: ...
  conditions: <list of standard kubernetes conditions>

The API has three conceptual layers:

Through the innermost layer .status.{controlPlane,workerPools[]}.informers, the API exposes detailed information about individual concerns related to the update, called "Update Insights." The API is prepared to allow multiple external informers to contribute insights, but in this enhancement, the only informer is the USC itself.
The aggregation layer .status.{controlPlane,workerPools[]} reports higher-level information about the update through this layer, serving as the USC's interpretation of all insights.
The outermost layer, .status.conditions, is used to report operational matters related to the USC itself (the health of the controller and gathering the insights, not the health of the update process).

We do not expect users to interact with UpdateStatus resources directly; the API is intended to be used mainly by tooling. A typical UpdateStatus instance is likely to be quite verbose.

I'm keeping iterations as commits but will squash before (eventual) merge.

openshift-ci · 2024-08-26T16:59:32Z

Hello @petr-muller! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

openshift-ci · 2024-08-26T16:59:32Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2024-10-07T16:27:56Z

@petr-muller: This pull request references OTA-1339 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.18.0" version, but no target version was set.

In response to this:

~~Initial draft of the API, not expecting API reviewers to review yet, just OTA~~

Still kinda WIP but it starts being partially reviewable. I still need to incorporate some feedback from Update Health API Draft and also revise this paying more attention to OpenShift API Conventions.

I'm keeping iterations as commits but will squash before (eventual) merge.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

config/v1alpha1/register.go

update/v1alpha1/types_update_status.go

wking · 2024-10-07T22:28:54Z

update/v1alpha1/types_update_status.go

+
+const (
+	// installation denotes a boolean that indicates the update was initiated as an installation
+	InstallationMetadata VersionMetadataKey = "installation"


ControlPlaneUpdateVersions has separate previous and target. Do we need an explicit "I'm the install!" marker in version metadata? Or can that be inferred from "previous is empty"? I can't think of a situation where you'd be in the middle of a real A->B update but still care about whether A was an install version or not.

this basically mimics what we do in oc prototype:

https://github.com/openshift/oc/blob/3692450b96d57ae3870e5f693e833a0701d0e2b0/pkg/cli/admin/upgrade/status/controlplane.go#L57-L61

if this was the only use case for version metadata then maybe the implicit 'previous empty implies installation' would be good enough but we have a good use case for metadata (partial) so I think it's worth being explicit, we have the mechanism for it

I have no problem with openshift/oc#4978e7d226e8cb2 deciding to use an explicit property internally to read more clearly. And I'm fine with you following that pattern here in the initial v1alpha1 approach. But the public-API form increases my personal threshold for whether the structure should allow inconsistent data, and "claims to be an installation but has a different previous" and "claims to be an update, but has no previous" are two possible inconsistent states for this structure. But either way, feel free to mark this thread resolved, to get v1alpha1 landed :)

I like that point. Not having more metadata is also simpler which is always a better start. I'll change this.

I ended up keeping the metadata item and enforcing consistency with CEL validation. I did not like allowing empty version (it is a struct that's also used in target where it should be never empty) so I made <none> a special value that's only allowed in previous and only if accompanied by Installation metadata.

update/v1alpha1/types_update_status.go

Initially `UpdateStatus` was made namespaces to accomodate HCP more easily, but recently it seems the org has established a practice where HCP resources have specialized variants. We have also identified some differences in how the API will need to behave in HCP, so it makes sense to make the API cluster-scoped in standalone, and in the future we will have a HCP-specific namespaced variant.

Mostly just fixed: - `kubebuilder:validation:Required` -> `required` - `patchStrategy=merge, patchMergeKey=type` in conditions

The (identical) rule markers present on both the field and the type alias caused the generated schema to be non-structural

I was running into trouble with CRD validation because of CEL budgets. The way the API is structured makes cardinatilities high because we need to allow high amouts of insights because of nodes. Then the hypothetical worst case leads to 32 worker pools, each with 16 informers reporting 1024 insights, all of which are `ClusterVersion` status insights which have a CEL that needs validation. Such scenario is purely hypothetical but compliance to budgets is enforced by the tools. The cardinality is so high because of the top-level worker pool list (32) so the problem can be solved by separating what insights can be present in which section of the API, which is what this commit does. Alternatively, we could flatten the `informers` layer and that way we would get rid of the `16` factor, and we would be able to limit the amount of insights reported by all informers that way. The informer source is not that important for API consumers but would make the API harder to work with on the producer side.

The generated code in this repositor does not need this but the code in openshift/client-go does

openshift-ci · 2025-03-11T17:47:06Z

@petr-muller: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/images	`8495ca6`	link	true	`/test images`
ci/prow/lint	`8495ca6`	link	true	`/test lint`
ci/prow/build	`8495ca6`	link	true	`/test build`
ci/prow/verify-client-go	`8495ca6`	link	true	`/test verify-client-go`
ci/prow/unit	`8495ca6`	link	true	`/test unit`
ci/prow/verify	`8495ca6`	link	true	`/test verify`
ci/prow/integration	`8495ca6`	link	true	`/test integration`
ci/prow/verify-deps	`8495ca6`	link	true	`/test verify-deps`
ci/prow/verify-feature-promotion	`8495ca6`	link	true	`/test verify-feature-promotion`
ci/prow/verify-crd-schema	`8495ca6`	link	true	`/test verify-crd-schema`
ci/prow/minor-images	`8495ca6`	link	true	`/test minor-images`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 26, 2024

openshift-ci bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 26, 2024

petr-muller closed this Aug 29, 2024

petr-muller reopened this Aug 29, 2024

petr-muller force-pushed the update-status-api branch 2 times, most recently from 7c1d5de to 5d0db69 Compare September 3, 2024 15:26

petr-muller force-pushed the update-status-api branch 2 times, most recently from ddca9e1 to 41b0a0c Compare October 7, 2024 16:17

petr-muller marked this pull request as ready for review October 7, 2024 16:17

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 7, 2024

openshift-ci bot requested review from deads2k and knobunc October 7, 2024 16:19

petr-muller changed the title ~~UpdateStatus: Initial working draft~~ OTA-1339: UpdateStatus: Initial working draft Oct 7, 2024

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 7, 2024