Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTA-1339: UpdateStatus: API to expose information about update progress & health #2012

Open
wants to merge 45 commits into
base: master
Choose a base branch
from

Conversation

petr-muller
Copy link
Member

@petr-muller petr-muller commented Aug 26, 2024

Implements the API proposed in openshift/enhancements#1701 , addressing feedback on the early Update Health API Controller Proposal and Update Health API Draft docs.


Introduce a new UpdateStatus API (CRD) that exposes the status and health of the OpenShift cluster update process. In a typical OpenShift cluster, the API will be a singleton with an empty .spec, and its purpose will be to expose information through its status subresource.

UpdateStatus API Overview

apiVersion: update.openshift.io/v1alpha1
kind: UpdateStatus
metadata:
  name: cluster
spec: { }
status:
  controlPlane:
    _: ...
    informers:
    - name: cvo-example-informer
      insights: <list of insights reported by the informer>
    - name: mco-example-informer
      insights:
      - type: ClusterVersion # CV status insight
        _: ...
      - type: ClusterOperator # CO status insight
        _: ...
      - type: UpdateHealth # General update health insight
        _: ...
    conditions: <list of standard kubernetes conditions>
  workerPools:
  - name: workers
    _: ...
    informers: <list of informers with reported insights>
    conditions: <list of standard kubernetes conditions>
  - name: infra
    _: ...
  conditions: <list of standard kubernetes conditions>    

The API has three conceptual layers:

  1. Through the innermost layer .status.{controlPlane,workerPools[]}.informers, the API exposes detailed information about individual concerns related to the update, called "Update Insights." The API is prepared to allow multiple external informers to contribute insights, but in this enhancement, the only informer is the USC itself.
  2. The aggregation layer .status.{controlPlane,workerPools[]} reports higher-level information about the update through this layer, serving as the USC's interpretation of all insights.
  3. The outermost layer, .status.conditions, is used to report operational matters related to the USC itself (the health of the controller and gathering the insights, not the health of the update process).

We do not expect users to interact with UpdateStatus resources directly; the API is intended to be used mainly by tooling. A typical UpdateStatus instance is likely to be quite verbose.

I'm keeping iterations as commits but will squash before (eventual) merge.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 26, 2024
Copy link
Contributor

openshift-ci bot commented Aug 26, 2024

Hello @petr-muller! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

Copy link
Contributor

openshift-ci bot commented Aug 26, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 26, 2024
@petr-muller petr-muller reopened this Aug 29, 2024
@petr-muller petr-muller force-pushed the update-status-api branch 2 times, most recently from 7c1d5de to 5d0db69 Compare September 3, 2024 15:26
@petr-muller petr-muller force-pushed the update-status-api branch 2 times, most recently from ddca9e1 to 41b0a0c Compare October 7, 2024 16:17
@petr-muller petr-muller marked this pull request as ready for review October 7, 2024 16:17
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 7, 2024
@openshift-ci openshift-ci bot requested review from deads2k and knobunc October 7, 2024 16:19
@petr-muller petr-muller changed the title UpdateStatus: Initial working draft OTA-1339: UpdateStatus: Initial working draft Oct 7, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 7, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 7, 2024

@petr-muller: This pull request references OTA-1339 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.18.0" version, but no target version was set.

In response to this:

Initial draft of the API, not expecting API reviewers to review yet, just OTA

Still kinda WIP but it starts being partially reviewable. I still need to incorporate some feedback from Update Health API Draft and also revise this paying more attention to OpenShift API Conventions.

I'm keeping iterations as commits but will squash before (eventual) merge.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.


const (
// installation denotes a boolean that indicates the update was initiated as an installation
InstallationMetadata VersionMetadataKey = "installation"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ControlPlaneUpdateVersions has separate previous and target. Do we need an explicit "I'm the install!" marker in version metadata? Or can that be inferred from "previous is empty"? I can't think of a situation where you'd be in the middle of a real A->B update but still care about whether A was an install version or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this basically mimics what we do in oc prototype:

https://github.com/openshift/oc/blob/3692450b96d57ae3870e5f693e833a0701d0e2b0/pkg/cli/admin/upgrade/status/controlplane.go#L57-L61

if this was the only use case for version metadata then maybe the implicit 'previous empty implies installation' would be good enough but we have a good use case for metadata (partial) so I think it's worth being explicit, we have the mechanism for it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem with openshift/oc#4978e7d226e8cb2 deciding to use an explicit property internally to read more clearly. And I'm fine with you following that pattern here in the initial v1alpha1 approach. But the public-API form increases my personal threshold for whether the structure should allow inconsistent data, and "claims to be an installation but has a different previous" and "claims to be an update, but has no previous" are two possible inconsistent states for this structure. But either way, feel free to mark this thread resolved, to get v1alpha1 landed :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that point. Not having more metadata is also simpler which is always a better start. I'll change this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up keeping the metadata item and enforcing consistency with CEL validation. I did not like allowing empty version (it is a struct that's also used in target where it should be never empty) so I made <none> a special value that's only allowed in previous and only if accompanied by Installation metadata.

@petr-muller petr-muller force-pushed the update-status-api branch 2 times, most recently from e438453 to 40182b1 Compare October 9, 2024 12:27
Initially `UpdateStatus` was made namespaces to accomodate HCP more easily, but recently it seems the org has established a practice where HCP resources have specialized variants. We have also identified some differences in how the API will need to behave in HCP, so it makes sense to make the API cluster-scoped in standalone, and in the future we will have a HCP-specific namespaced variant.
Mostly just fixed:
- `kubebuilder:validation:Required` -> `required`
- `patchStrategy=merge, patchMergeKey=type` in conditions
The (identical) rule markers present on both the field and the type alias caused the generated schema to be non-structural
I was running into trouble with CRD validation because of CEL budgets.
The way the API is structured makes cardinatilities high because we
need to allow high amouts of insights because of nodes. Then the
hypothetical worst case leads to 32 worker pools, each with 16 informers
reporting 1024 insights, all of which are `ClusterVersion` status insights
which have a CEL that needs validation. Such scenario is purely
hypothetical but compliance to budgets is enforced by the tools.

The cardinality is so high because of the top-level worker pool list (32)
so the problem can be solved by separating what insights can be present
in which section of the API, which is what this commit does.

Alternatively, we could flatten the `informers` layer and that way we
would get rid of the `16` factor, and we would be able to limit the amount
of insights reported by all informers that way. The informer source is
not that important for API consumers but would make the API harder to
work with on the producer side.
The generated code in this repositor does not need this but the code in openshift/client-go does
Copy link
Contributor

openshift-ci bot commented Mar 11, 2025

@petr-muller: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/images 8495ca6 link true /test images
ci/prow/lint 8495ca6 link true /test lint
ci/prow/build 8495ca6 link true /test build
ci/prow/verify-client-go 8495ca6 link true /test verify-client-go
ci/prow/unit 8495ca6 link true /test unit
ci/prow/verify 8495ca6 link true /test verify
ci/prow/integration 8495ca6 link true /test integration
ci/prow/verify-deps 8495ca6 link true /test verify-deps
ci/prow/verify-feature-promotion 8495ca6 link true /test verify-feature-promotion
ci/prow/verify-crd-schema 8495ca6 link true /test verify-crd-schema
ci/prow/minor-images 8495ca6 link true /test minor-images

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants