Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNM: OCPNODE-3023: Kueue operator api review #2222

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

kannon92
Copy link

@kannon92 kannon92 commented Mar 7, 2025

We are bringing Kueue into OCP in 4.19. This PR fullfills requirements for an api review.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 7, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 7, 2025

@kannon92: This pull request references OCPNODE-3023 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Mar 7, 2025

Hello @kannon92! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 7, 2025
@openshift-ci openshift-ci bot requested review from JoelSpeed and sjenning March 7, 2025 19:03
Copy link
Contributor

openshift-ci bot commented Mar 7, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kannon92
Once this PR has been reviewed and has the lgtm label, please assign knobunc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kannon92 kannon92 changed the title OCPNODE-3023: Kueue operator api review DNM: OCPNODE-3023: Kueue operator api review Mar 7, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 7, 2025

@kannon92: This pull request references OCPNODE-3023 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

We are bringing Kueue into OCP in 4.19. This PR fullfills requirements for an api review.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@kannon92 kannon92 force-pushed the kueue-operator-api branch from 243b816 to c2d2859 Compare March 11, 2025 18:17
// - "statefulset" (requires enabling pod integration)
// - "leaderworkerset.x-k8s.io/leaderworkerset" (requires enabling pod integration)
// +kubebuilder:validation:MaxItems=14
// +kubebuilder:validation:items:MaxLength=64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is too small, a valid qualified name has a DNS name as the prefix which can be up to 253 chars alone, I believe the next part after the slash can be up to 63 chars.

So 253 + 1 + 63 = 317 in the worst case IIUC

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is valid qualified name.

The options are given above and the greatest value is leaderworkerset.x-k8s.io/leaderworkerset. That is only 43 characters so went with a nice power of 2 variable for some wiggle room.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if upstream kueue adds support for an integration that uses a fully qualified name longer than 64 characters? Are there any conventions that kueue defines for framework names?

If we only ever want to support a defined list of frameworks (that we will update as needed), should the allowed values be defined by an enum?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should push for a standard on these names upstream if they are arbitrary, they look like qualified names so perhaps we can push for that validation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try!

kubernetes-sigs/kueue#4659

We could do a patch to make the integration take an allowed list of enums. But a quick glance in Kueue this is not a minor fix.

I was going to work on that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation is really a list of allowed frameworks so I don't think we need more validation here.

It should only be an allowed list which we already have.

Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only got about a quarter through on this pass. Going to circle back around tomorrow

Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'd like to see is more descriptive GoDoc for users. I linked a good guideline for GoDocs in one of my comments. I'd recommend using that guideline and going one-by-one through all your fields and updating the GoDoc with those guidelines in mind.

// - "statefulset" (requires enabling pod integration)
// - "leaderworkerset.x-k8s.io/leaderworkerset" (requires enabling pod integration)
// +kubebuilder:validation:MaxItems=14
// +kubebuilder:validation:items:MaxLength=64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if upstream kueue adds support for an integration that uses a fully qualified name longer than 64 characters? Are there any conventions that kueue defines for framework names?

If we only ever want to support a defined list of frameworks (that we will update as needed), should the allowed values be defined by an enum?

@openshift-ci openshift-ci bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 14, 2025
// This is required and must have at least one element.
// The frameworks are jobs that Kueue will manage.
// +required
Frameworks []KueueIntegrations `json:"frameworks"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why frameworks? I guess that's an upstream term? Do they use that in their product docs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not really a lot of docs around that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The developer docs do mention JobFrameworks.

@kannon92
Copy link
Author

Hello,

We decided to call a sync to discuss some options as we need to make progress on some kind of API.

Meeting Doc: https://docs.google.com/document/d/16kJXo3W8lCWgYOPpS5Jd8MTT2nrX68XAZKwH9QFO7sg/edit?usp=sharing

Generally, we know that the MVP for this is going to be integrations/external frameworks. The configuration options are quite complicated and we were worried that we would never really come to a decision trying to add every API field we want in the first round of this.

For our tech preview, we are going to only allow Integrations as a configuration option. Sensible defaults will be set in the controller level so we can deduce the scope of this API for tech preview. Feature gates that RHOAI needs to have enabled for GA will be toggled in the controller.

We still need to figure out a path forward with OLM on blocking upgrades and until we have a clear idea of that, it was causing a lot of friction to include all those configurations without be able to block upgrades.

I pushed up a commit that at least calls out the basic API we want for TP.

Comment on lines 29 to 30
// kueue configuration must not be changed once the object exists
// to change the configuration, one can delete the object and create a new object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this sentence need to be removed?

Comment on lines 40 to 41
// config is the desired configuration
// for the kueue operator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the kueue operator, or just kueue? The operator spec is for the operator right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is both. The operator creates the configuration for kueue and deploys kueue.

// The frameworks are jobs that Kueue will manage.
// +kubebuilder:validation:MaxItems=14
// +kubebuilder:validation:MinItems=1
// kubebuilder:validation:UniqueItems=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't actually work, use CEL instead

Suggested change
// kubebuilder:validation:UniqueItems=true
// kubebuilder:validation:XValidation:rule="self.all(x, self.exists_one(y, x == y))",message="each item in frameworks must be unique"

// frameworks are a list of names to be enabled.
// This is required and must have at least one element.
// The frameworks are jobs that Kueue will manage.
// +kubebuilder:validation:MaxItems=14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// +kubebuilder:validation:MaxItems=14
// Entries in this list must be unique.
// +kubebuilder:validation:MaxItems=14

// +required
Frameworks []KueueIntegrations `json:"frameworks"`
// externalFrameworks are a list of GroupVersionResources
// that are managed for Kueue by external controllers;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// that are managed for Kueue by external controllers;
// that are managed for Kueue by external controllers.

// externalFrameworks are a list of GroupVersionResources
// that are managed for Kueue by external controllers;
// These are optional and should only be used if you have an external controller
// that integrations with kueue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're mixing the casing of kueue quite a lot through the godoc, I think it should be capitalised as a noun

Suggested change
// that integrations with kueue.
// that integrates with Kueue.

// match or otherwise the workload creation would fail. The labels are copied only
// during the workload creation and are not updated even if the labels of the
// underlying job are changed.
// +kubebuilder:validation:items:MaxLength=317
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// +kubebuilder:validation:items:MaxLength=317
// Each entry in the list must be a valid qualified name consisting of a lower-case alphanumeric string,
// and hyphens of at most 63 characters in length. The name must start and end with an alphanumeric character.
// The name may be optionally prefixed with a subdomain consisting of lower-case alphanumeric characters,
// hyphens and periods, of at most 253 characters in length.
// Each period separated segment within the subdomain must start and end with an alphanumeric character.
// The optional prefix and the name are separate by a forward slash (/).
// +kubebuilder:validation:items:MaxLength=317

// underlying job are changed.
// +kubebuilder:validation:items:MaxLength=317
// +kubebuilder:validation:MaxItems=64
// +kubebuilder:validation:XValidation:rule="self.size() == 0 || !format.qualifiedName().validate(self).hasValue()",message="a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// +kubebuilder:validation:XValidation:rule="self.size() == 0 || !format.qualifiedName().validate(self).hasValue()",message="a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character."
// +kubebuilder:validation:items:XValidation:rule="!format.qualifiedName().validate(self).hasValue()",message="a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character."

@kannon92 kannon92 force-pushed the kueue-operator-api branch 2 times, most recently from 131d2f7 to 45893f3 Compare March 18, 2025 17:39
Comment on lines 45 to 49
// integrations are the workloads Kueue will manage
// Kueue has integrations in the codebase and it also allows
// for external frameworks
// Kueue are an important part to specify for the API as Kueue will
// only manage the workloads that are specfied in this list.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just read through this and check this is what you want to say, I think it may be missing some punctuation?

Line 48 especially isn't reading right to me

Comment on lines 102 to 106
// group of externalFramework
// must be a valid qualified name consisting of a lower-case alphanumeric string,
// and hyphens of at most 63 characters in length. The name must start and end with an alphanumeric character.
// The name may be optionally prefixed with a subdomain consisting of lower-case alphanumeric characters,
// hyphens and periods, of at most 253 characters in length.
// Each period separated segment within the subdomain must start and end with an alphanumeric character.
// The optional prefix and the name are separate by a forward slash (/).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to punctuate this documentation as if it were product documentation

Suggested change
// group of externalFramework
// must be a valid qualified name consisting of a lower-case alphanumeric string,
// and hyphens of at most 63 characters in length. The name must start and end with an alphanumeric character.
// The name may be optionally prefixed with a subdomain consisting of lower-case alphanumeric characters,
// hyphens and periods, of at most 253 characters in length.
// Each period separated segment within the subdomain must start and end with an alphanumeric character.
// The optional prefix and the name are separate by a forward slash (/).
// group is the API group of the externalFramework.
// Must be a valid qualified name consisting of a lower-case alphanumeric string,
// and hyphens of at most 63 characters in length. The name must start and end with an alphanumeric character.
// The name may be optionally prefixed with a subdomain consisting of lower-case alphanumeric characters,
// hyphens and periods, of at most 253 characters in length.
// Each period separated segment within the subdomain must start and end with an alphanumeric character.
// The optional prefix and the name are separate by a forward slash (/).

Comment on lines 114 to 115
// resource of external framework
// must be a valid qualified name consisting of a lower-case alphanumeric string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// resource of external framework
// must be a valid qualified name consisting of a lower-case alphanumeric string,
// resource is the Resource type of the external framework.
// Resource types are lowercase and plural (e.g. pods, deployments).
// Must be a valid qualified name consisting of a lower-case alphanumeric string,

// +kubebuilder:validation:MinLength=1
// +required
Resource string `json:"resource"`
// version is the version of the api
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// version is the version of the api
// version is the version of the api (e.g. v1alpha1, v1beta1, v1).

Comment on lines 45 to 50
// integrations are the workloads Kueue will manage
// Kueue has integrations in the codebase and it also allows
// for external frameworks
// Kueue are an important part to specify for the API as Kueue will
// only manage the workloads that are specfied in this list.
// This is a required field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels more developer focused than user focused. What do you think of something like:

integrations is a required field that configures the Kueue's workload integrations.
Kueue has both standard integrations, known as job frameworks, and external integrations known as external frameworks.
Kueue will only manage workloads that correspond to the specified integrations.

LeaderWorkerSet KueueIntegrations = "LeaderWorkerSet"
)

// This is the GVK for an external framework.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This is the GVK for an external framework.
// This is the GVR for an external framework.

}

// +kubebuilder:validation:Enum=BatchJob;RayJob;RayCluster;JobSet;MPIJob;PaddleJob;PytorchJob;TFJob;XGBoostJob;AppWrappers;Pod;Deployment;StatefulSet;LeaderWorkerSet
type KueueIntegrations string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an alias to a string, which is a singular type

Suggested change
type KueueIntegrations string
type KueueIntegration string

type KueueIntegrations string

const (
BatchJob KueueIntegrations = "BatchJob"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is generally recommend that the name of the constant follows the pattern {typeAlias}{constantValue} to prevent potential conflict of other constant names in the future. It also makes it easier for engineers that need to reference this API to more easily "guess" the constants associated with an enum constrained field.

Suggested change
BatchJob KueueIntegrations = "BatchJob"
KueueIntegrationsBatchJob KueueIntegrations = "BatchJob"

// The optional prefix and the name are separate by a forward slash (/).
// +kubebuilder:validation:MaxLength=253
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:XValidation:rule="self.size() == 0 || !format.qualifiedName().validate(self).hasValue()",message="a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want the dns1123Subdomain validation instead of the qualifiedName validation:

Suggested change
// +kubebuilder:validation:XValidation:rule="self.size() == 0 || !format.qualifiedName().validate(self).hasValue()",message="a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character."
// +kubebuilder:validation:XValidation:rule="self.size() == 0 || !format.dns1123Subdomain().validate(self).hasValue()",message="a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character."

// hyphens and periods, of at most 253 characters in length.
// Each period separated segment within the subdomain must start and end with an alphanumeric character.
// +kubebuilder:validation:MaxLength=256
// +kubebuilder:validation:XValidation:rule="self.size() == 0 || !format.qualifiedName().validate(self).hasValue()",message="a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain, but I don't think that resources can be full RFC1123 subdomain names. My understanding is that they would end up being limited to a DNS label.

@JoelSpeed would probably know more clearly than I.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's correct yes, resource is a DNS label which is a slightly different format validation

// +kubebuilder:validation:MaxLength=256
// +kubebuilder:validation:MinLength=1
// +required
Version string `json:"version"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do know that the version string should be a valid DNS label as outlined in the upstream Kube API conventions: https://github.com/kubernetes/community/blob/35444da79dff9a448e7ecf24b277e5f71373840a/contributors/devel/sig-architecture/api-conventions.md?plain=1#L116-L118

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add that validation then

@kannon92 kannon92 force-pushed the kueue-operator-api branch from 45893f3 to 0aba7b7 Compare March 19, 2025 15:10
// Controller runtime requires this in this format
// for api discoverability.
type ExternalFramework struct {
// group of externalFramework
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd drop this line and start on the next one

// Each period separated segment within the subdomain must start and end with an alphanumeric character.
// +kubebuilder:validation:MaxLength=256
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:XValidation:rule="self.size() == 0 || !format.dns1123Subdomain().validate(self).hasValue()",message="a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per a comment @everettraven pointed out, this should be a label

Suggested change
// +kubebuilder:validation:XValidation:rule="self.size() == 0 || !format.dns1123Subdomain().validate(self).hasValue()",message="a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character."
// +kubebuilder:validation:XValidation:rule="self.size() == 0 || !format.dns1123Label().validate(self).hasValue()",message="a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character."

// version is the version of the api (e.g. v1alpha1, v1beta1, v1).
// +kubebuilder:validation:MaxLength=256
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:XValidation:rule="self.size() == 0 || !format.dns1123Subdomain().validate(self).hasValue()",message="a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://github.com/kubernetes/apiextensions-apiserver/blob/7022eab6487354c438a96189ec0aeae898ea60bf/pkg/apis/apiextensions/validation/validation.go#L410C29-L410C43

Suggested change
// +kubebuilder:validation:XValidation:rule="self.size() == 0 || !format.dns1123Subdomain().validate(self).hasValue()",message="a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character."
// +kubebuilder:validation:XValidation:rule="self.size() == 0 || !format.dns1035Label().validate(self).hasValue()",message="a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character."

Version string `json:"version"`
}

type Integrations struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing struct godoc

// hyphens and periods, of at most 253 characters in length.
// Each period separated segment within the subdomain must start and end with an alphanumeric character.
// The optional prefix and the name are separate by a forward slash (/).
// +kubebuilder:validation:MaxLength=317
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is an empty key valid? I suspect we want minLength 1 on this?

Copy link
Contributor

openshift-ci bot commented Mar 19, 2025

@kannon92: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/verify-crd-schema 30a20f1 link true /test verify-crd-schema
ci/prow/verify 30a20f1 link true /test verify

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants