Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly specify the integrations that are allowed as the type. #4676

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kannon92
Copy link
Contributor

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Introduce an enum for integrations and propagated that usage throughout the codebase.

This can reduce bugs and allow for people to know what kind of strings are expected for built in frameworks.

Which issue(s) this PR fixes:

Fixes #4659

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Mar 18, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kannon92
Once this PR has been reviewed and has the lgtm label, please assign mimowo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from mimowo and PBundyra March 18, 2025 15:30
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 18, 2025
Copy link

netlify bot commented Mar 18, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 3905888
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/67d9e062eac2d70008235db8

@kannon92
Copy link
Contributor Author

/hold

I think there is one deepcopy fix I gotta do and I want to make sure this passes in CI.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 18, 2025
@@ -245,7 +245,7 @@ func (m *integrationManager) isOwnerIntegrationEnabled(owner *metav1.OwnerRefere
// RegisterIntegration registers a new framework, returns an error when
// attempting to register multiple frameworks with the same name or if a
// mandatory callback is missing.
func RegisterIntegration(name string, cb IntegrationCallbacks) error {
func RegisterIntegration(name configapi.KueueIntegrations, cb IntegrationCallbacks) error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in thinking that every integration framework has to call this?

Could we add a check on the names here to check that the names are Qualified Names so that we can assert in the future that new integrations follow the correct pattern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can.

This is not validation code though. I think putting your suggestion here would make more sense.

https://github.com/kubernetes-sigs/kueue/blob/main/pkg/config/validation.go#L171

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not users I want to find this error, its the developer adding a new integration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think adding some sanity check for developers is fine.

Additionally, in the validation.go code we could verify that the values supplied by the user are among the registered integrations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit opague in the validation logic but we do check if the integrations are supported.

Comment on lines 392 to 407
const (
BatchJob KueueIntegrations = "batch/job"
RayJob KueueIntegrations = "ray.io/rayjob"
RayCluster KueueIntegrations = "ray.io/raycluster"
JobSet KueueIntegrations = "jobset.x-k8s.io/jobset"
MPIJob KueueIntegrations = "kubeflow.org/mpijob"
PaddleJob KueueIntegrations = "kubeflow.org/paddlejob"
PyTorchJob KueueIntegrations = "kubeflow.org/pytorchjob"
TFJob KueueIntegrations = "kubeflow.org/tfjob"
XGBoostJob KueueIntegrations = "kubeflow.org/xgboostjob"
AppWrappers KueueIntegrations = "workload.codeflare.dev/appwrapper"
Pod KueueIntegrations = "pod"
Deployment KueueIntegrations = "deployment"
Statefulset KueueIntegrations = "statefulset"
LeaderWorkerSet KueueIntegrations = "leaderworkerset.x-k8s.io/leaderworkerset"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the objective for these? As #4659 (comment), third-party operator does not want to use Kueue as a library, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the configuration as a library.

So we want to be able to know on each release of Kueue what are the allowed list of integrations?

So how can we upgrade from v0.9 -> v0.10 safely and know what kind of integrations are new/or dropped.

So I think we need to import the kueue configuration library so we know what kind of frameworks are actually allowed.

Copy link
Member

@tenzen-y tenzen-y Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, do you want to use Kueue as a library, right? Because you import this into your project, a whole of dependencies is added to your project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see some benefit from strong typing like type IntegrationReference string or type BuiltInIntegrationReference string.

However, putting them all into one place has pros and cons. Pros is that you can see all integrations in one place (though we maintain them in comments too). Cons is that this is monolitic design. I don't have a strong view, but something feels off.

Still, I like the idea of strong typing as we do for other API types - ClusterQueueReference, TopologyReference, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that comment, it said that we want to import the kueue api. That is a standard process for building operators.

Kueue imports the k8s api so they can create k8s objects. I imagine users will import the Kueue API if they need to create CQ/RF.

They don't need to import anything other than the APIs which is what our operator does.

I have an operator that will maange the installation of Kueue and I want to create the necessary kueue configurations. We know that typically cluster admins will want to selectively enable/disable certain Kueue integrations. So we want to make it easy via documentation what are the allowed values of the integrations.

I am fine to relax restrictions for external frameworks but built in frameworks need to have some structure.

Right now the list of kueue configurations is a list of strings. We can best effort guess what Kueue will support for these integrations but

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but I'm not sure putting all integration names in the monolitic style is the best practice actually.

For example, in the core k8s, we don't have a file with all supported labels & annotations - they are scattered arcoss api packages.

Similarly, I think there is no single file listing all GVK resources supported by k8s, so we have corev1.SchemeGroupVersion, or batch.SchemeGroupVersion.

This approach is more modular.

Again, it is not a strong view, because for example in the core k8s you have a single file with all feature gates. So, I'm on the fence here.

I'm convinced about strong-typing the integration names.

For users, to know supported versions, there are alternatives, for example Kueue could log them all at startup, or we could expose a helper function which exposes the list of registered ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For users, to know supported versions, there are alternatives, for example Kueue could log them all at startup, or we could expose a helper function which exposes the list of registered ones.

Most users would probably not have access to this namespace. And most users would probably not be aware to check the kueue controller logs to know this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help by defining who the "user" is here?

I assume it is a developer of the external controller. So, I think for such a user it does not make much difference between checking Kueue logs or Kueue code for this purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My issue is that it isn’t really obvious in the code what are the supported integrations.

I rely on the comment in the configuration list as the canonical list. that seems error prone and can easily be not updated.

also we aren’t exactly sure what requirements there are on frameworks. Are they a qualified name? Is the domaain requirements for any specific reason?

what happens if someone species Pod instead of pod? It’s really a list of enums that must be those specific values.

a user is a cluster admin installing Kueue with an operator. But someone configur

@kannon92 kannon92 force-pushed the shift-left-integration-validation branch from bddad7c to 171875f Compare March 18, 2025 17:28
@kannon92 kannon92 force-pushed the shift-left-integration-validation branch from 171875f to 96064b4 Compare March 18, 2025 18:47
@@ -387,6 +387,25 @@ type Integrations struct {
LabelKeysToCopy []string `json:"labelKeysToCopy,omitempty"`
}

type IntegrationReference string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please extract this to a separate PR, as we agree on strong typing, but we are yet not sure about putting all references together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can’t really decouple this.

I can give a PR with just this type but it won’t build because we use the configiguretion list all over the repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just use in these places the new type instead of string?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be more precise, just replace

defaultJobFrameworkName                             = "batch/job"

with

defaultJobFrameworkName  kueue.IntegrationReference = "batch/job"

and so on in other places, not sure why this would not compile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what this PR does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reference to the framework must match the enum?

so if I use “batch/job” I would need to switch to use the strongly typed constant.

That was why I ended up changing all that I needed. I first added the configuration values and then I see that list is passed as a string slice in a few places.

we loop over those values and then I need to switch to use the reference.

The cycle continues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the "minimal" change. #4697

In any place like jobframework that takes a string or is passed in our config list, I did a transformation to make it fit into a string array.

This allows me to avoid changing it all places but it does require me to convert it in a few places so we don't propogate this change everywhere.

@mimowo
Copy link
Contributor

mimowo commented Mar 19, 2025

/hold
To gain consensus of the approach of exposing the list of supported controllers, as discussed in #4676 (comment).

In the meanwhile I'm happy to merge part of the PR which is to introduce strong typing by IntegrationReference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allowed values of integrations
6 participants