You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: CHANGELOG/CHANGELOG-0.7.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -72,7 +72,7 @@ Changes since `v0.6.0`:
72
72
- Add configuration to register Kinds as being managed by an external Kueue-compatible controller (#2059, @dgrove-oss)
73
73
- Add fair sharing when borrowing unused resources from other ClusterQueues in a cohort.
74
74
75
-
Fair sharing is based on DRF for usage above nominal quotas.
75
+
Fair Sharing is based on DRF for usage above nominal quotas.
76
76
When fair sharing is enabled, Kueue prefers to admit workloads from ClusterQueues with the lowest share first.
77
77
Administrators can enable and configure fair sharing preemption using a combination of two policies: `LessThanOrEqualtoFinalShare`, `LessThanInitialShare`.
Copy file name to clipboardexpand all lines: README.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@ Read the [overview](https://kueue.sigs.k8s.io/docs/overview/) and watch the Kueu
18
18
## Features overview
19
19
20
20
-**Job management:** Support job queueing based on [priorities](https://kueue.sigs.k8s.io/docs/concepts/workload/#priority) with different [strategies](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/#queueing-strategy): `StrictFIFO` and `BestEffortFIFO`.
21
-
-**Advanced Resource management:** Comprising: [resource flavor fungibility](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/#flavorfungibility), [fair sharing](https://kueue.sigs.k8s.io/docs/concepts/preemption/#fair-sharing), [cohorts](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/#cohort) and [preemption](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/#preemption) with a variety of policies between different tenants.
21
+
-**Advanced Resource management:** Comprising: [resource flavor fungibility](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/#flavorfungibility), [Fair Sharing](https://kueue.sigs.k8s.io/docs/concepts/preemption/#fair-sharing), [cohorts](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/#cohort) and [preemption](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/#preemption) with a variety of policies between different tenants.
22
22
-**Integrations:** Built-in support for popular jobs, e.g. [BatchJob](https://kueue.sigs.k8s.io/docs/tasks/run/jobs/), [Kubeflow training jobs](https://kueue.sigs.k8s.io/docs/tasks/run/kubeflow/), [RayJob](https://kueue.sigs.k8s.io/docs/tasks/run/rayjobs/), [RayCluster](https://kueue.sigs.k8s.io/docs/tasks/run/rayclusters/), [JobSet](https://kueue.sigs.k8s.io/docs/tasks/run/jobsets/), [plain Pod and Pod Groups](https://kueue.sigs.k8s.io/docs/tasks/run/plain_pods/).
23
23
-**System insight:** Build-in [prometheus metrics](https://kueue.sigs.k8s.io/docs/reference/metrics/) to help monitor the state of the system, and on-demand visibility endpoint for [monitoring of pending workloads](https://kueue.sigs.k8s.io/docs/tasks/manage/monitor_pending_workloads/pending_workloads_on_demand/).
24
24
-**AdmissionChecks:** A mechanism for internal or external components to influence whether a workload can be [admitted](https://kueue.sigs.k8s.io/docs/concepts/admission_check/).
Copy file name to clipboardexpand all lines: keps/1714-fair-sharing/README.md
+21-21
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,7 @@
27
27
<!-- /toc -->
28
28
29
29
## Summary
30
-
This KEP introduces weight-based fair sharing of unused resources across
30
+
This KEP introduces weight-based Fair Sharing of unused resources across
31
31
needing ClusterQueues, respecting borrowing and lending limits, multi-level
32
32
hierarchy and preferences of cohorts to distribute unused resources
33
33
internally.
@@ -44,7 +44,7 @@ use the highest one, breaking the whole idea altogether). Thus a more comprehens
44
44
solution of distributing unused resources is needed.
45
45
46
46
### Goals
47
-
* Create a mechanism to enforce fair sharing of resources. Two equally important
47
+
* Create a mechanism to enforce Fair Sharing of resources. Two equally important
48
48
sub-organizations (Cohorts or ClusterQueues), placed in the similar spots in the
49
49
whole organization (hierarchy of Cohorts), actively competing for the same
50
50
resources (having workloads needing more than nominal quota), should
@@ -59,10 +59,10 @@ under the same parent.
59
59
sub-organization and only after they have been fulfilled, proceed with distribution
60
60
outside of the suborganization.
61
61
62
-
* When enforcing fair sharing, ignore workload priorities unless:
62
+
* When enforcing Fair Sharing, ignore workload priorities unless:
63
63
64
64
* The workload's priority is above admin-defined high priority. Super high
65
-
priority workloads overrule fair sharing and are treated according to KEP [#1337](https://github.com/kubernetes-sigs/kueue/tree/main/keps/1337-preempt-within-cohort-while-borrowing).
65
+
priority workloads overrule Fair Sharing and are treated according to KEP [#1337](https://github.com/kubernetes-sigs/kueue/tree/main/keps/1337-preempt-within-cohort-while-borrowing).
66
66
67
67
* There is a need to preempt some non top priority workload from a ClusterQueue.
68
68
Then the lowest priority workloads from a CQ that is over its fair share should
@@ -76,30 +76,30 @@ in particular:
76
76
* Guaranteed/nominal quota
77
77
* Hierarchical cohorts
78
78
79
-
* Fair sharing should not limit Kueue scalability. Kueue, with fair sharing enabled,
79
+
* Fair Sharing should not limit Kueue scalability. Kueue, with Fair Sharing enabled,
80
80
should be able to handle >1k ClusterQueues, >100 Cohorts and >10k workloads (that
81
81
are either running or queued) within a single hierarchical organization.
82
82
83
83
* The proposed system should be hard to game, for example by creating big workloads
84
84
that consume all capacity.
85
85
86
-
* Fair sharing enforcement should not significantly decrease overall utilization,
86
+
* Fair Sharing enforcement should not significantly decrease overall utilization,
87
87
however, pathological situations (like a single workload consuming all otherwise
88
-
unused capacity) should be resolved in favor of fair sharing than maximizing the
88
+
unused capacity) should be resolved in favor of Fair Sharing than maximizing the
89
89
utilization (the big greedy workload should be preempted to admit smaller
90
90
workloads from other CQ that consume only their fair share).
91
91
92
92
### Non-Goals
93
93
* Use historical data (for example CQ A used a lot of shared capacity for the last
94
94
week, so now it should get less because others, who didn't need anything then, have
95
-
pending workloads). Fair sharing should be based on point-in-time situation, although,
96
-
ideally it should be expandable to support history-based fair sharing(for example with #26)
95
+
pending workloads). Fair Sharing should be based on point-in-time situation, although,
96
+
ideally it should be expandable to support history-based Fair Sharing(for example with #26)
97
97
without major redesign.
98
98
99
-
* Enable fair sharing only for some part of the resources or Cohort hierarchy.
100
-
Fair sharing will be a global switch (at least initially).
99
+
* Enable Fair Sharing only for some part of the resources or Cohort hierarchy.
100
+
Fair Sharing will be a global switch (at least initially).
101
101
102
-
* Maximize utilization at the cost of fair sharing.
102
+
* Maximize utilization at the cost of Fair Sharing.
103
103
104
104
## Proposal
105
105
@@ -108,7 +108,7 @@ on top of the given nominal quota, that doesn't justify "complains" against
108
108
any other similar CQ about excessive extra resources that CQ was given. Basically
109
109
the sharing of unused resources is fair, if no CQ can say it is grossly unfair.
110
110
111
-
Introduce a global fair sharing mechanism that is based on preemptions. As long
111
+
Introduce a global Fair Sharing mechanism that is based on preemptions. As long
112
112
as there are some free and accessible resources in the cohort hierarchy, Kueue
113
113
will admit workloads without any limits. However, once the capacity is gone,
114
114
new workloads from Cohorts/CQ that have not received their fair share yet will
@@ -126,7 +126,7 @@ We will add an optional weight field to both Cohorts and CQ. The weight will
126
126
indicate how to fair share resources between sub-organizations (CQs or Cohorts)
127
127
under the same Cohort.
128
128
129
-
Fair sharing will be configured for the whole cluster, using the configuration
129
+
Fair Sharing will be configured for the whole cluster, using the configuration
130
130
file. In Alpha it will be just a feature gate.
131
131
132
132
### User Stories (Optional)
@@ -176,13 +176,13 @@ to execute. I want to distribute the resources from CS in the following way:
176
176
177
177
### Risks and Mitigations
178
178
179
-
* Fair sharing may increase the number of preemptions in the system vs current
179
+
* Fair Sharing may increase the number of preemptions in the system vs current
180
180
state where the first workload to acquire unused resources keeps them until it
181
181
finishes. Mitigations include:
182
-
* Introduce minimum execution time before workloads can be preempted for fair sharing.
182
+
* Introduce minimum execution time before workloads can be preempted for Fair Sharing.
183
183
* Introduce delayed fair share enforcement - new workloads have to wait a bit before preempting others to get their share.
184
184
185
-
* Fair sharing may decrease utilization of the unused resources while attempting to
185
+
* Fair Sharing may decrease utilization of the unused resources while attempting to
186
186
distribute them fairly, vs provide the tighties bin-packing. To avoid these scenarios,
187
187
users should prefer to run massive workloads under nominal quotas.
188
188
@@ -255,7 +255,7 @@ When a CQ x fails to admit a workload w, one of the following scenarios may occu
255
255
256
256
*[S2] A sub-organization to which CQ x belongs is borrowing, however it seems that it is
257
257
borrowing too little compared to other sub-organizations that are also borrowing, so some
258
-
action is needed to enforce fair sharing. We should compare how much the sub-orgs are
258
+
action is needed to enforce Fair Sharing. We should compare how much the sub-orgs are
259
259
borrowing compared to each other, and preempt some workloads up to the point when its fair
260
260
share would be smaller than the sub-org for which preemptions are executed.
261
261
@@ -291,7 +291,7 @@ For each workload z in y we check whether if:
291
291
292
292
[S2-a] value of AlmostLCA(y,x) **without z** is still higher (or equal) than value of AlmostLCA(x,y)
293
293
with admitted workload w. Y’s sub-orgs will still be better than X’s sub-org after we
294
-
preempt z and admit w, thus z is a reasonable candidate to re-balance fair sharing.
294
+
preempt z and admit w, thus z is a reasonable candidate to re-balance Fair Sharing.
295
295
296
296
297
297
[S2-b] value of AlmostLCA(y,x) (with z) is strictly higher than AlmostLCA(x,y) with admitted workload w.
@@ -355,8 +355,8 @@ This is a very complex feature so there will be lots of unit, integration and e2
355
355
#### Alpha
356
356
357
357
The alpha will be split among two releases.
358
-
Release v0.7 will only implement fair sharing in the existing flat structure (cohorts don’t have parents).
359
-
Release v0.8 will incorporate fair sharing with arbitrary hierarchies (KEP #79)
358
+
Release v0.7 will only implement Fair Sharing in the existing flat structure (cohorts don’t have parents).
359
+
Release v0.8 will incorporate Fair Sharing with arbitrary hierarchies (KEP #79)
Copy file name to clipboardexpand all lines: site/content/en/docs/overview/_index.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -17,15 +17,15 @@ You can install Kueue on top of a vanilla Kubernetes cluster. Kueue does not rep
17
17
18
18
Kueue APIs allow you to express:
19
19
20
-
* Quotas and policies for fair sharing among tenants.
20
+
* Quotas and policies for Fair Sharing among tenants.
21
21
* Resource fungibility: if a resource flavor is fully utilized, Kueue can admit the job using a different flavor.
22
22
23
23
A core design principle for Kueue is to avoid duplicating mature functionality in Kubernetes components and well-established third-party controllers. Autoscaling, pod-to-node scheduling and job lifecycle management are the responsibility of cluster-autoscaler, kube-scheduler and kube-controller-manager, respectively. Advanced admission control can be delegated to controllers such as gatekeeper.
24
24
25
25
## Features overview
26
26
27
27
-**Job management:** Support job queueing based on [priorities](/docs/concepts/workload/#priority) with different [strategies](/docs/concepts/cluster_queue/#queueing-strategy): `StrictFIFO` and `BestEffortFIFO`.
28
-
-**Advanced Resource management:** Comprising: [resource flavor fungibility](/docs/concepts/cluster_queue/#flavorfungibility), [fair sharing](/docs/concepts/preemption/#fair-sharing), [cohorts](/docs/concepts/cluster_queue/#cohort) and [preemption](/docs/concepts/cluster_queue/#preemption) with a variety of policies between different tenants.
28
+
-**Advanced Resource management:** Comprising: [resource flavor fungibility](/docs/concepts/cluster_queue/#flavorfungibility), [Fair Sharing](/docs/concepts/preemption/#fair-sharing), [cohorts](/docs/concepts/cluster_queue/#cohort) and [preemption](/docs/concepts/cluster_queue/#preemption) with a variety of policies between different tenants.
29
29
-**Integrations:** Built-in support for popular jobs, e.g. [BatchJob](/docs/tasks/run/jobs/), [Kubeflow training jobs](/docs/tasks/run/kubeflow/), [RayJob](/docs/tasks/run/rayjobs/), [RayCluster](/docs/tasks/run/rayclusters/), [JobSet](/docs/tasks/run/jobsets/), [AppWrappers](/docs/tasks/run/appwrappers/), [plain Pod and Pod Groups](/docs/tasks/run/plain_pods/).
30
30
-**System insight:** Build-in [prometheus metrics](/docs/reference/metrics/) to help monitor the state of the system, and on-demand visibility endpoint for [monitoring of pending workloads](/docs/tasks/manage/monitor_pending_workloads/pending_workloads_on_demand/).
31
31
-**AdmissionChecks:** A mechanism for internal or external components to influence whether a workload can be [admitted](/docs/concepts/admission_check/).
Copy file name to clipboardexpand all lines: site/content/en/docs/reference/metrics.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -37,7 +37,7 @@ Use the following metrics to monitor the status of your ClusterQueues:
37
37
|`kueue_cluster_queue_status`| Gauge | Reports the status of the ClusterQueue |`cluster_queue`: The name of the ClusterQueue<br> `status`: Possible values are `pending`, `active` or `terminated`. For a ClusterQueue, the metric only reports a value of 1 for one of the statuses. |
38
38
|`kueue_reserving_active_workloads`| Gauge | The number of Workloads that are reserving quota, per `cluster_queue`. |`cluster_queue`: the name of the ClusterQueue |
39
39
|`kueue_admission_cycle_preemption_skips`| Gauge | The number of Workloads in the ClusterQueue that got preemption candidates but had to be skipped because other ClusterQueues needed the same resources in the same cycle |`cluster_queue`: the name of the ClusterQueue |
40
-
|`kueue_preempted_workloads_total`| Counter | The number of preempted workloads per `preempting_cluster_queue`|`preempting_cluster_queue`: the name of the ClusterQueue<br> `reason`: possible values are `InClusterQueue` means that the workload was preempted by a workload in the same ClusterQueue; `InCohortReclamation` means that the workload was preempted by a workload in the same cohort due to reclamation of nominal quota; `InCohortFairSharing` means that the workload was preempted by a workload in the same cohort due to fair sharing; `InCohortReclaimWhileBorrowing` means that the workload was preempted by a workload in the same cohort due to reclamation of nominal quota while borrowing |
40
+
|`kueue_preempted_workloads_total`| Counter | The number of preempted workloads per `preempting_cluster_queue`|`preempting_cluster_queue`: the name of the ClusterQueue<br> `reason`: possible values are `InClusterQueue` means that the workload was preempted by a workload in the same ClusterQueue; `InCohortReclamation` means that the workload was preempted by a workload in the same cohort due to reclamation of nominal quota; `InCohortFairSharing` means that the workload was preempted by a workload in the same cohort due to Fair Sharing; `InCohortReclaimWhileBorrowing` means that the workload was preempted by a workload in the same cohort due to reclamation of nominal quota while borrowing |
0 commit comments