-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flaky test] TopologyAwareScheduling when Creating a Pod requesting TAS should admit a single Pod via TAS #4140
Comments
cc @mbobrovskyi PTAL |
/assign |
I've tried to reproduce locally 30 times using the following script:
and then again 100 times while running |
You can try to add --race to GINKGO_ARGS, but not sure this will change it much. Might be a tough one as I haven't seen it repeat. I would suggest to also explore the logs related to the pods from artifacts. First, identify the pod name, and grep kube-apiserver, kube-scheduler and potentially kubelet logs (not sure if the pod was scheduled in this case at all). I did something similar here: #4495. EDIT: and grep Kueue logs related to the pod. Feel free to post them in a comment. |
Is there a way to download the logs from artifact page like that: https://gcsweb.k8s.io/gcs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/4131/pull-kueue-test-e2e-main-1-31/1886439313864921088/ ? Instead of looking at them in browser |
probably there is a way to download them as a package, but what I normally do is downloading the files I'm interested in one-by-one with curl, eg.:
|
there is no mention of "test-pod" in apiserver container log https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/4131/pull-kueue-test-e2e-main-1-31/1886439313864921088/artifacts/run-test-e2e-1.31.0/kind-control-plane/containers/kube-apiserver-kind-control-plane_kube-system_kube-apiserver-8777d7573967876827dabd8bbdffee29b0d782120253f6c089291eec7d678a66.log also no mention in kubelet log https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/4131/pull-kueue-test-e2e-main-1-31/1886439313864921088/artifacts/run-test-e2e-1.31.0/kind-control-plane/kubelet.log 2 mentions in kube-scheduler container log https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/4131/pull-kueue-test-e2e-main-1-31/1886439313864921088/artifacts/run-test-e2e-1.31.0/kind-control-plane/containers/kube-scheduler-kind-control-plane_kube-system_kube-scheduler-811b661fb56c9f5b19a81637a5a27f49918041f9869a7f487a4d49963b4cf76c.log
there are 48 mentions in kube apiserver pod log https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/4131/pull-kueue-test-e2e-main-1-31/1886439313864921088/artifacts/run-test-e2e-1.31.0/kind-control-plane/pods/kube-system_kube-apiserver-kind-control-plane_279a78e3b8722226a896e74ee09ca078/kube-apiserver/0.log.20250203-155314
there are 2 mentions in kube scheduler pod log https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/4131/pull-kueue-test-e2e-main-1-31/1886439313864921088/artifacts/run-test-e2e-1.31.0/kind-control-plane/pods/kube-system_kube-scheduler-kind-control-plane_b9e437278b0333614df7d8d898b3fae0/kube-scheduler/0.log
|
Also there is no mention in kube controller manager https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/4131/pull-kueue-test-e2e-main-1-31/1886439313864921088/artifacts/run-test-e2e-1.31.0/kind-control-plane/pods/kube-system_kube-controller-manager-kind-control-plane_a97c8ce14f20fa592e2406d83f45cd81/kube-controller-manager/0.log only found 1 namespace-related match (grep by
|
It seems the pod was never scheduled, wondering if this might be because there was no availbale node, or the scheduling gate was not deleted. Can you also grep the Kueue logs? |
can you clarify what kind of Kueue logs to grep? There are no log containers or pods that have "Kueue" in its name, and if I grep |
Actually, what is the difference between container and pod logs, in context of the test? Here are the logs from worker1 — https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/4131/pull-kueue-test-e2e-main-1-31/1886439313864921088/artifacts/run-test-e2e-1.31.0/kind-worker/pods/kueue-system_kueue-controller-manager-778456579-czw45_2c3a6cbe-beb4-4eb5-9690-6c415fc03af7/manager/1.log
Here are the logs from worker2 — https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/4131/pull-kueue-test-e2e-main-1-31/1886439313864921088/artifacts/run-test-e2e-1.31.0/kind-worker2/pods/kueue-system_kueue-controller-manager-778456579-jxzvl_92460a63-f6e3-4984-807a-9e5fd6ebe0b9/manager/2.log
|
I think this is quite interesting and a couple of lines look suspicious. In particular the line saying the ClustrQueue is inactive. Maybe this is a race condition. In some other tests we wait explicitly for the CQ to be active. Compra if this line is present in the analogous successful run. I believe this might be one of the first lines were the fail and success runs diverge but worth diffing them. |
Another thing drawing attention is the 10s gap in logs. Looks like the test timeout, does it correlate with test logs? in that case maybe we should just increase to LongTineout. This is what we mostly do in E2e tests, because lower down the logs suggest the workload was successfully admitted. |
I checked to tests so far with |
And yes it correlates, TAS test fails at 15:52:14:
and inadmissible workloads are 10s before that:
|
I also can see in logs that it was admitted at the end, the difference between the last log line and the first is slightly less than 10 seconds (15:52:04.652 to 15:52:14.320). Does it mean it almost passed but it was too late for test check? |
Yeah, so it seems this Cluster inactive is a sign of things going wrong. But it seems the CQ got active at some point as the workload was admitted. I would speculate it got active due to the test starting to delete in After each. So it seems there might be some very rare race condition when the CQ gets stuck in the inactive state So I think the next step would be to grep logs not for the Pod, because the issue started earlier, but for the controllers responsible for TAS, that is ClustrQueue, tas_resource_flavor and topology. |
tas-resource-flavor-controller logs:
Regarding ClusterQueue and tas-topology-controller logs, there are quite many matches, should I search for something particular? Is it ok to discard logs before the
|
yeah, from when the test started till the 10s gap. |
Here are the tas-topology-controller logs:
Here are ClusterQueue logs:
|
Thank you for digging down the logs! This one seems quite rare and tough to understand, but important. It seems, that under some circumstances (order of events) the CQ using TAS resource flavor may be stuck inactive. So, first thing to know about k8s, is that for our controllers in kueue the order of Create statements in the test code does not really matter. What only matters is the order of delivering the ADDED / Create events to the event handlers, like for RF TAS, RF and Topology. Also, be aware that we have 2 controllers for RF (TAS and core), both receive the ADDED / Create events, and the order of receiving the events may differ. Also, on a loaded cluster the API server may severely delay (up to seconds) delivering some of the events. It seems in this case the order of processing the added events was unusual indeed:
unfortunately we don't log the Create event for TAS RF controller, but it seems it was delayed, because the reconcile was 3s later. It would be great to also overlay the logs with the create log for the core resource flavor controller. Maybe actually the source of the bug is that we have 2 controllers for RF rather than one. It seemed like a good idea from the de-coupling point of view, but maybe they are racing occasionally. In any case, before we refactor I would like to understand the scenario and make it reproducible. To reproduce, one way is to inject time.Sleep(X seconds) into the first line of the Create handlers, and use different delays to simulate different order of processing. For example Sleep(1s into Topology, Sleep 2s into CQ, 3s into core RF, and 4s into TAS RF) to induce the order of actual processing. However, the suggestion is still speculative, so take it with caution :) |
resourceflavor_controller logs:
|
Notice that in the topology controller we didn't see this logged:
Which means that at the moment of processing the Topology event we didn't yet see the RF. The RF event was added later, but it is inside Topology Create event that we trigger reconciling of the CQ:
This suggests that we called NotifyTopologyUpdateWatchers, but it didn't trigger activating the CQ as the RF was not yet seen at this point in cache. Wondering how to repro / confirm reliably. |
I injected delays in the order they happened in the logs above, and it failed: #4475 (comment) 1s into Topology EDIT: but also the other tests are failing, including integration tests |
Right, which is a sign some other tests are also timing sensitive. However, this particular TAS test didn't fail, or any other TAS. So focusing on this test, it does not seem like repro yet. I would probably fucus on running this locally to have a shorten test cycle. TBH Im not sure what needs to be the order of events yet to repro this. Something else also might be relevant here is hard to control from our code, which is the controller runtime caches which are responsible for serving clients calls to List, eg. in topology controller when we list resource flavors. These caches have their own delays. |
I checked logs from successful run https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/4475/pull-kueue-test-e2e-main-1-31/1897218139498745856 and Also ran more local tests with different combinations of delays, no failures so far. |
I also cannot find any lines that match |
Hm, grepping by |
I've realized it's better to compare to logs from a successful run from the same commit, like https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/4131/pull-kueue-test-e2e-main-1-31/1886444172156604416/artifacts/run-test-e2e-1.31.0/kind-worker/pods/kueue-system_kueue-controller-manager-778456579-nqrgl_1fa2fe13-4762-407e-b760-f5a31060fe20/manager/0.log Then same events are logged and line numbers in logs are the same. |
Here are logs from successful run, maybe it could be useful:
or with removed lines related to workload, expectations and local queues:
|
What could be useful is to restrict the successful logs only to the flaky test. Similarly, show the kueue logs for a failed run from the controllers, but only. Then, maybe diffing them would reveal something useful. |
/kind flake
What happened:
The test failed on related branch: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/4131/pull-kueue-test-e2e-main-1-31/1886439313864921088
What you expected to happen:
no failures
How to reproduce it (as minimally and precisely as possible):
run on ci
Anything else we need to know?:
The text was updated successfully, but these errors were encountered: