Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/stealing #8271

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Refactor/stealing #8271

wants to merge 3 commits into from

Conversation

Vesyrak
Copy link

@Vesyrak Vesyrak commented Oct 13, 2023

Refactors work stealing to facilitate external algorithms. Functionally it should be the same.

This Pull Request enables more readable code, reduces side-effects and further facilitates external algorithms. It allows me to re-define which workers should be considered thieves or victims, and when to steal from these without impacting too much code.

I do recommend a second refactor taking out the WorkStealing dependency from the WorkStealingBalancer.
I recommend going for a more functional design of the class, where it returns the (recommended) movements of the tasks rather than executing them whilst stealing.
However, subsequent "thefts" depend on the state of the cluster with the previously stolen tasks. This requires wrapping the occupancy/processing/... metric calculations so that the recommended-but-not-yet-stolen tasks' new locations are considered when stealing another task. I do not yet possess the knowledge of how this works on the scheduler itself, that is why it is missing from the current PR.

I am willing to add more tests, as soon as the pipelines for tests work properly.

  • [ 0 / ? ] Tests added / passed
    • I get a timeout on test_blocks_until_full, with a side dish of Too many open files
  • [ x ] Passes pre-commit run --all-files

@Vesyrak Vesyrak requested a review from fjetter as a code owner October 13, 2023 11:56
@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

Admins can comment ok to test to allow this one PR to run or add to allowlist to allow all future PRs from the same author to run.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 13, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    27 files  ± 0      27 suites  ±0   9h 57m 58s ⏱️ + 4m 6s
 3 995 tests ± 0   3 881 ✅  -  2    110 💤 ±0  4 ❌ +2 
50 241 runs  +18  47 939 ✅ +17  2 293 💤  - 6  9 ❌ +7 

For more details on these failures, see this check.

Results for commit 297eb88. ± Comparison against base commit 1211e79.

♻️ This comment has been updated with latest results.

@hendrikmakait hendrikmakait self-requested a review January 9, 2024 16:57
@hendrikmakait hendrikmakait added the needs review Needs review from a contributor. label Jan 9, 2024
@Vesyrak Vesyrak force-pushed the refactor/stealing branch 2 times, most recently from 49a52d1 to 9d80eab Compare February 14, 2024 09:45
@Vesyrak
Copy link
Author

Vesyrak commented Feb 14, 2024

Could I get some support regarding the test pipelines?
I'm not able to get the tests to succeed locally (neither on barebones OS level (macos) nor in an ubuntu container (arm64 architecture though).

Reproducing the github pipelines using act fails as well, and I can't even get the pipelines to work in my fork (https://github.com/Vesyrak/distributed/actions/runs/7902375846/job/21567935269).

I already merged the fix for the existing issue with the pipelines (#8504), so that's not the cause.

Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs review Needs review from a contributor.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants