-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs fail midway or not get picked up by runners #4408
Comments
For more context, here are the logs, looking at one of the jobs failed midway. Failed github actions job:
and that's it, I couldn't find any other logs, e.g. no mentions in scale-down logs. |
Two follow-ups:
|
The error |
@guicaulada thanks, clear! Do I correctly understand that spot termination watcher just produces logs, it doesn't resubmit the terminated job? I also tried |
It doesn't handle spot terminations for ephemeral runners, it just produces logs, I couldn't find a way to handle them properly, the runner already picked up the job and is already running it, I don't think there's a way to receive a spot termination notification and "pause" de job to transfer it to a different runner. Check https://aws.amazon.com/ec2/spot/instance-advisor/ for the best region for your machine types and optimized region, machine types, spot termination frequency and cost. Another option is to create a "on-demand" runner type with |
We build ~50 docker images using github actions matrix jobs and ephemeral runners, and around 30-50% of jobs consistently fail midway or end up not being picked up by runners. The jobs that fail usually have this error:
The self-hosted runner: i-X lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
.First we thought it's due to high load, e.g. docker compilations taking up all cores and making an instance unresponsive, but after increasing instance sizes and limiting cores used we still have this issue. Here's our module configuration, are there any config issues? How would I diagnose something like this?
The text was updated successfully, but these errors were encountered: