Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs fail midway or not get picked up by runners #4408

Open
the-lay opened this issue Feb 6, 2025 · 5 comments
Open

Jobs fail midway or not get picked up by runners #4408

the-lay opened this issue Feb 6, 2025 · 5 comments

Comments

@the-lay
Copy link

the-lay commented Feb 6, 2025

We build ~50 docker images using github actions matrix jobs and ephemeral runners, and around 30-50% of jobs consistently fail midway or end up not being picked up by runners. The jobs that fail usually have this error: The self-hosted runner: i-X lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error..

First we thought it's due to high load, e.g. docker compilations taking up all cores and making an instance unresponsive, but after increasing instance sizes and limiting cores used we still have this issue. Here's our module configuration, are there any config issues? How would I diagnose something like this?

module "github_actions_runners" {
  # https://github.com/github-aws-runners/terraform-aws-github-runner
  source  = "github-aws-runners/github-runner/aws"
  version = "6.2.1"

  aws_region = var.region
  vpc_id     = module.X.vpc_id
  subnet_ids = module.X.public_subnet_ids

  prefix                = "github-actions-runners"
  role_path             = "/internal/"
  instance_profile_path = "/internal/"

  enable_organization_runners                 = true
  enable_runner_on_demand_failover_for_errors = ["InsufficientInstanceCapacity"]

  instance_types        = ["c6i.4xlarge", "c6i.8xlarge"]
  runners_maximum_count = -1

  block_device_mappings = [{
  ...
  }]

  github_app = {
    key_base64     = "X"
    id             = "X"
    webhook_secret = "X"
  }

  enable_ssm_on_runners           = true
  create_service_linked_role_spot = true
  delay_webhook_event             = 0
  scale_down_schedule_expression  = "cron(*/5 * * * ? *)"  # every 5 minutes
  minimum_running_time_in_minutes = 10
  enable_ephemeral_runners        = true
  enable_job_queued_check         = true

  runner_binaries_s3_versioning            = "Enabled"
  lambda_s3_bucket                         = aws_s3_bucket.X.bucket
  ami_housekeeper_lambda_s3_key            = "github-runners/ami-housekeeper.zip"
  ami_housekeeper_lambda_s3_object_version = "X"
  runners_lambda_s3_key                    = "github-runners/runners.zip"
  runners_lambda_s3_object_version         = "X"
  syncer_lambda_s3_key                     = "github-runners/runner-binaries-syncer.zip"
  syncer_lambda_s3_object_version          = "X"
  webhook_lambda_s3_key                    = "github-runners/webhook.zip"
  webhook_lambda_s3_object_version         = "X"
}
@the-lay
Copy link
Author

the-lay commented Feb 6, 2025

For more context, here are the logs, looking at one of the jobs failed midway.

Failed github actions job:

The self-hosted runner: i-04b6e2bd413cd7c41 lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

/aws/lambda/github-actions-runners-scale-up:

{"level":"INFO","message":"Created instance(s): i-04b6e2bd413cd7c41","sampling_rate":0,"service":"runners-scale-up","timestamp":"2025-02-06T12:29:02.975Z","xray_trace_id":"1-67a4ab0b-ebc0b919f423d0b466c80283","region":"X","environment":"github-actions-runners","module":"runners","aws-request-id":"X","function-name":"github-actions-runners-scale-up","runner":{"type":"Org","owner":"X","namePrefix":""},"github":{"event":"workflow_job","workflow_job_id":"36784250582"}} 

/github-self-hosted-runners/github-actions-runners/user_data/i-04b6e2bd413cd7c41:

...
2025-02-06 12:29:53Z: Listening for Jobs
2025-02-06 12:29:57Z: Running job: 📦 Build (X)
2025-02-06 12:34:12Z: Terminated
2025-02-06 12:34:12Z: ERROR: runner-start-failed with exit code 143 occurred on 1

/github-self-hosted-runners/github-actions-runners/runner/i-04b6e2bd413cd7c41:

...

[2025-02-06 12:29:57Z INFO JobNotification] Entering StartMonitor
[2025-02-06 12:30:57Z INFO JobDispatcher] Successfully renew job request 47361, job is valid till 02/06/2025 12:40:57
[2025-02-06 12:31:57Z INFO JobDispatcher] Successfully renew job request 47361, job is valid till 02/06/2025 12:41:57
[2025-02-06 12:32:57Z INFO JobDispatcher] Successfully renew job request 47361, job is valid till 02/06/2025 12:42:57
[2025-02-06 12:33:57Z INFO JobDispatcher] Successfully renew job request 47361, job is valid till 02/06/2025 12:43:57

and that's it, I couldn't find any other logs, e.g. no mentions in scale-down logs.

@the-lay
Copy link
Author

the-lay commented Feb 6, 2025

Two follow-ups:

  1. Setting enable_job_queued_check = false fixed jobs not getting picked up.
  2. The reason instances fail midway is spot instance capacity. The spot requests show that those instances get terminated with status: instance-terminated-no-capacity after a minute or two. Is there a way to catch those events? It doesn't seem like enable_runner_on_demand_failover_for_errors can handle this in ephemeral runners?

@guicaulada
Copy link

The error The self-hosted runner: i-04b6e2bd413cd7c41 lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error. probably means your runner got spot terminated, you can confirm this with the spot termination watcher.

@the-lay
Copy link
Author

the-lay commented Mar 13, 2025

@guicaulada thanks, clear! Do I correctly understand that spot termination watcher just produces logs, it doesn't resubmit the terminated job? I also tried enable_runner_on_demand_failover_for_errors but it doesn't seem like it can handle spot termination on ephemeral runners.

@guicaulada
Copy link

It doesn't handle spot terminations for ephemeral runners, it just produces logs, I couldn't find a way to handle them properly, the runner already picked up the job and is already running it, I don't think there's a way to receive a spot termination notification and "pause" de job to transfer it to a different runner.

Check https://aws.amazon.com/ec2/spot/instance-advisor/ for the best region for your machine types and optimized region, machine types, spot termination frequency and cost. Another option is to create a "on-demand" runner type with instance_target_capacity_type: "on-demand" and assign critical workflows, or workflows that take a long time to this runner type. Long running workflows have a higher probability of being spot interrupted and they will have the most cost on the interruption since the CI/infrastructure minutes it spent were wasted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants