Jobs fail midway or not get picked up by runners #4408

the-lay · 2025-02-06T16:05:27Z

We build ~50 docker images using github actions matrix jobs and ephemeral runners, and around 30-50% of jobs consistently fail midway or end up not being picked up by runners. The jobs that fail usually have this error: The self-hosted runner: i-X lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error..

First we thought it's due to high load, e.g. docker compilations taking up all cores and making an instance unresponsive, but after increasing instance sizes and limiting cores used we still have this issue. Here's our module configuration, are there any config issues? How would I diagnose something like this?

module "github_actions_runners" {
  # https://github.com/github-aws-runners/terraform-aws-github-runner
  source  = "github-aws-runners/github-runner/aws"
  version = "6.2.1"

  aws_region = var.region
  vpc_id     = module.X.vpc_id
  subnet_ids = module.X.public_subnet_ids

  prefix                = "github-actions-runners"
  role_path             = "/internal/"
  instance_profile_path = "/internal/"

  enable_organization_runners                 = true
  enable_runner_on_demand_failover_for_errors = ["InsufficientInstanceCapacity"]

  instance_types        = ["c6i.4xlarge", "c6i.8xlarge"]
  runners_maximum_count = -1

  block_device_mappings = [{
  ...
  }]

  github_app = {
    key_base64     = "X"
    id             = "X"
    webhook_secret = "X"
  }

  enable_ssm_on_runners           = true
  create_service_linked_role_spot = true
  delay_webhook_event             = 0
  scale_down_schedule_expression  = "cron(*/5 * * * ? *)"  # every 5 minutes
  minimum_running_time_in_minutes = 10
  enable_ephemeral_runners        = true
  enable_job_queued_check         = true

  runner_binaries_s3_versioning            = "Enabled"
  lambda_s3_bucket                         = aws_s3_bucket.X.bucket
  ami_housekeeper_lambda_s3_key            = "github-runners/ami-housekeeper.zip"
  ami_housekeeper_lambda_s3_object_version = "X"
  runners_lambda_s3_key                    = "github-runners/runners.zip"
  runners_lambda_s3_object_version         = "X"
  syncer_lambda_s3_key                     = "github-runners/runner-binaries-syncer.zip"
  syncer_lambda_s3_object_version          = "X"
  webhook_lambda_s3_key                    = "github-runners/webhook.zip"
  webhook_lambda_s3_object_version         = "X"
}

The text was updated successfully, but these errors were encountered:

the-lay · 2025-02-06T18:09:33Z

For more context, here are the logs, looking at one of the jobs failed midway.

Failed github actions job:

The self-hosted runner: i-04b6e2bd413cd7c41 lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

/aws/lambda/github-actions-runners-scale-up:

{"level":"INFO","message":"Created instance(s): i-04b6e2bd413cd7c41","sampling_rate":0,"service":"runners-scale-up","timestamp":"2025-02-06T12:29:02.975Z","xray_trace_id":"1-67a4ab0b-ebc0b919f423d0b466c80283","region":"X","environment":"github-actions-runners","module":"runners","aws-request-id":"X","function-name":"github-actions-runners-scale-up","runner":{"type":"Org","owner":"X","namePrefix":""},"github":{"event":"workflow_job","workflow_job_id":"36784250582"}}

/github-self-hosted-runners/github-actions-runners/user_data/i-04b6e2bd413cd7c41:

...
2025-02-06 12:29:53Z: Listening for Jobs
2025-02-06 12:29:57Z: Running job: 📦 Build (X)
2025-02-06 12:34:12Z: Terminated
2025-02-06 12:34:12Z: ERROR: runner-start-failed with exit code 143 occurred on 1

/github-self-hosted-runners/github-actions-runners/runner/i-04b6e2bd413cd7c41:

...

[2025-02-06 12:29:57Z INFO JobNotification] Entering StartMonitor
[2025-02-06 12:30:57Z INFO JobDispatcher] Successfully renew job request 47361, job is valid till 02/06/2025 12:40:57
[2025-02-06 12:31:57Z INFO JobDispatcher] Successfully renew job request 47361, job is valid till 02/06/2025 12:41:57
[2025-02-06 12:32:57Z INFO JobDispatcher] Successfully renew job request 47361, job is valid till 02/06/2025 12:42:57
[2025-02-06 12:33:57Z INFO JobDispatcher] Successfully renew job request 47361, job is valid till 02/06/2025 12:43:57

and that's it, I couldn't find any other logs, e.g. no mentions in scale-down logs.

the-lay · 2025-02-06T22:09:42Z

Two follow-ups:

Setting enable_job_queued_check = false fixed jobs not getting picked up.
The reason instances fail midway is spot instance capacity. The spot requests show that those instances get terminated with status: instance-terminated-no-capacity after a minute or two. Is there a way to catch those events? It doesn't seem like enable_runner_on_demand_failover_for_errors can handle this in ephemeral runners?

guicaulada · 2025-03-13T15:52:36Z

The error The self-hosted runner: i-04b6e2bd413cd7c41 lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error. probably means your runner got spot terminated, you can confirm this with the spot termination watcher.

the-lay · 2025-03-13T16:08:19Z

@guicaulada thanks, clear! Do I correctly understand that spot termination watcher just produces logs, it doesn't resubmit the terminated job? I also tried enable_runner_on_demand_failover_for_errors but it doesn't seem like it can handle spot termination on ephemeral runners.

guicaulada · 2025-03-14T17:42:36Z

It doesn't handle spot terminations for ephemeral runners, it just produces logs, I couldn't find a way to handle them properly, the runner already picked up the job and is already running it, I don't think there's a way to receive a spot termination notification and "pause" de job to transfer it to a different runner.

Check https://aws.amazon.com/ec2/spot/instance-advisor/ for the best region for your machine types and optimized region, machine types, spot termination frequency and cost. Another option is to create a "on-demand" runner type with instance_target_capacity_type: "on-demand" and assign critical workflows, or workflows that take a long time to this runner type. Long running workflows have a higher probability of being spot interrupted and they will have the most cost on the interruption since the CI/infrastructure minutes it spent were wasted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs fail midway or not get picked up by runners #4408

Jobs fail midway or not get picked up by runners #4408

the-lay commented Feb 6, 2025

the-lay commented Feb 6, 2025 •

edited

Loading

the-lay commented Feb 6, 2025

guicaulada commented Mar 13, 2025

the-lay commented Mar 13, 2025

guicaulada commented Mar 14, 2025

Jobs fail midway or not get picked up by runners #4408

Jobs fail midway or not get picked up by runners #4408

Comments

the-lay commented Feb 6, 2025

the-lay commented Feb 6, 2025 • edited Loading

the-lay commented Feb 6, 2025

guicaulada commented Mar 13, 2025

the-lay commented Mar 13, 2025

guicaulada commented Mar 14, 2025

the-lay commented Feb 6, 2025 •

edited

Loading