Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH Runners Orphaned when Spot Instance is Interrupted #4376

Open
iNoahNothing opened this issue Jan 23, 2025 · 2 comments
Open

GH Runners Orphaned when Spot Instance is Interrupted #4376

iNoahNothing opened this issue Jan 23, 2025 · 2 comments

Comments

@iNoahNothing
Copy link

I have been running into many of my spot runners getting orphaned in my organizations github runner pool as offline runners. I believe this is due to the scale down lambda filtering the runners to be scaled down and removed from github from the list of active ec2 instances for that runner pool.

This code snippet shows the logic for the scale down runner which itterates through the running ec2 instances, checking that they should be owned by this lambda/runner pool, checks if it should spin it down based on defined constraints, and removes it if it should.

    for (const ec2Runner of ec2RunnersFiltered) {
      const ghRunners = await listGitHubRunners(ec2Runner);
      const ghRunnersFiltered = ghRunners.filter((runner: { name: string }) =>
        runner.name.endsWith(ec2Runner.instanceId),
      );
      logger.debug(
        `Found: '${ghRunnersFiltered.length}' GitHub runners for AWS runner instance: '${ec2Runner.instanceId}'`,
      );
      logger.debug(
        `GitHub runners for AWS runner instance: '${ec2Runner.instanceId}': ${JSON.stringify(ghRunnersFiltered)}`,
      );
      if (ghRunnersFiltered.length) {
        if (runnerMinimumTimeExceeded(ec2Runner)) {
          if (idleCounter > 0) {
            idleCounter--;
            logger.info(`Runner '${ec2Runner.instanceId}' will be kept idle.`);
          } else {
            logger.info(`Terminating all non busy runners.`);
            await removeRunner(
              ec2Runner,
              ghRunnersFiltered.map((runner: { id: number }) => runner.id),
            );
          }
        }
      } else if (bootTimeExceeded(ec2Runner)) {
        await markOrphan(ec2Runner.instanceId);
      } else {
        logger.debug(`Runner ${ec2Runner.instanceId} has not yet booted.`);
      }
    }
  }

Since we are basing this iteration on the live ec2 instances, if a spot instance is terminated while a job is active on it, the scale down runner does cannot remove it from github when it tries to and never removes it from the github runner pool after the instance has been terminated because it never tries to.

This lambda should be removing offline runners from github even if there is no active ec2 instance in the account. The fix for this is to remove the runner manually in github which is not a viable solution when AWS increases how often they are interrupting spot instances.

I would be happy to submit a PR to address this if it is agreed this is a bug.

Thanks!

@iNoahNothing
Copy link
Author

@npalm This seems like a fairly important bug so pollutes our GHA runners page with offline runners. I would be happy to to work on a PR to address it if you agree.

@npalm
Copy link
Member

npalm commented Jan 27, 2025

Yes the observation is correct and indeed and the problem exists quite a while. The scale-down indeed use AWS as source of truth. When the solution was created we had even not a good way to link between GitHub and AWS. Also GitHub registered runners maybe offline due to creation and registration process, or even not owned by this solution.

Recently we have added a termination wathcer. I think this is the way to move forward to handle spot termiation. Based on the AWS event. And for example check if GitHub runner (registration) requires to be removed. I would propose to extend the lambda (https://github.com/github-aws-runners/terraform-aws-github-runner/tree/main/lambdas/functions/termination-watcher) in an extenable way (with a feature flag) to run the cleanup action.

We also want to move to a slightly different way of rellying on the GitHub api. We prefer to have a service alyer and use the service latyer to setup the GitHub auth as well to make the calls. This will make mocking simpler and also sharing code.

So in short would be great to get a contribution, suggestion would be

  • extend with a feature flag the termination watcher
  • introduce a service layer in the lambda (or better as util package) to call GitHub APIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants