You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been running into many of my spot runners getting orphaned in my organizations github runner pool as offline runners. I believe this is due to the scale down lambda filtering the runners to be scaled down and removed from github from the list of active ec2 instances for that runner pool.
This code snippet shows the logic for the scale down runner which itterates through the running ec2 instances, checking that they should be owned by this lambda/runner pool, checks if it should spin it down based on defined constraints, and removes it if it should.
for(constec2Runnerofec2RunnersFiltered){constghRunners=awaitlistGitHubRunners(ec2Runner);constghRunnersFiltered=ghRunners.filter((runner: {name: string})=>runner.name.endsWith(ec2Runner.instanceId),);logger.debug(`Found: '${ghRunnersFiltered.length}' GitHub runners for AWS runner instance: '${ec2Runner.instanceId}'`,);logger.debug(`GitHub runners for AWS runner instance: '${ec2Runner.instanceId}': ${JSON.stringify(ghRunnersFiltered)}`,);if(ghRunnersFiltered.length){if(runnerMinimumTimeExceeded(ec2Runner)){if(idleCounter>0){idleCounter--;logger.info(`Runner '${ec2Runner.instanceId}' will be kept idle.`);}else{logger.info(`Terminating all non busy runners.`);awaitremoveRunner(ec2Runner,ghRunnersFiltered.map((runner: {id: number})=>runner.id),);}}}elseif(bootTimeExceeded(ec2Runner)){awaitmarkOrphan(ec2Runner.instanceId);}else{logger.debug(`Runner ${ec2Runner.instanceId} has not yet booted.`);}}}
Since we are basing this iteration on the live ec2 instances, if a spot instance is terminated while a job is active on it, the scale down runner does cannot remove it from github when it tries to and never removes it from the github runner pool after the instance has been terminated because it never tries to.
This lambda should be removing offline runners from github even if there is no active ec2 instance in the account. The fix for this is to remove the runner manually in github which is not a viable solution when AWS increases how often they are interrupting spot instances.
I would be happy to submit a PR to address this if it is agreed this is a bug.
Thanks!
The text was updated successfully, but these errors were encountered:
@npalm This seems like a fairly important bug so pollutes our GHA runners page with offline runners. I would be happy to to work on a PR to address it if you agree.
Yes the observation is correct and indeed and the problem exists quite a while. The scale-down indeed use AWS as source of truth. When the solution was created we had even not a good way to link between GitHub and AWS. Also GitHub registered runners maybe offline due to creation and registration process, or even not owned by this solution.
We also want to move to a slightly different way of rellying on the GitHub api. We prefer to have a service alyer and use the service latyer to setup the GitHub auth as well to make the calls. This will make mocking simpler and also sharing code.
So in short would be great to get a contribution, suggestion would be
extend with a feature flag the termination watcher
introduce a service layer in the lambda (or better as util package) to call GitHub APIs.
I have been running into many of my spot runners getting orphaned in my organizations github runner pool as offline runners. I believe this is due to the scale down lambda filtering the runners to be scaled down and removed from github from the list of active ec2 instances for that runner pool.
This code snippet shows the logic for the scale down runner which itterates through the running ec2 instances, checking that they should be owned by this lambda/runner pool, checks if it should spin it down based on defined constraints, and removes it if it should.
Since we are basing this iteration on the live ec2 instances, if a spot instance is terminated while a job is active on it, the scale down runner does cannot remove it from github when it tries to and never removes it from the github runner pool after the instance has been terminated because it never tries to.
This lambda should be removing offline runners from github even if there is no active ec2 instance in the account. The fix for this is to remove the runner manually in github which is not a viable solution when AWS increases how often they are interrupting spot instances.
I would be happy to submit a PR to address this if it is agreed this is a bug.
Thanks!
The text was updated successfully, but these errors were encountered: