Scale github workflow with AWS ECS | by Shekhar Jha

using a self-hosted runner without access keys, complex lambda functions, or Kubernetes clusters

Architecture for GitHub self-hosted runner infrastructure on ECS for GitHub workflows

GitHub actions allow automation of CI/CD pipelines to automate build and deployment. It provides the ability to run the pipeline operation on either GitHub-hosted VMs or self-hosted runners.

Using self-hosted runners allows organizations to enforce some of the security controls like control overflow of code and build artifacts, and it uses service roles (AWS) with the least privilege needed to perform operations in the cloud instead of using access keys (which have an additional burden to securely store and rotate).

GitHub has published two recommended auto-scaling approaches that use Kubernetes clusters and AWS lambda functions with webhook integration to scale the building infrastructure for GitHub self-hosted runners.

This article will introduce an alternate mechanism that uses a simpler approach using relatively new capability OpenID Connect-based integration between AWS and GitHub, along with GitHub self-hosted runner containers configured as tasks on ECS to start ephemeral runners on-demand.

The approach uses free-tier AWS EC2 instances as ECS containers but can easily be extended to use Fargate (or spot) instances to run containers to optimize cost.

Architecture for self-hosted runner infrastructure for GitHub workflows

A sample GitHub workflow file below shows a workflow with two steps: one, start the GitHub runner container on ECS and two, build the code on the newly started runner.

Output of GitHub workflow that activates self-hosted runner and builds code on the runner

The corresponding workflow is available here.

Let’s walk through the code to understand how we activate GitHub self-hosted runner using an ECS task:

name: infra-aws-core
on:
workflow_dispatch:
inputs:
Env:
...
LaunchMode:
...

The section above describes the workflow and specifies that it can be invoked manually with Environment and LaunchMode as input.

jobs:
activate-github-runner:
runs-on: ubuntu-latest
environment: ${{ github.event.inputs.Env }}
permissions:
id-token: write
contents: read
outputs:
gitrunner_vm_id: ${{ steps.runner-label.outputs.vm_id }}

The activate-github-runner job is responsible for starting the GitHub self-hosted runner. It will runs-on GitHub runner infrastructure with the latest-ubuntu instance. The environment is used to define the input values:

  • AWS-REGION in which the GitHub self-hosted runner will be available.
  • RUNNER_PAT (Personal Access Token) with the ability to register self-hosted runners (see considerations below for details)
  • ROLE_TO_ASSUME to connect to AWS (checkout the OpenID Connect for more details)
  • SECURITY_GROUP and SUBNET of a network, attached to a container running on Fargate.

permissions sets the permission of the GitHub token provided to the job. id-token: write is required to enable OpenID Connect login to AWS. In addition, that outputs defines the output of the job which can be used by other jobs that are dependent on this job. This job outputs gitrunner_vm_id which represents the GitHub self-hosted runner that the build job should be run on.

    steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@master
with:
role-to-assume: ${{ secrets.ROLE_TO_ASSUME }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Check identity
id: validate
run: |
aws sts get-caller-identity

The Configure AWS credentials is a standard step to perform the OpenID Connect login by GitHub to AWS and the Check identity prints the caller identity for reference.

      - name: Generate Runner label
id: runner-label
run: |
task_vm_id=$(uuidgen)
echo "::set-output name=vm_id::$task_vm_id"
echo "VM ID generated $task_vm_id"

This step generates a tag to uniquely identify the instance of GitHub runner that will be used to run the build. It uses a simple method of using the uuidgen command to generate a tag.

Depending on whether the LaunchMode is EC2 or Fargate, the NET_CONFIG and TASK_DEF_NAME is set. The NET_CONFIG is needed for Fargate since an explicit ENI is attached to the container for internet connectivity needed to connect to GitHub infrastructure (along with other operations). The task definition for EC2 and Fargate primarily differ in network mode with values ​​of bridge and awsvpc respectively. Please see the following for additional considerations on this topic.

This step runs a task using the task identification identified in previous step. It passes the RUNNER_NAME, GITHUB_PAT and RUNNER_LABELS to the container so that the entry point script can use it to configure the GitHub self-hosted runner container. Please see the following for additional details about the image.

After this job completes, the self-hosted runner task has been started, and the next job to run is build-code which can be customized as needed. The following example shows a build-code job that prints the working directory.

build-code:
needs: [activate-github-runner]
runs-on: [self-hosted, "${{ needs.activate-github-runner.outputs.gitrunner_vm_id }}"]
permissions:
id-token: write
contents: read
steps:
- name: Testing build
run: |
MY_WD=$(pwd)
echo "Hello world ${MY_WD}"

The needs sets the dependency on activate-github-runner to ensure that self-hosted runner is started before this job is run. The runs-on explicitly identifies that this job should be run on a runner that has tags self-hosted and the born gitrunner_vm_id. Through the runs-on we are able to bind the build process with the container we started in the previous job.

In this implementation, since we have marked the self-hosted runner as ephemeral The container will automatically exit after the build job has completed.

GitHub leverages the ECS infrastructure to run the workflow. The following components form part of the ECS infrastructure which has been designed to leverage AWS free-tier. There are other possible designs that are more secure (eg, using NAT, VPC endpoints, etc), more optimal wrt cost and optimal maintenance (eg, using Fargate instead of EC2).

Network

The VPC network defined use terraform here, uses a simple public-private subnet model. This model can be enhanced by using NAT gateways, Private Links to avoid a compute component being publicly accessible and to reduce internet traffic, respectively.

Compute

The EC2 compute instance managed using autoscaling group and launched in public subnet with a security group that allows only egress traffic is described using terraform here.

In addition to that, following the least-privilege principle, the AmazonSSMManagedInstanceCore and AmazonEC2ContainerServiceforEC2Role IAM roles assigned to EC2 compute instance are to enable SSM agent and ECS agent respectively on the EC2 instance.

The user data (specified in the template file setup-vm.sh.tpl) is used to save the configuration data and then pull the build-runner.sh script from codecommit repo to start the configuration process. This two-step approach reduces the size of user-data configuration while maintaining flexibility.

The build-runner.sh script does the following three things:

Code infrastructure

The tf-code-infra.tf defines the infrastructure for CodeCommit, Elastic Container Registry and associated policy for EC2 role to allow EC2 instance access to these components. The terraform script for codecommit uses scripts and terraform provisioner capability to upload the docker file, entrypoint script and build-runner files to codecommit.

ECS configuration

ECS configuration is available in tf-runner-ecs.tf creates the cloudwatch log group, ECS cluster, and ECS task definitions. In addition to that, it defines the IAM service role needed to execute the ECS task (ie with permission AmazonECSTaskExecutionRolePolicy). Two separate ECS task definition for Fargate has been defined to simplify the launch process.

GitHub

The tf-code-github.tf describes the configuration needed for integration between AWS and GitHub. It contains a new IAM OpenID Connect provider that represents GitHub with validation of TLS certificate using its thumbprint. In addition to that, an IAM role is created with AssumeRoleWithWebIdentity permission with restrictions to specific users and repo. This role has the ability to run-task, stop-task and pass-role.

In addition to that, it also defines and sets the environment variables used by GitHub workflow. Please note that due to the limitations of GitHub APIs the PAT token (RUNNER_PAT) needed for runner registration needs to be manually added to the environment.

The following considerations should be kept in mind while using this approach

  1. Personal Access Token (PAT): used to register the self-hosted runner should have limited access to reduce impact in case it is compromised. There are two different PAT tokens needed: one, to create the environment variable as part of Terraform, and two, to create self-hosted runners tokens for registration.
  2. GitHub portal’s Actions tab shows workflows defined in default branch of the repo.
  3. Ephemeral vs explicit stop: In case the build/deployment process is performed over multiple jobs, the ephemeral container may not be appropriate. In such a scenario, the task arn may be extracted in first job by adding—-query “tasks[0].taskArn” --output text to run-task command and then used to stop tasks later in a final job.
  4. ECS task network mode: The ECS task definition network mode is set to bridge instead of awsvpc for running the task on a free-tier EC2 instance. This is primarily because public IPs can not be assigned to such containers (due to EC2 constraints). At the same time, it is possible to use awsvpc mode for a task definition that runs on EC2 if there is a NAT Gateway setup in the public subnet.
  5. GitHub runner Image: There are wide variety of image definition available. The image script used here is based on that provided here which creates an image with the latest version of GitHub self-hosted runner installer, runs the process in non-root mode, installs ADDITIONAL_PACKAGES During container startup, supports both GITHUB_TOKEN and PAT, and removes the runner on SIG_TERM and SIG_INT signals. The version available in the repo does not have removal, since it leverages the ephemeral flag.

Autoscaling GitHub self-hosted runners: https://docs.github.com/en/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners

GitHub OpenID Connect integration with AWS: https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/configuring-openid-connect-in-amazon-web-services

Sample GitHub workflow:

https://github.com/shekhar-jha/base-demo/blob/infra-core/.github/workflows/infra-aws-core.yml

All Terraform definitions are shown here: https://github.com/shekhar-jha/base-demo/tree/infra-core/infra/aws/core

Reference GitHub runner image: https://github.com/SanderKnape/github-runner

Thanks for reading! Stay tuned for more.

Leave a Comment