header image

Blog

How We Reduced Our GitLab CI Pipeline Duration by 70% at Student Beans (hidden)

|

At Student Beans, we recently created a new Developer Experience (DevEx) team which has the primary goal of making the work life of our developers better. The main way we’re going to be doing this is by improving our tooling and development processes.

One of the first initiatives we decided to take, is improving the speed and reliability of our GitLab CI pipelines. Below are a few of the things we changed and the improvements that came from it.

You can also find an example repository containing the GitLab setup & .gitlab-ci.yml config file which I used to test the changes here (somewhat ironically hosted on GitHub 😉).

Why?

post-image

The main reason for speeding up our CI/CD pipelines is quite obvious, to save developer time and toil.

Considering developer time is also one of our most expensive resources, allowing our devs to save time and make changes quicker will be of great benefit to the company. Additionally, if pipeline durations are reduced to within a couple of minutes (or seconds), developers will not need to start working on another task, reducing the mental fatigue associated with context switching and lowering the chance of burnout.

Since faster pipelines also lead to faster deployments, there are several other knock-on effects, such as:

  • Faster product delivery and outer feedback loops - we can get product changes out for user testing quicker.
  • Faster incident recovery - if a breaking change makes it past testing (no test suite is perfect 😅), developers can put a fix out quicker, reducing the impact on users and potential lost revenue.

What We Changed

Here are the five key changes we made to speed up our GitLab CI pipelines, along with an example of how they could be implemented:

  • Running Jobs in Kubernetes
  • Adding Distributed Caching
  • Container Image Layer Caching
  • Parallelizing Large Jobs
  • Taking Advantage of the needs keyword

Running Jobs in Kubernetes

Where the pipeline jobs run can make a massive difference to how quickly jobs start and the resources available to them.

GitLab uses the GitLab Runner application to run jobs in a pipeline. The environment the job is run in is defined by the runner’s executor, of which there are many choices.

The original executor setup we were using was docker+machine which in our case spun up a new AWS EC2 instance for each job, autoscaling based on demand. While the docker+machine executor provided great isolation and resource control for jobs, with the intermittent demand of CI pipelines and the auto-scaling not being perfect, spinning up a new EC2 instance per job on average took ~3mins. Long spin-up times lead to any pipelines with a large number of jobs being impacted with a lot of idle time… not ideal when developers are waiting for critical deployments.

The solution we came up with to alleviate the ~3min delay per job issue was to move our jobs to run inside a Kubernetes cluster using the GitLab Runner Kubernetes executor.

Moving jobs to run in Kubernetes came with several advantages, two of which were:

  • Less waiting time - The Kubernetes nodes are already running, so there is little waiting time to start a pod compared to spinning up a new AWS EC2 instance.
  • Better resource optimization - CI jobs have a wide range of resource requirements, running in Kubernetes allows for a job to define its resource requests and limits, allowing for easier, more refined resource optimization compared to creating different runners for different AWS EC2 instance sizes.

Creating a GitLab Runner with Kubernetes Executor

Since creating a Kubernetes cluster is a whole other topic, this blog post assumes you have already created a cluster to run the CI jobs in.

The GitLab Runner can be deployed to your Kubernetes cluster using the official Helm chart, either manually, as seen in the GitLab docs or via IaC such as Terraform like in the example repository.

To create a runner which runs CI jobs in Kubernetes, the GitLab Runner deployment must be configured with a runner using the Kubernetes executor. Relevant docs on how to do this can be found here:

A minimal GitLab Runner Helm Chart values.yml configuration would look like this:

gitlabUrl: <url of your GitLab instance>
runnerRegistrationToken: <runner registration token> # strongly consider storing this in a secret
runners:
  config: |
    [[runners]]
      [runners.kubernetes]
        image = "alpine:latest"

Once deployed, to use the runner:

  • Find your new runner in the GitLab UI /admin/runners page
  • Assign it a tag to run specific jobs, e.g. kubernetes
  • Create and run a job that is tagged

Adding Distributed Caching

One of the most effective ways to reduce pipeline duration is to reduce the amount of time spent downloading and building application dependencies. Caching can be used to store dependencies built in a previous pipeline run.

When CI jobs are running in containers (like when using the Kubernetes executor) local storage is ephemeral, so any build dependencies created in a job will be deleted on completion and therefore not be available to subsequent jobs.

A common way to allow dependencies to be persistently stored and shared between jobs is by creating a distributed cache. This is where a compatible cloud storage service, such as AWS S3, is used to store the cached files which are then uploaded and downloaded throughout the job’s lifecycle.

We decided to go for AWS S3 for our runner cache due to its high availability, and ease of setup and maintenance.

Creating a GitLab Runner Distributed Cache Using AWS S3

There are two components required to add an AWS S3 distributed cache to a GitLab Runner:

  • An S3 bucket for storing cache objects
  • An IAM user with permission to access the cache. Alternatively, if you are using a Kubernetes executor and the cluster is hosted on AWS EKS you can use an IAM role and service account, which makes credentials management easier.

These can be created manually or via IaC such as in the example repository:

The cache can be added to the runner in the GitLab Runner config, with a similar config to the example repository:

# …
runners:
  config: |
    [[runners]]
      [runners.kubernetes]
        image = "alpine:latest"
      [runners.cache]
        Type = "s3"
        Path = "runner-cache"
        Shared = true
        [runners.cache.s3]
          BucketName = "<your bucket name>"
          BucketLocation = "<your bucket region>"
          AccessKey = "access-key"
          SecretKey = "secret-key"

Using Caching in a GitLab CI Job

Ensure the job will be executed on a runner which has a cache configured by tagging the runner and job. Then use the cache by adding the cache configuration to the job. For example:

test-job:
  # …
  cache:
    - key:
        files:
          - Gemfile.lock
      paths:
        - vendor/ruby
  tags:
    - runner-with-cache

Notes:

  • It’s good practice to use a cache key that matches your dependencies and workflow. For example, the above key (Gemfile.lock) busts the cache when there have been changes to the application's Gemfile dependencies.

Container Image Layer Caching

If you run your application in containers, it’s likely you’ll have to rebuild your container images on each deployment pipeline run.

When building container images locally, tools such as Docker, cache container image layers, preventing the need to rebuild identical layers. The layer cache can save considerable time, as when parts of the application don’t change, such as the dependencies, the image layers can be re-used instead of rebuilt.

As mentioned in the above section, when CI jobs run in containers, local storage is ephemeral, so any layer caches will be deleted on job completion.

To take advantage of layer caching in pipelines, the image layers must first be pulled from an external cache, such as where you push the images to for deployment e.g. DockerHub or AWS ECR. For example, when using Docker to build images:

docker pull <image name>:latest 
docker build --cache-from <image name>:latest -t myimage:latest .

Building Images with Cache in Kubernetes

As CI jobs using the Kubernetes executor run in containers, using docker to build images in Kubernetes is classed as using Docker in Docker (DinD). Using DinD can lead to multiple issues, particularly in regards to security since most DinD setups require using the --privileged option on the parent container, so it tends to be best to avoid it if possible.

Another tool that can be used to build container images is Kaniko, which is also recommended by GitLab. Kaniko uses a custom container image to build from a Dockerfile and source code and also comes with a bunch of other cool features, such as remote image layer caching.

GitLab has a great guide on how to build images in CI jobs using the Kaniko debug image. To add image layer caching to their example, add the --cache=true and --cache-repo <cache repo> option to the /kaniko/executor command. An example of this can be seen in the example repository here.

Another cool feature of Kaniko is the layer cache expiry. Using the --cache-ttl <duration> option, cached layers will expire after a set period. Layer expiry allows all steps of the Dockerfile to be run periodically, so RUN commands which update the image will not be cached forever, keeping your image up to date. This can be useful, particularly if you implement scheduled pipelines to rebuild images on older applications that are not in active development.

Optimise Dockerfile for Layer Caching

To take full advantage of layer caching, the Dockerfile being built from will need to be optimized for caching. There are plenty of articles out there on how to do this for a specific language or framework, e.g. this one from Florin Lipan, but two key things to do are:

Firstly, follow the Dockerfile best practices.

Secondly, copy and build more stable parts of the application, such as dependencies, before the rest of the application. For example, when building a Ruby application that uses Bundler, the Gemfile & Gemfile.lock can be copied first to allow for the layers to be cached if the Gems are not changed in subsequent builds:

FROM ruby:3.1
# …
COPY Gemfile Gemfile.lock ./
RUN bundle install

COPY ./
# …

The container layers will cache if the hash of the files copied doesn't change. If you only have one COPY command that copies in all the files with COPY . ./, it is expected some files will have changed (as that’s a common reason for a deployment pipeline to run), this busts the cache for all future steps, causing gems to be rebuilt.

Full example repository Dockerfile.

Parallelizing Large Jobs

If you can’t avoid running a task via caching or other means, then another way to speed up its overall execution is through parallelization.

In CI, parallelization is where a task is split up across multiple machines (or pods) and its execution is optimized for: overall runtime, amount of computing resources, and test overhead (those unavoidable setup costs per job such as downloading the source code). It’s commonly used for large test suites.

Luckily GitLab has parallel jobs built into its CI system which can be enabled by using the parallel keyword.

GitLab has a good guide on how to set up parallel jobs. An example would be splitting JavaScript Jest tests using the --shard option:

test:
  parallel: 5
  script:
    - yarn jest --shard “$CI_NODE_INDEX/$CI_NODE_TOTAL” --maxWorkers=1

Taking Advantage of the needs Keyword

When running pipelines with lots of dependent jobs, using the needs keyword instead of the dependencies keyword can lead to improvements in overall pipeline duration.

The needs keyword allows jobs to be executed as soon as their dependents have completed, out of stage order. Not waiting for all the jobs in a stage to complete can be useful in larger applications, such as those with multiple build stages. For example, in a full-stack application that builds and tests both the frontend UI and the backend API, it's likely that each of the unit test jobs will not require the entirety of the application’s dependencies to be built. Therefore, if the UI dependencies are built before the API’s, the UI test jobs can start sooner, out of stage order, reducing the overall pipeline run time if the UI specs are very long.

When optimizing a GitLab CI pipeline using needs, GitLab generates a Directed Acyclic Graph (DAG) to define the job order and aid visualization. For example, the sample project’s DAG looks like this:

Other Refactors

Here are some of the other refactors we did along the way which aren’t strictly speed-related.

Shared CI Config

When you’re working with 100+ projects, including a shared configuration file can make global changes considerably easier and quicker.

By using the include keyword in GitLab CI, steps and stages from other configuration files located outside of the local repository can be merged into a project’s pipeline configuration.

For example, throughout our projects, we extracted common CI steps for building, scanning, and pushing container images, into a global shared CI configuration file. E.g. in the example repository shared.gitlab-ci.yml:

# shared.gitlab-ci.yml
.shared:container:build:
  stage: Container Build
  image: gcr.io/kaniko-project/executor:v1.8.1-debug
  variables:
    CONTAINER_REGISTRY_USER: "<your-docker-user>"
  script:
    - mkdir -p /kaniko/.docker/
    - echo "$DOCKER_AUTH" > /kaniko/.docker/config.json
    - /kaniko/executor
        --context "$CI_PROJECT_DIR"
        --dockerfile "$CI_PROJECT_DIR/Dockerfile"
        --cache=true
        --cache-repo "$CONTAINER_REGISTRY_USER/$CI_PROJECT_NAME-cache"
        --cache-ttl 168h
        --destination "$CONTAINER_REGISTRY_USER/$CI_PROJECT_NAME:$CI_COMMIT_REF_NAME"
  dependencies: []

Projects include the global CI configuration and then use the extends keyword to merge in any required shared configurations into their jobs. E.g. in the example repository .gitlab-ci.yml:

# .gitlab-ci.yml
# …

include:
  - project: root/sample-project
    ref: main
    file: /shared.gitlab-ci.yml

container:build:
  stage: Container Build
  extends: .shared:container:build
  only:
    - master

The shared configuration gives a standardised process for container creation and when changes need to be made there is only one place to update.

Cancelling Redundant Pipelines

When there are a lot of commits being made, cancelling the pipelines saves considerable compute. To take advantage of this feature, we have configured our projects to cancel all but some specific deployment pipelines which cannot fail.

While on some platforms cancelling redundant pipelines can be set up with a single toggle, e.g. CircleCI, in GitLab the setting has to be enabled per project and the CI configuration needs to explicitly specify which jobs require cancelling.

To enable per project, in the GitLab UI navigate to the project’s CI/CD settings “General Pipelines” section, then toggle “Auto-cancel redundant pipelines”:

post-image

In the .gitlab-ci.yml add interruptible: true to jobs that can be canceled mid-execution without negative side effects. E.g.

test:
  stage: Test
  # …
  interruptible: true

Results

Over the 90+ pipelines worked on, there was an average 70% decrease in runtime, with the main gains being pipelines with lots of dependency builds (improved through general caching and layer caching) and a high number of steps (improved by moving to Kubernetes).

Overall we saved over 350 hours / month of pipeline run time, which will go a long way in improving our developer productivity and reducing frustration.

Rolling Average Pipeline Duration Graph
Rolling Average Pipeline Duration Graph

Future Improvements

There are still a few improvements that we might make in the future to the pipeline configuration and infrastructure. Some of the notable ones are:

  • Create a local pull through image cache using Docker registry (or another cache application) - This will speed up image downloads and reduce internet traffic.
  • Create a local distributed cache using MinIO instead of AWS S3. This will provide improved cache speed and no ingress/egress fees from S3, however, it is harder to maintain and less available.
  • Parent-child pipelines. This will allow us to split the larger pipelines into independent more manageable sections.

Conclusion

Overall by changing a few key aspects of where and how our CI/CD pipelines run and adding distributed caching, we’ve managed to decrease our average pipeline duration by 70% 😄

While the ~350 hours saved doesn’t exactly correlate to developer time (since they will likely be working on other things while waiting for long pipelines) it will certainly help on the shorter pipelines, such as merge request tests, where they will no longer need to context switch to work on other things since pipelines will only take a few seconds to minutes.

Along with saving the developers time and making their experience better, the decrease in pipeline duration has greatly improved our merge to deployment time, which will help with incident recovery and the outer development feedback loop.