Creating a CI/CD pipeline employing an existing Makefile

In this post I’m going to highlight a certain problem occuring when one needs to create a CI/CD pipeline based on an already existing Makefile, and describe my solution to this problem.

GNU Make for reproducible research papers

We support the principle of reproducibility in science. All numerical results presented in our last paper are fully reproducible and hence can be verified by the reader. In order to build the paper a series of numerical experiments is executed, their results are analysed, and finally the corresponding tables and plots are generated.

One obviously needs a build system to make it right. We opted for GNU Make. The idea to use GNU Make for reseach papers is neither new nor surprising. Make is a general purpose build system and is available in almost every Linux system.

Here is an example of a Makefile for a paper in numerical mathematics, which one could easily extend.

MANUSCRIPT_TEX = manuscript.tex
MANUSCRIPT_PDF = $(MANUSCRIPT_TEX:.tex=.pdf)

.PHONY: manuscript plots.all tables.all numericals.all

manuscript: $(MANUSCRIPT_PDF)

$(MANUSCRIPT_PDF): \
    $(MANUSCRIPT_TEX) \
    plots.all \
    tables.all
  latexmk -pdf -silent $(MANUSCRIPT_TEX)

artifacts/numericals/solution_1.npy: \
    numericals/solver_1.py
  python3 numericals/solver_1.py --outfile=$@

artifacts/plots/plot_1.svg: \
    plots/plot_1.py \
    artifacts/numericals/solution_1.npy
  mkdir -p artifacts/plots
  python3 plots/plot_1.py --outfile=$@

artifacts/tables/table_1.tex: \
    tables/table_1.py \
    artifacts/numericals/solution_1.npy
  python3 tables/table_1.py --outfile=$@

numericals.all: artifacts/numericals/solution_1.npy
plots.all: artifacts/plots/plot_1.svg
tables.all: artifacts/tables/table_1.tex

Notice that all the generated artifacts, i.e. computed numbers, generated plots and tables are located in artifacts/, separately from the generation scripts and author’s content.

Creating a GitLab CI pipeline

Well, the ability to build the manuscript locally on a developer’s machine is nice, but automated builds running on a dedicated computing node for every new version are even better. How do we write a CI/CD pipeline based on the Makefile we have?

The key feature of make is dependency resolution and smart re-evaluation: the computing artifacts whose dependencies didn’t change won’t be recomputed. However, the default strategy for a GitLab CI pipeline would be to start all the computations from scratch every time. When writing a paper based on heavy computations, this is getting extremely expensive so the overhead involved outweighs the benefits of automation: a line of text added to the manuscipt will trigger re-evaluation of all the artifacts.

In order to resolve this issue we need to tell GitLab to cache our artifacts. Moreover, we need to change the GIT_STRATEGY from it’s default clone to fetch. Otherwise the scripts which are generating the artifacts will get more recent modification times than the previously computed artifacts, which will make the latter automatically obsolete for make.

Here is a minimal example.

variables:
  GIT_STRATEGY: fetch

cache:
  key: $CI_COMMIT_REF_SLUG
  paths:
    - articafts

manuscript:
  image: myproject/image
  script: make manuscript
  artifacts:
    paths:
      - artifacts

What is wrong with it? We declared a high level task manuscript and delegated make to do all the small intermediate steps. Is it bad? Not allways.

Going this way we need to create a big and dirty docker image which needs to contain all the relevant software. For small projects with simple computing environments this might be the best choice. However, if the computing environment is rather sophisticated and challenging to reproduce, it might be beneficial to split it into a few: one for each group of tasks, i.e. computing, process, plotting, etc.

In this case our pipeline will also become more informative. Instead of one big task manuscript we will be able to see all the intermediate steps straight in GitLab’s interface.

So here is the improved version.

stages:
  - compute
  - process
  - tex

variables:
  GIT_STRATEGY: fetch

cache:
  key: $CI_COMMIT_REF_SLUG
  paths:
    - articafts

numericals:
  stage: compute
  image: myproject/image_numericals
  script: make numericals.all
  artifacts:
    paths:
      - articafts/numericals

plots:
  stage: process
  image: myproject/image_plots
  script: make plots.all
  artifacts:
    paths:
      - articafts/plots

tables:
  stage: process
  image: myproject/image_tables
  script: make tables.all
  artifacts:
    paths:
      - articafts/tables

tex:
  stage: tex
  image: myproject/image_tex
  script: make preprint
  artifacts:
    paths:
      - manuscript.pdf

Caveats

After using a similar pipeline for some time we ran into the following issue. When the re-evaluation of some artifacts actually takes place (i.e. when their dependencies have been modified), the next task occasionally fails because the artifacts from the previous stage are still being uploaded from the corresponding GitLab runner. At the moment I have no easy fix in mind, so I simply restart the failed task if this happened.