Creating a CI/CD pipeline employing an existing Makefile
Abstract. In this post I’m going to highlight a certain problem occuring when one needs to create a CI/CD pipeline based on an already existing Makefile
, and describe my solution to this problem.
GNU Make for reproducible research papers #
We support the principle of reproducibility in science. All numerical results presented in our last paper are fully reproducible and hence can be verified by the reader. In order to build the paper a series of numerical experiments is executed, their results are analysed, and finally the corresponding tables and plots are generated.
One obviously needs a build system to make it right. We opted for GNU Make. The idea to use GNU Make for reseach papers is neither new nor surprising. Make is a general purpose build system and is available in almost every Linux system.
Here is an example of a Makefile
for a paper in numerical mathematics, which one could easily extend.
MANUSCRIPT_TEX = manuscript.tex
.PHONY: manuscript plots.all tables.all numericals.all
manuscript: $(MANUSCRIPT_PDF)
plots.all \
latexmk -pdf -silent $(MANUSCRIPT_TEX)
artifacts/numericals/solution_1.npy: \
python3 numericals/ --outfile=$@
artifacts/plots/plot_1.svg: \
plots/ \
mkdir -p artifacts/plots
python3 plots/ --outfile=$@
artifacts/tables/table_1.tex: \
tables/ \
python3 tables/ --outfile=$@
numericals.all: artifacts/numericals/solution_1.npy
plots.all: artifacts/plots/plot_1.svg
tables.all: artifacts/tables/table_1.tex
Notice that all the generated artifacts, i.e. computed numbers, generated plots and tables are located in artifacts/
, separately from the generation scripts and author’s content.
Creating a GitLab CI pipeline #
Well, the ability to build the manuscript locally on a developer’s machine is nice, but automated builds running on a dedicated computing node for every new version are even better. How do we write a CI/CD pipeline based on the Makefile
we have?
The key feature of make is dependency resolution and smart re-evaluation: the computing artifacts whose dependencies didn’t change won’t be recomputed. However, the default strategy for a GitLab CI pipeline would be to start all the computations from scratch every time. When writing a paper based on heavy computations, this is getting extremely expensive so the overhead involved outweighs the benefits of automation: a line of text added to the manuscipt will trigger re-evaluation of all the artifacts.
In order to resolve this issue we need to tell GitLab to cache our artifacts. Moreover, we need to change the GIT_STRATEGY
from it’s default clone to fetch. Otherwise the scripts which are generating the artifacts will get more recent modification times than the previously computed artifacts, which will make the latter automatically obsolete for make.
Here is a minimal example.
- articafts
image: myproject/image
script: make manuscript
- artifacts
What is wrong with it? We declared a high level task manuscript
and delegated make to do all the small intermediate steps. Is it bad? Not allways.
Going this way we need to create a big and dirty docker image which needs to contain all the relevant software. For small projects with simple computing environments this might be the best choice. However, if the computing environment is rather sophisticated and challenging to reproduce, it might be beneficial to split it into a few: one for each group of tasks, i.e. computing, process, plotting, etc.
In this case our pipeline will also become more informative. Instead of one big task manuscript
we will be able to see all the intermediate steps straight in GitLab’s interface.
So here is the improved version.
- compute
- process
- tex
- articafts
stage: compute
image: myproject/image_numericals
script: make numericals.all
- articafts/numericals
stage: process
image: myproject/image_plots
script: make plots.all
- articafts/plots
stage: process
image: myproject/image_tables
script: make tables.all
- articafts/tables
stage: tex
image: myproject/image_tex
script: make preprint
- manuscript.pdf
Caveats #
After using a similar pipeline for some time we ran into the following issue. When the re-evaluation of some artifacts actually takes place (i.e. when their dependencies have been modified), the next task occasionally fails because the artifacts from the previous stage are still being uploaded from the corresponding GitLab runner. At the moment I have no easy fix in mind, so I simply restart the failed task if this happened.