Abstract. In this post I’m going to highlight a certain problem occuring when one needs to create a CI/CD pipeline based on an already existing
Makefile, and describe my solution to this problem.
GNU Make for reproducible research papers #
We support the principle of reproducibility in science. All numerical results presented in our last paper are fully reproducible and hence can be verified by the reader. In order to build the paper a series of numerical experiments is executed, their results are analysed, and finally the corresponding tables and plots are generated.
One obviously needs a build system to make it right. We opted for GNU Make. The idea to use GNU Make for reseach papers is neither new nor surprising. Make is a general purpose build system and is available in almost every Linux system.
Here is an example of a
Makefile for a paper in numerical mathematics, which one could easily extend.
MANUSCRIPT_TEX = manuscript.tex MANUSCRIPT_PDF = $(MANUSCRIPT_TEX:.tex=.pdf) .PHONY: manuscript plots.all tables.all numericals.all manuscript: $(MANUSCRIPT_PDF) $(MANUSCRIPT_PDF): \ $(MANUSCRIPT_TEX) \ plots.all \ tables.all latexmk -pdf -silent $(MANUSCRIPT_TEX) artifacts/numericals/solution_1.npy: \ numericals/solver_1.py python3 numericals/solver_1.py --outfile=$@ artifacts/plots/plot_1.svg: \ plots/plot_1.py \ artifacts/numericals/solution_1.npy mkdir -p artifacts/plots python3 plots/plot_1.py --outfile=$@ artifacts/tables/table_1.tex: \ tables/table_1.py \ artifacts/numericals/solution_1.npy python3 tables/table_1.py --outfile=$@ numericals.all: artifacts/numericals/solution_1.npy plots.all: artifacts/plots/plot_1.svg tables.all: artifacts/tables/table_1.tex
Notice that all the generated artifacts, i.e. computed numbers, generated plots and tables are located in
artifacts/, separately from the generation scripts and author’s content.
Creating a GitLab CI pipeline #
Well, the ability to build the manuscript locally on a developer’s machine is nice, but automated builds running on a dedicated computing node for every new version are even better. How do we write a CI/CD pipeline based on the
Makefile we have?
The key feature of make is dependency resolution and smart re-evaluation: the computing artifacts whose dependencies didn’t change won’t be recomputed. However, the default strategy for a GitLab CI pipeline would be to start all the computations from scratch every time. When writing a paper based on heavy computations, this is getting extremely expensive so the overhead involved outweighs the benefits of automation: a line of text added to the manuscipt will trigger re-evaluation of all the artifacts.
In order to resolve this issue we need to tell GitLab to cache our artifacts. Moreover, we need to change the
GIT_STRATEGY from it’s default clone to fetch. Otherwise the scripts which are generating the artifacts will get more recent modification times than the previously computed artifacts, which will make the latter automatically obsolete for make.
Here is a minimal example.
variables: GIT_STRATEGY: fetch cache: key: $CI_COMMIT_REF_SLUG paths: - articafts manuscript: image: myproject/image script: make manuscript artifacts: paths: - artifacts
What is wrong with it? We declared a high level task
manuscript and delegated make to do all the small intermediate steps. Is it bad? Not allways.
Going this way we need to create a big and dirty docker image which needs to contain all the relevant software. For small projects with simple computing environments this might be the best choice. However, if the computing environment is rather sophisticated and challenging to reproduce, it might be beneficial to split it into a few: one for each group of tasks, i.e. computing, process, plotting, etc.
In this case our pipeline will also become more informative. Instead of one big task
manuscript we will be able to see all the intermediate steps straight in GitLab’s interface.
So here is the improved version.
stages: - compute - process - tex variables: GIT_STRATEGY: fetch cache: key: $CI_COMMIT_REF_SLUG paths: - articafts numericals: stage: compute image: myproject/image_numericals script: make numericals.all artifacts: paths: - articafts/numericals plots: stage: process image: myproject/image_plots script: make plots.all artifacts: paths: - articafts/plots tables: stage: process image: myproject/image_tables script: make tables.all artifacts: paths: - articafts/tables tex: stage: tex image: myproject/image_tex script: make preprint artifacts: paths: - manuscript.pdf
After using a similar pipeline for some time we ran into the following issue. When the re-evaluation of some artifacts actually takes place (i.e. when their dependencies have been modified), the next task occasionally fails because the artifacts from the previous stage are still being uploaded from the corresponding GitLab runner. At the moment I have no easy fix in mind, so I simply restart the failed task if this happened.