BLOG: Let’s build a workflow, part 2: encapsulating bespoke analyses
In our previous post in this series, we explored the idea of chaining multiple foundational workflows into a larger umbrella workflow. Some of these foundational workflows wrapped existing software packages distributed via official Docker images, highlighted in red in the figure below.
Example workflow to identify patients with similar patterns of genetic variation starting from genome sequences.
But what about the downstream steps (in teal) for our bespoke analyses? They obviously do not already exist as Docker images. We explored a strategy to package the software underlying these analyses as Docker images in a way that
Leverages our data scientists’ existing knowledge,
Reduces their technical barrier to entry, and
Allows us to quickly iterate over the design of these downstream analyses.
A Snakemake workflow as a software package
To package our downstream analyses, we designed a workflow using Snakemake that we can invoke externally, as this matches our process in other projects.
Designing a software package as a workflow has precedent. An example of this approach is Bactopia, which orchestrates individual Perl and Python scripts using Nextflow. The Bactopia authors provide an official WDL workflow that wraps Bactopia itself.
But why Snakemake and not Nextflow, Prefect, etc.?
It is based on Python. Our data science team operates almost exclusively in Python. Snakefiles have a syntax reminiscent of Python, greatly reducing the barrier of entry.
It can directly run notebooks. Internally, we prototype analyses as Jupyter notebooks. We eventually want to productionize code in these notebooks as Python modules, but an analysis may need to mature more before we consider doing so. Snakemake’s ability to parameterize and directly run notebooks leverages our data scientists’ existing knowledge and allows us to iterate quickly when designing these notebooks.
Wrapping up the software package with a CLI
Users of our downstream analyses may not necessarily invoke them through a WDL workflow. Moreover, internally testing the Snakemake workflow via a command line interface (CLI) has less technical overhead than running an entire WDL workflow, which requires a separate workflow execution engine. Externally, this CLI is also a convenient entrypoint for our downstream analyses that relieves us of the need to manually create our own Snakemake configuration files.
We created a CLI that calls Snakemake from Python without invoking the snakemake CLI:
import click import snakemake @click.command() @click.option(...) def main(...): config = {<configuration-options>} snakemake.snakemake( snakefile=<path-to-snakefile>, config=config, printshellcmds=True, ) if __name__ = "__main__": main()
Packaging our software into a Docker image and connecting it to a WDL task
To use our software package in a WDL workflow, we needed to package it in a Docker image. The choice of base image depends on whether you need to install non-Python dependencies. Normally, we would have been able to use the micromamba Docker image as a base image, but one dependency of our downstream analyses required Java 8 or 11. The micromamba image is based on Debian, which only allows the latest version of Java (22 at the time of writing) to be installed.
We instead selected an Ubuntu LTS base image, which provided an installable Java 11 package. Starting with this image, we installed micromamba and then the Python project containing our downstream analyses.
As we started with an Ubuntu image, we had to prepend the path to the directory containing our CLI (often the bin folder of the Python environment) in the Dockerfile for our image:
ENV PATH <path-to-cli-entrypoint>:$PATH
We then connected it with a WDL task and specified the Docker image in the runtime section of the relevant task:
task my_task { input { ... } command <<< set -euo pipefail my-cli <parameters> >>> output { ... } runtime { docker: "<my-organization>/<my-image-name>" } }
We have reached the end of our mini-series in constructing a WDL workflow in a modular manner. Follow us on LinkedIn to hear more about our journey in translating data into discoveries!