Debugging CWL Workflows and Tools in Arvados
Arvados 2.2 introduced a new feature that can help to debug your workflows: interactive ssh connectivity to running containers. This presents a good opportunity to provide an overview of debugging patterns and tools for running CWL workflows in Arvados.
This article will describe techniques to help determine why an Arvados workflow isn’t running or is returning unexpected results. This will be done with examples of “broken” code and the use of these techniques to debug them. All the files used are available here for download on Github and can be run on the Arvados playground.
This article assumes knowledge about the basic technical components of Arvados. A video with a technical overview of Arvados is available on Youtube.
Debugging techniques
CWL workflow validation
One of the first steps before running a CWL workflow or tool on Arvados is to validate the CWL code.
arvados-cwl-runner has a --validate
command line option which validates CWL document without submitting it to Arvados. The validation ensures that the CWL can be parsed and is semantically correct. As of Arvados 2.3.1, workflow validation also detects circular dependencies in the CWL definition. The check for circular dependencies also happens when the workflow is submitted to Arvados.
As an example, we check the CWL command line tool fastqc_blog_broken.cwl which runs fastqc on an input fastq. An typo has been introduced: the type of the input fastq1 was changed to Files
, where it was supposed to be File
.
The output of the validation indicates that the CWL code has an issue:
The fastqc_blog.cwl tool definition, which does not have the typo, validates properly:
CWL input validation
The workflow validation option to arvados-cwl-runner
described in the previous section only validates the CWL. It does not check that the inputs, which are typically provided in a YML file, are of the correct type expected by the CWL workflow.
CWL input validation is done automatically when the workflow is submitted to Arvados.
For example, when a Directory
is provided as an input for for fastq1
in fastqc_blog_mismatch.yml where the workflow expects a File
, the workflow submission will fail:
Using the Arvados log files
If a workflow or tool submits without error to Arvados but does not complete successfully, the first step to find the issue is to look at the log files.
Arvados provides several log files for each container it executes. This section provides an overview of the more commonly used log files.
Unless arvados-cwl-runner
is invoked with the --local
option, it will start a workflow runner Arvados container that will, in turn, submit all the containers for the workflow to Arvados.
The workflow runner tracks which steps of the CWL workflow need to be run when, which have been run already, what order they need to be run, which can be run concurrently and creates the necessary container requests. The workflow runner logs can be found in the Logs tab of the main workflow in Workbench. Each step has its own logs related to containers running a CWL step of the workflow. In workbench, those logs can be found by expanding the steps and clicking on the link to the log collection.
On the command line, the arv
tool can be used to get the uuid of the log collection, which can then be accessed as a set of files via arv-mount
. The user guide explains how to access collection contents from the command line. It is also possible to access collections via one of the SDKs.
Below is a snapshot of the log collection of a failed container request:
These are the log files generated by Arvados:
-
stderr.txt: This is a great place to start when debugging a failed run (or a run that gives unexpected results). Captures everything written to standard error by the programs run by the executing container.
-
stdout.txt: Captures everything written to standard out by the programs run by the executing container.
-
node-info.txt: Capabilities and specifications of the virtual machine, as reported by Linux tools running on the virtual machine itself.
-
node.json: A high level overview about the instance type, such as name, price, and RAM. This information comes from the Arvados configuration, which is typically copied from the cloud provider’s price list. Some figures, notably RAM size, are typically close but not quite equal to the corresponding figures in node-info.txt.
-
crunch-run.txt: Information about how the container’s execution environment was set up (e.g., time spent loading the docker image) and timing/results of copying output data to Keep (if applicable).
-
crunchstat.txt: Information about resource consumption (RAM, cpu, disk, network) by the container while it was running.
-
hoststat.txt: Details about resource consumption (RAM, cpu, disk, network) on the node while it was running. This is different from crunchstat.txt because it includes resource consumption of Arvados components that run on the node outside the container such as crunch-run and other processes related to the Keep file system.
-
container.json: The json description of the Arvados container object.
Arvados log files: a failure example
As an example, we submit a workflow that fails during execution, and resolve why it failed by using the logs. The CWL files are available here.
The failed workflow step can be found on the Arvados playground.
The stderr.txt log file indicates that something is wrong with the memory allocation:
The crunch-run.txt log file provides additional information about the status of the container execution environment, including the docker container that ran this workflow step.
Arvados is configured to use Docker as the runtime engine, and the docker container exited with code 139, which indicates the process was killed by signal 11 (139 - 128 = 11, SIGSEGV). A SIGSEGV is often caused by a code bug such as a null pointer dereference, e.g. after failing to notice that a memory allocation failed.
When using a cloud dispatcher, Arvados uses the cheapest possible node that satisfies the resource requirements to run a specific workflow step. The node type that was used and its specifications (number of cores, RAM, scratch space) is logged in the node.json file. RAM and scratch space are expressed in bytes:
In this case, a small cloud node type (t3.small
) was used, with 2 Gigabytes of RAM.
The CWL tool description in STAR-index-broken.cwl does not list resource requirements. This means that the default value for ramMin
is used, which is 256 Mebibytes (2**20). Arvados restricts the container runtime environment to the amount of ram defined by the workflow definition. In this case, with Docker as the container engine, this is implemented via a cgroup that restricts the amount of memory available to the Docker container.
The resource requirements can be changed in the CWL code so that the step is scheduled on a machine with more resources. Note that CWL specifies resource requirements like ramMin
in mebibytes:
With these changes, the step ran successfully.
The node.json log can be used to check that the node used matched the new requirements:
Getting performance and resource usage information
The crunchstat-summary tool can be used to analyze workflow and container performance. crunchstat-summary analyzes the crunchstat lines from the logs of a container or workflow and generates a report in text or html format. Among other information, crunchstat-summary reports the network traffic, memory usage, and CPU usage of tasks.
As an example, we run crunchstat-summary on the job from the previous section to see exactly how much RAM is necessary to run our job.
The output shows usage for both the workflow runner job as well as for each step in our workflow. In this case, there is only 1 step.
The tail end of the output shows that the CWL requiring too much RAM for this step, and that it can be reduced which may permit the container to be scheduled on a smaller, cheaper instance from the cloud provider.
It is also possible to change the crunchstat-summary
output format to html by giving it the --format html
argument. Doing so will result in plots of memory and keep utilization over the duration of the job.
Debugging Locally Using Docker and Mounted Keep Volumes
With arv-mount, Arvados collections can be accessed using traditional filesystem tools. This allows browsing, opening and reading of data in Keep as one would regular files. By mounting a keep volume, a running docker container can access files in Keep as a shared filesystem. This is often useful when you want to debug the code that is being run by CWL.
Because docker runs as a different user, be sure to use the --allow-other
cli option when invoking arv-mount
so that Docker can access your mounted keep volume.
This can be extremely useful because it allows debugging with the data for the workflow without having to download it all first. When working with large amounts of data, copying it all to a local machine may not be possible, let alone practical.
Additionally, this method makes it possible to use the same docker container used on Arvados, which should help to reproduce the same runtime environment as where the error occurred while running on Arvados.
Connecting to a running container for interactive debugging
New in Arvados 2.2.0 is the ability to get access to a running container and interactively debug your workflow.
In order to get shell access to your container while it is running, you will need access to a machine with OpenSSH and arvados-client installed, and the usual Arvados environment variables set
The cluster administrator will also need to enable container shell access. It is disabled in the default configuration because using container shell access can interfere with the reproducibility of workflows. To enable the feature cluster-wide:
- set Containers / ShellAccess / User: true in the cluster config: see the container shell documentation
- up-to-date Nginx configuration: see “new proxy parameters” in the release notes
Once the prerequisites are met, it is possible to run commands and start interactive sessions:
It is also possible to connect to services inside the container, e.g. debugging/profiling/metrics servers, by forwarding TCP ports: