Arvados 2.6.0 Release Notes

April 6, 2023

The Arvados team is pleased to announce Arvados 2.6.0. This release includes many improvements to Arvados’ performance and reliability under heavy compute load, along with a variety of other new features and bug fixes. We recommend that new and existing installations of 2.5.0 or earlier upgrade to 2.6.0. See Upgrading Arvados for upgrade instructions.

New features and enhancements

Workbench 2

Workbench 2 now provides a dedicated view for registered workflows, which lets you easily view and refer to their details. #19482

Workbench 2 screenshot that shows an example workflow page including workflow details

The Workbench 2 sharing dialog has numerous usability improvements (#19294, #20085):

  • “All users” has been added as a top-level sharing option to easily share with everyone on the cluster but not anonymous users.
  • All sharing changes are saved immediately.
  • The button to add a sharing permission runs parallel to the button to remove one.
  • If the user starts searching for a user or group to share with, then abandons the search, the input will be cleared to reflect that no change has been made.
  • Only users’ names are listed to provide a more friendly view. Additional details about the user are available through a tooltip when you need them.

Workbench 2 screenshot that shows the sharing dialog listing a user’s name with details in a tooltip

Workbench 2 now filters workflow intermediate and log collections out of the default project view to reduce clutter. You can browse these collections by selecting them from the Type filter pulldown. #19295

Workbench 2 screenshot that shows a project view and its new default type filter selections, with intermediate and log collections deselected

The container process action menu in Workbench 2 now includes “Copy and re-run process.” This creates a copy of the process in draft state so you can run it again. #15557

Workbench 2 screenshot that shows a process details pane with “Copy and re-run process” highlighted in the action menu

Workbench 2 now provides a “Cancel” button for container processes that are queued but have not yet started running. #20000

Workbench 2 screenshot that shows basics about a queued workflow with a “Cancel” button

Workbench 2 now provides a “Run” button to start container processes in the “draft” or “on hold” states. #20000

Workbench 2 screenshot that shows basics about a cancelled workflow with a “Run” button

Workbench 2 now reports container process status as “Cancelling” when the user has requested that a process be cancelled but Crunch has not yet shut it down. #19295

Workbench 2 now reports container process status as “Reused” when Arvados reuses results from a previous workflow run. #19295

Administrators can now configure Workbench 2 to display a banner message to users. This can be used to inform users about upcoming cluster maintenance or configuration changes. Refer to the Workbench configuration documentation for details about how to set this up on your cluster. #18368

Administrators can now configure Workbench 2 to display custom tooltips. These can provide users with tailored guidance about your site’s intended workflows and procedures. Refer to the Workbench configuration documentation for details about how to set this up on your cluster. #19836

Crunch

When running on AWS, Crunch now obtains instance pricing information from AWS APIs, and uses that to calculate container costs. This especially improves the accuracy of reported costs for containers run on spot instances. #19320

Crunch periodically updates running container records with their cost so far. This provides users with an estimate of the cost of containers that haven’t finished yet, as well as containers that abort unexpectedly. #19967

Shortly after Crunch starts a container, it will create a log collection that includes information about the compute node. Workbench 2 already uses this collection to display information about the compute node allocated for the container, so now this information will be available sooner. #19886

When Crunch dispatches a container to an AWS spot instance, if AWS announces a planned interruption of the instance, that information will be recorded in the container logs and runtime status. #19961

Crunch now logs the first time a running container uses more than 90%, 95%, or 99% of its requested memory. This can help users diagnose if a container likely failed because it ran out of memory. #19986

Crunch now logs the maximum usage it recorded for each resource after a container finishes running. This should help users hone their resource requests for a workflow. #19986

Documentation

The Python SDK cookbook has been expanded with organization by subject, background discussion for each recipe, and clearer examples. #19792

The Python SDK install instructions have been reorganized so they’re easier to follow. #19926

The container request API documentation now includes much more detail about how container cancellation works. #19624

SDKs

The R SDK includes a new writeFile function that can write to an existing collection, rather than creating a new one every time. #20214

API server and controller

All of the API server’s internal id columns have been migrated from 32-bit to 64-bit integers to provide more room for growth. Note that running this migration on a large production instance may take several hours. #19890 #20074

The API server automatically deduplicates permission links as they are created and updated. As a consequence, these API operations may now return an existing link. There is also a migration to deduplicate existing links. This migration could take a while to run if you have many duplicate links already, but this shouldn’t be common. #18693 #19954

LDAP configuration now includes a MinTLSVersion setting. You can set this to allow all Arvados systems to negotiate LDAP connections that use a version of TLS older than what’s recommended (currently TLS 1.2) if that’s the most your server supports. #19896

When Arvados retries a container, it will synthesize a new set of scheduling parameters from all outstanding container requests that provide the maximum requested resources to the new container. This means Arvados acts as expected when a user sees that a container is likely to fail and submits a new request with more generous scheduling parameters before it actually fails. #19917

Containers now have a log method that provides WebDAV access to running container logs. Future releases will include client tools that use this endpoint for more performant log viewing. #19889

Salt installer

The Salt installer now provides cluster monitoring by integrating with Salt’s Prometheus and Grafana formulas. Arvados nodes and services will be configured to publish metrics to a local Prometheus server, and those can be browsed with Grafana. #16379

The multi-node Salt installer now supports deploying to AWS with TLS private keys encrypted with a passphrase. nginx retrieves the passphrase securely from AWS Secrets Manager. #20035

The default deployment strategy used by the Terraform + Salt install has been adjusted to require fewer public IPv4 addresses. In particular, this means Arvados can now be installed in a fresh AWS account without modifying the installer or needing to request additional public IPv4 address quota. #20270

Scalability and reliability improvements

arvados-cwl-runner now uploads workflows to a collection that includes its dependencies, rather than a single JSON document. This is faster and the uploaded workflow stays much closer to the original source, which simplifies debugging. #19385

arvados-cwl-runner supports a new workflow extension arv:OutOfMemoryRetry. If a workflow step has this hint defined, and fails because the tool ran out of memory, arvados-cwl-runner will automatically retry the step once with a request for more RAM in its runtime constraints. The extension can define how much additional memory to request and how to detect out-of-memory errors from the tool. See our CWL extensions documentation for full details. #19975

The Go SDK can now automatically retry requests that encounter temporary failures. Retries are delayed with exponential backoff, limited by the duration of Client.Timeout. #19972

If the Crunch dispatcher receives a 503 error response from the API server, it reduces the number of API requests it puts in flight at one time to allow the API server time to recover. This limit gradually increases over time without an error. #19973

If the Crunch dispatcher receives InsufficientFreeAddressesInSubnet or InsufficientVolumeCapacity errors from EC2 when it tries to create new compute nodes, it treats those like hitting other quota limits, and will pause trying to create new nodes. #20188

CloudVMs configuration now includes a MaxInstances setting. This limits the number of compute nodes created by arvados-dispatch-cloud to ensure your compute capacity does not grow beyond what your API server can support. #18075

CloudVMs configuration now includes a SupervisorFraction setting. This limits the number of instances created out of MaxInstances to run workflow supervisor processes like arvados-cwl-runner to ensure they do not take so many compute node resources that they collectively bottleneck each other. #20182

API configuration now includes a LogCreateRequestFraction setting. This limits the number of concurrent requests out of MaxConcurrentRequests that can be log create requests. Log create requests that come in when the server is at this limit will receive a 503 Service Unavailable response. This ensures capacity is available for cluster administration even when the API server is under heavy log load from running containers. Crunch logs that receive this response will be discarded. #20200

The default API configuration for MaxConcurrentRequests has been changed from 0 (unlimited) to 64. With more deployment experience, we believe this limit is appropriate for most new installs, and is easy to increase as clusters grow. #20200

The default Collections configuration for BalancePeriod has been changed from 10 minutes to 6 hours. With more deployment experience, we believe this default will still provide sufficient block balancing and cleanup for most clusters, while leaving more resources available for other Arvados work. #20227

The API server had background logic to keep priority consistent across related containers. For example, if you cancelled an entire workflow by setting the priority of your original container request to 0, this logic would set priority 0 on all the container requests it spawned as well. We diagnosed several performance problems in this code, so Arvados 2.6.0 includes a more performant implementation in the controller. #20183 #20240

Several large database queries throughout the API server have been optimized to work in small batches and/or select the specific data fields they need to reduce memory requirements. #20223

The controller now caches the API server discovery document and serves it directly to clients. #20187

Workbench 2 now copies and updates collections using the replace_files API option. This provides better performance when modifying large collections. #20029

Bug fixes and smaller changes

Workbench 2

When you search a project in Workbench 2, and view the details of the one of the result items, Workbench 2 now retains your search and view. #19865

Workbench 2 screenshot that shows search results loaded alongside details for one item

The advanced search dialog in Workbench 2 no longer requires you to select a project to search. #19908 #19969

Workbench 2 screenshot that shows the advanced search dialog with various search criteria entered without a project selected

When you advance through pages of subprocesses on a process page, then reload the process page in your browser, Workbench 2 now remembers and displays the page of subprocesses you were viewing. #20252

Fixed a bug in Workbench 2 where sorting a project listing did not work correctly for some data columns. #19988

Workbench 2 now revalidates caches when displaying collection contents to avoid showing users an out-of-date listing. #19899

If your Workbench 2 session expires, then you log back in from that page, you will be returned to the page you were previously viewing. This is true whether your session expired due to inactivity or because your underlying authorization token is no longer good. #19715

Workbench 2 no longer displays a “Not Found” error when it fails to load resources associated with a container process. #19900

Workbench 2 now displays process status as “Unknown” when it does not have this information available. Previously it would show “Cancelled” in this case. #19273

Fixed a bug where Workbench 2 would construct invalid WebDAV URLs for collections when the cluster was not configured with wildcard certificates. #20089

When a user edits an object’s description to be empty, Workbench 2 will now explicitly update the API object’s description field to null. Previously it would update the description with a contentless HTML skeleton, which prevented the API.FreezeProjectRequiresDescription setting from being enforced as intended. #19930

Fixed a bug where Workbench 2 would show status incorrectly for some container processes in a very long list. #20251

API server and controller

The controller now reads requests sent as multipart form data. Workbench 2 sometimes sent requests encoded this way, so those requests are now handled properly. #19597

If the controller encounters an error when it tries to validate an OIDC token, it now returns a 5xx error so the client knows it can retry. Previously it returned a 401 Unauthorized response, which was indistinguishable from an invalid token. #19907

Fixed several bugs in the API server’s configuration reload thread that made it unreliable. #20137 #20198

The API server should now recognize all system properties, so you can define a strict vocabulary without having to redefine any system properties. System properties are documented in the Metadata vocabulary API documentation. #19980

Improved the trusted clients match detection to support configured URLs that explicitly specify their scheme’s default port number. #20264

A database migration included in Arvados 2.5.0 has been adjusted so it can run on PostgreSQL 11. #19993

SDKs

Fixed a bug in the R SDK that could prevent you from fetching multiple files from a collection in succession. Thanks to Konrad Rudolph for this fix. #20295

Keep

Fixed a bug in keep-web where it would try to update a collection just because another client sorted the manifest differently than keep-web would. This issue could cause users to receive Unauthorized error responses if they had read-only permission to a collection when keep-web sent the update request. #20083

Fixed a bug in keep-web where it did not redirect unauthenticated users to Workbench 2 if the cluster was not configured with an anonymous token. #19963

In previous releases arv-mount reported a generic “I/O error” if you tried to create a file directly inside a project directory. Now it reports “Operation not supported” to clarify that the problem is with the request and not the system. #19897

When the same volume is available from multiple Keep stores, and keep-balance wants to trash blocks on that volume, it will now select one Keep store at random to receive the trash request. It previously sent the request to all Keep stores, which could cause S3-backed stores to detect a race condition and discard the trash request. #20242

keep-balance now has an option that limits it to working on blocks with a specific checksum prefix. For now this is intended primarily as a way for us to instrument potential keep-balance scaling strategies. #19923

Crunch

Fixed a bug in arvados-dispatch-cloud where it did not properly install crunch-run after an upgrade, causing it to abandon running containers. #20235

Dependencies

The Python SDK’s dependency on google-api-python-client has been upgraded to version 2.1.0+. This makes it easier to install alongside other libraries that use that package. #19895

The Prometheus client library used by Arvados has been upgraded to version 1.14.0 to address a security vulnerability in earlier versions. #20121

Fixed a bug in the build scripts that would generate inconsistent version numbers on old commits from release branches where their nearest tag is older than their base merge commit. #19937