Introduction
High-throughput medical image computing, which includes storage, processing, and analysis, is essential in exploring hidden regularities of large-scale medical image data. Efficient data storage and exchange are increasingly important for subsequent image processing. As reviewed by Izzo [
1], prevalent large-scale medical imaging archiving solutions include, but are not limited, to the Extensible Neuroimaging Archive Toolkit (XNAT) [
2], the eXTENsible platform for biomedical Science (XTENS) [
3], and the Collaborative Informatics and Neuroimaging Suite (COINS) project [
4]. Given the XNAT’s high flexibility and availability of community support, the Vanderbilt University Institute for Imaging Science (VUIIS) Center for Computational Imaging (CCI) has built a large-scale image database VUIIS XNAT [
5,
6] that extends XNAT to house 379 IRB approved projects at the Vanderbilt University and Vanderbilt University Medical Center (VUMC) with 78,512 magnetic resonance imaging (MRI) or computed tomography (CT) sessions, totaling 480,574 medical imaging scans for 48,556 subjects (as of December 2017).
Large-scale medical image computing requires an efficient and stable imaging database and additionally demands efficient job management and distribution schemes to handle heterogeneous images and processing tasks (e.g., segmentation, registration, and modeling). To address such challenges, we have developed the Distributed Automation for XNAT toolkit (DAX,
http://dax.readthedocs.io/en/latest/) [
5,
7,
8], which is a Python-based middleware layer to control the processing algorithms for the individual scan, scan group, or session. For each image computing task, one or more image processing algorithms are encapsulated as a pipeline item (“spider”), which defines the order of executions, dependencies, and the running environment. For instance, an end-to-end multi-atlas MRI brain segmentation pipeline [
9] has been wrapped as a “Multi_Atlas” spider, which contains preprocessing, N4 bias field correction [
10], registrations [
11], atlas selection [
12], label fusion [
13], and post-processing [
14]. The spider defines the sequence of the processing as well as the processing environments (e.g., binary files, package dependencies, and source code). Once the spider is developed, it can be launched, managed, and terminated by DAX.
The VUIIS CCI medical image data storage and processing infrastructure (VUIIS XNAT, DAX, and spiders) have been developed in close relationship with the hardware and computing environment of the Vanderbilt Advanced Computing Center for Research and Education (ACCRE,
www.accre.vanderbilt.edu) and other facilities at the Vanderbilt University. This process has ensured the effectiveness of the toolkit at the Vanderbilt University but has also exposed certain limitations. First, in 2016/2017, the Vanderbilt high-performance computing (HPC) facility decided to upgrade the underlying operating system from CENTOS 6 to CENTOS 7. Many of the VUIIS CCI image processing tools and libraries were not supported by the new platform. For instance, changes in supported versions of CUDA libraries between operating systems (OS) caused unforeseen gaps in application coverage, requiring specialized installations for various applications. This upgrade process led us to revisit the framework of the VUIIS CCI medical image storage and processing infrastructure. More generally, to deploy or migrate this infrastructure to another institution or HPC center could be laborious. Significant efforts and resources could be required to migrate the DAX platforms and spiders to a platform with differenthardware, OS, and installed packages might be very different between users and Vanderbilt University.
A second issue arose concerning spider portability. Since the initial development of DAX, individual spiders were designed to be executed across platforms (e.g., laptops, workstations, servers, and HPC nodes). The deployment worked well for several years as end users would deploy a common library of tools and then be able to use all available spiders. However, with the growth of the library, multiple releases of third-party tools, varying system level dependencies, and library incompatibilities between spiders, the processes of deploying the underlying architecture had become increasingly cumbersome and hardware/software dependent. In revisiting the framework for sustainability, we focused on improving the ease of use for the user scenario where a single spider is executed on a local laptop/workstation/server or a different HPC platform.
In this paper, we describe new innovations that create a portable solution based on the VUIIS CCI medical image data storage and processing infrastructure using virtualized VUIIS XNAT, containerized DAX, and containerized spiders to isolate the image processing platform from the underlying OS and hardware. The portable implementation is an alternative solution that enables hardware independent encapsulation and provides a convenient way to migrate our platforms to another institute or center. The containerized spiders are able to run on local workstations or alternative facilities as easily as at our local HPC center. To achieve this, the VUIIS XNAT has been deployed on a virtual machine (VM), while the containerized DAX and spiders use the Docker [
15] and Singularity [
16] container format. More specifically, the portable version has the following new features: (1) multi-level portability is supported, which not only allows for the deployment of the entire infrastructure (VUIIS XNAT + DAX + spiders) but also enables deploying one component (VUIIS XNAT or DAX or spiders) independently at another HPC center. (2) The containerized design allows future updates to DAX and spiders to take place without concern regarding the underlying hardware. (3) The portable feature leverages the scalability of the spiders, which are runnable on HPC clusters, workstations, and personal computers.
The remainder of the paper is organized as follows. First, XNAT and the existing VUIIS CCI medical image storage and processing infrastructure (VUIIS XNAT + DAX + spiders) are reviewed. Next, the portable solution with containerized DAX and spiders is introduced. Finally, we discuss the benefits and limitations of our implementation. This paper extends our previous descriptions [
7,
8]with a more detailed analysis of the framework as well as an exposition of the new containerized spiders.
DAX and Spiders
DAX
The large-scale image database is organized as different projects based on the definition by PIs. Within each project, all imaging scans and the corresponding processing results from the same patient are saved as one subject on XNAT, which might contain one session (time point) or multiple sessions. DAX was developed to provide an efficient database and processing management tool for large-scale medical image processing scenarios. It is a Python-based middleware layer used to specify the processing algorithms to apply to each image session on XNAT. First, DAX communicates with XNAT to identify which sessions match the running requirements (e.g., input files, queue permits). Then, the image processing tasks are launched for all requirement satisfied sessions based on the definition in the REDCap dashboard. To launch the processing tasks, DAX provides the scripts of image processing pipelines called spider, which defines not only the software requirements but also handles the image processing pipelines. The spiders define a list of image processing tasks and the corresponding software environments. Then, spiders are launched on ACCRE (a high-performance computing cluster housing over 6000 CPU cores) using batch processing technique. To enable the batch processing of the spiders, DAX has been developed to support the customized shell scripts using Simple Linux Utility for Resource Manangement (SLURM).
Each time a process runs, the precise version and parameters of the algorithm that ran are logged along with the processed data. Manual interaction with the data flows is possible and widely used. Once each processing job is complete, a final report is generated in Portable Document Format (PDF). The reports are tailored for each processing task and can be quickly reviewed by a human for quality assurance (QA) purposes. For security reasons, users are required to do registration on the VUIIS XNAT website (
http://xnat.vanderbilt.edu/index.php/Main_Page) to use DAX. Manual interaction with the data flows is possible and widely used; in these cases, both the original automated results and the manual correction/interaction are saved.
In the currently deployed state-of-the-art DAX (version 0.7.1), the complete source code, runtime binaries libraries, and configuration files are archived in version controlled repositories linked to the high-performance computing center. Authorized users can retrieve the exact data, execution environment, and runtime arguments to recreate any analysis at a later time.
DAX has been tested on the SUN Grid Engine [
20,
21], the Simple Linux Utility for Resource Management (SLURM) platform [
22], and the Moab scheduler on the Terascale Open-source Resource and QUEue Manager (TORQUE) [
23].
Spiders
A spider is a processing pipeline package containing a controlling Python script, source code, and binary executable files. The controlling python script defines the running environments (e.g., input format, output format, software paths, related packages, etc.), and the execution order when more than one image processing algorithm is included in the spider. DAX, to date, supports 303 different spiders for efficient large-scale image processing. For instance, FreeSurfer brain segmentation and cortical surface extraction [
24], multi-atlas segmentation [
9,
13], voxel-based morphometry and other structural analysis using the SPM [
25,
26], and FSL [
26,
27] software suites are implemented as spiders. Also, the diffusion tensor imaging (DTI) processing and analysis spiders have been developed for quality analysis [
28,
29] plus various approaches for diffusion modeling and tractography: the FSL’s BEDPOSTX [
30] and TBSS [
31], the Freesurfer’s TRACULA [
32], and the gray matter diffusion analysis GSBSS [
33]. Available cortical surface reconstruction and analysis spiders include CRUISE [
34], MaCRUISE [
35,
36], surface parcellation [
37], and quantitative analyses on surfaces (e.g., thickness, curvatures, sulcal depth, etc.) [
38,
39]. Other spiders perform lesion segmentation like TOADS [
40] and Lesion Segmentation Tools (LST) [
41], fMRI quality analysis [
42], or abdomen segmentation using multi-atlas segmentation [
14] or deep neural networks [
43].
Since each update for a spider relies on different versions of source code or software packages, it is appealing to develop a spider to contain its full computing environment and be isolated from the underlying software and hardware environments.
Discussion
In summary, all human research scans taken at VUIIS today are automatically routed to a long-term PACS archive and mirrored in an XNAT server to provide secure, multi-site access to data resources. Our VUIIS XNAT/DAX systems have enabled large-scale automated analysis and quality assurance/control with a reasonable level of human oversight. Since 2010, DAX/XNAT has used 650+ CPU-years (5.6M+ CPU-hours) of image analysis at ACCRE and 200+ CPU-years (1.8M+ CPU-hours) on the MASI grid. To improve portability of the VUIIS CCI medical image storage and processing infrastructure (VUIIS XNAT + DAX + spiders), the new innovations provide an additional implementation that isolates the infrastructure from the underlying hardware and OS. The portable infrastructure may be deployed to another HPC center conveniently. Moreover, the containerized spiders are ready to run on HPC clusters or workstations, which enables efficient algorithm deployment and version management.
Although large-scale image processing has become essential in scientific research and clinical investigation, very few modern approaches are routinely used in clinical practice. A key barrier to clinical translation of image processing techniques is the evaluation and characterization of techniques at scale. Existing medical imaging automation and data interaction approaches have largely been designed for traditional well-controlled studies, and it is difficult to apply these frameworks to the more diverse data of clinical practice. With “big” imaging data, acquisitions are not uniform (e.g., multi-site issues, hardware/software changes, acquisition differences, personnel changes, etc.), and it is not clear which of the myriad of potential factors will be most challenging for image processors to generalize across. Furthermore, as many medical imaging applications begin to utilize deep learning approaches, accessible and translatable access to GPU interfaces and hardware is paramount for broad usage. Hence, creating algorithms to robustly function with big imaging data becomes a Herculean effort, in particular in the context of GPU programming.
Another interesting direction is to further improve the efficiency of large-scale data storage and image processing. We have previously investigated the theory, algorithm, and implementation of using Hadoop [
45] and a data colocation grid framework [
46]. We also presented a “medical image processing-as-a-service” grid framework that offers promise in utilizing the Apache Hadoop ecosystem and HBase distributed data store for data colocation by moving computation close to medical image storage [
47]. It is appealing to integrate such techniques with the Vanderbilt University HPC clusters to improve computational efficiency under Big Data scenarios.