Keywords

1 Introduction

Anomalies in the shape and texture of the liver and visible lesions in CT are important biomarkers for disease progression in primary and secondary hepatic tumor disease [9]. In clinical routine, manual or semi-manual techniques are applied. These, however, are subjective, operator-dependent and very time-consuming. In order to improve the productivity of radiologists, computer-aided methods have been developed in the past, but the challenges in automatic segmentation of combined liver and lesion remain, such as low-contrast between liver and lesion, different types of contrast levels (hyper-/hypo-intense tumors), abnormalities in tissues (metastasectomie), size and varying amount of lesions.

Nevertheless, several interactive and automatic methods have been developed to segment the liver and liver lesions in CT volumes. In 2007 and 2008, two Grand Challenges benchmarks on liver and liver lesion segmentation have been conducted [4, 9]. Methods presented at the challenges were mostly based on statistical shape models. Furthermore, grey level and texture based methods have been developed [9]. Recent work on liver and lesion segmentation employs graph cut and level set techniques [1517], sigmoid edge modeling [5] or manifold and machine learning [6, 11]. However, these methods are not widely applied in clinics, due to their speed and robustness on heterogeneous, low-contrast real-life CT data. Hence, interactive methods were still developed [1, 7] to overcome these weaknesses, which yet involve user interaction.

Deep Convolutional Neural Networks CNN have gained new attention in the scientific community for solving computer vision tasks such as object recognition, classification and segmentation [14, 18], often out-competing state-of-the art methods. Most importantly, CNN methods have proven to be highly robust to varying image appearance, which motivates us to apply them to fully automatic liver and lesions segmentation in CT volumes.

Semantic image segmentation methods based on fully convolutional neural networks FCN were developed in [18], with impressive results in natural image segmentation competitions [3, 24]. Likewise, new segmentation methods based on CNN and FCNs were developed for medical image analysis, with highly competitive results compared to state-of-the-art. [8, 12, 1921, 23].

In this work, we demonstrate the combined automatic segmentation of the liver and its lesions in low-contrast heterogeneous CT volumes. Our contributions are three-fold. First, we train and apply fully convolutional CNN on CT volumes of the liver for the first time, demonstrating the adaptability to challenging segmentation of hepatic liver lesions. Second, we propose to use a cascaded fully convolutional neural network (CFCN) on CT slices, which segments liver and lesions sequentially, leading to significantly higher segmentation quality. Third, we propose to combine the cascaded CNN in 2D with a 3D dense conditional random field approach (3DCRF) as a post-processing step, to achieve higher segmentation accuracy while preserving low computational cost and memory consumption. In the following sections, we will describe our proposed pipeline (Sect. 2.2) including CFCN (Sect. 2.3) and 3D CRF (Sect. 2.4), illustrate experiments on the 3DIRCADb dataset (Sect. 2) and summarize the results (Sect. 4).

2 Methods

In the following section, we denote the 3D image volume as I, the total number of voxels as N and the set of possible labels as \(\mathcal {\mathcal {L}}= \{0,1,\ldots ,l\}\). For each voxel i, we define a variable \(x_i \in \mathcal {\mathcal {L}}\) that denotes the assigned label. The probability of a voxel i belonging to label k given the image I is described by \(P(x_i=k \vert I)\) and will be modelled by the FCN. In our particular study, we use \(\mathcal {\mathcal {L}}= \{0,1,2\}\) for background, liver and lesion, respectively.

Fig. 1.
figure 1

Automatic liver and lesion segmentation with cascaded fully convolutional networks (CFCN) and dense conditional random fields (CRF). Green depicts correctly predicted liver segmentation, yellow for liver false negative and false positive pixels (all wrong predictions), blue shows correctly predicted lesion segmentation and red lesion false negative and false positive pixels (all wrong predictions). In the first row, the false positive lesion prediction in B of a single UNet as proposed by [20] were eliminated in C by CFCN as a result of restricting lesion segmentation to the liver ROI region. In the second row, applying the 3DCRF to CFCN in F increases both liver and lesion segmentation accuracy further, resulting in a lesion Dice score of 82.3 %.

2.1 3DIRCADb Dataset

For clinical routine usage, methods and algorithms have to be developed, trained and evaluated on heterogeneous real-life data. Therefore, we evaluated our proposed method on the 3DIRCADb datasetFootnote 1[22]. In comparison to the grand challenge datasets, the 3DIRCADb dataset offers a higher variety and complexity of livers and its lesions and is publicly available. The 3DIRCADb dataset includes 20 venous phase enhanced CT volumes from various European hospitals with different CT scanners. For our study, we trained and evaluated our models using the 15 volumes containing hepatic tumors in the liver with 2-fold cross validation. The analyzed CT volumes differ substantially in the level of contrast-enhancement, size and number of tumor lesions (1 to 42). We assessed the performance of our proposed method using the quality metrics introduced in the grand challenges for liver and lesion segmentation by [4, 9].

2.2 Data Preparation, Processing and Pipeline

Pre-processing was carried out in a slice-wise fashion. First, the Hounsfield unit values were windowed in the range \([-100,400]\) to exclude irrelevant organs and objects, then we increased contrast through histogram equalization. As in [20], to teach the network the desired invariance properties, we augmented the data by applying translation, rotation and addition of gaussian noise. Thereby resulting in an increased training dataset of 22,693 image slices, which were used to train two cascaded FCNs based on the UNet architecture [20]. The predicted segmentations are then refined using dense 3D Conditional Random Fields. The entire pipeline is depicted in Fig. 2.

Fig. 2.
figure 2

Overview of the proposed image segmentation pipeline. In the training phase, the CT volumes are trained after pre-processing and data augmentation in a cascaded fully convolutional neural network (CFCN). To gain the final segmented volume, the test volume is fed-forward in the (CFCN) and refined afterwards using a 3D conditional random field 3DCRF.

2.3 Cascaded Fully Convolutional Neural Networks (CFCN)

We used the UNet architecture [20] to compute the soft label probability maps \(P(x_i \vert I)\). The UNet architecture enables accurate pixel-wise prediction by combining spatial and contextual information in a network architecture comprising 19 convolutional layers. In our method, we trained one network to segment the liver in abdomen slices (step 1), and another network to segment the lesions, given an image of the liver (step 2). The segmented liver from step 1 is cropped and resampled to the required input size for the cascaded UNet in step 2, which further segments the lesions.

The motivation behind the cascade approach is that it has been shown that UNets and other forms of CNNs learn a hierarchical representation of the provided data. The stacked layers of convolutional filters are tailored towards the desired classification in a data-driven manner, as opposed to designing hand-crafted features for separation of different tissue types. By cascading two UNets, we ensure that the UNet in step 1 learns filters that are specific for the detection and segmentation of the liver from an overall abdominal CT scan, while the UNet in step 2 arranges a set of filters for separation of lesions from the liver tissue. Furthermore, the liver ROI helps in reducing false positives for lesions.

A crucial step in training FCNs is appropriate class balancing according to the pixel-wise frequency of each class in the data. In contrast to [18], we observed that training the network to segment small structures such as lesions is not possible without class balancing, due to the high class imbalance. Therefore we introduced an additional weighting factor \(\omega ^{class}\) in the cross entropy loss function L of the FCN.

$$\begin{aligned} L =- \frac{1}{n} \sum \limits _{i=1}^N \omega _i^{class} \left[ \hat{P_i} \log P_i + (1 - \hat{P_i}) \log (1 - P_i) \right] \end{aligned}$$
(1)

\(P_i\) denotes the probability of voxel i belonging to the foreground, \(\hat{P_i}\) represents the ground truth. We chose \(\omega ^{class}_i\) to be \(\frac{1}{\vert {\text {Pixels of Class } x_i=k} \vert }\).

The CFCNs were trained on a NVIDIA Titan X GPU, using the deep learning framework caffe [10], at a learning rate of 0.001, a momentum of 0.8 and a weight decay of 0.0005.

2.4 3D Conditional Random Field (3DCRF)

Volumetric FCN implementation with 3D convolutions is strongly limited by GPU hardware and available VRAM [19]. In addition, the anisotropic resolution of medical volumes (e.g. 0.57−0.8 mm in xy and 1.25−4 mm in z voxel dimension in 3DIRCADb) complicates the training of discriminative 3D filters. Instead, to capitalise on the locality information across slices within the dataset, we utilize 3D dense conditional random fields CRFs as proposed by [13]. To account for 3D information, we consider all slice-wise predictions of the FCN together in the CRF applied to the entire volume at once.

We formulate the final label assignment given the soft predictions (probability maps) from the FCN as maximum a posteriori (MAP) inference in a dense CRF, allowing us to consider both spatial coherence and appearance.

We specify the dense CRF following [13] on the complete graph \(\mathcal {G}=(\mathcal {V}, \mathcal {E})\) with vertices \(i \in \mathcal {V}\) for each voxel in the image and edges \(e_{ij} \in \mathcal {E}= \lbrace (i, j) \forall i, j \in \mathcal {V}\mathrm {s.t.}\,\, i < j \rbrace \) between all vertices. The variable vector \(\mathbf {x}\in \mathcal {\mathcal {L}}^N\) describes the label of each vertex \(i \in \mathcal {V}\). The energy function that induces the according Gibbs distribution is then given as:

$$\begin{aligned} E(\mathbf {x}) = \sum _{i \in \mathcal {V}} \phi _i(x_i) + \sum _{(i,j) \in \mathcal {E}} \phi _{ij}(x_i, x_j), \end{aligned}$$
(2)

where \(\phi _i(x_i) = -\log P\left( x_i \vert I \right) \) are the unary potentials that are derived from the FCNs probabilistic output, \(P\left( x_i \vert I \right) \). \(\phi _{ij}(x_i,x_j)\) are the pairwise potentials, which we set to:

$$\begin{aligned} \phi _{ij}(x_i, x_j) =&\mu (x_i, x_j) \bigg ( w_{\mathrm {pos}} \exp \left( -\frac{\vert p_i - p_j \vert ^2}{2 \sigma _{\mathrm {pos}}^2} \right) \qquad \nonumber \\ {}&+ w_{\mathrm {bil}} \exp \left( -\frac{\vert p_i - p_j \vert ^2}{2 \sigma _{\mathrm {bil}}^2} -\frac{\vert I_i - I_j \vert ^2}{2 \sigma _{\mathrm {int}}^2}\right) \bigg ), \end{aligned}$$
(3)

where \(\mu (x_i,x_j) = \mathbf {1}(x_i \ne x_j)\) is the Potts function, \(\vert p_i - p_j \vert \) is the spatial distance between voxels i and j and \(\vert I_i - I_j \vert \) is their intensity difference in the original image. The influence of the pairwise terms can be adjusted with their weights \(w_{\mathrm {pos}}\) and \(w_{\mathrm {bil}}\) and their effective range is tuned with the kernel widths \(\sigma _{\mathrm {pos}}, \sigma _{\mathrm {bil}}\) and \(\sigma _{\mathrm {int}}\).

We estimate the best labelling \(\mathbf {x}^* = {{\mathrm{arg\,min}}}_{\mathbf {x}\in \mathcal {\mathcal {L}}^N} E(\mathbf {x})\) using the efficient mean field approximation algorithm of [13]. The weights and kernels of the CRF were chosen using a random search algorithm.

3 Results and Discussion

The qualitative results of the automatic segmentation are presented in Fig. 1. The complex and heterogeneous structure of the liver and all lesions were detected in the shown images. The cascaded FCN approach yielded an enhancement for lesions with respect to segmentation accuracy compared to a single FCN as can be seen in Fig. 1. In general, we observe significantFootnote 2 additional improvements for slice-wise Dice overlaps of liver segmentations, from mean Dice \(93.1\,\%\) to \(94.3\,\%\) after applying the 3D dense CRF.

Table 1. Quantitative segmentation results of the liver on the 3DIRCADb dataset. Scores are reported as presented in the original papers.

Quantitative results of the proposed method are reported in Table 1. The CFCN achieves higher scores as the single FCN architecture. Applying the 3D CRF improved the segmentations results of calculated metrics further. The runtime per slice in the CFCN is \(2\cdot 0.2\,s=0.4\) s without and 0.8 s with CRF.

In comparison to state-of-the-art, such as [2, 5, 15, 16], we presented a framework, which is capable of a combined segmentation of the liver and its lesion.

4 Conclusion

Cascaded FCNs and dense 3D CRFs trained on CT volumes are suitable for automatic localization and combined volumetric segmentation of the liver and its lesions. Our proposed method competes with state-of-the-art. We provide our trained models under open-source license allowing fine-tuning for other medical applications in CT dataFootnote 3. Additionally, we introduced and evaluated dense 3D CRF as a post-processing step for deep learning-based medical image analysis. Furthermore, and in contrast to prior work such as [5, 15, 16], our proposed method could be generalized to segment multiple organs in medical data using multiple cascaded FCNs. All in all, heterogeneous CT volumes from different scanners and protocols as present in the 3DIRCADb dataset and in clinical trials can be segmented in under 100 s each with the proposed approach. We conclude that CFCNs and dense 3D CRFs are promising tools for automatic analysis of liver and its lesions in clinical routine.