Predicting standard-dose PET image from low-dose PET and multimodal MR images using mapping-based sparse representation

Yan Wang; Pei Zhang; Le An; Guangkai Ma; Jiayin Kang; Feng Shi; Xi Wu; Jiliu Zhou; David S Lalush; Weili Lin; Dinggang Shen

doi:10.1088/0031-9155/61/2/791

1. Introduction

Positron emission tomography (PET) is an emerging imaging technology that is able to reveal metabolic activities of a tissue (or an organ). Unlike other imaging technologies (e.g. computed tomography (CT) and magnetic resonance imaging (MRI)) that capture anatomical changes in the tissue or organ, PET scans detect biochemical and physiological changes. As these changes often occur before anatomical changes, PET is widely used for proactive treatment and early disease detection, such as for tumors (Chen 2007, Terakawa et al 2008, Dunnwald et al 2011) and brain disorders (Foster et al 2007, Takasawa et al 2008, Quigley et al 2011).

A typical PET scan involves an injection of radioactive substance (tracer) to the tissue or organ and then the detection of the gamma rays, emitted by the tracer. As high-quality PET image can provide more detailed functional information, the tracer, with a sufficient dose, is often used to generate the PET image with clinically acceptable quality. However, the use of a high-dose tracer inevitably leads to substantial radiation exposure, which may be detrimental to the subject's health, particularly for children. As a result, the well-known ALARA (as low as reasonably achievable) principle, which is usually used in CT imaging (Xu et al 2012), is favorable to help minimize the radiation exposure in clinical practice. While this principle minimizes the risk of radiation exposure, it also degrades the quality of the PET image, which is determined by the total injected dose and total acquisition time. For example, figure 1 shows two images from a low-dose PET (L-PET) scan and a standard-dose PET (S-PET) scan, respectively. We can see that the image from the S-PET scan provides more functional details than the one from the L-PET scan. Consequently, in clinical practice, it is of great interest to obtain high-quality PET images, without injecting a high-dose tracer. However, the understanding of standard doses may vary in different clinics. In this paper, we call standard on the low end of guidelines. In any case, working at the low end of guidelines is more challenging since it results in a noisier image. In this paper, the original S-PET image scanned, via injection of a standard-dose tracer, is referred to as the 'ground truth'.

**Figure 1.** Comparison of image quality for a low-dose PET (L-PET) scan and its corresponding standard-dose PET (S-PET) scan.
Download figure:
Standard image High-resolution image

Although many efforts have been made to tackle the problem above, most existing methods concentrate on improving the quality based on the PET image itself. Examples include motion correction (Gigengack et al 2012, Olesen et al 2013), partial volume correction (Lehnert et al 2012, Coello et al 2013), and attenuation correction (Bai and Brady 2011, Andersen et al 2014). In contrast, an alternative way is to incorporate anatomical information from other imaging modalities to further improve PET image quality, e.g. CT or MRI (Liu et al 2010, Pichler et al 2010, Zaidi et al 2011, Chopra et al 2012, Lumbreras et al 2010).

Combined PET/CT scanners have been widely used for diagnostic PET. However, soft-tissue contrast in CT is typically poor, and CT images are therefore of limited value, in terms of delineating organs and other tissue boundaries. Conversely, as an imaging modality complementary to the high-sensitivity functional information gathered by PET, MR scanners are able to provide high-resolution anatomical information with excellent soft-tissue contrast (Boss et al 2010). Consequently, the PET images are typically co-registered to MR images, which are then used to better reconstruct PET images (Bai et al 2013). Specifically, the multimodal MR images used in this paper include the T1-weighted MR image, fractional anisotropy (FA), and mean diffusivity (MD).

Sparse representation (SR) has been widely used to study high-dimensional data (Baraniuk et al 2010). Recently, patch-based sparse representation has drawn considerable attention, with broad applications in the fields of computer vision (Wright et al 2009, Gao et al 2010, Wright et al 2010, Xie et al 2010, Wagner et al 2012) and medical image analysis (Shi et al 2012a, Wee et al 2012, Zhang et al 2012a, 2012b). Specifically, inspired by the success of patch-based SR in super-resolution image reconstruction (Dong et al 2010, Yang et al 2010, Zhang et al 2011), we propose a similar method for S-PET image prediction. Note that the underlying assumption of the patch-based SR is that the sparse representation, estimated from the low-resolution image samples, can be directly applied to the high-resolution image samples, for reconstructing the high-resolution image from a given low-resolution image. This assumption implicitly requires the distribution of high-resolution image samples to be very similar to that of low-resolution image samples. However, this may not be true in practice, which adversely affects the final performance of the SR based method (Gao and Yang 2014). This problem may be more prominent for S-PET image prediction, due to the huge difference in imaging mechanisms between MRI and PET.

In this paper, we propose a mapping-based SR (m-SR) framework to address the aforementioned issue. Specifically, the feature distribution spaces vary among multimodal MR images, L-PET images and the corresponding S-PET images. Consequently, directly using the combination of multimodal MR and L-PET images to derive S-PET image may lead to inaccurate prediction. In order to solve this problem, we first design a graph-based mapping scheme to calculate a mapping matrix to transform the L-PET and MR to the feature distribution space of S-PET during the training phase. Next, the mapped training multimodal MR and L-PET images are used to build the coding dictionary. During the testing phase, the generated encoding dictionary is used to encode the input testing MR and L-PET images (transformed by the same mapping matrices obtained during the training phase) using SR. Finally, the obtained sparse coding coefficients are applied to the training S-PET images, associated with the mapped training MR and L-PET images, to predict the testing S-PET image. Note that, even though the proposed mapping procedure can reduce the differences between the multimodal MR/L-PET and S-PET images, it is still difficult to infer their true relationship with such a simple step of linear mapping. Therefore, a novel refinement framework is proposed to further improve the prediction. Specifically, the predicted S-PET image is further mapped to the target S-PET image again, and then a new prediction will be generated similarly by applying the aforementioned reconstruction procedure. By repeating the step above, we end up with an improved prediction for the S-PET image. To improve the efficiency of the SR procedure, we also perform patch selection to reduce the size of dictionary, before predicting an S-PET image.

In the following, we first briefly introduce the basic idea of the patch-based SR in section 2 with an instantiation of our problem. Then, we elaborate our method in section 3. Finally, we show the experimental results in section 4, and draw conclusions in section 5.

2. Patch-based sparse representation (SR)

For simplicity, we first introduce below the S-PET image prediction with the patch-based SR using an L-PET image and a T1-weighted MR image, as the strategy of how to incorporate other multimodal MR images for better prediction will be discussed in the next section. Suppose that we have N training subjects, each with an MR image, an L-PET image, and a corresponding S-PET image. All of these 3N images are used as training data. Given a testing MR image and the associated L-PET image, the goal is to predict an S-PET image, corresponding to the given L-PET image. The benefit of using SR based method is that it can reduce the noise for the S-PET image prediction, where noise is commonly introduced during the acquisition process of L-PET image (for more details, please refer to (Farouk 2012, Zhang et al 2012c)).

To predict the value of any voxel x in the unknown (testing) S-PET image, we first select a set of patches of the same size (p × p × p) from the training set to construct a pair of coupled dictionaries: a coding dictionary (CD) and a reconstruction dictionary (RD). Specifically, we first define a neighborhood centered at voxel x with the size of w × w × w in both training MR and L-PET images. We then generate the coding dictionary, $D_{x}^{M\_L}$ , by grouping all the patches within such neighborhood across all MR and L-PET images of training subjects (figure 2(a)). Each column of the coding dictionary is denoted as an atom, and there are totally w × w × w × N atoms in $D_{x}^{M\_L}$ . Furthermore, each atom contains p × p × p × 2 intensity features (figure 2(a)), half of which (shown in purple) are extracted from the training MR image and the other half (shown in green) are from the training L-PET image. Similarly, we can build the corresponding reconstruction dictionary, $D_{x}^{S}$ , containing p × p × p dimensional column vectors extracted from the corresponding S-PET images of N training subjects as shown in figure 2(b).

**Figure 2.** Illustration of the dictionary construction based on the training dataset: (a) coding dictionary (CD), $D_{x}^{M\_L}$ ; and (b) reconstruction dictionary (RD), $D_{x}^{S}$ .
Download figure:
Standard image High-resolution image

**Figure 2.** Illustration of the dictionary construction based on the training dataset: (a) coding dictionary (CD), $D_{x}^{M\_L}$ ; and (b) reconstruction dictionary (RD), $D_{x}^{S}$ .
Download figure:
Standard image High-resolution image

Given the testing MR and L-PET images from a new testing subject, it is assumed that each patch in the testing images can be sparsely represented by a linear combination of the patches in $D_{x}^{M\_L}$ . Specifically, to predict the patch centered at x in the unknown S-PET image of the testing subject, a set of sparse coefficients, ${{\alpha}_{x}}$ , is calculated by minimizing the following non-negative elastic-net problem (Zou and Hastie 2005):

$\begin{eqnarray}&&\underset{{{\alpha}_{x}}\geqslant 0}{\mathop{\min}}\,\,\|\,{{f}^{M\_L}}(x)-D_{x}^{M\_L}{{\alpha}_{x}}\|_{2}^{2}+{{\lambda}_{1}}\|{{\alpha}_{x}}{{\|}_{1}}+{{\lambda}_{2}}\|{{\alpha}_{x}}\|_{2}^{2},\end{eqnarray} \tag{ 1 }$

where ${{f}^{M\_L}}(x)$ is a feature vector containing the raw intensity values extracted at x from both the testing MR and L-PET images. There are three terms in equation (1). The first term measures how well the feature vector ${{f}^{M\_L}}(x)$ can be represented by the coding dictionary. The second term enforces the sparsity on coefficients via ${{l}_{1}}$ regularization. The last term encourages similar patches to have similar coefficients. ${{\lambda}_{1}}$ and ${{\lambda}_{2}}$ are weights to balance these three terms.

Once we obtain ${{\alpha}_{x}}$ , we can estimate the intensity values of the patch centered at x in the unknown S-PET of the testing subject, ${{f}^{S}}(x)$ , as follows:

$\begin{eqnarray}&&{{f}^{S}}(x)=D_{x}^{S}{{\alpha}_{x}}.\end{eqnarray} \tag{ 2 }$

Then, the intensity value at voxel x of the unknown S-PET of the testing subject can be obtained by taking the value in the center of ${{f}^{S}}(x)$ . An outline of the patch-based SR is illustrated in figure 3.

**Figure 3.** Illustration of the patch-based SR procedure.
Download figure:
Standard image High-resolution image

3. Proposed method

3.1. Mapping based SR

3.1.1. Graph-based mapping procedure.

The underlying assumption of the patch-based SR is that the embedding geometric relationship of patches, in the training S-PET images, is very similar to the relationship of patches in the training multimodal MR and L-PET images. However, this assumption hardly holds, due to noise in acquisition and transmission, as well as huge difference in imaging mechanisms between PET and MRI. To solve this problem, we propose transforming the training multimodal MR and L-PET patches to their respective S-PET patch using a graph-based distribution mapping method. Note that a graph here represents the distribution of feature vectors of training patches, consisting of a set of nodes and edges (linking similar patches). Each node in the graph represents a training patch, and each edge describes the geometric relationship between a pair of training patches. Specifically, we use graphs ${{g}^{M}}$ / ${{g}^{L}}$ to describe the distributions of feature vectors of patches from the training MR and L-PET images, respectively, and then use graph ${{g}^{S}}$ to describe patches from the training S-PET images. The mapping procedure aims to make the transformed graphs ${{g}^{M}}^{'}$ / ${{g}^{L}}^{'}$ match accordingly with the graph ${{g}^{S}}$ . This can be achieved by node-to-node matching, edge-to-edge matching, or even high-order matching between ${{g}^{M}}^{'}$ / ${{g}^{L}}^{'}$ and ${{g}^{S}}$ . Specifically, the mapping procedure for graph of L-PET is given below as an example:

$\begin{eqnarray}&&{{M}^{L}}=\underset{i}{\mathop \sum}\,\|A_{i}^{S}-{{M}^{L}}\times A_{i}^{L}\|_{2}^{2}+{{\beta}_{1}}\underset{i}{\mathop \sum}\,\underset{j}{\mathop \sum}\,\|\left(A_{i}^{S}-A_{j}^{S}\right)-{{M}^{L}}\times \left(A_{i}^{L}-A_{j}^{L}\right)\|_{2}^{2}+{{\beta}_{2}}\underset{i}{\mathop \sum}\,\underset{j}{\mathop \sum}\,\underset{k}{\mathop \sum}\,\ldots +\ldots,\end{eqnarray} \tag{ 3 }$

where $A_{i}^{S}$ is an S-PET image patch, $A_{i}^{L}$ is its corresponding L-PET image patch, and ${{M}^{L}}$ is a mapping matrix to transform $A_{i}^{L}$ . The first term represents node-to-node matching in the graphs ${{g}^{S}}$ and ${{g}^{L}}$ . That is, the mapped L-PET image patch ${{M}^{L}}\times A_{i}^{L}$ should be similar to $A_{i}^{S}$ . The second term enforces that the relationship between the mapped patches ${{M}^{L}}\times A_{i}^{L}$ and ${{M}^{L}}\times A_{j}^{L}$ should be very similar to that between $A_{i}^{S}$ and $A_{j}^{S}$ . Examples of node-to-node matching, edge-to-edge matching, and high-order matching are given in figure 4. Note that, the goal of the procedure above is to obtain the mapping matrix ${{M}^{L}}$ between L-PET and S-PET images. The mapping matrices ${{M}^{T1}}$ , ${{M}^{FA}}$ , ${{M}^{MD}}$ between the single modal MR (T1, FA, MD) and S-PET images can be estimated in a similar way.

**Figure 4.** The illustration of the mapping procedure between the training L-PET and S-PET image patches.
Download figure:
Standard image High-resolution image

3.1.2. Fusion of multimodal MR images for Prediction of S-PET image.

Once we obtain the mapping matrices ${{M}^{T1}}$ , ${{M}^{FA}}$ and ${{M}^{MD}}$ , we can apply them to the multimodal MR images (T1, FA and MD), respectively, to obtain the mapped T1 (m-T1), mapped FA (m-FA), and mapped MD (m-MD) images, and then perform sparse representation. Note that, we have to balance contributions of each channel of the multimodal MR images before any sparse representation. Otherwise, errors for representing multimodal MR images will overpower the L-PET image. To determine the weights for each channel, we use the following equation:

$\begin{eqnarray}&&\underset{\boldsymbol{w}}{\mathop{\min}}\,\underset{i}{\mathop \sum}\,\|I_{i}^{S}-\left({{w}_{1}}I_{i}^{T1}+{{w}_{2}}I_{i}^{FA}+{{w}_{3}}I_{i}^{MD}\right)_{{}}^{{}}\|_{2}^{2}~\text{s}\text{.} ~\text{t}\text{.} ~\underset{j=1}{\overset{3}{\mathop \sum}}\,{{w}_{j}}=1,\end{eqnarray} \tag{ 4 }$

where $I_{i}^{S}$ is the S-PET image of subject i, $I_{i}^{T1}$ , $I_{i}^{FA}$ , and $I_{i}^{MD}$ represent the mapped T1, mapped FA, and mapped MD images of subject i, respectively, and w = [ ${{w}_{1}}$ , ${{w}_{2}}$ , ${{w}_{3}}$ ] is the weights associated with $I_{i}^{T1}$ , $I_{i}^{FA}$ , and $I_{i}^{MD}$ . Then, we can obtain the multimodal MR fusion image of subject i by using

$\begin{eqnarray}&&~\tilde{\mathop{I}}\,_{i}^{M}={{w}_{1}}{{M}^{T1}}I_{i}^{T1}+{{w}_{2}}{{M}^{FA}}I_{i}^{FA}+{{w}_{3}}{{M}^{MD}}I_{i}^{MD}.\end{eqnarray} \tag{ 5 }$

Similarly, the mapped low-dose PET (m-L-PET) image can be obtained after applying the mapping matrix ${{M}^{L}}$ to L-PET images. Then, for each voxel x in the unknown S-PET image of the testing subject, it can be predicted via two procedures: the coding procedure and the reconstruction procedure. Specifically, in similarity to the patch based SR procedure (section 2), the coding dictionary is generated by grouping all patches within a neighborhood centered at voxel x across all MR fusion images and m-L-PET images, and the reconstruction dictionary is built using corresponding S-PET images in a similar manner. Finally, we will obtain the predicted S-PET image (S-PET)' using voxel-wise prediction. The flowchart of the m-SR procedure for S-PET prediction is shown in figure 5.

**Figure 5.** Flowchart of the m-SR procedure for S-PET prediction in the testing stage.
Download figure:
Standard image High-resolution image

3.2. Incremental refinement

Even if distributions of the S-PET image patches and the multimodal MR/L-PET image patches could be made to be more similar to each other, after the mapping strategy above, discrepancies may still exist due to the different sources of information, which cannot be bridged by one step of simple linear mapping. To solve this problem, we propose an incremental refinement scheme to further improve the quality of the prediction. This scheme consists of a training refinement stage and a testing refinement stage as detailed below.

Training refinement stage: We construct multiple layers to iteratively refine the quality of a set of predicted S-PET images ${{\mathbf{I}}^{S}}.\,$ Here, we use the set of L-PET images ${{\mathbf{I}}^{L}}$ as the initial prediction set ${{\mathbf{I}}^{S(0)}}$ . For each layer, the prediction results from the previous layer are fed into m-SR to generate a new set of predictions. Specifically, at the bth layer, the mapping matrices ${{M}^{T1}}$ , ${{M}^{FA}}$ , ${{M}^{MD}}$ are learned and used to generate the set of MR fusion images ${{\tilde{\mathop{\boldsymbol{I}}}\,}^{M}}$ according to equation (5). Meanwhile, ${{M}^{S(b)}}$ is learned and used to map prediction results from the previous layer ${{\mathbf{I}}^{S(b-1)}}$ to the S-PET images ${{\mathbf{I}}^{S}}$ , thus obtaining new mapped images ${{\tilde{\mathop{\mathbf{I}}}\,}^{S(b)}}$ . For each sample $\tilde{\mathop{I}}\,_{i}^{S(b)}\in ~{{\tilde{\mathop{\mathbf{I}}}\,}^{S(b)}}$ , a leave-one-out strategy is then utilized to construct the coding dictionary and the construction dictionary in the SR (i.e. all other mapped samples except for $~\tilde{\mathop{I}}\,_{i}^{M}$ , $\tilde{\mathop{I}}\,_{i}^{S(b)}$ and $I_{i}^{S}$ are utilized). The prediction $I_{i}^{S(b)}$ of the bth layer is finally obtained via SR with the new dictionary. The above procedure is summarized in algorithm 1.

Algorithm 1: Training refinement stage
1: Input: A set of training L-PET images ${{\mathbf{I}}^{L}}=\left\{I_{1}^{L},\,I_{2}^{L},\ldots,I_{N}^{L}\right\}$ , a set of training multimodal MR images including ${{\mathbf{I}}^{T1}}=\left\{I_{1}^{T1},\,I_{2}^{T1},\ldots,I_{N}^{T1}\right\}$ , ${{\mathbf{I}}^{FA}}=\left\{I_{1}^{FA},\,I_{2}^{FA},\ldots,I_{N}^{FA}\right\}$ , ${{\mathbf{I}}^{MD}}=\left\{I_{1}^{MD},\,I_{2}^{MD},\ldots,I_{N}^{MD}\right\}$ , and a set of training S-PET images ${{\mathbf{I}}^{S}}=\left\{I_{1}^{S},\,I_{2}^{S},\ldots,I_{N}^{S}\right\}$ . N is the total number of training subjects, each with five images (L-PET, T1, FA, MD, and S-PET).
Parameter: the total number of layers B.
2: Initialize: ${{\mathbf{I}}^{L}}$ is used as the initial prediction ${{\mathbf{I}}^{S(0)}}\,$ of the set of S-PET images $\,{{\mathbf{I}}^{S}}$ .
3: For b = 1:B
4: Perform the mapping procedure between ${{\mathbf{I}}^{T1}}$ , ${{\mathbf{I}}^{FA}}$ , ${{\mathbf{I}}^{MD}}$ , $\,{{\mathbf{I}}^{S(b-1)}}$ and ${{\mathbf{I}}^{S}}$ to obtain the mapping matrices ${{M}^{T1}}$ , ${{M}^{FA}}$ , ${{M}^{MD}}$ and ${{M}^{S(b)}}$ (equation (3))
${{M}^{T1}}:\,{{\mathbf{I}}^{T1}}\to {{\mathbf{I}}^{S}}$
${{M}^{FA}}:{{\mathbf{I}}^{FA}}\to {{\mathbf{I}}^{S}}$
${{M}^{MD}}:\,{{\mathbf{I}}^{MD}}\to {{\mathbf{I}}^{S}}$
${{M}^{S(b)}}:\,{{\mathbf{I}}^{S(b-1)}}\to {{\mathbf{I}}^{S}}$
5: Compute the mapped multimodal MR fusion image (equations (4) and (5)) and L-PET images by
${{\tilde{\mathop{\mathbf{I}}}\,}^{M}}={{w}_{1}}{{M}^{T1}}{{\mathbf{I}}^{T1}}+{{w}_{2}}{{M}^{FA}}{{\mathbf{I}}^{FA}}+{{w}_{3}}{{M}^{MD}}{{\mathbf{I}}^{MD}}$
${{\tilde{\mathop{\mathbf{I}}}\,}^{S(b)}}={{M}^{S(b)}}{{\mathbf{I}}^{S(b-1)}}$
6: For each training subject i (i = 1,2,...,N), construct the coding dictionary (CD) and reconstruction dictionary (RD) by using $~\tilde{\mathop{I}}\,_{i}^{M}$ , $\tilde{\mathop{I}}\,_{i}^{S(b)}$ and $I_{i}^{S}$ in a leave-one-out manner as follows:
CD: $\left[\begin{array}{c} \tilde{\mathop{I}}\,_{1}^{M},~\tilde{\mathop{I}}\,_{2}^{M},~\ldots,~\tilde{\mathop{I}}\,_{i-1}^{M},\,\tilde{\mathop{I}}\,_{i+1}^{M},\ldots,~\tilde{\mathop{I}}\,_{N}^{M} \\ \tilde{\mathop{I}}\,_{1}^{S(b)},\tilde{\mathop{I}}\,_{2}^{S(b)},\ldots,\tilde{\mathop{I}}\,_{i-1}^{S(b)},\tilde{\mathop{I}}\,_{i+1}^{S(b)},\ldots,\tilde{\mathop{I}}\,_{N}^{S(b)} \\\end{array}\right]$ RD: $\left[I_{1}^{S},I_{2}^{S},\ldots,I_{i-1}^{S},I_{i+1}^{S},\ldots,I_{N}^{S}\right]$
7: Compute the predicted S-PET image $I_{i}^{S(b)}\,\left(i=1,2,...,N\right)$ using the above-given new CD and RD via SR (equation (1), equation (2)).
8: End For
9: Output: The learned mapping matrices ${{M}^{T1}}$ , ${{M}^{FA}}$ , ${{M}^{MD}}$ and ${{M}^{S(1)}},{{M}^{S(2)}},\ldots,{{M}^{S(B)}}$ , and the predicted S-PET ${{\mathbf{I}}^{S(1)}},{{\mathbf{I}}^{S(2)}},\ldots,{{\mathbf{I}}^{S(B)}}$ .

Testing refinement stage: Suppose that we have a new L-PET image $I_{t}^{L}$ , together with its corresponding multimodal MR images $I_{t}^{T1}$ , $I_{t}^{FA}$ , $I_{t}^{MD}$ from a testing subject, where $t$ denotes testing subject. We can use the scheme shown in figure 6 to obtain the final prediction. Specifically, taking the bth layer for example, the mapping matrices ${{M}^{T1}}$ , ${{M}^{FA}}$ , ${{M}^{MD}}$ and ${{M}^{S(b)}}$ learned for that layer in the training stage are applied, respectively, to the MR images and the previously predicted S-PET $I_{t}^{S(b-1)}$ of the testing subject. After applying the fusion strategy, we obtain the MR fusion image of the testing subject ( $\tilde{\mathop{I}}\,_{t}^{M}$ ). Then, based on all mapped samples ( ${{\tilde{\mathop{I}}\,}^{M}}$ , $\,{{\tilde{\mathop{\mathbf{I}}}\,}^{S(b)}}$ ) learned from the same layer in the training stage and the S-PET samples ( ${{\mathbf{I}}^{S}}$ ), the coding dictionary and reconstruction dictionary can be built. Finally, the prediction $I_{t}^{S(b)}$ can be obtained via SR and used as the input for the next layer. By going through all of the layers, we can obtain the final estimation of S-PET for the testing subject.

**Figure 6.** Illustration of the incremental refinement procedure.
Download figure:
Standard image High-resolution image

3.3. Patch selection based dictionary construction

One of the critical issues in the SR is how to efficiently construct a dictionary. The size of a dictionary is proportional to the neighborhood size and the search window (w × w × w), as discussed in section 2. A small neighborhood size will lead to a small dictionary, which may not be representative and reliable enough for coding (Dong et al 2011). On the other hand, a large neighborhood size will lead to a redundant dictionary, which may introduce too many irrelevant patches and also increase the computational time.

To address this issue, we propose to perform patch selection before constructing the dictionary. When given an image patch to encode, we select the most relevant patches from all the patches across the training set to construct the dictionary. Similarity between patches is used as criterion for patch selection. Here, we measure the patch similarity by computing the Euclidean distance between patches. By using patch selection, we can exclude irrelevant patches, while retaining the related patches. All selected patches will be used as atoms to construct the dictionary. Since the new dictionary can now better represent the given testing patch, the entire S-PET image can be better estimated, when compared to those using a universal dictionary, which will be validated in the experiments below.

4. Experiments

We first evaluate our method on a human brain dataset containing eight normal subjects. The detailed demographic information of these subjects is summarized in table 1.

Table 1. Demographic information of the normal subjects.

Subject ID	Age	Gender	Weight (kg)
1	26	Female	50.35
2	30	Male	137.89
3	33	Male	102.97
4	25	Female	85.73
5	18	Female	59.87
6	19	Male	72.57
7	36	Male	102.06
8	28	Female	83.91

Before scanning, each subject was administered an average of 203 MBq (from 191 to 229 MBq) of ¹⁸F-2-deoxyglucose (¹⁸FDG). Within 60 min of injection, an S-PET image was obtained in 12 min, while an L-PET image was obtained in 3 min. The S-PET and L-PET were acquired separately, so the noise in S-PET and L-PET was not correlated. Since the acquisition time of L-PET was a quarter of that of the S-PET, it can be assumed that the former dose was also a quarter of the latter dose. Meanwhile, a T1-weighted MPRAGE MRI sequence and a diffusion-weighted image (DTI) were also acquired. For each subject, the PET images and the DTI images are, respectively, co-registered to the T1 image via affine transformation (Smith et al 2004). Finally, the FA and MD images can be computed from the resulting warped DTI image (Tournier et al 2011). All images were obtained by a Siemens Biograph mMR system housed in the Biomedical Research Imaging Center at the University of North Carolina at Chapel Hill (UNC), and this study has been approved by Institutional Review Board. Both the S-PET and L-PET images were reconstructed by the software provided by the vendor, including an MRI-based attenuation correction using the Dixon sequence, and corrections for randoms and scatters. Iterative reconstruction was employed with the ordered subsets expectation maximization (OSEM) algorithm (Hudson and Larkin 1994), with three iterations, 21 subsets, and post-reconstruction filtered with a 3D Gaussian with sigma of 2 mm. Each image has a size of 256 × 256 × 256, and a resolution of 2.09 mm × 2.09 mm × 2.03 mm. The dataset was preprocessed by skull stripping (Shi et al 2012b) and intensity normalization before all the experiments.

A leave-one-out cross-validation (LOOCV) strategy was used for evaluation. We used the first term and the second term of equation (3) to do the mapping. We set the following parameters for the patch-based SR in all experiments throughout this paper:

Patch size: 5 × 5 × 5
Neighborhood/Search window: 15 × 15 × 15;
The number of atoms in dictionary after patch selection: R = 800;
Mapping procedure: ${{\beta}_{1}}$ is set to 0.8;
Sparse coding procedure: ${{\lambda}_{1}}$ and ${{\lambda}_{2}}$ are set to 0.1 and 0.01, respectively.
The number of layers for the refinement: B is set to 3.

To evaluate the performance of the proposed method, we compare the prediction with the ground truth. Specifically, we use the following two metrics for evaluation:

Normalized mean square error (NMSE): This is used to measure the voxel-wise intensity differences between the predicted S-PET image ${{\widehat{I}}^{S}}$ and the ground truth ${{I}^{S}}$ .
$\begin{eqnarray}&&\text{NMSE}=\frac{\|{{I}^{S}}-{{{\widehat{I}}}^{S}}\|_{2}^{2}}{\|{{I}^{S}}\|_{2}^{2}}.\end{eqnarray} \tag{ 6 }$
Peak signal-to-noise ratio (PSNR): This is used to evaluate the prediction accuracy in terms of the logarithmic decibel scale. The PSNR is calculated as
$\begin{eqnarray}&&\text{PSNR}=10{{\log}_{10}}\left(\frac{{{P}^{2}}}{\frac{1}{Q}\|{{I}^{S}}-\widehat{I}_{{}}^{S}\|_{2}^{2}}\right),\end{eqnarray} \tag{ 7 }$
where $P$ is the larger intensity range from images ${{I}^{S}}$ and ${{\widehat{I}}^{S}}$ , denoting the dynamic range of the image, and $Q$ represents the total number of voxels in the image.

Theoretically, the image with lower NMSE and higher PSNR represents higher quality.

4.1. S-PET prediction using m-SR

We first run our method without the refinement strategy, i.e. using only one-layer model: (i) estimating the mapping matrices before SR, (ii) using the multimodal MR and L-PET images in both training and testing stages, (iii) using patch selection when constructing the coding and reconstruction dictionaries. An example of the predicted S-PET image is shown in figure 7.

Figure 7 clearly shows that the prediction by our method significantly improves the details over the L-PET image, and is quite similar to the ground truth. For quantitative comparison, we computed NMSE and PSNR with the results given in figure 8, which indicates a significant improvement over the original L-PET image.

**Figure 8.** The performance of the proposed m-SR method in terms of PSNR and NMSE.
Download figure:
Standard image High-resolution image

4.2. Influence of important elements in m-SR

We now investigate the influence of some important elements of the proposed method on prediction. Such elements include (1) the mapping procedure, (2) the use of multimodal MR images for prediction, and (3) the patch selection based dictionary construction. Note that the experiments below were done using only a one-layer model (B = 1).

4.2.1. Influence of the mapping procedure.

To show the influence of the mapping procedure, we compare the m-SR with the classic patch-based SR (i.e. without the mapping procedure) (Yang et al 2010). An example of such comparison is given in figure 9, where red rectangles highlight the regions from which we can see the most significant difference.

**Figure 9.** An example of the influence of *the mapping procedure* on the S-PET image prediction. Red rectangles highlight the regions with the most significant difference.
Download figure:
Standard image High-resolution image

We can see that our results look more similar to the ground truth than those obtained from the classic patch-based SR. This can also be observed from a detailed quantitative comparison as shown in figure 10. From this figure, we can also see that our method consistently achieves higher PSNR and lower NMSE across all subjects, indicating that the mapping procedure does help the SR to enhance the prediction quality.

**Figure 10.** Influence of the mapping procedure on the S-PET image prediction.
Download figure:
Standard image High-resolution image

4.2.2. Influence of the use of multimodal MR images for prediction.

We respectively used (1) L-PET image alone, (2) T1 + L-PET images, (3) FA + L-PET images, (4) MD + L-PET images, and (5) T1 + FA + MD + L-PET images, for prediction. The detailed quantitative comparison is given in figure 11.

**Figure 11.** Influence of the use of multimodal MR images on the S-PET image prediction.
Download figure:
Standard image High-resolution image

We can see that using only L-PET images generates the predictions with the lowest quality. This implies that the additional information introduced by the multimodal MR images is necessary to improve the prediction. Although the improvement can be obtained by using any channel of MR images (i.e. T1, FA, MD) together with the L-PET images, the prediction is still not as good as when all multimodal MR images and the L-PET images are used jointly. We can also see that the prediction given by single modal MR image is similar to each other across all three channels. This suggests that different channels contribute similarly in most of the brain areas. However, for some local areas, they may contribute differently. For example, the FA and MD are commonly used clinically to localize white matter lesions, and do not show up on other forms of clinical MRI. As a result, to evaluate if the combination of multimodal MR with L-PET images could have more improvement for local areas, we labeled the T1 images with 56 ROIs using a multi-atlas based method (Rohlfing et al 2004). Specifically, 15 MR brain images from the OASIS project and corresponding label maps as provided by Neuromorphometrics, Inc. (http://Neuromorphometrics.com/) under academic subscription were selected as atlases. For a target image to be labeled, we registered multiple atlases onto the target image space (using FLIRT and Demons registration methods) (Vercauteren et al 2009, Smith et al 2004), and then used the estimated transformations to warp the corresponding label maps of atlases. Finally, we made use of a majority voting scheme to fuse the warped label maps of all atlases, and then obtained the label map for the given target image. A sample label map with 56 ROIs is shown in figure 12.

**Figure 12.** The label map with 56 ROIs.
Download figure:
Standard image High-resolution image

Across all the segmented 56 ROIs, the mean, min and max PSNR/NMSE using different combination methods were calculated for the eight subjects. The corresponding results are given in table 2.

Table 2. Mean, min and max PSNR/NMSE values across the 56 ROIs.

Combination method	PSNR mean (min, max)	NMSE mean (min, max)
L-PET	19.02(15.74, 22.41)	0.016(0.011, 0.026)
T1 + L-PET	19.56(16.75, 22.91)	0.014(0.008, 0.025)
FA + L-PET	19.68(16.15, 23.03)	0.015(0.010, 0.026)
MD + L-PET	19.14(15.90, 22.58)	0.016(0.011, 0.029)
T1 + FA + MD + L-PET	19.98(16.92, 23.24)	0.013(0.007, 0.022)

From table 2, it can be observed that the T1 + FA + MD + L-PET scheme achieves the highest performance (i.e. with the highest PSNR and the lowest NMSE). The average PNSR and NMSE for each of the 56 ROIs among the eight subjects are further shown in figure 13.

**Figure 13.** The mean PSNR and NMSE within different ROIs. Blue asterisk highlights the ROIs with the significant improvement by using T1 + FA + MD + L-PET method, while red asterisk highlights the significant improvement by using T1 + L-PET method.
Download figure:
Standard image High-resolution image

It can be seen that, for most ROIs, methods using the combination of L-PET and MR images (T1 + L-PET, FA + L-PET, MD + L-PET and T1 + FA + MD + L-PET) outperform the method using only L-PET. Among all the combinations, the T1 + FA + MD + L-PET method always achieves the best performance across all ROIs, with the highest PSNR and lowest NMSE. In particular, compared with using L-PET only, the method using T1 + FA + MD + L-PET achieves the highest increase of PSNR (1.6485) on the ROI - Right PrG precentral gyrus, and significant improvements on other 24 ROIs (with the increase of PSNR larger than 1) denoted by blue asterisks in figure 13. For T1 + L-PET, the top increase of PSNR (1.3446) is achieved on the ROI - Right PrG precentral gyrus, and significant improvements are obtained on only 4 ROIs (labeled using red asterisks). On the other hand, for FA + L-PET and MD + L-PET, both have the top PSNR increases on the ROI - AnG angular gyrus, but with only 0.7495 and 0.7435, respectively. The corresponding comparisons of NMSE are also shown in figure 13, where we can see a similar conclusion again.

4.2.3. Influence of the patch selection based dictionary construction.

Since the neighborhood size was set to 15 × 15 × 15, there would be 23 625 atoms in the dictionary. To show the advantage of using the patch selection based dictionary construction, 800 patches that are most similar to each given local patch of testing subject were chosen for dictionary construction. We ran our method twice, one with the patch selection process and another without such process, with the results given in figure 14.

**Figure 14.** Influence of the patch selection based dictionary construction on the S-PET image prediction.
Download figure:
Standard image High-resolution image

We can see that, the patch selection based dictionary construction not only enhances the prediction quality (i.e. with higher PSNR and lower NMSE), but also greatly reduces the computational time (for approximately two thirds of the original computational time).

4.3. Effectiveness of the incremental refinement strategy

In this section, we evaluate the performance of using the incremental refinement procedure as discussed in section 3.2. The number of layers for the refinement was set to 3. We used both the multimodal MR and L-PET images for the following experiments. Figure 15 shows the predicted S-PET images of each layer, for a selected subject.

**Figure 15.** An example showing the influence of the incremental refinement strategy.
Download figure:
Standard image High-resolution image

We can see that the quality of the predicted S-PET image is gradually improved with the increased number of layers. The detailed quantitative results on all testing subjects are given in figure 16.

Figure 16 reveals that the performance improves gradually with the increase of the number of layers, indicating that the incremental refinement procedure can further improve image quality of the predicted S-PET image. However, the number of layers is proportional to the computational time. Thus, the number of layers should not be excessive in order to retain the computational efficiency.

4.4. Performance on abnormal subjects

The above experiments were conducted on the normal subjects to evaluate the proposed method. However, data with abnormal uptake can be often obtained in the real-world applications, since a number of factors can affect the tracer uptakes (e.g. subjects with brain atrophy may not normally uptake the tracer in particular regions). Thus, the robustness of a PET image prediction method, tackling abnormal uptake, is an important consideration in practice. To evaluate the proposed method under clinical data with abnormal uptake, we further acquired data from 8 subjects diagnosed as mild cognitive impairment (MCI). Table 3 summarizes the detailed demographic information of these 8 MCI subjects. Then, combining with those 8 normal subjects, we have 16 subjects to evaluate the performance of the proposed method on the clinical abnormal data.

Table 3. Demographic information of MCI subjects.

Subject ID	Age	Gender	Weight (kg)
9	65	Male	68.04
10	86	Female	68.95
11	86	Male	74.84
12	66	Female	58.97
13	61	Male	83.91
14	81	Male	1.6.59
15	70	Female	61.23
16	72	Female	77.11

For diagnostic purpose in clinical practice, the specific ROI-based measure is of more interests than the holistic images. Clinically, hippocampus is an important region for the diagnosis of the MCI patients (Zhou te al 2014). In this paper, contrast recovery (CR), which has been used to evaluate the quality of the specific ROI of PET image (Oehmigen et al 2014), is employed to assess the contrast around the hippocampal boundary. The CR value is computed as

$\begin{eqnarray}&&\text{CR}=1-\frac{\text{mean}\left(\text{ROI}\right)}{\text{mean}\left(\text{Background}\right)},\end{eqnarray} \tag{ 8 }$

where $\text{mean}\left(\text{ROI}\right)$ represents the mean value of the ROI, and $\text{mean}\left(\text{Background}\right)$ represents the mean value of the background. In this paper, the hippocampal regions are used as the ROIs, as these regions are particularly related to MCI, while the neighboring ROIs near the hippocampal regions are chosen as the background. The comparisons of the average CR bias across MCI subjects for different methods are shown in figure 17.

It can be observed that the CR bias in L-PET images is quite large. This suggests that, with dose reduction, the reconstructed L-PET images are quite different from the desired S-PET images. Prediction by the patch based SR achieves far less bias for the hippocampal ROI. For m-SR method, due to the minimization of feature distribution differences between MR/L-PET and S-PET images, the bias is significantly reduced. Compared to results by a single-layer m-SR, the proposed method with the incremental refinement strategy (multi-layer model) yields the lowest CR bias. Therefore, these experimental results demonstrate that the estimation of PET signals in the hippocampal region by our proposed method is more accurate, indicating that our proposed method can handle clinical data from abnormal subjects.

**Figure 17.** Average bias of contrast recovery (CR) across MCI subjects.
Download figure:
Standard image High-resolution image

5. Conclusion

We have presented a novel sparse representation method for predicting the S-PET image by using the multimodal MR and L-PET images with two new strategies: a mapping strategy and an incremental refinement scheme. The first strategy aims to reduce the patch distribution differences between MR/L-PET and S-PET images, while the second strategy is used to gradually improve the quality of the prediction. Note that, the main idea of the proposed method is totally different from some existing methods that aimed to improve PET image quality by only considering the raw PET data and corresponding MR image (Bai et al 2013). The proposed method manages to learn the relationship between low-dose and standard-dose PET images, and estimate a mapping function between them. This guarantees that the proposed method can work even in the case that the low-dose and standard-dose PET images have very different signal intensities, which cannot be done by those PET reconstruction methods. Experimental results show that our method can predict high-quality PET images, suggesting its great potential in clinical diagnosis by reducing the radiation exposure to the patients.

To the best of our knowledge, this is the first work that can well predict the S-PET image by using multimodal MR and L-PET images. In the proposed method, the training samples need to be tripled, i.e. any training subject should have the full set of multimodal MR images, L-PET, and S-PET. This excludes subjects with unavailable MR images or L-PET images (which often occurs in the dataset) for training our model. Therefore, our future work will focus on robustly predicting the S-PET image with different set of available image modalities. On the other hand, it is worth noting that, in the experiments, we did not observe any artifacts arising from the estimation of standard-dose PET images from low-dose PET and MRI, and also there is no indication in the quantitative results of a systematic bias that might occur from consistent artifact. However, the current study is limited to only brain data. Thus, at this stage we cannot comment on how our method might perform on the regions other than brain. All these aspects will be our future work. In addition, we are also in the process of acquiring more image data from more subjects, in order to rigorously evaluate our method by a larger scale dataset in the future.

Acknowledgments

This work was supported in part by the National Institutes of Health grants MH100217, AG042599, MH070890, EB006733, EB008374, EB009634, NS055754, MH064065, HD053000, and STMSP 2014RZ0027.

Conflict of interest

The authors declare no conflict of interest.

Predicting standard-dose PET image from low-dose PET and multimodal MR images using mapping-based sparse representation

Article metrics

Permissions

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. Patch-based sparse representation (SR)

3. Proposed method

3.1. Mapping based SR

3.1.1. Graph-based mapping procedure.

3.1.2. Fusion of multimodal MR images for Prediction of S-PET image.

3.2. Incremental refinement

3.3. Patch selection based dictionary construction

4. Experiments

4.1. S-PET prediction using m-SR

4.2. Influence of important elements in m-SR

4.2.1. Influence of the mapping procedure.

4.2.2. Influence of the use of multimodal MR images for prediction.

4.2.3. Influence of the patch selection based dictionary construction.

4.3. Effectiveness of the incremental refinement strategy

4.4. Performance on abnormal subjects

5. Conclusion

Acknowledgments

Conflict of interest

Predicting standard-dose PET image from low-dose PET and multimodal MR images using mapping-based sparse representation

Article metrics

Permissions

Share this article

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. Patch-based sparse representation (SR)

3. Proposed method

3.1. Mapping based SR

3.1.1. Graph-based mapping procedure.

3.1.2. Fusion of multimodal MR images for Prediction of S-PET image.

3.2. Incremental refinement

3.3. Patch selection based dictionary construction

4. Experiments

4.1. S-PET prediction using m-SR

4.2. Influence of important elements in m-SR

4.2.1. Influence of the mapping procedure.

4.2.2. Influence of the use of multimodal MR images for prediction.

4.2.3. Influence of the patch selection based dictionary construction.

4.3. Effectiveness of the incremental refinement strategy

4.4. Performance on abnormal subjects

5. Conclusion

Acknowledgments

Conflict of interest