Background
Idiopathic pulmonary fibrosis (IPF) is a fibrotic interstitial lung disease characterized by irreversible scarring of the lung parenchyma that predominantly affects older adults. While older retrospective studies suggested median survival was 2–3 years [
1‐
3], IPF has a highly heterogeneous disease course, making prognostication difficult [
4,
5]. While lung transplantation remains the sole intervention to prolong survival in patients with IPF [
6], organ scarcity, and ineligibility secondary to comorbid health conditions, make this available to only a few. Pirfenidone [
7] and nintedanib [
8] have emerged as promising therapies that slow disease progression. Several other medications are currently under investigation. Without the ability to predict disease course, it is difficult to identify which IPF patients are most likely to benefit from these new therapies or from lung transplantation.
Many clinical parameters, including race, gender, age, radiographic and/or histopathologic patterns, and pulmonary function tests have been linked to prognosis in patients with IPF [
9,
10]. Lung tissue-based molecular genomic signatures [
11,
12] have also been used to predict IPF progression; however, given the resources needed to perform lung biopsy and risk associated with the procedure, the applicability of such genomic signatures is limited. Peripheral blood mononuclear cells (PBMC), comprised of circulating monocytes, T-cells, B-cells, and natural killer cells, have been successfully used as an alternative for exploratory transcriptional profiling studies [
13‐
15]. Advantages of using PBMC over lung biopsy specimens to delineate molecular mechanisms of IPF include easier access, larger quantities, and the ability to dynamically assess disease status through longitudinal sample collection.
Using PBMC gene expression profiling, our group previously identified a genomic signature consisting of 52 genes that predicted survival in patients with IPF [
16]. While this investigation drew attention to the potential role of T cell signaling in IPF progression, the contribution of other genes identified in the study were not addressed. Furthermore, the gene set identified from our previous study did not provide a weighted score for the gene expression pattern, which has the potential to be useful in practical application. We therefore aim to construct a functional genomic model to better predict prognosis of IPF patients. To do so, we compiled a set of IPF prognostic predictor genes from previously reported microarray data in the training cohort (accession number GSE28221) [
17,
18]. First, we coupled PBMC gene expression profiling to IPF clinical traits using an unbiased “Weighted Gene Co-expression Network Analysis (WGCNA)” approach which is useful for describing the pairwise correlated expression among gene transcripts with co-regulation implications [
19‐
21] and to restrict the search space of genes to those genes in modules associated with pulmonary function. Second, we performed a supervised “Significance Analysis of Microarray (SAM)” approach to identify differentially expressed genes between observed “good” vs. “poor” prognosis IPF patients. Third, we identified genes based on their association with survivorship. The IPF prognostic predictor gene set satisfying all aforementioned three functional genomic criteria was used to construct a genomic prediction model and derived a prognostic index (PI) score for each patient in the training cohort. We then assessed the prognostic prediction specificity in the training cohort and further validated it in two independent cohorts. This work produced a functional genomic model with a mechanism-anchored IPF prognostication score for each patient, which may better identify those most likely to benefit from IPF-specific therapy and provide a tool for personalized IPF management.
Methods
Study populations
Study populations were collected, as previously described, from the University of Chicago Medical Center (UCMC ) and University of Pittsburg Medical Center (UPMC) [
16]. The training cohort consisted of 45 individuals with IPF collected from November 2007 to July 2009 at UCMC. The University of Chicago validation cohort (UCV) consisted of 21 individuals with IPF along with 10 healthy control subjects without lung disease collected from February 2007 to October 2007. The University of Pittsburg validation cohort (UPV) consisted of 75 individuals with IPF collected from March 2001 to September 2010. While the site source of the samples overlapped between cohorts, all samples were independent of each other. All patients with IPF met American Thoracic Society/European Respiratory Society (ATS/ERS) diagnosis criteria [
2]. The local Institutional Review Boards at the University of Chicago and University of Pittsburg Medical Center approved the study and informed consent was provided by all study subjects.
Demographic information, clinical characteristics, and pulmonary function tests were collected from all patients with IPF. Spirometry testing, including forced vital capacity percent predicted (FVC% predicted), diffusion capacity for carbon monoxide percent predicted (D
LCO % predicted) as well as lung volumes by plethysmography were obtained per ATS guidelines [
19‐
21]. The composite physiologic index (CPI) was calculated as described by Wells et al. [
22]. Survivorship was obtained from medical records, telephone interviews, and the social security death index database. The prognosis of IPF subjects was dichotomously categorized as good or poor based on observed survival over 3 years of follow-up.
PBMC sample collection, RNA isolation, microarray hybridization, and data processing
See details in Additional file
1. Microarray experiments were compliant with MIAME (Minimum Information About a Microarray Experiment) guidelines. The complete data sets are available in the Gene Expression Omnibus database (
http://www.ncbi.nlm.nih.gov/geo/) under accession number GSE28221.
Identification of gene co-expression modules correlated with clinical traits in training cohort
Normalized microarray data were filtered to remove redundant genes and genes with minimum variation (i.e. coefficient of variation <0.3 across all samples). Genes that passed filtering criteria were clustered into gene modules, based on their co-expression pattern, using an unsupervised “Weighted gene co-expression network analysis (WGCNA)” package in R 2.13 [
23]. Principal Component Analysis (PCA) was used to calculate an eigengene for each gene module. Pearson’s correlation was used to determine the significance of correlation (
p < 0.05) between the eigengenes of individual gene modules with each clinical parameter including race, sex, age, FVC % predicted, D
LCO % predicted, and CPI.
Identification of differentially expressed genes in the training cohort
Significant Analysis of Microarray (SAM) software [
24] was used to identify differentially expressed genes between observed good vs. poor IPF prognosis using criteria of fold change (FC) >1.5 and false discovery rate (FDR) < 2 %.
Survival analysis
Survival analysis was performed using unadjusted log rank testing along with univariate and/or multivariate Cox regression analysis. After checking to ensure that the proportional hazard assumption was met with each Cox model, subdistributional mortality hazards were determined for covariates by treating lung transplantation as a competing event, as previously described by Fine and Gray [
25]. Survival time was defined as time from blood draw to death, transplant, loss-to-follow-up or study conclusion. Patients who were lost to follow-up were censored at that time in survival modeling. Survival between groups was plotted using the Kaplan-Meier estimator.
Compilation of the IPF prognostic predictor gene set from the training cohort for a genomic model construction
To construct a functional genomic model predictive of IPF prognosis, genomic, clinical, and outcome data from the training cohort were analyzed to identify a set of genes with individual prognostic significance. Genes were selected for the “IPF prognostic predictor gene set” if they met all of the following criteria: 1) genes in specific gene co-expression modules that correlated with pulmonary function (p < 0.05) in WGCNA, 2) genes differentially expressed (FC > 1.5 and FDR < 2 %) between observed good vs. poor prognosis by SAM, and 3) genes predictive of mortality (p < 0.05) in univariate Cox regression analysis.
Development and validation of the functional genomic model to predict prognosis
The set of IPF prognostic predictor genes identified was used to construct a genomic model using “Survival risk group prediction” implemented in BRB-ArrayTools 4.2 [
26] to predict prognosis in IPF patients. The output of the genomic model is a patient-specific “prognostic index (PI)” score. PI of each patient in training cohort was derived from formula, ∑W
i
* X
i
+ 13.5, where
W
i
and
X
i
represent the weight (computed by supervised PCA) and log-intensity of the
i-th gene in the gene set. To assign a patient to either a high- or low-risk group, each patient’s PI was compared to a predetermined classification threshold. For this study, the threshold was set at the upper tertile in the training cohort according to clinical observation [
3]. A “10-fold Cross-Validation (CV)” algorithm was used to assess the classification specificity. Briefly, 10 % of patients were randomly omitted leaving the remaining 90 % of patients to construct the genomic model and derive a PI for each of the omitted samples. The PI of omitted individuals was then ranked relative to the PI of patients included in the CV model. Finally, we determined the predicted risk category based on the percentile ranking, the number of risk groups specified (i.e.
n = 2 in current study), and the empirical risk percentile setting (i.e. low/high risk = 66.7/33.3). Misclassification rate was determined by the discrepancy between the predicted low or high-risk category with the observed good or poor prognosis according to follow-up. Receiver-Operating-Characteristic (ROC) analysis with area under curve (AUC) calculation was performed to assess how well the PI distinguished IPF patients with low vs. high-risk prognosis. To perform an independent validation of the predictor, we applied the PI weights computed from the training set of 45 IPF samples to the calculation of the PI on the UCV and UPV cohorts. Details can be found in Additional file
1.
Functional pathways enrichment analysis
Significant biological processes in Gene Ontology associated with the set of IPF prognostic predictor genes were identified using R package “GOSim” [
27] with the criterion of
q-value (Benjamini-Yekutieli adjusted
p-value) <0.01. Significant canonical pathways or gene interaction networks were analyzed using Ingenuity Pathway Analysis (IPA) software (Ingenuity Systems, Redwood City, CA) with the criterion of the right-tailed (referring to the overrepresented pathway) Fisher’s exact test
q-value (Benjamini-Hochberg adjusted
p-value) <0.05.
IPF diagnosis prediction using prognosis index derived from the functional genomic model
Using the generated PI, ROC analysis with AUC calculation was performed in UCV cohort to assess how well the PI distinguishes IPF patients from healthy controls. The true positive rate (sensitivity) is plotted in function of the false positive rate (1-specificity) for different cut-off points. Each point on the ROC curve represents a sensitivity/false alarm pair corresponding to a particular decision threshold.
Statistical analysis
Continuous variables are reported as a mean (± standard deviation) and compared using a one-way analysis of variance. Categorical variables are reported as counts and percentages and compared using a chi-square or Fischer’s exact test, as appropriate. Pearson’s correlation was used to evaluate the correlation of prognostic index (PI) derived from genomic model with clinical parameters. ROC analysis with AUC calculation was performed using R package “caTools”. Other than when indicated above, statistical analysis was conducted using STATA 12 (StataCorp. 2011. College Station, TX).
Discussions
In this study, we constructed a functional genomic model that predicted survival in three independent cohorts of IPF patients. In the training cohort, we analyzed genomic data using both unsupervised WGCNA and supervised SAM approaches. By applying WGCNA algorithm, we first associated the pathophysiological alterations in the transcriptome level to the clinic traits of IPF and found 55 % of the genes clustered into the turquoise, black and red modules which were significantly correlated with pulmonary function. In a parallel analysis, 95 % of the differentially expressed genes between IPF patients with good and poor prognosis identified by SAM were attributed to these three pulmonary function associated gene modules. This analytical pipeline highlights the potential applicability of an unsupervised correlation network approach, whereby functional characterization of correlated gene modules provides insight into the molecular mechanisms underlying a clinical trait of a complex pulmonary disease. Lastly, we correlated gene expression levels with survival, which contributed another important feature of IPF. We defined genes met all three selection criteria as “IPF prognostic predictor genes”.
Pathway analysis of the IPF prognostic predictor genes revealed several canonical pathways including T-cell receptor signaling pathway in turquoise module (
q = 0.0087); hemoglobin metabolic process and oxygen transport in black module (
q = 0.0031 and 0.0032, respectively); and defense response to bacterium and neutrophil degranulation in red module (
q = 0.000 and 0.0041, respectively) (Additional file
1: Table S3). The enriched T-cell biology, including iCOS signaling in T-helper cells, CD28 signaling, and T-helper cell differentiation is supported by our prior work, with similar analyses of a smaller gene set demonstrating that decreased expression of CD28, ICOS, LCK, and ITK predicted mortality in patients with IPF [
16]. Impaired regulatory T-cells from bronchoalveolar lavage fluid have been strongly correlated with pulmonary dysfunction of IPF patients [
28]. Down-regulation of CD28 on circulating CD4 T-cells has been associated with poor outcomes in IPF patients [
29]. IL-17A, a cytokine produced by CD4
+ and gamma-delta
+ T cells, has been shown to play a critical role in inducing fibrosis in a mouse model [
30]. Although the role of the immune system in IPF remains unclear, a large multicenter study has shown that IPF patients treated with prednisone and azathioprine had an increased risk of death and hospitalization compared to those receiving placebo [
31]. It remains unknown whether a down-regulated immune system is causally involved in IPF pathogenesis, or is the result of primary lung injury. In addition, a down-regulated immune system could result in a reduced T-cell population [
32,
33]. Nevertheless, the down-regulated T-cell pathways or reduced T-cell population can both lead to impaired immune function. These studies are congruent with the functional profile of our IPF genomic model suggesting that suppression of the immune system with medications such as prednisone and azathioprine may worsen the clinical course for IPF patients whose immune systems are already down-regulated.
By evaluating the performance of the genomic model in two independent validation cohorts with different microarray platforms performed at different medical centers, we demonstrate the potential applicability of our findings for real-world use. Notably, the prognostic index (PI) derived from the genomic model showed consistent prognostic prediction specificity in each validation cohort and produced similar mortality hazard estimation across all three cohorts. Genomic model constructed using IPF prognostic predictor genes also displayed concordant fold changes between patients with predicted low- and high-risk prognosis in training and validation cohorts. Furthermore, the PI was able to discriminate between IPF and healthy controls with great accuracy, suggesting a future potential screening tool. However, it is unclear whether the PI can distinguish IPF patients from patients with other pulmonary fibrotic diseases such as nonspecific interstitial pneumonia (NSIP), hypersensitivity pneumonitis (HP), and respiratory bronchiolitis-associated interstitial lung disease (RB-ILD) etc. This question can be addressed in future studies.
While the genomic model developed in this study has been successfully validated in two independent (UCV and UPV) cohorts, certain technical issues and potential clinical confounders require further study. First, there were demographic differences in gender and race between the training and UPV cohorts. The training population was strongly biased towards male patients, while the UPV population was more balanced with respect to gender. There was a greater prevalence of Caucasians in the UPV cohort. Interestingly, the PI scores were higher and indicative of poorer outcome in women overall, while being primarily derived from a male cohort. While IPF is more common in men than women [
31], women appear to have improved survival [
34,
35]. Although the reasons for this clinical observation are unclear, our results indicate that there may be differential gene expressions between male and female patients with IPF that underlie this observation.
In addition, the different microarray platforms used in different cohort studies might affect the prediction specificity of the model. Notably, the correlation of the PI with pulmonary function in UPV cohort is less strong compared to that in training and in UCV cohorts. We speculated that this observation may be partially attributed to the loss of the 10 classifiers when mapped from Affymetrix to Agilent microarray platform. Another potential confounding factor is the higher rate of lung transplant in the UPV cohort (20 %) compared to the UCV cohorts (7 %). We attempted to adjust for this in our survival analysis by treating transplant as a competing event. Validation with a larger prospective cohort would be beneficial. Finally, the PI cannot be standardized across different microarray platforms, because the gene expression levels in microarray assay were measured by arbitrary fluorescent intensities rather than transcript copy numbers. Therefore, an absolute cut-off or carry-over of PI across individual studies is not feasible, and clinical elaboration of the hazard ratio of PI is impractical at this stage. Future approaches with direct assessment of these IPF prognostic predictor genes could overcome this issue.
Acknowledgments
The authors are grateful to the individuals who participated in this study. The study was supported by the National Institutes of Health, National Heart, Lung, and Blood Institute [Grants HL080513, 1RC1HL099619-01, 1RC2HL1011740]; Pulmonary Fibrosis Foundation (Chicago, IL); and Coalition for Pulmonary Fibrosis (San Jose, CA).
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
All authors contributed to the conception and design of the study; YH and JO performed the statistical analysis; YH drafted the manuscript; SFM, RV, JO, YAL, JGN, IN contributed to the writing of the manuscript; YH, SFM, RV contributed to this work equally; SMB, MSW collected biological samples; SFM supervised the sample collection; DKH, IN, NKS, MES, RV, SRW collected phenotypic data which were further evaluated by IN and RV; All authors read and approved the final manuscript.