Background
The clinical course of idiopathic pulmonary fibrosis (IPF), a chronic and fatal fibrotic lung disease, is highly variable. With a median survival of about 3 years, it ranges from a slow, steady loss of lung function over 5 or more years to a rapid progressive state and death within 1–3 years post-diagnosis. The typically slowly progressive course of IPF can be punctuated by intermittent episodes of precipitous decline in lung function termed acute exacerbation (AEIPF) [
1,
2], which often lead to a new, worsened baseline of respiratory impairment. The mechanisms underlying AEIPF continue to be poorly understood [
1,
3]. Further, the lack of a robust means of identifying biological heterogeneity, and selecting patient cohorts at risk for outcomes of interest continue to limit the scope and design of interventional clinical studies in IPF [
1].
The current approach in IPF diagnosis is limited to clinical assessment based on imaging and histology features. Stellar attempts, however, are currently underway to develop genomic signatures and blood-specific or lung-specific biomarkers in the future [
4]. Gene signatures derived from transcriptomic studies have been reported to differentiate IPF patients and from other interstitial lung diseases [
5,
6] and from healthy controls [
7,
8]. Comparison of gene signatures of healthy controls with ungrouped IPF patients revealed extensive genetic heterogeneity in the disease samples and differential gene expression profiles in IPF subgroups have been reported in several studies [
1,
9,
10]. This demands development of computational approaches to resolve heterogeneity and identify IPF-specific transcriptomes that may help to predict disease progression. We therefore reasoned that unsupervised machine learning approaches could be applied prior to differential gene expression analysis to facilitate recognition of potential IPF subgroups with novel gene signatures that have predictive or prognostic value.
We postulated that data-driven and knowledge-based approaches using gene expression profiling of a large set of IPF/UIP cases would both allow us to identify novel patient subgroups with shared molecular characteristics and reveal novel candidate genes. Using transcriptional profiles on lung tissue from 131 patients with IPF/UIP and 12 non-diseased controls, we identified six sub-types of IPF that reflect disease severity. We have further identified molecular signatures that are capable of differentiating (a) IPF from normal controls and (b) severe from mild IPF. These signatures were subsequently validated in three independent cohorts of IPF/UIP. Finally, using knowledge-based approaches, we identified several novel candidate genes and potential biomarkers for IPF.
Discussion
In this study of patients with UIP/IPF, we stratified subgroups based on lung function measures and applied unsupervised analysis on gene expression data. Genes enriched in cilium or lung alveolar morphology were expressed at different levels in two distinct transcriptomic profiles from patient clusters with moderate disease (cluster average DLCO: 40–60%), but not severe disease (cluster average DLCO: <40%). Comparison of DEG from each patient cluster revealed additional gene signatures that robustly differentiated IPF from normal lung, and advanced IPF from usual IPF. Finally, using knowledge-based approaches, we identified several novel gene candidates and potential BALF biomarkers for IPF.
The uniqueness of current study is that we used unsupervised, data-driven approaches to discover potential subgroups within IPF patient samples prior to extracting IPF-specific gene signatures, which allowed us to identify genes commonly involved in IPF or only associated with advanced IPF. In contrast, gene signatures of previous studies were all derived from comparing pooled IPF samples with healthy controls [
5‐
9]. As a result, we identified additional 1981 DEG along with genes discoverable without incorporating clustering steps prior to differential analysis, and 382 out of 392 advanced IPF genes were among the additional genes.
Our results indicate that gene expression profiles from IPF patients are heterogeneous. Grouping patients according to lung function measures such as FVC and DLCO reduced such heterogeneity and allowed discovery of more DEG. However, different gene expression profiles could still be found within the lung function group defined based on FVC or DLCO measurements, and several genes were expressed at similar levels in different patient groups. Gene expression heterogeneity not yet explainable by lung function measures suggested activation of disease-driving pathways that could potentially be informative in efforts to improve the therapeutic response and outcome. On the other hand, genes expressed at comparable levels across patient subgroups of different severity suggest potential involvement of these genes and linked biological processes in distinct stages of IPF. In this study, we validated these clusters by cross referencing with clinical data to avoid generating clusters that are less relevant clinically. Clustering patients based on gene expression prior to differential analysis may also help to circumvent some of the limitations we encountered.
A recent study reported that cilium-associated genes were associated with more extensive microscopic honeycombing in IPF patients, although no difference in lung function measures were found in patient groups defined by these genes [
10]. Our results are consistent with these data in that we found that cilium-associated genes are most highly expressed in patient cluster C5 with more severe IPF. These genes include
MUC5B and
DSP which were known to be involved in IPF [
10,
21]
, matrix metalloproteinases that are implicated in IPF such as
MMP1, MMP3 and
MMP7 [
22], and collagens involved in ECM organization. However, cilium-associated genes were also highly expressed, although to a less extent, in less-severe patient clusters, C1 and C3. More importantly, patient cluster C4 with low cilium-associated gene expression had D
LCO values that were comparable to those of C3, suggesting potential additional driver genes underlying IPF severity.
Our analyses revealed novel IPF associated genes and biomarkers. Among the 55 prioritized genes, 22 were previously shown to be dysregulated in IPF or involved in IPF pathogenesis. For example, up-regulated expression of candidate genes including
CTHRC1, CTSE, GREM1, NELL1 and
PLA2G2A in the core set, and
AREG, FST, LOX, THBS1 and
SELE in the advanced set, were also found to be increased in IPF animal models or in human IPF patients [
7,
23‐
27]. On the other hand,
ACE2,
SFRP2 and
WNT10A were known to be associated with fibrosis in IPF animal models and survival in IPF patients [
28‐
30]. The presence of these genes in the candidate list supports the validity and robustness of our prioritization although further studies are needed to validate the remaining novel candidate genes identified. In addition to novel candidate IPF genes, we also identified putative BALF biomarkers that can potentially differentiate IPF patients from healthy normal volunteers. The high consistency of the expression of these biomarker genes with their corresponding protein expression in BALF [
20] suggest that classifiers built on them could achieve comparable predictive power observed in our study. Thus, our biomarker list may inform future efforts to identify diagnostic, predictive and prognostic biomarkers in BALF that could obviate the need for more invasive diagnostic maneuvers and be used in decision making for IPF care.