Introduction
Chronic obstructive pulmonary disease (COPD) is a heterogeneous disease, including emphysema and small and large airways disease [
1,
2]. COPD diagnosis [
3] is based on spirometric measures reflecting reduced airflow obstruction, specifically a reduced ratio of forced expiratory volume in 1-s (FEV
1) to forced vital capacity (FVC) less than 0.70 [
4]. But this definition does not account for the vast heterogeneity observed in COPD cases in terms of the rate of progression of the disease [
5], response to treatment [
6‐
8], symptom burden [
9], inflammatory response [
10], and lung physiology [
11]. Therefore, there has been tremendous interest in identifying COPD subtypes that reflect differences in these disease aspects [
12,
13]. Well-characterized subtypes with readily assessable biomarkers would allow for the selection of high-risk COPD populations for therapeutic intervention and patient stratification leading to more highly-powered clinical trials. Molecular subtyping could also help to identify rare genetic variants and individuals at elevated risk for development of the disease [
14].
Disease subtyping has been relatively successful in asthma [
15], but efforts in COPD have proven more difficult. Previous attempts to subtype COPD have been limited due to lack of reproducibility and constraints in study design. Another limitation to COPD subtyping efforts is the barrier to validating and interpreting subtypes that are based on clinical characteristics (e.g., spirometry, body mass index). Some studies have tried to circumvent this problem by withholding a pre-defined subset of clinical characteristics at the clustering step and then using those to assess the resulting clusters [
16]; however, this raises the question of whether the holdout set is representative of the population. While it is possible to find distinct groups of subjects regarding these clinical variables, these classifications are unlikely to identify novel disease mechanisms.
Incorporation of genomic information can greatly enhance the relevance of COPD subtypes. Peripheral blood gene expression is an attractive candidate for potential biomarkers because it is easily accessible. One previous study identified four COPD clusters based on blood gene expression with a non-negative matrix factorization approach [
17]. These clusters of subjects promisingly varied in the severity of their disease, but, because the study relied on microarray gene expression data, discovery was limited to the genes included on those platforms.
We recently developed a new method for evaluating gene network perturbations in single samples (
single sample Network Perturbation Assessment, ssNPA) [
18]. ssNPA uses probabilistic graphs [
19‐
22] to estimate the gene network from a set of reference (control) samples and assesses perturbations in each individual disease sample. ssNPA outperformed existing algorithms in identifying subgroups of samples based on these gene expression perturbation features and had superior clustering performance compared to gene expression itself [
18] and other methods [
23,
24]. In this paper, we apply ssNPA to the Genetic Epidemiology of COPD Study (COPDGene) and the Multi-ethnic Study of Atherosclerosis (MESA) data in order to identify and validate new COPD phenotypes solely from gene expression measured in peripheral blood samples.
Discussion
COPD subtyping is essential for not only understanding the diversity of molecular mechanisms of the disease, but also to aid in the development of new intervention strategies. Here we present a new clustering of COPD former smokers based on PBMC gene expression. The focus of this work was restricted to former smokers because we wanted to eliminate biases, since current smoking status has a large impact in gene expression. This clustering was the result of a novel network deregulation-based approach (ssNPA), which has been shown to outperform many standard methods in sample clustering [
18]. We identify four COPD subtypes, which exhibit different degrees of symptom presentation, exercise capacity and mortality. Two of the clusters (cluster 0 and cluster 1) have similar (milder) impairment in spirometry, but show differences in DLCO, disease progression and mortality. The other two clusters (cluster 2 and cluster 3) have similar levels of lung function impairment, which is significantly worse than clusters 0 and 1. Compared to cluster 3, subjects in cluster 2 have more symptoms, lower 6-min walk distance, higher neutrophil count and worse survival despite similar reductions in FEV1 percent predicted and FEV1/FVC. Cluster 3 subjects have the most emphysema, although the differences are not significant.
We show that these clusters are stable by validating them using (1) additional COPDGene samples and (2) the MESA study cohort. To demonstrate the utility of our subtyping method for future patient classification, the samples from the two validation cohorts were assigned to one of these four clusters based on their own gene network deregulation vectors (instead of re-clustering these cohorts). We find that the clinical differences of the new sets of samples remained largely the same, which not only validates our findings but also demonstrates the ability of accurately assigning new samples to these four clusters. Unsurprisingly, the distribution of subtypes in MESA is skewed to include more in cluster 0 (mildest disease phenotype), since MESA enrolled subjects representing the general population. By contrast, COPDGene is a case–control study of COPD, so this distribution of MESA samples is consistent with our expectations.
Previous results in COPD subtype identification have proven difficult to replicate. For example, the number of identified subtypes generally varies from 2 to 5, and women and participants with mild disease are generally underrepresented [
30]. One study applied a consistent clustering analysis to 10 independent cohorts and found only modest reproducibility across cohorts, but had more success with a continuous PCA-based projection of the individuals [
31]. The authors suggest that the disease is best represented as a COPD continuum instead of separate and mutually exclusive subtypes. However, this interpretation does not account for the suspected varied genetic basis of COPD and, without clear cut-off points along the continuum, the practical utility is limited. Another study also applied a network-based clustering approach to blood microarray data and identified four clusters [
17]. These clusters differed in spirometry and emphysema, but the network component in that study was coming from existing knowledge (STRING database), which has its own biases and limitations.
Next, we investigate the underlying molecular changes and how they may be implicated in the mechanism of the disease. Several of the genes whose deregulation drive the clustering to subtypes have previously been noted as having a role in COPD. Desmoplakin (
DSP, 6p24.3) was identified in a genome-wide association study (GWAS) of COPD as one of 22 genes containing a top coding variant (rs2076295) [
32].
DSP is a desmosomal protein that plays an essential role in cell–cell linkages, especially in epidermis and cardiac muscle [
33,
34]. DSP variants have also been associated with idiopathic pulmonary fibrosis [
35], although these variants may be protective against COPD [
32]. This GWAS was included 15,256 COPD cases and 47,936 controls. This locus also colocalized with an expression quantitative trait locus (eQTL) from another lung tissue dataset that included subjects with COPD [
36]. In another study, the locus was associated with change two quantitative measures of emphysema, percentage of low-attenuation area less than -950 Hounsfield units (%LAA-950) and adjusted lung density [
37]. Recently, the variant was shown to regulate DSP expression in airway epithelial cells, and loss of DSP expression led to increased expression of extracellular matrix-related genes and cell migration [
38].
Another gene we identified,
GSTM1 (gluthathione S-transferase μ 1, 1p13.3), belongs to a family of enzymes that are relevant for lung disease, likely through their roles in detoxifying electrophilic compounds, such as cigarette smoke and environmental toxins [
39]. A homozygous
GSTM1-null genotype has been associated with lung cancer pathogenesis [
40,
41], emphysema [
42,
43], and COPD susceptibility [
44,
45]. However,
GSTM1 has not been previously identified by COPD GWAS, although the presumed functional variation is a gene deletion and not a single nucleotide polymorphism that would be included in GWAS chips.
Even though cluster 0 and cluster 1 had similar lung function, we identify a number of genes whose deregulation is different between these clusters. For example, the deregulation of CTNNA2 and SLC44A5 is higher in cluster 0 compared to cluster 1, and the deregulation of MRGPRE was lower in cluster 0 compared to cluster 1. Similarly, in cluster 2 and cluster 3 (also similar lung function), the deregulation of several genes, including MUC16, ZMAT4, GSTM1, MRGPRE, and ADAM29, differed between these two clusters. These observations indicate the presence of different underlying molecular mechanisms despite similar lung function.
The list of genes we have identified provides important insights into the molecular mechanism of susceptibility, such as the role of environmental toxin processing, and progression, including pathways involved in extracellular matrix organization. Several of the genes on the list such as Fibroblast Growth Factor 9 (FGF9) have not been specifically cited for an association with COPD, but they code for important signaling proteins and may play a role in lung development or airway remodeling.
As this is study is not meant to investigate the detailed molecular mechanisms of the four subtypes, we mention these genes as a proof-of-principle of our method. Future studies could investigate the role of the molecular mechanisms based on our results.
Conclusions
Using the ssNPA method on blood gene expression data, we identify and validate four clusters of former smokers with COPD, which correspond to clinically relevant disease subtypes, reflecting differences in severity, symptoms and mortality. These differences are not fully reflected by lung function impairment alone. Furthermore, the focus on differential regulation at the gene level provides insight into the disease mechanisms that differentiate COPD cases from the control group of subjects without COPD. We identify a set of genes whose deregulation drives the subtype separation. Several of these genes have previously described connections to COPD, although some new genes emerged as well. The network learning and gene selection were completely unbiased, using no prior knowledge of clinical characteristics, disease mechanism or biology pathways. Finally, we show that ssNPA is a flexible general framework for disease subtyping. As more omics data become available through COPDGene and other studies, future work could incorporate genetic variant, epigenetic, proteomic, or metabolomic variables into the network learning and feature calculations that would provide a multi-layered, more complete picture of the molecular pathology and heterogeneity of COPD.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.