Introduction
Parkinson’s disease (PD) is the second most prevalent neurodegenerative disorder, affecting ~ 1.8% of the population above 65 years [
45]. PD is a complex disorder caused by a combination of genetic and environmental factors, but the molecular mechanisms underlying its etiology remain largely unaccounted for. Genome-wide transcriptomic studies can identify expression signatures associated with PD. While not able to establish causality, these studies hold the potential to highlight important biological mechanisms, some of which may be exploited as targets for therapeutic modulation.
A recent systematic review identified 33 original genome-wide transcriptomic studies in the PD brain, of which 5 were performed on laser microdissected neurons from the
substantia nigra pars compacta (SNc) and the remaining in bulk tissue from various brain regions [
8]. These studies show surprisingly low replicability at the level of individual genes, however, and only partial concordance for pathways. The most consistent alterations have been found in pathways related to energy metabolism/mitochondrial function and protein degradation, followed by synaptic transmission, vesicle trafficking, lysosome/autophagy and neuroinflammation [
8]. While these processes commonly show differential expression signatures in PD, it remains unknown whether this is because they truly reflect the biology of PD or due to systematic bias and confounding factors. Two major sources of bias for transcriptomic studies in the human brain are the post-mortem degradation of RNA and the highly heterogeneous cell type composition of bulk tissue samples.
RNA degradation of variable extent occurs in post-mortem tissue. To further complicate the picture, it has been shown that different cell types exhibit different degrees of susceptibility to RNA degradation [
32], potentially confounding differences in cellular composition with differences in RNA quality. Access to high-quality brain tissue is generally limited, and thus an optimal choice of experimental platforms becomes paramount to maximize sensitivity. While RNA microarrays are being gradually superseded by RNA-seq technology, only 3 out of the 33 studies identified by an up-to-date review [
8] used RNA-seq, and all of them employed poly(A) capture, a widely used protocol (in both RNA-seq and microarray analyses) to restrict the analysis to mature mRNA [
20,
30,
46]. However, this library preparation method only picks up RNA fragments with a poly-A tail, introducing substantial bias in low quality RNA samples [
1,
25,
47,
56]. A well-established approach to mitigate this limitation is whole RNA-seq following active ribosomal RNA (rRNA) depletion and capture by random primers, such as the Illumina Ribo-Zero technique [
31]. To our knowledge there are no genome-wide transcriptomic studies on PD brain employing active rRNA depletion methods to date.
Systematic differences in sample cell composition represent another important confounding factor. These typically originate from two sources: biological differences (e.g. secondary to neurodegeneration) and technical variation in sample dissection and preparation. Brain areas affected by neurodegeneration are characterized by neuronal loss and gliosis, resulting in a systematically increased glia-to-neurons ratio in patients. This confounder is strongest in areas with severe changes, such as the SNc, but is also present to a variable degree in less affected areas, such as the neocortex. In addition, technical sources of variation due to sampling may affect any brain region and cause an uneven distribution of gray and white matter, resulting in a variable fraction of oligodendrocytes. Thus, transcriptional signatures associated with PD in bulk brain tissue may reflect changes in cellular composition rather than disease-specific transcriptional modulation. This observation has already been put forward using neurodegenerative mouse models and re-analysis of human brain transcriptomic data [
50]. Heterogeneous cell composition is, hence, a major confounder that needs to be considered and appropriately addressed in transcriptomic studies in bulk brain samples.
We report the first genome-wide transcriptomic study in the PD brain employing RNA-seq following rRNA depletion and random primer capture. We show that this approach is able to substantially mitigate the bias of post-mortem degradation, resulting in substantially better transcript coverage compared to poly(A) capture. Moreover, by estimating the relative cell type proportion in our samples, we confirm that cellular composition is a major source of variation in bulk tissue data, confounding the differential gene expression profile even in the less affected prefrontal cortex. By incorporating the estimated cell type proportions into our analysis models, we were able to unveil transcriptomic signatures which are more likely to be associated with the underlying disease mechanisms.
Material and methods
Subject cohorts
All experiments were conducted in fresh-frozen prefrontal cortex (Brodmann area 9) from a total of 49 individuals from two independent cohorts. The first cohort (
n = 29) comprised individuals with idiopathic PD (
n = 18) from the Park-West study (PW), a prospective population-based cohort which has been described in detail [
2] and neurologically healthy controls (Ctrl,
n = 11) from our brain bank for aging and neurodegeneration. Whole-exome sequencing had been performed on all patients [
24] and known/predicted pathogenic mutations in genes implicated in Mendelian PD and other monogenic neurological disorders had been excluded. None of the study participants had clinical signs of mitochondrial disease or use of medication known to influence mitochondrial function (Additional file
1). Controls had no known neurological disease and were matched for age and gender. The second cohort comprised samples from 21 individuals from the Netherlands Brain Bank (NBB) including idiopathic PD (
n = 10) and demographically matched neurologically healthy controls (
n = 11). Individuals with PD fulfilled the National Institute of Neurological Disorders and Stroke [
26] and the UK Parkinson’s disease Society Brain Bank [
54] diagnostic criteria for the disease at their final visit. Ethical permission for these studies was obtained from our regional ethics committee (REK 2017/2082, 2010/1700, 131.04).
To investigate the effect of the rRNA depletion and random primer capture protocol compared to the prevailing poly(A) method, we re-analyzed an RNA-seq dataset from a previous publication which employed a poly(A) tail selection kit on post-mortem tissue of the same brain area and same disease (PA cohort,
n = 29 PD samples,
n = 44 neurologically healthy controls, all males; GEO: GSE68719) [
20]. Informed consent was available from all individuals.
Tissue collection and neuropathology
Brains were collected at autopsy and split sagittaly along the
corpus callosum. One hemisphere was fixed whole in formaldehyde and the other coronally sectioned and snap-frozen in liquid nitrogen. All samples were collected using a standard technique and fixation time of ~ 2 weeks. There was no significant difference in post-mortem interval (PMI) (independent t-test, PW cohort
p = 0.16; NBB cohort
p = 0.92), age (independent t-test, PW cohort
p = 0.18; NBB cohort
p = 0.074) or gender (independent t-test, PW cohort
p = 0.94; NBB cohort
p = 0.53) between PD subjects and controls. Subject demographics and tissue availability are provided in Additional file
1. Routine neuropathological examination including immunohistochemistry for α-synuclein, tau and beta amyloid was performed on all brains. All cases showed neuropathological changes consistent with PD including degeneration of the dopaminergic neurons of the SNc in the presence of Lewy pathology. Controls had no pathological evidence of neurodegeneration.
RNA sequencing
Total RNA was extracted from prefrontal cortex tissue homogenate for all samples using RNeasy plus mini kit (Qiagen) with on-column DNase treatment according to manufacturer’s protocol. Final elution was made in 65 μl of dH2O. The concentration and integrity of the total RNA was estimated by Ribogreen assay (Thermo Fisher Scientific), and Fragment Analyzer (Advanced Analytical), respectively and 500 ng of total RNA was used for downstream RNA-seq applications. First, rRNA was removed using Ribo-Zero™ Gold (Epidemiology) kit (Illumina, San Diego, CA) using manufacturer’s recommended protocol. Immediately after the rRNA removal the RNA was fragmented and primed for the first strand synthesis using the NEBNext First Strand synthesis module (New England BioLabs Inc., Ipswich, MA). Directional second strand synthesis was performed using NEBNExt Ultra Directional second strand synthesis kit. Following this the samples were taken into standard library preparation protocol using NEBNext® DNA Library Prep Master Mix Set for Illumina® with slight modifications. Briefly, end-repair was done followed by poly(A) addition and custom adapter ligation. Post-ligated materials were individually barcoded with unique in-house Genomic Services Lab (GSL) primers and amplified through 12 cycles of PCR. Library quantity was assessed by Picogreen Assay (Thermo Fisher Scientific), and the library quality was estimated by utilizing a DNA High Sense chip on a Caliper Gx (Perkin Elmer). Accurate quantification of the final libraries for sequencing applications was determined using the qPCR-based KAPA Biosystems Library Quantification kit (Kapa Biosystems, Inc.). Each library was diluted to a final concentration of 12.5 nM and pooled equimolar prior to clustering. One hundred twenty-five bp Paired-End (PE) sequencing was performed on an Illumina HiSeq2500 sequencer (Illumina, Inc.). RNA quality, as measured by the RNA integrity number (RIN), varied across samples (mean = 5.3, range = 3.0–7.2 for PW; mean = 6.8, range = 3.2–9.1 for NBB), although the difference between conditions did not reach statistical significance in any of the cohorts (t-test P = 0.72 and 0.90 for PW and NBB cohorts, respectively).
Data quality control
FASTQ files were trimmed using Trimmomatic version 0.36 [
7] to remove potential Illumina adapters and low quality bases with the following parameters: ILLUMINACLIP:truseq.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15. FASTQ files were assessed using fastQC version 0.11.5 [
3] prior and following trimming. For an in-depth quality assessment, we mapped the trimmed reads using HISAT2 version 2.1.0 [
34] against the hg19 human reference genome (using --rna-strandness RF option) preserving lane-specific information. To discard potential lane-specific sequencing batch effects we inspected the output of the CollectRnaSeqMetrics tool of Picard Tools version 2.6 [
11]. Mapping efficiency and proportion of reads mapping to rRNA, intronic, intergenic and coding regions were obtained from the output of the CollectRnaSeqMetrics (Additional file
2: Figure S1 and S2).
For the poly(A) capture dataset [
20], raw FASTQ files were obtained from the Gene Expression Omnibus (GEO:GSE68719) and analyzed exactly as described for our cohorts (with the exception of --rna-strandness in HISAT2, which was turned off to take into account that the cDNA library of this cohort was unstranded).
RNA expression quantification and filtering
We used Salmon version 0.9.1 [
43] to quantify the abundance at the transcript level with the fragment-level GC bias correction option (−-gcBias) and the appropriate option for the library type (−l ISR) against the Ensembl release 75 transcriptome. Transcript-level quantification was collapsed onto gene-level quantification using the tximport R package version 1.8.0 [
49] according to the gene definitions provided by the same Ensembl release. We filtered out genes in non-canonical chromosomes and scaffolds, and transcripts encoded by the mitochondrial genome. To further reduce the potential for artifacts we filtered out transcripts with unusually high expression by removing transcripts that gathered more than 1% of the reads on more than half of the samples, which resulted in the removal of 3 and 4 transcripts from the PW and NBB cohorts, respectively. Additionally, low-expressed (i.e. genes whose expression was below the median expression in at least 20% of the samples) were filtered out from downstream analyses. Samples were then marked as outliers if their median correlation in gene expression (log counts per million) with the other samples was below Q
1–1.5*IQR or above Q
3 + 1.5*IQR (
Tukey’s fences; Q
1: first quartile, Q
3: third quartile, IQR: inter-quartile range). As a result, 3 samples were marked as outliers in the PW cohort and 3 in the NBB cohort, and were not included in downstream analyses (resulting sample sizes: N
PW = 26, N
NBB = 18, Additional file
2: Figure S3).
Estimation of marker gene profiles
It has been previously shown that cell type-specific transcriptional signature patterns derived from bulk tissue samples (marker gene profiles, MGPs), can be used as surrogates for relative cell type abundance across samples [
37]. MGPs for each cell type are calculated individually, by summarizing the concordant change in their respective marker genes via the first principal component of their expression (i.e. log-transformed counts per million (CPMs)). For the purpose of our study, we calculated MGPs for the main cortical cell types (neurons, oligodendroglia, microglia, endothelial cells, and astrocytes). Cortical cell type markers were obtained from the NeuroExpresso database [
37], a comprehensive database compiled using mouse brain cell type expression datasets, and human orthologs were defined using HomoloGene [
38]. To reduce the impact of outlier samples, principal component analysis was repeated 100 times on subsampled data, containing an equal number of subjects per group, and removing markers with opposite sign of the main trend. The median score for each sample was used as MGP for the downstream analyses. MGPs obtained with Neuroexpresso-based markers were highly correlated with MGPs calculated using two independent sets of markers from human brain single-cell transcriptomic studies [
33,
53] (Additional file
2: Figures S4–8, Additional file
2: Table S1). To assess potential variations associated with the disease across the neuronal markers, we examined the overlap between the markers and the differentially expressed genes in four publicly available datasets of laser microdissected neurons from PD brain (SNc dopaminergic neurons [
13,
22,
48] and posterior cingulate cortex pyramidal neurons [
51]). We found minimal overlap (3/78 genes) between our neuronal markers and genes differentially expressed in PD dopaminergic neurons. Moreover, none of the markers were differentially expressed in PD cortical neurons [
51] (Additional File
2: Figure S9). The vast majority of the cell type markers used for the calculation of MGPs changed in the same direction across our samples (Additional File
2: Figure S9), indicating that MGPs truly represent changes in global cell type-specific transcription profiles, rather than being driven by changes in specific genes.
To unravel potential complex interactions between MGPs and other experimental covariates, including disease status, we calculated the pairwise correlation between all the variables and also their association with the main axes of variation of gene expression. To assist us in choosing an optimal set of MGPs to include as covariates, we quantified the group differences in the cellular proportions between PD and controls using linear models adjusting for the known experimental covariates (i.e. RIN, PMI, sex, age, and sequencing batch). Significant association with disease status was found for oligodendrocyte MGP in the PW cohort and for microglia in the NBB cohort. Thus, these were included in the downstream analyses.
Differential gene expression and functional enrichment analyses
We performed differential gene expression analyses using the DESeq2 R package version 1.22.2 [
35] with default parameters. Experimental covariates (sex, age, RIN, PMI, and sequencing batch) as well as oligodendrocyte and microglia MGPs were incorporated into the statistical model. Multiple hypothesis testing was performed with the default automatic filtering of DESeq2 followed by false discovery rate (FDR) calculation by the Benjamini-Hochberg procedure. Analyses were carried out independently for the two cohorts. Genes were scored according to their significance by transforming the
p-values to account for direction of change. For each gene, the up-regulated score was calculated as
\( {S}_{\mathrm{up}}=\left\{\begin{array}{c}1-p/2, LFC<0\\ {}p/2, LFC\ge 0\end{array}\right. \), and the down-regulated score as
Sdown = 1 −
Sup, where LFC corresponds to the log fold change and
p to the nominal p-value of the gene. Genes were then tested for enrichment using alternatively log(
Sup) and log(
Sdown) scores employing the gene score resampling method implemented in the ermineR package version 1.0.1 [
39], an R wrapper package for ermineJ [
27] with the complete Gene Ontology (GO) database annotation [
5] to obtain lists of up- and down-regulated pathways for each cohort.
In order to characterize the main biological processes affected by the cell type correction, we scored pathways based on the loss of significance caused by the addition of cellular estimates to the gene expression model. We quantified the difference in the level of significance in the up- and down-regulated enrichment results for each significant pathway as ∆ = log(p0) − log(pCT), where pCT and p0 are the corrected enrichment p-values for the model with cell types (CT) and without (0), respectively. Only pathways that were significant in either one of the models were analyzed in this manner (p0 < 0.05 or pCT < 0.05).
Discussion
We present the first genome-wide transcriptomic study in the PD brain employing whole RNA-seq after rRNA depletion and random primer capture (Ribo-Zero). Our findings show that PD-associated differential gene expression signatures in bulk brain tissue are influenced to a great extent by the underlying differences in cell type composition of the samples. Modeling cell type heterogeneity allowed us to highlight transcriptional signatures that are likely to represent aberrant gene expression within the cells of the PD brain, rather than changes in cell composition.
Our results suggest that the Ribo-Zero approach is superior to the more commonly used poly(A) method and allows for a more accurate mapping and quantification of the transcriptome in post-mortem brain tissue. The Ribo-Zero method provides substantially higher evenness of coverage and effectively mitigates the 3′- and 5′-end coverage bias associated with poly(A) capture. Ultimately, the unevenness of coverage will influence transcript quantification, affecting the sensitivity of the differential expression estimates. While these observations are in agreement with previous comprehensive reports [
1,
25,
47,
56], we cannot rule out the contribution of experimental variables specific to each cohort, in addition to the RNA sequencing methodology. Furthermore, while the Ribo-Zero protocol shows advantages compared to the poly(A) method, it is certainly not sufficient to fully mitigate the impact of RNA degradation on transcript quantification.
Our study supports the notion that cell composition can be a major confounder in bulk brain tissue transcriptomics. We estimated the relative cell type abundance across our samples by calculating MGPs for the main cortical cell types. While MGPs do not provide a direct measure of cell counts, they are a validated and robust surrogate for cell type composition [
37,
52]. Moreover, we show that MGPs are (1) highly consistent across three different single cell-based marker sets, (2) highly robust to marker gene outliers, and (3) not susceptible to PD-associated changes in gene expression. Taken together, these results indicate that MGPs reliably represent the general behavior of cell type-specific transcriptional signature in our data.
Our analyses indicate that the observed expression profiles in both cohorts were driven predominantly by a combination of technical factors associated with RNA quality, and differences in cellular composition between PD and controls. This difference was primarily due to oligodendrocytes in PW and microglia in NBB. Since oligodendrocyte proliferation is not a pathological feature of PD, it is plausible that the difference in oligodendrocyte MGPs in PW was due to technical variation in gray/white matter content introduced during tissue sampling. Microglial infiltration does occur in affected areas of the PD brain [
18]. It is noteworthy, however, that increased microglial MGP was only observed in one of the cohorts (NBB), highlighting the biological heterogeneity of PD. Accounting for relative cell proportions reduced the number of differentially expressed genes and attenuated the calculated enrichment of cell type-specific pathways between PD and controls. In the PW cohort, this alleviated a substantial false positive signature of oligodendrocyte genes presumably caused by skewed grey/white matter sampling bias. Similar sampling bias could be responsible for oligodendroglia-specific functions appointed to PD brain in previous transcriptomic studies [
46].
Intriguingly, accounting for cellular proportions downplayed several of the transcriptomic signatures that have been previously associated with PD. For instance, the signal from vesicle trafficking- and synaptic transmission-related processes [
9,
10,
15,
21,
29,
40] was significantly attenuated in both cohorts, suggesting that the signal was primarily driven by changes in neuronal proportions between PD and controls, rather than modulation of these pathways within neurons. Moreover, we observed an attenuation in the down-regulation of mitochondrial pathways, including the respiratory chain and oxidative phosphorylation, which are among the most consistent transcriptomic signatures in PD [
8,
9,
19,
20,
29,
42,
55,
57]. The loss of transcriptional signal in these pathways is intriguing, because there is compelling evidence that decreased complex I protein levels occur in PD neurons [
23]. Our results suggest that the previously reported transcriptional down-regulation of the respiratory chain is at least partly driven by altered cellular composition (due to decreased number of neurons which highly express these genes) and may therefore not be the sole mechanism by which neuronal complex I deficiency occurs in PD. Indeed, it has been suggested that complex I deficiency in PD may be mediated by proteolytic degradation by the LON-ClpP protease system, rather than transcriptional regulation [
44].
Changes in the cell-composition of the affected brain regions occur in all neurodegenerative diseases, including PD, Alzheimer disease, amyotrophic lateral sclerosis (ALS) and Huntington disease. Interestingly, common and overlapping transcriptional signatures have been reported across these neurodegenerative diseases, including mitochondrial, neuronal-specific, and immunity-related pathways [
4,
17]. Our findings suggest that these common transcriptional signatures of neurodegeneration may largely represent the common pattern of altered cellularity, involving neuronal loss and glial proliferation, rather than biological processes of causal nature.
Accounting for cell type composition in our samples highlighted processes related to the endoplasmic reticulum, unfolded protein response and lipid/fatty acid oxidation as the top differential gene expression signatures in the PD prefrontal cortex. Unfolded protein response is indeed one of the most consistently reported transcriptomic signatures in PD [
8,
9,
20,
28,
41,
55]. Moreover, endoplasmic reticulum stress and aberrant proteostasis have been associated with the accumulation of misfolded proteins, including α-synuclein, in both in vitro studies and animal models of PD [
16]. While less is known regarding the role of lipid metabolism in PD, evidence of aberrant fatty acid oxidation has been found by metabolomic studies in serum [
12] and urine [
36] of patients. Our results corroborate these findings and indicate that aberrant fatty acid metabolism occurs in the PD prefrontal cortex.
Based on our findings, we advocate that modeling cell type heterogeneity is crucial in order to unveil transcriptomic signatures reflecting regulatory changes in the PD brain. It is, however, important noting that modeling of cellular estimates cannot completely mitigate the cell-composition bias in bulk tissue. Moreover, cell type correction complicates the identification of transcriptional changes that are confounded with changes in cellular composition and may thus increase the false negative rate. Single-cell or cell-sorting based methods will be key to overcoming this limitation and deciphering transcriptomic signatures directly associated with underlying disease mechanisms in PD.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.