Background
Deep sequencing of transcriptomes (RNA-seq) provides important insights into biology and disease. Bulk RNA-seq requires hundreds of thousands of cells. The resultant transcriptomic profile therefore represents the average of cells at different transcriptomic states or even different cell types within the same tissues (e.g., infiltrating immune cells or normal cells in tumor samples). With the discovery of new genes and splice junctions in the first single-cell RNA-seq (scRNA-seq) study [
1], researchers realized the need to profile single-cell transcriptomes. Coinciding with this intense interest is the development of diverse approaches to perform scRNA-seq, which have been summarized in recent reviews [
2‐
5]. Improvements in reagents that enable full-length transcriptome profiling by direct global amplification at the single-cell level (e.g., SMART-seq [
6‐
8], Quartz-Seq [
9], and the ‘Tang’ method [
1]) have also enabled direct amplification for the analysis of groups of 10’s–100’s of cells, i.e., limiting-cell RNA-seq (lcRNA-seq).
The advantages of direct amplification are manifold. First, it lowers the barrier in identifying differentially expressed genes (DEGs) in rare cell populations. Until recently, researchers had to pool cells from multiple samples to extract quantifiable amounts of total RNA for library generation. With direct cDNA preamplification, the number of cells required for successful library preparation dropped below 100 cells, a level often achievable from just one sample. Also, quantifying RNA accurately below 250 pg/μL is challenging. Direct cDNA preamplification eliminates this need by using cell counts (e.g., from fluorescence activated cell sorting (FACS) or laser capture) instead of RNA mass to standardize input amounts between experimental groups. Second, direct cDNA preamplification greatly preserves transcript quality by quickly transforming labile RNA into stable cDNA, as degradation associated with extraction can be significant. Third, direct amplification of enriched cells deposited into well-plates allows the incorporation of nanoliter microfluidics to deliver/mix reagents and templates quickly. This further preserves RNA integrity.
Even with these advances, systems noise associated with transcript degradation is inevitable and requires computational solutions, especially if large numbers of replicates are not feasible. Current publications on lcRNA-seq data fall into two categories: cell-pool samples as part of method development or scRNA-seq methodology comparisons [
6,
10‐
12] and ultralow amounts of extracted RNA as input for RNA-seq library preparation [
13‐
15]. Analysis workflows for bulk RNA-seq and scRNA-seq data are distinct as each approach addresses a different research question. The goal of bulk RNA-seq is to identify differences in transcriptomic profiles between treatment groups, whereas the goal of scRNA-seq is to characterize cell subpopulations in tissues or bulk cells. In this regard, the aim of lcRNA-seq experiments is like that of bulk RNA-seq experiments, whereas their data quality (e.g., prevalence of zero count genes or ‘dropout rate’ [
10,
16,
17]) is similar to that of scRNA-seq. Therefore, statistical methods often used for between-group comparisons in bulk RNA-seq studies, such as the negative binomial distribution-based test in DESeq2 [
18], should not be used for lcRNA-seq data without modifications because they are susceptible to zero-count artifacts. On the other hand, the myriad of tools [
19‐
30] for analyzing scRNA-seq data are tuned to work with high variabilities (e.g., true biological variations in single cells or variabilities due to cDNA preamplification) but require large numbers of replicates not possible in lcRNA-seq studies.
Here, we describe CLEAR, a computational preprocessing approach for between-group comparisons of lcRNA-seq experiments. Designed for low numbers of replicates, CLEAR focuses on identifying robust transcripts with even read coverage for downstream analysis to control for data noise. It examines the noise pattern of individual samples to identify ‘reliably quantifiable’ transcripts for maximal signal and minimal noise. CLEAR transcripts common to replicates across two comparison groups will be used for subsequent analyses. Using a dataset derived from the same RNA stock but at dilutions spanning typical lcRNA-seq inputs, we show that CLEAR greatly improves similarity between results from three input RNA levels. In two public datasets, we demonstrate that the numbers and dispersion patterns of CLEAR transcripts yield a novel way to evaluate library qualities. In an in-house murine neural cell lcRNA-seq study, utilizing CLEAR transcripts significantly improves cell type separations by principal component analysis (PCA) and validations of cell phenotype markers. These case studies demonstrate the value of lcRNA-seq coupled with CLEAR in transcriptomic profiling of tissues found in rare niches and of precious clinical samples.
Methods
CLL patient sample acquisition
A chronic lymphocytic leukemia (CLL) patient sample was obtained from the Leukemia Tissue Bank (LTB), a shared resource of the NCI-funded OSU Comprehensive Cancer Center. The sample was obtained following written informed consent in accordance with the Declaration of Helsinki and under a protocol reviewed and approved by the Institutional Review Board of the Ohio State University. The patient had CLL as defined by the IWCLL 2008 criteria. The patient’s white blood cells were isolated by Ficoll density gradient centrifugation (Ficoll-Paque Plus, Amersham Biosciences, Little Chalfont, UK) and samples were banked at − 180 °C in liquid nitrogen. Frozen cells were thawed and washed with RPMI 1640 media (Gibco, Life Technologies, Grand Island, NY, USA) and resuspended at 5 × 106 cells/mL in complete medium containing 10% fetal bovine serum (FBS) (Sigma, St Louis, MO, USA), 2 mM l-glutamine, penicillin (100 U/mL), and streptomycin (100 µg/ mL) (Gibco).
Animals for neural cell type analysis
All procedures involving animals were approved by the Ohio State Institutional Animal Care and Use Committee in accordance with institutional and national guidelines. Nestin-GFP mice were provided by Grigori Enikolopov at Cold Spring Harbor Laboratory [
31]. All mice were housed in a 12 h light–dark cycle with food and water ad libitum. For isolation of the dentate gyrus (DG), adult mice (6–9 week old) were anesthetized with an intraperitoneal injection of ketamine (87.5 mg/kg) and xylazine (12.5 mg/kg) before perfusion with PBS. Following perfusion, the brain was removed and placed in cold Neurobasal A medium (Gibco 10-888-022) on ice. After bisecting the brain along the midsagittal line, the cerebellum and diencephalic structures were removed to expose the hippocampus. Under a dissection microscope (Zeiss), the DG was excised using a beveled syringe needle and placed in ice cold PBS without calcium or magnesium (Gibco 10-010-049). DGs from mice were first mechanically dissociated with sterile scalpel blades and then enzymatically dissociated with a pre-warmed papain (Roche 10108014001)/dispase (Stem Cell Technologies 07913)/DNase (Stem Cell Technologies NC9007308) (PDD) cocktail at 37 °C for 20 min. Afterwards, the tissue was again mechanically disrupted by trituration for 1 min. Dissociated cells were collected by centrifugation at 500 g for 5 min.
Fluorescence activated cell sorting (FACS)
For the human sample all cells were stained and sorted by FACS Aria (BD Biosciences, San Jose, CA, USA). Live CLL B cells (CD5+ CD19+) and normal B cells (CD5− CD19+) were sorted from the patient sample. Briefly, cells for FACS were resuspended in PBS without calcium/magnesium and filtered through a 35 µm nylon filter and then stained for PE-conjugated CD45 (HI30), PerCp Cy-5.5-conjugated CD19 (HIB19), and Alexa Fluor 700-conjugated CD3 (UCHT1) monoclonal antibodies. Nonviable cells were excluded by the LIVE/DEAD Fixable Near-IR Dead Cell Stain Kit (Life Technologies, Carlsbad, CA, USA). Appropriate fluorescence minus one controls were used to determine nonspecific background staining. Single-cell gates were used to exclude the possibility of doublet cells. The FACS parameter diagrams for this process are available in Additional file
1: Figure S1.
For the Nestin-GFP mouse samples, cells were resuspended in PBS without calcium/magnesium and filtered through a 35 µm nylon filter before staining with the following antibodies: O4-APC (1:100, R&D FAB1326A), O1-eFluor660 (1:100, Thermo Fisher/eBioscience, Pittsburgh, PA, 5065082), GLAST-PE (1:50, Miltenyi Biotec, Bergisch Gladbach, Germany, 130-098-804), CD45-APC (1:100, BD Biosciences, Franklin Lakes, NJ, 561018), CD31-APC (1:100, BD Biosciences, Franklin Lakes, NJ, 561814). Cells were incubated with antibodies on ice for 30 min. During the last 10 min of staining, Hoechst dye (1:10,000, Thermo Fisher, Pittsburgh, PA, 33342) was added for live/dead discrimination. All cells were washed twice following staining and immediately sorted as stem, progenitor, or astrocyte populations based on fluorescent markers with the FACSAria III (BD Biosciences, Franklin Lakes, NJ). CD31, CD45, O1, and O4 negative live cells were designated as stem cells if double positive for GLAST and GFP, progenitors if only GFP positive, and astrocytes if only GLAST positive. For cells in the limited cell number study, 50 cells were sorted into 96 well format plates for downstream lcRNA-seq library generation.
Total RNA extraction was performed using Trizol reagent (Invitrogen, Carlsbad, CA). Briefly, approximately two million FACS-derived CD5+ and CD5− cells were separately sorted into 1.7 mL microcentrifuge tubes. Excess buffer was removed by centrifugation. Trizol reagent was added to cell pellets and the extraction protocol recommended by Invitrogen was followed. Total RNA was precipitated with 10 μg glycogen (Qiagen, Hilden, Germany). The quality of the total RNA was assessed with the Agilent 2100 Bioanalyzer (Agilent, Inc., Santa Clara, CA) using total RNA Pico chip.
Library generation and sample sequencing
Total CD5+ and CD5− RNA quantified using the Invitrogen Qubit RNA HS Assay kit (Invitrogen, Carlsbad, CA) was serially diluted to masses characteristic of single- and limiting-cell RNA-Seq (10-, 100-, and 1000-pg). The Clontech SMARTer v4 kit (Takara Bio USA, Inc., Mountain View, CA) was used for global preamplification of these serially diluted samples in triplicate and also for direct global preamplification in FACS-derived murine DG cell types prior to library generation in quadruplicates with the Nextera XT DNA Library Prep Kit (Illumina, Inc., San Diego, CA). Samples were sequenced to a depth of 15–20 million 2 × 150 bp clusters on the Illumina HiSeq 4000 platform (Illumina, Inc., San Diego, CA).
Publicly available data retrieval
Published data was retrieved in FASTQ format from the following accession numbers: ArrayExpress E-MTAB-2600 [
32] (mouse embryonic stem cells from 2i and alternative 2i media) and GSE50856 [
14] (SMART-seq samples).
Data preprocessing, alignment, and quantification
Individual FASTQ files were trimmed for adapter sequences and filtered for a minimum quality score of Q20 using AdapterRemoval v2.2.0 [
33]. Preliminary alignment using HISAT2 v2.0.6 [
34] was performed to a composite reference of rRNA, mtDNA, and PhiX bacteriophage sequences obtained from NCBI RefSeq [
35]. Reads aligning to these references were excluded in subsequent analyses. Primary alignment was performed against the human genome reference GRCh38p7 or mouse genome reference GRCm38p4 using HISAT2. Gene expression values for genes described by the GENCODE [
36,
37] Gene Transfer Format (GTF) release 25 (human) or release M14 (mouse) were quantified using the featureCounts tool of the Subread package v1.5.1 [
38,
39] in unstranded mode.
Sequencing and alignment quality control
Quality control was performed using a modification of our custom workflow QuaCRS [
40]. In brief, aligned read quality was verified using RNA-SeQC [
41] and RSeQC [
42]. Parameters evaluated included the exonic rate (the percentage of reads aligning to exons), the intronic rate (the percentage of reads aligning to introns), and the duplication rate (the percentage of reads that were identified as PCR duplicates).
Coverage profiling
Coverage depths across the aligned reference were calculated with a per-base resolution using the ‘genomecov’ utility of Bedtools v2.27.0 [
43] in BedGraph format. It is imperative to utilize ‘split output’ mode to reduce the size of the output BedGraph files. These files are used as input into CLEAR.
Calculation of transcript μi
For each transcript (
i) annotated in the NCBI RefSeq (GRCh38 or GRCm38) reference, as sourced from the UCSC Genome Browser [
44], CLEAR calculates the transcript’s
μi parameter. This quantifies the distribution of the positional mean of the read distribution along that transcript between the 5′ (
μi = −1) and the 3′ (
μi =+ 1) ends (Eq.
1):
$$\mu_{i} = \frac{2}{{L_{\text{transcript}} }}\left( {\frac{{\sum\nolimits_{k = 0}^{{L_{\text{transcript}} - 1}} {\left( {k \cdot d_{k} } \right)} }}{{\sum\nolimits_{k = 0}^{{L_{{t{\text{ranscript}}}} - 1}} {\left( {d_{k} } \right)} }}} \right) - 1$$
(1)
where
Ltranscript is the length of a given transcript, and
dk is the coverage of exonic locus
k zero indexed and starting at the transcription start site. In the case that a gene contains multiple isoforms, the longest transcript from the UCSC genome browser is used for the
μi calculation.
Determination of analysis-ready CLEAR transcripts
All transcripts quantified by featureCounts are sorted by overall length-normalized expression. Histograms of
μi values from 250 transcripts each, are collected and fit using the ‘optimize’ module of the Python scipy package, to a double-beta distribution as described by Eq.
2:
$$H\left( {\frac{{(1 + x)(1 - \frac{1}{2}(1 + x))^{1 + b} }}{2B(2,2 + b)} + \frac{{(1 + x)^{1 + a} (1 - \frac{1}{2}(1 + x))}}{{2^{1 + a} B(2 + a,2)}}} \right)$$
(2)
where
H is a normalization parameter fixed by the bin sizes,
B(q,p) is the beta integral of
q and
p, and
x is the bin location. The fitting parameters are
a and
b, which are each bounded to be non-negative. A value of 0 for both parameters represents a symmetric distribution and positive values represent progressively bimodal distributions to the positive (
a), or negative (
b) direction. These windows are advanced in units of 10 transcripts, reducing computational intensity. Once a window is found where
a or
b have a value greater than 2, the software exports transcripts with a measured expression higher than that bin’s value for analysis. For multi-sample comparisons, transcripts are overlapped and only transcripts found in all samples are included for downstream analysis. We note, however, that there is a command line argument available to reduce this requirement to only a subset of samples for investigators interested in a less stringent inclusion criteria.
CLEAR visualizations
A core component to the quality control of the CLEAR selection process is the visualization of
μi to confirm that the characteristic bifurcation is observed. Violin plots are produced using code included with CLEAR which utilizes the matplotlib package in Python. Examples of this can be seen in Fig.
2d, Additional file
1: Figures S5A, C.
DEG comparisons
Differentially expressed genes were called using DESeq2 [
18] run on counts tables generated with featureCounts as described above. In all summarization figures, a false discovery rate (FDR) q-value of < 0.05 was used as an inclusion criterion for DEGs.
Principal component analysis
Principal component analysis (PCA) was utilized to visualize differences between samples. All PCA plots were generated from counts tables that were size-normalized and r-log transformed after CLEAR selection using methods included with DESeq2. Each comparison was processed with the SciKitLearn [
45] PCA implementation and plotted using custom scripts in Python.
Discussion
Total RNA input amount for lcRNA-seq (e.g., 10–1000-pg from 1 to 100 cells) is closer to amounts found in scRNA-seq studies (1–10-pg) than to bulk RNA-seq studies (100–200-ng). Due to the minute RNA input amount in single-cell/limiting-cell transcriptomic studies, the library generation process, with the exception of single-molecule sequencing, is preceded by global cDNA preamplification. Researchers have observed artefacts such as a large proportion of the sequencing reads being dominated by a small number of transcripts [
14], excessive transcripts with zero read counts [
58], high variabilities between and within replicates of sample groups [
30], distortion towards shorter transcripts [
59], and higher variances at lower biological abundances [
10,
11]. Together, these artefacts result in noisy sc/lcRNA-seq data that challenge assumptions fundamental to bulk RNA-seq analysis approaches and render them unsuitable for direct application to these data [
16]. Existing strategies to overcome challenges associated with noisy data to uncover biological differences in single-cell studies include: (1) identifying low-quality, high-noise samples and removing them from downstream analyses [
25]; (2) applying transcript normalization [
16,
21,
27,
60]; (3) incorporating ERCC spike-in control [
30,
61]; (4) utilizing median absolute deviation (more resistant to outliers) to characterize statistical dispersion [
12]; (5) integrating UMIs to track transcripts by molecular counts [
62]. Yet, due to the often low number of replicates and the goal of identifying DEGs between sample groups, the strategies for scRNA-seq data just described are not appropriate or adequate for lcRNA-seq data.
Increasingly, RNA-seq is used in clinical trials to study patient’s transcriptional response to novel drugs or drug combinations. Patient biospecimens are extremely precious, especially for well annotated samples. In this scenario and in cases where cell types of interest are rare, the combination of lcRNA-seq (data generation) and CLEAR (identification of robust transcripts from high noise data for between-group comparisons) enables global profiling without exhausting important samples while preserving RNA quality by bypassing total RNA extraction. Similar to single-cell analysis, lcRNA-seq has high signal dropouts and skewed read coverage profiles. CLEAR is designed to identify transcripts with acceptable systems noise for downstream analyses by evaluating one sample at a time (not by assigning a uniform cutoff across all samples). CLEAR then requires transcripts to be reliably quantifiable in all replicates within and across comparison groups. This criterion is made possible due to the RNA quality preservation effects of direct cDNA preamplification reagents and library generation using nanoliter-microfluidics devices as they increase even read coverage in lcRNA-seq transcripts. To our knowledge, preprocessing lcRNA-seq data using CLEAR followed by DEG analysis is unique and unlike approaches used by others in similar studies. For example, Shanker et al. [
13] performed Pearson correlation comparisons between serial dilutions of input RNA vs. their 1 μg input control using log
2 of median RPKM for all genes having ≥ 2 reads coverage. Bhargava et al. [
14] performed DESeq without prior data preprocessing to compare the performance between three library preparation methods as well as within serial dilutions of mRNA derived from two mouse embryonic stem cell culture conditions. Liu et al. [
63] utilized relative expression orderings to harmonize clinical transcriptome signatures between low-input and bulk RNA-seq libraries. In contrast, CLEAR comprehensively screens all transcripts in all replicates within an lcRNA-seq study for genes with even coverage for downstream analyses. The combination of characterizing each transcript by its read distribution mean μ
i followed by cutoff selection defined by expression across the full transcript distribution is deliberate. For any given transcript, even one with low systems noise, there are many reasons why its mean μ
i could deviate from zero: (1) transposase-based library generation methods [
64] less effectively ‘tagment’ the 5′-end of transcripts (3′-ends are less affected likely due to the presence of poly(A) tails), affecting 5′-end read coverage; (2) transcript isoforms skew the read distribution along the length of the transcript, especially when prominent isoforms are not correctly annotated [
17]; (3) presence of truncated, polyadenylated RNA fragments/intermediates part way through the RNA decay pathway [
65]; (4) presence of RNA fragments with alternative polyadenylation sites [
17,
66,
67]. Thus, a non-zero μ
i alone does not signify a noisy transcript. Conversely, read coverage profiles of some of the noisy transcripts can become random and μ
i can approach zero by chance. Simply keeping transcripts with a mean μ
i close to zero would result in some valid transcripts being excluded and some noisy transcripts being included. By characterizing the entire distribution of μ
i over a defined range of expression levels, outliers due to the above-mentioned effects do not impact the classification of that particular expression bin. E.g., a well quantified transcript with an incorrectly annotated isoform would have a non-zero μ
i but would be found in a bin with a majority of genes with μ
i close to zero and thus classified as reliably quantifiable. In this manner, CLEAR adapts to the quality of each sample to assess systems noise, an approach more precise than using an arbitrary cutoff. It is, however, important to note that systematic biases along transcripts that affect
all genes, such as the known 3′ bias of the original SMART-seq protocol [
6]
do negatively affect CLEAR; we thus recommend its use only on the SMART-seq2 and newer protocols.
In this report, we have selected two sets of public data [
14,
25] and one set of in-house data to illustrate the utility of CLEAR. One of the goals for using public datasets was to ascertain that the mean read/fragment distribution phenomenon observed in the CLEAR development data is also present in sc- and lcRNA-seq libraries derived from Fluidigm C1 IFCs and manual SMART-seq, respectively. As expected, the violin plots from the public data (Additional file
1: Figure S6A, C) illustrate the distribution of μ
i transitions from an acceptable distribution centered around 0 to an unacceptably broad/bimodal distribution. Another goal was to evaluate the ability of CLEAR to identify noisy transcripts in a broad range of datasets. The application of CLEAR to Ilicic et al. scRNA-seq data [
25] revealed association between cell integrity and the resultant data quality. Additional file
1: Figure S6B displays violin plots of the dispersion of CLEAR transcripts which recapitulated the authors’ microscopy assessments of intact ‘single cells’ and ‘no cell’. When applied to Bhargava et al. data [
14,
25], CLEAR identified increasing amounts of ‘passed’ transcripts with increasing mRNA input mass (Additional file
1: Figure S6D) which we also observe in the CD5+/CD5− data. Together, these assessments confirm that CLEAR can independently reveal sample/transcript quality and identify noisy data associated with empty wells (Ilicic et al. data [
14,
25]) and increases in data quality with increasing levels of RNA input (Bhargava et al. data [
14,
25]). As yet another example of CLEAR’s performance, we incorporated a proof-of-principle experiment with cells isolated from the neurogenic niche in murine hippocampus. Traditionally, this study has been hampered by the limited numbers of stem cells and progenitors present in murine DG. By coupling preamplification-based lcRNA-seq with CLEAR, DEGs between these cell types can be achieved with only one mouse brain per biological replicate.
When comparing different biological groups it is possible that a gene is highly expressed in one but not in the other. An example of this is depicted in the Eomes gene in Additional file
1: Figure S7. It passes CLEAR in the progenitors but fails in the other cell types. While true DEG analysis is not possible for such genes, one can still use the CLEAR cutoff expression level (the count of the lowest transcript passing CLEAR for a particular sample) to report a bound on the fold change of such a gene consistent with the CLEAR analysis. For example, DESeq2 reports a log
2 fold change of 8.97 when comparing the expression of Eomes of progenitors to astrocytes. However, the ‘bound’ procedure suggests a more modest difference of 1.48, when substituting the threshold for the featureCounts-reported expression value. The latter analysis is likely more reliable than the higher fold change reported by DESeq2 alone due to the presence of signal dropouts when all transcripts are used. For large enough numbers of replicates, it also may be reasonable to identify a gene as a DEG (albeit without quantifying its fold change) if it passes CLEAR in all samples of one group and fails CLEAR in all samples of the other group as such behavior is exceedingly unlikely to occur by chance in the case of many replicates.
Although CLEAR can be applied to scRNA-seq data just as effectively as lcRNA-seq data (Additional file
1: Figure S6A, B, results from analyzing Ilicic et al. [
25] scRNA-seq data using CLEAR), the low complexity and highly variable nature of scRNA-seq data would typically result in insufficient CLEAR transcripts for meaningful DEG analysis, especially when requiring transcripts to pass the CLEAR criterion in all (usually of a very large number) cells. The incorporation of imputation would relax the CLEAR criterion thereby rendering it useful in scRNA-seq data.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.