Background
RNA-sequencing (RNA-seq) has replaced gene expression microarrays as the most popular method for transcriptome profiling [
1,
2]. Various computational tools have been developed for RNA-seq data quantification and analysis, sharing a similar workflow structure, but with some notable differences in certain processing steps [
3,
4]. Starting from a FASTQ file containing sequence reads and corresponding quality scores, the sequence reads can be mapped and aligned to a reference genome using algorithms such as TopHat2 and/or STAR read aligner. Gene counts are then generated from the resulting SAM or BAM file using tools such as SAMtools and HTSeq. This process is time consuming and yields gene-level counts only. Because alternative splicing creates multiple structurally-distinct transcripts of the same gene that may produce different phenotypes, several tools have been developed for RNA-seq isoform quantification such as Salmon_aln, eXpress, RSEM, and TIGAR2, which all require transcriptome-mapping BAM files [
5]. In contrast to the aforementioned alignment-based methods, transcript quantification tools Salmon, Sailfish, and kallisto were designed to boost processing speed and to decrease memory and disk usage by bypassing the creation and storage of BAM files [
6‐
8]. This approach is particularly useful for the discovery of novel transcripts, when sequencing poorly annotated transcriptomes, and to detect lowly expressed genes [
9]. Raw read counts cannot be used to compare expression levels between samples due to the need to account for differences in transcript length, total number of reads per samples, and sequencing biases [
4]. Therefore, RNA-seq isoform quantification software summarize transcript expression levels either as TPM (transcript per million), RPKM (reads per kilobase of transcript per million reads mapped), or FPKM (fragments per kilobase of transcript per million reads mapped); all three measures account for sequencing depth and feature length [
4].
Because of the nature of the quantification measures and embedded implicit normalization process, TPM, RPKM, and FPKM expression levels are suitable for the comparison of RNA transcript expression within a single sample. However, none of these measures can be used universally for cross-sample comparisons and downstream analyses such as the determination of differentially expressed genes between two or more biological states. Issues arise, especially in the case of lowly expressed genes, when attempts are made to correct for gene length differences [
9]. In a comprehensive evaluation of normalization methods for Illumina high-throughput RNA-seq data analysis, Dillies et al
. [
9] concluded that total gene counts and RPKM were not recommended quantifications for use in downstream differential expression analysis. Only DESeq2 and TMM normalization methods were shown to produce quantifications robust to the presence of different library sizes and widely different library compositions. Conesa et al
. [
4] conducted a survey of best practices for RNA-seq data analysis and indicated that RPKM, FPKM, and TPM methods normalize away the most important factor for comparing samples, which is sequencing depth, whether directly or by accounting for the number of transcripts, which can differ significantly between samples. RPKM, FPKM, and TPM tend to perform poorly when transcript distributions differ between samples. Highly expressed features in certain samples can skew the quantitative measure distribution and adversely affect normalization, leading to the spurious identification of differentially expressed genes. Zhao et al
. [
10] recently reported the misuse of RPKM and TPM normalization when comparing data across samples and sequencing protocols. However, due to the lack of experimental data generated from different types of replicates to further validate their recommendation, consensus regarding which RNA-seq quantification measure should be used for cross-sample comparison seems not to have been reached by the scientific community. Many recent peer-reviewed articles, as well as publicly-available databases and websites, are still using TPM or RPKM/FPKM for pooled data analyses, cross-sample comparisons, and differential expression (DE) analysis [
11‐
15]. Furthermore, some researchers have attempted to improve comparability of the expression measures by applying certain transformations (e.g., median centering and unit variance scaling, also referred to here as Z-score) or re-normalizing on either TPM or RPKM/FPKM data.
In recent years cancer models developed from patient tumors have come to replace late passage cell lines as the preferred tool in pre-clinical cancer research [
16]. The resulting patient-derived xenograft (PDX) models recapitulate most histological and genetic characteristics of their human donor tumor, thus facilitating the prediction of clinical outcomes and the investigation of drug efficacy, biomarker identification, and development of personalized medicine strategies. The National Cancer Institute (NCI) is developing a national repository of Patient-Derived Models (PDMs) comprised of hundreds of patient-derived xenograft (PDX) models spanning a wide variety of tumor types. The publicly-accessible database, NCI PDMR (
https://pdmr.cancer.gov/), provides clinical annotations as well as molecular characterization information, whole exome sequencing, and RNA-seq data for early-passage PDXs, and if available, for originator patient specimens, to aid in selection of the best model for the investigation of a specific research question.
Here we report on our evaluation of TPM, FPKM, and normalized counts on an RNA-seq dataset of PDX models from the NCI PDMR. Our study examined 61 replicate samples belonging to 20 different PDX models originating from patients with different cancer types to determine which quantitative measures should be used to minimize differences between replicate samples, while preserving biologically meaningful expression differences between genes and across PDX models.
Discussion
Choosing an appropriate gene quantification measure is a key step in the downstream analysis of RNA-seq data. We explored the performance of a few widely used measures on a comprehensive collection of replicate samples of 20 PDX models in RNA-seq experiments across 15 cancer types to address this question. We compared TPM, FPKM, normalized counts using DESeq2 and TMM approaches, and we examined the impact of using variance stabilizing Z-score normalization on TPM-level data as well. We found that for our datasets, both DESeq2 normalized count data (i.e., median of ratios method) and TMM normalized count data generally performed better than the other quantification measures.
Each normalization method comes with a set of assumptions; thus, the validity of downstream analysis results depend on whether the experimental setup is congruent with the assumptions [
32]. For instance, library size normalization approaches such as RPKM and its variant FPKM rely on the assumption that the total amount of mRNA/cell is the same for all conditions. In contrast, approaches such as TMM and DESeq perform normalization by comparing read count distribution across samples, and assume symmetrical differential expression between conditions (i.e., most genes are not differentially expressed between two conditions, and the number of upregulated and downregulated genes is comparable) [
20,
21,
32]. In these cases, all genes are scaled by the same normalization factor—whether they are differentially expressed or not—derived from the distance to an empirical reference sample. In practice, RPKM/FPKM and TPM tend to perform worse than distribution normalization methods because the requirement for the same amount of mRNA/cell does not hold, as substantiated by multiple reports of a few highly expressed genes dominating the number of mapped reads [
9,
33,
34]. We made a similar observation in our study of 61 PDX samples (Fig.
5; Additional file
1: Table S2).
Reproducibility data (i.e., a dataset comprised of n sets of replicate samples) can be used effectively to evaluate the performance of different normalization methods. Wagner et al
. [
35] discussed some of the benefits of TPM over FPKM and advocated for the use of TPM based on a small data set of six human tissue/cell samples with only two replicates. Additionally, Abrams et al
. [
37] recently published a protocol to evaluate RNA sequencing normalization methods using a pool of well-characterized RNA samples from the Universal Human Reference RNA (UHRR, from ten pooled cancer cell lines, Agilent Technologies, Inc.) and the Human Brain Reference RNA (HBRR, from multiple brain regions of 23 donors, Life Technologies, Inc.) [
36,
37]. The authors performed a two-way ANOVA to assess the relative contribution of biology and technology to the measured gene expression variability, and concluded that TPM was the best performing normalization method because it retained biological variability without introducing much additional bias in their dataset of reference cancer cell lines and human brain samples [
37]. Their conclusion was based on the analysis of technical replicates (i.e., same samples sequenced in different laboratories) from pooled human cancer cell lines and human brain tissue samples. A recent study from The Jackson Laboratory outlined a genomic data analysis workflow for PDX tumor samples from 455 models, wherein gene expression estimates were determined using RSEM. Both expected count and TPM data were used in their data analysis examples. However, recommendations were not made on optimal RNA-seq quantification measures for cross-sample comparison as the study did not include a systematic comparison of replicate samples [
38].
The focus of our study was PDX samples, which are inherently more heterogeneous than cell lines, thereby making selection of a sequencing data normalization method critical. We opted to use early passage PDXs because they encountered less evolutionary pressure to adapt to a new environment. Therefore the PDX replicates from 20 models that we chose are more genetically similar to the original tumor [
39]. Furthermore, noise may have been introduced in the RNA extraction and library preparation steps; and the presence of host mouse cells within the xenografted tumor requiring a bioinformatic filtration step, constitutes a further challenge [
40‐
42].
Using the data in NCI PDMR database we compared different RNA-seq quantification measures in 20 histologically diverse PDX samples with three or more replicates to evaluate the three different quantification measures TPM, FPKM, and normalized count. In our study, TPM seemed to perform the worst according to multiple evaluation metrics. Similar to FPKM, TPM performed poorly when replicate samples from the same PDX model had heterogeneous transcript distributions, as seen in Fig.
4; that is, highly and differentially expressed features can skew the count distribution. As pointed out by Pachter [
43], the dependency of TPM on effective lengths means that abundances reported in TPM are very sensitive to the estimates of effective length. Zhao et al
. [
10] suggested a workflow to follow for analysis of TPM or FPKM/RPKM level-data, which includes different paths depending on whether the same protocol and library were used, and whether the fractions of ribosomal, mitochondrial, and globin RNA were similar. In our examples, the top five most highly expressed genes have imbalanced fractions across the replicates hence leading to larger variations. Additionally, we noted that the genes with the highest TPM expression levels tended to overrepresent ribosomal and mitochondrial genes (Additional file
1: Table S2). These factors, in addition to differences in sequencing depth, may all contribute to the observed variation between replicate samples in our study, thus cementing the need for a robust normalization routine.
There have been discussions on the pitfalls of using TPM for cross-sample comparisons. These pitfalls will lead to some major problems in downstream analyses for RNA-seq data. For example, when correlation of gene expression values with some other continuous variable across experimental subjects is of interest, one must rely on comparability of gene expression measurements to both reduce technical noise that may attenuate correlations and avoid extreme measurements that could produce spurious correlations. Certain features of the underlying data may adversely affect the performance of some of these quantification methods. For example, high expression of ribosomal RNA may lead to a skewed distribution of TPM-normalized expression measures for a particular sample. Consequently, a computed correlation will not be accurate even if the rank statistics are used because the comparison is at the gene-level. Secondly, for differential expression (DE) analysis, statistical models usually assume that the data follow some probability distribution. Currently, the majority of the DE analysis tools for RNA-seq assume a Poisson/negative binomial distribution for the data. Since TPM/FPKM are not count data, they cannot be modeled using these types of discrete probability distributions. In addition, shrinkage methods implemented in many DE analysis tools require those distribution assumptions to hold, which clearly they will not, for length-normalized measures such as TPM or FPKM/RPKM. Thirdly, some gene set enrichment analysis methods rely on parametric assumptions about the data distribution for calculation of test statistics and p values [e.g. Fisher (LS) statistics]. TPM and FPKM/RPKM may be acceptable to use if the ranks of genes in each sample are used, as opposed to their quantitative expression values. For example, The Broad Institute’s gene set enrichment analysis (GSEA) tool allows users to perform pathway analyses by uploading single rank-based gene list [
44,
45]. Finally, our analyses demonstrated that neither Z-score nor additional normalization steps can resolve the potentially problematic issue in TPM data. We recommend using raw count matrix normalized by either DESeq2 or TMM for PDX studies.
As described above, each normalization method is based on its own assumptions. When the assumptions are violated, the method could fail [
32]. In this paper, we showed examples of such scenarios where TPM and FPKM did not perform as reliably as normalized counts by DESeq2 or TMM in at least four PDX models. Therefore, it is important to consider context when selecting normalization methods and not arbitrarily use a single method for all purposes [
38]. Researchers need to be aware of assumptions made by various methods, and data characteristics that might violate those assumptions, in order to choose the right normalization method for their study.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.