To validate the GPDS ranked list and demonstrate the reliability of our method as well as measure its predictive performance on in vitro data, we applied it to the Kinker et al. dataset [
29] comprising 53,513 single cell transcriptional profiles from 198 tumorigenic cell lines and spanning 22 distinct cancer types. To evaluate DREEP performance, we first converted predictions from the single-cell level to the cell-line level by computing the median enrichment score for each drug (i.e. GPDS) across the cells of each individual cell line. Then, to minimize potential confounders, we evaluated the performance of the DREEP method for each sensitivity dataset independently (see the “
Methods” section). We also searched for the optimal number of relevant genes to use for enrichment analysis and predict the sensitivity of a cell (i.e. 50, 100, 250, 500, and 1000 genes). Figure
1C shows the method’s overall performance in terms of precision-recall and ROC (receiver operating characteristic) curves across 198 cancer cell lines for each drug sensitivity dataset independently (i.e. CTRP2, GDSC, and PRISM). In this analysis, predicted drug/cell-line pairs are ranked according to the median enrichment score estimated by DREEP and the performance is evaluated against the corresponding gold standard (see the “
Methods” section). Figure
1D reports the ROC curve’s AUC (area under the curve) for all drugs and all cell lines individually. Next, to investigate whether DREEP exhibited a prediction bias toward a class of drugs characterized by a specific Mechanism of Action (MoA), we categorized the ROC curve’s AUC of each drug in Fig.
1D (left panels) into two distinct groups: (i) accurately predicted (defined as ROC-AUC ≥ 0.75) and (ii) erroneously predicted drugs (defined as ROC-AUC ≤ 0.5, indicating random performance). This analysis yielded a total of 639 drugs whose efficacy was accurately predicted (209 from GDSC, 147 from CTRP2, and 282 from PRISM; Additional file
2: Table S1) and 135 drugs whose efficacy was erroneously predicted by DREEP (21 from GDSC, 23 from CTRP2, and 91 from PRISM; Additional file
2: Table S1). Notably, accurately predicted drugs exhibited a significant enrichment [
32] (FDR < 10%) across a broad spectrum of MoAs (see the “
Methods” section), spanning various biological mechanisms (Additional file
2: Table S2). In contrast, the 135 drugs predicted with low accuracy showed enrichment for a limited number of MoAs but overlapped with the MoAs of accurately predicted drugs (Additional file
2: Table S2), except for prostanoid and glucocorticoid agonist molecules. This analysis indicates that DREEP did not have a prediction bias toward a specific class of drugs, but it proved incapable of predicting the effects of these 135 small molecules, leading us to exclude them from the final version of the tool. Lastly, to assess DREEP’s ability to generalize across different cancer types, we grouped the ROC curve’s AUC values for individual cell lines (as reported in Fig.
1D, right panels) according to their respective cancer types. Figure S2 (Additional file
1) illustrates that DREEP does not reveal any significant performance drop associated with specific cancer types except for neuroblastoma where average AUC is lower than in other cancer types. Finally, we conducted an additional assessment of DREEP’s predictive performance where we constructed GPDS ranked lists using the drug’s IC50 values instead of the AUC metric. As depicted in Fig. S3 (Additional file
1), DREEP’s performance noticeably decreased across all three drug viability datasets when employing this alternate approach.
Next, we tested the prediction performance of the DREEP method on both pan-cancer and breast cancer datasets using absolute gene expression levels instead of gf-icf normalized data. We observed a drastic reduction of the predictive performance of the method (Additional file
1: Fig. S4) using the absolute gene expression level to estimate its relevance in a cell, showing the importance of data pre-processing prior to the application of a drug prediction algorithm. As a good compromise between the required computational time and prediction accuracy, we decided to use
N = 500 genes as a default value for the DREEP sensitivity predictions in all subsequent analyses in the manuscript.
Finally, we used both the single-cell pan-cancer and breast cancer datasets to conduct a comparative analysis of DREEP’s performance with four other widely used single-cell drug prediction methods, including scDRUG [
17], scDEAL [
19], and beyondCell [
20] (see the “
Methods” section). We evaluated the performance of these methods based on the shared drugs, as depicted in Fig. S5A (Additional file
1). As demonstrated in Fig. S5B-F (Additional file
1) and Table S8-10, DREEP consistently outperformed the other methods across both datasets. While the biomarker-based method beyondCell, similar to DREEP, showcased reasonable performance when using the GDSC dataset, scDEAL and scDRUG yielded instead subpar results in all comparisons. This discrepancy is likely due to these two methods being trained on high-coverage single-cell data from 10X Genomics, which inherently has fewer dropout events compared to the low-coverage single-cell datasets used in our comparisons.
Taken together, these results collectively demonstrate the validity of our GPDS ranked lists for predicting the sensitivity of cells to drugs. Furthermore, our comprehensive evaluation showcases that DREEP consistently outperforms random predictions across various scenarios and outperforms other state-of-the-art methods.