Comparing within-gene variant distributions
Many of the presented analyses required comparing the distributions of within-gene variant locations between cancer types. Within each gene, we collected all the called variants (from all samples), and partitioned them into six groups according to the cancer types they had originated from. We considered only the per-gene exomic locations of the variants (e.g. coordinates 0 to ~ 8300 for BRCA1). Denote by Lg,t = (Lg,t(1),..,Lg,t(kg,t)) the collection of the gene exomic locations of all kg,t called variants in a given gene g originating from samples of a given cancer type t. For example, if singleton germline variants were called at nucleotide positions 17, 65, and an additional variant was called at two individuals at position 183 of the KRAS transcript in SKCM (Skin Cutaneous Melanoma) samples, then LKRAS,SKCM = (17,65,183,183). Note that the same locations, or even same variants, may appear multiple times in such a collection (e.g. if a variant is called in multiple samples).
In order to compare two cancer types t,s for a given gene g and obtain a p-value for the difference in the distributions of variants within that gene between the two cancer types, we applied a two-sided Kolmogorov-Smirnov (KS) test between the two (cumulative) empirical distributions of the collections, denoting the resulting p-values as pg,(t,s) = KS(Lg,t,Lg,s).
In order to obtain a final summary measure for the possible presence of batch effect within a gene (with respect to the distribution of variants along it), we took the ratio between the KS p-value of an intra sequencing center pair to the KS p-value of an inter sequencing center pair. Specifically, we defined the ratio rg = pg,min/pg,max between the minimum of the p-values of BI-BI pairs pg,min = min (pg,(SKCM,STAD), pg,(SKCM,TCHA) pg,(STAD,TCHA)) to the maximum of the p-values of BI-WUGCS pairs pg,max = max(pg,(SKCM,BRCA), pg,(SKCM,UCEC), pg,(STAD,BRCA), pg,(STAD,UCEC), pg,(TCHA,BRCA), pg,(THCA,UCEC),). We declared a gene to be possibly affected by the batch effect if rg > 1. By taking a minimum-to-maximum ratio, we adopted a conservative criterion for the presence of the batch effect, requiring that all between-center p-values are smaller than all within-centers p-values. As reported, only 33% of the analyzed genes resulted a ratio rg < 1, indicating no batch effect.