Background
High-grade serous ovarian carcinoma (HGS-OvC) is the most lethal gynecological cancer with around 22,000 new cases and 14,270 deaths per year in the United States. It is ranked 5th overall for cancer death in women [
1‐
3]. The Cancer Genome Atlas (TCGA) program performed a comprehensive “omics” characterization of HGS-OvC (TCGA-OV). They studied 489 ovarian cancer samples integrating copy number variations, transcriptomic, methylation arrays, and micro-RNA expression data and performed exome sequencing for 316 of the samples [
4]. Patients in the TCGA-OV project had advanced primary ovarian cancer with 5% diagnosed at stage 2, 79% at stage 3 and 16% at stage 4.
TCGA-OV researchers identified mutations that are important in ovarian tumors by comparing pathogenic variants to those found in the Catalogue of Somatic Mutations in Cancer (COSMIC) and Online Mendelian Inheritance in Man (OMIM), and by predicting the mutations’ impacts on protein function. The TCGA-OV study further analyzed the significance of all mutated genes compared to the background mutation rate (BMR), which represents the rate of random mutation. These estimates assume that most observed mutations are neutral and don’t have any selective advantage or disadvantage [
5,
6].
Estimating the significance of gene mutation done by the TCGA-OV research network relied mainly on frequency-based criteria, where a gene is identified as having a driver mutation if it is altered in significantly more patients than expected based on the background model. Mutations in some genes, such as
TP53, are detected in large populations of different cancers whereas some mutations exhibit low rates in cancers. For each gene, they calculated the probability of seeing the observed set of mutations and reported nine significant mutations out of the 9986 observed mutated genes.
TP53 was found to be mutated in more than 96% of all samples as previously reported [
7‐
10]. BRCA1/2 variants were also found in 22% of tumors (a combination of germline variation and somatic mutations). The TCGA-OV group also identified significantly mutated genes that occur at a low frequency, in only 2–6% of tumor samples. These genes are
RB1,
NF1,
FAT3,
CSMD3,
GABRA6, and
CDK12 [
4].
While characterizing the spectrum of somatic mutations in ovarian cancer in the TCGA-OV study has a high impact, cancer arises from a complex interplay between genes in cells and environmental factors [
11] and both mutated and non-mutated genes interact to enable the acquisition of the hallmarks of cancer [
12]. Understanding which non-mutated genes are important for tumors could lead to the development of new and more effective drug targets. Most studies have focused only on mutated genes because it is difficult to assign significance to non-mutated genes since most genes in a single patient would be non-mutated. However, using high-throughput sequencing data of many patients, it is possible to estimate the significance of non-mutated genes by comparing observed mutation frequencies to expected mutation frequencies and identifying genes with lower mutation frequencies than expected.
In this study, we used a computational biology approach and set up in-silico mutagenesis experiments. This allowed us to identify a subset of genes that were observed to have fewer mutations in observed data than expected from simulation data which we called non-mutated genes. We hypothesized that non-mutated genes were essential to tumor function. Pathway analysis showed that non-mutated genes interact in cancer-related pathways. Gene expression studies showed that non-mutated genes were well-expressed in cell lines and ovarian cancer tissues from patients. We also verified the relevance of these genes to tumor biology using proof-of-concept siRNA-based experiments. We conclude that non-mutated genes are potentially important for ovarian cancer tumor biology and could lead to new therapeutic strategies.
Methods
In-silico mutagenesis approach
We obtained somatic mutation data from the TCGA Ovarian Cancer Project from the GDC Data Portal (
https://portal.gdc.cancer.gov/). We implemented a method to efficiently simulate mutations across a set of nucleotide sequences in Matlab as previously described in Malek, Halabi and Rafii [
13]. The TCGA-OV data consisted of 316 patients, so we performed a simulation run 316 times. Since mutations were random, each simulated run of 316 patients was also repeated 100 times. In total, there were 31,600 simulated runs. Each simulation run consisted of simulating the mutagenesis of 140,362,938 nucleotide bases. Furthermore, since different bases undergo different mutation rates, it was necessary to implement a way to differentially mutate different sets of nucleotide bases. The sequence space was therefore divided into nucleotide bases that were (1) A or T (2) C or G (3) CG or GC. Mutations at these different sets were assigned different mutation rates. We used the background mutation rates (BMR) published in the TCGA study [
4] in Additional file
1: Table S2.2b (A/T mutations: 8.54 × 10–7, C/G mutations: 1.2 × 10–6, CG/GC mutations: 4.31 × 10–6 and insertions-deletions at 2.2 × 10–7). Since no information about insertions-deletion sequence specificity was available, we added the indel mutation rate to the other categories. Three different random mutation vectors were generated of a length equal to the number of bases in the A/T, C/G, and CG/GC vectors. Each random mutation vector consisted of 0’s and 1’s with the frequency of 1’s occurring randomly at a density equal to the TCGA published background mutation rate. The three different mutation vectors were then combined to form the final mutation vector that had within it all simulated mutations. We used a reduced sequence library corresponding to the sequences that overlapped with the Agilent SureSelect v2 probe sequences. Obtaining the chromosomal locations of each probe from Agilent generated this reduced library. We then identified the regions corresponding to those probes by detecting the overlaps between the exon coordinates and the probe coordinates. The final exon sequence library consisted of 40,362,938 bases. After carrying out simulated mutagenesis, the total number of mutations per gene was calculated by identifying all the exons corresponding to a gene. All exons sharing the same gene symbol were considered the same gene.
Identification and pathway analysis of specific candidate genes
With the ability to calculate the simulated mutation frequency for each gene, it is possible to compare the observed mutation frequency in a gene with the expected simulated mutation frequency. To prioritize the genes with the largest deviations from random simulation expectation, we looked at the top 50 genes where the observed mutation rate was lower or higher than the expected mutation rate, comparing the observed and simulated frequencies based on rank of the observed/expected from simulation mutation frequency ratio. To guarantee coverage in the TCGA-OV dataset, we restricted our analysis to genes that were mutated at least once in the TCGA data since the publicly available data only included a list of mutations per patient and not the coverage across all positions.
We also performed pathway analysis using Ingenuity Pathway Analysis software (IPA from Qiagen, content version March 12, 2022). IPA software consists of a database of published relationships between genes with tools to analyze and visualize pathways. A list of the top 50 non-mutated genes with the observed/expected ratio of each gene was generated and uploaded to IPA. Network diagrams were generated among the genes in the list with genes colored in shades of red based on their observed/expected ratio with the reddest indicating the lowest ratio. Networks were either built using the IPA tools (CONNECT, PATHWAY EXPLORER, TRIM, KEEP) or identified automatically by IPA software as indicated. Automatically generated network significance was assessed with an IPA generated score which represents the negative exponent of the right-tailed Fisher’s exact test result (described in the IPA documentation:
http://qiagen.secure.force.com/KnowledgeBase/articles/Basic_Technical_Q_A/Listing-of-Networks).
Cell culture
We used three ovarian cancer cell lines for silencing experiments: SKOV3, OVCAR3 and APOCC. SKOV3 and OVCAR3 were purchased from ATCC and APOCC was derived in-house from ascites of a patient with Stage III serous adenocarcinoma. These cell lines were all maintained in DMEM high glucose (Hyclone, Thermo Scientific), 10% FBS (Hyclone, Thermo Scientific), 1% Penicillin-Streptomycin-Amphotericin B solution (Sigma), 1X Non Essential Amino-Acid (Hyclone, Thermo Scientific). Additionally, one non-cancer primary ovarian epithelial cell line was purchased from Sciencell (Cat. No. 7310) and cultured in poly-L-lysine-coated culture vessel (2 μg/cm2, T-75 flask) following ScienCell recommendations in Ovarian Epithelial Cell Medium (OEpiCM, Cat. No. 7311), 1% Ovarian Epithelial Cell Growth Supplement (OEpiCGS, Cat. No.7352) and 1% penicillin/streptomycin solution (p/s, Cat.No 0503). All cultures were incubated in humidified 5% CO2 incubators and the media was replaced every three days. For RNA sequencing, we used ovarian cancer cell lines SKOV3, APOCC, GOC-2, and GOC-A2 [
14,
15], non-cancer ovary derived fibroblasts (ScienCell, Cat. No. 7330) and non-cancer primary ovarian epithelial cell lines (ScienCell, Cat. No. 7310). GOC-2 cells were isolated from a papillary serous ovarian cancer obtained after neoadjuvant chemotherapy while GOC-A2 were derived from a stage IIIc serous ovarian cancer [
14]. GOC-2, GOC-A2 and fibroblasts were cultured in DMEM high glucose as previously described.
Gene expression analysis of cell Lines and TCGA-OV patient data
RNA from six different cell lines were isolated using Qiagen Allprep DNA/RNA miniprep kit as per manufacturer instructions. Library preparation was done with Nugen’s Ovation Single Cell RNA-Seq System. Sequencing (Illumina 100 bp paired-end reads) was done on Illumina HiSeq 2500. Alignment was done with RNA Star to GRCH37 [
16]. Mapping to genes was done with Rsubread using the FeatureCounts function [
17]. Normalization and quantification of gene expression was done with edgeR [
18]. All genes with any read count were included. The reads per kilobase of transcript per million mapped reads (RPKM) measure was calculated for all genes in all cell lines and used for distribution comparison.
Publicly available gene expression data from the TCGA-OV project was downloaded from the GDC data portal (
https://portal.gdc.cancer.gov/legacy-archive) using the following filters: Primary-site = Ovary, Data-category = Gene expression and Platform = HT_HG-U133A. This data consisted of gene-level, robust multiarray analysis (RMA) normalized and background-corrected expression values for 12,042 genes from primary ovarian cancer biopsies. The RMA values were used as provided. The gene expression data files were further filtered to include only those that had somatic mutation data. Somatic mutation data was similarly obtained from the GDC data portal (
https://portal.gdc.cancer.gov). The intersection between gene expression data and somatic mutation data files resulted in 315 samples for further expression analysis.
Custom scripts in Matlab software (Mathworks) were used for further analysis and visualization. To analyze the distribution of gene expression of both cell lines and TCGA-OV we used the non-parametric, two-sample Kolmogorov–Smirnov test as implemented in Matlab software (version 2019a).
Methylation and copy number analysis of TCGA-OV primary ovarian cancer samples
We downloaded from the GDC data portal (data release 32) methylation Beta value data obtained from Illumina human methylation 27 chip from 605 samples. Beta values represent the fraction of methylation at a specific site with 0 representing no methylation and 1 representing complete methylation. We then excluded from the analysis non-primary and non-cancer samples which resulted in 582 samples for further analysis. We aggregated the Beta value data from all patients for the top 50 non-mutated (41 matches) and top 50 mutated genes (33 matches) and compared their distribution using the two-sample Kolmogorov–Smirnov test implemented in Matlab software (version 2021a). When one gene matched multiple methylation sites, the beta values were aggregated across the gene.
Similarly, for copy number analysis we downloaded from the GDC data portal (data release 32) ‘Gene Level Copy Number’ data obtained from the Affymetrix snp 6.0 array from 589 samples. We excluded non-primary cancer samples to obtain 562 samples for comparison. We matched 48 of the top 50 non-mutated genes and 43 of the top 50 mutated genes and aggregated all the copy number data across all samples. Distributions were compared using the non-parametric, two- sample Kolmogorov–Smirnov test as implemented in Matlab software (version 2021a).
siRNAs screening system
Double-stranded siRNAs targeting each gene were obtained from Invitrogen (Silencer® Select Pre-Designed siRNA LPP gene, Cat. No 4392420, siRNA ID: s8270, Silencer® Select Pre-Designed siRNA TRAPPC9 gene, Cat. No 4392420, siRNA ID: s38115, Silencer® Select Pre-Designed siRNA ELFN2 gene, Cat. No 4392420, siRNA ID: s41621, Silencer® Select Pre-Designed siRNA ANKLE2 gene, Cat. No 4392420, siRNA ID: s23124, Silencer® Select Pre-Designed siRNA PGR gene, Cat. No 4392420, siRNA ID: s10415, Silencer® Select Pre-Designed siRNA MAP1B gene, Cat. No 4392420, siRNA ID: s8499, Silencer® Select Pre-Designed siRNA VEGFA gene, Cat. No 4392420, siRNA ID: s461, Silencer® Select Pre-Designed siRNA SLC12A9 gene, Cat. No 4392420, siRNA ID: s224445, Silencer® Select Pre-Designed siRNA CELSR1 gene, Cat. No 4392420, siRNA ID: s18485). We also selected from Qiagen RNAi Human/Mouse starter kit Cat. No 301799, positive siRNA targeted against the protein kinase MAPK1, also called ERK2, and a non-targeting negative or non-silencing control siRNA that exhibits minimal nonspecific effects on gene expression and phenotype. Both the positive and negative controls were included in each 96-well plate.
To assess the degree of knockdown, cells were seeded in 96-well culture plates at a density of 5000 cells/well. cDNA synthesis was carried out 72 h after cell siRNA using TaqMan Gene Expression Cells-to-Ct kit (Thermo-Fisher). Normalization was done using the included B-actin probe in the Cells-to-Ct Control kit (Thermo-Fisher). All qPCR reactions were performed in triplicate and Cq values were averaged.
Cell viability assay
We used Promega’s CellTiter-Glo® assay in 96 well plates. Briefly, cells were seeded at 5000 cells per well in 96-well plates and allowed to attach overnight at 37 °C. Twenty-four hours after attachment, cells were transfected with individual siRNAs at 10 nM using Lipofectamine Max (Thermo Fisher). Twenty-four hours after siRNA treatment the transfection media was replaced with serum-free media. We used the same siRNA concentrations and transfection reagents in all cell lines and experiments. In addition, positive and negative siRNA controls were added in different wells. Seventy-two hours after transfection, 100 μl of CellTiter-Glo® reagent was added to 100 μl of medium containing cells in a 96-well plate and viability was evaluated using EnVision Workstation version 1.12 from PerkinElmer. All experiments were performed in triplicates. Student’s t-test was used to compare the proliferation fraction of the knockdowns with that of the negative control. A p-value less than 0.05 was considered significant.
Morphological marker staining
Cells were incubated 72 h after transfection with Invitrogen’s Live Cell stain CellMask Orange/Red for the plasma membrane and Hoechst 33342. Both the cell morphology of the cells and the nuclear morphology were visualized by confocal microscopy (Zeiss LSM 510).
Wound healing assay
Cancer cells (50000 cells/well) were plated in 24-well plates in triplicate. Twenty-four hours after siRNA transfection, cells were starved from serum. A scratch was made in all the wells with a 1 μL pipette tip forty-eight hours after siRNA transfection. Images were taken directly after the scratch (0H) and again after 24 h (24H) and 48 h (48H). Edges were identified with manual inspection and wound healing was quantified as the ratio of the pixel distance at the timepoints relative to the 0H distance. Student’s t-test was used to calculate significance of differences between the ANKLE2 knockdown and the control by combining the 24H and 48H data.
Chemotherapy
Paclitaxel/taxol and carboplatin were purchased from National Center for Cancer Care and Research (NCCCR; Doha, Qatar) pharmacy. Briefly, cancer cells (5000 cells/well) were plated in 96-well plates in triplicate for each condition. Twenty-four hours after siRNA treatment, the cells were starved from serum. Forty-eight hours after siRNA transfection, each drug suspended in phosphate buffered saline (PBS) was added to each well at a concentration of 50 µM and viability was analyzed after 24 h. Student’s t-test was used to compare the proliferation fraction of different pairs. A p-value less than 0.05 was considered significant.
Discussion
Here we focused on better understanding the role of non-mutated genes in ovarian cancer. We showed that in a few genes there are differences between simulated and observed mutation frequencies in the TCGA ovarian cancer cohort of 316 patients. These genes fell into two categories—genes that are mutated more than expected (such as
TP53) and genes that are mutated less than expected which we call here non-mutated genes. The non-mutated gene set was especially interesting because this was a set of genes that could be selected against mutation due to their role in tumor biology and that could offer new therapeutic strategies. The TCGA study in ovarian cancer uncovered 9,984 genes mutated in 316 patients with a very heterogeneous distribution among patients [
94]. Not only are there many mutations but, with the exception of
TP53, patients share few mutations. This mutational diversity makes treatment strategies that target mutated genes difficult as every patient may have a different combination of mutated genes. However, treatment strategies that target non-mutated genes may be more effective as these non-mutated genes would be the same in different patients if the observed non-mutation is due to selection against mutation. Indeed, one of the genes we identified as non-mutated is
VEGFA, which is known to be involved in promoting cancer angiogenesis [
65‐
67,
69,
103].
We found that the non-mutated genes are members of cancer-relevant networks. For example,
SHROOM3 has a role in regulating cell shape in tissues [
30] and is connected with the SNF complex, which mobilizes nucleosomes, remodels chromatin and opens up the transcription-binding domains leading to an increase in transcription [
104]. Growing studies support the role of SNF complex in cancer development, as several subunits possess intrinsic tumor-suppressor activity [
105]. Furthermore, several non-mutated genes interact indirectly and directly with the AKT network which modulates the function of numerous substrates involved in the regulation of cell survival, cell cycle progression and cellular growth, and neo-vascularization [
106,
107]. One of the most interesting genes we observed to be significantly non-mutated was
ANKLE2, which is a member of the LEM family of inner nuclear membrane proteins. This gene functions as a mitotic regulator through the post-mitotic formation of the nuclear envelope [
26]. Our inhibition strategy confirmed the important role of
ANLKE2 in different tumor-associated phenotypic traits.
We observed that generally non-mutated genes are well-expressed in both non-cancer and cancer tissues. This could limit the clinical use of targeting non-mutated genes as there could be significant side effects due to deleterious effects on non-cancer cells. However, targeting non-mutated genes could still be a viable strategy if cancer cells display greater sensitivity than non-cancer cells to inhibition of non-mutated genes. The greater sensitivity of cancer cells to radiation or chemotherapy compared to non-cancer cells has resulted in the wide use of these treatment modalities although with significant side effects. We have performed one experiment showing that non-cancer ovarian epithelial cells are less susceptible than cancer cell lines to the effects of silencing in our viability assay. While these results are promising, they need to be further validated across different cell lines and esp ecially across different cellular contexts. Cells grown in 2D monocultures are very different from cells grown in co-culture with other cells or in 3D organoids and from cells in tissues. It will be interesting to explore the differential sensitivity of cancer cells and non-cancer cells to non-mutated gene inhibition in future studies. Furthermore, we identified several non-mutated genes where three out of four cancer cells had high expression but where expression was low in non-cancer cells. These genes may be interesting therapeutic targets if this pattern is also seen across more cancer and non-cancer cells as targeting them could result in reduced toxicity t o non-cancer cells.
A related point that could affect therapeutic effectiveness is if these genes might be non-mutated because they are housekeeping genes and any mutation would be highly deleterious to all cells. The commonly known housekeeping genes are the ACTB gene, which is part of actin protein family, RAB7A, which belongs to the RAS oncogene family, and the GAPDH gene (Glyceraldehyde 3-phosphate dehydrogenase). In our study, they did not display any selection against mutation. The top 50 non-mutated genes identified are not part of classical housekeeping genes to our knowledge.
Our analysis is novel, as most studies have focused on mutated genes. Further data can help refine our analysis, as we used only the restricted publicly available datasets in this work. Sequencing with better coverage, such as whole-genome sequencing, would be an improvement to this analysis since we would get a much better coverage distribution. In addition, it would be interesting in future studies to develop single-cell and deep sequencing experiments in rapidly div iding cancer cells in culture across different time points to determine the distribution of mutations including low frequency mutations. With sufficient coverage it will also be possible to determine if there are significantly non-mutated genes in this context.
Comparing our data to the TCGA-OV study [
4] shows that among the top 50 mutated genes we identified are two of the nine genes the TCGA identified as being significant. These genes are TP53 and RB1. The TCGA-OV used complex statistical models considering sequence context in addition to considering the ove rall prevalence of mutations. Our random mutation model here is relatively simple and the mutation probability of a specific base is independent of any other base. One possibility this limitation raises is that genes can be observed to be non-mutated not because they are selected against but because they have a sequence context that greatly reduces the chance of mutation. While these mutation-resistant genes could still make interesting targets if cancer cells are sensitive to them, their identification would require both high coverage data and improved mutation simulation models. Our overall approach here was to combine random simulation results with pathway analysis, gene expression and functional testing of selected genes.
In this study, we exploited large-scale cancer genomic databases and bioinformatics approaches to discover novel therapeutic candidates. Our combined bioinformatics and silencing approach could potentially lead to discoveries of interesting candidates without the need for complex, costly, high-throughput screening approaches. Understanding the broader landscape of non-mutated genes using combined TCGA datasets could lead to understanding key selection processes in place in cancer evolution and identifying critical steps that could be used as therapeutic targets.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.