Background
Lung cancer has been a malignant tumor with a leading cause of both morbidity and mortality worldwide [
1‐
3], and 80–85% is non-small cell lung cancer (NSCLC), with different biological processes and pathological appearance to the other 10–15% small cell lung cancer (SCLC) [
4]. NSCLC includes lung adenocarcinoma, squamous cell carcinoma and large cell carcinoma. Although cancer is still a challenging and incurable disease, the up-rising new therapies including immunotherapy and targeted therapies are bringing promising effect to the clinical patients treatment [
5]. Especially in lung adenocarcinoma, great improvement is taking place in targeted therapies, nearly ten genes have been developed as drug targets, including epidermal growth factor receptor (EGFR), anaplasticlymphoma kinase (ALK), ROS1, RET, HER2, BRAF, PIK3CA, Kras, Nras and MET [
6‐
8] the drugs that are developed based on these genes expression situation are all showing exciting curative effect [
8‐
14].
However, the available ten bio-targets are still numbered as opposed to the highly heterogeneous, complicated and progressive cancer development [
4,
15‐
17]. As a well known fact that the main reason responsible for the incurability of cancer is their fast “adaptive” to outer environment changes, malignant tumors posses ever-changing characteristics according to different clinical treatments [
9,
18]. Not to mention the other subtypes of NSCLC besides adenocarcinoma, including squamous carcinoma and large cell carcinoma, the drug targets are woefully numbered currently. For instance, in the squamous carcinoma, only FGFR2 and DDR2 are known to be aberrantly mutated and could be developed into potential clinical use as drug targets, but as for now, both drugs are still in clinical trial stage [
19]. As for the large cell carcinoma, there is none probable drug target yet [
20]. It is of vital importance to keep identifying new prognostic biomarkers and other potential gene targets [
21].
Recently, great advance is happening to high-throughput technologies, bringing in tremendous amount of clinical data, which provides a rich source for researchers to better understand the molecular basis of cancer development and to identify disease-causing gene alterations thus exploring potential drug targets for therapeutic intervention [
22‐
24]. Large portion of these data are public open and accessible to world wide researchers. Bioinformatic is a data-driven branch of science, with many of the algorithms and databases developed to analyze different types of data [
25]. A lot of analysis tools including software, databases and website services are powerful and free [
25‐
28], although some software are commercial, they can be purchased at a virtually very low cost by school students and education institutes teachers [
29].
In the study, multiple bioinformatic tools were applied to analyze the four cDNA expression profiles from Gene Expression Omnibus (GEO) database including GSE19188, GSE101929, GSE18842 and GSE33532. Firstly, GEO2R tool was used to detect the differently expressed genes (DEGs) between NSCLC and normal lung tissues, the DEGs that were shared in all four profiles were chosen. Secondly, the protein–protein interaction (PPI) network of shared DEGs was constructed using Cytoscape3.6.0 software, and the core gene with highest connectivity degree with other genes was identified. Then, the correlation with NSCLC patients overall survival rate (OS) was evaluated with KM plotter online databases and clinical significance was analyzed based on immunohistochemistry experiment (IHC) results data. Last but not least, the potential function signaling behind the core gene’s regulation on NSCLC development was preliminary explored and genes that co-work with it were explored using STRING, Oncomine and GEPIA. The results shall provide delightful insights to the unearth of prognostic biomarker candidates and new potential bio targets to NSCLC patients.
Materials and methods
Data source: cDNA expression profiles from GEO database
Four cDNA expression profiles GSE19188 [
30], GSE101929 [
31], GSE18842 [
32] and GSE33532 [
33] were chosen from GEO online public database [
34] based on the sample size and their publication time (we mainly focused on the profiles that contains at least 20 paired samples and those being publicated recently). And GSE19188 profile was based on GPL570[HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array, containing 91 NSCLC samples and 20 normal lung tissues. GSE101929 was based on GPL570[HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array, containing 32 NSCLC samples and 34 normal lung samples. GSE18842 was based on GPL570[HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array, containing 46 NSCLC samples and 45 normal lung samples. And GSE33532 was based on GPL570[HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array, containing 80 NSCLC samples and 20 normal lung samples.
Identification of DEGs between NSCLC and normal lung tissue
To analyze the DEGs between NSCLC and normal lung tissues, GEO2R tool [
35], which is a public interactive online service was used in four cDNA profiles respectively. The criteria for DEGs selection were set as adjusted P value < 0.05 and |log2FC| ≥ 2. And E Chart online service for Venn diagram was then used to screen the DEGs that were shared in all four cDNA profiles. Meanwhile, GO and KEGG were used to preliminary analyze the main biological processes, molecular functions and signaling pathways the DEGs enriched in.
PPI network construction and core gene identification
To construct the PPI network of shared DEGs, the Search Tool for the Retrieval of Interacting Genes (STRING) was used, which is an online database for searching the direct (physical) and indirect (functional) association between various proteins. STRING contains the information of 9643763 proteins from 2031 species up to now [
27,
36]. The cut-off criteria to construct the network was set as confidence score ≥ 0.4 and maximum number of interactors = 0. Additionally, the top gene with highest connectivity degree with surrounding genes was picked based on the PPI network by Cytoscape3.6.0 software [
37].
Kaplan–Meier survival analysis
Kaplan–Meier plotter is an open access online service for the overall survival analysis of various cancers including lung cancer, breast cancer, gastric and ovarian cancer, as well as hepatocellular carcinoma, containing a total of 10,461 patients samples and their clinical information [
38,
39]. In the study, we used Kaplan–Meier plotter to analyze the correlation between TOP2A gene and NSCLC patients OS, followed by drawing the survival curve. Additionally, the clinical and mRNA transcription data including 574 lung adenocarcinoma and 555 lung squamous cell carcinoma were downloaded from TCGA database for multivariate COX regression analysis and exploring TOP2A expression relationship with clinical parameters.
GEPIA analysis of gene expression
GEPIA is a newly developed online software, which is based on the sequencing database of 9736 cancer and 8587 normal samples from TCGA and GTEx programs. The software is commonly used for analyzing certain genes expression differences between cancer and normal tissues in various tumor types [
40,
41]. In the study, we used GEPIA to preliminary explore the expression differences of TOP2A between NSCLC and normal lung samples.
Immunohistochemistry (IHC) experiment regents and tissue samples
All of the clinical patients sample were stored in our biobank, and they were all collected from routine surgeries at General Surgery Department and sent for pathology examination at the Department of Pathology of local Hospital. Informed consent from the patients as well as approval by the Hospital Institutional Board were obtained (ShanXi, China).
IHC experiment was performed on VENTANA platform (Roche), the TOP2A recombinant primary rabbit monoclonal antibody (SY27-00) was purchased from Invitrogen, secondary antibody (Envision/HRP kit) and DAB detection kit were from ZSBG-Bio. Other reagents including H2O2, phosphate-buffered saline (PBS) and hematoxylin stain were from the hospital supply department.
Immunohistochemistry (IHC) experiment protocol
IHC experiment was conducted to confirm the gene expression between NSCLC and normal lung tissues using 107 cases of biobank cancer samples following the experimental procedure as below.
The 107 paraffin-embedded tissue were made in tissue arrays first and made to slices. The stored slices were firstly taken out of refrigerator and rewarmed at room temperature for 20 min, followed by the deparaffin, rehydration and a 10 min boiling in 10 mmol/l citrate buffer for antigen retrieval. The sections would then be soaked in methanol containing 0.3% H2O2 for 10 min with the purpose of inhibiting of endogenous peroxidase activity. After being blocked with bovine serum albumin in PBS for 30 min, the sections would be incubated with primary antibody (dilution 1:250) for 2 h at 37 °C in Biochemistry Cultivation Cabinet, and another 40 min at 37 °C with species-specific secondary antibodies labeled with horseradish peroxidase (HRP) and finally visualized in DAB followed by the counterstaining of nuclei with hematoxylin.
Evaluation of IHC results
The relative TOP2A protein expression level was evaluated according to both the tissue section’s staining intensity and staining area. The intensity and area of immunostaining was scored by two experienced pathologists in our department with no prior knowledge of the clinical and pathological details of the patients. Nuclear staining was regarded as positive according to TOP2A antibody specification sheet. The staining intensity was classified based on the following criteria: none (0), mild (1), moderate (2) and strong (3). And the staining area was stratified as follows: < 5% (0), 6–25% (1), 26–50% (2), 51–75% (3) and > 75% (4). The final TOP2A expression level in each sample was scored by multiply the staining intensity by staining area, using the score = 6 as cutoff point, final score < 6 was defined as negative, and score ≥ 6 was classified as positive [
42].
Additionally, the gene’s clinical significance was analyzed based on the clinical data of above 107 patients.
The Oncomine database is a web-based data mining platform that incorporates 264 independent datasets for collecting, standardizing, analyzing, and delivering transcriptomic cancer data for biomedical research [
43]. In the study, we used Oncomine for analyzing the various expression levels of TOP2A in different subtypes of NSCLC and exploring the co-expression genes relating to TOP2A. As for the TOP2A expression in subtypes of NSCLC, the query terms were set as: ① analysis type: lung cancer vs normal analysis; ② GENE: TOP2A; and for the co-expression analysis, the query terms were set as: ① GENE: TOP2A; ② analysis type: co-expression analysis; ③ non small cell lung cancer.
RNA extraction and quantity real-time PCR (qRT-PCR)
Total mRNA of 30 lung adenocarcinoma and 30 lung squamous cell carcinoma samples were extracted using RNAiso-Plus (TAKARA, DaLian, China), and mRNA of matched adjacent normal tissue of each cancer sample were also extracted. cDNA was then synthesized from 1 μg extracted mRNA using cDNA synthesis kit (TAKARA, DaLian, China) according to the manufacturer’s instructions. Real-time PCR was performed on Roche z 480 with primers as:
TOP2A:
Former: CATTGAAGACGCTTCGTTATGG
Reverse: CAGAAGAGAGGGCCAGTTGTG
TPX2:
Former: CTTCCAATCACCGTCCCC
Reverse: TATTTCCACAGTTCTTGCCTCT
GAPDH:
Former: AGAAGGCTGGGGCTCATTTG
Reverse: AGGGGCCATCCACAGTCTTC
The cycling conditions were: 95 °C 5 min for 1 cycle; 95 °C 5 s, 60 °C 30 s, and 72 °C 34 s for 40 cycles followed by the melting curve stage. The relative expression of TOP2A and TPX2 were evaluated based on the 2−ΔΔCT calculation, each sample get three replicates.
Statistical analysis
Chi-square test was used to analyze the relationship between TOP2A expression and NSCLC clinicopathological features. T-test was used to analyze the relative mRNA expression of TOP2A and TPX2 in qPT-PCR experiment, and Pearson analysis was performed for exploring the connection between TOP2A and TPX2 genes. P < 0.05 was considered statistically significant.
Discussion
Lung cancer is a common malignant tumor with top mortality and morbidity in both male and female cancer patients [
4]. And 80–85% of lung cancer is NSCLC, which includes adenocarcinoma, squamous cell carcinoma and large cell carcinoma. Although current molecular targeted therapy and immunotherapy have been bringing promising effect for NSCLC treatment, the targets are still limited comparing to highly progressive and evolutionary cancer cells, the outcome of patients is not promising [
15,
16]. The study is to identify potential prognostic indicators and new drug targets of NSCLC using bioinformatic analysis.
Bioinformatic has been a data-driven branch of science, which is commonly used for high-through data analysis and involves a large number of powerful analysis tools, software packages and database [
25]. Great utilizing of these tools and software shall be an effective methodology for avoiding unnecessary repeated labour and mining useful insights buried in the high-throughput information, for instance, chips and sequencing “big-data”.
GEO database together with TCGA database, are two most commonly used databases to worldwide researchers, both databases are open-access to public and owning tremendous amount of information. In the study, we firstly chosen four cDNA expression profiles GSE18842, GSE19188, GSE33532 and GSE101929 based on the number of samples and the publication data from GEO database. The profiles contains a total of 249 NSCLC and 119 normal samples, and GEO2R tool was then used to analyze the DEGs between cancer and normal tissues, discovering that 306 DEGs were shared in all four profiles, including 214 down-regulated and 92 up-regulated genes. GO and KEGG analysis revealed that most of the 92 up-regulated DEGs were focused on cell cycle and cell division related signaling.
To better understand the internal relationship of 306 genes, PPI network was constructed. And top 15 genes with strongest connection with other genes were identified, including TOP2A, BIRC5, CDC20, UBE2C, CCNB1, CDK1, AURKA, TTK, CCNA2, BUB1, KIF11, FOXM1, PBK, KIAA0101 and NDC80. Out of the 15 genes, TOP2A possess best connection with surroundings.
TOP2A, which is short for Topoisomerase II Alpha, locates at 17q21.2 and encodes an enzyme that controls and alters the topological states of DNA during transcription. This enzyme has been known to be involved in processes such as chromosome condensation, chromatid separation, and the relief of torsional stress that occurs during DNA transcription and replication. A most well known disease associated with TOP2A is female breast cancer, it is usually deleted or amplified simultaneously with ERBB2, thus the two genes are commonly co-tested in breast cancer patients for further proper use of anticancer agent herceptin [
44‐
46]. And, TOP2A was reported to be targeted by tumor suppressor like miR-144-3p in glioblastoma, thus resulting in cancer cell apoptosis [
47]. As in lung cancer, Pabla et al. [
48] reported that TOP2A could be a potential new indicator in PD-L1 negative NSCLC, however, deeper analysis is still needed for mechanism explanation. In the study, we analyzed TOP2A function in NSCLC development using bioinformatic tools.
Firstly, Kaplan–Meier plotter overall survival analysis was used to reveal the correlation between TOP2A and NSCLC OS, and the results showed that TOP2A statistical significantly correlates with patients OS, higher TOP2A expression was associated with worse OS. And Multivariate Cox regression analysis supported TOP2A expression works as an independent prognostic indicator in lung adenocarcinoma, suggesting its probable tumor promoter and potential survival indicator function in further clinical use.
Then, to validate the aberrant gain of expression of TOP2A in NSCLC, GEPIA was firstly performed, and the results showed that TOP2A was up-regulated in cancers comparing to normal tissues. Our IHC experiment which was conducted on 107 cases of local hospitalized NSCLC patients surgery samples also confirmed the results, significant TOP2A gain of expression ratio (36.4%) in NSCLC was observed by IHC staining in verse the low ratio (less than 1%) in normal tissues. Meanwhile, clinical significance analysis showed that TOP2A expression was associated with cancer subtype, patients gender and smoking. TCGA data also supported the association between TOP2A expression and clinical parameters including patients gender and smoking status.
Additionally, TOP2A involving signaling pathways revealed that its main function in NSCLC is also cell cycle regulation related, consistent with the previous GO/KEGG analysis of up-regulated DEGs in NSCLC. And three different analyzing software including Oncomine database, PPI network and GEPIA software all predicted the positive correlation between TOP2A and TPX2, and qRT-PCR experiment conducted on 30 paired local hospital adenocarcinoma and squamous cell carcinoma samples validated the association between two genes, indicating TPX2 is a probable co-working partner of TOP2A.
TPX2, locates at 20q11.21, is one of the main spindle assembly factors that play a key role in inducing microtubule assembly and growth during M phase of mitosis [
49‐
51]. Previous studies reported that TPX2 mRNA expression during cell cycle progression is high in G2/M phase, decreases dramatically upon G1 phase entry, increases upon entry into S phase, and peaks again at the next G2/M phase [
52‐
56]. The drop in TPX2 is consistent with the drastic reorganization in structure and dynamics of the mitotic spindle [
57]. Due to its important role in microtubule assembly and mitosis, TPX2 has been found to be over expressed in various human cancers, for instance clear renal cell carcinoma [
58], esophageal carcinoma [
59], hepatocellular carcinoma (HCC) [
52,
60], gastric cancer [
61], bladder carcinoma [
62] and so on. TPX2 expression has been shown to be positively correlated with poor prognosis, metastasis, and recurrence [
49,
63].
However, above results aren’t yet enough to put TOP2A or TPX2 as a drug target in NSCLC, to distinguish gene aberrations that can cause the disease and may serve as drug targets with those only closely linked to the disease and consequently are associated with the disease development, comprehensive and longitudinal experiments, as well as clinical trials are needed to be performed.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.