Background
Serological biomarkers represent a non-invasive and cost-effective aid in the clinical management of cancer patients, particularly in areas of disease detection, prognosis, monitoring and therapeutic stratification. For a serological biomarker to be useful for early detection, its presence in serum must be relatively low in healthy individuals and those with benign disease. The marker must be produced by the tumor or its microenvironment and enter the circulation, giving rise to increased serum levels. Mechanisms that facilitate entry to the circulation include secretion or shedding, angiogenesis, invasion and destruction of tissue architecture [
1]. The biomarker should preferably be tissue specific, such that a change in serum level can be directly attributed to disease (for example, cancer) of that tissue [
2]. The currently most widely used serological biomarkers include carcinoembryonic antigen (CEA) and carbohydrate antigen 19.9 for gastrointestinal cancer [
3‐
5]; CEA, cytokeratin 19 fragment, neuron-specific enolase, tissue polypeptide antigen, progastrin-releasing peptide and squamous cell carcinoma antigen for lung cancer [
6]; CA 125 for ovarian cancer [
2]; and prostate-specific antigen (PSA, also known as kallikrein-related peptidase (KLK) 3) in prostate cancer [
7]. These current serological biomarkers lack the appropriate sensitivity and specificity to be suitable for early cancer detection.
Serum PSA is commonly used for prostate cancer screening in men over 50 years old, but its usage remains controversial due to serum elevation in benign disease as well as prostate cancer [
8]. Nevertheless, PSA represents one of the most useful serological markers currently available. PSA is strongly expressed only in the prostate tissue of healthy men, with low levels in the serum established by normal diffusion through various anatomical barriers. These anatomical barriers are disrupted upon development of prostate cancer, allowing increased amounts of PSA to enter circulation [
1].
Recent advances in high-throughput technologies (for example, high-content microarray chips, serial analysis of gene expression, expressed sequence tags) have enabled the creation of publicly available gene and protein databases that describe the expression of thousands of genes and proteins in multiple tissues. In this study we used five gene databases and one protein database. The C-It [
9,
10], Tissue-specific and Gene Expression and Regulation (TiGER) [
11,
12] and UniGene [
13,
14] databases are based on expressed sequence tags (ESTs). The BioGPS [
15‐
17] and VeryGene [
18,
19] databases are based on microarray data. The Human Protein Atlas (HPA) [
20,
21] is based on immunohistochemistry (IHC) data.
Our laboratory has previously characterized the proteomes of conditioned media (CM) from 44 cancer cell lines, three near normal cell lines and 11 relevant biological fluids (for example, pancreatic juice and ascites) using multidimensional liquid chromatography tandem mass spectrometry, identifying between 1,000 and 4,000 proteins per cancer site [
22‐
33] (unpublished work).
Numerous candidate biomarkers have been identified from
in silico mining of gene-expression profiling [
34‐
36] and the HPA [
37‐
48]. In the present study, we describe a strategy to identify tissue-specific proteins using publicly available gene and protein databases. Our strategy mines databases for proteins highly specific to or strongly expressed in one tissue, selects proteins which are secreted or shed, and integrates proteomic datasets enriched for the cancer secretome to prioritize candidates for further verification and validation studies. Integrating and comparing proteins identified from databases based on different data sources (ESTs, microarray and IHC) with the proteomes of the CM of cancer cell lines and relevant biological fluids will minimize the shortcomings of any one source, resulting in the identification of more promising candidates. Recently, the value of using an integrated approach in biomarker discovery has been described [
49].
In this study, we looked at identifying tissue-specific proteins as candidate biomarkers for colon, lung, pancreatic and prostate cancer. Our strategy can be applied to identify tissue-specific proteins for other cancer sites. Colon, lung, pancreatic and prostate cancer are ranked among the top leading causes of cancer-related deaths, cumulatively accounting for an estimated half of all cancer-related deaths [
50]. Early diagnosis is essential for improving patient outcomes as early-stage cancers are less likely to have metastasized and are more amenable to curative treatment. The five-year survival rate when treatment is administered on metastatic stages compared to organ-confined cancer drops dramatically from 91% to 11% in colorectal cancer, 53% to 4% in lung cancer, 22% to 2% in pancreatic cancer and 100% to 31% in prostate cancer [
50].
We identified 48 tissue-specific proteins as candidate biomarkers for the selected tissue types. Of these, 14 had been previously studied as cancer or benign disease serum biomarkers, providing credence to our strategy. Investigation of the remaining proteins in future studies is warranted.
Discussion
We describe a strategy to identify tissue-specific biomarkers using publicly available gene and protein databases. Since serological biomarkers are protein-based, using only protein expression databases for the initial identification of candidate biomarkers seems more relevant. While the HPA has characterized more than 50% of human protein-encoding genes (11,200 unique proteins to date), it has not completely characterized the proteome [
51]. Therefore, proteins that have not been characterized by the HPA but fulfill our desired criteria would be missed by searching only the HPA. There are also important limitations in using gene expression databases since there is considerable variation between mRNA and protein expression [
69,
70] and gene expression does not account for post-translational modification events [
71]. Therefore, mining both gene and protein expression databases minimizes the limitations of each platform. To the best of our knowledge, no studies for the initial identification of candidate cancer biomarkers have been conducted using both gene and protein databases.
Initially, the databases were searched for proteins highly specific to or strongly expressed in one tissue. The search criteria were tailored to accommodate the design of the databases, which did not allow for simultaneous searching with both criteria. Identifying proteins that were highly specific to and strongly expressed in one tissue was considered in a later step. In the verification of the expression profiles (see Methods), only 34% (48 of 143) of the proteins were found to meet both criteria. The number of databases mined in the initial identification can be varied at the discretion of the investigator. Additional databases will result in the same number of, or more, proteins being identified in two or more databases.
In the gene expression databases, the criteria used were set for maximum stringency for protein identification, to identify a manageable number of candidates. A more exhaustive search can be conducted using lower stringency criteria. The stringency could be varied in the correlation analysis using the BioGPS database plugin and the C-It database. The correlation cutoff of 0.9 used in identifying similarly expressed genes in the BioGPS database plugin could be reduced to as low as 0.75. The SymAtlas z-score of ≥|1.96| could be reduced to ≥|1.15|, corresponding to a 75% confidence level of enrichment. The literature information parameters used in the C-It database of fewer than five publications in PubMed and fewer than three publications with the MeSH term of the selected tissue could be reduced in stringency, to allow identification of well-studied proteins. Since C-It does not look at the content of publications in PubMed, it filters out proteins that have been studied even if they have not been studied in relation to cancer.
Although proteins that have been well studied but not as cancer biomarkers represent potential candidates, the emphasis in this study was on identifying novel candidates which have been, overall, minimally studied. A gene's mRNA level and protein expression can have significant variability. Therefore, if lower stringency criteria were used when identifying proteins from gene expression databases, a greater number of proteins would have been identified in at least two of the databases, potentially leading to a greater number of candidate protein biomarkers identified after application of the remaining filtering criteria.
The HPA was searched for proteins strongly expressed in one normal tissue with annotated IHC expression. Annotated IHC expression was selected because it uses paired antibodies to validate the staining pattern, providing the most reliable estimation of protein expression. Approximately 2,020 of the 10,100 proteins in version 7.0 of the HPA have annotated protein expression [
51]. Makawita
et al. [
33] included the criteria of annotated protein expression when searching for proteins with 'strong' pancreatic exocrine cell staining for prioritization of pancreatic cancer biomarkers. A more exhaustive search could be conducted by searching the HPA without annotated IHC expression.
Secreted or shed proteins have the highest chance of entering the circulation and being detected in the serum. Many groups, including ours [
23‐
25,
27‐
33], use Gene Ontology [
72] protein cellular localization annotations of 'extracellular space' and 'plasma membrane' to identify a protein as secreted or shed. Gene Ontology cellular annotations do not completely describe all proteins and are not always consistent if a protein is secreted or shed. An in-house secretome algorithm (GS Karagiannis
et al., unpublished work) designates a protein as secreted or shed if it is predicted either to be secreted based on the presence of signal peptide or to have non-classical secretion, or predicted to be a membranous protein based on amino-acid sequences corresponding to transmembrane helices. It more robustly defines proteins as secreted or shed and was therefore used in this study.
Evaluating which of the databases had initially identified the 48 tissue-specific proteins that passed the filtering criteria showed that the gene expression databases had identified more of the proteins than the protein expression database. The HPA had initially identified only 9 of the 48 tissue-specific proteins. The low initial identification of tissue-specific proteins was due to the stringent search criteria requiring annotated IHC expression. For example, 20 of the 48 tissue-specific proteins had protein expression data available in the HPA, of which the 11 proteins that were not initially identified by HPA did not have annotated IHC expression. The expression profiles of those proteins would have passed the 'Verification of in silico expression profiles' filtering criteria and, therefore, would have resulted in a greater initial identification of tissue-specific proteins by the HPA.
The HPA has characterized 11,200 unique proteins, which is more than 50% of the human protein-encoding genes [
51]. Of the 48 tissue-specific proteins that met the selection criteria, only nine were initially identified from mining the HPA. Twenty of the tissue-specific proteins have been characterized by the HPA. This demonstrates the importance of combining gene and protein databases to identify candidate cancer serum biomarkers. If only the HPA had been searched for tissue-specific proteins, even with lowered stringency, the 28 proteins that met the filtering criteria and represent candidate biomarkers would not have been identified.
The TiGER, UniGene and C-It databases are based on ESTs and collectively identified 46 of the 48 proteins. Of those, only 41% (19 of the 46) were identified in two or more of those databases. The BioGPS and VeryGene databases are based on microarray data and collectively identified 46 of the 48 proteins. Of those, 56% (26 of the 46) were identified uniquely by BioGPS and VeryGene. Clearly, even though databases are based on similar sources of data, individual databases still identified unique proteins. This demonstrates the validity of our initial approach of using databases that differently mine the same data source. The TiGER, BioGPS and VeryGene databases collectively identified all 48 of the tissue-specific proteins. From those three databases, 88% (42 of the 48) were identified in two or more databases, demonstrating the validity of selecting proteins identified in more than one database.
The accuracy of the databases' initial protein identification is related to how explicitly the database could be searched for the filtering criteria of proteins highly specific to and strongly expressed in one tissue. The BioGPS database had the highest accuracy at 26%, as it was searched for proteins similarly expressed as a protein of known tissue specificity and strong expression. The UniGene database, with an accuracy of 20%, could only be searched for proteins with tissue-restricted expression, without the ability to search for proteins also with strong expression in the tissue. The VeryGene database, accuracy of 9%, was searched for tissue-selective proteins and the TiGER database, with 6% accuracy, was searched for proteins preferentially expressed in a tissue. Their lower accuracies reflect that they could not be explicitly searched for proteins highly specific to only one tissue. The C-It database, with an accuracy of 4%, searched for tissue-enriched proteins and the HPA, accuracy of 0.4%, searched for proteins with strong tissue staining. These very low accuracies reflect that the search looked for proteins with strong expression in a tissue, but could not be searched for proteins highly specific to only one tissue.
The low identification of tissue-specific proteins by the C-It database is not unexpected. Given that the literature search parameters initially used filtered out any proteins that had fewer than five publications in PubMed, regardless of whether those publications were related to cancer, C-It only identified proteins enriched in a selected tissue which have been minimally, if at all, studied. Of the nine proteins C-It initially identified from the tissue-specific list, eight of the proteins had not been previously studied as serum candidate cancer biomarkers. Syncollin (SYCN) has only very recently been shown to be elevated in the serum of pancreatic cancer patients [
33]. The eight remaining proteins that C-It identified represent especially interesting candidate biomarkers because they represent proteins that fulfill the filtering criteria but have not been well studied.
A PubMed search revealed that 15 of the 48 tissue-specific proteins identified had been previously studied as serum markers of cancer or benign disease, providing credence to our approach. The most widely used biomarkers currently suffer from a lack of sensitivity and specificity due to the fact they are not tissue-specific. CEA is a widely used colon and lung cancer biomarker. It was identified by the BioGPS and TiGER databases and the HPA as highly specific to or strongly expressed in the colon, but not by any of the databases for the lung. CEA was eliminated upon evaluating the protein expression profile in silico, because it is not tissue specific. High levels of CEA protein expression were seen in the normal tissues of the digestive tract, such as the esophagus, small intestine, appendix, colon and rectum, as well as in bone marrow, and medium levels were seen in the tonsil, nasopharynx, lung and vagina. PSA is an established, clinically relevant biomarker for prostate cancer with demonstrated tissue specificity. PSA was identified in our strategy as a prostate-specific protein, after passing all the filtering criteria. This provides credence to our approach because we re-identified known clinical biomarkers and our strategy filtered out the biomarkers based on tissue specificity.
From the list of candidate proteins that have not been studied as serum cancer or benign disease biomarkers, 18 of the 26 proteins were identified in proteomic datasets. The proteomic datasets primarily contain the CM proteomes of various cancer cell lines, and other relevant fluids, enriched for the secretome. For proteins that have not been characterized by the HPA, it is possible the transcripts are not translated, in which case they would represent unviable candidates. If the transcripts are translated and the protein enters circulation, it must do so at a level detectable by current proteomic techniques. Proteins that have been characterized by the HPA may not necessarily enter the circulation. The identification of a protein in the proteomic datasets verifies the presence of the protein in the secretome of cancer at a detectable level; therefore, the protein represents a viable candidate. Because cancer is a highly heterogeneous disease, the integration of multiple cancer cell lines and relevant biological fluids likely provides a more, if not necessarily complete picture of the cancer proteome.
Relaxin 1 is a candidate protein that was not identified in any of the proteomes but its expression was confirmed by semi-quantitative RT-PCR in prostate carcinomas [
73]. Therefore, a protein not being identified in any of the proteomic datasets does not necessarily imply that it is not expressed in cancer.
Acid phosphatase is a previously studied prostate cancer serum biomarker [
74]. When compared to proteomic datasets (data not shown), it was identified in the seminal plasma proteome [
25], the CM of many prostate cancer cell lines [
28] (P Saraon
et al., unpublished work) and, interestingly, the CM of colon cancer cell lines Colo205 [
52] and LS180 (GS Karagiannis
et al., unpublished work), the CM of breast cancer cell lines HCC-1143 (MP Pavlou
et al., unpublished work) and MCF-7 [
52], the CM of oral cancer cell line OEC-M1 [
52] and the CM of ovarian cancer cell line HTB161 (N Musrap
et al., unpublished work). Graddis
et al. [
74] observed very low levels of acid phosphatase mRNA expression in both normal and cancerous breast and colon tissue, in normal ovary and salivary gland tissue and comparatively high levels in normal and malignant prostate tissue. We, therefore, reasoned that identification of a tissue-specific protein in a proteome of a different tissue does not necessarily correlate with strong expression in that proteome.
Identification of a tissue-specific protein in only proteomes corresponding to that tissue, coupled with
in silico evidence of strong and specific protein expression in that tissue, indicates an especially promising candidate cancer biomarker. SYCN has been shown to be increased in the serum of pancreatic cancer patients [
33]. SYCN was identified in the pancreatic juice proteome [
33] and in normal pancreatic tissue (H Kosanam
et al., unpublished work) and by BioGPS, C-It, TiGER, UniGene and VeryGene databases as strongly expressed in only the pancreas. Folate hydrolase 1, also known as prostate-specific membrane antigen, and KLK2 have been studied as prostate cancer serum biomarkers [
67,
68]. Folate hydrolase 1 and KLK2 were both identified in the CM of various prostate cancer cell lines [
28] (P Saraon
et al., unpublished work) and the seminal plasma proteome [
25] and by BioGPS and TiGER databases as strongly expressed in only the prostate. Of the tissue-specific proteins which have not been previously studied as serum cancer or benign disease biomarkers, colon-specific protein GPA33, pancreas-specific proteins chymotrypsinogen B1 and B2, chymotrypsin C, CUB and zona pellucida-like domains 1, KLK1, PNLIP-related protein 1 and 2, regenerating islet-derived 1 beta and 3 gamma and prostate-specific protein NPY represent such candidates. Investigation of these candidates should be prioritized for further verification and validation studies.
The proposed strategy seeks to identify candidate tissue-specific biomarkers for further experimental studies. Using colon, lung, pancreatic and prostate cancer as case examples, we identified a total of 26 tissue-specific candidate biomarkers. In the future, we intend to validate the candidates; if validation is successful, we can validate the use of this strategy for in silico cancer biomarker discovery. Using this strategy, investigators can rapidly screen for candidate tissue-specific serum biomarkers and prioritize candidates for further study based on overlap with proteomic datasets. This strategy can be used to identify candidate biomarkers for any tissue, contingent on the data availability in the mined databases, and incorporate various proteomic datasets at the discretion of the investigator.