The use of confusing nomenclature
The rCRS [
27] nucleotide at position 10398 is A. Therefore the correct notation for the transition at this position is A10398G or m.10398A > G following the official nomenclature in medicine.
In their breast cancer paper, Canter et al. [
12] erroneously employed the notation “G10398A”, which incidentally could be interpreted
post hoc as if it followed the evolutionary order of nucleotide change (from the ancestral G to a first derived A at 10398), but this was certainly not intended by the authors since C7028T, C14766T, T16189C, and T16519C all follow the standard rCRS-based style. The incorrect and confusing notation “G10398A” was then repeatedly used in numerous follow-up papers [
20]. This led to the most paradoxical situation that “G10398A” has become more widespread in use within the past five years than A10398G: e.g. Google now has ~60,600 entries for ‘G10398A mtDNA’ but only ~3010 entries for ‘A10398G mtDNA’. This cannot be explained by the 2012 switch from A10398G to “G10398A” executed in PhyloTree since this equally transformed the notation for the C7028T polymorphism by reversing the roles of C and T: there are ~4250 entries for the standard form ‘C7028T mtDNA’ but only ~355 entries for ‘T7028C mtDNA’.
The confusion that has set in with the erroneous designation of the A10398G polymorphism is best reflected by a brief comment on the Canter et al. [
12] study given by Benn et al. [
28], where it is stated that “
The mt10398a > g polymorphism present in European haplogroups J, K, and Z has been associated with increased risk of invasive breast cancer in black women (48 cases/54 controls and validated in 654 cases/605 controls) but not in white women (879 cases/760 controls).” Leaving aside the inaccuracies concerning haplogroups K (which as a whole is not defined by 10398G) and Z (not of European ancestry), the Canter et al. [
12] study was misinterpreted: first, the smallest case–control analysis did not reach significant values (see below), so that the next one could not be regarded as a validation of the former; second, the nucleotide states A and G for the claimed association were confounded.
Covarrubias et al. [
13] claimed that “o
ne of the interactions, 4216C and 10398G, observed in this [their]
study although not statistically significant after controlling the GWER (Table two), was previously reported in the Canter et al.
case–control study…” This affirmation was unfortunate, because it was the rCRS nucleotide A10398 that was reported as being associated with breast cancer by Canter et al. Further, the authors mentioned correctly (in regard to the Canter et al. study), that “
a synergistic interaction was observed between 4216C and 10398A”.
The G nucleotide is the ultimately ancestral nucleotide at 10398 with respect to the entire (known) mtDNA human phylogeny, while the A nucleotide is considered to be the derived allele. However, it is important to note that this site mutates as many as 21 times (5 from G to A and 16 from A to G) in the basal classification tree, Phylotree Build 16, pointing to independent mutational events occurring at different times. The age of the mutation therefore varies depending on the targeted branch (haplogroup) in the phylogeny. The fact of being ancestral or derived could be completely irrelevant here from the point of view of its presumable pathogenicity (there seems to be a bias towards considering ‘ancestral’ alleles as generally healthy and the ‘derived’ ones as potentially pathogenic). It could be the case that the seeming ‘ancestral’ nucleotide is in fact more recent than the seeming ‘derived’ allele if one focuses on a particular mtDNA haplogroup where this polymorphism has mutated back to the ancestral nucleotide several times. The concept of ancestral allele is also misunderstood in the literature. For instance, Czarnecka et al. [
21] mentioned that “w
hile the revised Cambridge reference sequence [
14]
lists the wild type base as A, the alternate base (G) is also prevalent in many populations”. This affirmation reflects a popular misconception of the rCRS as being the ‘wild-type’ sequence; in reality, the rCRS represents just one particular European mtDNA sequence used for notational purposes [
16]. All existing mtDNA lineages are quite distant from the root of the entire mtDNA tree.
Therefore, one has to be aware that the A10398G polymorphism targeted in different studies does not necessarily refer to the same mutational events, and therefore, different studies could in reality be referring to different statistical associations. Thus, for instance, the statistical association could have been found in combination with another variant, then pointing to a particular haplogroup and not to the complex polyphyletic group defined by a particular nucleotide at 10398.
The reply of Bai et al. [
19] to the Mosquera-Miguel’s et al. [
5] article testifies to a basic misconception of the role of haplogroups in disease studies: “
Although haplogroup K is a subclade of haplogroup U in Europeans, most, if not all published articles on haplogroups and disease association do not include haplogroup K within haplogroup U”. Solely mutations that define monophyletic clades could be pinpointed to generate an effect on the fitness of the mtDNA. U minus K is just a conglomerate of nine monophyletic clades. Unfortunately, Bai et al. were correct in claiming that most medical geneticists perceive haplogroup U as an entity that does not include haplogroup K – but a mistake remains a mistake whether it is committed by a majority of researchers or not. This U-K misconception has its root in an early article on the classification of European mtDNA [
29], which was based on limited mtDNA information as provided by RFLP analysis at the time.
The conflicting signals in the studies of Canter et al. and Bai et al
Canter et al. [
12] analyzed three different cohorts of ‘African-American’ women with invasive breast cancer. In their pilot study (48 cases and 54 controls; all ‘African-American women), the authors did not find any association of this polymorphism with the disease.
In a second ‘African American’ cohort (654 cases, 605 controls) these authors found the A10398 nucleotide significantly increased in breast cancer patients. In a third cohort of ‘White’ women (879 cases, 760 controls) they did not detect any statistical association. The authors understand their findings as a “novel epidemiologic evidence that the mtDNA 10398A allele influences breast cancer susceptibility in African-American women”.
In reply to an article by Mims et al. [
30] on prostate cancer (see also Verma et al. [
31]), the response by Canter et al. [
26] provided us with more clues about their previous findings from 2005 [
12] (since they used the same breast cancer cohorts as in their 2005 study). Thus, one can infer that the main association signal found by Canter et al. [
12] appears due to the combination of the variants A10398 and 4216C. These two variants together are good markers for the European haplogroups J1c8, T, and R2 (see PhyloTree) with haplogroup T being the most prevalent one. In other words, the statistical signal reported by Canter et al. in [
12,
32] just mirrors the existence of an increased component of matrilineal European ancestry in their cases compared to their controls.
Therefore, there is little support to the positive association reported in the Canter et al. study if we consider that the phylogenetic evidence clearly points to a false positive due to the confounding effect of population stratification. Control for the latter should be mandatory in case–control studies in order to avoid false positives, especially when targeting ‘African-Americans’ for which one might
a priori suspect large differential matrilineal admixture proportions in cases and controls [
33].
In contrast to the Canter studies, Bai et al. [
19] and the follow-up (non-independent) Covarrubias et al. [
13] study led to the conclusion that it is the 10398G variant that would be related to an increase of risk to suffer breast cancer. Considering the SNPs targeted by Bai et al., the variant 10398G pointed mainly to haplogroup K1 in their cases and controls (see e.g. their Figure two).
Covarrubias et al. [
13] reported an interaction between 10398G and 12308G, (thereby ‘recycling’ the same findings reported by Bai et al.), thus effectively claiming that K1 has an apparent statistical association with breast cancer. Note therefore that the Canter’s implicit finding (the ‘risky’ haplogroup T) was not replicated in Covarrubias et al. [
13] while
vice versa the Covarrubias et al. [
13] finding (concerning the ‘risky’ haplogroup K1) was not observed in Canter’s study.
The risk of the 10398G variant in polish breast cancer patients
Czarnecka et al. [
21] reported the association of 10398G in a Polish breast cancer cohort (44 cases and 100 controls). Apart from being an underpowered case–control study (Table
1), population stratification was not monitored. Regarding the latter, it is important to mention that stratification has to be measured empirically, and that the fact that controls “
…matched for ethnicity and region of residence.” does not guarantee lack of stratification. There are in fact solid reasons to believe that this study suffered from a strong bias in the estimation of the frequency of the 10398G mutation in controls. Czarnecka et al. [
21] reported 10398G in 10 (23%) of the 44 cases (from one medical center) and 3 (3%) of the 100 controls (which came from a geographically distant medical center). Fisher’s (2-sided) exact test delivers an extraordinarily low
P-value of 0.00042 for this contingency table. It is surprising that no reference had been made to four or five sets of Polish mtDNA population data of total size >3,000, which were available in September 2008, well before the submission of that article. No attempt was made to compare the in-house control-region data with any one of these data sets, although one of them [
34] was taken as the major part of the control group in a publication submitted two months later [
35].
In any European population there are three major haplogroups defined by 10398G: haplogroups J, K1, and N1a1. Using control-region data, one can reliably recognize the following haplogroups by minimal motifs: J (16069T-16126C), K (16224C-16311C)
versus K1a (16224C-16311C-497T) and K1c (16224C-16311C-498del), and haplogroup N1a1b (250C alone or plus 16391A) in which subhaplogroup I is nested as its dominating component. One thus loses N1a1a but gains J1c8 (with back mutated A10398), both of which are very minor and expected to be equally uncommon (< 0.3%). In an enlarged Polish control group of 414 normal individuals, Gaweda-Walerych et al. [
36] detected 22 carriers of haplogroup K, including 15 of its subhaplogroup K1a and 2 of K1c. Hence 17/22 > 3/4 is a conservative estimate of the K1 proportion of K. When applying this 3/4 rule for estimating the K1 frequencies in different samples, we obtain compound frequency of 117/894 (13.1%) for N1a1b, J, and K1 in the three Polish data sets stored in EMPOP (
http://www.empop.org) and frequency of 47/277 (17.0%) for I, J, and K1 for the control group in Gaweda-Walerych et al. [
36]. In total, we thus obtain the (conservative) frequency estimate 164/1171 (14.0%) for the occurrence of 10398G in Poland. This is also well in agreement with an estimate derived from the Polish haplogroup frequencies of Piechota et al. [
37], who effectively targeted haplogroups I, J, and U8b (instead of the claimed K): assuming a lower bound of 1/4 for the K1 contribution to U8b, one obtains the estimate of 23/152 (15.1%) for 10398G in Poland. This data set served as the minor part for the control group of Czarnecka et al. [
35]. Incidentally, the still larger Polish data set of Saxena et al. [
38] would give an estimate of 13.0% using the same method of estimation. However, in that article one can directly read off the real frequency of 10398G in this Polish data set, viz. as 337/2006 (16.8%), which demonstrates that our estimation was even somewhat too conservative.
We then performed four (2-sided) exact Fisher tests for the control and cases data from Czarnecka et al. [
35] each compared against either of the literature data sets with counts 164 vs 1007 and 337 vs 1669, respectively. The control data from their study receive
P-values of 0.00059 and 0.00004 (sic!) in these comparisons. This demonstrates that their control group either (1) is so special that it would have been mandatory to monitor population stratification village by village or (2) the genotyping went wrong quite badly or (3) the samples had not been chosen and aggregated in a correct way. Therefore, it is evident that the control group employed by Czarnecka et al. [
35] cannot represent the population of cases and yield nucleotide variant frequencies that are completely unexpected in view of the patterns observed in other data sets. Comparing the frequencies for the cases in that paper to either of the literature data we get
P-values as high as 0.122 and 0.309, very far from being significant.
The risk of the A10398 variant in Indian breast cancer patients
Darvishi et al. [
20] analyzed 124 breast cancer patients and 273 controls; the authors reported a statistical association of A10398 with breast cancer. Given the fact that these authors targeted an Indian population and that they only genotyped the A10398G polymorphism, one has to assume that the main statistical signal came from haplogroup N as a whole (10398 is one of the mutations that separates the two (macro)-haplogroups M and N in Asia). By way of analyzing squamous cell carcinoma samples (55 cases
versus 163 controls), they also reported a positive association for A10398.
The frequency of haplogroup N in India is highly heterogeneous, as even highlighted by Darvishi et al. [
20]. Their cases and controls were selected from North India (without reporting any further geographic specification). It is noteworthy that their control sample has a frequency of A10398 of 43.6% (compared to 57.3% in their cases); however, the frequency of haplogroup N in e.g. Gujarat (Northeast India) and Punjab and Kashmir (North India) is just below 60% as also reported by the same authors (see their Figure two), thus nearly matching the frequency of haplogroup N in their cases. This means that their control group does not properly represent their cases, and therefore pointing once more to a false positive case of association.
On the other hand, the results of Darvishi et al. [
20] enter in conflict with the recent article by Francis et al. [
14] carried out on a much larger sample of Indian patients and controls (three different cohorts plus meta-analysis) where they found no association of the A10398G polymorphism and breast cancer.
The risk of the A10398 variant in breast cancer patients from Bangladesh
Recently, Sultana et al. [
23] analyzed a sample of only 24 breast cancer cases and 20 controls from Bangladesh, claiming the association of A10398 and C10400 with breast cancer. These authors therefore targeted the (macro-haplogroup N). It is surprising to see that the frequency of haplogroup N in their cases is 75%
versus 25% in controls.
To explain a possible false positive finding in the Sultana et al. [
23] study one could easily allege: (i) deficient statistical power due to their extremely small cohort, and (ii) the confounding effect of populations sub-structure (given that these authors have not controlled this possible confounding factor). Their most recent study, Sultana et al. [
39] give us more clues about this interesting case example. In the latter study, these authors analyzed exactly the same samples as in their 2011 article [
23], but instead of targeting the coding region, they examined now the control region. The authors reported that “
two novel polymorphisms in the D-loop, one at position 16290 (T-ins) and the other at 16293 (A-del), was higher in breast cancer patients than in controls”. From the sequence electropherogram of their Figure one one discovers that the authors misaligned their sequences with respect to the rCRS: their two indels constitute in fact the well-known transition C16290T and the transversion A16293C. Both resulting variants together signal the rare haplogroup A11 within haplogroup N (PhyloTree) and would therefore necessarily bear the combination A10398-C10400 seen in their 2011 article). This haplogroup status enters in phylogenetic conflict with another mutation that is reported in their 2012 article; the authors mentioned that 10316G is present in 69% of their cases but not in their controls. The transition 10316G is a good marker for haplogroup M43 and R22; but it has not yet been reported within A11. Whatever the solution to this phylogenetic puzzle would be, their data point to the fact that cases carry an exaggerated representation of a rare haplogroup (most likely A11) that constitutes 75% of them, thus inflating the signal given by haplogroup N in their cases. This is another clear demonstration of population stratification or inadequate selection of cases and controls in that small Bangladesh sample. Since the full haplotypes have not been presented in either article, one has to reject the conclusions drawn by the authors.
Publication bias in breast cancer risk
Several review articles have been written since the first publication of Canter et al. [
12] in regard to the implications of the 10398 polymorphism in breast cancer (together with other cancers); strikingly as many as original research articles [
41‐
47].
Unfortunately, all these surveys rephrase and summarize the conclusions of the original articles without critically investigating the robustness of the evidences. Worse, most of the time, only the positive findings of the literature are highlighted. Since 2009, and according to The Web of Science (
http://ip-science.thomsonreuters.com/es/productos/wok/); the studies by Canter et al. [
12] and Bai et al. [
19] received 109 and 86 citations (query: 28 February 2014), whereas for the same period, the studies of Setiawan et al. [
18] and Mosquera-Miguel et al. [
5] showing negative association received 23 and 22 citations, respectively. Czarnecka and Bartnik [
41] reported that “the first interesting and
widely investigated mtDNA polymorphism in the cancer field was A10398G, first described as causative factor in breast cancer development (50–52)”. Curiously, note that in the previous quotation, their citation 50 refers to the negative findings of Setiawan et al. [
18] but they do not further mention or comment on this article, and the negative findings of Mosquera-Miguel et al. [
5] are not cited at all. In the earlier review by Plak et al. [
43] neither of the two reports with negative findings were cited. This kind of publication bias is harmful in science because it stimulates future scientific studies in wrong directions and promotes studies suffering from the same deficiencies as the previous ones.