Introduction
A reliable knowledge of the particular Y-chromosomal short tandem repeat (Y-STR) polymorphisms used in the forensic context is essential for the correct interpretation of the resulting profiles. Over the last years, 17 Y-STRs included in the commercially available AmpF
lSTR® Yfiler® polymerase chain reaction (PCR) amplification kit (Applied Biosystems, Inc., Foster City, CA, USA) have become widely used in the forensic genetic community as well as for evolutionary anthropological studies. Establishing a reliable knowledge on the mutation rates and characteristics of these particular 17 Y-STRs included in the kit are important for particular forensic and anthropological applications. In forensics, mutation rates are needed when STRs are applied to paternity testing, and Y-STRs are especially powerful in deficiency cases of disputed paternity involving male offspring where the alleged father is not available for DNA analysis but is replaced by any of his male paternal relatives. In such applications, the knowledge on Y-STR mutation rates needs to be considered in the paternity probabilities, and mutations are more likely the more generations the son is separated from its putative male paternal relative [
1]. There are also other forensic applications where Y-STR mutation rates have to be considered, i.e., all those that include different members of the same male lineage. In evolutionary anthropological studies, Y-STRs are usually applied to unveil the local and temporal origin of a given Y-SNP based haplogroup, and Y-STR mutation rates are used for time estimations as well as (often) for weighted network constructions [
2]. In addition, Y-STRs are useful in genealogical studies where mutation data are needed as well [
3].
There are several approaches to establish Y-STR mutation rates including genotyping father–son pairs from trio cases of autosomal DNA confirmed paternity [
4], males from deep-rooted pedigrees [
5], single sperm cells or small pools of sperm cells [
6], or using Y-STR population data in combination with known historical events for time calibration [
7]. Of these, studying DNA-confirmed father–son pairs is the most reliable approach but only if the number of father–son pairs investigated is large enough to reveal reliable mutation rate estimates. This is because mutation rates of STRs, including Y-STRs, are expected to be small (about one mutation in 1,000 generations per locus). It is therefore important to further increase the number of father–son pairs typed for the specific Y-STR loci intended to be applied for forensic and evolutionary analyses to provide more reliable knowledge about their mutability and thus to further gain certainty in Y-STR data interpretation.
Several studies have investigated mutation rates and characteristics of Y-STR loci widely used in forensic, genealogical, and evolutionary studies [
4,
5,
8‐
23]. However, the mutation information for some of the Y-STRs included in the Yfiler kit is still very limited as most of the Y-STR mutation rate studies so far were conducted on a subset of markers included in Yfiler kit (e.g., the nine Y-STRs defining the so-called minimal haplotype). Only six previous studies investigated the complete set of 16 Yfiler Y-STR loci (DYS385a/b was considered jointly) in father–son pair analyses covering all together only 1,624 meiotic transfers per single locus [
16,
18,
19,
21‐
23]. In this paper, we report mutation data for the 17 Y-STRs included in the AmpF
lSTR® Yfiler® PCR amplification kit from analyzing 1,730–1,764 father–son pairs per locus, comprising a total of 29,792 meiotic transfers (mutations at DYS385a and DYS385b were analyzed separately) and representing the largest single Yfiler mutation study available thus far. We additionally provide summarized mutation data from our study and previously published data for the same 16 Y-STR loci (DYS385a/b considered as combined system) comprising 3,531–11,900 meiotic transfers per each of the Y-STR loci (all together 135,212 meiotic transfers).
Materials and methods
Father–son pair samples used in this study were confirmed in their family relationship by DNA analysis using various sets of DNA markers before this study, and all had paternity probabilities of >99.9%. Samples came from five sampling regions: Cologne, Leipzig, and Berlin in Germany as well as Warsaw and Wroclaw in Poland. Individuals came from the named cities as well as their surrounding regions, i.e., provinces/counties these cities are part of. Although the vast majority will have originated from the respective geographic regions, we cannot exclude some migrants from other regions. If known, individuals of origin from countries other than those considered in the respective regional sample sets were excluded from the study. There is no sample overlap between the present study and our previously published mutation study [
4]. Because of very low DNA quantities available for the Leipzig samples, a whole genome amplification procedure was performed before Yfiler PCR analysis using the GenomiPhi DNA Amplification Kit (GE Healthcare, Little Chalfont, UK). One or 5 µl (depending on DNA concentration) genomic DNA were added to 9 µl of sample buffer and denatured at 95°C for 3 min, then cooled on ice. Subsequently, 9-µl reaction buffer plus 1 µl of enzyme mix were added to the cooled sample and incubated at 30°C for 16–18 h, then heat inactivated at 65°C for 10 min. Afterwards, the whole-genome-amplified DNA was purified using Invisorb® 96 Filter Microplates (Invitek GmbH, Berlin, Germany).
The Y-STRs included in the AmpFlSTR® Yfiler® PCR amplification kit (Applied Biosystems, Inc.): DYS19, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS385a/b, DYS437, DYS348, DYS439, DYS448, DYS456, DYS458, DYS635, and Y-GATA-H4 were genotyped according to the instructions provided by the manufacturer and using a gold-plated silver block GeneAmp®PCR System 9700 (Applied Biosystems, Inc.). All PCRs, except for the Berlin samples, were carried out at the Department of Forensic Molecular Biology, Erasmus MC Rotterdam (The Netherlands), and after quality control, PCR products were shipped on dry ice to Applied Biosystems at Foster City (USA), where fragment length analyses was performed using the 3130xl genetic analyzer according to the guidelines in the AmpFlSTR® Yfiler® PCR amplification kit user manual. Yfiler profiles were generated using Genemapper ID v3.2 software (Applied Biosystems Inc.), and generated profiles were manually inspected by experienced technicians in Rotterdam for quality control. The Berlin samples were genotyped at the Abteilung für Forensische Genetik, Institut für Rechtsmedizin und Forensische Wissenschaften, Charité (Germany) according to the manufacturer’s instructions. Genotype differences between respective fathers and sons were identified using in-house developed MATLAB®-scripts using version 7.6.0.324 (The MathWorks, Inc., Natick, MA, USA).
All mutations were confirmed by DNA sequence analysis of the respective father and son DNA sample at the respective Y-STR locus in Rotterdam. Mutations at the DYS385a/b system were sequenced separately for DYS385a and DYS385b as described elsewhere [
24]. Before DNA sequence analysis, PCR was carried out using the following conditions: 10–20 ng genomic DNA was used in a total volume of 20 μl PCR reaction. Final concentrations were 1× PCR GeneAmp PCR gold buffer and 0.5–1 unit AmpliTaq Gold (Applied Biosystems Inc.), 1 mM deoxyribonucleotide triphosphates (dNTPs; Roche Diagnostics GmbH, Mannheim, Germany), 250 nM of each primer (see Supplementary Table S
1 for primer sequences used for sequencing as well as for PCR before sequence analysis) and 1.5–2.5 mM MgCl
2 depending on the marker. DYS393, DYS439 (2.5 mM MgCl
2), GATA-H4, DYS385a, and DYS385b (1.5 mM MgCl
2) were amplified using a 60–50-touchdown protocol: 95°C 10 min, ten cycles, 94°C 30 s, 60–1°C 30 s, 72°C 45 s; 25 cycles, 94°C 30 s, 50°C 30 s, 72°C 45 s, and final extension at 72°C 10 min. The combined DYS389I/II fragment was amplified using a 60–55 touchdown protocol: 95°C 10 min, five cycles, 94°C 30 s, 60–1°C 30 s, 72°C 45 s; 30 cycles, 94°C 30 s, 55°C 30 s, 72°C 45 s, and final extension at 72°C 10 min. DYS437, DYS392, DYS438, DYS19, DYS456 (all 2.0 mM MgCl
2), and DYS390 (2.5 mM MgCl
2) were amplified with a 65–50 touchdown protocol; 95°C 10 min, 15 cycles, 94°C 30 s, 65–1°C 30 s, 72°C 45 s; 20 cycles 94°C 30 s, 50°C 30 s, 72°C 45 s, and final extension at 72°C 10 min. DYS635 (1.5 mM MgCl
2) and DYS391 (2.0 mM MgCl
2) were amplified using a 70–50 touchdown protocol: 95°C 10 min, 20 cycles, 94°C 30 s, 70–1°C 45 s, 72°C 1 min; 15 cycles, 94°C 30 s, 50°C 45 s, 72°C 1 min, and a final extension at 72°C 10 min. DYS458 (1.5 mM MgCl
2) was amplified using a fixed annealing temperature of 60°C; 95°C 10 min, 35 cycles, 94°C 30 s, 60°C 30 s, 72°C 45 s, then a final extension at 72°C 10 min. DYS385a and DYS385b were amplified separately as described elsewhere [
24]. Excess of PCR primers and dNTP was removed via enzymatic treatment of exonuclease I (Exo) and shrimp alkaline phosphatase (SAP) using the ExoSAP-IT™ Kit (USB Corporation, Cleveland, OH, USA) where 5 μl PCR product was incubated with 2 μl ExoSap-IT mix for 15 min at 37°C and inactivated at 80°C for 15 min, then cooled to 15°C for 5 min. DNA sequence analysis was performed via cycle sequencing in a total volume of 10 μl using the BigDye Terminator Cycle Sequencing Ready Reaction kit (Applied Biosystems Inc.) and the following conditions: 1 μl ExoSAP-IT-treated PCR product, 1.5 μl sequencing buffer (Applied Biosystems Inc.), 1.0 μl BigDyeTerminator v1.1 (Applied Biosystems Inc.), 5 pmol of sequencing primer (see Supplementary Table S
1 for sequences) and LiChrosolv water (Merck KGaA, Darmstadt, Germany). The cycle sequencing was performed in an MJ-Research PTC-200 (Bio-Rad, Hercules, CA, USA) by heating to 96°C for 1 min, then 25 cycles of 96°C 10 s, 50°C for 5 s and 60°C for 4 min and subsequent cooling to 15°C. The sequencing products were purified using 96-well multiscreen plates (Millipore, Billerica, MA) filled with Sephadex G-50 superfine (GE Healthcare Bio-Sciences AB, Uppsala, Sweden) absorbed with LiChrosolv water (Merck KGaA). After spinning the column for 5 min at 2900 rpm, 10 μl sequencing product was added to the column and collected in a clean 96-well PCR plate after centrifugation for 5 min at 2900 rpm. To the purified product, 5 μl HiDi formamide (Applied Biosystems Inc.) was added and loaded on the ABI 3100 Genetic Analyzer (Applied Biosystems Inc.). Separation of the purified sequencing products was performed using capillary electrophoresis under standard conditions. DNA sequences were aligned using the DNAstar software (DNASTAR, Inc., Madison, WI, USA). Since Y-STR typing was performed by Yfiler chemistry using labeled primers and therefore DNA sequencing was performed from an independent PCR reaction, our confirmation procedure thus included two independent analyses: one Yfiler fragment-length analysis and one sequence analysis. Y-STR mutations were only accepted as such if the repeat counts from the DNA sequence analysis matched the repeat-based allele nomenclature of the Yfiler fragment length analysis. For additional confirmation, we included for all Y-STRs sequenced control DNA samples that had known size and repeat-based alleles from multiple Yfiler fragment length analyses as well as known repeat counts from multiple sequence analyses as performed previously.
Mutation rates were estimated by means of two different approaches: a frequentist approach and a Bayesian approach. Frequentist estimation of the mutation rates was conducted by dividing the number of sequence-confirmed mutations by the number of father–son pairs for every Y-STR locus and for every sampling region separately. Ninety-five percent confidence intervals of the mutation rates were established by using a binomial model given the total number of working father–son pairs and the estimated mutation rate and obtained via the website
http://statpages.org/confint.html. To test for locus-specific differences in the mean of the mutation rates between sampling regions (Cologne, Leipzig, Berlin, Warsaw, and Wroclaw), a permutation analysis was carried out. In each iteration, each father–son pair was assigned at random to each sampling region, keeping the original population sample size. The average mutation rate computed for the permutated populations was then compared with the observed rate, and the number of times that the permutated averaged mutation rate was larger than the observed one was recorded. The one tail
p value was obtained by dividing such numbers by the 100,000 iterations that were conducted for each locus. Overall mutation rate distributions collected from the present as well as previous studies were estimated by means of a binomial hierarchical Bayesian model [
25] by using the Marcov Chain Monte Carlo (MCMC) Gibbs sampling implemented in WinBUGS [
26]. A non-informative prior normal distribution (
μ = 0,
σ = 1.0E−06) was specified to estimate the logit of the overall mutation rate and a prior gamma distribution with parameters
α = 1.0E−5, and
β = 1.0E−5 for the parameter tau as suggested in WinBUGS. Three different Gibbs MCMC chains were generated when estimating the mutation rate for each locus, and 100,000 runs were performed for each chain. Mean, median, and 95% credible intervals (CI) were estimated from the three chains after discarding the first 50,000 runs and performing a thinning of 15 in order to reduce the amount of autocorrelation (representing a final number of 9,999 retained runs). Bayesian estimations of DYS385a and DYS385b separately (as only available from our own study) were performed by using a binomial model with a uniform prior, which led to a posterior Beta distribution [
25] with parameters
α =
m + 1 and
β =
n + 1, where
m is the number of mutant father–son pairs and
n is the number of non-mutant father–son pairs. The ratio of repeat gains versus losses and the ratio of single- versus multi-repeat changes were estimated using a multinomial-logistic Bayesian model. For the individual studies, the relatively low number of observed counts of each class required using informative priors, which highly skewed the posterior distributions towards the prior distributions, and credible intervals tended to be large, including the 1:1 ratio (results not shown). Therefore, we did not use the Bayesian approach for such estimations. The ages (at the time of son’s birth) of fathers with and without mutations were compared with a Mann–Whitney
U test. The estimation of the effect on the mutation rate of the age of the father was calculated by means of a Bayesian approach. Mutation rate was modeled as a function of each age class using a Poisson distribution:
$$ p\left( {\left. y \right|\theta } \right) = \prod\limits_{t = 1}^n {\frac{1}{{y_i !}}\left( {x_i \theta } \right)^{{y_i }} e^{{ - x_i \theta }} } $$
where
θ is the mutation rate,
y
i
is the number of mutations, and
x
i
is the number of father–son pairs for the age class
i.
θ is assumed to be dependent on the age of the father, with
\( \theta = e^{{\alpha a_i + \gamma }} \), where
α is the slope of the function, and
γ is the error associated. If the mutation rate
θ is independent of the fathers’ age,
α will be zero. Prior distributions for each parameter were ascertained in order to be non-informative:
$$ \begin{array}{*{20}c} {\alpha \sim {\text{Normal}}\left( {\mu, \sigma_{\alpha } } \right)} \hfill \\ {\gamma \sim {\text{Normal}}\left( {0,\sigma_{\gamma } } \right)} \hfill \\ {\mu \sim {\text{Normal}}\left( {0,1000000} \right)} \hfill \\ {\sigma_{\alpha } \sim {\text{Gamma}}\left( {0.000001,0.000001} \right)} \hfill \\ {\sigma_{\gamma } \sim {\text{Gamma}}\left( {0.000001,0.000001} \right)} \hfill \\ \end{array} $$