Introduction
Increasing understanding of human genome variability is enabling better use of DNA’s predictive potential [
1]. Besides clinical applications, predictive DNA analysis can be useful in forensics for intelligence purposes [
2], in molecular anthropology [
3] and in identification of historical figures [
4‐
6]. In recent years, intensive research has been carried out on the prediction of various human appearance characteristics [e.g.
7‐
15]. The most significant progress was made in the prediction of pigmentation characteristics, and eye colour in particular [
16]. Nevertheless, the genetic architecture of some categories of pigmentation phenotypes remains elusive, their prediction is still inaccurate and research to improve accuracy continues. One such category is intermediate eye colour, which in the most commonly used IrisPlex model is predicted with low sensitivity [
16]. Because of the very complex genetic basis of the appearance traits, a promising direction is building predictive tools that take into account markers based on the criterion of improved prediction and not genetic association, and the use of more advanced mathematical methods in prediction modelling [
17]. There are many machine learning (ML) methods available for developing predictive models, and their effectiveness may depend on the type and amount of data used; some of them may be more suitable than others for taking into account diverse genetic phenomena, including epistasis. First, we can distinguish linear and nonlinear methods [
18]. The linear methods in their basic form are limited to detecting the linear dependency between a class variable and attributes. Representative examples are logistic and multinomial regression, linear discriminant analysis (LDA), the basic linear version of support vector machines (SVM) or perceptron. The nonlinear methods are designed to detect more complex dependencies between a class variable and attributes. Examples include various tree-based methods, multivariate adaptive regression splines (MARS) and multilayer neural networks (NN). The advantage of the first group is the relatively low computational cost of fitting the model as well as simplicity and interpretability. On the other hand, nonlinear models usually achieve greater predictive power, especially in the case of complex classification issues. Moreover, they are also able to detect interactions among attributes [
19]. In addition to single models, ensemble techniques, which combine multiple learning algorithms, have gained great popularity. It has been proved that ensemble methods such as random forest (RF) or extreme gradient boosting (XGB) are among the most powerful classification models; they usually achieve significantly higher accuracy when compared to simple models. The price for this is the higher computational cost and more complicated interpretation. An important line of research in ML is focused on combining classification methods with feature selection techniques. Feature selection plays a crucial role in many analyses, especially when the number of attributes is large compared with the sample size. Selection of relevant attributes improves the understandability of the considered model and allows one to discover the relationship between attributes and the class variable. Secondly, it helps to devise approaches with better generalization and larger predictive power [
20]. In the case of some classification methods, feature selection is an integral element of learning the model; for example, in tree-based methods, relevant attributes are chosen during the building of the tree. Another solution is using regularization techniques [
18], such as least absolute shrinkage and selection operator (LASSO) regularization, which ensure sparsity in the parameter vector and allow one to find attributes influencing the class variable.
In this study, we explored the possibility of increasing the accuracy in predicting eye colour. To this end, we adopted the following strategies: (1) quantitative characterization of samples using high-quality images of the iris analysed with Digital Iris Analysis Tool (DIAT) software; (2) whole-exome sequencing (WES)-based identification of new potential predictors in a group of 150 phenotypically diverse Polish samples using the HyperLasso method and regression-based single-SNP association testing; (3) predictive modelling conducted based on the literature and WES-identified markers, using various machine learning algorithms and independent sets of samples in order to find the most accurate method for eye colour in a moderate dimensional dataset.
Discussion
Accuracy of phenotype prediction from genetic data is essential for the successful application of predictive methods in biomedical studies including anthropology, paleogenetics and forensics [
29]. Several factors determine good accuracy of DNA-based predictive methods, including high heritability of a trait, identification of appropriate predictors and selection of the best mathematical approach to model development. Even highly heritable traits are often difficult to predict, due to polygenicity, epistasis, and allelic and locus heterogeneity. In this study, we used quantitative assessment of eye colour phenotypes and whole exome/regulome sequencing to identify additional predictors, and additionally, we verified multiple machine learning methods to assess their impact on prediction accuracy, focusing especially on more complex intermediate phenotypes. The studied cohort of Polish individuals shows a relatively large diversity of pigmentation phenotype compared to some other European populations, which makes it useful for studying the genetics of pigmentation traits. Objective phenotyping of eye colour for finding new loci provided quantitative measurements. The analysis using DIAT software [
21] confirmed that the calculated PIE score reflecting the ratio of blue to brown pixels highly correlates with human evaluation of eye colour (Spearman correlation = − 0.82,
P-value = 5.46 × 10
−233). Using the single-SNP association testing of the WES/regulome data under
P-value < 1 × 10
−4, and the HyperLasso algorithm, which aimed to select the subset of SNPs that best predicted the trait under study simultaneously controlling the type I error of the selected variants [
24], we identified 34 SNPs and age as important factors for eye colour prediction. In the next step, we moved directly to the extensive predictive modelling.
A large number of algorithms have been developed to deal with a variety of increasingly demanding and computationally challenging data analyses. Analysis of AIC, BIC and LASSO methods of marker selection conducted in this study revealed that all of them are robust, since they produced models with better performance compared to models without any selection method applied (i.e. LOG FULL). We confirmed that BIC, which more heavily penalizes the introduction of additional variables, produced the most parsimonious models. Together with BIC, LASSO yielded models with the best predictive performance. Interestingly, focusing on SNPs selected by at least two out of three feature selection methods and in at least 50% of data splits, we found the well-known pigmentation markers and the intronic variant rs2253104 in
ARFIP2, newly identified by HyperLasso (Table
3).
ARFIP2 is located on 11p15.4 and encodes for ADP-ribosylation factor-interacting protein 2 (ARFIP2), which is highly expressed in various tissues. This protein has been shown to be involved in several cellular processes and signalling pathways. They include Rac1-mediated signalling, triggering actin polymerization [
30], which in melanocytes is involved in dendrites formation and therefore the transport of melanosomes to keratinocytes [
31]. Also, ARFIP2 has been shown to negatively regulate NF-κB signalling [
32], inducing MITF expression, one of the most important melanogenesis regulators [
33]. Interestingly,
ARFIP2 was among the downregulated genes in human melanoma cells treated with arbutin [
34], which is a known inhibitor of melanin biosynthesis used in cosmetology for skin whitening [
35]. Therefore, although there is no evidence that ARFIP2 is directly involved in melanogenesis, it is possible that it may be engaged in indirect regulation of pigmentation-related genes. It has been speculated that the missing heritability of many complex traits can be explained by gene action outside the core pathways [
36]. So far, rs2253104 in
ARFIP2 has been associated with lung cancer [
37]. Rs2253104 in
ARFIP2 was selected in 65% of LOG AIC models and in 51% of LOG REG models, therefore more frequently than, e.g. rs12203592 in
IRF4 or rs1408799 in
TYRP1 (LOG REG), the other well-established eye colour predictors (Fig.
1). Nevertheless, as the univariate association analysis did not reveal statistically significant association of rs2253104 with eye colour either in discovery or predictive modelling cohort, its effect appears to be very complex and the direction of the effect difficult to interpret. Therefore, further studies are needed to support our hypothesis about the potential role of this variant and better understand this effect. The nonsynonymous
OCA2 rs74653330 variant, which was very often selected by all three (AIC, BIC, REG) methods, also deserves more attention. The research by Yuasa et al. showed a north–south geographic gradient of the rarer T allele, which was interpreted as a possible case of adaptive evolution [
38]. Indeed, it has been suggested that this
OCAC2 variant is responsible for reduced efficiency of melanogenesis [
39] and thus lighter pigmentation, which is preferred in areas with lower ultraviolet radiation content. Notably, the T-allele was also found to have a measurable effect on normal eye colour variation in Scandinavian samples [
40,
41]. The incidence of the minor T allele in the Scandinavian population was 0.005 and this variant was not present in the Italian and Portuguese populations [
40]. In our population, the derived T allele was observed 10 times in 999 individuals, in the heterozygous genotypes. Our study confirms importance of rs74653330 for eye colour prediction and further indicates that allelic heterogeneity altogether with the population-specific differences in allele frequencies may be important factors in predictive DNA analysis. Other SNPs for eye colour prediction included three out of six variants implemented in the IrisPlex model: rs12913832 (
HERC2), rs1800407 (
OCA2) and rs1689182 (
SLC45A2) [
16,
42] as well as others, previously associated with eye patterning (rs10874518,
OLFM3; [
43]), or other pigmentation traits (rs885479;
MC1R, rs8049897,
DEF8; [
44]) (Table
3).
Importantly, the accuracy of predicting intermediate eye colour achieved a high level (e.g. regression model developed with BIC approach or with LASSO regularization: AUC = 0.85), higher than reported for IrisPlex [
16] and Snipper [
45], the two most widely used eye colour predictive tools. In data analysed here, the sensitivity of intermediate eye colour prediction was also better (LOG BIC sens. = 0.29) compared to the results obtained with the original IrisPlex model (sens. = 0.00). In previous research, a significant increase in the sensitivity of intermediate eye colour prediction was achieved due to additional variation in the
HERC2 gene included in the predictive model. The positive effect was, however, reversible, since the addition of other
HERC2 variants decreased the ability of the model to predict intermediate eye colours [
45]. A small increase in accuracy of intermediate eye colour was also reported in a study that involved genetic interactions [
46].
Besides classical regression, several more advanced machine learning algorithms were evaluated. The study demonstrated that advanced machine learning methods showed even higher sensitivity values of intermediate eye colour prediction (i.e. TREE, XGB and MARS with sens.
interm. = 0.34–0.39); however, a slightly reduced sensitivity of brown eye colour prediction was observed for these models when compared to the regression model. It is well known that more advanced machine learning methods may better cope with recognition of complex phenotypes, including intermediate eye colour, due to the ability to identify possible nonlinear dependencies between variables, such as interactions. Nevertheless, while some advanced methods were found to demonstrate increased sensitivity or specificity in predicting certain categories, none of these approaches outperformed the regression method developed following prior features selection using BIC or LASSO, when AUC or accuracy metrics were compared. Moreover, differences between the tested methods were modest. These results suggest that more sophisticated learning algorithms may need larger datasets to demonstrate their superiority and do not reveal their potential in low- and medium-dimensional data. Also, a systematic review [
47] of logistic regression and other machine learning methods (among which the most common were classification trees, random forests, artificial neural networks and support vector machines) showed that in the group of low risk of biased study, no performance benefit of machine learning over logistic regression methods was reported for clinical prediction models. Further, evaluation of deep learning methods (multilayer perceptron and convolutional neural networks) conducted on high-dimensional data (~ 100 k individuals and ~ 500 k SNPs) did not provide any proof that these methods outperform simple linear methods and improve complex human trait prediction by a sizeable margin [
48]. Although our analysis did not involve advanced machine learning hyperparameters tuning aimed at improving the obtained prediction accuracies, there is evidence in the literature that such tuning may still not be helpful for significantly improving accuracy [
49]. Nevertheless, it was found that the superiority of the advanced ML approaches (random forests) depends on the dataset and tends to be more pronounced for an increasing number of analysed features or an increase in the ratio of the number of features to the number of cases [
50]. Indeed, it has been shown that some of the advanced algorithms can be very successful in predicting complex traits if applied to very high-dimensional data [
51]. It is also worth noting that advanced machine learning methods outperform basic linear regression in age prediction using DNA methylation data. In the evaluation of 17 different machine learning approaches performed by Aliferi et al., the support vector machine with the polynomial function method was chosen as highly robust, generalizable and the best-performing modelling approach [
52], as it was in another previous study [
53]. Also, neural networks (e.g. [
54]) and random forest regression [
55] were successfully applied to accurate human age prediction. This demonstrates the superiority of some ML approaches over classical regression methods in data with observed nonlinear correlation effects and also suggests a possible dependence of ML methods’ efficiency on the data type: discrete for SNP vs. quantitative for DNA methylation.
In summary, whole-exome sequencing of 150 individuals has allowed identification of 27 DNA variants that are relevant for eye colour prediction which have not been reported before in pigmentation predictive studies. Besides well-known pigmentation-associated variants, rs2253104 in ARFIP2 was selected by at least two different feature selection methods for regression predictive models, which turned out to be the most accurate. None of the sophisticated machine learning algorithms outperformed the overall prediction accuracy of regression models developed following prior features selection using BIC or LASSO regularization, indicating that medium-dimensional data does not use the whole potential of these more advanced algorithms.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.