Background
Osteoarthritis (OA), a debilitating musculoskeletal disease, is the main reason for permanent work incapacitation and seeing primary care physicians. The current therapies available to treat OA only relieve pain and not the structural alteration of the joint. Moreover, conventional diagnosis is ineffective in the early identification of patients in whom the disease will progress rapidly. This situation is a bottleneck for developing effective treatment aiming at the joint structure and attaining precision medicine. Finding biomarkers that will enable stratifying OA patients into subgroups/phenotypes will assist in a better understanding of individual patient needs and the development of disease-modifying OA drugs (DMOADs). In this line of thought, genetics have been shown to play an important role in the prevalence and progression of OA [
1‐
5], and genetic markers are believed to be important for the stratification of patients with OA.
Extensive genome-wide association studies (GWAS) yielded several single nucleotide polymorphisms (SNPs) within different gene loci associated with OA and included
GDF5,
MCF2L,
TP63,
FTO,
DUS4L/COG5,
GNL3,
SUPT3H, and
TGFA, to name a few [
1‐
10]. Some SNPs were specific for a population, joint site, and/or gender.
Mounting evidence also suggests the implication of some mitochondrial DNA (mtDNA) SNPs in the pathogenesis of OA. The mtDNA is exclusively maternally transmitted, and its sequence evolution rate is higher than the average nuclear DNA. As a result, a significant number of mtDNA mutations have accumulated sequentially along radiating maternal lineages [
11]. These accumulated mtDNA mutations (haplogroups) are characterized by the presence of a particular set of SNPs in their sequence. The most frequent Caucasian mtDNA haplogroups are H, J, T, U, K, and others (the latter not ascribed to any of these haplogroups) [
4,
12‐
15], in which the H variant was the most frequent (about 48%) [
16]. In addition, haplogroups with a common phylogenetic origin are organized into clusters and named HV, TJ, KU, and C-others. Although these mutations have been critical for human adaptation, some may be maladaptive in a different environment with new lifestyles. This could have occurred as these mutations lead to modifications in cytoplasmic signaling molecules, thus reprogramming nuclear DNA gene expression [
11,
17]. Moreover, certain are related to the pathogenesis of OA [
4,
12,
18‐
23].
The need to develop tests to facilitate early and more appropriate therapeutic intervention is widely recognized and crucially required in the field of OA. The objective is to obtain not only an early and accurate diagnosis but also an early prognosis of the disease progression for a given individual. Precision and personalized medicine or at least stratified interventions could be achievable with biomarkers.
During the last few decades, researchers have looked at biomarkers mostly related to proteins in the serum/urine for an early diagnosis, monitoring, or prediction of the course of the disease. Yet, none is sufficiently specific or sensitive nor has been accepted by the regulatory agencies. In contrast to serum proteins, genes are not susceptible to daily activities and therefore are more stable. Having a genetic OA biomarker should provide a robust and powerful tool for the early identification of OA patients at risk of structural progression.
In recent years, instead of using individual features to identify progressors, a combination of OA markers and patient parameters conjointly with machine learning (ML) approaches have been found successful. However, studies using ML methodologies generally included a small number of patients, did not lead to robust predictions, and used radiography and/or symptoms to define OA progressors [
24‐
31]. The two latter are well known to lack sensitivity to early knee structure (tissue) alterations and their changes over time [
32,
33]. Moreover, symptoms are not recommended as, in addition to not correlating well with knee OA structural progression, they are very subjective and dependent on the population studied. However, combining radiography with quantitative magnetic resonance imaging (MRI) variables improves the identification of structural progressors [
30]. Hence, MRI methodology is very sensitive to knee structural alterations, which could be detected even before morphological alterations are seen with other imaging-based technologies [
34].
For the first time, the present study aims to determine, by using ML technologies, the gender-based predictability of some SNP genes and mtDNA haplogroups/clusters alone or combined with two OA major risk factors (age and body mass index [BMI]) in the risk of being a structural progressor of the knee. Knee structural progression was determined using features from both radiography and quantitative MRI. The developed models were validated using tenfold cross-validation experiments and an external OA cohort from the community-based Tasmanian Older Adult Cohort Study (TASOAC).
Discussion
The current study’s main goal was to improve the clinical prognosis of knee OA for a better therapeutic approach. In this perspective, genetic biomarkers hold great potential for improving clinical outcomes in OA. We investigated, using ML methodologies, the prediction of SNP genes and mtDNA haplogroups/clusters as early predictors of knee OA structural progression.
All the SNPs evaluated in this work were shown to be related to OA [
1,
3,
5‐
10]. Likewise, the mtDNA haplogroups/clusters H, J, T, Uk, and others and the clusters HV, TJ, and KU have all been associated with the disease. Hence, the J and T, as well as the cluster TJ, have not only been associated with a decreased risk of knee OA but also with a lower rate of incidence and progression of knee OA [
4,
12,
18‐
21,
23] and that patients belonging to cluster TJ had a slower OA progression than patients belonging to cluster KU [
22]. In contrast, patients with knee OA carrying (i) the haplogroup U and the cluster KU showed a more severe progression of the disease [
18], (ii) the haplogroup H was more prone to OA progression and also total joint replacement [
12,
19,
22], and (iii) the cluster HV had a marginal correlation with OA [
59].
Data-driven approaches were used as such methodologies do not require an a priori hypothesis and are, therefore, able to identify unanticipated patterns in the data and offer the potential to provide new insights. These methods are widely used in medical research but only recently applied to OA. In an ML-based study, one of the important challenges is the selection of an appropriate supervised methodology enabling optimal performance in classifying the dataset. Here, we compared seven ML techniques in which each of them was fine-tuned with hyperparameters. Data showed that the supervised SVM methodology had the highest accuracy.
Using the SVM ML methodology, the gender-based models consisting of all the variables (n = 12) had high accuracy (88.8%). However, in general, a model should have a minimum of features in addition to the maximum possible accuracy to be more easily applicable. To this end, and to determine the optimum model, 277 models were evaluated, and two gender-based scenarios were developed. The first ones (scenario 1) consisted of age, BMI, and four SNP genes (TP63, DUS4L, GDF5, FTO) with an accuracy of 85.0%.
Furthermore, a second scenario was developed by exploiting data from the synergy analysis, in which a moderate level of synergy is found between age and mtDNA haplogroup, and using them as a fixed variable in addition to the BMI. This latter consisted of one less variable and included the three fixed ones (age, BMI, and mtDNA haplogroup) in addition to the SNP genes, FTO and SUPT3H, and with excellent accuracy (82.5%). Therefore, the latter was selected as the optimum model to predict early structural OA knee progressors, as it has one less variable than the other model. In this model, the mtDNA haplogroup H, as well as the presence of the alleles CT for rs8044769 at FTO and the absence of AA for rs10948172 at SUPT3H, demonstrated the highest impact.
The involvement of the mtDNA haplogroup H as a predictor of poor OA prognosis is not new. Different studies demonstrated that compared to other mtDNA haplogroups, especially those belonging to the mtDNA cluster JT, patients with the haplogroup H (or cluster HV) show a higher rate of OA incidence and progression [
4,
12]. Among the proposed functional consequences of harboring this variant, higher free radical production, lower cell survival under oxidative stress conditions, and a higher grade of apoptosis stand out [
12].
In addition, the effect of specific nuclear risk alleles can be conditioned by the mitochondrial background and vice versa through mitonuclear interactions [
60‐
62]. This was reported in diseases such as Alzheimer’s, where an association between the cluster HV and the risk of this disease following adjustment for the
apolipoprotein E gene (APOE4) status was detected [
63], and in obese patients with type 1 diabetes mellitus [
45]. Mitonuclear interactions have also been described in terms of the differential association of the haplogroups H and J with the methylation status of articular cartilage by which apoptosis, among other biological processes, is enhanced in cartilage with the haplogroup H and repressed in those having the haplogroup J [
64].
Interestingly, the rs8044769 at the
FTO variant was found to be linked to OA via its effect on the BMI [
43]. Taking into account, on the one hand, previous associations of mitochondrial background with the risk of obesity [
44‐
46] and, on the other hand, the differential methylation pattern between haplogroups H and J in cartilage in genes related to developmental processes, including the homeobox family [
64], potential interactions between haplogroup H/cluster HV able to modify the risk of structural progression in OA are not surprising.
This study has several strengths. The population included a sensible number of participants for both genders, enabling the models to be developed in a gender-based fashion, permitting a high accuracy of the models. The validation and reproducibility of the developed models using cross-validation and an external cohort, respectively, demonstrated excellent accuracies for both M32-3 and MH17 models, reinforcing the robustness and generalizability of the developed models. In addition, OAI and TASOAC cohorts consist of people in the mild-moderate stage of the disease, thus representing a general population. Another strength is that, for classifying joint structural progressors (PVBSP) for each individual, and as suggested by Nelson et al. [
30], we applied an image-based prediction algorithm using both radiographic and MRI variables and an overtime X-ray as the outcome from our previous study [
35]. Finally, the development of our models using genetic and demographic information could have improved the ability of the models to predict knee structural alterations compared to having only genetic information, as previously described [
65]. Moreover, incorporating modifiable risk factors (e.g., BMI) could also have increased the accuracy of the predicted models, such as previously described for knee OA [
66].
Like all studies, the present has limitations. First, the participants were all of Caucasian origin; therefore, the results of this study did not extend beyond this ethnicity. Second, although we used the most common SNP genes and Caucasian mtDNA haplogroups associated with OA, others could also be tested. Third, for some SNP genes, the number of participants having a specific allele was limited (Additional file
1: Fig. S1). Fourth, although unbiased evaluation of models in training and testing stages were confirmed by tenfold cross-validation, one might argue that the results from the OAI cohort be optimistic in forecasting PVBSP as a nested cross-validation could have been used. However, this concern is alleviated by the validation analysis using an independent external cohort in which an excellent accuracy was obtained for both developed ML models. Fifth, we acknowledge that one of the important challenges in performing this study was the proximity of the results in the development of the models, and more specifically, in the sensitivity analysis. Hence, the results of the different models for SNP genes and mtDNA haplogroups/clusters were very close, and the important variables were selected based on small differences using both training and testing stages. However, by doing this, we were able to decrease the number of variables from 12 to only five or six, while maintaining high accuracies for the models.
Results from this study are translational for Caucasians at risk of structural progressive knee OA and could have high and direct clinical relevance as they could improve clinical prognosis with real-time patient monitoring. The next step will be to transform these automated models of OA knee structural progressors into an application that will make it practical for use by clinicians for a given patient. These models could be used early during the OA process and guide clinicians to adapt the therapeutic strategy to improve long-term harmful outcomes. In addition, they could assist in the design of knee DMOAD clinical trials. As the disease progression may be slow for many OA patients, DMOAD trials require extremely large numbers of patients and longer follow-up periods. However, stratifying patients who will likely have more rapid knee structural progression would result in enriched trials with appropriate patients as a result of discriminating potential responders from non-responders for a given therapeutic approach. Such a selection of OA patients, which at present is a major hurdle for DMOAD clinical trials, would result in lower trial costs, opportunities for testing more products, and faster end results.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.