Introduction
Costs associated with genomic investigations continue to reduce [
1], while the richness of data generated increases. Globally, the adoption of wide-scale genome sequencing implies that all newborn infants may receive screening for pathogenic genetic variants in an asymptomatic stage, pre-emptively [
2]. The one dimensionality of individual genomes is now being expanded by the possibility of massive parallel sequencing for somatic variant analysis and by single-cell or lineage-specific genotyping, culminating in a genotype spectrum. In whole blood, virtually every nucleotide position may be mutated across 10
5 cells [
3]. Mapping one’s genotype across multiple cell types and at several periods during a person’s life may soon be feasible [
4]. Such genotype snapshots might allow for prediction and tracking of somatic, epigenetic, and transcriptomic profiling.
The predictive value of genomic screening highly depends on the computation tools used for data analysis and its correlation with functional assays or prior clinical experience. Interpretation of that data is especially challenging for rare human genetic disorders; candidate disease-causing variants that are predicted as pathogenic often require complex functional investigations to confirm their significance. There is a need for predictive genomic modelling with aims to provide reliable guidance for therapeutic intervention for patients harboring genetic defects for life-threatening disease before the illness becomes clinically significant.
The study of predictive genomics is exemplified by consideration of gene essentiality, accomplished by observing intolerance to loss-of-function variants. Several gene essentiality scoring methods are available for both the coding and non-coding genome [
5]. Approximately 3000 human genes cannot tolerate the loss of one allele [
5]. The greatest hurdle in monogenic disease is the interpretation of variants of unknown significance while functional validation is a major time and cost investment for laboratories investigating rare disease.
Severe, life-threatening immune diseases are caused by genetic variations in almost 300 genes [
6,
7]; however, only a small percentage of disease-causing variants have been characterized using functional studies. Several robust tools are in common usage for predicting variant pathogenicity. Compared with methods for pathogenicity prediction, a void remains for predicting mutation probability, essential for efficient pre-emptive validation. Our investigation aims to apply predictive genomics as a tool to identify genetic variants that are most likely to be seen in patient cohorts.
We present the first application of our novel approach of predictive genomics using Recombination activating gene 1 (RAG1) and RAG2 deficiency as a model for a rare primary immunodeficiency (PID) caused by autosomal recessive variants.
RAG1 and
RAG2 encode lymphoid-specific proteins that are essential for V(D)J recombination. This genetic recombination mechanism is essential for a robust immune response by diversification of the T and B cell repertoire in the thymus and bone marrow, respectively [
8,
9]. Deficiency of RAG1 [
10] and RAG2 [
11] in mice causes inhibition of B and T cell development. Schwarz et al. [
12] formed the first publication reporting that RAG mutations in humans cause severe combined immunodeficiency (SCID), and deficiency in peripheral B and T cells. Patient studies identified a form of immune dysregulation known as Omenn syndrome [
13,
14]. The patient phenotype includes multi-organ infiltration with oligoclonal, activated T cells. The first reported cases of Omenn syndrome identified infants with hypomorphic RAG variants which retained partial recombination activity [
15]. RAG deficiency can be measured by in vitro quantification of recombination activity [
16‐
18]. Hypomorphic
RAG1 and
RAG2 mutations, responsible for residual V(D)J recombination activity (on average 5–30%), result in a distinct phenotype of combined immunodeficiency with granuloma and/or autoimmunity (CID-G/A) [
2,
19,
20].
Human RAG deficiency has traditionally been identified at very early ages due to the rapid drop of maternally acquired antibody in the first six months of life. A loss of adequate lymphocyte development quickly results in compromised immune responses. More recently, we have found that RAG deficiency is also found for some adults living with PID [
16].
RAG1 and
RAG2 are highly conserved genes, but disease is only reported with autosomal recessive inheritance. Only 44% of amino acids in RAG1 and RAG2 are reported as mutated on GnomAD, and functional validation of candidate variants is difficult [
21]. Pre-emptive selection of residues for functional validation is a major challenge; a selection based on low allele frequency alone is infeasible since the majority of each gene is highly conserved. A shortened time between genetic analysis and diagnosis means that treatments may be delivered earlier. RAG deficiency may present with diverse phenotypes, and treatment strategies vary. With such tools, early intervention may be prompted. Some patients could benefit from hematopoietic stem cell transplant [
22] when necessary, while others may be provided mechanism-based treatment [
23]. Here, we provide a new method for predictive scoring that was validated against groups of functional assay values, human disease cases, and population genetics data. We present the list of variants most likely seen as future determinants of RAG deficiency, meriting functional investigation.
Discussion
Determining disease-causing variants for functional analysis typically aims to target conserved gene regions. On GnomAD, 56% of RAG1 (approx. 246,000 alleles) is conserved with no reported variants. Functional validation of unknown variants in genes with this level purifying selection is generally infeasible. Furthermore, we saw that a vast number of candidates are “predicted pathogenic” by commonly used pathogenicity tools, which may indeed be damaging but unlikely to occur. To overcome the challenge of manual selection, we quantified the likelihood of mutation for each candidate variant.
Targeting clearly defined regions with high MRF scores allows for functional validation studies tailored to the most clinically relevant protein regions. An example of high MRF score clustering occurred in the RAG1 catalytic RNase H (RNH) domain at p.Ser638-Leu658 which is also considered a conserved Transib motif.
While many hypothetical variants with low MRF scores may be uncovered as functionally damaging, our findings suggest that human genomic studies will benefit by first targeting variants with the highest probability of occurrence (gene regions with high MRF). Table
E1 lists the values for calculated MRFs for RAG1 and RAG2.
We have presented a basic application of MRF scoring for RAG deficiency. The method can be applied to genome wide. This can include phenotypically derived weights to target candidate genes or tissue-specific epigenetic features. In the state presented here, MRF scores are used for pre-clinical studies. A more advanced development may allow for use in single cases. During clinical investigations using personalized analysis of patient data, further scoring methods may be applied based on disease features. A patient phenotype can contribute a weight based on known genotype correlations separating primary immunodeficiencies or autoinflammatory diseases [
6]. For example, a patient with autoinflammatory features may require a selection that favors genes associated with proinflammatory diseases such as
MEFVand
TNFAIP3, whereas a patient with mainly immunodeficiency may have preferential scoring for genes such as
BTK and
DOCK8. In this way, a check-list of most likely candidates can be confirmed or excluded by whole genome or panel sequencing. However, validation of these expanded implementations requires a deeper consolidation of functional studies than is currently available.
Havrilla et al. [
61] have recently developed a method with similar possible applications for human health mapping constrained coding regions. Their study employed a method that included weighting by sequencing depth. Similarly, genome-wide scoring may benefit from mutation significance cutoff, which is applied for tools such as CADD, PolyPhen-2, and SIFT [
62]. We have not included an adjustment method as our analysis was gene-specific but implementation is advised when calculating genome-wide MRF scores.
The MRF score was developed to identify the topmost probable variants that have the potential to cause disease. It is not a predictor of pathogenicity. However, MRF may contribute to disease prediction; a clinician may ask for the likelihood of RAG deficiency (or any other Mendelian disease of interest) prior to examination (
Supplemental)[
68].
Predicting the likelihood of discovering novel mutations has implications in genome-wide association studies (GWAS). Variants with low minor allele frequencies have a low discovery rate and low probability of disease association [
63], an important consideration for rare diseases such as RAG deficiency. An analysis of the NHGRI-EBI catalogue data highlighted diseases whose average risk allele frequency was low [
63]. Autoimmune diseases had risk allele frequencies considered low at approximately 0.4. Without a method to rank most probable novel disease-causing variants, it is unlikely that GWAS will identify very rare disease alleles (with frequencies < 0.001). It is conceivable that a number of rare immune diseases are attributable to polygenic rare variants. However, evidence for low-frequency polygenic compounding mutations will not be available until large, accessible genetics databases are available, exemplified by the NIHR BioResource Rare Diseases study [
16]. An Interesting consideration when predicting probabilities of variant frequency is that of protective mutations. Disease risk variants are quelled at low frequency by negative selection, while protective variants may drift at higher allele frequencies [
64].
The cost-effectiveness of genomic diagnostic tests is already outperforming traditional, targeted sequencing [
1]. Even with substantial increases in data sharing capabilities and adoption of clinical genomics, rare diseases due to variants of unknown significance and low allele frequencies will remain non-actionable until reliable predictive genomics practices are developed. Bioinformatics as a whole has made staggering advances in the field of genetics [
65]. Challenges that remain unsolved, hindering the benefit of national or global genomics databases, include DNA data storage and random access retrieval [
66], data privacy management [
67], and predictive genomics analysis methods. Variant filtration in rare disease is based on reference allele frequency, yet the result is not clinically actionable in many cases. Development of predictive genomics tools may provide a critical role for single-patient studies and timely diagnosis [
23].
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.