Introduction

Next generation sequencing (NGS) has ushered in a new era in the delivery of genomic medicine where the genetic diagnosis is no longer limited by the a priori knowledge of the caring physician (Bamshad et al. 2011). The capacity to screen all or a large subset of genes based on broad clinical categories, e.g., developmental delay, rather than a nuanced clinical phenotype, e.g., Malpuech syndrome, has also greatly expanded our knowledge of the relationship between known or putative disease genes and the resulting phenotype (Yang et al. 2013). Indeed, reverse phenotyping is now commonplace in the era of clinical genomics (Alkuraya 2015). The ultimate benefit of improving patient care in terms of precise diagnosis, management and treatment has propelled NGS-based assays to the forefront of modern molecular diagnostics.

Much has been learned from large scale NGS studies in the diagnostic setting. They revealed a higher diagnostic yield (usually around 30%) across clinical indications compared to standard approaches, expanded the genetic and allelic heterogeneity of many conditions, and challenged the concept of “atypical” presentations as instances of dual (or sometimes triple) molecular diagnoses in some patients (Posey et al. 2017; Trujillano et al. 2017). However, very few studies focused on highly consanguineous populations, which are known to differ from outbred populations in a number of ways that impact the landscape of disease-causing mutations (Alfares et al. 2017; Yavarna et al. 2015). For example, we have previously shown that the highly consanguineous nature of the Saudi population is biased in the occurrence of recessive, typically homozygous, mutations in diseases that are typically caused by de novo dominant mutations, e.g., severe intellectual disability (Anazi et al. 2016). These characteristics are predicted to increase the sensitivity of next-generation sequencing-based tests, which has indeed been observed in a few studies (Alfares et al. 2017; Yavarna et al. 2015).

In March 2016, we launched the first reference lab for NGS-based assays in Saudi Arabia. The tests offered include previously validated multigene panels, as well as standard whole exome sequencing (WES) (Group 2015). The high demand for these tests by clinicians from various medical and surgical specialties across the country represented a unique opportunity to observe the distribution of disease-causing mutations in the Saudi population. This is the largest study to date on the mutational spectrum of genetic diseases in the Saudi population in the diagnostic setting. The unselected nature of tested families and their representation of all regions of the Kingdom, allowed us to infer important patterns of genetic diseases in our highly consanguineous population that are relevant to the wider community of diagnostic NGS labs around the world.

Materials and methods

Human subjects

All patients underwent testing on a clinical basis with the specific test (panel vs. exome) chosen by their treating physician. A standard informed consent was signed by all patients or their guardians to explain the nature of the test and its potential to reveal secondary findings with the option to decline receiving such findings. Phenotypes were collected as entered by the ordering physician on the requisition forms. For solo tests, only the index was sequenced, while parents were also included in trio tests. Couples who presented with a history of prior affected children were offered duo tests if none of the affected children were available for testing. Duo testing was also rarely requested on two affected siblings.

Panel tests

Seven clinically themed multigene panels were offered (neuro, dysmorphology/skeletal dysplasia, renal, inborn errors of metabolism, vision, primary immunodeficiency, and gastrointestinal). The gene content and validation of these panels were previously described (Group 2015).

Whole exome sequencing (WES), variant calling, annotation and autozygome determination

WES was performed using an Agilent Sureselect All Exons V5 (50 Mb) capture kit (Agilent Technologies; Santa Clara, CA, USA) for library preparation. Briefly, DNA was sheared mechanically after which targeted fragments were captured by probe hybridization and amplified before sequencing. An Illumina HiSeq 2500 (Illumina Inc; San Diego, CA, USA) was used for paired-end 100nt sequencing. Sequence alignment, indexing of the reference genome (hg19), variant calling and annotation used a pipeline based on BWA, Samtools, GATK (https://software.broadinstitute.org/gatk/) and Annovar, respectively. Essentially, variants were annotated using a combination of public knowledge databases available from the Annovar package and in-house databases which included collections of previously published Saudi disease causing variants. Autozygome definition solely from whole exome sequence data was undertaken as previously described (Carr et al. 2013).

Variant interpretation

Our previously described in-house variant interpretation pipeline was used (Group 2015). Briefly, each variant was annotated based on 117 tracks that include chromosomal location, zygosity, number of reads for the reference and alternate alleles, quality metrics, presence within autozygome, local and global frequency data, in silico prediction and previous association with diseases based on HGMD and ClinVar (Harrison et al. 2016; Stenson et al. 2017). Our reporting strategy was as follows: we first considered previously reported disease-causing variants and evaluated them for relevance to the patient’s phenotype. If such variants were identified, we independently evaluated them for likely pathogenicity using local and global population frequency data, as well as in silico prediction of pathogenicity. If our evaluation of the variant is in agreement with the previously reported pathogenicity, we reported them as the likely causal variants. If none was identified, we searched for other relevant variants. Again, such novel variants were evaluated for pathogenicity. In the case of apparently loss-of-function variants with compatible population frequency with the disease in question, we reported these as likely causal. However, in the case of missense and in-frame indels, the burden of proof is much higher according to the ACMG guidelines, so these were usually reported as variants of unknown significance (VOUS) and the result was considered ambiguous (Richards et al. 2015). When only one heterozygous candidate variant was identified in a relevant known recessive disease gene, in the absence of a more compelling candidate, we also opted to report this finding as an ambiguous result. When no candidate variants were identified in known disease genes, we considered variants in genes not previously linked to human disease only when biological relevance was suspected based on a number of factors (pathway known to be implicated, relevant animal models or at least relevant expression data). Results involving novel candidate genes were also considered ambiguous. Only results involving pathogenic or likely pathogenic variants in known disease genes that explain the phenotype in the correct zygosity were labeled “positive”. Patients who opted to receive secondary information (virtually all) only received pathogenic or likely pathogenic variants in the ACMG list of 59 genes recommended for reporting (Kalia et al. 2016).

Sanger confirmation

After Sanger validation of the first 200 variants, we identified a quality metric above which SNVs had a 99.5% probability of being a true positive. All subsequent candidate variants (other than indels) above this value were reported without Sanger validation and with a disclaimer explaining the above. All candidate indels were Sanger validated before reporting regardless of their quality value.

Results

High diagnostic yield of panels and WES in Saudi patients

We report in this study the results of the first 1013 families, all indigenous Arabs (originally from Arabia) referred for testing by our diagnostic laboratory (Table S1). One of the seven offered panels was requested in 666 families (645 solo, 16 trio and 5 duo) while WES was requested in the remaining 347 families (321 solo, 17 trio and 9 duo; 15 had WES after receiving negative panel results, i.e., reflex WES) (Table S1). A positive result was reported in 27% of those tested with panels, and 43% with WES (Fig. 1). A breakdown of the diagnostic yield by indication shows marked variability, with the highest yield being multiple congenital malformations in the prenatal setting where the yield was 68%, followed by skeletal dysplasia where the yield was 58% (Table S1). Similarly, duo testing of couples who lost children with a likely genetic diagnosis was associated with a high yield of 50% (83% if novel candidate genes are counted). One notable example is a couple with a history of recurrent non-immune hydrops, one child who died of epidermolysis bullosa and another of osteogenesis imperfecta. Shared carrier status for heterozygous variants in THSD1, ITGB4 and P3H1, respectively, was considered relevant (Table S1). Although we always recommended WES when the panel result was negative or ambiguous, reflex WES only accounted for 4% of the WES samples.

Fig. 1
figure 1

Pie charts showing the yield of the two testing modalities and the breakdown of mutation classes in positive cases

The distribution of disease-mutations in Saudi patients

Autosomal recessive pathogenic and likely pathogenic mutations accounted for the majority of positive cases (235/332, 71%), and were almost always homozygous (97%). This is consistent with the high rate of consanguinity in this cohort. Among the 482 families for whom information on consanguinity was provided, 376 reported consanguinity (78%). In addition, seven families reported endogamy, i.e., intratribal marriage, but no recognizable consanguinity otherwise. Dominant, presumptive de novo being the rule, mutations accounted for 27% of positive cases, followed by X-linked mutations (2%) (Fig. 1). In total, pathogenic or likely pathogenic variants spanning 279 known disease genes were identified, 166 of these (43%) are reported here for the first time (Table S1). Of the recessive positive cases, 33% were due to founder mutations (defined as those encountered in other patients or present in the heterozygous state at least once in our in-house database of Saudi variants, which is based on >10,000 Saudi individuals), whereas 67% were private. The frequency of identified pathogenic or likely pathogenic mutations ranged from the most common variant C12orf57:NM_001301837:exon1:c.1A>G:p.M1?, representing 1.5% of all positive cases, to the exceedingly rare, e.g., TRAF3IP2:NM_001164283:exon4:c.200G>C:p.W67S, which represents only the second mutation in TRAF3IP2 in the context of severe eczema (Maddirevula et al., in press) (Table S1). We also observed several instances of biallelic pathogenic or likely pathogenic variants in strictly dominant disease genes in the setting of normal parents, e.g., ITPR1 (Table 1). Dual molecular diagnosis was rare and only accounted for 1.5% of exome-sequenced cases (panel-sequenced cases were not included given the inherent limitation of panel testing to comprehensively detect dual diagnosis) (Table 2). ACMG secondary findings were also very rare and were only identified in 1.2% of exome-sequenced cases. On the other hand, carrier status for previously established Saudi pathogenic mutations was observed in a substantial fraction (90%).

Table 1 Recessive mutations in genes only known to cause dominant phenotypes
Table 2 Dual molecular diagnoses

Expanding the morbid genome of Mendelian diseases

Pathogenic or likely pathogenic mutations were identified in nine genes with only tentative links to human diseases, thus confirming their classification as bona fide disease genes (Table 3). In addition, our analysis highlighted 75 genes that we propose as novel candidates based on a number of criteria (Table 4). A few such candidates deserve a special emphasis. AKAP6 was mutated in a patient with intellectual disability (16W-0212). Through our internal matchmaking effort, we were able to identify another de novo truncating variant in the same gene in a patient with intellectual disability: NM_004274.4:c.1572_1573del:p.Lys525Glufs30*. Thus, it seems likely that this is a bona fide disease gene for intellectual disability in humans. Similarly, we have identified independent deleterious variants in UBR4 in two families with very early onset dementia (16-2737 and 16-2768). This gene is involved in ubiquitin ligation, a mechanism that is impaired in several neurodegenerative diseases so it seems likely that our analysis has uncovered a bone fide autosomal dominant form of early-onset dementia (Parsons et al. 2015). In 16W-0295 we identified an apparently loss-of-function variant in a gene (WHSC1:NM_001042424:exon13:c.2518+1G>A) long suspected to be the candidate gene for manifestations of Wolf-Hirschhorn syndrome, which overlaps significantly with our patient’s features (Nimura et al. 2009).

Table 3 Confirming the candidacy of previously reported candidates
Table 4 Novel candidates

Phenotypic expansion

Table 5 summarizes positive cases in which the molecular lesion was unexpected, given the established phenotype for the respective gene. For example, 16N-0149 presented with a CHARGE-like presentation but was found to have a de novo variant in KMT2A, the gene responsible for Kabuki syndrome. We have previously suggested the overlap between CHARGE and Kabuki to be biologically relevant and this has been confirmed very recently by a large study (Butcher et al. 2017; Patel and Alkuraya 2015). Similarly, 16W-0253 presented with progressive spasticity, hyper-reflexia and behavioral changes due to a homozygous truncating mutation in FAM134B. This presentation is different from what has been different in the context of FAM134B mutation, which is a form of hereditary sensory and autonomic neuropathy characterized by early childhood onset of distal sensory impairment usually resulting in ulceration and associated with variable autonomic features, such as hyperhidrosis and urinary incontinence (Ilgaz Aydinlar et al. 2014; Kurth et al. 2009).

Table 5 Atypical phenotypes

Impact on management

With very few exceptions, solved cases were positive for genes not suspected a priori by the ordering physician according to the information provided. Furthermore, the correct molecular characterization of many patients led to immediate implementation of management changes. Examples include a child with maple syrup urine disease who was missed by newborn screening and presented at 6 m of age with failure to thrive, developmental regression, abnormal hair, exfoliative skin lesions and epilepsy (16N-0566). 16N-0616 presented with intellectual disability, which was misdiagnosed as hypoxic ischemic encephalopathy but was found to have a novel startloss PTEN mutation, thus enabling the treating physician to implement active tumor surveillance. 16N-0617 is 2-year-old child who presented with unexplained chronic liver failure with hyperbilirubinemia and was being considered for liver transplant. The finding that he has polycystic liver disease (no cysts were identified on ultrasonography) caused by an autosomal dominant PRKCSH:NM_001289103:exon17:c.1462-1G>C mutation was very helpful in selecting an unaffected donor from his relatives. 16N-0653 was managed for many years with anti-inflammatory agents to treat her “Crohn’s disease” but she was found to have leukocyte adhesion molecule deficiency caused by ITGB2:NM_000211:exon13:c.1756C>T:p.R586 W resulting in a drastically different management. 16N-0703 presented in the 2nd year of life with severe nephrolithiasis and his management was significantly altered when he was found to have autosomal dominant polycystic kidney disease (PKD2:NM_000297:exon1:c.567G>A:p.W189X), a known risk factor of nephrolithiasis although his very early presentation was unusual (Grampsas et al. 2000). 16W-0247 is an 8-year-old child who was followed for seizures, ataxia and learning disability but the diagnosis of neurofibromatosis type I was missed until exome sequencing revealed a previously known pathogenic mutation in NF1. 16W-0269 is a 6-year-old child who was managed for years as a case of Bartter syndrome but exome sequencing led to a change in management after revealing a pathogenic mutation in SLC26A3 thus establishing the correct diagnosis of chloride-losing diarrhea. 16-2717 is a 2-year-old child whose diagnosis of pyridoxine-dependent epilepsy was missed until revealed by exome sequencing leading to immediate initiation of pyridoxine.

Discussion

Genetic studies on the Saudi population have contributed significantly to the global effort of identifying the causal mutations of Mendelian diseases, particularly in the area of autosomal recessive diseases (Alkuraya 2014). The very characteristics that make this population a significant contributor to the analysis of recessive diseases, i.e., consanguinity and large family size, have also recently been exploited to advance our knowledge of other areas of human genetics research, including dominant disorders (by reclassifying dominant mutations when observed in homozygosity in individuals who lack the reported dominant phenotype) and even common diseases (by identifying Mendelian phenocopies and by challenging or supporting GWAS candidate genes when observed in individuals who are functional “knockouts”, i.e., homozygous for loss-of-function alleles as a function of autozygosity) (Abouelhoda et al. 2016; Monies et al. 2017) (Maddirevula et al., in press). However, despite the wealth of published data on genetic diseases in Saudi Arabia, unbiased data on the overall distribution of these diseases and their genetic landscape are largely lacking.

Our molecular diagnostic lab is the sole major referral NGS laboratory in Saudi Arabia. As such, it receives samples from patients with suspected genetic disorders from all healthcare sectors in the country (private hospitals and practices, as well as state-funded healthcare system). This provided us with a unique opportunity to observe and report a wide spectrum of phenotypes from all regions of the country, as well as other countries in Arabia (Oman and Kuwait) in a largely unbiased fashion. In this report, we describe our laboratory’s prospective analysis of families referred for panel or exome sequencing. The unbiased selection of the first 1000 samples has allowed us to derive important conclusions about the nature of genetic diseases in the country.

Autosomal recessive disorders, as expected, accounted for the bulk of Mendelian disease burden. This is consistent with several previous reports, including the very recent report by Alfares and colleagues who retrospectively analyzed 454 families that underwent exome sequencing by various commercial labs although they did not provide phenotyping information (Alfares et al. 2017). It is noteworthy that enhanced autozygosity also led to several instances of autosomal recessive inheritance of strictly dominant genes. These cases can improve our understanding of the molecular pathogenesis of the dominant counterpart (Monies et al. 2017). For example, we report two families with a severe congenital myopathy phenotype each with a different homozygous truncating mutation in VAMP1 inherited from healthy non-ataxic parents. This suggests that the dominant spinocerebellar ataxia previously reported in a family with a heterozygous VAMP1 mutation may not be caused by haploinsufficiency as originally suggested but a different mechanism, e.g., dominant negative (Bourassa et al. 2012). Furthermore, the difference in phenotype between the observed recessive and previously reported dominant phenotypes should serve as a reminder that caution is warranted before dismissing variants in genes deemed “irrelevant” to the phenotype in question. These cases also provide an alternative explanation to non-manifesting carriers of clearly pathogenic mutations in dominant genes other than reduced penetrance, i.e., truly recessive inheritance. The predominance of autosomal recessive disorders also allowed us to perform “molecular autopsy by proxy” in families who lost children in the past due to recessive conditions but have no material left to test those children directly. Duo analysis of the parents allowed us to identify the likely causal mutation in the overwhelming majority. The highlighted couple who lost three pregnancies/neonates with three different disorders all of which were identified by duo exome sequencing is an example of the power of this approach. It may be surprising that despite this trend, we have only confirmed dual molecular diagnosis in 1.5% of our cohort. This could be attributed to our methodology (we only considered dual diagnosis if both molecular lesions could be classified as at least likely pathogenic).

Secondary findings represent a complicated ethical and practical issue in genomic sequencing. In our analysis, we followed the ACMG recommendations of reporting only pathogenic/likely pathogenic variants in the list of 59 actionable genes (Kalia et al. 2016). Only 1.2% of our cohort received positive results for this class of secondary findings, consistent with previous studies (Dheensa et al. 2016). However, we also opted to report carrier status for a select list of pathogenic recessive mutations that we have previously characterized in our population. Although it is generally discouraged to disclose these results in the pediatric setting, the very high rate of consanguinity in our population prompted us to consider the consequences of not disclosing this information. The carrier status of the index essentially indicates that at least one parent is a carrier, so if the other parent is related then the probability of both parents being carriers of the same mutant allele may not be rare. Thus, we considered this an opportunity to prevent the occurrence of another recessive disease unrelated to the disease being tested by counseling parents about this possibility and offering them the option of being tested for their carrier status. This class of secondary findings was positive in a large fraction of WES cases (90%).

The establishment of novel links between genes and diseases remains a priority in human genetics research to enable the molecular diagnosis of all Mendelian diseases in the near future. The final percentage of genes that are linked to Mendelian phenotypes remains a highly contested speculation (Boycott et al. 2017). However, large iterative sequencing efforts from the same population can provide helpful clues in this regard. For example, we have recently reported identification of 35 novel disease genes in the context of developmental delay and intellectual disability in an unselected sample of 337 patients (Anazi et al. 2016). Our identification of an additional 75 novel candidate genes by exome sequencing 348 patients from the same population seems to suggest that the number of yet to be identified genes with mutations etiologic in human disease remains large. We suggest that two of these novel candidates appear to meet the standard cutoff of establishing disease-gene link. Although the phenotype associated with AKAP6 is nonspecific developmental delay and seizures, the phenotype we observe in the two families with different heterozygous variants in UBR4 is highly specific and consists of early onset dementia. The remaining candidate genes will require independent confirmation in the future. It is noteworthy that if all 75 novel candidates are confirmed, this will increase the yield of exome sequencing to 83%, which suggests that most clinical exomes that are reported as “negative” by clinical laboratories are reported as such due to interpretation rather than technical limitations as we have proposed in an earlier work (Shamseldin et al. 2016).

In conclusion, this is the largest diagnostic genomic sequencing study of an unselected clinical cohort from Saudi Arabia with suspected genetic disease from a single clinical sequencing lab. In addition to providing numerous examples of phenotypic expansion, we report the first autosomal recessive inheritance of genes that have thus far been linked to human diseases only in a dominant fashion. We add 166 novel variants in known disease genes, and proposed 75 genes as novel disease candidates, two of which are independently mutated in more than one family with the same phenotype. It is hoped that the results of this study will improve the diagnostic yield of genomic sequencing locally and globally.