Background
Small insertions/deletions (indels) are the second most abundant form of human genetic variation after single nucleotide variants (SNVs) [
1]. These DNA changes can influence gene products through multiple mechanisms, including altering amino acid sequence and affecting gene expression [
2]. A number of computational tools that functionally annotate indels are available including SIFT-indel [
3], PROVEAN [
4], DDG-in [
5], CADD [
6], PriVar [
7], PinPor [
2], HMMvar [
8], KD4i [
9], and VEST-indel [
10]. Although some of these tools are reported to achieve relatively high sensitivity and specificity values [
10], predicting the effect of protein-coding (frameshifting, in-frame) and non-protein-coding indels in the clinical setting remains a formidable challenge [
11].
Inherited eye disorders such as childhood cataracts (CC) and retinal dystrophies (RD) are a major cause of blindness among children and working-age adults [
12,
13]. Over the past decades, exciting progress has been made in elucidating the genetic basis of these disorders. Hundreds of disease-causing genes have been identified leading to the development of diagnostic tests that are now regularly used in clinical practice [
14,
15]. The preferred testing method at present is panel-based genetic diagnostic testing [
16], although whole genome sequencing is increasingly being used in the clinical domain [
17]. For these tests to have the greatest medical impact, it is necessary to be able to pinpoint the disease-causing variant(s) among the considerable background of detected rare changes that might be potentially functional but not actually responsible for the phenotype under investigation [
18]. Guidelines for assigning clinical significance to sequence variants have been developed [
19] and it is clear that, among protein-coding changes, in-frame indels present a unique challenge.
When the phenotypic relevance of a protein-coding variant is investigated, knowledge of the structure and biochemistry of the associated protein can be very useful. Unfortunately, due to limitations of mainstream structural biology techniques (X-ray crystallography [XRC], nuclear magnetic resonance [NMR], 3D electron microscopy [3DEM]), experimentally determined structures are available for only a small proportion of proteins [
20]. Recently, computational methods have been used to generate reliable structural models based on complementary experimental data and theoretic information [
21]. Such integrative modeling approaches can be utilised to evaluate protein-coding variants
in silico, on the basis of 3D structure and molecular dynamics [
22].
In this study, a variety of methods including integrative modeling, are used to gain insights into the role of in-frame indels in two genetically heterogeneous Mendelian disorders, CC and RD. Clinical genetic data (multigene panel testing) from 667 individuals are presented and 17 previously unreported in-frame indels are described.
Discussion
In this study we have investigated the role of small (≤21 bp) in-frame indels in two inherited eye disorders and have shown that integrative structural modeling can help interpret some of these changes. Known disease-associated genes were screened in 181 probands with CC and/or anterior segment developmental anomalies, and in 486 probands with RD; one small in-frame indel was clinically reported in 2.8 % (5/181) in 2.7 % (13/486) of cases respectively.
Although current high-throughput sequencing technologies provide unprecedented opportunities to detect genetic variation, it is still not possible to elucidate the molecular pathology in a significant proportion of cases with Mendelian disorders [
43]. It has been previously shown that a genetic diagnosis cannot be identified in in 1 in 3 CC cases [
44] and in 1 in 2 RD cases [
16]. A combination of analytical/technical and biological factors are likely to contribute to this, including incomplete testing or knowledge of genes associated with these disorders [
43]. One key factor is the inability of high-throughput sequencing to consistently and reliably detect indels [
28]. There are two main reasons for this. First, most indels are associated with polymerase slippage and are located in difficult-to-sequence repetitive regions [
30]. In the present study, we have not analysed 4 extremely repetitive exons (such as RPGR ORF15, see Additional file
1: Table S1) and we would therefore expect the true number of indel events to be higher. Second, numerous analytical/technical factors can affect indel detection accuracy including indel size, read coverage, read length and software tool options [
28]. To minimize bias, we focused on small indels (≤21 bp), we analysed a high coverage subset (samples in which ≥99.5 % of target sequence had ≥50x coverage), and we employed the widely used Illumina chemistry (100 bp paired-end reads). Although there are bioinformatic pipelines that outperform the one utilized in this study [
26‐
29,
45], at present, there is no gold standard method. It is noteworthy that the setting of this study is a clinical diagnostic laboratory and our findings reflect the current real-world diagnostic context.
To date, over 4000 disease-causing in-frame indels have been reported, corresponding to 2.2 % of all mutations (Human Gene Mutation Database, HGMD Professional release 2015.4). Recently, the 1000 Genomes Project Consortium reported that 1.4 % of detected exonic variants were indels [
1] and it is expected that at least half of these changes will be in-frame [
31]. Notably, functional and population annotations for these in-frame indels are becoming increasing available [
1,
10]. In this study, three computational tools were used and their annotations were found to be in agreement for 61.8 % (34/55) of variants. However, the results were probably erroneous for at least two of these variants (
ABCA4 c.3840_3845del and the
FSCN2 c.1071_1073del). It can be speculated that the high degree of correlation between predictions (including the incorrect ones) was due to the fact that all three predictive models evaluated similar sets of variant properties (e.g. evolutionary conservation scores or regulatory-type annotations). We hypothesized that for the clinical utility to be maximised, not only the prediction but also the reasons for the prediction (e.g. disruption of a binding site or a β-sheet etc.) should be available to the clinician. Protein structure was therefore used as an endophenotype (defined by Karchin [
11] as ‘measurable component unseen by the unaided eye along the pathway between disease and distal genotype’). Importantly only 1 in 7 in-frame indels were found within regions that could be reliably modeled. This mostly reflects the fact that integrative models often represent only fractions of the full-length of a protein [
20]. Nevertheless, as new structures become available and new techniques are developed, the applicability and utility of the discussed methods is expected to grow.
A variety of properties can be evaluated to infer the impact of an amino acid sequence change on in vivo protein activity. Parameters assessed here and in previous studies include effect on protein folding/stability [
46] and consequences on interaction interfaces [
22]. Highly accurate protein structures are required for these types of analyses. To obtain such structures, we utilized a popular comparative modeling tool (Modeller 9.16 [
34]). Notably, a range of similar tools has been described and objective testing/evaluation of these methods is regularly performed (see
http://www.predictioncenter.org/). Although the pipeline and parameters used in this report have been carefully chosen, the current state of the art method remains to be established.
Structural analysis of mutant proteins in this study suggested that the abnormal phenotype can arise through diverse molecular mechanisms. These include alterations in the DNA interaction site of transcription factors (PITX2 c.429_431del), and disruption of secondary structural elements in crystallins (CRYBA1 c.272_274del, CRYBA4 c.136_156del), cytoskeletal constituents (BFSP2 c.697_699del) and GTPase-activating proteins (RP2 c.260_268del). This wide range of effects could only be rationalized with a combination of (i) careful clinical characterization, (ii) knowledge of the molecular and cellular function of the proteins in question, and (iii) modeling of the likely effects of indels in the context of protein structure and protein interactions. There is an acute need for computational tools that are able to estimate the relative pathogenicity of sequence variants of all types, including indels. Our findings suggest that if such tools are to be effective, they must be able to model the full complexity of molecular mechanisms by which pathogenicity arises.
Acknowledgements
We would like to thank all patients and family members for their participation in this study.