Background
Many neurodegenerative diseases, including Huntington’s disease [
1‐
4], spinocerebellar ataxias [
1,
2], frontotemporal dementia (FTD) [
3], and amyotrophic lateral sclerosis (ALS) [
3] can be caused by nucleotide repeat expansions [
1] that are historically challenging to sequence [
4,
5]. Repeat expansions are a specific multi-nucleotide DNA sequence that is repeated (i.e., expanded) significantly more times than normal. In 2011, a
C9orf72 ‘GGGGCC’ (G
4C
2) repeat expansion was discovered [
3,
6] that causes approximately 34% and 26% of familial ALS and FTD cases, respectively [
7]. This finding genetically linked ALS and FTD, generating an exciting opportunity to better understand the etiology of both diseases, and potentially develop a therapeutic approach. Individuals with ALS and FTD caused by the G
4C
2 expansion generally have hundreds to thousands of G
4C
2 repeats [
8], while healthy individuals typically have between 2 and 30 G
4C
2 repeats [
6,
9], though a precise cutoff for pathogenicity is unclear [
9]. Additional diseases caused by repeat expansions include Fuch’s disease [
10], myotonic dystrophy [
11], Friedreich’s ataxia [
12], and Fragile X syndrome [
13], among others, demonstrating the breadth of diseases caused by such expansions. Revealing the underlying etiology of these diseases, and discovering additional repeat expansions that directly cause or modify disease, or modify risk for disease, will likely be accelerated through long-read sequencing technologies capable of characterizing at least major portions of the repeat; characterizing these repeats at the nucleotide level will help determine, for example, whether the repeat is interrupted and whether such interruptions mitigate disease, as in other neurodegenerative disorders [
14‐
17].
It is unclear whether third-generation long-read sequencing platforms such as Pacific Biosciences’ (PacBio; RS II and Sequel) and Oxford Nanopore Technologies’ (ONT; MinION) can traverse these challenging disease-causing repeats, nor is there a report of nucleotide-level sequencing data in a
C9orf72 repeat expansion carrier. Likewise, it is unclear whether the
C9orf72 repeat expansion is pure G
4C
2 repeat in affected carriers, or whether it is interrupted by non-G
4C
2 sequence. The
C9orf72 G
4C
2 expansion may be the most challenging repeat to sequence, given its extreme length, “pure” GC content [
4,
5], and propensity to form G-quadruplexes in both RNA [
18,
19] and DNA [
19].
Here, we demonstrate that both PacBio and ONT sequencing platforms can sequence through repeats cloned into plasmids, including the spinocerebellar ataxia type 36 (SCA36) disease-causing ‘GGCCTG’ repeat expansion [
20] and the FTD- and ALS-causing G
4C
2 repeat expansion. We further report long-read sequencing data from the
C9orf72 G
4C
2 repeat expansion at the nucleotide level in two symptomatic expansion carriers using both whole-genome and no-amplification (No-Amp) targeted sequencing [
21,
22] on the PacBio Sequel. Our findings indicate that long-read sequencing is well suited to characterizing repeat expansions and that this technology has potential to accelerate future genetic discovery efforts across a broad range of diseases that may involve repeat expansions. These technologies may also have potential in clinical and genetic counseling environments for repeat-expansion and other structural variant disorders, generally. Structural mutations, and repeat expansions specifically, are challenging for short-read technologies. Thus, long-read sequencing technologies may be ideal for discovering new disease-causing or disease-modifying repeat expansions that have escaped detection with conventional short-read sequencing.
Discussion
Here, we showed that both PacBio and ONT long-read sequencing technologies can sequence through the SCA36 ‘GGCCTG’ and the
C9orf72 G
4C
2 repeat expansions in relatively controlled repeats in plasmids, and that the PacBio Sequel can sequence through a human
C9orf72 repeat expansion, in its entirety, depending on length. Additionally, we demonstrated the PacBio No-Amp targeted sequencing method can identify the unexpanded allele and determine whether the individual carries a repeat expansion. These results demonstrate the potential these technologies offer in clinical testing, genetic counseling, and future structural mutation genetic discovery efforts—including those involving challenging repeat expansions. For example, the
C9orf72 G
4C
2 repeat expansion could have been discovered years earlier if long-read sequencing technologies had been available. Through great effort, using the best approaches available at the time, the G
4C
2 repeat was discovered in 2011 [
3,
6], approximately 5 years after chromosome 9p was initially implicated in both ALS and FTD [
36,
37]. With current long-read sequencing technologies, we can decrease the time to discover and characterize such mutations, begin studying them at the molecular level, and potentially translate them for use in clinical and genetic counseling environments.
We found that both platforms are fully capable of sequencing through challenging repeats like the SCA36 ‘GGCCTG’ and C9orf72 G4C2 repeat expansions when cloned into plasmids. It is unclear why the read length distributions for ONT’s MinION were much tighter than those from the PacBio RS II, but it shows promise for future ONT MinION applications. Additionally, while median read lengths were highly similar between the PacBio RS II and ONT MinION, the MinION had a higher percentage of reads that extended all the way through the repeat regions for all three repeat plasmids, particularly the C9-423 and C9-774 plasmids.
Both the PacBio RS II and ONT MinION correctly identified the maximal expected number of repeats in the C9-774 plasmid, based on their consensus sequences, but base calling accuracy in the repeat regions was higher for the PacBio RS II. The PacBio RS II attained approximately 99.8% consensus accuracy, while the ONT MinION consensus sequence was only 26.6% accurate because of the mixed nucleotides in the consensus sequence. We believe this will be relatively easy to address in the ONT base calling algorithms because it appears systematic based on the RRRRCM and RRRRMC repeats in the consensus sequences, which demonstrates the same base calling errors occur consistently.
After verifying both PacBio’s and ONT’s technologies were capable of sequencing repeats in plasmids, we tested the technologies’ ability on two C9orf72 G4C2 expansion carriers and found the PacBio Sequel is capable of sequencing through challenging GC-rich repeat expansions, but throughput was problematic for the ONT MinION in this study. Newer chemistries and hardware from ONT are likely to alleviate this issue. During the timeline of our study, ONT released the GridION and PromethION sequencers that are based on the same nanopore technology and have greater throughput. The PromethION, in particular, can run many more flowcells concurrently, and each individual PromethION flowcell has significantly more nanopores than the MinION and GridION flowcells. We anticipate at least the PromethION will be suitable for large repeat studies, based on the MinION’s performance in the plasmids, but we cannot be certain without further testing.
Using the PacBio Sequel, we attained 8× coverage across the C9orf72 G4C2 repeat region for sample 2 using the whole-genome approach, where four reads covered the individual’s expected eight-repeat (unexpanded) allele, three reads that ended 30, 69, and 912 repeats into the expansion, respectively, and one read that fully spanned an expanded repeat region with 1324 repeats. The read spanning 1324 repeats is on the lower end of the Southern blot, suggesting longer repeat alleles may have been inaccessible to the PacBio Sequel, perhaps simply because their size impedes loading into the zero-mode waveguide (ZMW) wells. Deeper sequencing is generally required to detect the mutation using a variant caller, but we demonstrate here that the technology is capable of generating such reads, as a proof of principle. We also could not reliably estimate G4C2 content for sample 2 because of few reads, and each read had only a single sequencing pass. These data do demonstrate, however, that the PacBio Sequel is capable of sequencing through at least a large portion of what may be the most challenging GC-rich repeat expansion known. Discovering whether a structural variant exists, its location, and its general nucleotide makeup is the critical first step to understanding its role in human disease.
Additional studies will be necessary to determine the maximum repeat size that these technologies can span, but our data reiterates the PacBio Sequel is adequate for genetic discovery efforts already [
38‐
41]—and suggests it is capable of sequencing and identifying large repeats. With sufficient read depth, the reads do not necessarily need to bridge the entire repeat (or other large structural variant) to discover whether it exists and characterize the general nucleotide content. Additional experiments can clarify size and nucleotide content, if the sequencing technology was unable to span the variant entirely, or with lower-quality base calls.
After verifying the PacBio Sequel was able to sequence through the
C9orf72 G
4C
2 repeat expansion using whole-genome sequencing in a human case, we tested PacBio’s No-Amp targeted sequencing approach to determine how well it can characterize nucleotide content with the increased read depth, and assess how amenable the approach is for clinical and genetic counseling environments. Our results suggest the method can identify an individual’s unexpanded allele, determine whether the individual carries a repeat expansion, and can estimate size up to at least 5 kb, though a larger study is needed. While being able to perfectly determine an individual’s expansion size regardless of its length would be ideal, knowing the exact repeat expansion size does not clarify prognosis for
C9orf72 G
4C
2 repeat expansion carriers [
8], mitigating the need to determine the expansion’s precise size. Additionally, the repeat size is known to be highly variable throughout various body tissues, including different brain regions, and even within a small tissue piece from the same brain region [
3,
8,
42‐
44]. This is further demonstrated by the smear within the Southern blots for both symptomatic carriers included in this study.
While repeat size is not informative for prognosis, being able to assess overall G4C2 content and detect repeat interruptions may be informative for prognosis, but more information is required. We were able to more accurately assess G4C2 content using the targeted approach, though distinguishing between G4C2 and G3C2 motifs is likely unreliable at this stage. Treating all G3C2 motifs as G4C2, we estimate the G4C2 content for sample 1 is > 99%, and potentially 100%. There is some evidence supporting potential G3C2 and non-GC interruptions, but experimentally verifying these finer differences in low-complexity repeat regions is non-trivial. A larger study will be important to determine whether it is possible to identify more pronounced interruptions, or even distinguishing between G4C2 and G3C2. The ability to identify an individual’s unexpanded allele, clearly indicate expansion status, and assess nucleotide content in a single experiment could have important implications in clinical and genetic counseling environments, and will certainly be investigated further in the research environment.
Existing challenges for the No-Amp targeted sequencing method include low throughput, and it inherently selects for shorter reads, or repeats in this case, because of both loading bias (shorter fragments load preferentially) and that the polymerase is less likely to traverse longer repeats as reliably as shorter repeats. This is likely why there is a statistical mode at approximately 110 repeats (Fig.
8), even though there is no observable band at that size in the Southern blot. We are confident the reads are real, however, as the Southern blot clearly shows size mosaicism, and the adjacent sequence on both sides of the repeat region aligned on both sides of the repeat region for each read with ≥85% identity for all included reads. The bias towards shorter reads does misrepresent the primary size distribution, but we were still able to determine that the individual carries a repeat expansion, and we were able to accurately estimate the size in this case. Determining whether an individual carries a repeat expansion in an automated fashion would be relatively simple using this approach.
Given that several studies have shown the
C9orf72 repeat expansion is variable across tissues within a given patient [
8,
42‐
44], we suggest that a large, deep long-read sequencing study across the
C9orf72 repeat is important to better understand how repeat content affects disease onset and progression. Repeat interruptions are known to mitigate disease in other neurodegenerative disorders [
14‐
17]. Fully characterizing the repeat at the nucleotide level in a large cohort may have critical implications on our understanding of disease etiology, development, duration, and on future therapy. A large, long-read sequencing study of affected
C9orf72 G
4C
2 repeat expansion carriers would also allow us to characterize mosaicism within individuals; there may be expansion sub-species that explain the more aggressive forms of ALS and FTD, that are not measurable through traditional methods, such as Southern blotting.
Cost is a limiting factor for long-read sequencing technologies, often making it impractical for large studies or for diagnostic use. Because of these limitations, researchers have made great efforts to maximize the utility of short-read sequencing technologies, employing the large amount of short-read sequencing data already generated across nearly every disease currently studied. An excellent example is the effort to identify repeat expansions based on evidence in existing short-read data [
45,
46]. These efforts offer researchers that have already generated short-read data for individuals the ability to determine whether an individual has a repeat expansion, but only if the repeat expansion and its location are already known. The approaches are also generally unable to estimate the repeat size. The limitations in these approaches reflect the limitations of short-read sequencing, because short reads cannot span even relatively small repeat expansions. Long-read sequencing, while having a much higher error rate, addresses these limitations, and may be more amenable to regular use in the future. For now, long-read sequencing may be ideal for small familial studies or for smaller studies intent on identifying repeat expansions that exist among a small cohort of cases. Researchers could then follow up with more cost-effective methods such as repeat-primed PCR or Southern blotting.
Knowing PacBio and ONT long-read sequencing technologies are fully capable of sequencing through challenging disease-causing repeats, such as the SCA36 ‘GGCCTG’ and C9orf72 ‘GGGGCC’ repeats, lays important ground work for future sequencing studies to understand the nucleotide-level nature of all repeat-expansion disorders. It also demonstrates that long-read sequencing technologies offer great potential for future repeat expansion discovery efforts, and may be useful in clinical and genetic counseling environments for either the C9orf72 repeat expansion specifically, or for other large structural mutations; the ability to target specific regions will be particularly important in several settings. Further utilizing these technologies in larger studies will be critical to properly characterizing known repeats (e.g., C9orf72) and their allelic distributions (size and content) on the nucleotide level to better understand how they contribute to disease.