Background
The prevention and management of plant diseases largely depend on accurate identification of pathogens. Rapid and specific detection assays are required that need continuous development and optimisation as technology and knowledge of the pathogens advance. The increased use of high-throughput sequencing (HTS) for the construction of virome profiles of many agricultural crops has led to the discovery of many new viruses [
1‐
13]. Some viruses cause, or are associated with, economically damaging diseases necessitating detection tools that can reliably detect them in the shortest timeframe. The successful control of viruses and viroids in commercial crops are directedly correlated to the effectiveness of the detection assays used to screen plant propagation material. The challenges and opportunities of HTS for virus and viroid detection has been highlighted previously [
14‐
17] and in the United States of America, HTS already forms part of their clean plant propagation programs by creating a provisional release category based on a HTS-negative selection [
16]. This category of plants is then allowed to be propagated in designated approved areas pending the completion of all conventional laboratory tests. This allows accelerated multiplication of plant material prior to official clean status certification and release.
The generation of sequence data using various metagenomic and enrichment strategies and the development of the associated bioinformatic pipelines has led to the identification of many known and novel viral sequences [
18]. The confirmation of these agents is usually subsequently verified using a PCR or RT-PCR assay designed using the HTS data generated in the first place. The question is therefore whether HTS can be used as a reliable standalone detection tool if the necessary parameters i.e. sensitivity and reliability can be validated as for any other detection assay. As the use of HTS is more routinely employed, the need to establish the influence of variables such as sample preparation, sequencing platforms and data analysis on the output becomes imperative. If HTS can be validated for sensitivity and specificity within a specific pathosystem, it can be used as a standalone detection assay to provide a fast and reliable diagnostic for any viral disease. For known viruses, a HTS detection assay has great application value for broad based detection of viruses in high value plant material at the post import quarantine stage or in the clean-status verification of nuclear or mother plant material of plant improvement schemes. One limitation to this assay would be the comprehensiveness of the reference database used.
The current consensus is that HTS data analyses and the interpretation of the results for plant viral diseases require expertise in both bioinformatics and plant virology [
14,
15]. Nevertheless, numerous studies report attempts to streamline the analysis and to find a one size fits all solution to the bioinformatic component of the HTS assay to inform a diagnostic call [
19‐
36]. There are also different options for target nucleic acids to be sequenced, most popular being total RNA (commonly ribo-depleted), small RNA (sRNA) and double stranded RNA depending on the application of the HTS assay. A popular strategy employed for the detection of plant viruses is sRNA sequencing. This strategy can effectively detect viruses with DNA or RNA genomes. However, sRNAs are generated as a host defence response to virus infection and a weaker response will result in lower levels of sRNAs that could impact negatively on this approach’s ability to detect these specific viruses [
37,
38]. The effect of different bioinformatic pipelines was also evaluated previously in a large-scale performance test on sRNA data and the variation in results were significant [
39]. The advantages of ribo-depleted RNA sequencing over sRNA sequencing for virus detection were also highlighted previously [
37,
40].
Even though plant virologists without training in bioinformatics can benefit from automated pipelines with graphical user interfaces, the quality and accuracy of the output is reliant on the quality of the input. The input to an HTS assay incorporates the whole process from sample collection, wet laboratory processing and data generation including data quality control. All the quality control measures up to data QC are the same or similar to any other sensitive molecular assay and need to be incorporated as assay variables. To ensure optimum data analysis, data should be evaluated for the different quality parameters, including not only the quality scores of each base, but also the sequencing depth. All these parameters can impact on the specificity, sensitivity and repeatability of the diagnostic result. The specific application of HTS will determine the acceptable level of variation that is tolerable. Identifying the exact virus or viroid variant, for example, is not required for pathogen detection. Applications of HTS are therefore varied and include both detection and discovery. It is important to identify the application and to tailor the assay, data analysis and interpretation accordingly.
The citrus industry is one of the largest fruit industries worldwide with South Africa being the second largest fresh citrus exporter [
41]. However, citrus pathogens can lead to a reduction in yield and threaten cultivar sustainability. One of the most devastating and complex viral pathogens of citrus species locally and worldwide is the closterovirus, citrus tristeza virus (CTV) [
42,
43]. Other pathogens sporadically detected in non-certified citrus in South Africa include citrus tatter leaf virus (CTLV) and viroids such as hop stunt viroid (HSVd), citrus dwarfing viroid (CDVd) and citrus exocortis viroid (CEVd). More recently citrus virus A (CiVA) was detected in older orchards [
44].
Viruses and viroids that are mainly spread through vegetative propagation can be effectively controlled through the use of certification programmes to generate virus free budwood/cuttings for propagation. HTS can detect multiple pathogens within a single assay, and has the advantage that data can be re-evaluated as new viral agents and variants are added to global databases.
In this study, Citrus sinensis plant material infected with a range of viruses, including positive and negative sense RNA viruses, and viroids were established and subjected to HTS to evaluate the level of biological and technical variation that can arise from the RNA extraction method, sequencing platform and bioinformatic pipeline used. This study evaluated HTS variation for a citrus virome in order to use HTS as a standalone detection assay.
Discussion
In this study an experimentally constructed citrus virome was characterised using HTS to evaluate the influence of sampling, RNA extraction method, sequencing platform and data analysis pipeline. Four sweet orange (cv. ‘Madam Vinous’) trees were prepared, one negative control and three graft inoculated with CTV, CTLV, CiVA, HSVd, CDVd and CEVd. Each plant was sampled three times at the same timepoint, and RNA extracted using two different methods. Two sequencing platforms were selected to generate data from three samples from four plants (Illumina) and one sample from four plants (Ion Torrent). All data sets were subjected to a reference independent de novo assembly approach and a dependent read mapping strategy to determine the virome profile of each plant sample.
HTS can be utilised for both detection and discovery and the acceptable level of variation in specificity, sensitivity and repeatability that can be tolerated will depend on the application. In the present study, the detection of virus and viroid species were evaluated for application as a routine diagnostic assay. Identifying the exact virus or viroid variant, for example, was therefore not required for pathogen detection.
HTS is intrinsically specific and is therefore only limited by the accuracy of the base calls, the depth of the data and the completeness of the reference databases [
15]. A public reference database can be incomplete due to novel viruses or new variants of viruses that are yet to be discovered. Local databases require continuous upkeep to be complete, even if it is only for virus species or variant additions. Nonetheless, novel pathogens or different variants of known pathogens not contained in a local database can, in some cases, still be detected by de novo assembly followed by homology searches and read mapping, just with lower confidence levels and less robustness. The limitations of such databases should be considered throughout the data analysis and interpretation.
The composition of the starting nucleic acids additionally impacts detection specificity. The virome profiles of the Zymo Research kit extracted samples were less consistent across samples and replicates than the profiles of the CTAB extracted samples (Fig.
6). This variation can be due to several fundamental differences in the RNA extraction protocols. The input plant material amount for the CTAB extraction method is one gram compared to the 200 mg input for the Zymo Research kit. It is more difficult to obtain a representative sample using a lower input extraction method. This can potentially lead to the generation of false negative results if the virome includes pathogens that are unevenly distributed in the plant as was observed for plum pox virus (PPV) [
59] and certain grapevine viruses [
60].
The lower weight input restriction of the Zymo Research RNA extraction kit resulted in greater variation in the virome profiles between the technical replicates of each biological sample (Fig.
6), probably due to sectorial differences in concentrations of the target pathogens in the plants. The CTAB method yielded more consistent virome profiles between samples, likely due to the ability to process a more homogeneous sample, however this method appears to have a slight bias against viroid sequences. Collectively, this virome analysis indicates that a low weight input extraction method has risk implications for use in a routine HTS detection assay.
Reproducibility is required at each step of the HTS assay, from nucleic acid extraction to data interpretation. Previous studies highlighted the link between appropriate depth of coverage and the repeatability of the assay [
37,
40], but no systematic studies on repeatability and reproducibility have been published. In this study an attempt at reproducibility was made by including both biological and technical replicates for the Illumina sequencing protocol.
The two different extraction methods yielded total RNA with different rRNA profiles. No significant difference between the RIN values for the two groups was observed, however the rRNA ratio of the Zymo Research kit extracted RNA was significantly lower than for the CTAB RNA due to a higher concentration of 5S rRNA yielded by the Zymo Research kit. This indicates a potential difference in the RNA species extracted with each method and it is possible that the CTAB extraction selected against viroid sequences due to the Lithium Chloride (LiCl) precipitation step. The SPAdes de novo assembly of the Illumina data did not produce viroid scaffolds for one technical replicate of two different CTAB extracted samples, compared to all the expected viroid scaffolds assembled in the Zymo Research kit replicates (Additional file
2). The SPAdes assemblies of the Zymo Research data also generated more viroid scaffolds compared to the CTAB data (Fig.
2). The read mapping strategy also displayed the difference in viroid RNA concentration between the two extraction methods where the ratios of pathogen concentration was vastly different between the extraction methods, independent of the sequencing platform (Fig.
6). More viroid RNA reads were obtained using the Zymo Research kit. Due to the potential lower representation of viroid RNA obtained with certain extraction methods and the small genome size of viroids, it is possible that a viroid infection may be missed with only a de novo assembly approach. Although, the RNA extraction method influenced the performance of the detection assay in the present study, it did not alter the final combined de novo and read mapping results (Additional file
2, Additional file
5) and all pathogens were consistently detected.
The selection in de novo assembler can influence the result as seen in the contigs assembled with CGW compared to the SPAdes scaffolds (Additional file
2). The SPAdes assembly with the Illumina data performed better in confirming the expected virome profile compared to the assemblies with the Ion Torrent data. Both assemblers were able to assemble more and longer contigs/scaffolds with the Illumina data. Even though 1.2 times more Ion Torrent data than Illumina data was obtained from the service providers, the 196 nt read distance between the paired-end reads of the Illumina data may contribute to better contig assemblies compared to the single-end Ion Torrent reads with an average length of 137 nt.
The principal component analyses using the TPM counts of the pathogen and gene accession read mappings showed a clear separation between the different extraction/platform protocols (Fig.
3). However, for the gene accessions, the most variation was between sequencing platforms and for the pathogens the variation was between RNA extraction methods (Fig.
3). This is partially explained by the viroid component of the pathogen profile that was greater in the Zymo Research data sets. The variation between technical replicates was however minimal and the variation observed was rather as a result of extraction method or sequencing platform.
The investigation into the expression profile of reference genes allowed the comparison between samples across different extraction/platform protocols to potentially answer questions relating to the suitability of the sequencing depth to address pathogen detection. The expression pattern of these genes is hypothesized to be stable and even if the gene expression is modulated in response to biotic stress, the variation between samples should be reflected in each of the different extraction/platform protocols selected. By identifying low and high expressing genes, gene expression profiles can be used as internal controls for RNA extraction efficiency, library construction and also to determine the number of reads required for accurate detection. Using the host reference gene mapping ratios, outlier samples can be identified as seen for sample C122 (Illumina) (Fig.
4). The expression pattern of the 12 genes selected in this study showed a consistent pattern across extraction methods but differed for each of the sequencing platforms (Fig.
4). No significant variation in expression patterns were observed between healthy and infected plants. The different pattern per sequencing platform was also consistent, independent of data set size (Fig.
5). Therefore, based on the data generated in this study, UPL7 and GAPC2 might not be consistent internal controls for cross platform comparisons, however, when selecting a single platform, these genes can be used. The reference genes can also be used to normalise the virus or viroid TPM count to allow for direct virus and viroid concentration comparisons between samples. Only five of the gene accessions had a consistent gene coverage of more than 90% for all technical and subset replicates (Additional file
5).
The expected virome was confirmed with RT-PCR and included five CTV genotypes (RB, VT, T3, T30 and S1), CTLV, CiVA and three viroids (HSVd, CDVd and CEVd). The influence of read mapping to a distant variant of the target virus/viroid was assessed by including nontarget reference sequences of CTV (genotype T68) and the Cachexia variant of HSVd. An average genome coverage for CTV genotype T68 of 73–83% was obtained for the four different extraction/platform protocols. Compared to the genome coverage of the expected CTV genotype RB of > 99%, the T68 read mappings are distinguishable as false positive mappings. A previous study showed that it is possible to obtain up to 90% coverage for nontarget genotypes in mixed genotype infections and that read mapping across more than 95% of the genome is indicative of the presence of a particular genotype [
46]. Due to the extent of variation between CTV genotypes (2–9%) [
46], the selection of reference sequences will influence the coverage percentage. Therefore, if a reference for a genotype not present in the data is used for read mapping, a lower percentage would still be indicative of the presence of CTV, but just that a different CTV genotype than the reference would be expected. This would be true for most viruses and by including representative sequences of the different genotypes in the read mapping strategy, false negative diagnostic calls can be prevented. In the case of HSVd it is important to be able to differentiate between disease causing and latent variants since they are biologically distinct in citrus. The nontarget Cachexia variant of the HSVd genome was only 36%-61% covered for the four different extraction/platform protocols, clearly indicating that this variant was not present in the samples.
The sensitivity of any detection assay is directly linked to the proportion of viral RNAs among the host cellular RNAs. Therefore, sequencing depth plays an important role in the reliability of the HTS assay. The main conclusion from the subset experiment in this study was that less Illumina data was needed to obtain complete or near complete genomes of the expected pathogens and the Ion Torrent data can perform on par with Illumina if more reads are used for the read mapping. The number of bases in each subset size for the Ion Torrent data was 1.3–1.4 times more compared to the Illumina subsets as a consequence of the longer read length. However, the average distance of 196 bases between the Illumina read pairs may have increased the efficacy of the Illumina read mapping.
The finding of a previous study [
37] that showed that sequencing one million reads will provide sufficient genome coverage for closterovirus detection, was confirmed (Fig.
8, Additional file
5). It was also shown that a higher number of reads is needed for other pathogens depending on the extraction/platform protocol. Viroid detection was shown to be variable and even though it was possible to obtain a complete genome with lower read numbers, the detection is only consistent with more sequencing depth.
No citrus pathogen sequences were de novo assembled from the Ornithogalum nontarget positive control or C. sinensis negative control data that was included in the Illumina and Ion Torrent sequencing runs and no pathogens associated with Ornithogalum were assembled from the citrus RNA data. The Ornithogalum data set were mapped to the citrus pathogens and negligible read counts were obtained. A maximum of 788 citrus RNA reads from the different samples mapped to ornithogalum mosaic virus however the genome coverage never reached more than 1.5%. This indicated no significant cross-contamination between samples.
The reproducibility of this study was not measured specifically, as a true test of reproducibility would require an interrogation of the extent to which consistent results could be obtained by repetition of the whole experiment at different timepoints. An attempt at reproducibility was made to include both biological and technical replicates for the Illumina sequencing protocol. The technical replicates were however not sequenced on the Ion Torrent platform due to a cost implication. The Ion torrent data cost 43% more for the same amount of Illumina data.
In this biological context, reproducibility will not be completely achievable as variables such as plant age, growth stage and virus concentration linked to infection duration might influence the outcome.
A comparison of single-end data from the Illumina platform and single-end data from the Ion Torrent platform was not evaluated to keep the generation of data as close to a real-case scenario as possible. The two service providers, Macrogen and CAF, provide by default, paired-end Illumina and single-end Ion Torrent data, respectively.
Conclusions
This study is a detailed measurement of technical variation in HTS data associated with the detection of viruses and viroids in citrus. The study evaluated the efficiency of using HTS to detect two single stranded RNA viruses from different families, a negative-sense single-stranded RNA virus and three viroid species. The study evaluated the influence of RNA extraction protocol, sequencing platform and data analysis pipelines on the sensitivity, specificity and repeatability of HTS as a detection tool. Each of these parameters introduce a different bias that creates variation in the data output. Even though the different extraction methods, sequencing platform and data analysis tools resulted in variation in the present study, the end result being the virome profile of each sample could be confirmed independent of the HTS approach. The study highlights the need to be aware of the level of variation associated with each approach in strategy, from sample collection to data interpretation and how these variables may impact on the initial objective of the HTS assay. This awareness is critical to enable informed adjustments to correctly interpret the data for a reliable results. The primary recommendation that follows from this study is that, irrespective of extraction method or sequencing platform, a combination of de novo assembly and read mapping be used for a routine detection assay.
Since the goal of this study was to evaluate HTS as a detection tool in quarantine or certification schemes, and not for discovery purposes, a list of known pathogens should be available in these settings for read mapping. The aim of a de novo assembly in the certification scheme context will be to identify unsuspected pathogens.
1 The absence of virus/viroid related de novo assembled contigs does not automatically indicate a negative status for the respective pathogen and read mapping is required as a validation step to confirm absence. This is especially necessary for low concentration viruses and viroids. Read mapping against multiple reference genes as internal controls is also recommended to establish gene ratios for a specific assay. This allows for the evaluation of sequencing depth to accurately determine the absence of low-level infections. The inclusion of a nontarget positive and a negative control can assist significantly to evaluate cross contamination between samples. The final conclusion is that sequencing depth matters and that with enough data the variation observed between extraction methods and sequencing platforms are diminished and equivalent results can be obtained.
The application of HTS for the detection of plant viruses is commonly described as being unbiased, however this is only true within a specific context, in that it does not require any prior knowledge of the pathogens. There are however, slight biases and variations at every step of an HTS assay, as demonstrated, but which can easily be corrected for when quantified. This study provides strong evidence that the application of HTS for routine pathogen detection is attainable if the detection pipeline is critically validated.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.