Experimental Design

Bisulfite Sequencing Methods

As with many ecological studies involving next-generation sequencing, the experimental design is often based heavily around one fundamental trade-off. This trade-off is paramount to achieving a level of statistical power appropriate for the study's aim. In the previous chapter (chapter NGS sequencing), we discussed sequencing depth and coverage; this applies here as we seek to maximize sequencing coverage with regards to the scope of the questions we are looking to answer and the practical limitations of the study, such as cost and time. Next-generation sequencing is expensive, with costs driven by the quantity of material to be sequenced. This can be delineated to the total genome size, number of replicates, level of sequence coverage, and sequencing technology itself. Therefore, an ideal study seeks to define and optimize these factors a priori to carrying out the experiment and the subsequent downstream analyses.

Next-generation sequencing is a fast-developing field, with several commercial technologies currently being available to address various experimental needs and applications. These can broadly be categorized into short-read technologies (~ 50-900 bp), which tend to be cheaper with higher per-base accuracy and throughput than the counterpart long-read technologies (~ 1-500 kbp) [Li 2018]. The advantage of obtaining long reads is that they are less prone to assembly and mapping-related errors, making them suitable for resolving true genome arrangements even in repetitive regions and regions of low complexity, e.g., found in heterochromatin regions. Due to the issues mentioned above with generating long fragments from sodium bisulfite-treated DNA samples, however, these advantages are frequently lost. Short-read sequencing technologies, therefore, are dominantly selected for methylation analyses.

Among these short-read sequencing technologies, Ion Torrent [Rothberg 2011], Illumina [Bentley 2008], and SOLiD [Shendure 2006, McKernan 2006] have all been used successfully for bisulfite sequencing [Cokus 2008, Lister 2008, Lang 2017, Mirouze 2009, Bormann Chung 2010, Pabinger 2016, Venney 2016]. Each may be appropriate depending on the desired read size and number of reads per run, which is influenced by the size of the genome and the nature of the study in question (i.e., sample size). The most extensively used of these is Illumina, wherein the HiSeq 2500, 3000, and 4000 or for large projects, HiSeqX and NovoSeq are all appropriate high-throughput applications. A good comparison between these and some alternative systems is given by Grehl et al. [Grehl 2018]. A problem arises for bisulfite sequencing from the requirement of many sequencing machines that nucleotides should be present in the DNA fragments in roughly even proportions to perform efficient base calling. Following the bisulfite treatment, cytosines are underrepresented. To still enable high throughput sequencing, additional DNA with balanced base proportions and known sequence (e.g., Phi X) is added or "spiked in" during library preparation. This DNA standard is sequenced together with the target DNA and later filtered out. The disadvantage is that it reduces the potential sequencing coverage considerably because the machines may require 20-40% spiked in standard DNA. Grehl et al. [Grehl 2018] mention, however, that the HiSeq 2500 benefits specifically from an optimization of the cluster calling algorithm, which allows for the handling of bisulfite-treated libraries without the need for additional base proportion balancing (i.e., spiking). Thus, the HiSeq 2500 is a good baseline to act as a starting point for designing your sequencing experiments. However, your final choice should be influenced by whether or not the specific limitations of this method need addressing within the scope of your study.

As methylation calling is calculated from the number of overlapping reads aligning to each position, it is clear that the statistical power increases with greater sequencing depth. In an ideal scenario, the sequencing depth would be uniform and consistent across all positions of interest to the study, but this is rarely the case. Sequence-related biases in random hexamer priming [Hansen 2010], random DNA fragmentation [Poptsova 2014], and the PCR amplification [Kozarewa 2009, Aird 2011] stages of the library preparation impede uniformity and lead to coverage underrepresentation in regions of extreme GC-content [Benjamini 2012]. A straightforward approach in mitigating these issues is to select a level of coverage that ensures a minimum lower bound in the majority of regions where the distributed coverage is lower than the mean.

To give a point of reference, Ziller et al. [Ziller 2015] found that coverage between 5-15x was optimal in terms of statistical power for detecting differentially methylated regions between a range of human tissue and cell types in the CG context. Beyond that, resources would be better allocated towards expanding the number of biological replicates, starting at a minimum of two to capture within-group variance. In plants, it would be wise to consider this point of reference as a bare minimum for a homozygous diploid. Both the heterozygosity and the ploidy level invariably influence the minimum level of coverage due to the increased variation on single positions, which may lead to greater within-group heterogeneity necessitating a larger number of replicates. The highly-repetitive regions, as well as low-complexity regions often found in plant genomes, are also notoriously difficult to map to, leading to multi-mapped reads and alignment ambiguities [Treangen 2011]. This problem can be mitigated by increasing the coverage and using paired-end (PE) sequencing in the absence of long-read data. Furthermore, the magnitude of methylation differences is usually less pronounced in the CHG and CHH contexts than in the CG context requiring more power for detection. If the study seeks to capture differences in these contexts, then an increase in the number of replicates should be considered relative to CG alone.

Once the optimal level of coverage and number of replicates have been decided, it may be the case that the total genome size for the species of interest pushes the cost outside the range of affordability. In these instances, it is sometimes possible during library preparation to subset the material and exclude regions of minor or no relevance to the scope of the study. Whether this is appropriate or not depends on the research questions. An overview of each method is given in the following section.

Last updated