Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Bisulfite Sequencing Methods
WGBS is the practice of applying bisulfite sequencing on a genome-wide scale, capturing all regions, and attempting to define global methylation patterns in each sample. This method is appropriate when the study question is broad in scope or if prior information on the genomic regions of interest is limited. It can be considered the "go-to" approach when other methods for concentrating the sequencing on reduced subsets of the genome are either unavailable or inappropriate for the study in question.
There are two main variations of WGBS library preparation, known as BS-Seq [Cokus 2008] and MethylC-Seq [Lister 2008] (Figure 2). In terms of the protocol, they differ primarily in the number of PCR steps and when the ligation of sequencing adaptors occurs relative to the treatment with sodium bisulfite. Many sequencing technologies require specific sequencing adaptors to facilitate base calling on selected DNA fragments. In the case of Illumina, these adaptors are bound to complementary sequences on the flow cell, forming clusters to be sequenced by synthesis. If the adaptor is not present, the DNA molecule is simply washed off the cell, and no information is retrieved. The issue here is that the bisulfite treatment alters the sequence of these adaptors wherever there are unmethylated cytosines present, rendering them incompatible with the complementary sequences on the flow cell. MethylC-Seq addresses this by using custom cytosine-methylated adaptors that remain unaffected by sodium bisulfite. In contrast, BS-Seq circumvents the issue by ligating the adaptors only after the bisulfite treatment.
In principle, the approach of BS-Seq seems more straightforward. However, ligating the sequencing adaptors after the bisulfite treatment presents another problem. The two strands of DNA are no longer complementary to each other and hence remain in a single-stranded state. This is a problem because sequencing adaptor ligation requires duplex DNA. Therefore an additional round of PCR is necessary before adaptor ligation can occur. This PCR step generates reverse complementary strands to both the bisulfite-treated Watson (+FW) and Crick (-FW) strands of the original DNA, which are themselves distinct sequences (+RC and -RC, respectively). The result is a set of four sequences where both the FW and RC strands are indistinguishable from each other by the sequencer. Strand-specificity is therefore lost, and additional bioinformatic processing is required to resolve which reads belong to which strand. In MethylC-Seq, only the +FW and -FW sequences are present, and strand-specificity is cleanly preserved during sequencing, though with paired-end data, it becomes more complex as the +RC and -RC strands are present as well.
A more recent variation of these approaches has also been developed, known as post-bisulfite adaptor tagging (PBAT) [Miura 2012]. In this case, the bisulfite conversion process itself is first used to fragment the genomic DNA. Adaptor ligation is then facilitated by two rounds of random priming extension in place of PCR, thereby maintaining strand-specificity while avoiding any denaturation of adaptor-ligated DNA. The real advantage of this method, however, is its sensitivity in handling sub microgram quantities of DNA without the need for additional amplification, contrary to MethylC-Seq, where the bisulfite treatment often fragments adaptor-ligated DNA templates, which then cannot be used during sequencing. In such a case, the remaining DNA may need to be amplified to achieve a reasonable DNA mass for sequencing, but this amplification risks inducing PCR artifacts. The approach of PBAT can circumvent the need for PCR amplification on sub microgram quantities of DNA. Still, it should be noted that random primer extension is subject to its own biases. Sequence-specific site preferences can give rise to "pile-ups" of reads, and differential priming between methylated and unmethylated alleles has been hypothesized. Therefore, it may be preferable to run MethylC-Seq with a very low number of PCR amplification cycles (e.g., ~ 4) in cases where sample availability is not strictly limited. Last but not least, an even newer approach (TAPS) was published recently that avoids using the bisulfite conversion altogether, allowing for higher mapping quality and nonfragmented duplex DNA after the conversion of methylated cytosines into thymines (Yibin et al. 2019). The downside of TAPS, it is too new to be available as a commercial kit. TAPS is very new and promising, but experiences with this method are yet scars, and so we do not discuss it further in this chapter.
Regardless of the approach selected, at least two cycles of post-bisulfite PCR are necessary to facilitate the conversion of uracil to thymine before sequencing can occur. For these PCRs, the presence of uracil in the sequence precludes the use of many standard, high-fidelity polymerase enzymes with proofreading mechanisms such as Phusion (Thermo Scientific) or KAPA HiFi (Roche). On encountering uracil, these enzymes stall as they await base excision repair [Lindahl 1999, Greagg 1999]. Fortunately, there are alternatives available, such as PfuTurbo Cx (Agilent) or KAPA HiFi uracil+ (Roche), specifically designed for bisulfite sequencing. Once a library has been prepared, it is standard practice to perform library quantification and normalization, using, for example, Qubit / PicoGreen assay or qPCR measurement. It should be noted during this step that methods that estimate only the total quantity of DNA may fail to give an accurate representation of the adapter-ligated DNA, particularly in MethylC-Seq libraries due to the aforementioned fragmentation caused by the bisulfite treatment. For this reason, it is recommended to use a BioAnalyzer for sizing only and qPCR to quantify the final library for bisulfite sequencing.
Several commercial kits are readily available for carrying out bisulfite conversion itself. Depending on your sample DNA quantity and library preparation methodology, the aim is to achieve maximum conversion efficiency relative to optimal DNA recovery. High temperature, high bisulfite molarity, and long incubation times are more likely to yield complete bisulfite conversion but degrade much of the DNA in the process. Incomplete conversion, however, leads to an overestimation of methylation levels on unconverted cytosines. With this trade-off in mind, a good evaluation of modern kits was provided by Kint et al. [Kint 2018], where EpiTect Bisulfite (Qiagen), EZ DNA Methylation-Gold (Zymo Research), and EZ DNA Methylation-Lightning (Zymo Research) kits were each cited for high performance with regards to several study-dependent factors.
To estimate conversion efficiency within bisulfite-treated samples, it is typical to have a control consisting of a known quantity of unmethylated DNA within the sample. Historically the conversion rate was estimated from non-CG cytosines in mammals [Hodges 2009], which is inappropriate for plants where DNA methylation occurs in the CHG and the CHH contexts. Alternatively, the mitochondria or chloroplast genomes were used, as both organelles are widely considered to escape DNA methylation [Marano 1991, Vanyushin 1988]. However, this may not be entirely reliable as conflicting evidence of DNA methylation has since been reported in both [Šimková 1998, Fojtová 2001]. Therefore, the most reliable method in plants is to use a "spike-in" of DNA from another source. The enterobacteria phage Lambda (~ 0.1% w/w) is often used, which is shown to be virtually devoid of 5mC when propagated on mutant bacteria strains lacking DNA methylase activity [Hattman 1973]. Reads aligning to the Lambda genome can then indicate the level of bisulfite conversion, as in theory, all cytosines should have been replaced with thymines.
In addition to Lambda, as noted earlier, the bacteriophage Phi X is commonly used as a "spike-in" to balance base proportions [Raine 2018, Illumina bulletin]. During the initial cycles of Illumina sequencing, the phasing/pre-phasing, color matrix corrections, and pass filter calculations are influenced by the flow cell imaging [Illumina bulletin]. In bisulfite-treated DNA, there is a notable deficiency in cytosine bases and the fluorescent color associated with it, which can skew the base-calling algorithm during this normalization process. Adding the well-balanced Phi X DNA [Sanger 1977] or any other well-balanced DNA to the sequencing library allows the Illumina sequencing to proceed unaffected. Another interesting possibility is to multiplex a bisulfite-treated library with an untreated library with each DNA fragment containing an identifying adapter sequence indicating which library it belongs to. This way, spiking can be omitted, and cytosine methylation and single nucleotide polymorphisms (SNPs) can be obtained from one sequencing run.
RRBS is similar to WGBS in many ways but differs primarily by adding an initial selection procedure at the beginning of the library preparation. It was developed by Meissner et al. [2005] to generate large-scale sequencing data with a lower resolution than WGBS and still evenly representing the genome, though with the option to focus either on Eu- or Heterochromatin. This reduces the sequencing cost compared to WGBS but results in the loss of much sequence content that could otherwise be relevant. In cases where this technique was employed, the enriched fraction was frequently reported to be less than 1% of the whole genome size.
Sample DNA is first subjected to a restriction endonuclease that targets a specific sequence context depending on the local cytosine methylation status. A typical enzyme used is MspI, which targets CG sites in the specific sequence 5’-CCGG-3'. MspI can not cleave when this specific recognition sequence is symmetrically methylated, thus focuses on weakly methylated euchromatin rather than the heavily methylated heterochromatin in the chromosomes. Different sequence contexts require different enzymes, although this application has not been broadly applied in non-CG contexts. The enzymatic digestion produces fragments that can be size selected, usually following some additional end repair and A-tailing depending on which enzyme was used. The rest of the library preparation follows closely with that which was outlined previously for WGBS and unfortunately suffers from the same loss of strand-specificity as BS-Seq. It must be noted here that the recognition site's methylation is not the main focus of the study. Instead, the sequence flanking the recognition site is sequenced, providing information on the methylation status of many cytosines, which can principally be in all three sequence contexts.
Reduced representation bisulfite sequencing is a beneficial technique when the aim is to sequence many biological samples, for example, to study population genetics or when the studied organism has a very large genome, like in many coniferous trees, for example. The technique further allows to roughly direct the analysis either to heterochromatin or euchromatin or, depending on the genome in question, enrich promotor or gene-body sites by choosing the appropriate cleavage enzyme. However, besides this possibility of setting a rough focus of the study, the idea is to provide a valid representation of the genome through a sample of random sequence reads scattered across the genome. Though, it may be desirable in a project to set the target more specifically to a particular region in the genome. This can be achieved through target capturing, which can be applied before or after bisulfite conversion (Wreczycka et al. 2017). Different techniques usually involving the hybridization of genomic DNA with the complementary of a known piece of the target sequence, combined with bisulfite conversion and followed by the above-described processing of converted DNA, enable the inference of the methylation status of a specific target location in the genome of interest. Such techniques may be helpful when, for example, unraveling the methylation status of a known promotor region is the aim of the investigation.
Bisulfite Sequencing Methods
When considering the epigenome, researchers may refer to changes in chromatin structure due to post-translational modification of histone proteins, populations of non-coding RNA (ncRNA), several chemical modifications to DNA sequences, or a combined effect of these factors. Most often, however, they are referring specifically to the methylome. That is to say: the distinct arrangement of methylcytosines present in the genome and the fluctuations between different organisms, tissues, or cell types within species. This generalization is not a reflection of a greater epigenetic significance but instead an overrepresentation of our current understanding of this type of methylation relative to other epigenetic mechanisms. However, recently more studies realize that epigenetic factors often interact.
In plants, DNA methylation can affect both cytosine [zhang 2018] and adenine [ratel 2006] nucleotides and has been associated with changes in gene expression [lang 2017, zhang 2006, lei 2015], chromosome interactions [grob 2014, feng 2014], and genome stability through the repression of transposable elements [mirouze 2009, tsukahara 2009, la 2011]. The modification can be further characterized as either 5-methylcytosine (5mC) or 5-hydroxymethylcytosine (5hmC) within the context of cytosine methylation. These subgroups may well have contrasting epigenetic functions, though, in Arabidopsis thaliana, at least no appreciable level of 5hmC has been observed in genomic DNA [erdmann 2015]. Methylation of cytosine residues is not a base modification that is limited to genomic DNA. However, the fraction of 5hmC present in RNA may be much higher [huber 2015]. Though, the topic of 5hmC in plants is debated among researchers heavily (Mahmood et al. 2019).
The underlying basis for the perceived emphasis on DNA cytosine methylation is due to the development of bisulfite sequencing as a means for studying epigenetics. Since its conception and initial application in 1992 by Frommer et al. [frommer 1992], the method has received much attention for its capacity to resolve DNA methylation patterns at the nucleotide level. This allows researchers to study the effect of differential methylation between organisms, tissues, or cell types on specific genomic elements such as gene bodies, promoter regions, or other regulatory motifs. This, in turn, provides a roadmap for linking epigenetics to gene expression, heritability, or the activation of particular genes or transposons.
The basis for the method is the usage of sodium bisulfite. This chemical compound catalyzes the hydrolytic deamination of cytosine to uracil via an intermediary sulfonation step from cytosine to unstable cytosinesulfonate [hayatsu 1970, shapiro 1970]. The loss of the amine group from cytosinesulfonate yields uracilsulfonate, which in turn is desulfonated under basic pH conditions to form uracil (Figure 1). Though principally also methylated cytosines can react with sodium bisulfite, the presence of a methyl group on position five of the aromatic ring inhibits the process by an order of two magnitudes [hayatsu 1979]. This inhibition is sufficient to confer selectivity for the conversion of unmethylated cytosines. Unfortunately, this selectivity does not extend to a measurable differentiation between 5mC and 5hmC, which are therefore indistinguishable during the regular application of this method. During the sequencing processes or preceding PCR reactions, uracil positions convert on the newly synthesized DNA fragment to thymine so that the sequencing reads contain the four DNA bases.
Figure 1 Bisulfite conversion of unmethylated Cytosin to Uracil.
Once a given DNA sequence has undergone treatment with sodium bisulfite, any remaining cytosines present in the sequence can be inferred as methylated positions. The problem then progresses to the question of mapping these sequences back to their original location on the genome. As is often the case with next-generation sequencing, it is very difficult to achieve a singular, continuous strand of DNA representing an entire chromosome. DNA stability is such that the molecules are easily fragmented during library preparation, and the sequencing approach itself is often restrictive in terms of the length of DNA that can be sequenced in a single iteration. The fragmentation is further confounded by the harsh bisulfite treatment, which undermines long-read sequencing technologies such as PacBio [Eid 2009, Uemura 2010] or Nanopore [Cherf 2012, Mikheyev 2014] and reduces the cost-benefit ratio for using them.
The dominating circumstance is one where the sequencing data consists of many, short bisulfite-converted DNA sequences that subsequently need to be aligned to a reference genome to determine the relationship of methylated positions with nearby genomic elements. In this regard, it is imperative that a high-quality reference genome already exists for the species of interest. Any improvements made to the existing genome assembly contribute towards mapping precision and reducing any false correlations between methylation patterns and nearby loci.
However, it is not a simple task to map bisulfite-treated reads to the reference genome as it would be performed with DNA-Seq data. The challenge arises from the many unmethylated cytosines, which have since been converted to thymines. The fraction of methylated cytosines varies much between different plant species, with as few as ~5% of total cytosines methylated in A. thaliana [Leutwiler 1984] to as many as ~33%, for example, in cultivated tobacco, Nicotiana tabacum [Wagner 1981]. Taking A. thaliana as an example, with a reported GC content of 41.4% [Leutwiler 1984], this could mean that almost ~20% of the total sequence content on one strand is artificially replaced with thymines where before, they were cytosines. In a standard read alignment procedure, this would result in many differences in the alignment of reads to the reference sequence due to the high fraction of C-T mismatches occurring in place of C-C matches. It would not be enough to specify in the algorithm to treat all cytosines and thymines unanimously because the bisulfite treatment dictates that only thymines present in the sequencing reads could potentially match cytosines in the reference genome. Vice versa, it does not follow conceptually that thymines in the reference genome might be able to match cytosines in the reads. Therefore treating the two bases equally would still generate many differences during read alignment.
As standard alignment tools are often not designed with this base-matching asymmetry in mind, the solution is to make further adaptations to these tools or operate specifically with algorithms designed to handle bisulfite data. As mentioned earlier, several specific alignment algorithms for bisulfite data exist. Prominent among them are Bismark [Krueger 2011], BWA-meth [Pedersen 2014], relying on popular alignment algorithms, and Segemehl [Otto 2012] or ERNE [Prezza 2012], with new specific algorithms. Despite the heuristics involved in adapting conventional software (see Read Alignment), recent benchmarks have shown that software such as BWA-meth compete favorably both in terms of precision-sensitivity and overall computational performance [Nunn et al 2021]. Most tools perform capably in non-repetitive regions, but some indeed struggle more-so than others in repetitive, hard-to-align loci. In some cases it may even be the case that the increased base complexity conveyed by methylated nucleotides under bisulfite sequencing may lead to bias in alignment, which can have downstream consequences for methylation calling. Further consideration should always be given to unique aspects of each tool in the context of the planned study. For example, BWA-meth defines mapping quality values differently from Bismark and this too can have implications for downstream analyses such as variant calling.
Another problem that arises during bisulfite sequencing is the loss of variant information following the sodium bisulfite treatment. A standard sequencing experiment identifies polymorphisms where most reads overlapping a single nucleotide position on the reference genome indicate a deviation from the original base in that position. Single nucleotide polymorphisms (SNPs) are used in genotyping to identify which samples belong to different variants or strains. During epigenetic studies, it is often of crucial importance that genetic variation is kept to a minimum to reduce confounding genetic effects. However, any true SNPs in the C-T context are obscured by the artificially converted bases. It is possible to retrieve this variant information by comparing the converted bases to their complementary bases on the opposite strand. Only true SNPs should have the correct sequence complement. But this is computationally intensive and yet to be implemented efficiently. Analyzing the strands separately in this way would theoretically imply that twice as much sequencing coverage would be necessary compared to normal genotyping to achieve a similarly credible result.
Following a successful read alignment, methylation calling can be performed similarly to variant calling but without the need for a complex model. Instead, each cytosine position is extracted from the alignment and the methylation level determined by majority voting. The ratio of reads with either cytosine or thymine in that position is used to calculate an overall methylation percentage or rate. The total collection of positions can be further subset into different genomic sequence contexts such as CG, CHG, or CHH, where H can be either A, T, or G (see Chapter DNA Methylation). This subsetting later allows for downstream analyses of methylation patterns.
Bisulfite Sequencing Methods
Although the use of whole genome bisulfite sequencing to analyse DNA methylation in CG, CHG, and CHH context is highly relevant to the study of epigenetics in plant ecology, it is not all-encompassing. There are indeed other techniques based on similar principles that can be used to capture methylation information, such as methylated DNA immunoprecipitation (meDIP) which can be used in combination with high-resolution DNA microarrays or high-throughput next-generation sequencing.
Recently there has also been some investigation into treatments that can facilitate nucleotide-level base conversion without the harsh side effects of sodium bisulfite [Yibin et al. 2019]. The advent of long-read sequencing technologies such as single molecule real-time (SMRT) analysis with PacBio or Nanopore has also provided an alternative to bisulfite sequencing. The base calling in these approaches can provide wavelength profiles for each nucleotide that differ between bases with and without base modifications [Flusberg et al. 2010]. This has the advantage of detecting DNA methylation without the need for harsh bisulfite treatment, while also allowing for detection of other forms of base modification. Unfortunately the profiles can be difficult to differentiate, but development of machine learning techniques may be a promising avenue of advancement in this regard.
Bisulfite Sequencing Methods
Once sequencing has taken place, the question of identifying DNA methylation is effectively reframed to a computational concern. Like standard sequencing, the basic workflow involves initial quality control (QC) of the generated raw reads, followed by mapping these reads to a reference genome to produce alignment files that are the basis of downstream analyses. Methylated positions can then be extracted from these files in a much similar manner to variant calling. Bisulfite sequencing, however, presents its own challenges, particularly during the read alignment step. This section explores common issues and significant divergences from standard practices in bioinformatics.
Like all reads generated via sequencing by synthesis, bisulfite reads are subject to a drop in quality towards the 3’-end due to the propensity of base-calling errors to accumulate following failures in the synthesis process. During a single cycle of Illumina sequencing, the next base in the read is incorporated into the template strand together with a reversible terminator molecule containing a fluorescent tag. The terminator prevents the following base from being incorporated, so the sequencer can read the color of the fluorescent tag and identify the current base. The molecule is then cleaved to facilitate the next cycle that repeats the process for the next base in the sequence. In these synthesis cycles, it can happen that the terminator molecule is either not cleaved or cleaved too early so that two bases are incorporated during a single cycle. If such an error occurs, the strand is out of sync compared to the other strands of the cluster for all remaining cycles and makes it more difficult for the imager to assess the correct base color within the sequence cluster. The consequence is a quality drop for the complete cluster. This phenomenon is known as phase-shifting and is usually corrected by trimming some bases that fall below a quality threshold (eg. phred score < 20) from the 3’-end of a read, limiting the negative effect on the previous correctly sequenced bases.
Another commonly-encountered issue is the tendency to sequence into the adaptor sequence on DNA fragments smaller than the total number of Illumina cycles [Illumina bulletin]. These sequence subsets are not part of the original DNA, making it much more difficult to map such reads to their true location on the reference genome. Fortunately, the adaptors are synthetically designed, distinct sequences which are therefore known and can thus be readily identified [Illumina technical document]. By considering the overlap of the known adaptor sequence at the 3’-end on each read, the reads can again be trimmed to remove the DNA that is not part of the original sequence.
These common problems can be identified with a standard QC tool such as FastQC [Andrews 2010], and frequently occur in standard sequencing and bisulfite sequencing data. As such tools (i.e., FastQC) are usually not designed for bisulfite sequencing, they may also flag errors such as unbalanced base proportions with as high as ~50% thymine content in read one (or adenine content in read two). If dealing with RRBS data, which has been digested with MspI, there is also the possibility that non-random sequence content is flagged at the 5’-ends of reads, as digested fragments always start with a C base. So long as there is confidence that the standard precautions were taken during library preparation, these warnings can be safely ignored at this stage.
The other facet of using standard QC tools that are not designed for bisulfite sequencing is the tendency to miss bisulfite-related sequencing problems. One such problem can occur during the initial DNA fragmentation step of the library preparation procedure, which often leaves protruding 5' and 3’-ends that must be restored to double-stranded DNA by a process known as overhang end-repair [Poptsova 2014]. The incorporation of unmethylated cytosines during this step can introduce artificially low methylation rates at each end of the DNA fragment, which cannot be detected in standard QC [Lin 2013]. Another such issue is thought to occur due to the re-annealing of single-strand sequences adjacent to the methylated sequencing adaptors during MethylC-Seq, which partially restores double-strandedness thereby affording a measure of protection from the bisulfite treatment [Lin 2013]. Therefore, there is a tendency for bisulfite conversion failure to be enriched towards the 5’-end of reads, leading to artificially high methylation rates. BS-Seq and PBAT libraries should theoretically avoid this bias due to the adaptor-ligation occurring after the bisulfite treatment.
As neither of these issues causes changes to the DNA sequence, they can only be detected once methylation calling has occurred (after read alignment). The standard procedure is to look at the total distribution of methylation levels across the average length of the reads in an approach known as M-bias analysis [Hansen 2012, Lin 2013]. A uniform distribution is expected across the read length, but spikes in methylation level can be observed at each end of the distribution. As the sequence information at each end remains unaffected, they should be used for the alignment and not be clipped similarly to quality trimming for repeated read alignment. Instead, the start and end positions of reads are "masked" from a follow-up repeat of the methylation calling procedure, depending on the deviation of their methylation status from the uniform distribution. It should be noted that a significant variation of read lengths reduces the accuracy of this step.
When all quality concerns are sorted out, the next step in the workflow is to align these reads to an appropriate high-quality reference genome. If a reference genome is not available, de novo assembly is first required before any methylation information can be retrieved. The availability of a good reference genome is fundamental to DNA methylation analysis, when the project aims to relate methylated positions to nearby annotations such as gene bodies, promoter regions, or transposable elements. The higher the genome quality and relatedness of the genome to the test sample, the more confidence you can have in the study's findings [Mardis 2002]. Unfortunately, this degree of confidence is not something that can be measured statistically. Therefore every effort should be made to ensure the validity of your reference genome prior to analysis.
With standard sequencing data, mapping typically involves the use of dynamic programming to determine the best alignment for a given read according to a scoring matrix. Positive scores are given for base matches, or certain types of mismatches, whereas penalties are given for other mismatches and positions where insertions or deletions (indels) are present. The cumulative score is then compared to other potential alignments above a set threshold, and in most cases, only the best one is selected as the most likely point of origin for the read.
However, mapping bisulfite-treated DNA presents a challenge. The majority of cytosine positions on the reference genome are likely unmethylated in the test sample [Wagner 1981, Leutwiler 1984] and therefore represented as thymines in the reads following bisulfite conversion. Aligning these reads results in many C-T mismatches, which negatively influence the scoring matrix and significantly inhibit successful read mapping (Figure 3). It is not enough to simply allow for a higher number of errors, as this would only obscure the correct alignments through an increased number of false positives.
Figure 3 Alignment missmatches with the reference genome resulting from bisulfite conversion.
To further complicate the issue, bisulfite conversion of DNA fragments results in two strands that are no longer complementary to each other. In paired-end sequencing, this means that four distinct sequences are now present, each one varying to some degree from the original DNA (Figure 4). From the original, untreated Watson (+) strand, the first mate pair is the direct bisulfite-converted variant (+FW), and the second is the reverse complement of this (+RC). From the original, untreated Crick (-) strand, the first mate pair is the direct bisulfite-converted variant (-FW), and the second is the reverse complement of this (-RC). In standard DNA sequencing, it is simply the case that the second mate-pair obtained from one strand aligns to the other strand, but in bisulfite sequencing, this no longer holds. In BS-Seq libraries, this is compounded further because the method already encompasses all four-strand variants, even in single-end sequencing. In this case, the directionality relative to the strand (indicated by the box arrows in Figure 4) is thus lost, which is why it is sometimes referred to as an unstranded bisulfite sequencing protocol.
Figure 4 Strand specific point mutations in the new synthesized strand resulting from bisulfite conversion of unmethylated cytosins.
One potential solution for this alignment problem is to adjust the scoring matrix so that a mismatch of thymine to cytosine is instead treated as a match. This can be implemented in standard sequence aligners by "collapsing" the genetic alphabet in both the read and the reference genome so that all cytosines are rewritten as thymines (Figure 5a). Mapping is then performed normally, and the methylated positions are retrieved through post-processing based on the composition of the pre-collapsed sequences. However, this procedure results in two undesirable scenarios that make no sense conceptually: 1) true thymines in the read match with cytosines in the reference genome, and 2) cytosines from the read (indicating methylated positions) match thymines in the reference genome. Such a solution undoubtedly produces many false positives and obscures the correct read alignments. Bisulfite read aligners such as Bismark [Kreuger 2011], BSmooth [Hansen 2012] (in bowtie2 mode), BS-Seeker [Chen 2010, Guo 2013], and BWA-meth [Pedersen 2014] follow this strategy.
A better strategy would be to allow matches between thymines and cytosines, but only between read-based thymines and reference-based cytosines (not vice versa). In this case, methylated cytosines are correctly considered mismatched with thymines in the reference genome, thus reducing false positives (Figure 5b). However, it is still possible for true read-based thymines to match incorrectly with reference-based cytosines (see scenario 1 in the last paragraph). This asymmetric base scoring is not easy to implement in most index structures (e.g., Burrows-Wheeler transform, suffix arrays) used in standard sequence aligners. Therefore specialized read alignment software is required that is explicitly designed for bisulfite sequencing. Such tools include BSMAP [Xi 2009], BSmooth [Hansen 2012] (in merman mode), and ERNE-BS5 [Prezza 2012]. The specialized read aligner segemehl [Otto 2012] uses collapsing (Figure 5a) during a starting step and then changes to asymmetric matching (Figure 5b) in the following. A common drawback of these methods is the increased memory consumption and processing time relative to tools that rely on a collapsed alphabet.
Figure 5 Collapsed alphabet versus assymmetric matchning.
Whichever approach is followed, the entire process likely has to be repeated to account for both C-T conversions and G-A conversions due to the aforementioned loss of complementarity between a given sequence complement and the opposite strand (figure). The possibility of four distinct sequences in bisulfite sequencing, as opposed to two in standard sequencing, dictates that two distinct variants of the reference genome are required to resolve the best alignments. These two alignment procedures may either be run in parallel, as is the case in Bismark [Kreuger 2011], or consecutively as is the case in segemehl [Otto 2012]. Still, the point is that they are interlinked with each other and treated as one.
One problem that might arise during sequencing is the potential for genomic rearrangement [Saxena 2014] that may have occurred in the test sample relative to the reference genome due to evolution. Particularly with reads that originate from a locus that has translocated to the opposite strand. In this case, the strand-specificity is inverted, and the aligner may attempt to map the read to the wrong strand. These false-stranded reads can be detected by the high proportion of G-A mismatches on the Watson (+) strand or C-T mismatches on the Crick (-) strand. In practical terms, a threshold of 3-5% regarding the length of the read is usually enough to identify false-stranded reads. As there is a high probability that these reads originate elsewhere in the genome, they are usually excluded from the alignment because it is not a trivial task to infer where the locus may have translocated to [Onishi-Seebacher 2011].
Filtering can also be applied based on several other factors, in a study-dependant manner, prior to methylation calling. Any spike-in of Lambda or Phi X DNA can be filtered easily in the case of multiplexing with alternative sequencing adaptors, based on the different sequencing tags or it is not indexed at all and therefore automatically filtered out during de-multiplexing. In the case of Lambda, the alignment indexes must be generated to contain both the sample genome and the Lambda phage genome. Any read alignments to the Lambda genome can therefore be filtered out. When splitting the alignment files in this manner, it is essential to remove any reference to the Lambda genome from the file header if there is any intention to view the alignment file with a genome browser (e.g., IGV, JBrowse).
Finally, filtering based on multi-mapped reads or PCR duplicates may also be considered, like in standard sequencing experiments. Any given read should conceptually originate from a single position on the genome. Still, equally-scoring alignments may be possible, particularly in highly repetitive regions or regions of low complexity [Treangen 2011]. If each of these alignments is considered separately during methylation calling, then the statistical power is skewed, especially if the methylation is not the same between these regions. The options are to exclude these alignments entirely, as is performed intrinsically by some sequence aligners [Krueger 2011, Guo 2013], to select one such alignment at random, or to accept the reduction in statistical power to retain the information from that read. Regarding PCR duplicates, several tools exist to identify such reads that arise from a single DNA fragment, such as Picard MarkDuplicates [Broad Institute] and samtools rmdup [Li 2009]. These tools identify PCR duplicates based on the proportionally higher likelihood that identical reads arise from PCR, then that they are separate fragments. In this case, such reads are counted only once during methylation calling. However, deduplication works only in WGBS sequencing projects. If PCR is absent or negligible during library preparation, however, this step should be avoided.
To detect the level of DNA methylation at any given cytosine position within the test sample, the reads overlapping that position are evaluated to give the proportion of the coverage that is methylated bases (cytosine) versus the sum of methylated bases (cytosine) and unmethylated bases (thymine). In practice, there is only one major divergence in the implementation of this method by different tools: what to do with C-T polymorphisms that arise before bisulfite treatment has taken place. Most tools simply ignore SNPs. If the consensus indicates anything other than a cytosine or a thymine in a cytosine position on the reference genome, then the entire position is often excluded from analysis. It is not necessarily accurate to consider such an SNP as evidence against DNA methylation, particularly in the case of adenine which may also be methylated [Ratel 2006].
A more robust analysis will recognise also that not all read-based thymines overlapping reference-based cytosine positions are representative of the bisulfite treatment [Liu 2012]. A mutation in this context is deceiving, as we might interpret it as demethylation resulting from an epigenetic mechanism rather than a genetic one. The only way to identify such a mutation from the data itself, rather than from independent genotyping of the test sample, is to compare the overlapping reads in that position to those on the opposite strand. The DNA strands after treatment with sodium bisulfite are no longer complementary to each other since the methylation information is strand-specific. Artificially converted bases should therefore always lack a complementary base on the opposite strand, whereas a C-T mutation will have support on the opposing strand. This difference can be used to exclude such base positions from the analysis; the independent testing of both strands in this manner however would suggest that twice the coverage be required to achieve a similar degree of confidence to standard genotyping.
Consideration should also be made when using PE sequencing to regions that overlap between read pairs. Generally each read pair will start from opposing ends of a single DNA fragment, and it is possible in cases where the fragment size is less than double the number of cycles during Illumina sequencing that a part of the fragment will be sequenced twice for the same read pair [Magoč 2011]. In this case, methylation information in this overlap region is redundant and does not constitute an independent observation of the methylation status as if they arose from separate DNA fragments. It is therefore wise to identify overlapping regions prior to methylation calling and mask those bases from either one of the two reads of a pair.
Adam Nunn
Though it is by no means the only biological mechanism within the domain of epigenetics, DNA methylation is among the most prevalent and the most studied throughout the field. In earlier chapters, we have discussed underlying molecular processes and the ecological consequences of differential methylation within species- but how exactly do we detect these differences? One technique that has emerged at the forefront of epigenetic research is bisulfite sequencing: a distinct adaptation of next-generation sequencing which produces genome-wide methylation profiles at a nucleotide-level resolution.
The technique, devised by Frommer et al. [Frommer 1992] and refined for modern sequencing techniques by Lister et al. [Lister 2008] and Cokus et al. [Cokus 2008], involves the treatment of extracted DNA from test samples with sodium bisulfite, a deaminating agent which mediates the conversion of unmethylated cytosine nucleotides into uracil. Cytosine bases that carry methyl groups (e.g. 5-methylcytosine, 5-hydroxymethylcytosine) are left unaffected by the treatment and remain in their original unconverted state. As uracil residues are subsequently converted to thymine during the PCR step of standard DNA sequencing, these bisulfite-treated samples can be subjected to standard sequencing protocols and used to generate sequencing reads, which carry epigenetic information. Once treated, the reads effectively reframe the research question from a biological to a computational, algorithmic concern, at least until the results require interpretation.
In standard sequencing, the next step is to follow workflows for read alignment of the sequencing reads to a reference genome assembly. Aligning sequencing reads uses overlapping sequences to puzzle as many reads as possible into long sequence fragments, so-called contigs. The alignment presents some issues when handling bisulfite data, as thymine residues can no longer be considered as entirely independent entities to cytosine. Read alignment algorithms usually operate based on scoring matrixes, which assign an overall probability for the alignment of two sequences based on the number and position of matches, mismatches, insertions, and deletions between nucleotides. The problem arises in that reference cytosines can conceptually match thymines in bisulfite-treated reads, but not vice versa. Existing algorithms are often not built to handle this asymmetry between bases, so the solution is either to adapt these tools in some way further or to operate specifically with algorithms designed for bisulfite data. Several tools now exist in representation of either category, including notably Bismark [Krueger 2011] and BWA-meth [Pederson 2014], which adapt the popular standard aligners bowtie [Langmead 2013] and BWA [Li 2010], and software such as segemehl [Otto 2012] or ERNE-BS5 [Prezza 2012] which are capable of interpreting bisulfite reads in their own right.
The principles of bisulfite sequencing notwithstanding, another important consideration when designing such an experiment involves the chosen strategy for library preparation. Like generally in next-generation-sequencing, sequencing depth and coverage are also very important for bisulfite-sequencing, as we seek to maximize sequencing coverage regarding the scope of the questions we are looking to answer and the practical limitations of the study, such as cost and time. For example, does your study seek to investigate genome-wide methylation patterns, or is it enough to focus on a reduced subset of the DNA? Herein, we will consider Whole-Genome Bisulfite Sequencing (WGBS) applications, Reduced-Representation Bisulfite Sequencing (RRBS), and variations of these methods. In particular, what implications have such protocols on data robustness, and how should we adjust the procedure to improve quality control during the downstream analyses?
The chapter covers various technical concerns of bisulfite sequencing, from DNA extraction and library preparation to sequencing itself and the downstream extraction of methylated positions. The bioinformatic principles determine the data validity for answering the questions posed by the study, and an a priori consideration, therefore, is fundamental to the successful outcome of any such experiment. Finally, we discuss the rigid limitations of bisulfite sequencing and give brief suggestions for alternative methods that might be used to address these issues.
Bisulfite Sequencing Methods
Adapter trimming: Why are adapter sequences trimmed from only the 3' ends of reads? (2018). Retrieved January 24, 2019, from https://support.illumina.com/bulletins/2016/04/adapter-trimming-why-are-adapter-sequences-trimmed-from-only-the--ends-of-reads.html
Aird, D., Ross, M. G., Chen, W. S., Danielsson, M., Fennell, T., Russ, C., … Gnirke, A. (2011). Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biology, 12(2), R18. https://doi.org/10.1186/gb-2011-12-2-r18
Benjamini, Y., & Speed, T. P. (2012). Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Research, 40(10), 1–14. https://doi.org/10.1093/nar/gks001
Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., Milton, J., Brown, C. G., … Smith, A. J. (2008). Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456, 53–49. https://doi.org/10.1038/nature07517
Bormann Chung, C. A., Boyd, V. L., McKernan, K. J., Fu, Y., Monighetti, C., Peckham, H. E., & Barker, M. (2010). Whole methylome analysis by ultra-deep sequencing using two-base encoding. PLoS ONE, 5(2). https://doi.org/10.1371/journal.pone.0009320
Chen, P.-Y., Cokus, S. J., & Pellegrini, M. (2010). BS Seeker: precise mapping for bisulfite sequencing. BMC Bioinformatics, 11(203), 2–7. https://doi.org/10.1186/1471-2105-11-203
Cherf, G. M., Lieberman, K. R., Rashid, H., Lam, C. E., Karplus, K., & Akeson, M. (2012). Automated Forward and Reverse Ratcheting of DNA in a Nanopore at Five Angstrom Precision. Nature Biotechnology, 30(4), 344–348. https://doi.org/10.1038/nbt.2147.Automated
Cokus, S. J., Feng, S., Zhang, X., Chen, Z., Merriman, B., Haudenschild, C. D., … Jacobsen, S. E. (2008). Shotgun bisulfite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature, 452(7184), 215–219. https://doi.org/10.3390/ijms151222874
Eid, J., Fehr, A., Gray, J., Luong, K., Lyle, J., Otto, G., … Dewinter, A. (2009). Real-Time DNA sequencing from Single Polymerase Molecules. Science, 323(January), 133–138. https://doi.org/10.1126/science.1162986
Erdmann, R. M., Souza, A. L., Clish, C. B., & Gehring, M. (2015). 5-Hydroxymethylcytosine Is Not Present in Appreciable Quantities in Arabidopsis DNA. G3: Genes, Genomes, Genetics, 5(1), 1–8. https://doi.org/10.1534/g3.114.014670
Feng, S., Cokus, S. J., Schubert, V., Zhai, J., Pellegrini, M., & Jacobsen, S. E. (2014). Genome-wide Hi-C Analyses in Wild-Type and Mutants Reveal High-Resolution Chromatin Interactions in Arabidopsis. Molecular Cell, 55(5), 694–707. https://doi.org/10.1016/j.molcel.2014.07.008
Flusberg, B. A., Webster, D. R., Lee, J. H., Travers, K. J., Olivares, E. C., Clark, T. A., ... & Turner, S. W. (2010). Direct detection of DNA methylation during single-molecule, real-time sequencing. Nature methods, 7(6), 461-465.
Fojtová, M., Kovařík, A., & Matyášek, R. (2001). Cytosine methylation of plastid genome in higher plants. Fact or artefact? Plant Science, 160(4), 585–593. https://doi.org/10.1016/S0168-9452(00)00411-8
Frommer, M., McDonald, L. E., Millar, D. S., Collist, C. M., Watt, F., Griggt, G. W., … Paul, C. L. (1992). A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proceedings of the National Academy of Sciences, 89(March), 1827–1831. https://doi.org/10.1073/pnas.89.5.1827
Greagg, M., Fogg, M. J., Panayotou, G., Evans, S. J., Connolly, B. A., & Pearl, L. H. (1999). A read-ahead function in archaeal DNA polymerases detects promutagenic template-strand uracil. Proceedings of the National Academy of Sciences, 96(16), 9045–9050. https://doi.org/10.1073/pnas.96.16.9045
Grehl, C., Kuhlmann, M., Becker, C., Glaser, B., & Grosse, I. (2018). How to Design a Whole-Genome Bisulfite Sequencing Experiment. Epigenomes, 2(21), 1–11. https://doi.org/10.3390/epigenomes2040021
Grob, S., Schmid, M. W., & Grossniklaus, U. (2014). Hi-C Analysis in Arabidopsis Identifies the KNOT, a Structure with Similarities to the flamenco Locus of Drosophila. Molecular Cell, 55(5), 678–693. https://doi.org/10.1016/j.molcel.2014.07.009
Guo, W., Fiziev, P., Yan, W., Cokus, S. J., Sun, X., Zhang, M. Q., … Pellegrini, M. (2013). BS-Seeker2: A versatile aligning pipeline for bisulfite sequencing data. BMC Genomics, 14(1). https://doi.org/10.1186/1471-2164-14-774
Hansen, K. D., Brenner, S. E., & Dudoit, S. (2010). Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research, 38(12), 1–7. https://doi.org/10.1093/nar/gkq224
Hansen, K. D., Langmead, B., & Irizarry, R. A. (2012). BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology, 13(10), 1–10. https://doi.org/10.1186/gb-2012-13-10-r83
Hattman, S., Schlagman, S., & Cousens, L. (1973). Isolation of a mutant of Escherichia coli defective in cytosine specific deoxyribonucleic acid methylase activity and in partial protection of bacteriophage ?? against restriction by cells containing the N 3 drug resistance factor. Journal of Bacteriology, 115(3), 1103–1107. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC246359/pdf/jbacter00349-0397.pdf
Hayatsu, H., & Shiragami, M. (1979). Reaction of Bisulfite with the 5-Hydroxymethyl Group in Pyrimidines and in Phage DNAs. Biochemistry, 18(4), 632–637. https://doi.org/10.1021/bi00571a013
Hayatsu, H., Wataya, Y., Kai, K., & Iida, S. (1970). Reaction of sodium bisulfite with uracil, cytosine, and their derivatives. Biochemistry, 9(14), 2858–2865. https://doi.org/10.1021/ja00706a062
Hodges, E., Smith, A., Kendall, J., Xuan, Z., Ravi, K., Rooks, M., … Hicks, J. (2009). High definition profiling of mammalian DNA methylation by array capture and single molecule bisulfite sequencing. Genome Research, 19(9), 1593. Retrieved from https://genome.cshlp.org/content/early/2009/07/06/gr.095190.109.full.pdf
Huber, S. M., Van Delft, P., Mendil, L., Bachman, M., Smollett, K., Werner, F., … Balasubramanian, S. (2015). Formation and abundance of 5-hydroxymethylcytosine in RNA. ChemBioChem, 16(5), 752–755. https://doi.org/10.1002/cbic.201500013
Illumina Adapter Sequences. (2018). San Diego, California. Retrieved from https://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/experiment-design/illumina-adapter-sequences-1000000002694-09.pdf
Kint, S., De Spiegelaere, W., De Kesel, J., Vandekerckhove, L., & Van Criekinge, W. (2018). Evaluation of bisulfite kits for DNA methylation profiling in terms of DNA fragmentation and DNA recovery using digital PCR. PLoS ONE, 13(6), e0199091. https://doi.org/10.1371/journal.pone.0199091
Kozarewa, I., Ning, Z., Quail, M. a, Sanders, M. J., & Turner, D. J. (2009). Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of GC-biased genomes. Nature Methods, 6(4), 291–295. https://doi.org/10.1038/nmeth.1311.Amplification-free
Krueger, F., & Andrews, S. R. (2011). Bismark: A flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics, 27(11), 1571–1572. https://doi.org/10.1093/bioinformatics/btr167
La, H., Ding, B., Mishra, G. P., Zhou, B., Yang, H., Bellizzi, M. del R., … Wang, G.-L. (2011). A 5-methylcytosine DNA glycosylase/lyase demethylates the retrotransposon Tos17 and promotes its transposition in rice. Proceedings of the National Academy of Sciences, 108(37), 15498–15503. https://doi.org/10.1073/pnas.1112704108
Laehnemann, D., Borkhardt, A., & McHardy, A. C. (2016). Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Briefings in Bioinformatics, 17(1), 154–179. https://doi.org/10.1093/bib/bbv029
Lang, Z., Wang, Y., Tang, K., Tang, D., Datsenka, T., Cheng, J., … Zhu, J.-K. (2017). Critical roles of DNA demethylation in the activation of ripening-induced genes and inhibition of ripening-repressed genes in tomato fruit. Proceedings of the National Academy of Sciences, 114(22), E4511–E4519. https://doi.org/10.1073/pnas.1705233114
Langmead, B., & Salzberg, S. L. (2013). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4), 357–359. https://doi.org/10.1038/nmeth.1923.Fast
Lei, M., Zhang, H., Julian, R., Tang, K., Xie, S., & Zhu, J.-K. (2015). Regulatory link between DNA methylation and active demethylation in Arabidopsis. Proceedings of the National Academy of Sciences, 112(11), 3553–3557. https://doi.org/10.1073/pnas.1502279112
Leutwiler, L. S., Hough-Evans, B. R., & Meyerowitz, E. M. (1984). The DNA of Arabidopsis thaliana. MGG Molecular & General Genetics, 194(1–2), 15–23. https://doi.org/10.1007/BF00383491
Li, F. W., & Harkess, A. (2018). A guide to sequence your favorite plant genomes. Applications in Plant Sciences. https://doi.org/10.1002/aps3.1030
Li, H., & Durbin, R. (2010). Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5), 589–595. https://doi.org/10.1093/bioinformatics/btp698
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … Durbin, R. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078–2079. https://doi.org/10.1093/bioinformatics/btp352
Lin, X., Sun, D., Rodriguez, B., Zhao, Q., Sun, H., Zhang, Y., & Li, W. (2013). BSeQC: Quality control of bisulfite sequencing experiments. Bioinformatics, 29(24), 3227–3229. https://doi.org/10.1093/bioinformatics/btt548
Lindahl, T., & Wood, R. D. (1999). Quality control by DNA repair. Science, 286(5446), 1897–1905. https://doi.org/10.1126/science.286.5446.1897
Lister, R., O'Malley, R. C., Tonti-Filippini, J., Gregory, B. D., Berry, C. C., Millar, A. H., & Ecker, J. R. (2008). Highly Integrated Single-Base Resolution Maps of the Epigenome in Arabidopsis. Cell, 133(3), 523–536. https://doi.org/10.1016/j.cell.2008.03.029
Liu, Y., Siegmund, K. D., Laird, P. W., & Berman, B. P. (2012). Bis-SNP: Combined DNA methylation and SNP calling for Bisulfite-seq data. Genome Biology, 13(7). https://doi.org/10.1186/gb-2012-13-7-r61
Liu, Y., Siejka-Zielińska, P., Velikova, G., Bi, Y., Yuan, F., Tomkova, M., ... & Song, C. X. (2019). Bisulfite-free direct detection of 5-methylcytosine and 5-hydroxymethylcytosine at base resolution. Nature biotechnology, 37(4), 424-429.
Magoč, T., & Salzberg, S. L. (2011). FLASH: Fast length adjustment of short reads to improve genome assemblies. Bioinformatics, 27(21), 2957–2963. https://doi.org/10.1093/bioinformatics/btr507
Mahmood, A. M., & Dunwell, J. M. (2019). Evidence for novel epigenetic marks within plants. AIMS genetics, 6(4), 70.
Marano, M. R., & Carrillo, N. (1991). Chromoplast formation during tomato fruit ripening. No evidence for plastid DNA methylation. Plant Molecular Biology, 16(1), 11–19. https://doi.org/10.1007/BF00017913
Mardis, E., McPherson, J., Martienssen, R., Wilson, R. K., & McCombie, W. R. (2002). What is Finished, and Why Does it Matter. Genome Research, 12, 669–671. https://doi.org/10.1101/gr.032102.O
McKernan, K. J., Blanchard, A., Kotler, L., & Costa, G. (2006). 2008/0003571. United States.
Meissner, Alexander, Andreas Gnirke, George W. Bell, Bernard Ramsahoye, Eric S. Lander, and Rudolf Jaenisch. "Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis." Nucleic acids research 33, no. 18 (2005): 5868-5877.
Mikheyev, A. S., & Tin, M. M. Y. (2014). A first look at the Oxford Nanopore MinION sequencer. Molecular Ecology Resources, 14(6), 1097–1102. https://doi.org/10.1111/1755-0998.12324
Mirouze, M., Reinders, J., Bucher, E., Nishimura, T., Schneeberger, K., Ossowski, S., … Mathieu, O. (2009). Selective epigenetic control of retrotransposition in Arabidopsis. Nature, 461(7262), 427–430. https://doi.org/10.1038/nature08328
Miura, F., Enomoto, Y., Dairiki, R., & Ito, T. (2012). Amplification-free whole-genome bisulfite sequencing by post-bisulfite adaptor tagging. Nucleic Acids Research, 40(17). https://doi.org/10.1093/nar/gks454
Nunn, A., Otto, C., Stadler P. F., Langenberger D. (2021). Comprehensive benchmarking of software for mapping whole genome bisulfite data: from read alignment to DNA methylation analysis, Briefings in Bioinformatics, bbab021, https://doi.org/10.1093/bib/bbab021
Onishi-Seebacher, M., & Korbel, J. O. (2011). Challenges in studying genomic structural variant formation mechanisms: The short-read dilemma and beyond. BioEssays, 33(11), 840–850. https://doi.org/10.1002/bies.201100075
Otto, C., Stadler, P. F., & Hoffmann, S. (2012). Fast and sensitive mapping of bisulfite-treated sequencing data. Bioinformatics, 28(13), 1698–1704. https://doi.org/10.1093/bioinformatics/bts254
Pabinger, S., Ernst, K., Pulverer, W., Kallmeyer, R., Valdes, A. M., Metrustry, S., … Weinhaeusel, A. (2016). Analysis and visualization tool for targeted amplicon bisulfite sequencing on ion torrent sequencers. PLoS ONE, 11(7), 1–16. https://doi.org/10.1371/journal.pone.0160227
Pedersen, B., Eyring, K., De, S., Yang, I. V, & Schwartz, D. A. (2014). Fast and accurate alignment of long bisulfite-seq reads (arXiv q-bio). Denver, Colorado. Retrieved from https://arxiv.org/pdf/1401.1129.pdf
Picard Toolkit. (2019). Broad Institute. Retrieved from https://github.com/broadinstitute/picard
Poptsova, M. S., Il’Icheva, I. A., Nechipurenko, D. Y., Panchenko, L. A., Khodikov, M. V, Oparina, N. Y., … Grokhovsky, S. L. (2014). Non-random DNA fragmentation in next-generation sequencing. Scientific Reports, 4, 1–6. https://doi.org/10.1038/srep04532
Prezza, N., Del Fabbro, C., Vezzi, F., De Paoli, E., & Policriti, A. (2012). ERNE-BS5: Aligning BS-treated sequences by multiple hits on a 5-letters alphabet. 2012 ACM Conference on Bioinformatics, Computational Biology and Biomedicine, BCB 2012, (May), 12–19. https://doi.org/10.1145/2382936.2382938
Raine, A., Liljedahl, U., & Nordlund, J. (2018). Data quality of whole genome bisulfite sequencing on Illumina platforms. PLoS ONE, 13(4), 1–15. https://doi.org/10.1371/journal.pone.0195972
Ratel, D., Ravanat, J. L., Berger, F., & Wion, D. (2006, March). N6-methyladenine: The other methylated base of DNA. BioEssays. https://doi.org/10.1002/bies.20342
Rothberg, J. M., Hinz, W., Rearick, T. M., Schultz, J., Mileski, W., Davey, M., … Bustillo, J. (2011). An integrated semiconductor device enabling non-optical genome sequencing. Nature, 475, 348–352. https://doi.org/10.1038/nature10242
Sanger, F., Air, G. M., Barrell, B. G., Brown, N. L., Coulson, A. R., Fiddes, J. C., … Smith, M. (1977). Nucleotide sequence of bacteriophage φX174 DNA. Nature, 265, 687–695. https://doi.org/10.1038/267585a0
Saxena, R. K., Edwards, D., & Varshney, R. K. (2014). Structural variations in plant genomes. Briefings in Functional Genomics, 13(4). https://doi.org/10.1093/bfgp/elu016
Shapiro, R., Servis, R. E., & Welcher, M. (1970). Reactions of Uracil and Cytosine Derivatives with Sodium Bisulfite. A Specific Deamination Method. Journal of the American Chemical Society, 92(2), 422–424. https://doi.org/10.1021/ja00705a626
Shendure, J., Porreca, G. J., Reppas, N. B., Lin, X., McCutcheon, J. P., Rosenbaum, A. M., … Church, G. M. (2005). Accurate multiplex polony sequencing of an evolved bacterial genome. Science, 309, 1728–1732. https://doi.org/10.1126/science.1117389
Šimková, H. (1998). Methylation of mitochondrial DNA in carrot (Daucus carota L.). Plant Cell Reports, 17(3), 220–224. https://doi.org/10.1007/s002990050382
Treangen, T. J., & Salzberg, S. L. (2011). Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nature Reviews Genetics, 13(1), 36–46. https://doi.org/10.1038/nrg3117.Repetitive
Tsukahara, S., Kobayashi, A., Kawabe, A., Mathieu, O., Miura, A., & Kakutani, T. (2009). Bursts of retrotransposition reproduced in Arabidopsis. Nature, 461(17), 423–427. https://doi.org/10.1038/nature08351
Uemura, S., Aitken, C. E., Korlach, J., Flusberg, B. A., Turner, S. W., & Puglisi, J. D. (2010). Real-time tRNA transit on single translating ribosomes at codon resolution. Nature, 464(7291), 1012–1017. https://doi.org/10.1038/nature08925
Vanyushin, B. F., & Kirnos, M. D. (1988). DNA methylation in plants. Gene, 74(1), 117–121. Retrieved from http://zero.sci-hub.se/2223/39b0cdc1e6d65e6ac3b157a88ec31cc7/vanyushin1988.pdf
Venney, C. J., Johansson, M. L., & Heath, D. D. (2016). Inbreeding effects on gene-specific DNA methylation among tissues of Chinook salmon. Molecular Ecology, 25(18), 4521–4533. https://doi.org/10.1111/mec.13777
Wagner, I., & Capesius, I. (1981). Determination of 5-methylcytosine from plant DNA by high-performance liquid chromatography. Biochimica et Biophysica Acta, 654, 52–56. https://doi.org/10.1360/zd-2013-43-6-1064
What is nucleotide diversity and why is it important? (2017). Retrieved January 24, 2018, from https://support.illumina.com/bulletins/2016/07/what-is-nucleotide-diversity-and-why-is-it-important.html
Wreczycka, K., Gosdschan, A., Yusuf, D., Grüning, B., Assenov, Y., & Akalin, A. (2017). Strategies for analyzing bisulfite sequencing data. Journal of biotechnology, 261, 105-115.
Xi, Y., & Li, W. (2009). BSMAP: Whole genome bisulfite sequence MAPping program. BMC Bioinformatics, 10. https://doi.org/10.1186/1471-2105-10-232
Zhang, H., Lang, Z., & Zhu, J.-K. (2018). Dynamics and function of DNA methylation in plants. Nature Reviews Molecular Cell Biology, 19(8), 489–506. https://doi.org/10.1038/s41580-018-0016-z
Zhang, X., Yazaki, J., Sundaresan, A., Cokus, S. J., Chan, S. W. L., Chen, H., … Ecker, J. R. (2006). Genome-wide High-Resolution Mapping and Functional Analysis of DNA Methylation in Arabidopsis. Cell, 126(6), 1189–1201. https://doi.org/10.1016/j.cell.2006.08.003
Ziller, M. J., Hansen, K. D., Meissner, A., & Aryee, M. J. (2015). Coverage recommendations for methylation analysis by whole genome bisulfite sequencing. Nature Methods, 12(3), 230–232. https://doi.org/10.1097/NCN.0b013e3181a91b58.Exploring
Bisulfite Sequencing Methods
As with many ecological studies involving next-generation sequencing, the experimental design is often based heavily around one fundamental trade-off. This trade-off is paramount to achieving a level of statistical power appropriate for the study's aim. In the previous chapter (chapter NGS sequencing), we discussed sequencing depth and coverage; this applies here as we seek to maximize sequencing coverage with regards to the scope of the questions we are looking to answer and the practical limitations of the study, such as cost and time. Next-generation sequencing is expensive, with costs driven by the quantity of material to be sequenced. This can be delineated to the total genome size, number of replicates, level of sequence coverage, and sequencing technology itself. Therefore, an ideal study seeks to define and optimize these factors a priori to carrying out the experiment and the subsequent downstream analyses.
Next-generation sequencing is a fast-developing field, with several commercial technologies currently being available to address various experimental needs and applications. These can broadly be categorized into short-read technologies (~ 50-900 bp), which tend to be cheaper with higher per-base accuracy and throughput than the counterpart long-read technologies (~ 1-500 kbp) [Li 2018]. The advantage of obtaining long reads is that they are less prone to assembly and mapping-related errors, making them suitable for resolving true genome arrangements even in repetitive regions and regions of low complexity, e.g., found in heterochromatin regions. Due to the issues mentioned above with generating long fragments from sodium bisulfite-treated DNA samples, however, these advantages are frequently lost. Short-read sequencing technologies, therefore, are dominantly selected for methylation analyses.
Among these short-read sequencing technologies, Ion Torrent [Rothberg 2011], Illumina [Bentley 2008], and SOLiD [Shendure 2006, McKernan 2006] have all been used successfully for bisulfite sequencing [Cokus 2008, Lister 2008, Lang 2017, Mirouze 2009, Bormann Chung 2010, Pabinger 2016, Venney 2016]. Each may be appropriate depending on the desired read size and number of reads per run, which is influenced by the size of the genome and the nature of the study in question (i.e., sample size). The most extensively used of these is Illumina, wherein the HiSeq 2500, 3000, and 4000 or for large projects, HiSeqX and NovoSeq are all appropriate high-throughput applications. A good comparison between these and some alternative systems is given by Grehl et al. [Grehl 2018]. A problem arises for bisulfite sequencing from the requirement of many sequencing machines that nucleotides should be present in the DNA fragments in roughly even proportions to perform efficient base calling. Following the bisulfite treatment, cytosines are underrepresented. To still enable high throughput sequencing, additional DNA with balanced base proportions and known sequence (e.g., Phi X) is added or "spiked in" during library preparation. This DNA standard is sequenced together with the target DNA and later filtered out. The disadvantage is that it reduces the potential sequencing coverage considerably because the machines may require 20-40% spiked in standard DNA. Grehl et al. [Grehl 2018] mention, however, that the HiSeq 2500 benefits specifically from an optimization of the cluster calling algorithm, which allows for the handling of bisulfite-treated libraries without the need for additional base proportion balancing (i.e., spiking). Thus, the HiSeq 2500 is a good baseline to act as a starting point for designing your sequencing experiments. However, your final choice should be influenced by whether or not the specific limitations of this method need addressing within the scope of your study.
As methylation calling is calculated from the number of overlapping reads aligning to each position, it is clear that the statistical power increases with greater sequencing depth. In an ideal scenario, the sequencing depth would be uniform and consistent across all positions of interest to the study, but this is rarely the case. Sequence-related biases in random hexamer priming [Hansen 2010], random DNA fragmentation [Poptsova 2014], and the PCR amplification [Kozarewa 2009, Aird 2011] stages of the library preparation impede uniformity and lead to coverage underrepresentation in regions of extreme GC-content [Benjamini 2012]. A straightforward approach in mitigating these issues is to select a level of coverage that ensures a minimum lower bound in the majority of regions where the distributed coverage is lower than the mean.
To give a point of reference, Ziller et al. [Ziller 2015] found that coverage between 5-15x was optimal in terms of statistical power for detecting differentially methylated regions between a range of human tissue and cell types in the CG context. Beyond that, resources would be better allocated towards expanding the number of biological replicates, starting at a minimum of two to capture within-group variance. In plants, it would be wise to consider this point of reference as a bare minimum for a homozygous diploid. Both the heterozygosity and the ploidy level invariably influence the minimum level of coverage due to the increased variation on single positions, which may lead to greater within-group heterogeneity necessitating a larger number of replicates. The highly-repetitive regions, as well as low-complexity regions often found in plant genomes, are also notoriously difficult to map to, leading to multi-mapped reads and alignment ambiguities [Treangen 2011]. This problem can be mitigated by increasing the coverage and using paired-end (PE) sequencing in the absence of long-read data. Furthermore, the magnitude of methylation differences is usually less pronounced in the CHG and CHH contexts than in the CG context requiring more power for detection. If the study seeks to capture differences in these contexts, then an increase in the number of replicates should be considered relative to CG alone.
Once the optimal level of coverage and number of replicates have been decided, it may be the case that the total genome size for the species of interest pushes the cost outside the range of affordability. In these instances, it is sometimes possible during library preparation to subset the material and exclude regions of minor or no relevance to the scope of the study. Whether this is appropriate or not depends on the research questions. An overview of each method is given in the following section.