Library Preparation

Bisulfite Sequencing Methods

Whole Genome Bisulfite Sequencing (WGBS)

WGBS is the practice of applying bisulfite sequencing on a genome-wide scale, capturing all regions, and attempting to define global methylation patterns in each sample. This method is appropriate when the study question is broad in scope or if prior information on the genomic regions of interest is limited. It can be considered the "go-to" approach when other methods for concentrating the sequencing on reduced subsets of the genome are either unavailable or inappropriate for the study in question.

There are two main variations of WGBS library preparation, known as BS-Seq [Cokus 2008] and MethylC-Seq [Lister 2008] (Figure 2). In terms of the protocol, they differ primarily in the number of PCR steps and when the ligation of sequencing adaptors occurs relative to the treatment with sodium bisulfite. Many sequencing technologies require specific sequencing adaptors to facilitate base calling on selected DNA fragments. In the case of Illumina, these adaptors are bound to complementary sequences on the flow cell, forming clusters to be sequenced by synthesis. If the adaptor is not present, the DNA molecule is simply washed off the cell, and no information is retrieved. The issue here is that the bisulfite treatment alters the sequence of these adaptors wherever there are unmethylated cytosines present, rendering them incompatible with the complementary sequences on the flow cell. MethylC-Seq addresses this by using custom cytosine-methylated adaptors that remain unaffected by sodium bisulfite. In contrast, BS-Seq circumvents the issue by ligating the adaptors only after the bisulfite treatment.

In principle, the approach of BS-Seq seems more straightforward. However, ligating the sequencing adaptors after the bisulfite treatment presents another problem. The two strands of DNA are no longer complementary to each other and hence remain in a single-stranded state. This is a problem because sequencing adaptor ligation requires duplex DNA. Therefore an additional round of PCR is necessary before adaptor ligation can occur. This PCR step generates reverse complementary strands to both the bisulfite-treated Watson (+FW) and Crick (-FW) strands of the original DNA, which are themselves distinct sequences (+RC and -RC, respectively). The result is a set of four sequences where both the FW and RC strands are indistinguishable from each other by the sequencer. Strand-specificity is therefore lost, and additional bioinformatic processing is required to resolve which reads belong to which strand. In MethylC-Seq, only the +FW and -FW sequences are present, and strand-specificity is cleanly preserved during sequencing, though with paired-end data, it becomes more complex as the +RC and -RC strands are present as well.

A more recent variation of these approaches has also been developed, known as post-bisulfite adaptor tagging (PBAT) [Miura 2012]. In this case, the bisulfite conversion process itself is first used to fragment the genomic DNA. Adaptor ligation is then facilitated by two rounds of random priming extension in place of PCR, thereby maintaining strand-specificity while avoiding any denaturation of adaptor-ligated DNA. The real advantage of this method, however, is its sensitivity in handling sub microgram quantities of DNA without the need for additional amplification, contrary to MethylC-Seq, where the bisulfite treatment often fragments adaptor-ligated DNA templates, which then cannot be used during sequencing. In such a case, the remaining DNA may need to be amplified to achieve a reasonable DNA mass for sequencing, but this amplification risks inducing PCR artifacts. The approach of PBAT can circumvent the need for PCR amplification on sub microgram quantities of DNA. Still, it should be noted that random primer extension is subject to its own biases. Sequence-specific site preferences can give rise to "pile-ups" of reads, and differential priming between methylated and unmethylated alleles has been hypothesized. Therefore, it may be preferable to run MethylC-Seq with a very low number of PCR amplification cycles (e.g., ~ 4) in cases where sample availability is not strictly limited. Last but not least, an even newer approach (TAPS) was published recently that avoids using the bisulfite conversion altogether, allowing for higher mapping quality and nonfragmented duplex DNA after the conversion of methylated cytosines into thymines (Yibin et al. 2019). The downside of TAPS, it is too new to be available as a commercial kit. TAPS is very new and promising, but experiences with this method are yet scars, and so we do not discuss it further in this chapter.

Regardless of the approach selected, at least two cycles of post-bisulfite PCR are necessary to facilitate the conversion of uracil to thymine before sequencing can occur. For these PCRs, the presence of uracil in the sequence precludes the use of many standard, high-fidelity polymerase enzymes with proofreading mechanisms such as Phusion (Thermo Scientific) or KAPA HiFi (Roche). On encountering uracil, these enzymes stall as they await base excision repair [Lindahl 1999, Greagg 1999]. Fortunately, there are alternatives available, such as PfuTurbo Cx (Agilent) or KAPA HiFi uracil+ (Roche), specifically designed for bisulfite sequencing. Once a library has been prepared, it is standard practice to perform library quantification and normalization, using, for example, Qubit / PicoGreen assay or qPCR measurement. It should be noted during this step that methods that estimate only the total quantity of DNA may fail to give an accurate representation of the adapter-ligated DNA, particularly in MethylC-Seq libraries due to the aforementioned fragmentation caused by the bisulfite treatment. For this reason, it is recommended to use a BioAnalyzer for sizing only and qPCR to quantify the final library for bisulfite sequencing.

Several commercial kits are readily available for carrying out bisulfite conversion itself. Depending on your sample DNA quantity and library preparation methodology, the aim is to achieve maximum conversion efficiency relative to optimal DNA recovery. High temperature, high bisulfite molarity, and long incubation times are more likely to yield complete bisulfite conversion but degrade much of the DNA in the process. Incomplete conversion, however, leads to an overestimation of methylation levels on unconverted cytosines. With this trade-off in mind, a good evaluation of modern kits was provided by Kint et al. [Kint 2018], where EpiTect Bisulfite (Qiagen), EZ DNA Methylation-Gold (Zymo Research), and EZ DNA Methylation-Lightning (Zymo Research) kits were each cited for high performance with regards to several study-dependent factors.

To estimate conversion efficiency within bisulfite-treated samples, it is typical to have a control consisting of a known quantity of unmethylated DNA within the sample. Historically the conversion rate was estimated from non-CG cytosines in mammals [Hodges 2009], which is inappropriate for plants where DNA methylation occurs in the CHG and the CHH contexts. Alternatively, the mitochondria or chloroplast genomes were used, as both organelles are widely considered to escape DNA methylation [Marano 1991, Vanyushin 1988]. However, this may not be entirely reliable as conflicting evidence of DNA methylation has since been reported in both [Šimková 1998, Fojtová 2001]. Therefore, the most reliable method in plants is to use a "spike-in" of DNA from another source. The enterobacteria phage Lambda (~ 0.1% w/w) is often used, which is shown to be virtually devoid of 5mC when propagated on mutant bacteria strains lacking DNA methylase activity [Hattman 1973]. Reads aligning to the Lambda genome can then indicate the level of bisulfite conversion, as in theory, all cytosines should have been replaced with thymines.

In addition to Lambda, as noted earlier, the bacteriophage Phi X is commonly used as a "spike-in" to balance base proportions [Raine 2018, Illumina bulletin]. During the initial cycles of Illumina sequencing, the phasing/pre-phasing, color matrix corrections, and pass filter calculations are influenced by the flow cell imaging [Illumina bulletin]. In bisulfite-treated DNA, there is a notable deficiency in cytosine bases and the fluorescent color associated with it, which can skew the base-calling algorithm during this normalization process. Adding the well-balanced Phi X DNA [Sanger 1977] or any other well-balanced DNA to the sequencing library allows the Illumina sequencing to proceed unaffected. Another interesting possibility is to multiplex a bisulfite-treated library with an untreated library with each DNA fragment containing an identifying adapter sequence indicating which library it belongs to. This way, spiking can be omitted, and cytosine methylation and single nucleotide polymorphisms (SNPs) can be obtained from one sequencing run.

Reduced Representation Bisulfite Sequencing (RRBS)

RRBS is similar to WGBS in many ways but differs primarily by adding an initial selection procedure at the beginning of the library preparation. It was developed by Meissner et al. [2005] to generate large-scale sequencing data with a lower resolution than WGBS and still evenly representing the genome, though with the option to focus either on Eu- or Heterochromatin. This reduces the sequencing cost compared to WGBS but results in the loss of much sequence content that could otherwise be relevant. In cases where this technique was employed, the enriched fraction was frequently reported to be less than 1% of the whole genome size.

Sample DNA is first subjected to a restriction endonuclease that targets a specific sequence context depending on the local cytosine methylation status. A typical enzyme used is MspI, which targets CG sites in the specific sequence 5’-CCGG-3'. MspI can not cleave when this specific recognition sequence is symmetrically methylated, thus focuses on weakly methylated euchromatin rather than the heavily methylated heterochromatin in the chromosomes. Different sequence contexts require different enzymes, although this application has not been broadly applied in non-CG contexts. The enzymatic digestion produces fragments that can be size selected, usually following some additional end repair and A-tailing depending on which enzyme was used. The rest of the library preparation follows closely with that which was outlined previously for WGBS and unfortunately suffers from the same loss of strand-specificity as BS-Seq. It must be noted here that the recognition site's methylation is not the main focus of the study. Instead, the sequence flanking the recognition site is sequenced, providing information on the methylation status of many cytosines, which can principally be in all three sequence contexts.

Target capture bisulfite-sequencing

Reduced representation bisulfite sequencing is a beneficial technique when the aim is to sequence many biological samples, for example, to study population genetics or when the studied organism has a very large genome, like in many coniferous trees, for example. The technique further allows to roughly direct the analysis either to heterochromatin or euchromatin or, depending on the genome in question, enrich promotor or gene-body sites by choosing the appropriate cleavage enzyme. However, besides this possibility of setting a rough focus of the study, the idea is to provide a valid representation of the genome through a sample of random sequence reads scattered across the genome. Though, it may be desirable in a project to set the target more specifically to a particular region in the genome. This can be achieved through target capturing, which can be applied before or after bisulfite conversion (Wreczycka et al. 2017). Different techniques usually involving the hybridization of genomic DNA with the complementary of a known piece of the target sequence, combined with bisulfite conversion and followed by the above-described processing of converted DNA, enable the inference of the methylation status of a specific target location in the genome of interest. Such techniques may be helpful when, for example, unraveling the methylation status of a known promotor region is the aim of the investigation.

Last updated