Principles of Bisulfite Sequencing

Bisulfite Sequencing Methods

When considering the epigenome, researchers may refer to changes in chromatin structure due to post-translational modification of histone proteins, populations of non-coding RNA (ncRNA), several chemical modifications to DNA sequences, or a combined effect of these factors. Most often, however, they are referring specifically to the methylome. That is to say: the distinct arrangement of methylcytosines present in the genome and the fluctuations between different organisms, tissues, or cell types within species. This generalization is not a reflection of a greater epigenetic significance but instead an overrepresentation of our current understanding of this type of methylation relative to other epigenetic mechanisms. However, recently more studies realize that epigenetic factors often interact.

In plants, DNA methylation can affect both cytosine [zhang 2018] and adenine [ratel 2006] nucleotides and has been associated with changes in gene expression [lang 2017, zhang 2006, lei 2015], chromosome interactions [grob 2014, feng 2014], and genome stability through the repression of transposable elements [mirouze 2009, tsukahara 2009, la 2011]. The modification can be further characterized as either 5-methylcytosine (5mC) or 5-hydroxymethylcytosine (5hmC) within the context of cytosine methylation. These subgroups may well have contrasting epigenetic functions, though, in Arabidopsis thaliana, at least no appreciable level of 5hmC has been observed in genomic DNA [erdmann 2015]. Methylation of cytosine residues is not a base modification that is limited to genomic DNA. However, the fraction of 5hmC present in RNA may be much higher [huber 2015]. Though, the topic of 5hmC in plants is debated among researchers heavily (Mahmood et al. 2019).

The underlying basis for the perceived emphasis on DNA cytosine methylation is due to the development of bisulfite sequencing as a means for studying epigenetics. Since its conception and initial application in 1992 by Frommer et al. [frommer 1992], the method has received much attention for its capacity to resolve DNA methylation patterns at the nucleotide level. This allows researchers to study the effect of differential methylation between organisms, tissues, or cell types on specific genomic elements such as gene bodies, promoter regions, or other regulatory motifs. This, in turn, provides a roadmap for linking epigenetics to gene expression, heritability, or the activation of particular genes or transposons.

The basis for the method is the usage of sodium bisulfite. This chemical compound catalyzes the hydrolytic deamination of cytosine to uracil via an intermediary sulfonation step from cytosine to unstable cytosinesulfonate [hayatsu 1970, shapiro 1970]. The loss of the amine group from cytosinesulfonate yields uracilsulfonate, which in turn is desulfonated under basic pH conditions to form uracil (Figure 1). Though principally also methylated cytosines can react with sodium bisulfite, the presence of a methyl group on position five of the aromatic ring inhibits the process by an order of two magnitudes [hayatsu 1979]. This inhibition is sufficient to confer selectivity for the conversion of unmethylated cytosines. Unfortunately, this selectivity does not extend to a measurable differentiation between 5mC and 5hmC, which are therefore indistinguishable during the regular application of this method. During the sequencing processes or preceding PCR reactions, uracil positions convert on the newly synthesized DNA fragment to thymine so that the sequencing reads contain the four DNA bases.

Figure 1 Bisulfite conversion of unmethylated Cytosin to Uracil.

Once a given DNA sequence has undergone treatment with sodium bisulfite, any remaining cytosines present in the sequence can be inferred as methylated positions. The problem then progresses to the question of mapping these sequences back to their original location on the genome. As is often the case with next-generation sequencing, it is very difficult to achieve a singular, continuous strand of DNA representing an entire chromosome. DNA stability is such that the molecules are easily fragmented during library preparation, and the sequencing approach itself is often restrictive in terms of the length of DNA that can be sequenced in a single iteration. The fragmentation is further confounded by the harsh bisulfite treatment, which undermines long-read sequencing technologies such as PacBio [Eid 2009, Uemura 2010] or Nanopore [Cherf 2012, Mikheyev 2014] and reduces the cost-benefit ratio for using them.

The dominating circumstance is one where the sequencing data consists of many, short bisulfite-converted DNA sequences that subsequently need to be aligned to a reference genome to determine the relationship of methylated positions with nearby genomic elements. In this regard, it is imperative that a high-quality reference genome already exists for the species of interest. Any improvements made to the existing genome assembly contribute towards mapping precision and reducing any false correlations between methylation patterns and nearby loci.

However, it is not a simple task to map bisulfite-treated reads to the reference genome as it would be performed with DNA-Seq data. The challenge arises from the many unmethylated cytosines, which have since been converted to thymines. The fraction of methylated cytosines varies much between different plant species, with as few as ~5% of total cytosines methylated in A. thaliana [Leutwiler 1984] to as many as ~33%, for example, in cultivated tobacco, Nicotiana tabacum [Wagner 1981]. Taking A. thaliana as an example, with a reported GC content of 41.4% [Leutwiler 1984], this could mean that almost ~20% of the total sequence content on one strand is artificially replaced with thymines where before, they were cytosines. In a standard read alignment procedure, this would result in many differences in the alignment of reads to the reference sequence due to the high fraction of C-T mismatches occurring in place of C-C matches. It would not be enough to specify in the algorithm to treat all cytosines and thymines unanimously because the bisulfite treatment dictates that only thymines present in the sequencing reads could potentially match cytosines in the reference genome. Vice versa, it does not follow conceptually that thymines in the reference genome might be able to match cytosines in the reads. Therefore treating the two bases equally would still generate many differences during read alignment.

As standard alignment tools are often not designed with this base-matching asymmetry in mind, the solution is to make further adaptations to these tools or operate specifically with algorithms designed to handle bisulfite data. As mentioned earlier, several specific alignment algorithms for bisulfite data exist. Prominent among them are Bismark [Krueger 2011], BWA-meth [Pedersen 2014], relying on popular alignment algorithms, and Segemehl [Otto 2012] or ERNE [Prezza 2012], with new specific algorithms. Despite the heuristics involved in adapting conventional software (see Read Alignment), recent benchmarks have shown that software such as BWA-meth compete favorably both in terms of precision-sensitivity and overall computational performance [Nunn et al 2021]. Most tools perform capably in non-repetitive regions, but some indeed struggle more-so than others in repetitive, hard-to-align loci. In some cases it may even be the case that the increased base complexity conveyed by methylated nucleotides under bisulfite sequencing may lead to bias in alignment, which can have downstream consequences for methylation calling. Further consideration should always be given to unique aspects of each tool in the context of the planned study. For example, BWA-meth defines mapping quality values differently from Bismark and this too can have implications for downstream analyses such as variant calling.

Another problem that arises during bisulfite sequencing is the loss of variant information following the sodium bisulfite treatment. A standard sequencing experiment identifies polymorphisms where most reads overlapping a single nucleotide position on the reference genome indicate a deviation from the original base in that position. Single nucleotide polymorphisms (SNPs) are used in genotyping to identify which samples belong to different variants or strains. During epigenetic studies, it is often of crucial importance that genetic variation is kept to a minimum to reduce confounding genetic effects. However, any true SNPs in the C-T context are obscured by the artificially converted bases. It is possible to retrieve this variant information by comparing the converted bases to their complementary bases on the opposite strand. Only true SNPs should have the correct sequence complement. But this is computationally intensive and yet to be implemented efficiently. Analyzing the strands separately in this way would theoretically imply that twice as much sequencing coverage would be necessary compared to normal genotyping to achieve a similarly credible result.

Following a successful read alignment, methylation calling can be performed similarly to variant calling but without the need for a complex model. Instead, each cytosine position is extracted from the alignment and the methylation level determined by majority voting. The ratio of reads with either cytosine or thymine in that position is used to calculate an overall methylation percentage or rate. The total collection of positions can be further subset into different genomic sequence contexts such as CG, CHG, or CHH, where H can be either A, T, or G (see Chapter DNA Methylation). This subsetting later allows for downstream analyses of methylation patterns.

Last updated