Computational Processing

Bisulfite Sequencing Methods

Once sequencing has taken place, the question of identifying DNA methylation is effectively reframed to a computational concern. Like standard sequencing, the basic workflow involves initial quality control (QC) of the generated raw reads, followed by mapping these reads to a reference genome to produce alignment files that are the basis of downstream analyses. Methylated positions can then be extracted from these files in a much similar manner to variant calling. Bisulfite sequencing, however, presents its own challenges, particularly during the read alignment step. This section explores common issues and significant divergences from standard practices in bioinformatics.

Quality Control

Like all reads generated via sequencing by synthesis, bisulfite reads are subject to a drop in quality towards the 3’-end due to the propensity of base-calling errors to accumulate following failures in the synthesis process. During a single cycle of Illumina sequencing, the next base in the read is incorporated into the template strand together with a reversible terminator molecule containing a fluorescent tag. The terminator prevents the following base from being incorporated, so the sequencer can read the color of the fluorescent tag and identify the current base. The molecule is then cleaved to facilitate the next cycle that repeats the process for the next base in the sequence. In these synthesis cycles, it can happen that the terminator molecule is either not cleaved or cleaved too early so that two bases are incorporated during a single cycle. If such an error occurs, the strand is out of sync compared to the other strands of the cluster for all remaining cycles and makes it more difficult for the imager to assess the correct base color within the sequence cluster. The consequence is a quality drop for the complete cluster. This phenomenon is known as phase-shifting and is usually corrected by trimming some bases that fall below a quality threshold (eg. phred score < 20) from the 3’-end of a read, limiting the negative effect on the previous correctly sequenced bases.

Another commonly-encountered issue is the tendency to sequence into the adaptor sequence on DNA fragments smaller than the total number of Illumina cycles [Illumina bulletin]. These sequence subsets are not part of the original DNA, making it much more difficult to map such reads to their true location on the reference genome. Fortunately, the adaptors are synthetically designed, distinct sequences which are therefore known and can thus be readily identified [Illumina technical document]. By considering the overlap of the known adaptor sequence at the 3’-end on each read, the reads can again be trimmed to remove the DNA that is not part of the original sequence.

These common problems can be identified with a standard QC tool such as FastQC [Andrews 2010], and frequently occur in standard sequencing and bisulfite sequencing data. As such tools (i.e., FastQC) are usually not designed for bisulfite sequencing, they may also flag errors such as unbalanced base proportions with as high as ~50% thymine content in read one (or adenine content in read two). If dealing with RRBS data, which has been digested with MspI, there is also the possibility that non-random sequence content is flagged at the 5’-ends of reads, as digested fragments always start with a C base. So long as there is confidence that the standard precautions were taken during library preparation, these warnings can be safely ignored at this stage.

The other facet of using standard QC tools that are not designed for bisulfite sequencing is the tendency to miss bisulfite-related sequencing problems. One such problem can occur during the initial DNA fragmentation step of the library preparation procedure, which often leaves protruding 5' and 3’-ends that must be restored to double-stranded DNA by a process known as overhang end-repair [Poptsova 2014]. The incorporation of unmethylated cytosines during this step can introduce artificially low methylation rates at each end of the DNA fragment, which cannot be detected in standard QC [Lin 2013]. Another such issue is thought to occur due to the re-annealing of single-strand sequences adjacent to the methylated sequencing adaptors during MethylC-Seq, which partially restores double-strandedness thereby affording a measure of protection from the bisulfite treatment [Lin 2013]. Therefore, there is a tendency for bisulfite conversion failure to be enriched towards the 5’-end of reads, leading to artificially high methylation rates. BS-Seq and PBAT libraries should theoretically avoid this bias due to the adaptor-ligation occurring after the bisulfite treatment.

As neither of these issues causes changes to the DNA sequence, they can only be detected once methylation calling has occurred (after read alignment). The standard procedure is to look at the total distribution of methylation levels across the average length of the reads in an approach known as M-bias analysis [Hansen 2012, Lin 2013]. A uniform distribution is expected across the read length, but spikes in methylation level can be observed at each end of the distribution. As the sequence information at each end remains unaffected, they should be used for the alignment and not be clipped similarly to quality trimming for repeated read alignment. Instead, the start and end positions of reads are "masked" from a follow-up repeat of the methylation calling procedure, depending on the deviation of their methylation status from the uniform distribution. It should be noted that a significant variation of read lengths reduces the accuracy of this step.

Read Alignment

When all quality concerns are sorted out, the next step in the workflow is to align these reads to an appropriate high-quality reference genome. If a reference genome is not available, de novo assembly is first required before any methylation information can be retrieved. The availability of a good reference genome is fundamental to DNA methylation analysis, when the project aims to relate methylated positions to nearby annotations such as gene bodies, promoter regions, or transposable elements. The higher the genome quality and relatedness of the genome to the test sample, the more confidence you can have in the study's findings [Mardis 2002]. Unfortunately, this degree of confidence is not something that can be measured statistically. Therefore every effort should be made to ensure the validity of your reference genome prior to analysis.

With standard sequencing data, mapping typically involves the use of dynamic programming to determine the best alignment for a given read according to a scoring matrix. Positive scores are given for base matches, or certain types of mismatches, whereas penalties are given for other mismatches and positions where insertions or deletions (indels) are present. The cumulative score is then compared to other potential alignments above a set threshold, and in most cases, only the best one is selected as the most likely point of origin for the read.

However, mapping bisulfite-treated DNA presents a challenge. The majority of cytosine positions on the reference genome are likely unmethylated in the test sample [Wagner 1981, Leutwiler 1984] and therefore represented as thymines in the reads following bisulfite conversion. Aligning these reads results in many C-T mismatches, which negatively influence the scoring matrix and significantly inhibit successful read mapping (Figure 3). It is not enough to simply allow for a higher number of errors, as this would only obscure the correct alignments through an increased number of false positives.

Figure 3 Alignment missmatches with the reference genome resulting from bisulfite conversion.

To further complicate the issue, bisulfite conversion of DNA fragments results in two strands that are no longer complementary to each other. In paired-end sequencing, this means that four distinct sequences are now present, each one varying to some degree from the original DNA (Figure 4). From the original, untreated Watson (+) strand, the first mate pair is the direct bisulfite-converted variant (+FW), and the second is the reverse complement of this (+RC). From the original, untreated Crick (-) strand, the first mate pair is the direct bisulfite-converted variant (-FW), and the second is the reverse complement of this (-RC). In standard DNA sequencing, it is simply the case that the second mate-pair obtained from one strand aligns to the other strand, but in bisulfite sequencing, this no longer holds. In BS-Seq libraries, this is compounded further because the method already encompasses all four-strand variants, even in single-end sequencing. In this case, the directionality relative to the strand (indicated by the box arrows in Figure 4) is thus lost, which is why it is sometimes referred to as an unstranded bisulfite sequencing protocol.

Figure 4 Strand specific point mutations in the new synthesized strand resulting from bisulfite conversion of unmethylated cytosins.

One potential solution for this alignment problem is to adjust the scoring matrix so that a mismatch of thymine to cytosine is instead treated as a match. This can be implemented in standard sequence aligners by "collapsing" the genetic alphabet in both the read and the reference genome so that all cytosines are rewritten as thymines (Figure 5a). Mapping is then performed normally, and the methylated positions are retrieved through post-processing based on the composition of the pre-collapsed sequences. However, this procedure results in two undesirable scenarios that make no sense conceptually: 1) true thymines in the read match with cytosines in the reference genome, and 2) cytosines from the read (indicating methylated positions) match thymines in the reference genome. Such a solution undoubtedly produces many false positives and obscures the correct read alignments. Bisulfite read aligners such as Bismark [Kreuger 2011], BSmooth [Hansen 2012] (in bowtie2 mode), BS-Seeker [Chen 2010, Guo 2013], and BWA-meth [Pedersen 2014] follow this strategy.

A better strategy would be to allow matches between thymines and cytosines, but only between read-based thymines and reference-based cytosines (not vice versa). In this case, methylated cytosines are correctly considered mismatched with thymines in the reference genome, thus reducing false positives (Figure 5b). However, it is still possible for true read-based thymines to match incorrectly with reference-based cytosines (see scenario 1 in the last paragraph). This asymmetric base scoring is not easy to implement in most index structures (e.g., Burrows-Wheeler transform, suffix arrays) used in standard sequence aligners. Therefore specialized read alignment software is required that is explicitly designed for bisulfite sequencing. Such tools include BSMAP [Xi 2009], BSmooth [Hansen 2012] (in merman mode), and ERNE-BS5 [Prezza 2012]. The specialized read aligner segemehl [Otto 2012] uses collapsing (Figure 5a) during a starting step and then changes to asymmetric matching (Figure 5b) in the following. A common drawback of these methods is the increased memory consumption and processing time relative to tools that rely on a collapsed alphabet.

Figure 5 Collapsed alphabet versus assymmetric matchning.

Whichever approach is followed, the entire process likely has to be repeated to account for both C-T conversions and G-A conversions due to the aforementioned loss of complementarity between a given sequence complement and the opposite strand (figure). The possibility of four distinct sequences in bisulfite sequencing, as opposed to two in standard sequencing, dictates that two distinct variants of the reference genome are required to resolve the best alignments. These two alignment procedures may either be run in parallel, as is the case in Bismark [Kreuger 2011], or consecutively as is the case in segemehl [Otto 2012]. Still, the point is that they are interlinked with each other and treated as one.

One problem that might arise during sequencing is the potential for genomic rearrangement [Saxena 2014] that may have occurred in the test sample relative to the reference genome due to evolution. Particularly with reads that originate from a locus that has translocated to the opposite strand. In this case, the strand-specificity is inverted, and the aligner may attempt to map the read to the wrong strand. These false-stranded reads can be detected by the high proportion of G-A mismatches on the Watson (+) strand or C-T mismatches on the Crick (-) strand. In practical terms, a threshold of 3-5% regarding the length of the read is usually enough to identify false-stranded reads. As there is a high probability that these reads originate elsewhere in the genome, they are usually excluded from the alignment because it is not a trivial task to infer where the locus may have translocated to [Onishi-Seebacher 2011].

Filtering can also be applied based on several other factors, in a study-dependant manner, prior to methylation calling. Any spike-in of Lambda or Phi X DNA can be filtered easily in the case of multiplexing with alternative sequencing adaptors, based on the different sequencing tags or it is not indexed at all and therefore automatically filtered out during de-multiplexing. In the case of Lambda, the alignment indexes must be generated to contain both the sample genome and the Lambda phage genome. Any read alignments to the Lambda genome can therefore be filtered out. When splitting the alignment files in this manner, it is essential to remove any reference to the Lambda genome from the file header if there is any intention to view the alignment file with a genome browser (e.g., IGV, JBrowse).

Finally, filtering based on multi-mapped reads or PCR duplicates may also be considered, like in standard sequencing experiments. Any given read should conceptually originate from a single position on the genome. Still, equally-scoring alignments may be possible, particularly in highly repetitive regions or regions of low complexity [Treangen 2011]. If each of these alignments is considered separately during methylation calling, then the statistical power is skewed, especially if the methylation is not the same between these regions. The options are to exclude these alignments entirely, as is performed intrinsically by some sequence aligners [Krueger 2011, Guo 2013], to select one such alignment at random, or to accept the reduction in statistical power to retain the information from that read. Regarding PCR duplicates, several tools exist to identify such reads that arise from a single DNA fragment, such as Picard MarkDuplicates [Broad Institute] and samtools rmdup [Li 2009]. These tools identify PCR duplicates based on the proportionally higher likelihood that identical reads arise from PCR, then that they are separate fragments. In this case, such reads are counted only once during methylation calling. However, deduplication works only in WGBS sequencing projects. If PCR is absent or negligible during library preparation, however, this step should be avoided.

Methylation Calling

To detect the level of DNA methylation at any given cytosine position within the test sample, the reads overlapping that position are evaluated to give the proportion of the coverage that is methylated bases (cytosine) versus the sum of methylated bases (cytosine) and unmethylated bases (thymine). In practice, there is only one major divergence in the implementation of this method by different tools: what to do with C-T polymorphisms that arise before bisulfite treatment has taken place. Most tools simply ignore SNPs. If the consensus indicates anything other than a cytosine or a thymine in a cytosine position on the reference genome, then the entire position is often excluded from analysis. It is not necessarily accurate to consider such an SNP as evidence against DNA methylation, particularly in the case of adenine which may also be methylated [Ratel 2006].

A more robust analysis will recognise also that not all read-based thymines overlapping reference-based cytosine positions are representative of the bisulfite treatment [Liu 2012]. A mutation in this context is deceiving, as we might interpret it as demethylation resulting from an epigenetic mechanism rather than a genetic one. The only way to identify such a mutation from the data itself, rather than from independent genotyping of the test sample, is to compare the overlapping reads in that position to those on the opposite strand. The DNA strands after treatment with sodium bisulfite are no longer complementary to each other since the methylation information is strand-specific. Artificially converted bases should therefore always lack a complementary base on the opposite strand, whereas a C-T mutation will have support on the opposing strand. This difference can be used to exclude such base positions from the analysis; the independent testing of both strands in this manner however would suggest that twice the coverage be required to achieve a similar degree of confidence to standard genotyping.

Consideration should also be made when using PE sequencing to regions that overlap between read pairs. Generally each read pair will start from opposing ends of a single DNA fragment, and it is possible in cases where the fragment size is less than double the number of cycles during Illumina sequencing that a part of the fragment will be sequenced twice for the same read pair [Magoč 2011]. In this case, methylation information in this overlap region is redundant and does not constitute an independent observation of the methylation status as if they arose from separate DNA fragments. It is therefore wise to identify overlapping regions prior to methylation calling and mask those bases from either one of the two reads of a pair.

Last updated