Here we show the most basic steps for a differential methylation analysis. There is one major step upstream of current analysis that generates mapped reads in BAM files for each sample, which can be done by varies choices of softwares such as Hisat2, tophat2 or Kallisto. we will discuss in the sections below assuming BAM file has been obtained.
This code chunk assumes that you have bam files for sample1-6 s1.input.bam... s6..input.bam
and s1.m6A.bam... s6..m6A.bam
.
lazy pipeline not implemented yet…
To analyze for MeRIP-seq data, MeRIPtools expect a pair of BAM file for each sample. For each pair of BAM file, the naming convention for MeRIPtools is sample.input.bam
for INPUT and sample.m6A.bam
for m6A IP or sample.m1A.bam
for m1A IP sample. They key is INPUT and IP sample have the same prefix. The mid-name for IP sample can be specified in the countReads
function and thus can be anything.
For the organism being analyzed, user need to provide an annotation file in gtf format to define the genomic coordinate of gene features. A good source to download those supporting files are iGenome. Alternatively, the GENECODE is also a good source for annotation if you are interested in human and mouse genome.
The MeRIPtools uses a S4 class system to manage the data involved in the analysis, please see Manual page for more details. The countReads()
function takes BAM files and gtf annotation as inputs and initializes a MeRIP class object by counting the reads in consecutive bins of size set by the user.
First, let’s prepare input data.
library(MeRIPtools)
samplenames <- c("4582WT","4584WT","4587WT","4612WT","4583KO","4586KO","4615KO","4616KO")
bam_dir <- "~/METTL14_liver2017/bam_files"
The variable samplenames
correspond to the prefix of file names in the bam_dir
. We can check the files in the bam_dir
:
list.files(bam_dir)
## [1] "4582WT.input.bam" "4582WT.input.bam.bai" "4582WT.m6A.bam"
## [4] "4582WT.m6A.bam.bai" "4583KO.input.bam" "4583KO.input.bam.bai"
## [7] "4583KO.m6A.bam" "4583KO.m6A.bam.bai" "4584WT.input.bam"
## [10] "4584WT.input.bam.bai" "4584WT.m6A.bam" "4584WT.m6A.bam.bai"
## [13] "4586KO.input.bam" "4586KO.input.bam.bai" "4586KO.m6A.bam"
## [16] "4586KO.m6A.bam.bai" "4587WT.input.bam" "4587WT.input.bam.bai"
## [19] "4587WT.m6A.bam" "4587WT.m6A.bam.bai" "4612WT.input.bam"
## [22] "4612WT.input.bam.bai" "4612WT.m6A.bam" "4612WT.m6A.bam.bai"
## [25] "4615KO.input.bam" "4615KO.input.bam.bai" "4615KO.m6A.bam"
## [28] "4615KO.m6A.bam.bai" "4616KO.input.bam" "4616KO.input.bam.bai"
## [31] "4616KO.m6A.bam" "4616KO.m6A.bam.bai"
It is okay if your folder don’t have xxx.bai
files. These are BAM file index, which can be generated by the MeRIPtools.
The next step is to count reads in consecutive bins. Usually we do 50 bp bins for typical library of 20-30 Million reads. If you have deeper coverage, a smaller bin size could increase resolution of the analysis. However, for libraries of shallow coverage, larger bin size is recommended because the smaller the bin size, the less reads are sampled in a bin and therefore the larger sampling error is encountered.
MeRIP <- countReads(samplenames = samplenames,
gtf = "~/Database/genome/GRCm38.p6/gencode.vM18.annotation.gtf",
bamFolder = bam_dir,
outputDir = NA,
modification = "m6A",
strandToKeep = "opposite",
fragmentLength = 150,
binSize = 50,
threads = 20,
saveOutput = FALSE
)
Note 1 The parameter strandToKeep
defines the strand of read to count with respect to the transcript strand. Usually, Illumina Truseq library kit or other similar protocol resulted in sequencing opposite strand of the RNA molecules in read1; therefore, default is “opposite” for this parameter. However, for small RNA library protocol, the sense strand of the RNA molecule is sequenced as read1; thus “same” should be set to this parameter.
Note 2 The parameter fragmentLength
defines the average length of the RNAs (insert size of the library). This parameter will affect the location of the reads to be counted. This parameter can be learned from data for paired-end sequencing data.
A summary of the MeRIP object can be printed by just call the variable name (“MeRIP” in this example). Or use the summary(MeRIP)
function.
MeRIP
## MeRIP dataset of 8 samples.
## The total read count for Input and IP samples are (Million reads):
## 4582WT 4584WT 4587WT 4612WT 4583KO 4586KO 4615KO 4616KO
## Input 18.30 15.90 16.00 16.35 20.18 16.77 15.47 17.81
## IP 18.67 21.54 19.83 20.06 18.82 21.43 20.44 18.06
##
## Peak calling done by Binomial test.
## 40186 joint peak reported at threshold 2 (requiring peaks to be called in at least 2 samples).
## Input gene level count available.
## There are 1 predictor variables/covariates. Can access by function variable(MeRIPdata).
## [1] "Differential peaks tested by QNB test.\n"
After obtaining read count in consecutive bins, we can perform peak calling on each bin of each sample. This step will result in a matrix of logic value to indicate whether each bin is called a peak or not. The result will be stored in a MeRIP.Peak object and returned.
Currently, two methods were implemented in MeRIPtools to call peak.
(1) Binomial test This method models the total read count (Input + IP) and expected ratio of read count in IP as binomial distribution and test the probability of observing the total read count and IP read count where expected ratio \(p\) is given by the library size (total coverage) of IP/(Input+IP).
MeRIP <- callPeakBinomial(MeRIP = MeRIP, min_counts = 15, peak_cutoff_fdr = 0.05, peak_cutoff_oddRatio = 1, threads = 20)
The parameter min_counts
set the minimal number of total reads (Input + IP) presented in a bin to call this bin peak. The parameter peak_cutoff_fdr
set the FDR threshold (correct for multiple tests for bins in each gene) to call a bin peak. The parameter peak_cutoff_oddRatio
set the IP/Input ratio threshold to call a bin peak.
(2) Fisher’s exact test This method test a contingency table of read count in IP and gene-wise median read count in IP as well as read count in Input and gene-wise median read count in Input. Basically, it test for the deviation of read count distribution in IP compare to expectation given by the Input sample. This method assumes that only a small segment of a gene is enriched by the IP experiment as gene-wise median was used as a control. When the whole is enriched (e.g. very short transcript or widely methylated genes), this test would fail to detect enrichment.
MeRIP <- callPeakFisher(MeRIP = MeRIP, min_counts = 15, peak_cutoff_fdr = 0.05, peak_cutoff_oddRatio = 1, threads = 20)
The parameters are the same as previous method except for the oddRatio, which is represented by the (IP/gene-median-IP)/(Input/gene-median-Input).
Although peak calling has been perform for each samples and there are variations of peaks called across samples. However, it is not a good idea to directly compare across samples (groups) using the peak calling result (qualitative info) to infer changes of methylation status between groups/conditions (e.g. define emerged/lost peaks). There are two major drawbacks of such analysis: (1) some peak calling result variation actually reflect boundary cases where peak is called in one group with FDR = 0.049 while peak will not be called in another group with FDR = 0.051. However, the methylation level are not really different in these two group as the emerged/lost peak seemingly suggested. (2) peak could be call in one group but not another due to differential gene expression. Low expression of a gene in one group could resulted in total read counts lower than the minimal required read count for peak calling and thus not called as a peak. However the IP/Input ratio of this groupd could be roughly the same as another group.
Thus, we recommand using quantitative analysis to search for differential methylated regions (peaks). On one hand, we are interested in peaks that are vary across samples, thus we want to make differential test on the union of peaks. On another hand, we don’t want to include peaks that are called only one but not any other samples because these peaks are likely false signals. Thus, we defined a joint peak that requires peak to be called in at least 2 (or any other threshold the user defines) samples for comparative analysis. The joint peaks is an representation of the union of peaks in samples analyzed.
MeRIP <- reportJointPeak(MeRIPdata = MeRIP, joint_threshold = 2, threads = 10)
Now we have joint peaks stored in the MeRIP.Peak object returned from reportJointPeak()
function. We can get the joint Peak by jointPeaks(MeRIP)
.
head( jointPeak(MeRIP) )
## chr start end name score strand thickStart
## 1 chr3 108108911 108109157 ENSMUSG00000000001.4 0 - 108108911
## 2 chr3 108109207 108109256 ENSMUSG00000000001.4 0 - 108109207
## 3 chr3 108123800 108146097 ENSMUSG00000000001.4 0 - 108123800
## 4 chr7 142577946 142577995 ENSMUSG00000000031.16 0 - 142577946
## 5 chr11 121237303 121242550 ENSMUSG00000000056.7 0 + 121237303
## 6 chr11 121252834 121253082 ENSMUSG00000000056.7 0 + 121252834
## thickEnd itemRgb blockCount blockSizes blockStarts
## 1 108109157 0 1 246 0
## 2 108109256 0 1 49 0
## 3 108146097 0 2 38,210 0,22088
## 4 142577995 0 1 49 0
## 5 121242550 0 3 49,99,50 0,1116,5198
## 6 121253082 0 1 248 0
After obtaining the joint peaks, we will need to summarize the read count in peaks for Input and IP.
MeRIP <- jointPeakCount(MeRIP)
This step is to prepare read count data for differential methylation test. If you are interested in peak read count data, you can extract it by:
Input_count <- extractInput(MeRIP)
IP_count <- extractIP(MeRIP)
dim(Input_count)
## [1] 40186 8
Usually, we are interested in compare samples across experiment conditions or genotypes. We implemented three alternative tests in MeRIPtools: RADARtest
using a random effect model, QNBtest
using quad-negative binomial model and BetaBinTest
using beta-binomial regression model. Before we run the test, we need to set the variable associated with each samples. Predictor variable and potential covariates are inputed as a data.frame.
variable(MeRIP) <- data.frame( genotype = c("WT","WT","WT","WT","KO","KO","KO","KO") )
## if there is covariates
variable(MeRIP) <- data.frame( genotype = c("WT","WT","WT","WT","KO","KO","KO","KO"), covariate1 = c(...), covariate2 = c(...), ... )
QNB test was introduced by Liu et al 2017 BMC Bioinformatics. Here we implemented a wrapper function to call QNB test. Note this test is designed for small sample size scenario and cannot incorporate covariates. Thus I recommond using this test for cell line experiment where only 2-3 technical/biological replicates are available.
MeRIP <- QNBtest(MeRIP)
To be add… #### (3) BetaBinTest To be add…
Finally, we want to check the joint peaks and associated test statistics. The function results()
will extract the joint peaks and associated differential test statistics.
allPeakResult <- results(MeRIP)
head(allPeakResult)
## chr start end name score strand thickStart
## 1 chr3 108108911 108109157 ENSMUSG00000000001.4 0 - 108108911
## 2 chr3 108109207 108109256 ENSMUSG00000000001.4 0 - 108109207
## 3 chr3 108123800 108146097 ENSMUSG00000000001.4 0 - 108123800
## 4 chr7 142577946 142577995 ENSMUSG00000000031.16 0 - 142577946
## 5 chr11 121237303 121242550 ENSMUSG00000000056.7 0 + 121237303
## 6 chr11 121252834 121253082 ENSMUSG00000000056.7 0 + 121252834
## thickEnd itemRgb blockCount blockSizes blockStarts p.treated p.control
## 1 108109157 0 1 246 0 0.6107835 0.7072817
## 2 108109256 0 1 49 0 0.1892661 0.2908680
## 3 108146097 0 2 38,210 0,22088 0.4098838 0.3320385
## 4 142577995 0 1 49 0 0.2791100 0.5723506
## 5 121242550 0 3 49,99,50 0,1116,5198 0.4373165 0.6007194
## 6 121253082 0 1 248 0 0.2459933 0.4964980
## log2.RR beta p_value q padj
## 1 -0.2116238 -0.6226838 0.107213503 733.70257 0.9026464
## 2 -0.6199482 -0.8131224 0.480945117 59.62527 0.9895912
## 3 0.3038641 0.4826298 0.383857934 297.92652 0.9895912
## 4 -1.0360652 -1.7894158 0.399415869 11.77369 0.9895912
## 5 -0.4580136 -0.9529343 0.132806105 193.58893 0.9466131
## 6 -1.0131691 -1.5957490 0.007498819 345.49632 0.4312254
The column beta
is the coefficient associated with the predictor variable. The interpretation is log2-fold change in QNBtest
(log fold change RADARtest and BetaBinTest). The p_value
is the P value for the differential test of each peak. padj
is P value adjusted for multiple testing, which can be interpreted as false discovery rate (FDR).
To filter for significantly differentially methylated peaks, we usually use a FDR < 10% cutoff to filter the result:
diff_peaks <- allPeakResult[allPeakResult$padj < 0.1,]
dim(diff_peaks)
## [1] 688 19
There are 688 peaks differentially methylated peaks tested by QNB in this example. You can save the result to a file by:
write.table(diff_peaks, "Path/to/file/location.xls", sep = "\t", col.names = T, row.names = F, quote = F)
If you are only interested in peak calling and want to obtain peaks consistent across some samples, you can report consistent peak after step 2 d) in previous section.
For example, I’m only interested in peaks that are consistent in the first two samples:
consistentPeak1 <- consistentPeak(MeRIP, samplenames = c("4582WT","4584WT"), joint_threshold = 2)
Or I can report peaks that are consistent in at least 3 samples among 4 samples indicated by setting the joint_threshold=3
:
consistentPeak2 <- consistentPeak(MeRIP, samplenames = c("4582WT","4584WT","4587WT","4612WT"), joint_threshold = 3 )
Warning This function is intended to quality check peaks called in subset of samples or to just report a peak file. It is strongly not recommended to use this function to report peaks for two groups and compare the two peak files to infer variation between groups.
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 17.10
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats4 parallel stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] MeRIPtools_0.1.0 qvalue_2.14.0
## [3] RcppArmadillo_0.9.200.5.0 Rcpp_1.0.0
## [5] reshape2_1.4.3 GenomicAlignments_1.18.0
## [7] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
## [9] BiocParallel_1.16.2 matrixStats_0.54.0
## [11] rtracklayer_1.42.1 doParallel_1.0.14
## [13] iterators_1.0.10 foreach_1.4.4
## [15] ggplot2_3.1.0 Rsamtools_1.34.0
## [17] Biostrings_2.50.1 XVector_0.22.0
## [19] GenomicFeatures_1.34.1 AnnotationDbi_1.44.0
## [21] Biobase_2.42.0 GenomicRanges_1.34.0
## [23] GenomeInfoDb_1.18.1 IRanges_2.16.0
## [25] S4Vectors_0.20.1 BiocGenerics_0.28.0
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.3-2 rprojroot_1.3-2 htmlTable_1.12
## [4] base64enc_0.1-3 rstudioapi_0.8 QNB_1.1.11
## [7] bit64_0.9-7 codetools_0.2-15 splines_3.5.0
## [10] R.methodsS3_1.7.1 geneplotter_1.60.0 knitr_1.20
## [13] Formula_1.2-3 workflowr_1.1.1 broom_0.5.0
## [16] annotate_1.60.0 cluster_2.0.7-1 R.oo_1.22.0
## [19] gamlss.dist_5.1-0 compiler_3.5.0 httr_1.3.1
## [22] backports_1.1.2 assertthat_0.2.0 Matrix_1.2-15
## [25] lazyeval_0.2.1 acepack_1.4.1 htmltools_0.3.6
## [28] prettyunits_1.0.2 tools_3.5.0 bindrcpp_0.2.2
## [31] gtable_0.2.0 glue_1.3.0 GenomeInfoDbData_1.2.0
## [34] dplyr_0.7.8 ape_5.2 nlme_3.1-137
## [37] pinfsc50_1.1.0 stringr_1.3.1 XML_3.98-1.16
## [40] zlibbioc_1.28.0 MASS_7.3-51.1 scales_1.0.0
## [43] BSgenome_1.50.0 hms_0.4.2 gamlss.data_5.1-0
## [46] RColorBrewer_1.1-2 yaml_2.2.0 memoise_1.1.0
## [49] gridExtra_2.3 biomaRt_2.38.0 rpart_4.1-13
## [52] latticeExtra_0.6-28 stringi_1.2.4 RSQLite_2.1.1
## [55] genefilter_1.64.0 permute_0.9-4 checkmate_1.8.5
## [58] exomePeak_2.16.0 rlang_0.3.0.1 pkgconfig_2.0.2
## [61] bitops_1.0-6 evaluate_0.12 lattice_0.20-38
## [64] purrr_0.2.5 bindr_0.1.1 Guitar_1.20.0
## [67] htmlwidgets_1.3 bit_1.1-14 tidyselect_0.2.5
## [70] plyr_1.8.4 magrittr_1.5 DESeq2_1.22.1
## [73] R6_2.3.0 Hmisc_4.1-1 DBI_1.0.0
## [76] mgcv_1.8-26 pillar_1.3.0 foreign_0.8-71
## [79] withr_2.1.2 survival_2.43-3 RCurl_1.95-4.11
## [82] nnet_7.3-12 tibble_1.4.2 crayon_1.3.4
## [85] gamlss_5.1-2 vcfR_1.8.0 rmarkdown_1.10
## [88] progress_1.2.0 locfit_1.5-9.1 data.table_1.11.8
## [91] vegan_2.5-3 blob_1.1.1 digest_0.6.18
## [94] xtable_1.8-3 tidyr_0.8.2 R.utils_2.7.0
## [97] munsell_0.5.0 viridisLite_0.3.0