Epitranscriptome profiling using MeRIP-seq is a powerful technique for in vivo functional studies of reversible RNA modifications. We developed RADAR, a comprehensive analytical tool for detecting differentially methylated loci in MeRIP-seq data. RADAR enables accurate identification of altered methylation sites by accommodating variability of pre-immunoprecipitation expression level and post-immunoprecipitation count using different strategies. In addition, it is compatible with complex study design when covariates need to be incorporated in the analysis.
To analyze the count data of MeRIP-seq, we divide the transcript (concatenated exons of a gene) into consecutive bins and quantify the reads align to each bin. The summarized count data is in matrix format where each row represents a locus and each column is a sample. Statistical inference of changes cross groups, as compared to within-group variability is made for each filtered locus (bin) with sufficient count.
RADAR achieves improved performance mainly by the following advancements: (1) RADAR uses gene-level isntead of peak-level read count as an robust representation of pre-IP gene expression level. This significantly reduces the noise of the data. (2) The generalized linear model framework used in RADAR is compatible with complex study design so that biological and experimental covariates can be accounted for in differential methylation analysis. (3) To accomodate extra overdispersion due to immunoprecipitation, RADAR implemented a random effect model that is flexible on mean-variable relationship.
Depends: Rsamtools, GenomicFeatures (>= 1.14.5), BH, Rcpp, ggplot2, doParallel, foreach, gplots, RColorBrewer, RcppArmadillo
install.packages("devtools")
library(devtools)
install_github("scottzijiezhang/RADAR")
library("RADAR")
The installation has been tested on R version 3.4.4 and R version 3.5.0 and 3.5.1 on Linux and Mac.
To access the example data:
bam.path <- system.file("extdata",package = "RADAR")
list.files(bam.path)
## [1] "Case1.input.bam" "Case1.m6A.bam" "Case2.input.bam"
## [4] "Case2.m6A.bam" "Case3.input.bam" "Case3.m6A.bam"
## [7] "Case4.input.bam" "Case4.m6A.bam" "Case5.input.bam"
## [10] "Case5.m6A.bam" "Case6.input.bam" "Case6.m6A.bam"
## [13] "Case7.input.bam" "Case7.m6A.bam" "Case8.input.bam"
## [16] "Case8.m6A.bam" "Ctl1.input.bam" "Ctl1.m6A.bam"
## [19] "Ctl2.input.bam" "Ctl2.m6A.bam" "Ctl3.input.bam"
## [22] "Ctl3.m6A.bam" "Ctl4.input.bam" "Ctl4.m6A.bam"
## [25] "Ctl5.input.bam" "Ctl5.m6A.bam" "Ctl6.input.bam"
## [28] "Ctl6.m6A.bam" "Ctl7.input.bam" "Ctl7.m6A.bam"
## [31] "Ctl8.input.bam" "Ctl8.m6A.bam" "Ctl9.input.bam"
## [34] "Ctl9.m6A.bam"
Since MeRIP-seq generate a pair of sequencing data – Input (regular RNA-seq) and IP (immunoprecipitation), we denote these two libraries by .input
or .<modification>
followed by .bam
after sample names. Thus, the RADAR requires a pair of BAM files for each sample named as <samplename>.input.bam
+ <samplename>.<modification>.bam
. The <modification>
can be specified in the countRead(... , modification = "<modification>")
function. In the example data where we are interested in m6A modification, the files are named as : <samplename>.input.bam
and <samplename>.m6A.bam
. All BAM files to be processed and analyzed together should be saved in one directory, the path to which needs to be specified in the countRead(bamPath = ".../.../...")
.
MeRIPtools used the S4 object system to manage the data involved in processing MeRIP-seq data. We defined a class MeRIP
that stores the basic informations about a MeRIP-seq experiment (e.g. path to BAM files, read count matrix, gene model used to quantify reads etc.).
We also defined classes MeRIP.RADAR
that inherits MeRIP
class and stores intermediate data and result involved in RADAR pipelines. MeRIP.RADAR
is designed for peak-calling free differential methylation analysis pipeline as described in this documentation site.
Data slots in MeRIP
class:
getSlots("MeRIP")
## reads binSize geneModel gtf bamPath.input
## "matrix" "numeric" "GRangesList" "character" "character"
## bamPath.ip samplenames geneBins geneSum GTF
## "character" "character" "data.frame" "matrix" "GRanges"
reads
Numeric matrix data for read count in Input and IP library. The columns are sample names and rows are consecutive bins of user defined size (defaul is 50bp).
binSize
Interger specifying the bin size used to divide the transcript.
geneModel
A GRangesList that stores the gene model that has been filtered for ambiguous chromosome. Each element of the list contains a GRanges object specifying exons of a gene.
gtf
The path to the gtf annotation file used to obtain gene model.
bamPath.input
The path to input BAM files.
bamPath.IP
The path to IP BAM files.
samplenames
The prefix used for naming Input and IP BAM files (as introduced in naming convention), which specifies the name of a sample.
geneBins
Data.frame mapping each consecutive bins to the corresponding gene.
geneSum
Numeric matrix data for gene level read count in the Input library. This is equivalent to the read count matrix commenly used in RNA-seq differential expression analysis.
GTF
The gtf annotation in GRanges format.
Data slots in MeRIP.RADAR
class:
getSlots("MeRIP.RADAR")
## norm.input norm.ip sizeFactor
## "matrix" "matrix" "data.frame"
## variate ip_adjExpr ip_adjExpr_filtered
## "data.frame" "matrix" "matrix"
## test.est test.method fdr.method
## "matrix" "character" "character"
## reportBin.fdr reportBin.logFoldChange mergedBins
## "numeric" "numeric" "data.frame"
## reads binSize geneModel
## "matrix" "numeric" "GRangesList"
## gtf bamPath.input bamPath.ip
## "character" "character" "character"
## samplenames geneBins geneSum
## "character" "data.frame" "matrix"
## GTF
## "GRanges"
norm.input
The normalized Input read count matrix.
norm.ip
The normalized IP read count matrix.
sizeFactor
The estimated size factor for both Input and IP.
variate
This data.frame store the predictor variable (the first column) used for inferential test and covariates (the rest of columns) to be included.
ip_adjExpr
The IP read count matrix that have adjusted for the gene expression level.
ip_adjExpr_filtered
The IP read count matrix that have adjusted for the gene expression level and then filtered out low coverage bins.
test.est
Parameter estimation by differential test either by RADARtest (PoissonGamma test), QNB (quad-negative binomial test) or Beta-binomial test.
test.method
The method used to make inferential test (character).
mergedBins
The final result in BED12 format with additional columns for test statistics. This table are obtained by merginng the neighbouring significantly differentially methylated bins.
This is the very first function in MeRIP-seq data analysis that initianize a MeRIP
object. This function takes BAM files of Input/IP library of each samples as input and use given GTF file as gene annotation to divide genes into consecutive bins of user defined size.
countReads(
samplenames,
gtf,
fragmentLength = 150,
bamFolder,
outputDir=NA,
modification = "m6A",
binSize = 50,
strandToKeep = "opposite",
paired = FALSE,
threads = 1,
saveOutput = T
)
Input parameters:
The samplenames
should be a vector of charaters that are the same as the prefix name of input BAM files. BAM files should be named as <samplename>.input.bam
and <samplename>.<modification>.bam
.
The gtf
should be a (charater) path to the gft annotation file.
The fragmentLength
is the average length of RNA fragment. Default is 150bp.
The bamFolder
should be a (charater) path to the dirctory where input BAM files are located.
The outputDir
denote the path to save the read count result in RDS format.
The modification
is a character denoting the mid-name of BAM files for IP sample. Default is “m6A”, where countReads
function expect to find <samplenames>.m6A.bam
in the bamPath
directory.
The binSize
is an integer indicating the size of bin to divide the transcript.
The strandToKeep
parameter set the strand specificity when counting the reads. Default is “opposite” for reverse-stranded library construction protocol, which is the most common one seen in RNA-seq. “same” denotes the sense-stranded library construction protocol, commonly seen in small RNA library kit. This parameter can also be set to “both” to ignore strand when counting the reads.
The paired
is a logic parameter. Default is FALSE for single-end sequencing data. TURE for pair-end sequencing data where fragment length is calculated from the read pair instead of using the user provided parameter.
The threads
set the number of threads to use for multi-core processing.
The saveOutput
is logic parameter denoting whether read count result should be saved in addition to returning the output to the R environment.
Output data:
This function will return a MeRIP class object that stores the read count in bins for each samples.
Normalize the library size of Input and IP library of the MeRIP-seq data. This is the first step of data preprocessing after read count matrix is obtained.
normalizeLibrary(object, boxPlot)
Input parameters:
The object
is a MeRIP object. The boxPlot
is a logical parameter that defines whether to generate the boxplot of bin read counts comparing before and after normalization.
Output data:
This function will return a MeRIP class object that includes normalized bin level read counts and Input gene-level read counts.
Compute the IP read counts that are adjusted for gene expression level as represented by Input read counts.
adjustExprLevel(object, adjustBy)
Input parameters:
The object
is a MeRIP object which has run library normalization and gene expression level adjustment processing.
The adjustBy
set the denominator for adjusting the gene expression level; can be “geneSum” or “pos”. “geneSum” adjust the gene expression level by the gene-level read counts of Input library. “pos” adjust the gene expression level by the local (bin-level) read counts of Input library. Default is “geneSum”.
Output data:
This function will return a MeRIP class object that includes IP read counts adjusted for (normalized by) the gene expression level.
Filter out low-coverage bins that doesn’t have sufficient information for reliable inferential test and bins where IP coverage is lower than the Input coverage.
filterBins(object, minCountsCutOff)
Input parameters:
The object
is a MeRIP object which has run library normalization and gene expression level adjustment processing.
The minCountsCutOff
set the minimal IP read count required in each bin for inferential test.
Output data:
This function will return a MeRIP class object that includes filtered IP read counts adjusted for (normalized by) the gene expression level, which will can be then tested by random effect regression model.
Run PoissonGamma random effect regression model to test the differential methylation cross experimental group.
diffIP(object, exclude = NULL, maxPsi, fdrBy)
diffIP_parallel(object, exclude = NULL, maxPsi, fdrBy, thread)
Input parameters:
The object
should be a MeRIP object that is returned from the countRead()
function and other preprocessing functions.
The exclude
is a optional parameter setting the samples to be excluded from the inferential test. Default is NULL. If some sample are to be excluded from the inferential test (e.g. technical problem was found in certain sample), one can indicate which samples are to be excluded by a character vector denoting the sample names.
The maxPsi
set the max random effect parameter allowed in the fitting. Default is 100, which is roughly 1/100 variance.
The fdrBy
set the FDR method to compute adjusted p-value. Default is “qvalue” for Storey’s qvalue method, can also be “fdr” for Benjamini & Hochberg method.
The thread
set the number of threads to use when use diffIP_parallel
.
Output data:
This function returns the MeRIP.RADAR
object with test statistics saved in it.
This function aims to merge neighboring significant bins that are attributed to the same m6A locus. After merging connected significant bins, this function also retrieve the genomic location of the bin and save the differential peaks in BED12 format with addtion columns for the test statistics.
reportResult(object , cutoff = 0.1, Beta_cutoff = 0.5, threads = 1)
Input parameters:
The object
should be a MeRIP object that is returned from the countRead()
function and other preprocessing functions.
The cutoff
is number that indicate the FDR cutoff to select the significant differential methylated bins, default is 0.1 or 10% FDR.
The Beta_cutoff
is the log fold change cutoff for selecting significant differential methylated bins, default is 0.5 log fold change.
The threads
set the number of thread(s) to use for this process.
Output data:
This function returns the MeRIP.RADAR
object with final differential peak result saved in it.
Extract final result table as in BED12 format denoting the location of the differential peaks with additional columns for the test statistics.
results(object)
Input parameters:
The object
is a MeRIP object with preprocessing, test and postprocessing steps executed.
Output data:
This function returns a data.frame where the first 12 columns are BED12 format columns and the rest of columns for the test statistics.
Import GTF file as gene annotation from the path stored in the MeRIP object.
PrepCoveragePlot(object, gtf)
Input parameters:
The object
is a MeRIP object with preprocessing steps executed.
The gtf
is an optional parameter that define the path to the GTF file. A path to GTF file was stored in the MeRIP object when the object is initiated by the countRead()
function. However, if the user want to update the path to the GTF file, this parameter can be set to the new path to the GTF file. Default is NULL.
Output data:
This function returns a MeRIP.RADAR class object with GTF gene annotation imported as GRange object.
Generate coverage plot for a given gene or locus. Can plot (library size) normalized mean/median coverage for each experimental group or plot each individual sample in a panel.
plotGeneCov(object, geneName, libraryType, center, ZoomIn, adjustExprLevel, split)
Input parameters:
The object
is a MeRIP object with GTF annotation imported.
The geneName
should be a character defining the gene to be plotted. The gene name correspond to the ‘gene_name’ tag in the GTF files. For example, UCSC GTF annotation files use gene symbol, e.g. “METTL3”; Ensembl GTF files use Enembl gene ID, e.g. “ENSG00000165819.11”.
The libraryType
set the strandness of the library. Can be “same” or “opposite”. Most mRNA library protocol (e.g. TruSeq, Kapa mRNA stranded library kit, Takara total RNA pico V2) generate reverse complementary strand reads and thus should use paprameter “opposite” (Default). Small RNA library (RNA ligation based) protocol usually generate reads of the same strand as RNA molecules and thus should use parameter “same”.
The center
set how coverage of samples in the same experimental group will be combined. Can be mean
or median
. This option will not have effect if split
is set TRUE.
The ZoomIn
is a integer vector of two element setting a genomic range in the given gene to be plotted. e.g. c(12345600, 12345700)
. Note this range need to be within the gene to be plotted as set in geneName
parameter.
The adjustExprLevel
is logical option. If set TRUE, the coverage of Input and IP will be normalzied to Input coverage of that gene so that the coverage of IP is conditioned on gene expression level and directly comparable across group. If set FALSE, the plotted coverage will be only normalized to adjust sequencing library size.
The split
is a logical option. If set TRUE, coverage of each individual will be plotted in a grid panel. If set FALSE, the coverage of samples will be combined according to the experimental group.
Output data:
This function generates a coverage plot of one panel if split = FALSE
or multiple panel if split = TRUE
.
Extract variable(s) matrix of the experiment. Or set the variable(s) of the experiment.
variable(object)
variable(object) <- data.frame()
Input parameters:
The object
should be a MeRIP object.
Output data:
This function returns a data.frame that contains the predictor variable of the experiment and other covariates if exits.
Extract the mapping between the consecutive bins and genes.
geneBins(object)
Input parameters:
The object
should be a MeRIP object that is returned from the countRead()
function and other preprocessing functions.
Output data:
This function returns a data.frame that maps the bin names (row names of read count matrix) to genes and on-“transcript” location of the bin.
Extract the normalized gene-level read counts from Input library, which is equivalent to regular RNA-seq gene level quantification.
geneExpression(object)
Input parameters:
The object
should be a MeRIP object that has ran the library normalization preprocessing.
Output data:
This function returns a matrix of read counts data where each row is gene name and column is sample.
select(object, samples)
Input parameters:
The object
should be a MeRIP object.
The samples
is a character vector setting the sample names to be subset.
Output data:
This function returns a MeRIP object with subset of samples as set by the parameter. Note only read counts data, preprocessed data will be retained while test statistics will be removed because the result will change with subset of samples.
Plot the proportion of differential peaks in detected by RADAR.
peakDistribution(object)
Input parameters:
The object
should be a MeRIP object that has run all preprocessing, test as well as postprocessing.
Output data:
This function generates a pie plot showing the fraction of differential peaks in each genomic annotation.
Plot heatmap of significant bins detected by RADAR.
plotHeatMap(object, covariates)
Input parameters:
The object
should be a MeRIP object that has run all preprocessing, test as well as postprocessing.
The covariates
is a logical option that defines whether to regress out covariates when plotting the heatmap. If TRUE, covariates as provided in the variable(object)
data.frame will be regressed out.
Output data:
This function generates a heatmap plot showing variation of methylation level of significantly differentially methylated bins.
Extract IP read counts.
extractIP(object, normalized, adjusted, filtered)
Input parameters:
The object
should be a MeRIP object.
The normalized
is logical option. If TRUE, extract the normalized IP read counts, default is TRUE.
The adjusted
is logical option. If TRUE, extract the IP read counts adjusted for gene expression level.
The filtered
is logical option. If TRUE, extract the low-coverage filtered IP read counts that is adjusted for gene expression level.
Output data:
This function returns a IP read counts matrix.
Extract Input read counts.
extractInput(object, normalized)
Input parameters:
The object
should be a MeRIP object.
The normalized
is logical option. If TRUE, extract the normalized Input read counts, default is TRUE.
Output data:
This function returns a Input read counts matrix.
This R Markdown site was created with workflowr