Introduction

Epitranscriptome profiling using MeRIP-seq is a powerful technique for in vivo functional studies of reversible RNA modifications. We developed RADAR, a comprehensive analytical tool for detecting differentially methylated loci in MeRIP-seq data. RADAR enables accurate identification of altered methylation sites by accommodating variability of pre-immunoprecipitation expression level and post-immunoprecipitation count using different strategies. In addition, it is compatible with complex study design when covariates need to be incorporated in the analysis.

To analyze the count data of MeRIP-seq, we divide the transcript (concatenated exons of a gene) into consecutive bins and quantify the reads align to each bin. The summarized count data is in matrix format where each row represents a locus and each column is a sample. Statistical inference of changes cross groups, as compared to within-group variability is made for each filtered locus (bin) with sufficient count.

RADAR achieves improved performance mainly by the following advancements: (1) RADAR uses gene-level isntead of peak-level read count as an robust representation of pre-IP gene expression level. This significantly reduces the noise of the data. (2) The generalized linear model framework used in RADAR is compatible with complex study design so that biological and experimental covariates can be accounted for in differential methylation analysis. (3) To accomodate extra overdispersion due to immunoprecipitation, RADAR implemented a random effect model that is flexible on mean-variable relationship.

Install the R package from Github

Depends: Rsamtools, GenomicFeatures (>= 1.14.5), BH, Rcpp, ggplot2, doParallel, foreach, gplots, RColorBrewer, RcppArmadillo

    install.packages("devtools")
    library(devtools)
    install_github("scottzijiezhang/RADAR")
    library("RADAR")

The installation has been tested on R version 3.4.4 and R version 3.5.0 and 3.5.1 on Linux and Mac.

Example data

To access the example data:

bam.path <- system.file("extdata",package = "RADAR")
list.files(bam.path)
##  [1] "Case1.input.bam" "Case1.m6A.bam"   "Case2.input.bam"
##  [4] "Case2.m6A.bam"   "Case3.input.bam" "Case3.m6A.bam"  
##  [7] "Case4.input.bam" "Case4.m6A.bam"   "Case5.input.bam"
## [10] "Case5.m6A.bam"   "Case6.input.bam" "Case6.m6A.bam"  
## [13] "Case7.input.bam" "Case7.m6A.bam"   "Case8.input.bam"
## [16] "Case8.m6A.bam"   "Ctl1.input.bam"  "Ctl1.m6A.bam"   
## [19] "Ctl2.input.bam"  "Ctl2.m6A.bam"    "Ctl3.input.bam" 
## [22] "Ctl3.m6A.bam"    "Ctl4.input.bam"  "Ctl4.m6A.bam"   
## [25] "Ctl5.input.bam"  "Ctl5.m6A.bam"    "Ctl6.input.bam" 
## [28] "Ctl6.m6A.bam"    "Ctl7.input.bam"  "Ctl7.m6A.bam"   
## [31] "Ctl8.input.bam"  "Ctl8.m6A.bam"    "Ctl9.input.bam" 
## [34] "Ctl9.m6A.bam"

Data structure

Naming convention of input files

Since MeRIP-seq generate a pair of sequencing data – Input (regular RNA-seq) and IP (immunoprecipitation), we denote these two libraries by .input or .<modification> followed by .bam after sample names. Thus, the RADAR requires a pair of BAM files for each sample named as <samplename>.input.bam + <samplename>.<modification>.bam. The <modification> can be specified in the countRead(... , modification = "<modification>") function. In the example data where we are interested in m6A modification, the files are named as : <samplename>.input.bam and <samplename>.m6A.bam. All BAM files to be processed and analyzed together should be saved in one directory, the path to which needs to be specified in the countRead(bamPath = ".../.../...").

The MeRIP and MeRIP.RADAR class

MeRIPtools used the S4 object system to manage the data involved in processing MeRIP-seq data. We defined a class MeRIP that stores the basic informations about a MeRIP-seq experiment (e.g. path to BAM files, read count matrix, gene model used to quantify reads etc.).
We also defined classes MeRIP.RADAR that inherits MeRIP class and stores intermediate data and result involved in RADAR pipelines. MeRIP.RADAR is designed for peak-calling free differential methylation analysis pipeline as described in this documentation site.

Data slots in MeRIP class:

getSlots("MeRIP")
##         reads       binSize     geneModel           gtf bamPath.input 
##      "matrix"     "numeric" "GRangesList"   "character"   "character" 
##    bamPath.ip   samplenames      geneBins       geneSum           GTF 
##   "character"   "character"  "data.frame"      "matrix"     "GRanges"

reads Numeric matrix data for read count in Input and IP library. The columns are sample names and rows are consecutive bins of user defined size (defaul is 50bp).
binSize Interger specifying the bin size used to divide the transcript.
geneModel A GRangesList that stores the gene model that has been filtered for ambiguous chromosome. Each element of the list contains a GRanges object specifying exons of a gene.
gtf The path to the gtf annotation file used to obtain gene model.
bamPath.input The path to input BAM files.
bamPath.IP The path to IP BAM files.
samplenames The prefix used for naming Input and IP BAM files (as introduced in naming convention), which specifies the name of a sample.
geneBins Data.frame mapping each consecutive bins to the corresponding gene.
geneSum Numeric matrix data for gene level read count in the Input library. This is equivalent to the read count matrix commenly used in RNA-seq differential expression analysis.
GTF The gtf annotation in GRanges format.

Data slots in MeRIP.RADAR class:

getSlots("MeRIP.RADAR")
##              norm.input                 norm.ip              sizeFactor 
##                "matrix"                "matrix"            "data.frame" 
##                 variate              ip_adjExpr     ip_adjExpr_filtered 
##            "data.frame"                "matrix"                "matrix" 
##                test.est             test.method              fdr.method 
##                "matrix"             "character"             "character" 
##           reportBin.fdr reportBin.logFoldChange              mergedBins 
##               "numeric"               "numeric"            "data.frame" 
##                   reads                 binSize               geneModel 
##                "matrix"               "numeric"           "GRangesList" 
##                     gtf           bamPath.input              bamPath.ip 
##             "character"             "character"             "character" 
##             samplenames                geneBins                 geneSum 
##             "character"            "data.frame"                "matrix" 
##                     GTF 
##               "GRanges"

norm.input The normalized Input read count matrix.
norm.ip The normalized IP read count matrix.
sizeFactor The estimated size factor for both Input and IP.
variate This data.frame store the predictor variable (the first column) used for inferential test and covariates (the rest of columns) to be included.
ip_adjExpr The IP read count matrix that have adjusted for the gene expression level.
ip_adjExpr_filtered The IP read count matrix that have adjusted for the gene expression level and then filtered out low coverage bins.
test.est Parameter estimation by differential test either by RADARtest (PoissonGamma test), QNB (quad-negative binomial test) or Beta-binomial test.
test.method The method used to make inferential test (character).
mergedBins The final result in BED12 format with additional columns for test statistics. This table are obtained by merginng the neighbouring significantly differentially methylated bins.

Basic functions of the RADAR package

countReads

This is the very first function in MeRIP-seq data analysis that initianize a MeRIP object. This function takes BAM files of Input/IP library of each samples as input and use given GTF file as gene annotation to divide genes into consecutive bins of user defined size.

countReads(
  samplenames,
  gtf, 
  fragmentLength = 150,
  bamFolder,
  outputDir=NA,
  modification = "m6A",
  binSize = 50,
  strandToKeep = "opposite",
  paired = FALSE,
  threads = 1,
  saveOutput = T
)

Input parameters:
The samplenames should be a vector of charaters that are the same as the prefix name of input BAM files. BAM files should be named as <samplename>.input.bam and <samplename>.<modification>.bam.
The gtf should be a (charater) path to the gft annotation file.
The fragmentLength is the average length of RNA fragment. Default is 150bp.
The bamFolder should be a (charater) path to the dirctory where input BAM files are located.
The outputDir denote the path to save the read count result in RDS format.
The modification is a character denoting the mid-name of BAM files for IP sample. Default is “m6A”, where countReads function expect to find <samplenames>.m6A.bam in the bamPath directory.
The binSize is an integer indicating the size of bin to divide the transcript.
The strandToKeep parameter set the strand specificity when counting the reads. Default is “opposite” for reverse-stranded library construction protocol, which is the most common one seen in RNA-seq. “same” denotes the sense-stranded library construction protocol, commonly seen in small RNA library kit. This parameter can also be set to “both” to ignore strand when counting the reads.
The paired is a logic parameter. Default is FALSE for single-end sequencing data. TURE for pair-end sequencing data where fragment length is calculated from the read pair instead of using the user provided parameter.
The threads set the number of threads to use for multi-core processing.
The saveOutput is logic parameter denoting whether read count result should be saved in addition to returning the output to the R environment.
Output data:
This function will return a MeRIP class object that stores the read count in bins for each samples.

normalizeLibrary

Normalize the library size of Input and IP library of the MeRIP-seq data. This is the first step of data preprocessing after read count matrix is obtained.

normalizeLibrary(object, boxPlot)

Input parameters:
The object is a MeRIP object. The boxPlot is a logical parameter that defines whether to generate the boxplot of bin read counts comparing before and after normalization.
Output data:
This function will return a MeRIP class object that includes normalized bin level read counts and Input gene-level read counts.

adjustExprLevel

Compute the IP read counts that are adjusted for gene expression level as represented by Input read counts.

adjustExprLevel(object, adjustBy)

Input parameters:
The object is a MeRIP object which has run library normalization and gene expression level adjustment processing.
The adjustBy set the denominator for adjusting the gene expression level; can be “geneSum” or “pos”. “geneSum” adjust the gene expression level by the gene-level read counts of Input library. “pos” adjust the gene expression level by the local (bin-level) read counts of Input library. Default is “geneSum”.
Output data:
This function will return a MeRIP class object that includes IP read counts adjusted for (normalized by) the gene expression level.

filterBins

Filter out low-coverage bins that doesn’t have sufficient information for reliable inferential test and bins where IP coverage is lower than the Input coverage.

filterBins(object, minCountsCutOff)

Input parameters:
The object is a MeRIP object which has run library normalization and gene expression level adjustment processing.
The minCountsCutOff set the minimal IP read count required in each bin for inferential test.
Output data:
This function will return a MeRIP class object that includes filtered IP read counts adjusted for (normalized by) the gene expression level, which will can be then tested by random effect regression model.

diffIP

Run PoissonGamma random effect regression model to test the differential methylation cross experimental group.

diffIP(object, exclude = NULL, maxPsi, fdrBy)
diffIP_parallel(object, exclude = NULL, maxPsi, fdrBy, thread)

Input parameters:
The object should be a MeRIP object that is returned from the countRead() function and other preprocessing functions.
The exclude is a optional parameter setting the samples to be excluded from the inferential test. Default is NULL. If some sample are to be excluded from the inferential test (e.g. technical problem was found in certain sample), one can indicate which samples are to be excluded by a character vector denoting the sample names.
The maxPsi set the max random effect parameter allowed in the fitting. Default is 100, which is roughly 1/100 variance.
The fdrBy set the FDR method to compute adjusted p-value. Default is “qvalue” for Storey’s qvalue method, can also be “fdr” for Benjamini & Hochberg method.
The thread set the number of threads to use when use diffIP_parallel.
Output data:
This function returns the MeRIP.RADAR object with test statistics saved in it.

reportResult

This function aims to merge neighboring significant bins that are attributed to the same m6A locus. After merging connected significant bins, this function also retrieve the genomic location of the bin and save the differential peaks in BED12 format with addtion columns for the test statistics.

reportResult(object , cutoff = 0.1, Beta_cutoff = 0.5, threads = 1)

Input parameters:
The object should be a MeRIP object that is returned from the countRead() function and other preprocessing functions.
The cutoff is number that indicate the FDR cutoff to select the significant differential methylated bins, default is 0.1 or 10% FDR.
The Beta_cutoff is the log fold change cutoff for selecting significant differential methylated bins, default is 0.5 log fold change.
The threads set the number of thread(s) to use for this process.
Output data:
This function returns the MeRIP.RADAR object with final differential peak result saved in it.

results

Extract final result table as in BED12 format denoting the location of the differential peaks with additional columns for the test statistics.

results(object)

Input parameters:
The object is a MeRIP object with preprocessing, test and postprocessing steps executed.
Output data:
This function returns a data.frame where the first 12 columns are BED12 format columns and the rest of columns for the test statistics.

PrepCoveragePlot

Import GTF file as gene annotation from the path stored in the MeRIP object.

PrepCoveragePlot(object, gtf)

Input parameters:
The object is a MeRIP object with preprocessing steps executed.
The gtf is an optional parameter that define the path to the GTF file. A path to GTF file was stored in the MeRIP object when the object is initiated by the countRead() function. However, if the user want to update the path to the GTF file, this parameter can be set to the new path to the GTF file. Default is NULL.
Output data:
This function returns a MeRIP.RADAR class object with GTF gene annotation imported as GRange object.

plotGeneCov

Generate coverage plot for a given gene or locus. Can plot (library size) normalized mean/median coverage for each experimental group or plot each individual sample in a panel.

plotGeneCov(object, geneName, libraryType, center, ZoomIn, adjustExprLevel, split)

Input parameters:
The object is a MeRIP object with GTF annotation imported.
The geneName should be a character defining the gene to be plotted. The gene name correspond to the ‘gene_name’ tag in the GTF files. For example, UCSC GTF annotation files use gene symbol, e.g. “METTL3”; Ensembl GTF files use Enembl gene ID, e.g. “ENSG00000165819.11”.
The libraryType set the strandness of the library. Can be “same” or “opposite”. Most mRNA library protocol (e.g. TruSeq, Kapa mRNA stranded library kit, Takara total RNA pico V2) generate reverse complementary strand reads and thus should use paprameter “opposite” (Default). Small RNA library (RNA ligation based) protocol usually generate reads of the same strand as RNA molecules and thus should use parameter “same”.
The center set how coverage of samples in the same experimental group will be combined. Can be mean or median. This option will not have effect if split is set TRUE.
The ZoomIn is a integer vector of two element setting a genomic range in the given gene to be plotted. e.g. c(12345600, 12345700). Note this range need to be within the gene to be plotted as set in geneName parameter.
The adjustExprLevel is logical option. If set TRUE, the coverage of Input and IP will be normalzied to Input coverage of that gene so that the coverage of IP is conditioned on gene expression level and directly comparable across group. If set FALSE, the plotted coverage will be only normalized to adjust sequencing library size.
The split is a logical option. If set TRUE, coverage of each individual will be plotted in a grid panel. If set FALSE, the coverage of samples will be combined according to the experimental group.
Output data:
This function generates a coverage plot of one panel if split = FALSE or multiple panel if split = TRUE.

variable

Extract variable(s) matrix of the experiment. Or set the variable(s) of the experiment.

variable(object)
variable(object) <- data.frame()

Input parameters:
The object should be a MeRIP object.
Output data:
This function returns a data.frame that contains the predictor variable of the experiment and other covariates if exits.

geneBins

Extract the mapping between the consecutive bins and genes.

geneBins(object)

Input parameters:
The object should be a MeRIP object that is returned from the countRead() function and other preprocessing functions.
Output data:
This function returns a data.frame that maps the bin names (row names of read count matrix) to genes and on-“transcript” location of the bin.

geneExpression

Extract the normalized gene-level read counts from Input library, which is equivalent to regular RNA-seq gene level quantification.

geneExpression(object)

Input parameters:
The object should be a MeRIP object that has ran the library normalization preprocessing.
Output data:
This function returns a matrix of read counts data where each row is gene name and column is sample.

select

select(object, samples)

Input parameters:
The object should be a MeRIP object.
The samples is a character vector setting the sample names to be subset.
Output data:
This function returns a MeRIP object with subset of samples as set by the parameter. Note only read counts data, preprocessed data will be retained while test statistics will be removed because the result will change with subset of samples.

peakDistribution

Plot the proportion of differential peaks in detected by RADAR.

peakDistribution(object)

Input parameters:
The object should be a MeRIP object that has run all preprocessing, test as well as postprocessing.
Output data:
This function generates a pie plot showing the fraction of differential peaks in each genomic annotation.

plotHeatMap

Plot heatmap of significant bins detected by RADAR.

plotHeatMap(object, covariates)

Input parameters:
The object should be a MeRIP object that has run all preprocessing, test as well as postprocessing.
The covariates is a logical option that defines whether to regress out covariates when plotting the heatmap. If TRUE, covariates as provided in the variable(object) data.frame will be regressed out.
Output data:
This function generates a heatmap plot showing variation of methylation level of significantly differentially methylated bins.

extractIP

Extract IP read counts.

extractIP(object, normalized, adjusted, filtered)

Input parameters:
The object should be a MeRIP object.
The normalized is logical option. If TRUE, extract the normalized IP read counts, default is TRUE.
The adjusted is logical option. If TRUE, extract the IP read counts adjusted for gene expression level.
The filtered is logical option. If TRUE, extract the low-coverage filtered IP read counts that is adjusted for gene expression level.
Output data:
This function returns a IP read counts matrix.

extractInput

Extract Input read counts.

extractInput(object, normalized)

Input parameters:
The object should be a MeRIP object.
The normalized is logical option. If TRUE, extract the normalized Input read counts, default is TRUE.
Output data:
This function returns a Input read counts matrix.


This R Markdown site was created with workflowr