Intro_FindIT2:find_influential_TF module part2
上一篇文章提到了如何寻找调控你感兴趣基因
的那些TF,感兴趣基因的来源可以是差异分析、聚类等。这一篇文章来介绍如何寻找调控你感兴趣peak
的那些TF,感兴趣的peak可以来自于差异分析、聚类分析等,也可以是你感兴趣基因临近或远端的peak。这部分内容包含了两个函数:findIT_enrichFisher
和findIT_enrichWilcox
的介绍。前者基于的是Fisher精确检验
,后者基于的是Wilcox秩和检验
,具体的细节我们会在后面讨论。
Pre-preparation
这次用到的测试数据跟上一篇文章类似,包含了LEC2的在整个基因组上的ChIP-seq结果,GR诱导LEC2过表达的RNA-seq结果以及一个ATAC的多时间点的mergePeak文件和count矩阵以及一些公共数据。
# 加载包
library(FindIT2)
## Warning: package 'GenomeInfoDb' was built under R version 4.1.1
library(TxDb.Athaliana.BioMart.plantsmart28)
## Warning: package 'GenomicFeatures' was built under R version 4.1.1
library(BSgenome.Athaliana.TAIR.TAIR9)
library(motifmatchr)
library(dplyr)
library(ggplot2)
Txdb <- TxDb.Athaliana.BioMart.plantsmart28
seqlevels(Txdb) <- paste0("Chr", c(1:5, "M", "C"))
除了ATAC 多时间点的count矩阵文件之外,其他都已经在上一篇文章下载过了。
# 下载LEC2在整个基因组上的ChIP-seq结果
download.file(url = "https://raw.githubusercontent.com/shangguandong1996/FindIT2_paper_relatedCode/master/rawdata/ChIP/TF_ChIP_bindingSite.bed",
destfile = "TF_ChIP_bindingSite.bed")
# 下载ATAC peak
download.file(url = "https://raw.githubusercontent.com/shangguandong1996/FindIT2_paper_relatedCode/master/rawdata/ATAC/ATAC_mergePeak.bed",
destfile = "ATAC_mergePeak.bed")
# 下载ATAC 多时间点的count矩阵
download.file(url = "https://raw.githubusercontent.com/shangguandong1996/FindIT2_paper_relatedCode/master/rawdata/ATAC/ATAC_peakNormCount.rda",
destfile = "ATAC_peakNormCount.rda"
)
# 下载RNA-seq的差异分析结果
download.file(url = "https://raw.githubusercontent.com/shangguandong1996/FindIT2_paper_relatedCode/master/rawdata/RNA/LEC2_RNA_diffResult.rda",
destfile = "LEC2_RNA_diffResult.rda")
# 下载公共数据库中各种TF ChIP-seq的peak文件
download.file(url = "https://remap.univ-amu.fr/storage/remap2022/tair10/tf/MACS2/remap2022_all_macs2_TAIR10_v1_0.bed.gz",
destfile = "remap2022_all_macs2_TAIR10_v1_0.bed.gz")
由于用的是上一篇文章下载过的数据,所以指向的路径是上一篇文章的地址
ChIP_peak_GR <- loadPeakFile("../2021-09-24-intro-findit2-find-influential-tf-module-part1/TF_ChIP_bindingSite.bed")
ChIP_peak_GR
## GRanges object with 3223 ranges and 2 metadata columns:
## seqnames ranges strand | feature_id score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] Chr1 37901-38160 * | TF_ChIP_1 0.747831
## [2] Chr1 51301-51660 * | TF_ChIP_2 1.368107
## [3] Chr1 51801-51810 * | TF_ChIP_3 0.765108
## [4] Chr1 68451-68660 * | TF_ChIP_4 0.797340
## [5] Chr1 116401-116660 * | TF_ChIP_5 0.853717
## ... ... ... ... . ... ...
## [3219] Chr5 26903801-26904210 * | TF_ChIP_3219 0.829693
## [3220] Chr5 26905801-26906010 * | TF_ChIP_3220 1.008028
## [3221] Chr5 26906401-26906760 * | TF_ChIP_3221 0.990938
## [3222] Chr5 26933051-26933410 * | TF_ChIP_3222 0.814937
## [3223] Chr5 26933851-26934110 * | TF_ChIP_3223 0.943218
## -------
## seqinfo: 5 sequences from an unspecified genome; no seqlengths
load("../2021-09-24-intro-findit2-find-influential-tf-module-part1/LEC2_RNA_diffResult.rda")
diff_RNA
## # A tibble: 37,051 x 3
## gene_id log2FoldChange padj
## <chr> <dbl> <dbl>
## 1 AT1G01010 -0.0169 0.510
## 2 AT1G01020 0.000174 1.00
## 3 AT1G03987 -0.478 0.191
## 4 AT1G01030 -0.00859 1.00
## 5 AT1G01040 -0.000254 1.00
## 6 AT1G03993 0 NA
## 7 AT1G01050 -0.000576 1.00
## 8 AT1G03997 0 NA
## 9 AT1G01060 -0.00174 1.00
## 10 AT1G01070 -0.00384 1.00
## # ... with 37,041 more rows
ATAC_peak_GR <- loadPeakFile("../2021-09-24-intro-findit2-find-influential-tf-module-part1/ATAC_mergePeak.bed")
ATAC_peak_GR
## GRanges object with 22994 ranges and 2 metadata columns:
## seqnames ranges strand | feature_id score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] Chr1 30-875 * | ATAC_1 0
## [2] Chr1 1392-1953 * | ATAC_2 0
## [3] Chr1 2164-3625 * | ATAC_3 0
## [4] Chr1 8268-8966 * | ATAC_4 0
## [5] Chr1 13654-14698 * | ATAC_5 0
## ... ... ... ... . ... ...
## [22990] Chr5 26969184-26969915 * | ATAC_22990 0
## [22991] Chr5 26970542-26971504 * | ATAC_22991 0
## [22992] Chr5 26972304-26973186 * | ATAC_22992 0
## [22993] Chr5 26974280-26974825 * | ATAC_22993 0
## [22994] Chr5 26975157-26975438 * | ATAC_22994 0
## -------
## seqinfo: 5 sequences from an unspecified genome; no seqlengths
load("ATAC_peakNormCount.rda")
head(peakCountNorm_ATAC)
## WT_E5_0h_R1 WT_E5_0h_R2 WT_E5_4h_R1 WT_E5_4h_R2 WT_E5_8h_R1 WT_E5_8h_R2
## ATAC_1 126.65266 143.0386 220.6055 224.7059 171.1754 204.2058
## ATAC_2 93.45245 103.4747 106.0053 132.7808 151.9182 152.2089
## ATAC_3 354.75043 402.9427 621.7065 569.9358 487.8500 406.5208
## ATAC_4 158.62324 144.2559 183.3605 183.8503 165.8262 180.5708
## ATAC_5 222.56439 258.6868 203.4155 275.7754 171.1754 192.8610
## ATAC_6 201.04574 175.9070 137.5203 134.8235 145.4991 119.1200
## WT_E5_16h_R1 WT_E5_16h_R2 WT_E5_24h_R1 WT_E5_24h_R2 WT_E5_48h_R1
## ATAC_1 184.5536 152.9906 205.0690 182.4882 126.4342
## ATAC_2 153.0956 178.4891 204.3619 184.2152 105.5451
## ATAC_3 440.4120 413.7701 506.3083 472.0515 307.8398
## ATAC_4 191.8938 119.3790 221.3331 190.5476 180.3062
## ATAC_5 197.1368 200.5104 208.6047 223.3609 133.0308
## ATAC_6 139.4638 114.7430 185.2692 156.0073 124.2354
## WT_E5_48h_R2 WT_E5_48h_R3 WT_E5_72h_R1 WT_E5_72h_R2 WT_E5_72h_R3
## ATAC_1 128.4671 151.36817 134.9386 158.33027 112.9270
## ATAC_2 117.0983 97.89572 84.6225 119.51381 134.0553
## ATAC_3 345.6106 403.10002 408.2464 362.62738 446.6083
## ATAC_4 120.5090 176.04777 137.2257 150.15838 143.5266
## ATAC_5 172.8053 166.99858 152.0918 196.12523 214.9257
## ATAC_6 120.5090 106.12225 101.7757 79.67588 129.6840
# Remap2022的bed文件
database_ChIP_GR <- loadPeakFile("../2021-09-24-intro-findit2-find-influential-tf-module-part1/remap2022_all_macs2_TAIR10_v1_0.bed.gz")
database_ChIP_GR
## GRanges object with 4843030 ranges and 4 metadata columns:
## seqnames ranges strand | feature_id
## <Rle> <IRanges> <Rle> | <character>
## [1] 1 1-1033 * | GSE143831.MED12.Col-..
## [2] 1 1-1053 * | GSE124750.AGO4.Col_i..
## [3] 1 1-1194 * | GSE124750.AGO6.Col_i..
## [4] 1 1-1198 * | GSE124750.AGO9.Col_i..
## [5] 1 1-1234 * | GSE124750.AGO9.Col_i..
## ... ... ... ... . ...
## [4843026] 5 26975133-26975442 * | GSE166543.EICBP-B.Co..
## [4843027] 5 26975155-26975443 * | GSE124750.AGO4.Col_i..
## [4843028] 5 26975174-26975444 * | GSE108673.SPT6.Col-0..
## [4843029] 5 26975211-26975325 * | GSE60141.TGA4.Col-0_..
## [4843030] 5 26975213-26975502 * | GSE60141.TRP2.Col-0_..
## score itemRgb thick
## <numeric> <character> <IRanges>
## [1] 4.41114 #8C381C 500
## [2] 17.08726 #674F67 313
## [3] 21.07903 #556C10 294
## [4] 58.29619 #22DBC3 415
## [5] 7.67439 #22DBC3 267
## ... ... ... ...
## [4843026] 6.47935 #CDE461 26975253
## [4843027] 10.07379 #674F67 26975286
## [4843028] 12.09363 #F8D7EE 26975298
## [4843029] 5.75710 #DB9A42 26975291
## [4843030] 355.36154 #F7A31C 26975406
## -------
## seqinfo: 5 sequences from an unspecified genome; no seqlengths
integrate_ChIP_RNA
跟上一篇文章一样,我通过整合ChIP-seq和RNA-seq来拿到LEC2的靶基因,并选取top1000做进一步地分析。
# 我希望calcRP_TFHit能够将我bed文件里面的score考虑在内
# 所以我重命名了socre列为featrure_score
colnames(mcols(ChIP_peak_GR))[2] <- "feature_score"
mmAnno_TFHit <- mm_geneScan(ChIP_peak_GR,
Txdb,
upstream = 2e4,
downstream = 2e4)
## >> checking seqlevels match... 2021-10-26 20:46:57
## >> your peak_GR seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5...
## >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 ChrM ChrC...
## Good, your Chrs in peak_GR is all in Txdb
## ------------
## annotatePeak using geneScan mode begins
## >> preparing gene features information and scan region... 2021-10-26 20:46:57
## >> some scan range may cross Chr bound, trimming... 2021-10-26 20:46:58
## >> finding overlap peak in gene scan region... 2021-10-26 20:46:58
## >> dealing with left peak not your gene scan region... 2021-10-26 20:46:58
## >> merging two set peaks... 2021-10-26 20:46:58
## >> calculating distance and dealing with gene strand... 2021-10-26 20:46:58
## >> merging all info together ... 2021-10-26 20:46:58
## >> done 2021-10-26 20:46:58
TFHit_RP <- calcRP_TFHit(mmAnno = mmAnno_TFHit,
Txdb = Txdb,
decay_dist = 1000,
report_fullInfo = FALSE)
## >> calculating peakCenter to TSS using peak-gene pair... 2021-10-26 20:46:58
## >> calculating RP using centerToTSS and feature_score 2021-10-26 20:46:59
## >> merging all info together 2021-10-26 20:46:59
## >> done 2021-10-26 20:46:59
merge_data <- integrate_ChIP_RNA(result_geneRP = TFHit_RP,
result_geneDiff = diff_RNA,
lfc_threshold = 0.58,
padj_threshold = 0.05)
# 这里我取前1000个基因作为LEC2的靶基因
LEC2_target <- merge_data$data$gene_id[1:1000]
build peak-gene link
由于这篇文章讨论的是如何寻找调控你感兴趣peak
的那些TF,所以首先我们需要定义一个peak集合。最常见的peak集合就是通过差异分析得到的上调或者下调的peak,当然也可以是做完聚类之后每个类群里面的peak。这里我用FindIT2
前面的mmAnno
和peakGeneCor
模块来演示如何为LEC2靶基因寻找相关联的peak。
首先是用mmAnno module
中的mm_geneBound
函数来寻找LEC2靶基因的关联peak。mm_geneBound
的原理可见前面的文章,大致就是通过最近基因注释
和最近peak注释
两步注释来关联peak。尽管这样找到的peak不一定是最全的,但一定会为每个基因都至少找到一个peak。
mm_geneBound(peak_GR = ATAC_peak_GR,
Txdb = Txdb,
input_genes = LEC2_target) -> mmAnno_geneBound
## >> checking seqlevels match... 2021-10-26 20:46:59
## >> your peak_GR seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5...
## >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 ChrM ChrC...
## Good, your Chrs in peak_GR is all in Txdb
## >> using mm_nearestGene to annotate Peak... 2021-10-26 20:46:59
## >> checking seqlevels match... 2021-10-26 20:46:59
## >> your peak_GR seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5...
## >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 ChrM ChrC...
## Good, your Chrs in peak_GR is all in Txdb
## ------------
## annotating Peak using nearest gene mode begins
## >> preparing gene features information... 2021-10-26 20:46:59
## >> finding nearest gene and calculating distance... 2021-10-26 20:46:59
## >> dealing with gene strand ... 2021-10-26 20:47:00
## >> merging all info together ... 2021-10-26 20:47:00
## >> done 2021-10-26 20:47:00
## It seems that there 212 genes have not been annotated by nearestGene mode
## >> using distanceToNearest to find nearest peak of these genes... 2021-10-26 20:47:00
## >> merging all anno... 2021-10-26 20:47:00
## >> done 2021-10-26 20:47:00
当然进一步地,大家也可以用mm_geneScan
来划定每个基因的扫描范围,从而找到更多的关联peak,只不过关联的peak越多,相对应的噪音可能也就越多。由于这里我们的目的只是想找一个LEC2靶基因关联的集合,从而来验证函数的功效,所以没必要引入太多的关联peak。
由于这套ATAC-seq关联的数据集还含有一组多时间点的peak count矩阵,所以这里我们还可以使用peakGeneCor module
里面的enhancerPromoterCor
函数来寻找基因关联的peak。大致原理就是尽量为每个基因寻找一个最近的peak作为promoter,然后寻找较远的peak,通过计算远端peak与近端promoter的相关性,从而找到关联的远端peak。尽管这种方法可能并不是那么地准确,因为近端promoter的变化并不一定能够表征基因自身的变化,但在缺乏对应的表达量数据或者样本量不是那么地充分的时候,这个方法也可以作为一个参考。
# 我设置up/down_scanPromoter为500bp,即如果一个基因TSS 500bp以内有peak
# 则最近的那个peak被认为是这个基因的promoter
# 我设置up/down_scanEnhancer为2e4,如果一个基因TSS 2e4bp以内有peak
# 那么这些peak就会与基因的promoter计算相关性
enhancerPromoterCor(peak_GR = ATAC_peak_GR,
Txdb = Txdb,
up_scanPromoter = 1000,
down_scanPromoter = 1000,
up_scanEnhancer = 2e4,
down_scanEnhacner = 2e4,
peakScoreMt = peakCountNorm_ATAC) -> mmAnno_enhancerPromoter
## >> checking seqlevels match... 2021-10-26 20:47:00
## >> your peak_GR seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5...
## >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 ChrM ChrC...
## Good, your Chrs in peak_GR is all in Txdb
## >> using scanPromoter parameter to scan promoter for each gene... 2021-10-26 20:47:00
## >> checking seqlevels match... 2021-10-26 20:47:00
## >> your peak_GR seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5...
## >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 ChrM ChrC...
## Good, your Chrs in peak_GR is all in Txdb
## >> some scan range may cross Chr bound, trimming... 2021-10-26 20:47:01
## >> there are 22117 gene have scaned promoter
## >> using scanEnhancer parameter to scan Enhancer for each gene... 2021-10-26 20:47:01
## >> checking seqlevels match... 2021-10-26 20:47:01
## >> your peak_GR seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5...
## >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 ChrM ChrC...
## Good, your Chrs in peak_GR is all in Txdb
## >> some scan range may cross Chr bound, trimming... 2021-10-26 20:47:02
## >> calculating cor and pvalue, which may be time consuming... 2021-10-26 20:47:03
## >> merging all info together... 2021-10-26 20:47:28
## >> done 2021-10-26 20:47:29
# 我以相关性大于0.8,pvalue<0.01为阈值来挑选distal enhancer-promoter link
mmAnno_enhancerPromoter_filter <- subset(mmAnno_enhancerPromoter, cor > 0.8 & pvalue < 0.01)
当然,如果还有对应的RNA-seq数据,也可以使用peakGeneCor module
里面的peakGeneCor
来寻找基因关联的peak。
然后我们来整合peak,这样我们就得到了在全局性的peak中,LEC2靶基因所关联的peak集合。后面分析的目的就是想要看看这些LEC2关联的peak中富集了什么TF或者说什么TF可能调控了这些peak。
relatedPeak_1 <- mmAnno_geneBound$feature_id
relatedPeak_2 <- mmAnno_enhancerPromoter_filter %>%
as_tibble() %>%
filter(gene_id %in% LEC2_target) %>%
pull(feature_id)
relatedPeak_3 <- metadata(mmAnno_enhancerPromoter)$mm_promoter_tidy %>%
filter(gene_id %in% LEC2_target) %>%
pull(promoter_feature)
relatedPeak <- unique(c(relatedPeak_1, relatedPeak_2, relatedPeak_3))
length(relatedPeak)
## [1] 1531
# 当然,relatedPeak_1,即mmAnno_geneBound的结果已经包含了大部分的relatedPeak了
# 所以我们单纯地用mmAnno_geneBound的结果也行
mean(relatedPeak %in% unique(relatedPeak_1))
## [1] 0.9150882
mean(relatedPeak %in% unique(relatedPeak_2))
## [1] 0.1404311
mean(relatedPeak %in% unique(relatedPeak_3))
## [1] 0.5453952
findIT_enrichFisher
在进一步讲解函数之前,我想要先提下在使用HOMER findMotifsGenome.pl
这个命令中的一个小问题。现在大部分的ATAC-seq文章会用这个函数去得到感兴趣peak集合中的富集TF,常规地就是对上调或者下调的peak去使用这个命令。但根据我的经验来看,如果你在使用findMotifsGenome.pl
去Finding Enriched Motifs
的时候,最好不要用随机产生的location作为背景,而是应该用全部的ATAC peak集合。即findMotifsGenome.pl
里面的-bg
参数应该设置为全部的ATAC peak。不然地话,所产生的结果就很有可能会是ATAC-seq建库本身所富集的东西,而不是跟你生物学过程密切相关的TF了。这种情况在你感兴趣的peak数目较多的时候尤其严重,最典型的现象就是你经常会在富集结果中看到某一家族的TF。
findIT_enrichFisher
的原理跟上一篇中的findIT_TTPair
一模一样,都是基于Fisher精确检验
来查看你感兴趣的集合,相比于全局性的集合,更加富集哪些TF。只不过findIT_TTPair
要求的是一个感兴趣的基因集和一个两列的TF-target
数据库。findIT_enrichFisher
则要求的是一个感兴趣的peak集合和GRanges格式的TF peak location数据库。数据库的格式跟上一篇的findIT_TFHit
是一样的。
我们首先用Remap2022
的公共数据库来做下富集
# 首先我们需要对数据库的binding site GRange对象做一些修改
# 将seqlevel修改一下
(seqlevels(database_ChIP_GR) <- paste0("Chr", seqlevels(database_ChIP_GR)))
## [1] "Chr1" "Chr2" "Chr3" "Chr4" "Chr5"
# 将feature_id修改成TF_id
colnames(mcols(database_ChIP_GR))[1] <- "TF_id"
# peak_GR参数包含了你全部ATAC peak的location
# 同时还需要有一列feature_id来标识peak的id
# 你的input_feature_id必须是这其中的一部分
result_enrichFisher_Remap <- findIT_enrichFisher(input_feature_id = relatedPeak,
peak_GR = ATAC_peak_GR,
TF_GR_database = database_ChIP_GR)
# 加as.data.frame只是为了能在博客里面展现所有的结果
result_enrichFisher_Remap %>%
as.data.frame() %>%
head()
## TF_id num_TFHit_inputFeature inputRatio
## 1 GSE145387.VAL1.Col-0_seedling 885 885/1531
## 2 GSE94307.HEC1.Col-0_seedling_12d 789 789/1531
## 3 GSE142621.BRM.Col-0_seedling_BRIP1 798 798/1531
## 4 GSE26722.REV.Col-0_seedling_10d-DEX-90min 711 711/1531
## 5 GSE20176.AP1.Col_apical-meristem_DEX 676 676/1531
## 6 GSE122579.AT2G17950.Col-0_seedling 845 845/1531
## bgRatio pvalue odds_ratio padj qvalue rank
## 1 6925/22994 4.827909e-120 3.497960 3.311946e-117 1.834606e-117 1
## 2 6425/22994 8.537808e-91 2.985903 5.848398e-88 1.622184e-88 2
## 3 6706/22994 7.258159e-85 2.866194 4.964581e-82 9.193669e-83 3
## 4 5746/22994 1.633122e-79 2.828857 1.115422e-76 1.551466e-77 4
## 5 5487/22994 3.225643e-73 2.736451 2.199888e-70 2.451488e-71 5
## 6 7747/22994 1.705800e-71 2.598544 1.161650e-68 1.080340e-69 6
跟之前一样,因为数据库里面没有LEC2,所以结果里面LEC2并不在top。我们可以把LEC2加入到数据库里面再跑一次。
mcols(database_ChIP_GR) <- mcols(database_ChIP_GR)[, 1, drop = FALSE]
mcols(ChIP_peak_GR) <- NULL
ChIP_peak_GR$TF_id <- "LEC2"
database_ChIP_GR <- c(database_ChIP_GR, ChIP_peak_GR)
result_enrichFisher_Remap <- findIT_enrichFisher(input_feature_id = relatedPeak,
peak_GR = ATAC_peak_GR,
TF_GR_database = database_ChIP_GR)
result_enrichFisher_Remap %>%
as.data.frame() %>%
head()
## TF_id num_TFHit_inputFeature inputRatio
## 1 LEC2 821 821/1531
## 2 GSE145387.VAL1.Col-0_seedling 885 885/1531
## 3 GSE94307.HEC1.Col-0_seedling_12d 789 789/1531
## 4 GSE142621.BRM.Col-0_seedling_BRIP1 798 798/1531
## 5 GSE26722.REV.Col-0_seedling_10d-DEX-90min 711 711/1531
## 6 GSE20176.AP1.Col_apical-meristem_DEX 676 676/1531
## bgRatio pvalue odds_ratio padj qvalue rank
## 1 2216/22994 0.000000e+00 16.622502 0.000000e+00 0.000000e+00 1
## 2 6925/22994 4.827909e-120 3.497960 3.311946e-117 9.173028e-118 2
## 3 6425/22994 8.537808e-91 2.985903 5.848398e-88 1.081456e-88 3
## 4 6706/22994 7.258159e-85 2.866194 4.964581e-82 6.895251e-83 4
## 5 5746/22994 1.633122e-79 2.828857 1.115422e-76 1.241173e-77 5
## 6 5487/22994 3.225643e-73 2.736451 2.199888e-70 2.042907e-71 6
跟之前一样,如果你的物种没有公共数据库,也可以考虑用ATAC motif scan或者footprint scan。用motif scan来做Fisher精确检验
其实就很类似于findMotifGenome.pl
了,只不过这里对应的是-bg参数设置的是全部的peak集合。
# 写个函数来整理提取出JASPAR2020中的内容
getJasparMotifs <- function(species = "Arabidopsis thaliana",
collection = "CORE", ...) {
opts <- list()
opts["species"] <- species
opts["collection"] <- collection
opts <- c(opts, list(...))
out <- TFBSTools::getMatrixSet(JASPAR2020::JASPAR2020, opts)
if (!isTRUE(all.equal(TFBSTools::name(out), names(out))))
names(out) <- paste(names(out), TFBSTools::name(out), sep = "_")
return(out)
}
motifs_JASPAR2020 <- getJasparMotifs()
# call motif
motif_ix <- matchMotifs(motifs_JASPAR2020,
ATAC_peak_GR,
genome = BSgenome.Athaliana.TAIR.TAIR9,
out = "positions",
bg = "genome")
motif_GR <- unlist(motif_ix)
motif_GR$TF_id <- gsub("(MA[0-9]{4}\\.[0-9]_)", "", names(motif_GR))
motif_GR
## GRanges object with 2147996 ranges and 2 metadata columns:
## seqnames ranges strand | score TF_id
## <Rle> <IRanges> <Rle> | <numeric> <character>
## MA0121.1_ARR10 Chr1 1490-1497 - | 10.9848 ARR10
## MA0121.1_ARR10 Chr1 1904-1911 - | 10.9848 ARR10
## MA0121.1_ARR10 Chr1 33532-33539 - | 10.9848 ARR10
## MA0121.1_ARR10 Chr1 55227-55234 - | 11.1123 ARR10
## MA0121.1_ARR10 Chr1 62176-62183 - | 11.1123 ARR10
## ... ... ... ... . ... ...
## MA1329.2_ZHD1 Chr5 26876218-26876230 - | 18.9609 ZHD1
## MA1329.2_ZHD1 Chr5 26879928-26879940 - | 19.6284 ZHD1
## MA1329.2_ZHD1 Chr5 26880196-26880208 - | 19.6284 ZHD1
## MA1329.2_ZHD1 Chr5 26889355-26889367 - | 20.1565 ZHD1
## MA1329.2_ZHD1 Chr5 26901566-26901578 - | 19.1066 ZHD1
## -------
## seqinfo: 5 sequences from an unspecified genome; no seqlengths
result_enrichFisher_motifScan <- findIT_enrichFisher(input_feature_id = relatedPeak,
peak_GR = ATAC_peak_GR,
TF_GR_database = motif_GR)
result_enrichFisher_motifScan %>%
as.data.frame() %>%
head()
## TF_id num_TFHit_inputFeature inputRatio bgRatio pvalue odds_ratio
## 1 FUS3 449 449/1531 2805/22994 5.935729e-78 3.364828
## 2 ABI3 283 283/1531 1369/22994 7.183527e-71 4.254414
## 3 LEC2 400 400/1531 2455/22994 1.788268e-70 3.339894
## 4 HBI1 321 321/1531 2799/22994 4.101352e-24 2.032429
## 5 TCP24 368 368/1531 3412/22994 5.647347e-23 1.914660
## 6 TCP3 363 363/1531 3437/22994 5.067003e-21 1.859006
## padj qvalue rank
## 1 2.760114e-75 5.377072e-76 1
## 2 3.333156e-68 3.253715e-69 2
## 3 8.279683e-68 5.399869e-69 3
## 4 1.894825e-21 9.288356e-23 4
## 5 2.603427e-20 1.023166e-21 5
## 6 2.330822e-18 7.650181e-20 6
findIT_enrichWilcox
findIT_enrichWilcox
相比于前面的findIT_enrichFisher
,其还额外考虑了TFBS在你感兴趣peak集合的数目。举个例子而言,假如我们想要看看LEC2在感兴趣的peak集合中的富集情况,在findIT_enrichFisher
中,LEC2 motif scan得到的TFBS对于每个感兴趣的peak,只会有在和不在两种结果。所以最终产生的就是一个列联表。
TF Hit | TF not Hit | total | ||
---|---|---|---|---|
select peak | 400 | 1131 | 1531 | |
left peak | 2055 | 19408 | 21463 | |
total | 2455 | 20539 | 22994 |
(matrix_fisher <- matrix(c(400, 1131, 2055, 19408), nrow = 2, byrow = TRUE))
## [,1] [,2]
## [1,] 400 1131
## [2,] 2055 19408
fisher.test(matrix_fisher, alternative = "greater")
##
## Fisher's Exact Test for Count Data
##
## data: matrix_fisher
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 3.005858 Inf
## sample estimates:
## odds ratio
## 3.339894
fisher.test(matrix_fisher, alternative = "greater")$p.value
## [1] 1.788268e-70
但实际上每个感兴趣的peak里面可能会有1个,2个甚至多个LEC2 TFBS,而这些TFBS在peak里面的数目其实是有意义的。所以findIT_enrichWilcox
会随机选择一个背景peak集合,计算TFBS在背景peak集合中的数目分布,然后将感兴趣集合与背景集合进行wilcox秩和检验,从而计算哪些TF是显著的。显然,如果一个TF的TFBS在感兴趣peak集合中很有多的hit,而在背景peak集合中hit数目很少,就说明这个TF很有可能调控了你的peak。
set.seed(20160806)
result_enrichWilcox_Remap <- findIT_enrichWilcox(input_feature_id = relatedPeak,
peak_GR = ATAC_peak_GR,
TF_GR_database = database_ChIP_GR,
background_peaks = NULL,
background_number = 3000)
result_enrichWilcox_Remap %>%
as.data.frame() %>%
head()
## TF_id input_meanMotifScore
## 1 LEC2 0.7668191
## 2 GSE145387.VAL1.Col-0_seedling 0.6544742
## 3 GSE94307.HEC1.Col-0_seedling_12d 0.6610059
## 4 GSE142621.BRM.Col-0_seedling_BRIP1 0.5963423
## 5 GSE122579.AT2G17950.Col-0_seedling 0.7459177
## 6 GSE26722.REV.Col-0_seedling_10d-DEX-90min 0.5682560
## bg_meanMotifScore pvalue padj qvalue rank
## 1 0.06833333 4.989869e-297 3.467959e-294 1.776393e-294 1
## 2 0.30200000 5.699338e-87 3.955341e-84 1.014482e-84 2
## 3 0.29966667 3.371684e-71 2.336577e-68 4.001065e-69 3
## 4 0.30266667 1.622719e-60 1.122922e-57 1.444220e-58 4
## 5 0.38866667 1.494534e-55 1.032723e-52 1.064108e-53 5
## 6 0.27633333 3.777533e-55 2.606498e-52 2.241336e-53 6
set.seed(20160806)
result_enrichWilcox_motifScan <- findIT_enrichWilcox(input_feature_id = relatedPeak,
peak_GR = ATAC_peak_GR,
TF_GR_database = motif_GR,
background_peaks = NULL,
background_number = 3000)
result_enrichWilcox_motifScan %>%
as.data.frame() %>%
head()
## TF_id input_meanMotifScore bg_meanMotifScore pvalue padj
## 1 FUS3 0.3932071 0.1206667 2.061566e-61 9.586283e-59
## 2 ABI3 0.3696930 0.0780000 1.422173e-52 6.598883e-50
## 3 LEC2 0.3507511 0.1163333 4.487473e-47 2.077700e-44
## 4 TCP24 0.2906597 0.1616667 1.096020e-18 5.063613e-16
## 5 HBI1 0.2769432 0.1663333 2.237878e-17 1.031662e-14
## 6 TCP5 0.2129327 0.1120000 3.065327e-17 1.410050e-14
## qvalue rank
## 1 1.958488e-59 1
## 2 6.755322e-51 2
## 3 1.421033e-45 3
## 4 2.603048e-17 4
## 5 4.251967e-16 5
## 6 4.853434e-16 6
这里面有一些小的注意点需要考虑,首先就是背景peak的选择问题。如果你不指定背景peak,那么函数默认地就会随机从全部的ATAC peak集合中抽取background_number
数目的peak来作为背景集合。当然,你也可以使用你自己指定的背景peak填入background_peaks
中。
还有就是findIT_enrichWilcox
和findIT_enrichFisher
函数选择哪个的问题。findIT_enrichFisher
基于的是Fisher精确检验,原理跟基因的GO富集很相似,比较容易解释,也不需要你自己选择背景基因集,但是其只是单纯地考虑了TFBS在peak集合中的有和无。而findIT_enrichWilcox
则更多地考虑了TFBS在peak集合中的数目分布,但是背景集合的选取是一个问题。怎么样选取比较有意义的“背景集合”是一个值得思考的问题,我这里用的是随机选取,但是否有更好的集合来充当背景的任务是一个有待商榷的事情。这里我还是推荐大家两个函数都试一下,因为从我自己的试用结果来看,两者的结果是差不多的。
还有一个小问题就是什么时候设置set.seed
的问题。大家只要看下函数的manual,如果如果参数里面有background相关的东西,就说明这个函数需要随机选取背景集合,那么这时候就需要加set.seed了。