Intro_FindIT2:find_influential_TF module part2

Last updated on 2021-10-05 13 min read manual_tools

上一篇文章提到了如何寻找调控你感兴趣基因的那些TF，感兴趣基因的来源可以是差异分析、聚类等。这一篇文章来介绍如何寻找调控你感兴趣peak的那些TF，感兴趣的peak可以来自于差异分析、聚类分析等，也可以是你感兴趣基因临近或远端的peak。这部分内容包含了两个函数：findIT_enrichFisher和findIT_enrichWilcox的介绍。前者基于的是Fisher精确检验，后者基于的是Wilcox秩和检验，具体的细节我们会在后面讨论。

Pre-preparation

这次用到的测试数据跟上一篇文章类似，包含了LEC2的在整个基因组上的ChIP-seq结果，GR诱导LEC2过表达的RNA-seq结果以及一个ATAC的多时间点的mergePeak文件和count矩阵以及一些公共数据。

# 加载包
library(FindIT2)
## Warning: package 'GenomeInfoDb' was built under R version 4.1.1

library(TxDb.Athaliana.BioMart.plantsmart28)
## Warning: package 'GenomicFeatures' was built under R version 4.1.1
library(BSgenome.Athaliana.TAIR.TAIR9)
library(motifmatchr)

library(dplyr)
library(ggplot2)

Txdb <- TxDb.Athaliana.BioMart.plantsmart28
seqlevels(Txdb) <- paste0("Chr", c(1:5, "M", "C"))

除了ATAC 多时间点的count矩阵文件之外，其他都已经在上一篇文章下载过了。

# 下载LEC2在整个基因组上的ChIP-seq结果
download.file(url = "https://raw.githubusercontent.com/shangguandong1996/FindIT2_paper_relatedCode/master/rawdata/ChIP/TF_ChIP_bindingSite.bed",
              destfile = "TF_ChIP_bindingSite.bed")

# 下载ATAC peak
download.file(url = "https://raw.githubusercontent.com/shangguandong1996/FindIT2_paper_relatedCode/master/rawdata/ATAC/ATAC_mergePeak.bed",
              destfile = "ATAC_mergePeak.bed")

# 下载ATAC 多时间点的count矩阵
download.file(url = "https://raw.githubusercontent.com/shangguandong1996/FindIT2_paper_relatedCode/master/rawdata/ATAC/ATAC_peakNormCount.rda",
              destfile = "ATAC_peakNormCount.rda"
              )

# 下载RNA-seq的差异分析结果
download.file(url = "https://raw.githubusercontent.com/shangguandong1996/FindIT2_paper_relatedCode/master/rawdata/RNA/LEC2_RNA_diffResult.rda",
              destfile = "LEC2_RNA_diffResult.rda")

# 下载公共数据库中各种TF ChIP-seq的peak文件
download.file(url = "https://remap.univ-amu.fr/storage/remap2022/tair10/tf/MACS2/remap2022_all_macs2_TAIR10_v1_0.bed.gz",
              destfile = "remap2022_all_macs2_TAIR10_v1_0.bed.gz")

由于用的是上一篇文章下载过的数据，所以指向的路径是上一篇文章的地址

ChIP_peak_GR <- loadPeakFile("../2021-09-24-intro-findit2-find-influential-tf-module-part1/TF_ChIP_bindingSite.bed")

ChIP_peak_GR
## GRanges object with 3223 ranges and 2 metadata columns:
##          seqnames            ranges strand |   feature_id     score
##             <Rle>         <IRanges>  <Rle> |  <character> <numeric>
##      [1]     Chr1       37901-38160      * |    TF_ChIP_1  0.747831
##      [2]     Chr1       51301-51660      * |    TF_ChIP_2  1.368107
##      [3]     Chr1       51801-51810      * |    TF_ChIP_3  0.765108
##      [4]     Chr1       68451-68660      * |    TF_ChIP_4  0.797340
##      [5]     Chr1     116401-116660      * |    TF_ChIP_5  0.853717
##      ...      ...               ...    ... .          ...       ...
##   [3219]     Chr5 26903801-26904210      * | TF_ChIP_3219  0.829693
##   [3220]     Chr5 26905801-26906010      * | TF_ChIP_3220  1.008028
##   [3221]     Chr5 26906401-26906760      * | TF_ChIP_3221  0.990938
##   [3222]     Chr5 26933051-26933410      * | TF_ChIP_3222  0.814937
##   [3223]     Chr5 26933851-26934110      * | TF_ChIP_3223  0.943218
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

load("../2021-09-24-intro-findit2-find-influential-tf-module-part1/LEC2_RNA_diffResult.rda")
diff_RNA
## # A tibble: 37,051 x 3
##    gene_id   log2FoldChange   padj
##    <chr>              <dbl>  <dbl>
##  1 AT1G01010      -0.0169    0.510
##  2 AT1G01020       0.000174  1.00 
##  3 AT1G03987      -0.478     0.191
##  4 AT1G01030      -0.00859   1.00 
##  5 AT1G01040      -0.000254  1.00 
##  6 AT1G03993       0        NA    
##  7 AT1G01050      -0.000576  1.00 
##  8 AT1G03997       0        NA    
##  9 AT1G01060      -0.00174   1.00 
## 10 AT1G01070      -0.00384   1.00 
## # ... with 37,041 more rows

ATAC_peak_GR <- loadPeakFile("../2021-09-24-intro-findit2-find-influential-tf-module-part1/ATAC_mergePeak.bed")

ATAC_peak_GR
## GRanges object with 22994 ranges and 2 metadata columns:
##           seqnames            ranges strand |  feature_id     score
##              <Rle>         <IRanges>  <Rle> | <character> <numeric>
##       [1]     Chr1            30-875      * |      ATAC_1         0
##       [2]     Chr1         1392-1953      * |      ATAC_2         0
##       [3]     Chr1         2164-3625      * |      ATAC_3         0
##       [4]     Chr1         8268-8966      * |      ATAC_4         0
##       [5]     Chr1       13654-14698      * |      ATAC_5         0
##       ...      ...               ...    ... .         ...       ...
##   [22990]     Chr5 26969184-26969915      * |  ATAC_22990         0
##   [22991]     Chr5 26970542-26971504      * |  ATAC_22991         0
##   [22992]     Chr5 26972304-26973186      * |  ATAC_22992         0
##   [22993]     Chr5 26974280-26974825      * |  ATAC_22993         0
##   [22994]     Chr5 26975157-26975438      * |  ATAC_22994         0
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

load("ATAC_peakNormCount.rda")
head(peakCountNorm_ATAC)
##        WT_E5_0h_R1 WT_E5_0h_R2 WT_E5_4h_R1 WT_E5_4h_R2 WT_E5_8h_R1 WT_E5_8h_R2
## ATAC_1   126.65266    143.0386    220.6055    224.7059    171.1754    204.2058
## ATAC_2    93.45245    103.4747    106.0053    132.7808    151.9182    152.2089
## ATAC_3   354.75043    402.9427    621.7065    569.9358    487.8500    406.5208
## ATAC_4   158.62324    144.2559    183.3605    183.8503    165.8262    180.5708
## ATAC_5   222.56439    258.6868    203.4155    275.7754    171.1754    192.8610
## ATAC_6   201.04574    175.9070    137.5203    134.8235    145.4991    119.1200
##        WT_E5_16h_R1 WT_E5_16h_R2 WT_E5_24h_R1 WT_E5_24h_R2 WT_E5_48h_R1
## ATAC_1     184.5536     152.9906     205.0690     182.4882     126.4342
## ATAC_2     153.0956     178.4891     204.3619     184.2152     105.5451
## ATAC_3     440.4120     413.7701     506.3083     472.0515     307.8398
## ATAC_4     191.8938     119.3790     221.3331     190.5476     180.3062
## ATAC_5     197.1368     200.5104     208.6047     223.3609     133.0308
## ATAC_6     139.4638     114.7430     185.2692     156.0073     124.2354
##        WT_E5_48h_R2 WT_E5_48h_R3 WT_E5_72h_R1 WT_E5_72h_R2 WT_E5_72h_R3
## ATAC_1     128.4671    151.36817     134.9386    158.33027     112.9270
## ATAC_2     117.0983     97.89572      84.6225    119.51381     134.0553
## ATAC_3     345.6106    403.10002     408.2464    362.62738     446.6083
## ATAC_4     120.5090    176.04777     137.2257    150.15838     143.5266
## ATAC_5     172.8053    166.99858     152.0918    196.12523     214.9257
## ATAC_6     120.5090    106.12225     101.7757     79.67588     129.6840

# Remap2022的bed文件
database_ChIP_GR <- loadPeakFile("../2021-09-24-intro-findit2-find-influential-tf-module-part1/remap2022_all_macs2_TAIR10_v1_0.bed.gz")

database_ChIP_GR
## GRanges object with 4843030 ranges and 4 metadata columns:
##             seqnames            ranges strand |             feature_id
##                <Rle>         <IRanges>  <Rle> |            <character>
##         [1]        1            1-1033      * | GSE143831.MED12.Col-..
##         [2]        1            1-1053      * | GSE124750.AGO4.Col_i..
##         [3]        1            1-1194      * | GSE124750.AGO6.Col_i..
##         [4]        1            1-1198      * | GSE124750.AGO9.Col_i..
##         [5]        1            1-1234      * | GSE124750.AGO9.Col_i..
##         ...      ...               ...    ... .                    ...
##   [4843026]        5 26975133-26975442      * | GSE166543.EICBP-B.Co..
##   [4843027]        5 26975155-26975443      * | GSE124750.AGO4.Col_i..
##   [4843028]        5 26975174-26975444      * | GSE108673.SPT6.Col-0..
##   [4843029]        5 26975211-26975325      * | GSE60141.TGA4.Col-0_..
##   [4843030]        5 26975213-26975502      * | GSE60141.TRP2.Col-0_..
##                 score     itemRgb     thick
##             <numeric> <character> <IRanges>
##         [1]   4.41114     #8C381C       500
##         [2]  17.08726     #674F67       313
##         [3]  21.07903     #556C10       294
##         [4]  58.29619     #22DBC3       415
##         [5]   7.67439     #22DBC3       267
##         ...       ...         ...       ...
##   [4843026]   6.47935     #CDE461  26975253
##   [4843027]  10.07379     #674F67  26975286
##   [4843028]  12.09363     #F8D7EE  26975298
##   [4843029]   5.75710     #DB9A42  26975291
##   [4843030] 355.36154     #F7A31C  26975406
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

integrate_ChIP_RNA

跟上一篇文章一样，我通过整合ChIP-seq和RNA-seq来拿到LEC2的靶基因，并选取top1000做进一步地分析。


# 我希望calcRP_TFHit能够将我bed文件里面的score考虑在内
# 所以我重命名了socre列为featrure_score
colnames(mcols(ChIP_peak_GR))[2] <- "feature_score"

mmAnno_TFHit <- mm_geneScan(ChIP_peak_GR,
                            Txdb,
                            upstream = 2e4,
                            downstream = 2e4)
## >> checking seqlevels match...       2021-10-26 20:46:57
## >> your peak_GR seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5...
## >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 ChrM ChrC...
## Good, your Chrs in peak_GR is all in Txdb
## ------------
## annotatePeak using geneScan mode begins
## >> preparing gene features information and scan region...        2021-10-26 20:46:57
## >> some scan range may cross Chr bound, trimming...      2021-10-26 20:46:58
## >> finding overlap peak in gene scan region...       2021-10-26 20:46:58
## >> dealing with left peak not your gene scan region...       2021-10-26 20:46:58
## >> merging two set peaks...      2021-10-26 20:46:58
## >> calculating distance and dealing with gene strand...      2021-10-26 20:46:58
## >> merging all info together ...     2021-10-26 20:46:58
## >> done      2021-10-26 20:46:58

TFHit_RP <- calcRP_TFHit(mmAnno = mmAnno_TFHit,
                         Txdb = Txdb,
                         decay_dist = 1000,
                         report_fullInfo = FALSE)
## >> calculating peakCenter to TSS using peak-gene pair...     2021-10-26 20:46:58
## >> calculating RP using centerToTSS and feature_score        2021-10-26 20:46:59
## >> merging all info together     2021-10-26 20:46:59
## >> done      2021-10-26 20:46:59

merge_data <- integrate_ChIP_RNA(result_geneRP = TFHit_RP,
                                 result_geneDiff = diff_RNA,
                                 lfc_threshold = 0.58,
                                 padj_threshold = 0.05)

# 这里我取前1000个基因作为LEC2的靶基因
LEC2_target <- merge_data$data$gene_id[1:1000]

build peak-gene link

由于这篇文章讨论的是如何寻找调控你感兴趣peak的那些TF，所以首先我们需要定义一个peak集合。最常见的peak集合就是通过差异分析得到的上调或者下调的peak，当然也可以是做完聚类之后每个类群里面的peak。这里我用FindIT2前面的mmAnno和peakGeneCor模块来演示如何为LEC2靶基因寻找相关联的peak。

首先是用mmAnno module中的mm_geneBound函数来寻找LEC2靶基因的关联peak。mm_geneBound的原理可见前面的文章，大致就是通过最近基因注释和最近peak注释两步注释来关联peak。尽管这样找到的peak不一定是最全的，但一定会为每个基因都至少找到一个peak。

mm_geneBound(peak_GR = ATAC_peak_GR,
             Txdb = Txdb,
             input_genes = LEC2_target) -> mmAnno_geneBound
## >> checking seqlevels match...       2021-10-26 20:46:59
## >> your peak_GR seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5...
## >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 ChrM ChrC...
## Good, your Chrs in peak_GR is all in Txdb
## >> using mm_nearestGene to annotate Peak...      2021-10-26 20:46:59
## >> checking seqlevels match...       2021-10-26 20:46:59
## >> your peak_GR seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5...
## >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 ChrM ChrC...
## Good, your Chrs in peak_GR is all in Txdb
## ------------
## annotating Peak using nearest gene mode begins
## >> preparing gene features information...        2021-10-26 20:46:59
## >> finding nearest gene and calculating distance...      2021-10-26 20:46:59
## >> dealing with gene strand ...      2021-10-26 20:47:00
## >> merging all info together ...     2021-10-26 20:47:00
## >> done      2021-10-26 20:47:00
## It seems that there 212 genes have not been annotated by nearestGene mode
## >> using distanceToNearest to find nearest peak of these genes...        2021-10-26 20:47:00
## >> merging all anno...       2021-10-26 20:47:00
## >> done      2021-10-26 20:47:00

当然进一步地，大家也可以用mm_geneScan来划定每个基因的扫描范围，从而找到更多的关联peak，只不过关联的peak越多，相对应的噪音可能也就越多。由于这里我们的目的只是想找一个LEC2靶基因关联的集合，从而来验证函数的功效，所以没必要引入太多的关联peak。

由于这套ATAC-seq关联的数据集还含有一组多时间点的peak count矩阵，所以这里我们还可以使用peakGeneCor module里面的enhancerPromoterCor函数来寻找基因关联的peak。大致原理就是尽量为每个基因寻找一个最近的peak作为promoter，然后寻找较远的peak，通过计算远端peak与近端promoter的相关性，从而找到关联的远端peak。尽管这种方法可能并不是那么地准确，因为近端promoter的变化并不一定能够表征基因自身的变化，但在缺乏对应的表达量数据或者样本量不是那么地充分的时候，这个方法也可以作为一个参考。

# 我设置up/down_scanPromoter为500bp，即如果一个基因TSS 500bp以内有peak
# 则最近的那个peak被认为是这个基因的promoter

# 我设置up/down_scanEnhancer为2e4，如果一个基因TSS 2e4bp以内有peak
# 那么这些peak就会与基因的promoter计算相关性
enhancerPromoterCor(peak_GR = ATAC_peak_GR,
                    Txdb = Txdb,
                    up_scanPromoter = 1000,
                    down_scanPromoter = 1000,
                    up_scanEnhancer = 2e4,
                    down_scanEnhacner = 2e4,
                    peakScoreMt = peakCountNorm_ATAC) -> mmAnno_enhancerPromoter
## >> checking seqlevels match...       2021-10-26 20:47:00
## >> your peak_GR seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5...
## >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 ChrM ChrC...
## Good, your Chrs in peak_GR is all in Txdb
## >> using scanPromoter parameter to scan promoter for each gene...        2021-10-26 20:47:00
## >> checking seqlevels match...       2021-10-26 20:47:00
## >> your peak_GR seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5...
## >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 ChrM ChrC...
## Good, your Chrs in peak_GR is all in Txdb
## >> some scan range may cross Chr bound, trimming...      2021-10-26 20:47:01
## >> there are 22117 gene have scaned promoter
## >> using scanEnhancer parameter to scan Enhancer for each gene...        2021-10-26 20:47:01
## >> checking seqlevels match...       2021-10-26 20:47:01
## >> your peak_GR seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5...
## >> your Txdb seqlevel:Chr1 Chr2 Chr3 Chr4 Chr5 ChrM ChrC...
## Good, your Chrs in peak_GR is all in Txdb
## >> some scan range may cross Chr bound, trimming...      2021-10-26 20:47:02
## >> calculating cor and pvalue, which may be time consuming...        2021-10-26 20:47:03
## >> merging all info together...      2021-10-26 20:47:28
## >> done      2021-10-26 20:47:29

# 我以相关性大于0.8，pvalue<0.01为阈值来挑选distal enhancer-promoter link
mmAnno_enhancerPromoter_filter <- subset(mmAnno_enhancerPromoter, cor > 0.8 & pvalue < 0.01)

当然，如果还有对应的RNA-seq数据，也可以使用peakGeneCor module里面的peakGeneCor来寻找基因关联的peak。

然后我们来整合peak，这样我们就得到了在全局性的peak中，LEC2靶基因所关联的peak集合。后面分析的目的就是想要看看这些LEC2关联的peak中富集了什么TF或者说什么TF可能调控了这些peak。

relatedPeak_1 <- mmAnno_geneBound$feature_id
relatedPeak_2 <- mmAnno_enhancerPromoter_filter %>% 
  as_tibble() %>% 
  filter(gene_id %in% LEC2_target) %>% 
  pull(feature_id)
relatedPeak_3 <- metadata(mmAnno_enhancerPromoter)$mm_promoter_tidy %>% 
  filter(gene_id %in% LEC2_target) %>% 
  pull(promoter_feature)

relatedPeak <- unique(c(relatedPeak_1, relatedPeak_2, relatedPeak_3))

length(relatedPeak)
## [1] 1531

# 当然，relatedPeak_1，即mmAnno_geneBound的结果已经包含了大部分的relatedPeak了
# 所以我们单纯地用mmAnno_geneBound的结果也行
mean(relatedPeak %in% unique(relatedPeak_1))
## [1] 0.9150882
mean(relatedPeak %in% unique(relatedPeak_2))
## [1] 0.1404311
mean(relatedPeak %in% unique(relatedPeak_3))
## [1] 0.5453952

findIT_enrichFisher

在进一步讲解函数之前，我想要先提下在使用HOMER findMotifsGenome.pl这个命令中的一个小问题。现在大部分的ATAC-seq文章会用这个函数去得到感兴趣peak集合中的富集TF，常规地就是对上调或者下调的peak去使用这个命令。但根据我的经验来看，如果你在使用findMotifsGenome.pl去Finding Enriched Motifs的时候，最好不要用随机产生的location作为背景，而是应该用全部的ATAC peak集合。即findMotifsGenome.pl里面的-bg参数应该设置为全部的ATAC peak。不然地话，所产生的结果就很有可能会是ATAC-seq建库本身所富集的东西，而不是跟你生物学过程密切相关的TF了。这种情况在你感兴趣的peak数目较多的时候尤其严重，最典型的现象就是你经常会在富集结果中看到某一家族的TF。

findIT_enrichFisher的原理跟上一篇中的findIT_TTPair一模一样，都是基于Fisher精确检验来查看你感兴趣的集合，相比于全局性的集合，更加富集哪些TF。只不过findIT_TTPair要求的是一个感兴趣的基因集和一个两列的TF-target数据库。findIT_enrichFisher则要求的是一个感兴趣的peak集合和GRanges格式的TF peak location数据库。数据库的格式跟上一篇的findIT_TFHit是一样的。

我们首先用Remap2022的公共数据库来做下富集

# 首先我们需要对数据库的binding site GRange对象做一些修改

# 将seqlevel修改一下
(seqlevels(database_ChIP_GR) <- paste0("Chr", seqlevels(database_ChIP_GR)))
## [1] "Chr1" "Chr2" "Chr3" "Chr4" "Chr5"

# 将feature_id修改成TF_id
colnames(mcols(database_ChIP_GR))[1] <- "TF_id"

# peak_GR参数包含了你全部ATAC peak的location
# 同时还需要有一列feature_id来标识peak的id
# 你的input_feature_id必须是这其中的一部分
result_enrichFisher_Remap <- findIT_enrichFisher(input_feature_id = relatedPeak,
                                                 peak_GR = ATAC_peak_GR,
                                                 TF_GR_database = database_ChIP_GR)

# 加as.data.frame只是为了能在博客里面展现所有的结果
result_enrichFisher_Remap %>% 
  as.data.frame() %>% 
  head()
##                                       TF_id num_TFHit_inputFeature inputRatio
## 1             GSE145387.VAL1.Col-0_seedling                    885   885/1531
## 2          GSE94307.HEC1.Col-0_seedling_12d                    789   789/1531
## 3        GSE142621.BRM.Col-0_seedling_BRIP1                    798   798/1531
## 4 GSE26722.REV.Col-0_seedling_10d-DEX-90min                    711   711/1531
## 5      GSE20176.AP1.Col_apical-meristem_DEX                    676   676/1531
## 6        GSE122579.AT2G17950.Col-0_seedling                    845   845/1531
##      bgRatio        pvalue odds_ratio          padj        qvalue rank
## 1 6925/22994 4.827909e-120   3.497960 3.311946e-117 1.834606e-117    1
## 2 6425/22994  8.537808e-91   2.985903  5.848398e-88  1.622184e-88    2
## 3 6706/22994  7.258159e-85   2.866194  4.964581e-82  9.193669e-83    3
## 4 5746/22994  1.633122e-79   2.828857  1.115422e-76  1.551466e-77    4
## 5 5487/22994  3.225643e-73   2.736451  2.199888e-70  2.451488e-71    5
## 6 7747/22994  1.705800e-71   2.598544  1.161650e-68  1.080340e-69    6

跟之前一样，因为数据库里面没有LEC2，所以结果里面LEC2并不在top。我们可以把LEC2加入到数据库里面再跑一次。

mcols(database_ChIP_GR) <- mcols(database_ChIP_GR)[, 1, drop = FALSE]

mcols(ChIP_peak_GR) <- NULL

ChIP_peak_GR$TF_id <- "LEC2"

database_ChIP_GR <- c(database_ChIP_GR, ChIP_peak_GR)

result_enrichFisher_Remap <- findIT_enrichFisher(input_feature_id = relatedPeak,
                                                 peak_GR = ATAC_peak_GR,
                                                 TF_GR_database = database_ChIP_GR)

result_enrichFisher_Remap %>% 
  as.data.frame() %>% 
  head()
##                                       TF_id num_TFHit_inputFeature inputRatio
## 1                                      LEC2                    821   821/1531
## 2             GSE145387.VAL1.Col-0_seedling                    885   885/1531
## 3          GSE94307.HEC1.Col-0_seedling_12d                    789   789/1531
## 4        GSE142621.BRM.Col-0_seedling_BRIP1                    798   798/1531
## 5 GSE26722.REV.Col-0_seedling_10d-DEX-90min                    711   711/1531
## 6      GSE20176.AP1.Col_apical-meristem_DEX                    676   676/1531
##      bgRatio        pvalue odds_ratio          padj        qvalue rank
## 1 2216/22994  0.000000e+00  16.622502  0.000000e+00  0.000000e+00    1
## 2 6925/22994 4.827909e-120   3.497960 3.311946e-117 9.173028e-118    2
## 3 6425/22994  8.537808e-91   2.985903  5.848398e-88  1.081456e-88    3
## 4 6706/22994  7.258159e-85   2.866194  4.964581e-82  6.895251e-83    4
## 5 5746/22994  1.633122e-79   2.828857  1.115422e-76  1.241173e-77    5
## 6 5487/22994  3.225643e-73   2.736451  2.199888e-70  2.042907e-71    6

跟之前一样，如果你的物种没有公共数据库，也可以考虑用ATAC motif scan或者footprint scan。用motif scan来做Fisher精确检验其实就很类似于findMotifGenome.pl了，只不过这里对应的是-bg参数设置的是全部的peak集合。

# 写个函数来整理提取出JASPAR2020中的内容
getJasparMotifs <- function(species = "Arabidopsis thaliana", 
                                collection = "CORE", ...) {
  opts <- list()
  opts["species"] <- species
  opts["collection"] <- collection
  opts <- c(opts, list(...))
  out <- TFBSTools::getMatrixSet(JASPAR2020::JASPAR2020, opts)
  if (!isTRUE(all.equal(TFBSTools::name(out), names(out)))) 
    names(out) <- paste(names(out), TFBSTools::name(out), sep = "_")
  return(out)
}

motifs_JASPAR2020 <- getJasparMotifs()

# call motif
motif_ix <- matchMotifs(motifs_JASPAR2020, 
                        ATAC_peak_GR, 
                        genome = BSgenome.Athaliana.TAIR.TAIR9, 
                        out = "positions", 
                        bg = "genome")
motif_GR <- unlist(motif_ix)
motif_GR$TF_id <- gsub("(MA[0-9]{4}\\.[0-9]_)", "", names(motif_GR))

motif_GR
## GRanges object with 2147996 ranges and 2 metadata columns:
##                  seqnames            ranges strand |     score       TF_id
##                     <Rle>         <IRanges>  <Rle> | <numeric> <character>
##   MA0121.1_ARR10     Chr1         1490-1497      - |   10.9848       ARR10
##   MA0121.1_ARR10     Chr1         1904-1911      - |   10.9848       ARR10
##   MA0121.1_ARR10     Chr1       33532-33539      - |   10.9848       ARR10
##   MA0121.1_ARR10     Chr1       55227-55234      - |   11.1123       ARR10
##   MA0121.1_ARR10     Chr1       62176-62183      - |   11.1123       ARR10
##              ...      ...               ...    ... .       ...         ...
##    MA1329.2_ZHD1     Chr5 26876218-26876230      - |   18.9609        ZHD1
##    MA1329.2_ZHD1     Chr5 26879928-26879940      - |   19.6284        ZHD1
##    MA1329.2_ZHD1     Chr5 26880196-26880208      - |   19.6284        ZHD1
##    MA1329.2_ZHD1     Chr5 26889355-26889367      - |   20.1565        ZHD1
##    MA1329.2_ZHD1     Chr5 26901566-26901578      - |   19.1066        ZHD1
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

result_enrichFisher_motifScan <- findIT_enrichFisher(input_feature_id = relatedPeak,
                                                     peak_GR = ATAC_peak_GR,
                                                     TF_GR_database = motif_GR)

result_enrichFisher_motifScan %>% 
  as.data.frame() %>% 
  head()
##   TF_id num_TFHit_inputFeature inputRatio    bgRatio       pvalue odds_ratio
## 1  FUS3                    449   449/1531 2805/22994 5.935729e-78   3.364828
## 2  ABI3                    283   283/1531 1369/22994 7.183527e-71   4.254414
## 3  LEC2                    400   400/1531 2455/22994 1.788268e-70   3.339894
## 4  HBI1                    321   321/1531 2799/22994 4.101352e-24   2.032429
## 5 TCP24                    368   368/1531 3412/22994 5.647347e-23   1.914660
## 6  TCP3                    363   363/1531 3437/22994 5.067003e-21   1.859006
##           padj       qvalue rank
## 1 2.760114e-75 5.377072e-76    1
## 2 3.333156e-68 3.253715e-69    2
## 3 8.279683e-68 5.399869e-69    3
## 4 1.894825e-21 9.288356e-23    4
## 5 2.603427e-20 1.023166e-21    5
## 6 2.330822e-18 7.650181e-20    6

findIT_enrichWilcox

findIT_enrichWilcox相比于前面的findIT_enrichFisher，其还额外考虑了TFBS在你感兴趣peak集合的数目。举个例子而言，假如我们想要看看LEC2在感兴趣的peak集合中的富集情况，在findIT_enrichFisher中，LEC2 motif scan得到的TFBS对于每个感兴趣的peak，只会有在和不在两种结果。所以最终产生的就是一个列联表。

	TF Hit	TF not Hit	total
select peak	400	1131	1531
left peak	2055	19408	21463
total	2455	20539	22994

(matrix_fisher <- matrix(c(400, 1131, 2055, 19408), nrow = 2, byrow = TRUE))
##      [,1]  [,2]
## [1,]  400  1131
## [2,] 2055 19408

fisher.test(matrix_fisher, alternative = "greater")
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix_fisher
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  3.005858      Inf
## sample estimates:
## odds ratio 
##   3.339894

fisher.test(matrix_fisher, alternative = "greater")$p.value
## [1] 1.788268e-70

但实际上每个感兴趣的peak里面可能会有1个,2个甚至多个LEC2 TFBS，而这些TFBS在peak里面的数目其实是有意义的。所以findIT_enrichWilcox会随机选择一个背景peak集合，计算TFBS在背景peak集合中的数目分布，然后将感兴趣集合与背景集合进行wilcox秩和检验，从而计算哪些TF是显著的。显然，如果一个TF的TFBS在感兴趣peak集合中很有多的hit，而在背景peak集合中hit数目很少，就说明这个TF很有可能调控了你的peak。

set.seed(20160806)
result_enrichWilcox_Remap <- findIT_enrichWilcox(input_feature_id = relatedPeak,
                                                 peak_GR = ATAC_peak_GR,
                                                 TF_GR_database = database_ChIP_GR,
                                                 background_peaks = NULL,
                                                 background_number = 3000)

result_enrichWilcox_Remap %>% 
  as.data.frame() %>% 
  head()
##                                       TF_id input_meanMotifScore
## 1                                      LEC2            0.7668191
## 2             GSE145387.VAL1.Col-0_seedling            0.6544742
## 3          GSE94307.HEC1.Col-0_seedling_12d            0.6610059
## 4        GSE142621.BRM.Col-0_seedling_BRIP1            0.5963423
## 5        GSE122579.AT2G17950.Col-0_seedling            0.7459177
## 6 GSE26722.REV.Col-0_seedling_10d-DEX-90min            0.5682560
##   bg_meanMotifScore        pvalue          padj        qvalue rank
## 1        0.06833333 4.989869e-297 3.467959e-294 1.776393e-294    1
## 2        0.30200000  5.699338e-87  3.955341e-84  1.014482e-84    2
## 3        0.29966667  3.371684e-71  2.336577e-68  4.001065e-69    3
## 4        0.30266667  1.622719e-60  1.122922e-57  1.444220e-58    4
## 5        0.38866667  1.494534e-55  1.032723e-52  1.064108e-53    5
## 6        0.27633333  3.777533e-55  2.606498e-52  2.241336e-53    6

set.seed(20160806)
result_enrichWilcox_motifScan <- findIT_enrichWilcox(input_feature_id = relatedPeak,
                                                     peak_GR = ATAC_peak_GR,
                                                     TF_GR_database = motif_GR,
                                                     background_peaks = NULL,
                                                     background_number = 3000)

result_enrichWilcox_motifScan %>% 
  as.data.frame() %>% 
  head()
##   TF_id input_meanMotifScore bg_meanMotifScore       pvalue         padj
## 1  FUS3            0.3932071         0.1206667 2.061566e-61 9.586283e-59
## 2  ABI3            0.3696930         0.0780000 1.422173e-52 6.598883e-50
## 3  LEC2            0.3507511         0.1163333 4.487473e-47 2.077700e-44
## 4 TCP24            0.2906597         0.1616667 1.096020e-18 5.063613e-16
## 5  HBI1            0.2769432         0.1663333 2.237878e-17 1.031662e-14
## 6  TCP5            0.2129327         0.1120000 3.065327e-17 1.410050e-14
##         qvalue rank
## 1 1.958488e-59    1
## 2 6.755322e-51    2
## 3 1.421033e-45    3
## 4 2.603048e-17    4
## 5 4.251967e-16    5
## 6 4.853434e-16    6

这里面有一些小的注意点需要考虑，首先就是背景peak的选择问题。如果你不指定背景peak，那么函数默认地就会随机从全部的ATAC peak集合中抽取background_number数目的peak来作为背景集合。当然，你也可以使用你自己指定的背景peak填入background_peaks中。

还有就是findIT_enrichWilcox和findIT_enrichFisher函数选择哪个的问题。findIT_enrichFisher基于的是Fisher精确检验，原理跟基因的GO富集很相似，比较容易解释，也不需要你自己选择背景基因集，但是其只是单纯地考虑了TFBS在peak集合中的有和无。而findIT_enrichWilcox则更多地考虑了TFBS在peak集合中的数目分布，但是背景集合的选取是一个问题。怎么样选取比较有意义的“背景集合”是一个值得思考的问题，我这里用的是随机选取，但是否有更好的集合来充当背景的任务是一个有待商榷的事情。这里我还是推荐大家两个函数都试一下，因为从我自己的试用结果来看，两者的结果是差不多的。

还有一个小问题就是什么时候设置set.seed的问题。大家只要看下函数的manual，如果如果参数里面有background相关的东西，就说明这个函数需要随机选取背景集合，那么这时候就需要加set.seed了。

FindIT2 R

Intro_FindIT2:find_influential_TF module part2

Pre-preparation

integrate_ChIP_RNA

build peak-gene link

findIT_enrichFisher

findIT_enrichWilcox

Guandong Shang

PhD Candidate

Related