Global Mapping of Transcription Factor Binding Sites by Sequencing Chromatin Surrogates: a Perspective on Experimental Design, Data Analysis, and Open Problems

Mapping genome-wide binding sites of all transcription factors (TFs) in all biological contexts is a critical step toward understanding gene regulation. The state-of-the-art technologies for mapping transcription factor binding sites (TFBSs) couple chromatin immunoprecipitation (ChIP) with high-throughput sequencing (ChIP-seq) or tiling array hybridization (ChIP-chip). These technologies have limitations: they are low-throughput with respect to surveying many TFs. Recent advances in genome-wide chromatin profiling, including development of technologies such as DNase-seq, FAIRE-seq and ChIP-seq for histone modifications, make it possible to predict in vivo TFBSs by analyzing chromatin features at computationally determined DNA motif sites. This promising new approach may allow researchers to monitor the genome-wide binding sites of many TFs simultaneously. In this article, we discuss various experimental design and data analysis issues that arise when applying this approach. Through a systematic analysis of the data from the Encyclopedia Of DNA Elements (ENCODE) project, we compare the predictive power of individual and combinations of chromatin marks using supervised and unsupervised learning methods, and evaluate the value of integrating information from public ChIP and gene expression data. We also highlight the challenges and opportunities for developing novel analytical methods, such as resolving the one-motif-multiple-TF ambiguity and distinguishing functional and non-functional TF binding targets from the predicted binding sites. Electronic Supplementary Material The online version of this article (doi:10.1007/s12561-012-9066-5) contains supplementary material, which is available to authorized users.

timate of fold changes when c ijk is small. Different values of ∆ (1, 5, and 10) were tried, and they produced the same qualitative conclusions. In the paper, we only show results based on ∆ = 5 for simplicity. After obtaining b ijk s, replicates were averaged to obtain a ij = ∑ k b ijk /n j . n j is the number of replicate samples in dataset j. If control samples were available (e.g., Input controls in ChIP-seq experiments), control read counts were subtracted from the ChIP read counts to obtain a ij = Here j ′ indicates the control dataset corresponding to the ChIP dataset j. a ij provides a one number summary for each bin i and dataset j. Next, each motif site was extended 250bp to both ends. The a ij s within the 500bp window were averaged to obtain x sj . y s was computed similarly.

Supplemental Method 2: Various prediction methods
We compared nine different methods for predicting TFBSs. Unsupervised methods: 1. Single surrogate (SS): TFBSs are predicted based on each individual surrogate. It was assumed that y s is a monotone function of x sj , and ranking of motif sites based on x sj determines the ranking of motif sites based on y s . Top ranked motif sites were predicted to be TFBSs. 2. Principal component of all surrogates (AS PC1): Let x s = (x s1 , . . . , x sJ ) T be the vector that contains the intensity values of all surrogates at motif site s. The first principal component (PC1) [5] of all x s s was computed and motif sites were ranked accordingly. Since the direction of unique PCs can only be determined up to a positive or negative sign, motif sites can be ranked based on either PC1 scores or −1×PC1. Both rankings were tested, and the ranking that produced better results was reported.
Supervised methods: 1. Best subset of multiple surrogates, linear regression (MS L): Using a training TF ChIP-seq dataset, the following linear model y s = β 0 + β j1 x sj1 + · · · + β j k x sj k + error is fit using a subset of surrogates (j 1 , · · · , j k ). All possible combinations of surrogates were enumerated and tested. The exhaustive search finds the best subset of surrogates using the Mallows' Cp as the selection criterion.
To compare different methods, TF ChIP-seq was used as gold standard. For instance, we train a model using TF A, say EGR1, and use the model to predict binding sites of TF B, say GABP. We can then use TF B ChIP-seq data to benchmark prediction performance. TF B motif sites with y s > 1 were treated as true binding sites. For each prediction method, motif sites were rank ordered based on the predicted TF binding intensities. The positive predictive value (PPV) (i.e., the percentage of true positives among the top predictions) was reported for the top N predictions, where N = 1, 2, · · · , etc. This created a curve showing the PPV as a function of N . Curves of different methods were compared. The area under the receiver operating characteristic curves (AUC) was also computed and compared across methods. In parallel, we also computed and compared the Pearson correlation between y s (from the actual TF B ChIP-seq data) and the predicted binding intensities for each prediction model.

Supplemental Method 3: Clustering analysis
To generate Figure 6, we first took the average of x sj as computed in Supplemental Method 1 over the TF bound motif sites (where the TF ChIP-seq y s > 1) and the TF non-bound sites (where y s ≤ 1) respectively for TF t and surrogate data type j. This created the average surrogate signals for the bound and non-bound sites, denoted by u tj and v tj respectively. Next, we subtracted v tj from u tj to obtain r tj = u tj − v tj . Since x sj , u tj and v tj were all on log2 scale, r tj describes the enrichment of the surrogate signals in the bound class compared to the non-bound class for surrogate data type j and TF t. Using r tj s, we then conducted a hierarchical clustering using Euclidean distance and complete linkage, and the result is shown as a heat map.

Supplemental Method 4: Sensitivity analysis
To generate Figure 9, ChIP-seq binding regions for each TF in Hepg2 and G-m12878 cell lines were detected using CisGenome [4] at 1% FDR. We recorded the center of binding regions as peak sites. When combining peak sites from Hepg2 and Gm12878, we merged any two binding regions if the distance between their centers were less than 250bp. Now these peak sites detected in other cell lines can serve the same function as our motif sites, therefore from now on we call both of them candidate sites. Using the mapped DNase-seq reads downloaded from ENCODE, we counted the number of reads in a 500bp flanking window of each candidate site which was used as the predictor.
For Figures 8 and 9, to calculate the false discovery rate (FDR) for each candidate site, we need to learn the null distribution of DNase intensities. For that purpose, we randomly sampled 1,000,000 loci without replacement from the whole genome and used these loci as our null motif sites. Then we counted DNase read numbers for the null motif sites in the same way as before, and obtained the background null distribution p 0 (x). For a given TF A and a candidate binding sites list, we ranked sites according to the decreasing order of DNase read count. At each cutoff k of the rank list, we computed the pvalue based on the null distribution p 0 (x) and estimated the false discovery rate (FDR) using the Benjamini-Hechberg procedure [1].
To compute the sensitivity in Figure 9, we first used CisGenome [4] to call peaks for TF ChIP-seqs in K562, and adopted these peaks as the gold standard. For Figure 8, we used the ChIP-seq peaks containing the canonical motifs as the gold standard. Next, for each FDR cutoff, we detected candidate sites passing the cutoff, and counted the number of gold standard ChIP-seq peaks that were discovered by these sites. Discovery of a gold standard peak is defined as existence of one or more positive motif sites located within 250bp of the center of the peak. Sensitivity (i.e. [Number of ChIP-seq peaks found among the top predictions]/[Total number of ChIP-seq peaks]) of each candidate sites list was computed accordingly.

Supplemental Method 5: One-motif-multiple-TFs
We computed E2F4 and E2F6 ChIP-seq binding intensities at 347,952 E2F motif sites using the way described in Supplemental Method 1. The intensities intuitively are log2 ratio of normalized ChIP and Input control read counts, with an offset ∆ introduced to the counts. Since antibodies for different TFs may have different efficiencies, the ChIP-seq intensities for E2F4 and E2F6 may not be directly comparable. Therefore, we quantile normalized the E2F4 and E2F6 ChIP-seq intensities and focused on studying their correlation. We plotted the quantile normalized intensities of E2F4 against DHS, and quantile normalized E2F6 intensities against DHS. E2F motif sites with DHS intensity greater than log 2 (10) were predicted to be bound by a TF. For these motif sites, we also plotted the quantile normalized intensities of E2F4 against the quantile normalized intensities of E2F6. We repeated the same analysis for USF against MYC at 139,222 E-box motif sites. E-box consensus CACGTG was mapped to human genome without allowing any mismatch. The obtained motif sites were used for comparing USF and MYC.

Supplemental Method 6: Functional target analysis
To study functional targets, we first analyzed MYC ENCODE ChIP-seq data from HelaS3 and GEO gene expression data (GSE5823) before and after TF perturbation. ChIP-seq were analyzed using CisGenome [4] with the default settings to call peaks. 3346 peaks corresponding to 2772 unique genes were found (F DR < 10%). Gene expression data were analyzed using limma [11] which identified 3019 differentially expressed genes at FDR = 10%. In total, 659 unique genes had at least one overlapping MYC binding peak in the -25kb ∼ +10kb region around the transcription start site (TSS) and was differentially expressed. These genes were defined as the functional targets of MYC in HelaS3.
We then mapped MYC motif to genome and extracted DNase-seq read count for each motif site. For each gene, motif sites within the -25kb ∼ +10kb were collected and their DNase I read counts were added up. Genes were then ranked based on the DNase I signal. Figure 11a shows what percentage of top ranked genes are functional targets. Next, for each gene, we also computed its Pearson correlation with MYC across 13,182 gene expression microarray samples in the GEO compendium. Genes were also rank ordered based on the absolute value of the Pearson correlation. The average of the DNase rank and GEO rank was taken as the final score of each gene, and genes were re-ranked accordingly. For this new ranking method, the percentage of top ranked genes that are functional targets is also shown in Figure 11a. Based on the figure, integrating GEO data clearly improved functional target identification.
We performed similar analysis by using MYC ChIP-seq in place of DNaseseq to rank genes. Integrating GEO data again improved. We rank the surrogate data types in terms of AUC for predicting TFBSs of each TF and list the top five surrogate data types.  , let a jk be the number of motif sites in bin j that belong to cluster k. The proportion of cluster k motif sites in bin j is p jk = a jk /A j . Let N k be the total number of cluster k motif sites in the whole genome and N be the total number of all motif sites in the genome. P k = N k /N is the genome-wide proportion of cluster k motif sites. For each bin j and cluster k, we computed λ jk = p jk /P k , the relative enrichment of cluster k motif site in bin j compared to the genome-wide proportion. λ jk is plotted across the genome for two representative clusters: (c) cluster 4, and (d) cluster 3. Different chromosomes are concatenated together in the plots. Since each bin has only a few motif sites (typically 2-4 on average) for each cluster, the bin level λ jk has large statistical variation, therefore we used smoothing spline to obtain more stable estimates of λ jk (red curve). We see that the red curves fluctuate around 1 (blue line) across the genome, and the fluctuation is relatively mild. For each cluster, there is no strong regionalized distribution of motif sites. Similar analyses were performed for all other TFs and clusters, and similar results were obtained (data not shown).