RPol II binding patterns around the bidirectional promoter region
We have examined the RPol II binding pattern around the TSS of common protein coding gene and the bidirectional promoter region. To avoid the RPol II binding effect of a gene's neighbourhood, focus was only on the genes whose transcript lengths were > 10,000 bp and no other genes were presented within 10,000 bp of their TSSs. This identified 4,120 expressed genes and 2,682 unexpressed genes in HeLa cells, based on the gene expression array data. We divided the genomic regions into multiple 20 bp bins and calculated the total number of RPol II derived fragments located in each bins within 2,000 bp upstream and downstream of the TSS, producing a RPol II binding landscape in the regulatory regions of the expressed genes. Not unexpectedly, a significant enrichment of the RPol II signal on top of the TSS was seen (Figure 1A), which gradually declined towards both upstream and downstream (transcript) regions. We sub-classified expressed genes based upon their expression levels, and genes with higher expression levels tended to display higher than average RPol II signals around the TSS (Figure 1A). For the coding genes with undetectable (Absent) expression levels, RPol II enrichment around the TSS was markedly lower.
Removing the same distance between the different isoforms of genes, 1564 head-to-head paired genes in which 522 pairs included the non-coding genes were used to analysis. The numbers of RPol II fragments in each 20 bp bins within 2,000 bp upstream and downstream of the centre of bidirectional regions were calculated. Based on the distance between 2 opposite genes, the bidirectional regions were divided into 5 categories, and the average RPol II signals of each category were calculated (Figure 2B). RPol II signals formed a significant bi-peak shape, and there was a peak-valley in the middle of bidirectional region, which indicated that 2 promoters were located in the bidirectional region. As the distance got shorter, the 2 peaks tended to be close to each other and the higher peak-valley was evident. Within a 200 bp distance, the 2 peaks tended to overlap and form one peak in the middle of the bidirectional region. Therefore, the RPol II signals can be used to identify each regulatory region and the overlap region for 2 promoters of the opposite gene pair.
Identification of the bi-peak shape in the bidirectional promoter region
In the bidirectional promoter region, RPol II signals usually formed 3 shapes, bi-peak, single peak and no peak. Bi-peak denotes 2 different promoters located in the bidirectional promoter region, and each promoter regulates its closest gene. Single peak had 2 options: (1) the regulatory regions of 2 opposite genes were very close, RPol II bound the same regulatory region; and (2) RPol II bound only one gene promoter, and no RPol II bound the other.
To find the exact regulatory regions, it was important to identify the RPol II bi-peak shape between the TSSs of 2 opposite genes. We adopted 3 features to identify bi-peak distribution of RPol II on bidirectional promoters: P1 - the number of RPol II fragments at TSS of the gene on the positive strand, P2 - the number of RPol II fragments at TSS of the gene on the negative strand, and V - the lowest number of RPol II fragments between the TSSs (Figure 2A). Two parameters, the significant ratio and the difference ratio of RPol II binding pattern between two peaks, were described by these 3 features (see Method). The significant ratio presents the enrichment of RPol II in the lower peak, and the difference ratio presents the difference of RPol II binding between 2 peaks. Larger significant ratio and smaller difference ratio support a strong bi-peak distribution of RPol II fragments.
To estimate the cutoff of significant ratio and difference ratio for bi-peak distribution of RPol II, we simulated some bi-peak shapes using both expressed genes, and single peak shapes with an expressed gene and an unexpressed gene. The paired genes randomly selected from 4,120 expressed genes and 2,682 unexpressed genes were arranged head-to-head, and RPol II signals around the TSSs of genes were summed. According to the distance between paired genes, we simulated 5 sets, each of which included 10,000 bi-peak shapes and 10,000 single peak shapes.
The area under the curve (AUC) in the Receiver Operator Characteristic (ROC) reached 0.78 in differentiating all bi-peak shapes and single peak shapes of RPol II signals (Figure 3), suggesting an effective distinguished power. Figure 3 clearly shows that the distinguished accuracy of our model is higher for paired genes that have longer distance.
Identification of regulatory regions in bidirectional promoter
In the regulatory region, the total number of RPol II binding fragments should follow a Poisson distribution; a Poisson mixture model had already been used to identify the microRNA regulatory regions based on the genome wide RPol II binding patterns of protein coding genes [19]. Briefly, the 5 parameters S, B, T, K
p
, and K
t
determine the Poisson parameter, λ, associated with the distribution of the number of RPol II binding fragments. In the bidirectional promoter region that showed bi-peak distribution of RPol II signals, 10 parameters were used to describe RPol II binding pattern of 2 different promoters (Figure 2B). Particle Swarm Optimization (PSO) algorithm was used to maximize probability, and gave the optimized parameters. Genomic regions with < 90% RPol II signal decay compared with those ones in TSS-bin were considered as potential regulatory regions. Two regulatory regions were recognized for each gene pair previously characterized as bi-peak.
The regulatory regions located in the bidirectional region were divided into 4 categories, double promoters which included 2 individual promoters, left promoters that only cover the TSS of gene in negative strand, right promoters that only cover the TSS of gene in right stand, and centre promoters that cover 2 TSSs. Figure 4A shows the RPol II binding pattern of four types of promoters. To assess the efficiency of bi-peak identification, false discovery rate (FDR) was calculated by comparing the number of simulated bi-peaks and single peaks. Using an FDR ≤ 0.2, the significant ratio ≥ 1.2 and difference ratio ≤ 4, 249 bi-peaks (Additional file 1) and 1094 single peaks (Additional file 2) were identified from the bidirectional promoter regions (Figure 4A). After using the Poisson mixture model to identify the regulatory region of each gene, the overlaps of regulatory regions of 249 gene pairs are shown in Figure 4B. No overlap of regulatory regions were found for 55 pairs of genes, and the width of overlapped regulatory regions in 135 (54%) gene pairs were < 200 bp (Figure 4C), which indicates that the promoters of these opposite genes are relatively independent. For RPol II signals following the single peak shapes, 86 left promoters, 76 right promoters and 932 centre promoters in Hela cell were identified (Figure 4A).
Expression of paired genes
Expression levels of each of the paired genes were checked to examine the function of identified regulatory regions. Removing the genes not printed in the expression array, 233 pairs of promoters, 67 left promoters, 51 right promoters, and 770 centre promoters were kept. In total 3 microarrays experiments, genes presented at least twice were expressed, others were not. Figure 5 shows the number and percentage of expressed left gene, right gene, and 2 genes in bidirectional region under the different promoter categories. The percentage of expressed left gene and right gene were similar in double promoter and centre promoter categories, indicating that the regulation of promoters to each strand was unbiased. In the left promoter category, the left gene was presented more than the right one (90 vs 71%), which indicates that left promoters regulate left genes; in contrary, the right promoters tended to regulate the right genes.
STAT1 binding site in the promoter region
To see the effect of transcription factor on bidirectional region, overlapping of ChIP-seq-derived STAT1 binding regions, identified by Gerstein's group [11], with the regulatory regions of four types of promoters (Additional file 3, 4, 5 and 6) were explored. To avoid the bias where left promoters and right promoters only cover the TSS of one gene, the regions were extend to 500 bp of the opposite gene body. Figure 6 shows the number of the centre of ChIP-seq-derived STAT1-enriched regions located in left gene body, right gene body, and the middle of two genes in 4 promoter categories. Intuitively, in left promoter category more STAT1 binding sites were located in the left gene body, and same pattern was found in right category, which suggests that STAT1 is involved in the primary promoter.