ConTra - identification of conserved transcription factor binding sites

ConTra - a promoter alignment analysis tool to identify transcription factor binding sites across species

This help page of ConTra provides a step-by-step walk through of the analysis process.

Step 1: Select the type of analysis and specify the genes

ConTra consists of 2 parts, each giving a suggestive answer to a different type of question:
Visualisation part: If the user wants to identify (conserved) binding sites for specific transcription factors that may regulate the gene of interest (default option).
Exploration part: If the user has no idea which transcription factors may regulate his gene of interest. In that case ConTra will provide the user with a list of all possible transcription factors ranked by their binding probability to the promoter of interest. This binding probability of a transcription factor (TF) is determined by a score that takes into account the number of predicted binding sites for that TF, the phylogenetic depth of each predicted site (~ defined as the number of other species in the alignment that have a predicted binding site for the same PWM in a window up to 200% of the ungapped reference site length on each side of the reference site) and the Information Content (IC) of the predicting PWM. The input of an e-mail address is obligatory for this part, for it can take up to 20 minutes.

To start a ConTra analysis the user can provide different kinds of information about the gene(s) or transcript(s) he wants to analyse:
ConTra treats the following terms as valid input: a HGNC approved human gene name, symbol or alias, an entrez gene ID (number), a human RefSeq ID (starting with NM_) or a human Ensembl ID (starting with ENSG for a gene ID or ENST for a transcript ID). The textarea field allows multiple genes or transcripts as input, where every identifier must be on a new line (without comma or other seperator). It might occur that no match is found for an identifier or no alignment is available, for instance because the gene and/or transcripts are relatively new and no alignments are available yet from UCSC and Ensembl. In that case the user may provide his own alignment to use for analysis. Alignment programs that can be used to do the job are: CLUSTALW, CHAOS/DIALIGN, LAGAN, MAVID, BLASTZ & MULTIZ (Miller Lab), Pecan and T-COFFEE. The file that can be uploaded is allowed to be in the UCSC maf format, the clustal format, or in fasta format. Avoid special characters (e.g. +,/,|) in the file name: use _ instead. When uploading a file with multiple alignment(s) in fasta format, position tags can be used to indicate the position of the pieces relative to the TSS, as described below:

fasta format>reference species, transcript (group) 1
AAATTTGGGCCC
>aligning species, transcript (group) 1
AAATTTG-GCCC
//
>reference species, transcript (group) 2
AAATGTGGGCCC
>aligning species, transcript (group) 2
AAATTTG-GCCC
>yet another aligning species, transcript (group) 2
AGATTTG-GCCC
//
>reference species, transcript (group) 3, 2 concatenated pieces of a blastz-net pairwise alignment
!-800<->-789!AAATGTGGGCCC!-655<->-646!CGTAATT---GGA
>aligning species, transcript (group) 3, 2 concatenated pieces of a blastz-net pairwise alignment
!-800<->-789!AAATTTG-GCCC!-655<->-646!CGTAATTCGTGGA

Position tags (format !-2378<->-1500!) indicate the positions, relative to the transcription start site (TSS) of the reference species transcript(s), of the following alignment piece. If no position tags are used, the last base of the reference species sequence will be considered to be the position -1. If a blastz-net pairwise alignment contains several alignment pieces that need to be analyzed/visualized as one, then the different pieces can be concatenated with position tags as is shown above. This concatenation is mainly important for the exploration part. As for the visualization part, only in the files for Jalview the pieces can be taken together, not in the html view.

One can also upload an own collection of PWMs in this step, after which that collection will be used next to the built-in PWM libraries.

Step 2: Select which promoters ConTra should use

In the second step ConTra shows a list of genes that (fuzzily) matched the identifier provided in the first step. Some genes code for several transcipts of which eventual context-dependent differential expression could (partially) be explained by the presence or absence of binding sites for the context specific TFs in their promoter regions. Therefore, ConTra groups the transcripts by transcription start site (TSS), resulting in one or more groups of transcripts having the same promoter. The user can select up to 10 different promoters from one or more genes. More information on genes or transcripts is provided by links to either Entrez Gene and Ensembl gene view for gene information or to the UCSC genome browser and Ensembl transcript view for transcript information.

ConTra uses the promoter region 2000 nt upstream of the TSS by default, but lengths of 500, 1000, or 5000 nt can be specified.

Step 3: Select which alignment to use for promoter analysis

Multiple sequence alignments are: UCSC multiz 17-way, UCSC multiz 28-way, Pecan 12 amniota vertebrates, 9 eutherian mammals and 4 catarrhini primates.

The pairwise alignments are blastz-net alignments from UCSC.

We experience that in most cases multiz 17-way alignments are best suited for exploration and visualization.

Step 4: Select the stringency of TFBS prediction (and PWMs)

ConTra uses position weight matrices (PWMs) from four databases: TransFac, JASPAR core and JASPAR phyloFACTS, and a Protein Binding Microarray (PBM) derived collection of homeodomain TF PWMs (see here and UniProbe).
Up to 20 PWMs can be selected for visualization. When doing an exploration, one can (de)select any of the libraries.
To find more information on a specific PWM each identifier links back to the source database (except for the PBM derived homeodomain PWMs). A valid login of the user's own TransFac license is needed to see the latest TransFac information. JASPAR core and JASPAR phyloFACTS are open-access collections.
TransFac is the biggest collection of PWMs, whereas JASPAR CORE represents a more qualitative, non-redundant collection.
JASPAR phyloFACTS is a collection of PWMs representing a collection of ultra-conserved sequences among vertebrates calculated by Xie et al., sequences that do not necessarily represent the binding specificity of a TF. The use of phyloFACTS in combination with 'wet lab' experiments could lead to the identification of new TFs.

ConTra results

Visualization part: The results page shows the user's input followed by the output. The input consists of promoters (indicated by a (group of) transcript(s)), alignment type and PWM prediction stringency. The ouput consists of links to 2 files per promoter that are available for download. One file contains the alignment (.fasta file) and another contains the positions and colors of the binding sites (feature color or .fc file). These files can be opened with Jalview to create output for publishing purposes. First open the alignment file by clicking File > Input alignment > from file and select the file. Next use the "Load Features" command from the "File" menu. Now the predicted binding sites (features) are colored. In the "View" menu the window with the features can be activated by selecting "Feature Settings...". In this window the different binding sites can be selected and deselected by clicking the checkboxes. For more information on the use of Jalview click here. Following these links are the promoter alignments with the predicted TFBSs in a different color per PWM. The TFBSs can be made (in)visible by (un)checking the checkboxes at the left.
Exploration part The results page shows the user's input in the same way as for the visualization part (described above). The output consists of a list with the 100 highest scoring PWMs per promoter, from which a selection can be forwarded to the visualization part (this will open a new window). The full ranked list is also available for download. When multiple promoters were explored, a file will be created that contains the summed ranks of the seperate ranked lists. Thus this file might give an idea which are the binding TFs common to your group of promoters.

Example: ATOH7

The UCSC multiz 17-way alignment of ATOH7 (Ath5) clearly shows the conserved TATA box and two conserved E-box motifs as described by Del Bene et al (PLoS Genetics, 2007).

Step 1: Enter "ATOH7" as gene name and click next.
Step 2: Select the first transcript NM_145178 (RefSeq) and set 200 bases upstream to use. Click next.
Step 3: Use the default UCSC multiz 17-way alignment and click next.
Step 4: Select the V$EBOX_Q6_01, V$TATA_C and V$TATA_01 position weight matrices (PWMs) from the alphabetical list of PWMs from Transfac and click "Run Contra".
Results:
example results ConTra for ATOH7

For publication purposes the ConTra output files (.fasta and .fc) can be imported in Jalview for visualization and header editing and
can be exported again as EPS / PDF e.g.

Validated examples analysed in ConTra

Gene	TFBS	References
ATOH7	E-box, TATA box	[Del Bene et al, 2007]
CDH1 (E-cadherin)	E-box, GC box, CCAAT, AP-2	[Comijn et al 2001] [Bar-Eli, 2001]
MX1	ISRE (ISGF-3), STAT1, Sp1	[Nakade et al, 1997] [Ronni et al, 1998] [Altmann et al, 2004]
IL-2	AP-1, Sp1, Oct-1	[Kim et al, 2006]
FASN	SREBP, Ebox	[Foufelle et al, 2002] [Magana et al, 1996]
ACACA (ACC)	SREBP	[Foufelle et al, 2002] [Magana et al, 1997]

ConTra | contact