See also the About page.


How to use COTRASIF

To use COTRASIF's PWM and HMM-PWM tools, one needs to follow these simple steps:

  1. go to COTRASIF start page ( and select the type of task to run;
  2. select a species promoters database to use for current search;
  3. for the PWM method (and optionally for HMM-PWM) either input PFM manually (copy-paste) or select from the JASPAR matrices select field;
  4. for the HMM-PWM method, enter input sequences; for the PWM, specify a threshold (if different from the default 0.75);
  5. for both HMM-PWM and PWM methods, input an email address for where the results link should be sent when calculation is complete; the email address is also used as an identifier for conservation filter, so be sure to use the same email when planning to use conservation filter;
  6. click "submit task";
  7. go to the status page, and wait until it finishes auto-reloading and displays direct results file download links. Alternatively, wait for the email with a link to TSV (tab-separated values) results file. Download the results file, and open in your favorite spreadsheet editor for manual analysis, or use in your own data processing pipeline.

To use the conservation filter, one must have two completed tasks in COTRASIF. The user must input the email address used for previous tasks, then select two tasks to run through the filter, submit the form, and save the returned text file (the result is immediate).

PFM format

A well-formed PFM (also known as position-counts matrix, PCM, and also position-specific scoring matrix, PSSM) can be defined by the following list of requirements:

This definition is valid both for integer-counts matrix, and for fractional (floating-point) matrices.

This is an example of valid integer-counts matrix:

1 12 0 0 0 0 0 7 1 1 0 0 0 2 1
8 0 0 0 0 0 13 1 7 0 0 3 8 7 8
2 1 13 0 0 0 0 1 2 0 0 0 0 2 3
2 0 0 13 13 13 0 4 3 12 13 10 5 2 1

Another valid example - this time fractional PFM:

0.002392631 0.018539367 0.023213127 0.055630882 0.028694009
0.85478965 0.133623499 0.115525923 0.914481573 0.079327923
0.118833399 0.003637736 0.854790957 0.016689723 0.142543009
0.02398432 0.844199398 0.006469993 0.013197821 0.749435059

Important: the PWM search method uses integer and fractional PFMs differently. If an integer PFM is supplied, then pseudocounts correction is applied, and PFM is converted into PWM using the formula w = log2( ((f + 0.25*sqrt{N})/(N + sqrt{N})) / 0.25), where 'w' is the weight value, 'f' is the integer frequency (count) value, N is the column sum of the PFM, and sqrt{N} and 0.25*sqrt{N} are pseudocounts.

If a fractional PFM is supplied, it is converted into PWM using the same formula, but instead of sqrt{N} a fixed small pseudocount (10-9) is used. Effectively, zero-value positions of the user-supplied fractional matrix translate to PWM value -29.897, which is ~10 times less than the lowest (non-zero) PWM value. This leads to very low similarity score for sequences with mismatch in the zero-value position of the user-submitted fractional PFM.

Promoter definition in COTRASIF

In COTRASIF, "promoter" is defined as 2000bp upstream from the TSS of the gene, plus the first 5' UTR (if any). Thus, the length of the majority of promoters is between 2000 and several thousands nucleotides. In rare cases, length can be less than 2000 nucleotides (for genes located near chromosome termini).

Description of the results text file format

Results are presented in the form of the TSV text file. This format is convenient both for manual processing (using any spreadsheet program), and for scripted processing/parsing. The file has a header line, which names columns. All the columns are separated with a single tabulation symbol (tab), including the column names in the header line.

Here is the description of each column:

About TRANSFAC matrices

COTRASIF allows easy selection of one of 398 TRANSFAC 7.0 Public matrices.

Some of the matrices had different column sums - that is, did not add up to the same number of sequences or to a common normalization base, and were re-normalized. Here is the list of such matrices.

Note: some matrices had identical column sums, but were normalized not to 1 (as accepted by COTRASIF), but to a different number (most commonly 100). See the detailed log of re-normalization for more details.

Which matrix to use: TRANSFAC or JASPAR?

COTRASIF allows easy selection of one of the JASPAR CORE matrices.

So what is the difference between Jaspar and Transfac?

JASPAR is "high-quality, manually-curated", while TRANSFAC is a "broad compilation of binding sites"; this distinction is based on my own limited knowledge.

Citing Jaspar website:

"The JASPAR CORE database contains a curated, non-redundant set of 123 profiles, derived from published collections of experimentally defined transcription factor binding sites for multicellular eukaryotes. The prime difference to similar resources (TRANSFAC, TESS etc) consist of the open data acess, non-redundancy and quality: JASPAR CORE is a smaller set that is non-redundant and curated.

When should it be used? When seeking models for specific factors or structural classes, or if experimental evidence is paramount."

TRANSFAC is also literature-derived (read more at TRANSFAC Release 7.0 Documentation).

TRANSFAC assigns "quality rating" in the range 1..6 (where 1 is the best quality) to each matrix. In each individual case, this quality metric could help you decide which matrix to use.

TRANSFAC Public is at version 7.0 and is pretty old; to the best of my knowledge, current version number is 12 (April 2009).

There is also a general consideration relevant to the quality of the matrix - the number of sequences used to build the matrix. Generally, more sequences mean better matrix.

Sometimes, there are multiple matrices for a single binding site.
Let us review in detail the process of selecting a proper matrix to search for NFkB binding sites.

If you register (for free) to get access to Transfac Public 7.0, you will be able to see more details about each of the matrices for NFkB: M00194, M00208, M00054 (records M00051 and M00052 are subunits, I'm not showing them here, but they are also accessible).

Of these:

Of these, I would trust M00054 for 40 sequences; M00208 should be interesting as well for two extra positions it covers (because if matrix is short, then there are more false positives, so longer matrices are generally better).

JASPAR's NFkB matrix MA0061 also refers to PMID 8449662, and is built from 38 sequences; if PMID 1985897 (second citation from Transfac's M00054) is about viral genomes, then I'd say that Jaspar's MA0061 and TRANSFAC's are roughly same quality.

As this matrix is short, I'd start with a cut-off 0.9 or even 0.95 to get initial predictions. Matrix was built from rat, mouse and human, so any pair of these would be good for evolutionary conservation filtering step.

Hopefully this extended matrix selection example will help you use the proper matrix for your research.

Note on the duplicate lines in the TSV results file

As COTRASIF is transcript-centric (with one promoter defined for each transcript), there can be multiple promoters defined per gene. Most of the time the transcription start sites of alternative gene transcripts are the same, which leads to several identical promoters stored in our database for the gene. When the TFBS search is performed, those multiple promoters are translated to multiple found TFBS lines in the results file.

Starting with Ensembl release 48, we introduced duplicate-cleanup routines into our automatic promoter-importing pipeline. This significantly decreased the number of duplicate result lines. However, the solution appears to be incomplete, and some duplicates still make it into the results file.

Sample duplicate records, as of Ensembl release 52 (description and chromosome not shown):

ENSG00000197110  ENST00000355055  500  CAGTTTCTCTTTCCC  0.961344  -1  4862565
ENSG00000197110  ENST00000392072  500  CAGTTTCTCTTTCCC  0.961344  -1  4862565
Note that the following columns are the same for these two lines: "promoter size" can be different, absolute chromosomal "position" will be the same for true overlaps. The only thing which is always different is the Ensembl Transcript ID.

Line duplication is now quite a rare problem.

Conservation filter

Conservation filter (also mentioned as orthology filter and [non-classical] phylogenetic footprinting method) helps to identify TFBSs occuring in the promoters of the orthologous genes.

No sequence alignments are performed: the result is based solely on the presence (or absence) of the specified TFBS in the promoters of orthologs.

When applying conservation filter, one can specify orthology types to take into account. These are better described at the Ensembl website. One can also constrain the distance from the found TFBS to the TSS of each of the orthologs; this is then referred to as position-constrained orthology filtering.

When allowing orthology types other than ortholog_one2one, it is possible that TFBSs are found in several different orthologs in the reference genome. Currently, this is resolved by taking a single - best - orthology pair. Here, best only applies to the quality of the orthology pair, and has nothing to do with the reliability of the final result.

Previously, both orthology type and protein_percent_id were used to rank orthology pairs and choose the best pair. However, as protein_percent_id is not a proper measure of orthology quality (this being the reason of its removal from Ensembl Compara in release 54), now only orthology type is used to find which orthology pair is better. The hierarchy used (best to worst) is:

At the moment, if there are several candidate orthologs with identical orthology types, the one with the numerically smaller Ensembl Gene ID is taken.

Conservation filter output is tab-separeted plain text, with the following columns:

Target species is the one selected first when choosing task results to compare; reference species is represented by the second selected task.

© Bogdan Tokovenko (2006 - 2011) and Rostyslav Golda (2008 - 2009)
Portions © Oleksiy Protas (2009)