Help

See also the About page.

Contents

How to use COTRASIF

To use COTRASIF's PWM and HMM-PWM tools, one needs to follow these simple steps:

  1. go to COTRASIF start page (http://biomed.org.ua/COTRASIF/) and select the type of task to run;
  2. select a species promoters database to use for current search;
  3. for the PWM method (and optionally for HMM-PWM) either input PFM manually (copy-paste) or select from the JASPAR matrices select field;
  4. for the HMM-PWM method, enter input sequences; for the PWM, specify a threshold (if different from the default 0.75);
  5. for both HMM-PWM and PWM methods, input an email address for where the results link should be sent when calculation is complete; the email address is also used as an identifier for conservation filter, so be sure to use the same email when planning to use conservation filter;
  6. click "submit task";
  7. go to the status page, and wait until it finishes auto-reloading and displays direct results file download links. Alternatively, wait for the email with a link to TSV (tab-separated values) results file. Download the results file, and open in your favorite spreadsheet editor for manual analysis, or use in your own data processing pipeline.

To use the conservation filter, one must have two completed tasks in COTRASIF. The user must input the email address used for previous tasks, then select two tasks to run through the filter, submit the form, and save the returned text file (the result is immediate).

PFM format

A well-formed PFM (also known as position-counts matrix, PCM, and also position-specific scoring matrix, PSSM) can be defined by the following list of requirements:

This definition is valid both for integer-counts matrix, and for fractional (floating-point) matrices.

This is an example of valid integer-counts matrix:

1 12 0 0 0 0 0 7 1 1 0 0 0 2 1
8 0 0 0 0 0 13 1 7 0 0 3 8 7 8
2 1 13 0 0 0 0 1 2 0 0 0 0 2 3
2 0 0 13 13 13 0 4 3 12 13 10 5 2 1

Another valid example - this time fractional PFM:

0.002392631 0.018539367 0.023213127 0.055630882 0.028694009
0.85478965 0.133623499 0.115525923 0.914481573 0.079327923
0.118833399 0.003637736 0.854790957 0.016689723 0.142543009
0.02398432 0.844199398 0.006469993 0.013197821 0.749435059

Important: the PWM search method uses integer and fractional PFMs differently. If an integer PFM is supplied, then pseudocounts correction is applied, and PFM is converted into PWM using the formula w = log2( ((f + 0.25*sqrt{N})/(N + sqrt{N})) / 0.25), where 'w' is the weight value, 'f' is the integer frequency (count) value, N is the column sum of the PFM, and sqrt{N} and 0.25*sqrt{N} are pseudocounts.

If a fractional PFM is supplied, it is converted into PWM using the same formula, but instead of sqrt{N} a fixed small pseudocount (10-9) is used. Effectively, zero-value positions of the user-supplied fractional matrix translate to PWM value -29.897, which is ~10 times less than the lowest (non-zero) PWM value. This leads to very low similarity score for sequences with mismatch in the zero-value position of the user-submitted fractional PFM.

Promoter definition in COTRASIF

In COTRASIF, "promoter" is defined as 2000bp upstream from the TSS of the gene, plus the first 5' UTR (if any). Thus, the length of the majority of promoters is between 2000 and several thousands nucleotides. In rare cases, length can be less than 2000 nucleotides (for genes located near chromosome termini).

Description of the results text file format

Results are presented in the form of the TSV text file. This format is convenient both for manual processing (using any spreadsheet program), and for scripted processing/parsing. The file has a header line, which names columns. All the columns are separated with a single tabulation symbol (tab), including the column names in the header line.

Here is the description of each column:

About TRANSFAC matrices

COTRASIF allows easy selection of one of 398 TRANSFAC 7.0 Public matrices.

Some of the matrices had different column sums - that is, did not add up to the same number of sequences or to a common normalization base. Here is the list of such matrices.

Note: some matrices had identical column sums, but were normalized not to 1 (as accepted by COTRASIF), but to a different number (most commonly 100). See the detailed log of re-normalization for more details.

Note on the duplicate lines in the TSV results file

As COTRASIF is transcript-centric (with one promoter defined for each transcript), there can be multiple promoters defined per gene. Most of the time the transcription start sites of alternative gene transcripts are the same, which leads to several identical promoters stored in our database for the gene. When the TFBS search is performed, those multiple promoters are translated to multiple found TFBS lines in the results file.

Starting with Ensembl release 48, we introduced duplicate-cleanup routines into our automatic promoter-importing pipeline. This significantly decreased the number of duplicate result lines. However, the solution appears to be incomplete, and some duplicates still make it into the results file.

Sample duplicate records, as of Ensembl release 52 (description and chromosome not shown):

ENSG00000197110  ENST00000355055  500  CAGTTTCTCTTTCCC  0.961344  -1  4862565
ENSG00000197110  ENST00000392072  500  CAGTTTCTCTTTCCC  0.961344  -1  4862565
Note that the following columns are the same for these two lines: "promoter size" can be different, absolute chromosomal "position" will be the same for true overlaps. The only thing which is always different is the Ensembl Transcript ID.

Duplication is now a rare problem, but not yet completely solved.

Description of conservation filter output format

Conservation filter output is tab-separeted plain text, with the following columns:

Target species is the one selected first when choosing task results to compare; reference species is the second selected task.

© 2007 - 2009
Bogdan Tokovenko, Rostyslav Golda, Protas Oleksiy.