
Help
See also the About page.
Contents
- How to use COTRASIF
- PFM format
- Promoter definition in COTRASIF
- Description of the results file format
- About TRANSFAC matrices
- About duplicates in results file
- Conservation filter output format
How to use COTRASIF
To use COTRASIF's PWM and HMM-PWM tools, one needs to follow these simple steps:
- go to COTRASIF start page (http://biomed.org.ua/COTRASIF/) and select the type of task to run;
- select a species promoters database to use for current search;
- for the PWM method (and optionally for HMM-PWM) either input PFM manually (copy-paste) or select from the JASPAR matrices select field;
- for the HMM-PWM method, enter input sequences; for the PWM, specify a threshold (if different from the default 0.75);
- for both HMM-PWM and PWM methods, input an email address for where the results link should be sent when calculation is complete; the email address is also used as an identifier for conservation filter, so be sure to use the same email when planning to use conservation filter;
- click "submit task";
- go to the status page, and wait until it finishes auto-reloading and displays direct results file download links. Alternatively, wait for the email with a link to TSV (tab-separated values) results file. Download the results file, and open in your favorite spreadsheet editor for manual analysis, or use in your own data processing pipeline.
To use the conservation filter, one must have two completed tasks in COTRASIF. The user must input the email address used for previous tasks, then select two tasks to run through the filter, submit the form, and save the returned text file (the result is immediate).
PFM format
A well-formed PFM (also known as position-counts matrix, PCM, and also position-specific scoring matrix, PSSM) can be defined by the following list of requirements:
- it has exactly 4 lines - one per A, C, G and T
- all lines have equal number of elements (counts)
- the sum of counts in each column is equal for all columns (with rounding up to 1.0 for the fractional PFMs).
This is an example of valid integer-counts matrix:
1 12 0 0 0 0 0 7 1 1 0 0 0 2 1 8 0 0 0 0 0 13 1 7 0 0 3 8 7 8 2 1 13 0 0 0 0 1 2 0 0 0 0 2 3 2 0 0 13 13 13 0 4 3 12 13 10 5 2 1
Another valid example - this time fractional PFM:
0.002392631 0.018539367 0.023213127 0.055630882 0.028694009 0.85478965 0.133623499 0.115525923 0.914481573 0.079327923 0.118833399 0.003637736 0.854790957 0.016689723 0.142543009 0.02398432 0.844199398 0.006469993 0.013197821 0.749435059
Important: the PWM search method uses integer and fractional PFMs differently. If an integer PFM is supplied, then pseudocounts correction is applied, and PFM is converted into PWM using the formula w = log2( ((f + 0.25*sqrt{N})/(N + sqrt{N})) / 0.25), where 'w' is the weight value, 'f' is the integer frequency (count) value, N is the column sum of the PFM, and sqrt{N} and 0.25*sqrt{N} are pseudocounts.
If a fractional PFM is supplied, it is converted into PWM using the same formula, but instead of sqrt{N} a fixed small pseudocount (10-9) is used. Effectively, zero-value positions of the user-supplied fractional matrix translate to PWM value -29.897, which is ~10 times less than the lowest (non-zero) PWM value. This leads to very low similarity score for sequences with mismatch in the zero-value position of the user-submitted fractional PFM.
Promoter definition in COTRASIF
In COTRASIF, "promoter" is defined as 2000bp upstream from the TSS of the gene, plus the first 5' UTR (if any). Thus, the length of the majority of promoters is between 2000 and several thousands nucleotides. In rare cases, length can be less than 2000 nucleotides (for genes located near chromosome termini).
Description of the results text file format
Results are presented in the form of the TSV text file. This format is convenient both for manual processing (using any spreadsheet program), and for scripted processing/parsing. The file has a header line, which names columns. All the columns are separated with a single tabulation symbol (tab), including the column names in the header line.
Here is the description of each column:
- Ensembl Gene ID: an identifier which can be used to find the full gene record in the Ensembl genomes database. This identifier is "stable" (permanent), which means that it does not change between Ensembl releases. However, the gene record may change between releases, including the gene coordinates (location) on the chromosome. If the location changes, the promoter (by definition) will also change, leading to the minor changes of the gene lists in COTRASIF results file for different Ensembl releases.
- Ensembl Transcript ID: stable Ensembl transcript identifier. COTRASIF's TFBS search is transcript-centric - "one transcript - one promoter". This was done to account for the possible alternative transcription start sites.
- position: contains the abolute chromosomal start coordinate of the found putative TFBS.
Here's an example of interpreting coordinates:
- let's assume TFBS length is 15
- one TFBS was found for the forward-strand gene A at 11'567'943; another one - for reverse-strand gene B at 5'948'128, both on chromosome 15
- for gene A, TFBS position will be 15:11'567'943-11'567'957 (start position = "position" column, end position = "position" column + 14; coordinates are inclusive)
- for gene B, TFBS position will be 15:5'948'114-5'948'128 (start position = "position" column - 14, end position = "position" column; coordinates are inclusive)
These coordinates can be pasted into either UCSC or Ensembl genome browsers. - matched sequence: the sequence of the found TFBS.
- score: similarity of the user-provided PFM matrix/set of sequences to the "matched sequence". Maximal theoretical value is 1.00 (best possible match).
- gene strand: DNA strand of the gene.
- promoter size: the length of the promoter used for searching.
- chromosome: self-evident.
- gene description: description of the gene, as obtained from Ensembl.
About TRANSFAC matrices
COTRASIF allows easy selection of one of 398 TRANSFAC 7.0 Public matrices.
Some of the matrices had different column sums - that is, did not add up to the same number of sequences or to a common normalization base. Here is the list of such matrices.
Note: some matrices had identical column sums, but were normalized not to 1 (as accepted by COTRASIF), but to a different number (most commonly 100). See the detailed log of re-normalization for more details.
Note on the duplicate lines in the TSV results file
As COTRASIF is transcript-centric (with one promoter defined for each transcript), there can be multiple promoters defined per gene. Most of the time the transcription start sites of alternative gene transcripts are the same, which leads to several identical promoters stored in our database for the gene. When the TFBS search is performed, those multiple promoters are translated to multiple found TFBS lines in the results file.
Starting with Ensembl release 48, we introduced duplicate-cleanup routines into our automatic promoter-importing pipeline. This significantly decreased the number of duplicate result lines. However, the solution appears to be incomplete, and some duplicates still make it into the results file.
Sample duplicate records, as of Ensembl release 52 (description and chromosome not shown):
ENSG00000197110 ENST00000355055 500 CAGTTTCTCTTTCCC 0.961344 -1 4862565 ENSG00000197110 ENST00000392072 500 CAGTTTCTCTTTCCC 0.961344 -1 4862565Note that the following columns are the same for these two lines:
- Ensembl Gene ID
- sequence
- score
- strand
- position
Duplication is now a rare problem, but not yet completely solved.
Description of conservation filter output format
Conservation filter output is tab-separeted plain text, with the following columns:
- target species gene id
- target transcript id
- target found site(s) position(s) ( (s) is for the case when multiple sites are present in gene promoter)
- target found site(s) score(s)
- reference gene id
- reference transcript id
- reference position(s)
- reference score(s)
- percent of peptide identity between the two orthologous genes (as judged by the translation of the longest transcript)
- orthology type
© 2007 - 2009
Bogdan Tokovenko, Rostyslav Golda, Protas Oleksiy.