
Help
This page is being extended. See the About page for more details.
PFM format
A well-formed PFM (also known as position-counts matrix) can be defined by the following list of requirements:
- it has exactly 4 lines (that is, a total of 3 newline characters)
- all lines have equal number of elements (counts)
- the sum of counts in each column is equal for all columns (with rounding up to integers, for the fractional frequency matrix).
This is an example of valid integer-counts matrix:
1 12 0 0 0 0 0 7 1 1 0 0 0 2 1 8 0 0 0 0 0 13 1 7 0 0 3 8 7 8 2 1 13 0 0 0 0 1 2 0 0 0 0 2 3 2 0 0 13 13 13 0 4 3 12 13 10 5 2 1
Another valid example - this time fractional frequency matrix:
0.002392631 0.018539367 0.023213127 0.055630882 0.028694009 0.009708435 0.010513775 0.85478965 0.133623499 0.115525923 0.914481573 0.079327923 0.035589784 0.907054561 0.118833399 0.003637736 0.854790957 0.016689723 0.142543009 0.951495944 0.036552662 0.02398432 0.844199398 0.006469993 0.013197821 0.749435059 0.003205836 0.045879001
Promoter definition in COTRASIF
In COTRASIF, "promoter" is defined as 800bp upstream from the TSS of the gene, plus the first 5' UTR (if any). Thus, the length of the majority of promoters is between 800 and several thousands nucleotides. In rare cases, length can be less than 800 nucleotides.
Description of the PWM results text file
Results are presented in the form of the TSV text file. This format is convenient both for manual processing (using any spreadsheet program), and for scripted processing/parsing. The file has a header line, which names columns. All the columns are separated with a single tabulation symbol (tab), including the column names in the header line.
Here is the description of each column:
- Ensembl Gene ID: an identifier which can be used to find the full gene record in the Ensembl genomes database. This identifier is "stable" (permanent), which means that it does not change between Ensembl releases. However, the gene record may change between releases, including the gene coordinates (location) on the chromosome. If the location changes, the promoter (by definition) will also change, leading to the minor changes of the gene lists in COTRASIF results file for different Ensembl releases.
- Ensembl Transcript ID: stable Ensembl transcript identifier. COTRASIF's TFBS search is transcript-centric - "one transcript - one promoter". This was done to account for the possible alternative transcription start sites.
- position: position of the 1st nucleotide of the found TFBS relative to the 1st nucleotide of the promoter. Soon we plan to change the format, introducing absolute chromosome coordinates of the start and end of the found TFBS - there will be an additional notice about the change.
- matched sequence: the sequence of the found TFBS.
- score: similarity of the user-provided PFM matrix to the "matched sequence". Maximal theoretical value is 1.00, which means "best matrix match".
- gene strand: DNA strand of the gene.
- promoter size: the length of the promoter used for searching. Based on the
definition of the promoter, the position of the found TFBS relative to the gene's TSS can be found by subtracting 800 from the "position" column
(if "promoter size" is greater than or equal to 800), or by subtracting "promoter size" from the
"position" column (if "promoter size" is less than 800).
For example, if "position" is 500, "promoter size" is 890, then TFBS's position relative to the TSS is (500-800=) -300.
As another example, if "position" is 200, "promoter size" is 300, then TFBS's position is (200-300=) -100.
This complexity will be abandonned with the upcoming introduction of absolute chromosome coordinates for the found TF binding sites. - chromosome: self-evident.
- gene description: description of the gene, as obtained from Ensembl.
Important note on the duplicate lines in the TSV results file
As COTRASIF is transcript-centric (with one promoter defined for each transcript), there can be multiple promoters defined per gene. Most of the time the transcription start sites of alternative gene transcripts are the same, which leads to several identical promoters stored in our database for the gene. When the TFBS search is performed, those multiple promoters are translated to multiple found TFBS lines in the results file.
Starting with Ensembl release 48, we introduced duplicate-cleanup routines into our automatic promoter-importing pipeline. This significantly decreased the number of duplicate result lines. However, the solution appears to be incomplete, and some duplicates still make it into the results file.
Sample duplicate records (description and chromosome not shown):
ENSG00000197110 ENST00000355055 500 CAGTTTCTCTTTCCC 0.961344 -1 862 ENSG00000197110 ENST00000392072 500 CAGTTTCTCTTTCCC 0.961344 -1 852Note that the following columns are the same for these two lines:
- Ensembl Gene ID
- sequence
- score
- strand
We are working to fully eradicate the duplicates problem before the next Ensembl release.
© 2007 - 2008
Bogdan Tokovenko, Rostyslav Golda, Protas Oleksiy.