Help

This page is being extended. See the About page for more details.

PFM format

A well-formed PFM (also known as position-counts matrix) can be defined by the following list of requirements:

This definition is valid both for integer-counts matrix, and for fractional (floating-point) matrices.

This is an example of valid integer-counts matrix:

1 12 0 0 0 0 0 7 1 1 0 0 0 2 1
8 0 0 0 0 0 13 1 7 0 0 3 8 7 8
2 1 13 0 0 0 0 1 2 0 0 0 0 2 3
2 0 0 13 13 13 0 4 3 12 13 10 5 2 1

Another valid example - this time fractional frequency matrix:

0.002392631 0.018539367 0.023213127 0.055630882 0.028694009 0.009708435 0.010513775
0.85478965 0.133623499 0.115525923 0.914481573 0.079327923 0.035589784 0.907054561
0.118833399 0.003637736 0.854790957 0.016689723 0.142543009 0.951495944 0.036552662
0.02398432 0.844199398 0.006469993 0.013197821 0.749435059 0.003205836 0.045879001

Promoter definition in COTRASIF

In COTRASIF, "promoter" is defined as 800bp upstream from the TSS of the gene, plus the first 5' UTR (if any). Thus, the length of the majority of promoters is between 800 and several thousands nucleotides. In rare cases, length can be less than 800 nucleotides.

Description of the PWM results text file

Results are presented in the form of the TSV text file. This format is convenient both for manual processing (using any spreadsheet program), and for scripted processing/parsing. The file has a header line, which names columns. All the columns are separated with a single tabulation symbol (tab), including the column names in the header line.

Here is the description of each column:

Important note on the duplicate lines in the TSV results file

As COTRASIF is transcript-centric (with one promoter defined for each transcript), there can be multiple promoters defined per gene. Most of the time the transcription start sites of alternative gene transcripts are the same, which leads to several identical promoters stored in our database for the gene. When the TFBS search is performed, those multiple promoters are translated to multiple found TFBS lines in the results file.

Starting with Ensembl release 48, we introduced duplicate-cleanup routines into our automatic promoter-importing pipeline. This significantly decreased the number of duplicate result lines. However, the solution appears to be incomplete, and some duplicates still make it into the results file.

Sample duplicate records (description and chromosome not shown):

ENSG00000197110	ENST00000355055	500	CAGTTTCTCTTTCCC	0.961344	-1	862
ENSG00000197110	ENST00000392072	500	CAGTTTCTCTTTCCC	0.961344	-1	852
Note that the following columns are the same for these two lines: Based on identical Gene ID, description and chromosome are also the same. "promoter size" can be different, "position" will be mostly the same, but sometimes can be different (the case of "overlapping promoters"). The only thing which is always different is the Ensembl Transcript ID.

We are working to fully eradicate the duplicates problem before the next Ensembl release.

© 2007 - 2008
Bogdan Tokovenko, Rostyslav Golda, Protas Oleksiy.