
Help
See also the About page.
Contents
- How to use COTRASIF
- PFM format
- Promoter definition in COTRASIF
- Description of the results file format
- About TRANSFAC matrices
- Which matrix to use: TRANSFAC or JASPAR?
- About duplicates in results file
- Conservation filter
How to use COTRASIF
To use COTRASIF's PWM and HMM-PWM tools, one needs to follow these simple steps:
- go to COTRASIF start page (http://biomed.org.ua/COTRASIF/) and select the type of task to run;
- select a species promoters database to use for current search;
- for the PWM method (and optionally for HMM-PWM) either input PFM manually (copy-paste) or select from the JASPAR matrices select field;
- for the HMM-PWM method, enter input sequences; for the PWM, specify a threshold (if different from the default 0.75);
- for both HMM-PWM and PWM methods, input an email address for where the results link should be sent when calculation is complete; the email address is also used as an identifier for conservation filter, so be sure to use the same email when planning to use conservation filter;
- click "submit task";
- go to the status page, and wait until it finishes auto-reloading and displays direct results file download links. Alternatively, wait for the email with a link to TSV (tab-separated values) results file. Download the results file, and open in your favorite spreadsheet editor for manual analysis, or use in your own data processing pipeline.
To use the conservation filter, one must have two completed tasks in COTRASIF. The user must input the email address used for previous tasks, then select two tasks to run through the filter, submit the form, and save the returned text file (the result is immediate).
PFM format
A well-formed PFM (also known as position-counts matrix, PCM, and also position-specific scoring matrix, PSSM) can be defined by the following list of requirements:
- it has exactly 4 lines - one per A, C, G and T
- all lines have equal number of elements (counts)
- the sum of counts in each column is equal for all columns (with rounding up to 1.0 for the fractional PFMs).
This is an example of valid integer-counts matrix:
1 12 0 0 0 0 0 7 1 1 0 0 0 2 1 8 0 0 0 0 0 13 1 7 0 0 3 8 7 8 2 1 13 0 0 0 0 1 2 0 0 0 0 2 3 2 0 0 13 13 13 0 4 3 12 13 10 5 2 1
Another valid example - this time fractional PFM:
0.002392631 0.018539367 0.023213127 0.055630882 0.028694009 0.85478965 0.133623499 0.115525923 0.914481573 0.079327923 0.118833399 0.003637736 0.854790957 0.016689723 0.142543009 0.02398432 0.844199398 0.006469993 0.013197821 0.749435059
Important: the PWM search method uses integer and fractional PFMs differently. If an integer PFM is supplied, then pseudocounts correction is applied, and PFM is converted into PWM using the formula w = log2( ((f + 0.25*sqrt{N})/(N + sqrt{N})) / 0.25), where 'w' is the weight value, 'f' is the integer frequency (count) value, N is the column sum of the PFM, and sqrt{N} and 0.25*sqrt{N} are pseudocounts.
If a fractional PFM is supplied, it is converted into PWM using the same formula, but instead of sqrt{N} a fixed small pseudocount (10-9) is used. Effectively, zero-value positions of the user-supplied fractional matrix translate to PWM value -29.897, which is ~10 times less than the lowest (non-zero) PWM value. This leads to very low similarity score for sequences with mismatch in the zero-value position of the user-submitted fractional PFM.
Promoter definition in COTRASIF
In COTRASIF, "promoter" is defined as 2000bp upstream from the TSS of the gene, plus the first 5' UTR (if any). Thus, the length of the majority of promoters is between 2000 and several thousands nucleotides. In rare cases, length can be less than 2000 nucleotides (for genes located near chromosome termini).
Description of the results text file format
Results are presented in the form of the TSV text file. This format is convenient both for manual processing (using any spreadsheet program), and for scripted processing/parsing. The file has a header line, which names columns. All the columns are separated with a single tabulation symbol (tab), including the column names in the header line.
Here is the description of each column:
- Ensembl Gene ID: an identifier which can be used to find the full gene record in the Ensembl genomes database. This identifier is "stable" (permanent), which means that it does not change between Ensembl releases. However, the gene record may change between releases, including the gene coordinates (location) on the chromosome. If the location changes, the promoter (by definition) will also change, leading to the minor changes of the gene lists in COTRASIF results file for different Ensembl releases.
- Ensembl Transcript ID: stable Ensembl transcript identifier. COTRASIF's TFBS search is transcript-centric - "one transcript - one promoter". This was done to account for the possible alternative transcription start sites.
- position: contains the absolute chromosomal start coordinate of the found putative TFBS.
Here's an example of interpreting coordinates:
- let's assume TFBS length is 15
- one TFBS was found for the forward-strand gene A at 11'567'943; another one - for reverse-strand gene B at 5'948'128, both on chromosome 15
- for gene A, TFBS position will be 15:11'567'943-11'567'957 (start position = "position" column, end position = "position" column + 14; coordinates are inclusive)
- for gene B, TFBS position will be 15:5'948'114-5'948'128 (start position = "position" column - 14, end position = "position" column; coordinates are inclusive)
These coordinates can be pasted into either UCSC or Ensembl genome browsers. We plan adding links to these genome browsers from the future HTML format of the results file. - TSS-relative position: start of the found TFBS relative to the transcript's TSS; minimal possible value is -2000, maximal possible value is (longest_promoter - 2000).
- matched sequence: the sequence of the found TFBS.
- score: similarity of the user-provided PFM matrix/set of sequences to the "matched sequence". Maximal theoretical value is 1.00 (best possible match).
- gene strand: DNA strand of the gene.
- promoter size: the length of the promoter used for searching.
- chromosome: self-evident.
- gene description: description of the gene, as obtained from Ensembl.
About TRANSFAC matrices
COTRASIF allows easy selection of one of 398 TRANSFAC 7.0 Public matrices.
Some of the matrices had different column sums - that is, did not add up to the same number of sequences or to a common normalization base, and were re-normalized. Here is the list of such matrices.
Note: some matrices had identical column sums, but were normalized not to 1 (as accepted by COTRASIF), but to a different number (most commonly 100). See the detailed log of re-normalization for more details.
Which matrix to use: TRANSFAC or JASPAR?
COTRASIF allows easy selection of one of the JASPAR CORE matrices.
So what is the difference between Jaspar and Transfac?
JASPAR is "high-quality, manually-curated", while TRANSFAC is a "broad compilation of binding sites"; this distinction is based on my own limited knowledge.
Citing Jaspar website:
"The JASPAR CORE database contains a curated, non-redundant set of 123 profiles, derived from published collections of experimentally defined transcription factor binding sites for multicellular eukaryotes. The prime difference to similar resources (TRANSFAC, TESS etc) consist of the open data acess, non-redundancy and quality: JASPAR CORE is a smaller set that is non-redundant and curated.
When should it be used? When seeking models for specific factors or structural classes, or if experimental evidence is paramount."
TRANSFAC is also literature-derived (read more at TRANSFAC Release 7.0 Documentation).
TRANSFAC assigns "quality rating" in the range 1..6 (where 1 is the best quality) to each matrix. In each individual case, this quality metric could help you decide which matrix to use.
TRANSFAC Public is at version 7.0 and is pretty old; to the best of my knowledge, current version number is 12 (April 2009).
There is also a general consideration relevant to the quality of the matrix - the number of sequences used to build the matrix. Generally, more sequences mean better matrix.
Sometimes, there are multiple matrices for a single binding site.
Let us review in detail the process of selecting a proper matrix to search for NFkB binding sites.
If you register (for free) to get access to Transfac Public 7.0, you will be able to see more details about each of the matrices for NFkB: M00194, M00208, M00054 (records M00051 and M00052 are subunits, I'm not showing them here, but they are also accessible).
Of these:
- M00208 was last updated in 1995, and doesn't have a quality rating assigned; spans 12 positions;
- M00054 was updated in 1996, was built from "40 binding sites from 30 genes (26 cellular genes and 4 viral genomes)", and refers to 2 publications as sources (PMID 8449662, 1985897), spans 10 positions;
- M00194 was updated in 2002, built from 13 sequences, has quality <= 6, spans 14 ("really" - 12) positions.
Of these, I would trust M00054 for 40 sequences; M00208 should be interesting as well for two extra positions it covers (because if matrix is short, then there are more false positives, so longer matrices are generally better).
JASPAR's NFkB matrix MA0061 also refers to PMID 8449662, and is built from 38 sequences; if PMID 1985897 (second citation from Transfac's M00054) is about viral genomes, then I'd say that Jaspar's MA0061 and TRANSFAC's are roughly same quality.
As this matrix is short, I'd start with a cut-off 0.9 or even 0.95 to get initial predictions. Matrix was built from rat, mouse and human, so any pair of these would be good for evolutionary conservation filtering step.
Hopefully this extended matrix selection example will help you use the proper matrix for your research.
Note on the duplicate lines in the TSV results file
As COTRASIF is transcript-centric (with one promoter defined for each transcript), there can be multiple promoters defined per gene. Most of the time the transcription start sites of alternative gene transcripts are the same, which leads to several identical promoters stored in our database for the gene. When the TFBS search is performed, those multiple promoters are translated to multiple found TFBS lines in the results file.
Starting with Ensembl release 48, we introduced duplicate-cleanup routines into our automatic promoter-importing pipeline. This significantly decreased the number of duplicate result lines. However, the solution appears to be incomplete, and some duplicates still make it into the results file.
Sample duplicate records, as of Ensembl release 52 (description and chromosome not shown):
ENSG00000197110 ENST00000355055 500 CAGTTTCTCTTTCCC 0.961344 -1 4862565 ENSG00000197110 ENST00000392072 500 CAGTTTCTCTTTCCC 0.961344 -1 4862565Note that the following columns are the same for these two lines:
- Ensembl Gene ID
- sequence
- score
- strand
- position
Line duplication is now quite a rare problem.
Conservation filter
Conservation filter (also mentioned as orthology filter and [non-classical] phylogenetic footprinting method) helps to identify TFBSs occuring in the promoters of the orthologous genes.
No sequence alignments are performed: the result is based solely on the presence (or absence) of the specified TFBS in the promoters of orthologs.
When applying conservation filter, one can specify orthology types to take into account. These are better described at the Ensembl website. One can also constrain the distance from the found TFBS to the TSS of each of the orthologs; this is then referred to as position-constrained orthology filtering.
When allowing orthology types other than ortholog_one2one, it is possible that TFBSs are found in several different orthologs in the reference genome. Currently, this is resolved by taking a single - best - orthology pair. Here, best only applies to the quality of the orthology pair, and has nothing to do with the reliability of the final result.
Previously, both orthology type and protein_percent_id were used to rank orthology pairs and choose the best pair. However, as protein_percent_id is not a proper measure of orthology quality (this being the reason of its removal from Ensembl Compara in release 54), now only orthology type is used to find which orthology pair is better. The hierarchy used (best to worst) is:
- ortholog_one2one
- ortholog_one2many
- ortholog_many2many
- apparent_ortholog_one2one
At the moment, if there are several candidate orthologs with identical orthology types, the one with the numerically smaller Ensembl Gene ID is taken.
Conservation filter output is tab-separeted plain text, with the following columns:
- target species gene id
- target transcript id
- target found site(s) absolute chromosomal position(s); (s) is for the case when multiple sites are present in gene promoter - if this is the case, then multiple values are comma-separated
- target found site(s) TSS-relative position(s)
- target found site(s) score(s)
- reference gene id
- reference transcript id
- reference absolute chromosomal position(s)
- reference found site(s) TSS-relative position(s)
- reference score(s), CSV
- orthology type
© Bogdan Tokovenko (2006 - 2011) and Rostyslav Golda (2008 - 2009)
Portions © Oleksiy Protas (2009)