- How to use the binaries
- Download binaries for Linux (32/64 bit)
- FASTA files, as obtained from Ensembl
- Promoters, as used by COTRASIF
Binaries status note
The binaries were tested for the correctness of absolute coordinates. These are also quite stable, as tested on human_toplevel.fa.gz sequence.
However, testing coverage was insufficient. This is beta version, so binaries might be updated/replaced without further notice - although any bugs found (and fixed) will be reported in COTRASIF googlegroup.
How to use the binaries
- Input expected: ENSEMBL toplevel FASTA files (see here, you need "FASTA (DNA)" column). As an example, see Saccharomyces_cerevisiae.SGD1.01.53.dna_rm.toplevel.fasta, a repeat-masked yeast genome from E! release 53.
- Output format: TSV, tab-separated values, with header; columns are:
- chromosome name, as found in FASTA file
- absolute chromosomal start coordinate of the found TFBS
- actual matching sequence window
- normalized similarity score
- strand where TFBS was found ("1" or "-1"; complementary "-1" search is not yet implemented!)
- base (chromosomal) strand, always "1"
- Sample PFM file; columns represent positions, rows top-down are ACGT counts per position. Please ensure all the columns add up to the same number (otherwise your search results will not be correct).
- Sample sequences file (for use as HMM method input) - single sequence per line, all of the same length, and aligned.
- Use examples:
zcat Saccharomyces_cerevisiae.SGD1.01.53.dna_rm.toplevel.fa.gz | ./cotrasif_gw --method=pwm --stdin --cutoff=0.95 --pfmfile=isre.txt > found_TFBS.tsv
./cotrasif_gw --infile=Saccharomyces_cerevisiae.SGD1.01.53.dna_rm.toplevel.fasta --cutoff=0.95 --pfmfile=isre.txt > tfbs.tsv
cat Saccharomyces_cerevisiae.SGD1.01.53.dna_rm.toplevel.fasta | ./cotrasif_gw --method=hmm --stdin --seqfile=isre.seq > found_with_hmm_TFBS.tsv
./cotrasif_gw --method=hmm --infile=my_genome.fasta --cutoff=0.74 --seqfile=isre.seq > more_TFBS.tsv
Prior testing suggests that both usage forms (zcat and cat) have comparable performance.
Options --method=pwm and --stdin are defaults, and can be omitted.
- Sample output from the example above:
Chromosome Coordinate Sequence Score Strand Base strand VII 964158 TAGTTTCACTTTTCC 0.954210 1 1 VII 1019324 CAGTTTCTTTTTCCC 0.964635 1 1
- Performance note: there is no support for multi-core CPUs, so it is OK to run N instances at once, where N is the number of cores/CPUs you have. For a full-genome PWM-based scan of repeat-masked human 3.5GiB-large uncompressed fasta file our system (Athlon X2, 2.2GHz) requires ~11 minutes, when scanning with a 15-column-wide PFM (wider matrix should take a bit more time, narrower - less). HMM search on the same file took ~40 minutes (also 15-long sequences). Memory requirements for both PWM/HMM search methods are ~20MiB at maximum for a 15-nucleotide long TFBS.
- Low cut-off warning: too low cut-off will lead to huge result file, and search slowdown due to I/O bottleneck.
Download binaries for Linux
FASTA files, as obtained from Ensembl
These files were downloaded as responses to martservice queries, asking for -2000..0 upstream sequence (see promoter definition for details), and for the 5`UTR. These were compressed for faster downloads. Please note, that archives prior to E!52 had requests for only -800..0 upstream sequence, as promoter definition was different at that time. Also, number of genomes decreases as you go to older E! releases. Zipped files are named by release versions. Each file's size is less than or ~100MiB.
If you use these FASTA files in your research, you may want to cite both Ensembl and our article as the source of promoters.
Promoters, as used by COTRASIF
These are "glued" upstream and 5`UTR sequences (see those separately above). Promoters are only
available since Ensembl release 54. These are also named by the release version. FASTA header structure is:
>chromosome|species name|gene ID|transcript ID|promoter start coordinate|promoter length|gene strand
Promoter start coordinate is calculated relative to the chromosome start coordinate, and was not yet tested to be correct in the presented FASTA files for both sense and antisense strands (any reports on that are welcome at COTRASIF googlegroup). Each archive is smaller than 100 MiB.
If you use these promoter sets in your research - please consider citing the source of promoters.