Motivation: Evolutionarily conserved non-coding genomic sequences represent a potentially rich source for the discovery of gene regulatory region such as transcriptional enhancers. However, detecting orthologous enhancers using alignment-based methods in higher eukaryotic genomes is particularly challenging, as regulatory regions can undergo considerable sequence changes while maintaining their functionality.
Results: We have developed an alignment-free method which identifies conserved enhancers in multiple diverged species. Our method is based on similarity metrics between two sequences based on the co-occurrence of sequence patterns regardless of their order and orientation, thus tolerating sequence changes observed in non-coding evolution. We show that our method is highly successful in detecting orthologous enhancers in distantly related species without requiring additional information such as knowledge about transcription factors involved, or predicted binding sites. By estimating the significance of similarity scores, we are able to discriminate experimentally validated functional enhancers from seemingly equally conserved candidates without function. We demonstrate the effectiveness of this approach on a wide range of enhancers in Drosophila, and also present encouraging results to detect conserved functional regions across large evolutionary distances. Our work provides encouraging steps on the way to ab initio unbiased enhancer prediction to complement ongoing experimental efforts.
Availability: The software, data and the results used in this article are available at http://www.genome.duke.edu/labs/ohler/research/transcription/fly_enhancer/
Contact: tomancak@mpi-cbg.de; uwe.ohler@duke.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Comparative genomic sequence analysis is a powerful approach for identifying putative functional elements in silico. The availability of full-genome sequences from many vertebrate species has resulted in the development of popular tools, for example, the phastCons software package that search large numbers of genomes to identify conserved elements. While phastCons can analyze many genomes simultaneously, it ignores potentially informative insertion and deletion events and relies on a fixed, precomputed multiple sequence alignment.
Results: We have developed a new method, GRAPeFoot, which simultaneously aligns two full genomes and annotates a set of conserved regions exhibiting reduced rates of insertion, deletion and substitution mutations. We tested GRAPeFoot using the human and mouse genomes and compared its performance to a set of phastCons predictions hosted on the UCSC genome browser. Our results demonstrate that despite the use of only two genomes, GRAPeFoot identified constrained elements at rates comparable with phastCons, which analyzed data from 28 vertebrate genomes. This study demonstrates how integrated modelling of substitutions, indels and purifying selection allows a pairwise analysis to exhibit a sensitivity similar to a heuristic analysis of many genomes.
Availability: The GRAPeFoot software and set of genome-wide functional element predictions are freely available to download online at http://www.stats.ox.ac.uk/~satija/GRAPeFoot/
Contact: satija@stats.ox.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: The accurate prediction of protein folding rate change upon mutation is an important and challenging problem in protein folding kinetics and design. In this work, we have collected experimental data on protein folding rate change upon mutation from various sources and constructed a reliable and non-redundant dataset with 467 mutants. These mutants are widely distributed based on secondary structure, solvent accessibility, conservation score and long-range contacts. From systematic analysis of these parameters along with a set of 49 amino acid properties, we have selected a set of 12 features for discriminating the mutants that speed up or slow down the folding process. We have developed a method based on quadratic regression models for discriminating the accelerating and decelerating mutants, which showed an accuracy of 74% using the 10-fold cross-validation test. The sensitivity and specificity are 63% and 76%, respectively. The method can be improved with the inclusion of physical interactions and structure-based parameters.
Availability: http://bioinformatics.myweb.hinet.net/freedom.htm
Contact: michael-gromiha@aist.go.jp
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: An observed metabolic response is the result of the coordinated activation and interaction between multiple genetic pathways. However, the complex structure of metabolism has meant that a compete understanding of which pathways are required to produce an observed metabolic response is not fully understood. In this article, we propose an approach that can identify the genetic pathways which dictate the response of metabolic network to specific experimental conditions.
Results: Our approach is a combination of probabilistic models for pathway ranking, clustering and classification. First, we use a non-parametric pathway extraction method to identify the most highly correlated paths through the metabolic network. We then extract the defining structure within these top-ranked pathways using both Markov clustering and classification algorithms. Furthermore, we define detailed node and edge annotations, which enable us to track each pathway, not only with respect to its genetic dependencies, but also allow for an analysis of the interacting reactions, compounds and KEGG sub-networks. We show that our approach identifies biologically meaningful pathways within two microarray expression datasets using entire KEGG metabolic networks.
Availability and implementation: An R package containing a full implementation of our proposed method is currently available from http://www.bic.kyoto-u.ac.jp/pathway/timhancock
Contact: timhancock@kuicr.kyoto-u.ac.jp
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: One of the main goals of high-throughput gene-expression studies in cancer research is to identify prognostic gene signatures, which have the potential to predict the clinical outcome. It is common practice to investigate these questions using classification methods. However, standard methods merely rely on gene-expression data and assume the genes to be independent. Including pathway knowledge a priori into the classification process has recently been indicated as a promising way to increase classification accuracy as well as the interpretability and reproducibility of prognostic gene signatures.
Results: We propose a new method called Reweighted Recursive Feature Elimination. It is based on the hypothesis that a gene with a low fold-change should have an increased influence on the classifier if it is connected to differentially expressed genes. We used a modified version of Google's PageRank algorithm to alter the ranking criterion of the SVM-RFE algorithm. Evaluations of our method on an integrated breast cancer dataset comprising 788 samples showed an improvement of the area under the receiver operator characteristic curve as well as in the reproducibility and interpretability of selected genes.
Availability: The R code of the proposed algorithm is given in Supplementary Material.
Contact: m.johannes@DKFZ-heidelberg.de; tim.beissbarth@ams.med.uni-goettingen.de
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Directed evolution, in addition to its principal application of obtaining novel biomolecules, offers significant potential as a vehicle for obtaining useful information about the topologies of biomolecular fitness landscapes. In this article, we make use of a special type of model of fitness landscapes—based on finite state machines—which can be inferred from directed evolution experiments. Importantly, the model is constructed only from the fitness data and phylogeny, not sequence or structural information, which is often absent. The model, called a landscape state machine (LSM), has already been used successfully in the evolutionary computation literature to model the landscapes of artificial optimization problems. Here, we use the method for the first time to simulate a biological fitness landscape based on experimental evaluation.
Results: We demonstrate in this study that LSMs are capable not only of representing the structure of model fitness landscapes such as NK-landscapes, but also the fitness landscape of real DNA oligomers binding to a protein (allophycocyanin), data we derived from experimental evaluations on microarrays. The LSMs prove adept at modelling the progress of evolution as a function of various controlling parameters, as validated by evaluations on the real landscapes. Specifically, the ability of the model to ‘predict’ optimal mutation rates and other parameters of the evolution is demonstrated. A modification to the standard LSM also proves accurate at predicting the effects of recombination on the evolution.
Contact: william.rowe@manchester.ac.uk
Motivation: Complex patterns of protein phosphorylation mediate many cellular processes. Tandem mass spectrometry (MS/MS) is a powerful tool for identifying these post-translational modifications. In high-throughput experiments, mass spectrometry database search engines, such as MASCOT provide a ranked list of peptide identifications based on hundreds of thousands of MS/MS spectra obtained in a mass spectrometry experiment. These search results are not in themselves sufficient for confident assignment of phosphorylation sites as identification of characteristic mass differences requires time-consuming manual assessment of the spectra by an experienced analyst. The time required for manual assessment has previously rendered high-throughput confident assignment of phosphorylation sites challenging.
Results: We have developed a knowledge base of criteria, which replicate expert assessment, allowing more than half of cases to be automatically validated and site assignments verified with a high degree of confidence. This was assessed by comparing automated spectral interpretation with careful manual examination of the assignments for 501 peptides above the 1% false discovery rate (FDR) threshold corresponding to 259 putative phosphorylation sites in 74 proteins of the Trypanosoma brucei proteome. Despite this stringent approach, we are able to validate 80 of the 91 phosphorylation sites (88%) positively identified by manual examination of the spectra used for the MASCOT searches with a FDR < 15%.
Conclusions:High-throughput computational analysis can provide a viable second stage validation of primary mass spectrometry database search results. Such validation gives rapid access to a systems level overview of protein phosphorylation in the experiment under investigation.
Availability: A GPL licensed software implementation in Perl for analysis and spectrum annotation is available in the supplementary material and a web server can be assessed online at http://www.compbio.dundee.ac.uk/prophossi
Contact: d.m.a.martin@dundee.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Distinguishing direct from indirect influences is a central issue in reverse engineering of biological networks because it facilitates detection and removal of false positive edges. Transitive reduction is one approach for eliminating edges reflecting indirect effects but its use in reconstructing cyclic interaction graphs with true redundant structures is problematic.
Results: We present TRANSWESD, an elaborated variant of TRANSitive reduction for WEighted Signed Digraphs that overcomes conceptual problems of existing versions. Major changes and improvements concern: (i) new statistical approaches for generating high-quality perturbation graphs from systematic perturbation experiments; (ii) the use of edge weights (association strengths) for recognizing true redundant structures; (iii) causal interpretation of cycles; (iv) relaxed definition of transitive reduction; and (v) approximation algorithms for large networks. Using standardized benchmark tests, we demonstrate that our method outperforms existing variants of transitive reduction and is, despite its conceptual simplicity, highly competitive with other reverse engineering methods.
Contact: klamt@mpi-magdeburg.mpg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Viewing a cellular system as a collection of interacting parts can lead to new insights into the complex cellular behavior. In this study, we have investigated aryl hydrocarbon receptor (AhR) signal transduction pathway from such a system-level perspective. AhR detects various xenobiotics, such as drugs or endocrine disruptors (e.g. dioxin), and mediates transcriptional regulation of target genes such as those in the cytochrome P450 (CYP450) family. On binding with 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD), however, AhR becomes abnormally activated and conveys toxic effects on cells. Despite many related studies on the TCDD-mediated toxicity, quantitative system-level understanding of how TCDD-mediated toxicity generates various toxic responses is still lacking.
Results: Here, we present a manually curated TCDD-mediated AhR signaling pathway including crosstalks with the hypoxia pathway that copes with oxygen deficiency and the p53 pathway that induces a DNA damage response. Based on the integrated pathway, we have constructed a mathematical model and validated it through quantitative experiments. Using the mathematical model, we have investigated: (i) TCDD dose-dependent effects on AhR target genes; (ii) the crosstalk effect between AhR and hypoxia signals; and (iii) p53 inhibition effect of TCDD-liganded AhR. Our results show that cellular intake of TCDD induces AhR signaling pathway to be abnormally up-regulated and thereby interrupts other signaling pathways. Interruption of hypoxia and p53 pathways, in turn, can incur various hazardous effects on cells. Taken together, our study provides a system-level understanding of how AhR signal mediates various TCDD-induced toxicities under the presence of hypoxia and/or DNA damage in cells.
Contact: ckh@kaist.ac.kr
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: A gene set test is a differential expression analysis in which a P-value is assigned to a set of genes as a unit. Gene set tests are valuable for increasing statistical power, organizing and interpreting results and for relating expression patterns across different experiments. Existing methods are based on permutation. Methods that rely on permutation of probes unrealistically assume independence of genes, while those that rely on permutation of sample are suitable only for two-group comparisons with a good number of replicates in each group.
Results: We present ROAST, a statistically rigorous gene set test that allows for gene-wise correlation while being applicable to almost any experimental design. Instead of permutation, ROAST uses rotation, a Monte Carlo technology for multivariate regression. Since the number of rotations does not depend on sample size, ROAST gives useful results even for experiments with minimal replication. ROAST allows for any experimental design that can be expressed as a linear model, and can also incorporate array weights and correlated samples. ROAST can be tuned for situations in which only a subset of the genes in the set are actively involved in the molecular pathway. ROAST can test for uni- or bi-direction regulation. Probes can also be weighted to allow for prior importance. The power and size of the ROAST procedure is demonstrated in a simulation study, and compared to that of a representative permutation method. Finally, ROAST is used to test the degree of transcriptional conservation between human and mouse mammary stems.
Availability: ROAST is implemented as a function in the Bioconductor package limma available from www.bioconductor.org
Contact: smyth@wehi.edu.au
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Highly sensitive and specific screening tools may reduce disease -related mortality by enabling physicians to diagnose diseases in asymptomatic patients or at-risk individuals. Diagnostic tests based on multiple biomarkers may achieve the needed sensitivity and specificity to realize this clinical gain.
Results: Logic regression, a multivariable regression method predicting an outcome using logical combinations of binary predictors, yields interpretable models of the complex interactions in biologic systems. However, its performance degrades in noisy data. We extend logic regression for classification to an ensemble of logic trees (Logic Forest, LF). We conduct simulation studies comparing the ability of logic regression and LF to identify variable interactions predictive of disease status. Our findings indicate LF is superior to logic regression for identifying important predictors. We apply our method to single nucleotide polymorphism data to determine associations of genetic and health factors with periodontal disease.
Availability: LF code is publicly available on CRAN, http://cran.r-project.org/.
Contact: wolfb@musc.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: METAL provides a computationally efficient tool for meta-analysis of genome-wide association scans, which is a commonly used approach for improving power complex traits gene mapping studies. METAL provides a rich scripting interface and implements efficient memory management to allow analyses of very large data sets and to support a variety of input file formats.
Availability and implementation: METAL, including source code, documentation, examples, and executables, is available at http://www.sph.umich.edu/csg/abecasis/metal/
Contact: goncalo@umich.edu
SUMMARY: Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads-specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This article focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data.
Availability: http://public.tgen.org/sqz. Academic/non-profit: Source: available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site.
Contact: wtembe@tgen.org
Summary: Computational methods designed to discover transcription factor binding sites in DNA sequences often have a tendency to make a lot of false predictions. One way to improve accuracy in motif discovery is to rely on positional priors to focus the search to parts of a sequence that are considered more likely to contain functional binding sites. We present here a program called PriorsEditor that can be used to create such positional priors tracks based on a combination of several features, including phylogenetic conservation, nucleosome occupancy, histone modifications, physical properties of the DNA helix and many more.
Availability: PriorsEditor is available as a web start application and downloadable archive from http://tare.medisin.ntnu.no/priorseditor (requires Java 1.6). The web site also provides tutorials, screenshots and example protocol scripts.
Contact: kjetil.klepper@ntnu.no
Summary: We describe mbmdr, an R package for implementing the model-based multifactor dimensionality reduction (MB-MDR) method. MB-MDR has been proposed by Calle et al. as a dimension reduction method for exploring gene–gene interactions in case-control association studies. It is an extension of the popular multifactor dimensionality reduction (MDR) method of Ritchie et al. allowing a more flexible definition of risk cells. In MB-MDR, risk categories are defined using a regression model which allows adjustment for covariates and main effects and, in addition to the classical low risk and high risk categories, MB-MDR considers a third category of indeterminate or not informative cells. An important improvement added to the current mbmdr algorithm with respect to the original MB-MDR formulation in Calle et al. and also to the classical MDR approach, is the extension of the methodology to different outcome types. While MB-MDR was initially proposed for binary traits in the context of case-control studies, the mbmdr package provides options to analyze both binary or quantitative traits for unrelated individuals.
Availability: http://cran.r-project.org/
Contact: malu.calle@uvic.cat
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Most population genetic simulators fall into one of two classes, backward time simulators that quickly generate trees but accommodate only relatively simple selective and demographic regimes, and forward simulators that allow for a broader range of evolutionary scenarios but which cannot produce genealogies. Thus, few tools are available that allow for producing genealogies under arbitrarily complex selective and demographic models.
Results: TreesimJ is a forward time population genetic simulator that allows for sampling of genealogies, genetic data and many population parameters from populations evolving under complex evolutionary scenarios. The application provides many fitness and demographic models and new models are easy to develop. Data collection is performed by a variety of independently configurable collectors which periodically sample the population and record statistics. Output options include writing traces, histograms and summary statistics from the data collectors in addition to sampled genetic sequences and genealogies.
Summary: TreesimJ allows researchers to easily sample and analyze gene genealogies and related data from populations evolving under a wide variety of selective and demographic regimes. It is likely to be useful for population genetic researchers seeking to understand the links between evolutionary and demographic forces, genealogical structure and the resulting patterns of genetic variation.
Availability: TreesimJ home : http://staff.washington.edu/brendano/treesimj. Source and developer resources: http://code.google.com/p/treesimj
Contact: brendano@u.washington.edu
Summary: RPPanalyzer is a statistical tool developed to read reverse-phase protein array data, to perform the basic data analysis and to visualize the resulting biological information. The R-package provides different functions to compare protein expression levels of different samples and to normalize the data. Implemented plotting functions permit a quality control by monitoring data distribution and signal validity. Finally, the data can be visualized in heatmaps, boxplots, time course plots and correlation plots. RPPanalyzer is a flexible tool and tolerates a huge variety of different experimental designs.
Availability: The RPPAanalyzer is open source and freely available as an R-Package on the CRAN platform http://cran.r-project.org/
Contact: h.mannsperger@dkfz.de
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: BigWig and BigBed files are compressed binary indexed files containing data at several resolutions that allow the high-performance display of next-generation sequencing experiment results in the UCSC Genome Browser. The visualization is implemented using a multi-layered software approach that takes advantage of specific capabilities of web-based protocols and Linux and UNIX operating systems files, R trees and various indexing and compression tricks. As a result, only the data needed to support the current browser view is transmitted rather than the entire file, enabling fast remote access to large distributed data sets.
Availability and implementation: Binaries for the BigWig and BigBed creation and parsing utilities may be downloaded at http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/. Source code for the creation and visualization software is freely available for non-commercial use at http://hgdownload.cse.ucsc.edu/admin/jksrc.zip, implemented in C and supported on Linux. The UCSC Genome Browser is available at http://genome.ucsc.edu
Contact: ann@soe.ucsc.edu
Supplementary information: Supplementary byte-level details of the BigWig and BigBed file formats are available at Bioinformatics online. For an in-depth description of UCSC data file formats and custom tracks, see http://genome.ucsc.edu/FAQ/FAQformat.html and http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html
Motivation: Copy number variation (CNV), a major contributor to human genetic variation, comprises ≥ 1 kb genomic deletions and insertions. Yet, the identification of CNVs from microarray data is still hampered by high false negative and positive prediction rates due to the noisy nature of the raw data. Here, we present CNVineta, an R package for rapid data mining and visualization of CNVs in large case–control datasets genotyped with single nucleotide polymorphism oligonucleotide arrays. CNVineta is compatible with various established CNV prediction algorithms, can be used for genome-wide association analysis of rare and common CNVs and enables rapid and serial display of log2 of raw data ratios as well as B-allele frequencies for visual quality inspection. In summary, CNVineta aides in the interpretation of large-scale CNV datasets and prioritization of target regions for follow-up experiments.
Availability and Implementation: CNVineta is available as an R package and can be downloaded from http://www.ikmb.uni-kiel.de/CNVineta/; the package contains a tutorial outlining a typical workflow. The CNVineta compatible HapMap dataset can also be downloaded from the link above.
Contact: m.wittig@mucosa.de
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: In recent years, the number of knowledge bases developed using Wiki technology has exploded. Unfortunately, next to their numerous advantages, classical Wikis present a critical limitation: the invaluable knowledge they gather is represented as free text, which hinders their computational exploitation. This is in sharp contrast with the current practice for biological databases where the data is made available in a structured way. Here, we present WikiOpener an extension for the classical MediaWiki engine that augments Wiki pages by allowing on-the-fly querying and formatting resources external to the Wiki. Those resources may provide data extracted from databases or DAS tracks, or even results returned by local or remote bioinformatics analysis tools. This also implies that structured data can be edited via dedicated forms. Hence, this generic resource combines the structure of biological databases with the flexibility of collaborative Wikis.
Availability: The source code and its documentation are freely available on the MediaWiki website: http://www.mediawiki.org/wiki/Extension:WikiOpener.
Contact: sbrohee@esat.kuleuven.be
Supplementary information: Supplementary data are available at Bioinformatics online.