Motivation: Variable selection is a typical approach used for molecular-signature and biomarker discovery; however, its application to survival data is often complicated by censored samples. We propose a new algorithm for variable selection suitable for the analysis of high-dimensional, right-censored data called Survival Max–Min Parents and Children (SMMPC). The algorithm is conceptually simple, scalable, based on the theory of Bayesian networks (BNs) and the Markov blanket and extends the corresponding algorithm (MMPC) for classification tasks. The selected variables have a structural interpretation: if T is the survival time (in general the time-to-event), SMMPC returns the variables adjacent to T in the BN representing the data distribution. The selected variables also have a causal interpretation that we discuss.
Results: We conduct an extensive empirical analysis of prototypical and state-of-the-art variable selection algorithms for survival data that are applicable to high-dimensional biological data. SMMPC selects on average the smallest variable subsets (less than a dozen per dataset), while statistically significantly outperforming all of the methods in the study returning a manageable number of genes that could be inspected by a human expert.
Availability: Matlab and R code are freely available from http://www.mensxmachina.org
Contact: vlagani@ics.forth.gr
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: We present SVDetect, a program designed to identify genomic structural variations from paired-end and mate-pair next-generation sequencing data produced by the Illumina GA and ABI SOLiD platforms. Applying both sliding-window and clustering strategies, we use anomalously mapped read pairs provided by current short read aligners to localize genomic rearrangements and classify them according to their type, e.g. large insertions–deletions, inversions, duplications and balanced or unbalanced inter-chromosomal translocations. SVDetect outputs predicted structural variants in various file formats for appropriate graphical visualization.
Availability: Source code and sample data are available at http://svdetect.sourceforge.net/
Contact: svdetect@curie.fr
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: Genomes undergo large structural changes that alter their organization. The chromosomal regions affected by these rearrangements are called breakpoints, while those which have not been rearranged are called synteny blocks. Lemaitre et al. presented a new method to precisely delimit rearrangement breakpoints in a genome by comparison with the genome of a related species. Receiving as input a list of one2one orthologous genes found in the genomes of two species, the method builds a set of reliable and non-overlapping synteny blocks and refines the regions that are not contained into them. Through the alignment of each breakpoint sequence against its specific orthologous sequences in the other species, we can look for weak similarities inside the breakpoint, thus extending the synteny blocks and narrowing the breakpoints. The identification of the narrowed breakpoints relies on a segmentation algorithm and is statistically assessed. Here, we present the package Cassis that implements this method of precise detection of genomic rearrangement breakpoints.
Availability: Perl and R scripts are freely available for download at http://pbil.univ-lyon1.fr/software/Cassis/. Documentation with methodological background, technical aspects, download and setup instructions, as well as examples of applications are available together with the package. The package was tested on Linux and Mac OS environments and is distributed under the GNU GPL License.
Contact: Marie-France.Sagot@inria.fr
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: Multiple sequence alignment (MSA) is an important step in comparative sequence analyses. Parallelization is a key technique for reducing the time required for large-scale sequence analyses. The three calculation stages, all-to-all comparison, progressive alignment and iterative refinement, of the MAFFT MSA program were parallelized using the POSIX Threads library. Two natural parallelization strategies (best-first and simple hill-climbing) were implemented for the iterative refinement stage. Based on comparisons of the objective scores and benchmark scores between the two approaches, we selected a simple hill-climbing approach as the default.
Availability: The parallelized version of MAFFT is available at http://mafft.cbrc.jp/alignment/software/. This version currently supports the Linux operating system only.
Contact: kazutaka.katoh@aist.go.jp
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: Bisulfite sequencing allows cytosine methylation, an important epigenetic marker, to be detected via nucleotide substitutions. Since the Applied Biosystems SOLiD System uses a unique di-base encoding that increases confidence in the detection of nucleotide substitutions, it is a potentially advantageous platform for this application. However, the di-base encoding also makes reads with many nucleotide substitutions difficult to align to a reference sequence with existing tools, preventing the platform's potential utility for bisulfite sequencing from being realized. Here, we present SOCS-B, a reference-based, un-gapped alignment algorithm for the SOLiD System that is tolerant of both bisulfite-induced nucleotide substitutions and a parametric number of sequencing errors, facilitating bisulfite sequencing on this platform. An implementation of the algorithm has been integrated with the previously reported SOCS alignment tool, and was used to align CpG methylation-enriched Arabidopsis thaliana bisulfite sequence data, exhibiting a 2-fold increase in sensitivity compared to existing methods for aligning SOLiD bisulfite data.
Availability: Executables, source code, and sample data are available at http://solidsoftwaretools.com/gf/project/socs/
Contact: bergmann@nbacc.net
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: We present the first parallel implementation of the T-Coffee consistency-based multiple aligner. We benchmark it on the Amazon Elastic Cloud (EC2) and show that the parallelization procedure is reasonably effective. We also conclude that for a web server with moderate usage (10K hits/month) the cloud provides a cost-effective alternative to in-house deployment.
Availability: T-Coffee is a freeware open source package available from http://www.tcoffee.org/homepage.html
Contact: cedric.notredame@crg.es
Summary: Detection of distant homology is a widely used computational approach for studying protein evolution, structure and function. Here, we report a homology search web server based on sequence profile–profile comparison. The user may perform searches in one of several regularly updated profile databases using either a single sequence or a multiple sequence alignment as an input. The same profile databases can also be downloaded for local use. The capabilities of the server are illustrated with the identification of new members of the highly diverse PD-(D/E)XK nuclease superfamily.
Availability: http://www.ibt.lt/bioinformatics/coma/
Contact: venclovas@ibt.lt
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: adephylo is a package for the R software dedicated to the analysis of comparative evolutionary data. Phylogenetic comparative methods initially aimed at accounting for or removing the effects of phylogenetic signal in the analysis of biological traits. However, recent approaches have shown that considerable information can be gathered from the study of the phylogenetic signal. In particular, close examination of phylogenetic structures can unveil interesting evolutionary patterns. For this purpose, we developed the package adephylo that provides tools for quantifying and describing the phylogenetic structures of biological traits. adephylo implements tests of phylogenetic signal, phylogenetic distances and proximities, and novel methods for describing further univariate and multivariate phylogenetic structures. These tools open up new perspectives in the analysis of evolutionary comparative data.
Availability: The stable version is available from CRAN: http:/cran.r-project.org/web/packages/adephylo/. The development version is hosted by R-Forge: http://r-forge.r-project.org/projects/adephylo/. Both versions can be installed directly from R. adephylo is distributed under the GNU General Public Licence (≥2).
Contact: t.jombart@imperial.ac.uk; dray@biomserv.univ-lyon1.fr
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: Count is a software package for the analysis of numerical profiles on a phylogeny. It is primarily designed to deal with profiles derived from the phyletic distribution of homologous gene families, but is suited to study any other integer-valued evolutionary characters. Count performs ancestral reconstruction, and infers family- and lineage-specific characteristics along the evolutionary tree. It implements popular methods employed in gene content analysis such as Dollo and Wagner parsimony, propensity for gene loss, as well as probabilistic methods involving a phylogenetic birth-and-death model.
Availability: Count is available as a stand-alone Java application, as well as an application bundle for MacOS X, at the web site http://www.iro.umontreal.ca/~csuros/gene_content/count.html. It can also be launched using Java Webstart from the same site. The software is distributed under a BSD-style license. Source code is available upon request from the author.
Contact: csuros@iro.umontreal.ca
Summary: Structure-based approaches complement ligand-based approaches for lead-discovery and cross-reactivity prediction. We present to the scientific community a web server for comparing the surface of a ligand bound site of a protein against a ligand bound site surface database of 106 796 sites. The web server implements the property encoded shape distributions (PESD) algorithm for surface comparison. A typical virtual screen takes 5 min to complete. The output provides a ranked list of sites (by site similarity), hyperlinked to the corresponding entries in the PDB and PDBeChem databases.
Availability: The server is freely accessible at http://reccr.chem.rpi.edu/Software/pesdserv/
Contact: brenec@rpi.edu
Summary: ViewDock TDW is a modification of the pre-existing ViewDock Chimera extension (http://www.cgl.ucsf.edu/chimera/) used to visualize results of virtual screening experiments. By combing TDW hardware and an enhanced ViewDock interface, dozens of ligand–protein complexes are rendered simultaneously to parallelize the analysis of candidate ligands. The ViewDock TDW GUI allows the user to easily and interactively manipulate the molecules on the TDW as an entire set, a selected subset or a single ligand–protein complex and preserves all Chimera functionality.
Availability and Implementation: ViewDock TDW is an open source software; freely available on the web at http://www.tdw-prime.webs.com. Chimera UCSF is also available, free of charge, at http://www.cgl.ucsf.edu/chimera/
Contact: jhaga@bioeng.ucsd.edu
Summary: We present LOX (Level Of eXpression) that estimates the Level Of gene eXpression from high-throughput-expressed sequence datasets with multiple treatments or samples. Unlike most analyses, LOX incorporates a gene bias model that facilitates integration of diverse transcriptomic sequencing data that arises when transcriptomic data have been produced using diverse experimental methodologies. LOX integrates overall sequence count tallies normalized by total expressed sequence count to provide expression levels for each gene relative to all treatments as well as Bayesian credible intervals.
Availability: http://www.yale.edu/townsend/software.html
Contact: jeffrey.townsend@yale.edu
Summary: The miRror application provides insights on microRNA (miRNA) regulation. It is based on the notion of a combinatorial regulation by an ensemble of miRNAs or genes. miRror integrates predictions from a dozen of miRNA resources that are based on complementary algorithms into a unified statistical framework. For miRNAs set as input, the online tool provides a ranked list of targets, based on set of resources selected by the user, according to their significance of being coordinately regulated. Symmetrically, a set of genes can be used as input to suggest a set of miRNAs. The user can restrict the analysis for the preferred tissue or cell line. miRror is suitable for analyzing results from miRNAs profiling, proteomics and gene expression arrays.
Availability: http://www.proto.cs.huji.ac.il/mirror
Contact: michall@cc.huji.ac.il
Summary: Endeavour is a tool that detects the most promising genes within large lists of candidates with respect to a biological process of interest and by combining several genomic data sources. We have benchmarked Endeavour using 450 pathway maps and 826 disease marker sets from MetaCoreTM of GeneGo, Inc. containing a total of 9911 and 12 432 genes, respectively. We obtained an area under the receiver operating characteristic curves of 0.97 for pathway and of 0.91 for disease gene sets. These results indicate that Endeavour can be used to efficiently prioritize candidate genes for pathways and diseases.
Availability: Endeavour is available at http://www.esat.kuleuven.be/endeavour
Contact: sven.schuierer@novartis.com; leon-charles.tranchevent@esat.kuleuven.be
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: Recently, several methods for analyzing phenotype data have been published, but only few are able to cope with data sets generated in different studies, with different methods, or for different species. We developed an online system in which more than 300 000 phenotypes from a wide variety of sources and screening methods can be analyzed together. Clusters of similar phenotypes are visualized as networks of highly similar phenotypes, inducing gene groups useful for functional analysis. This system is part of PhenomicDB, providing the world's largest cross-species phenotype data collection with a tool to mine its wealth of information.
Availability: Freely available at http://www.phenomicdb.de
Contact: bertram.weiss@bayerhealthcare.com
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Finding association between genetic variants and phenotypes related to disease has become an important vehicle for the study of complex disorders. In this context, multi-loci genetic association might unravel additional information when compared with single loci search. The main goal of this work is to propose a non-linear methodology based on information theory for finding combinatorial association between multi-SNPs and a given phenotype.
Results: The proposed methodology, called MISS (mutual information statistical significance), has been integrated jointly with a feature selection algorithm and has been tested on a synthetic dataset with a controlled phenotype and in the particular case of the F7 gene. The MISS methodology has been contrasted with a multiple linear regression (MLR) method used for genetic association in both, a population-based study and a sib-pairs analysis and with the maximum entropy conditional probability modelling (MECPM) method, which searches for predictive multi-locus interactions. Several sets of SNPs within the F7 gene region have been found to show a significant correlation with the FVII levels in blood. The proposed multi-site approach unveils combinations of SNPs that explain more significant information of the phenotype than their individual polymorphisms. MISS is able to find more correlations between SNPs and the phenotype than MLR and MECPM. Most of the marked SNPs appear in the literature as functional variants with real effect on the protein FVII levels in blood.
Availability: The code is available at http://sisbio.recerca.upc.edu/R/MISS_0.2.tar.gz
Contact: helena.brunel@upc.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: New sequencing technologies have accelerated research on prokaryotic genomes and have made genome sequencing operations outside major genome sequencing centers routine. However, no off-the-shelf solution exists for the combined assembly, gene prediction, genome annotation and data presentation necessary to interpret sequencing data. The resulting requirement to invest significant resources into custom informatics support for genome sequencing projects remains a major impediment to the accessibility of high-throughput sequence data.
Results: We present a self-contained, automated high-throughput open source genome sequencing and computational genomics pipeline suitable for prokaryotic sequencing projects. The pipeline has been used at the Georgia Institute of Technology and the Centers for Disease Control and Prevention for the analysis of Neisseria meningitidis and Bordetella bronchiseptica genomes. The pipeline is capable of enhanced or manually assisted reference-based assembly using multiple assemblers and modes; gene predictor combining; and functional annotation of genes and gene products. Because every component of the pipeline is executed on a local machine with no need to access resources over the Internet, the pipeline is suitable for projects of a sensitive nature. Annotation of virulence-related features makes the pipeline particularly useful for projects working with pathogenic prokaryotes.
Availability and implementation: The pipeline is licensed under the open-source GNU General Public License and available at the Georgia Tech Neisseria Base (http://nbase.biology.gatech.edu/). The pipeline is implemented with a combination of Perl, Bourne Shell and MySQL and is compatible with Linux and other Unix systems.
Contact: king.jordan@biology.gatech.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Current algorithms for estimating DNA copy numbers (CNs) borrow concepts from gene expression analysis methods. However, single nucleotide polymorphism (SNP) arrays have special characteristics that, if taken into account, can improve the overall performance. For example, cross hybridization between alleles occurs in SNP probe pairs. In addition, most of the current CN methods are focused on total CNs, while it has been shown that allele-specific CNs are of paramount importance for some studies. Therefore, we have developed a summarization method that estimates high-quality allele-specific CNs.
Results: The proposed method estimates the allele-specific DNA CNs for all Affymetrix SNP arrays dealing directly with the cross hybridization between probes within SNP probesets. This algorithm outperforms (or at least it performs as well as) other state-of-the-art algorithms for computing DNA CNs. It better discerns an aberration from a normal state and it also gives more precise allele-specific CNs.
Availability: The method is available in the open-source R package ACNE, which also includes an add on to the aroma.affymetrix framework (http://www.aroma-project.org/).
Contact: arubio@ceit.es
Supplementaruy information: Supplementary data are available at Bioinformatics online.
Motivation: Finding biologically causative genotype–phenotype associations from whole-genome data is difficult due to the large gene feature space to mine, the potential for interactions among genes and phylogenetic correlations between genomes. Associations within phylogentically distinct organisms with unusual molecular mechanisms underlying their phenotype may be particularly difficult to assess.
Results: We have developed a new genotype–phenotype association approach that uses Classification based on Predictive Association Rules (CPAR), and compare it with NETCAR, a recently published association algorithm. Our implementation of CPAR gave on average slightly higher classification accuracy, with approximately 100 time faster running times. Given the influence of phylogenetic correlations in the extraction of genotype–phenotype association rules, we furthermore propose a novel measure for downweighting the dependence among samples by modeling shared ancestry using conditional mutual information, and demonstrate its complementary nature to traditional mining approaches.
Availability: Software implemented for this study is available under the Creative Commons Attribution 3.0 license from the author at http://kiwi.cs.dal.ca/Software/PICA
Contact: beiko@cs.dal.ca
Supplementary information: Supplementary data are available Bioinformatics online.
Motivation: The limited availability of protein structures often restricts the functional annotation of proteins and the identification of their protein–protein interaction sites. Computational methods to identify interaction sites from protein sequences alone are, therefore, required for unraveling the functions of many proteins. This article describes a new method (PSIVER) to predict interaction sites, i.e. residues binding to other proteins, in protein sequences. Only sequence features (position-specific scoring matrix and predicted accessibility) are used for training a Naïve Bayes classifier (NBC), and conditional probabilities of each sequence feature are estimated using a kernel density estimation method (KDE).
Results: The leave-one out cross-validation of PSIVER achieved a Matthews correlation coefficient (MCC) of 0.151, an F-measure of 35.3%, a precision of 30.6% and a recall of 41.6% on a non-redundant set of 186 protein sequences extracted from 105 heterodimers in the Protein Data Bank (consisting of 36 219 residues, of which 15.2% were known interface residues). Even though the dataset used for training was highly imbalanced, a randomization test demonstrated that the proposed method managed to avoid overfitting. PSIVER was also tested on 72 sequences not used in training (consisting of 18 140 residues, of which 10.6% were known interface residues), and achieved an MCC of 0.135, an F-measure of 31.5%, a precision of 25.0% and a recall of 46.5%, outperforming other publicly available servers tested on the same dataset. PSIVER enables experimental biologists to identify potential interface residues in unknown proteins from sequence information alone, and to mutate those residues selectively in order to unravel protein functions.
Availability: Freely available on the web at http://tardis.nibio.go.jp/PSIVER/
Contact: yoichi@nibio.go.jp; kenji@nibio.go.jp
Supplementary information: Supplementary data are available at Bioinformatics online.