Bioinformatics

Syndicate content
Updated: 4 hours 25 min ago

Molecular signatures that can be transferred across different omics platforms

Mon, 2017-08-07 23:00
Bioinformatics (2017) 33 (14): i333-i340.
Categories: Bioinformatics, Journals

Author Index

Tue, 2017-07-11 23:00
Aittokallio,T. i359
Categories: Bioinformatics, Journals

ISMB/ECCB 2017 proceedings

Tue, 2017-07-11 23:00
The 25th Annual Conference Intelligent Systems for Molecular Biology (ISMB), held jointly with the 16th annual European Conference on Computational Biology (ECCB), took place in Prague, Czech Republic, July 21–25, 2017 (http://www.iscb.org/ismbeccb2017). This special issue of Bioinformatics serves as the proceedings for this official conference of the International Society for Computational Biology (ISCB, http://www.iscb.org/).
Categories: Bioinformatics, Journals

HopLand: single-cell pseudotime recovery using continuous Hopfield network-based modeling of Waddington’s epigenetic landscape

Tue, 2017-07-11 23:00
Abstract
Motivation: The interpretation of transcriptional dynamics in single-cell data, especially pseudotime estimation, could help understand the transition of gene expression profiles. The recovery of pseudotime increases the temporal resolution of single-cell transcriptional data, but is challenging due to the high variability in gene expression between individual cells. Here, we introduce HopLand, a pseudotime recovery method using continuous Hopfield network to map cells to a Waddington’s epigenetic landscape. It reveals from the single-cell data the combinatorial regulatory interactions among genes that control the dynamic progression through successive cell states. Results: We applied HopLand to different types of single-cell transcriptomic data. It achieved high accuracies of pseudotime prediction compared with existing methods. Moreover, a kinetic model can be extracted from each dataset. Through the analysis of such a model, we identified key genes and regulatory interactions driving the transition of cell states. Therefore, our method has the potential to generate fundamental insights into cell fate regulation.Availability and implementation: The MATLAB implementation of HopLand is available at https://github.com/NetLand-NTU/HopLand.Contact:zhengjie@ntu.edu.sg
Categories: Bioinformatics, Journals

Improving the performance of minimizers and winnowing schemes

Tue, 2017-07-11 23:00
Abstract
Motivation: The minimizers scheme is a method for selecting k-mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g. too many k-mers are selected when processing certain sequences). Some of these problems were already known to the authors of the minimizers technique, and the natural lexicographic ordering of k-mers used by minimizers was recognized as their origin. Many software tools using minimizers employ ad hoc variations of the lexicographic order to alleviate those issues.Results: We provide an in-depth analysis of the effect of k-mer ordering on the performance of the minimizers technique. By using small universal hitting sets (a recently defined concept), we show how to significantly improve the performance of minimizers and avoid some of its worse behaviors. Based on these results, we encourage bioinformatics software developers to use an ordering based on a universal hitting set or, if not possible, a randomized ordering, rather than the lexicographic order. This analysis also settles negatively a conjecture (by Schleimer et al.) on the expected density of minimizers in a random sequence.Availability and Implementation: The software used for this analysis is available on GitHub: https://github.com/gmarcais/minimizers.git.Contact: gmarcais@cs.cmu.edu or carlk@cs.cmu.edu
Categories: Bioinformatics, Journals

Modelling haplotypes with respect to reference cohort variation graphs

Tue, 2017-07-11 23:00
Abstract
Motivation: Current statistical models of haplotypes are limited to panels of haplotypes whose genetic variation can be represented by arrays of values at linearly ordered bi- or multiallelic loci. These methods cannot model structural variants or variants that nest or overlap.Results: A variation graph is a mathematical structure that can encode arbitrarily complex genetic variation. We present the first haplotype model that operates on a variation graph-embedded population reference cohort. We describe an algorithm to calculate the likelihood that a haplotype arose from this cohort through recombinations and demonstrate time complexity linear in haplotype length and sublinear in population size. We furthermore demonstrate a method of rapidly calculating likelihoods for related haplotypes. We describe mathematical extensions to allow modelling of mutations. This work is an important incremental step for clinical genomics and genetic epidemiology since it is the first haplotype model which can represent all sorts of variation in the population.Availability and Implementation: Available on GitHub at https://github.com/yoheirosen/vg.Contact:benedict@soe.ucsc.eduSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

Abundance estimation and differential testing on strain level in metagenomics data

Tue, 2017-07-11 23:00
Abstract
Motivation: Current metagenomics approaches allow analyzing the composition of microbial communities at high resolution. Important changes to the composition are known to even occur on strain level and to go hand in hand with changes in disease or ecological state. However, specific challenges arise for strain level analysis due to highly similar genome sequences present. Only a limited number of tools approach taxa abundance estimation beyond species level and there is a strong need for dedicated tools for strain resolution and differential abundance testing.Methods: We present DiTASiC (Differential Taxa Abundance including Similarity Correction) as a novel approach for quantification and differential assessment of individual taxa in metagenomics samples. We introduce a generalized linear model for the resolution of shared read counts which cause a significant bias on strain level. Further, we capture abundance estimation uncertainties, which play a crucial role in differential abundance analysis. A novel statistical framework is built, which integrates the abundance variance and infers abundance distributions for differential testing sensitive to strain level.Results: As a result, we obtain highly accurate abundance estimates down to sub-strain level and enable fine-grained resolution of strain clusters. We demonstrate the relevance of read ambiguity resolution and integration of abundance uncertainties for differential analysis. Accurate detections of even small changes are achieved and false-positives are significantly reduced. Superior performance is shown on latest benchmark sets of various complexities and in comparison to existing methods.Availability and Implementation: DiTASiC code is freely available from https://rki_bioinformatics.gitlab.io/ditasic.Contact:renardB@rki.deSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

Deep learning-based subdivision approach for large scale macromolecules structure recovery from electron cryo tomograms

Tue, 2017-07-11 23:00
Abstract
Motivation: Cellular Electron CryoTomography (CECT) enables 3D visualization of cellular organization at near-native state and in sub-molecular resolution, making it a powerful tool for analyzing structures of macromolecular complexes and their spatial organizations inside single cells. However, high degree of structural complexity together with practical imaging limitations makes the systematic de novo discovery of structures within cells challenging. It would likely require averaging and classifying millions of subtomograms potentially containing hundreds of highly heterogeneous structural classes. Although it is no longer difficult to acquire CECT data containing such amount of subtomograms due to advances in data acquisition automation, existing computational approaches have very limited scalability or discrimination ability, making them incapable of processing such amount of data.Results: To complement existing approaches, in this article we propose a new approach for subdividing subtomograms into smaller but relatively homogeneous subsets. The structures in these subsets can then be separately recovered using existing computation intensive methods. Our approach is based on supervised structural feature extraction using deep learning, in combination with unsupervised clustering and reference-free classification. Our experiments show that, compared with existing unsupervised rotation invariant feature and pose-normalization based approaches, our new approach achieves significant improvements in both discrimination ability and scalability. More importantly, our new approach is able to discover new structural classes and recover structures that do not exist in training data.Availability and Implementation: Source code freely available at http://www.cs.cmu.edu/∼mxu1/software.Contact:mxu1@cs.cmu.eduSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

deBGR: an efficient and near-exact representation of the weighted de Bruijn graph

Tue, 2017-07-11 23:00
Abstract
Motivation: Almost all de novo short-read genome and transcriptome assemblers start by building a representation of the de Bruijn Graph of the reads they are given as input. Even when other approaches are used for subsequent assembly (e.g. when one is using ‘long read’ technologies like those offered by PacBio or Oxford Nanopore), efficient k-mer processing is still crucial for accurate assembly, and state-of-the-art long-read error-correction methods use de Bruijn Graphs. Because of the centrality of de Bruijn Graphs, researchers have proposed numerous methods for representing de Bruijn Graphs compactly. Some of these proposals sacrifice accuracy to save space. Further, none of these methods store abundance information, i.e. the number of times that each k-mer occurs, which is key in transcriptome assemblers.Results: We present a method for compactly representing the weighted de Bruijn Graph (i.e. with abundance information) with essentially no errors. Our representation yields zero errors while increasing the space requirements by less than 18–28% compared to the approximate de Bruijn graph representation in Squeakr. Our technique is based on a simple invariant that all weighted de Bruijn Graphs must satisfy, and hence is likely to be of general interest and applicable in most weighted de Bruijn Graph-based systems.Availability and implementation:https://github.com/splatlab/debgr.Contact:rob.patro@cs.stonybrook.eduSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

Improved data-driven likelihood factorizations for transcript abundance estimation

Tue, 2017-07-11 23:00
Abstract
Motivation: Many methods for transcript-level abundance estimation reduce the computational burden associated with the iterative algorithms they use by adopting an approximate factorization of the likelihood function they optimize. This leads to considerably faster convergence of the optimization procedure, since each round of e.g. the EM algorithm, can execute much more quickly. However, these approximate factorizations of the likelihood function simplify calculations at the expense of discarding certain information that can be useful for accurate transcript abundance estimation.Results: We demonstrate that model simplifications (i.e. factorizations of the likelihood function) adopted by certain abundance estimation methods can lead to a diminished ability to accurately estimate the abundances of highly related transcripts. In particular, considering factorizations based on transcript-fragment compatibility alone can result in a loss of accuracy compared to the per-fragment, unsimplified model. However, we show that such shortcomings are not an inherent limitation of approximately factorizing the underlying likelihood function. By considering the appropriate conditional fragment probabilities, and adopting improved, data-driven factorizations of this likelihood, we demonstrate that such approaches can achieve accuracy nearly indistinguishable from methods that consider the complete (i.e. per-fragment) likelihood, while retaining the computational efficiently of the compatibility-based factorizations.Availability and implementation: Our data-driven factorizations are incorporated into a branch of the Salmon transcript quantification tool: https://github.com/COMBINE-lab/salmon/tree/factorizations.Contact:rob.patro@cs.stonybrook.eduSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

Tumor phylogeny inference using tree-constrained importance sampling

Tue, 2017-07-11 23:00
Abstract
Motivation: A tumor arises from an evolutionary process that can be modeled as a phylogenetic tree. However, reconstructing this tree is challenging as most cancer sequencing uses bulk tumor tissue containing heterogeneous mixtures of cells.Results: We introduce Probabilistic Algorithm for Somatic Tree Inference (PASTRI), a new algorithm for bulk-tumor sequencing data that clusters somatic mutations into clones and infers a phylogenetic tree that describes the evolutionary history of the tumor. PASTRI uses an importance sampling algorithm that combines a probabilistic model of DNA sequencing data with a enumeration algorithm based on the combinatorial constraints defined by the underlying phylogenetic tree. As a result, tree inference is fast, accurate and robust to noise. We demonstrate on simulated data that PASTRI outperforms other cancer phylogeny algorithms in terms of runtime and accuracy. On real data from a chronic lymphocytic leukemia (CLL) patient, we show that a simple linear phylogeny better explains the data the complex branching phylogeny that was previously reported. PASTRI provides a robust approach for phylogenetic tree inference from mixed samples.Availability and Implementation: Software is available at compbio.cs.brown.edu/software.Contact:braphael@princeton.eduSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

Discovery and genotyping of novel sequence insertions in many sequenced individuals

Tue, 2017-07-11 23:00
Abstract
Motivation: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additionally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects.There are only a handful of algorithms that are specifically developed for novel sequence insertion discovery that can bypass the need for the whole genome de novo assembly. Still, most such algorithms rely on high depth of coverage, and to our knowledge there is only one method (PopIns) that can use multi-sample data to “collectively” obtain a very high coverage dataset to accurately find insertions common in a given population.Result: Here, we present Pamir, a new algorithm to efficiently and accurately discover and genotype novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous versus homozygous) by analyzing multiple sequence signatures, matching one-end-anchored sequences to small-scale de novo assemblies of unmapped reads, and conducting strand-aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects.Availability and implementation: Pamir is available at https://github.com/vpc-ccg/pamir.Contact:fhach@{sfu.ca, prostatecentre.com} or calkan@cs.bilkent.edu.trSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

Incorporating interaction networks into the determination of functionally related hit genes in genomic experiments with Markov random fields

Tue, 2017-07-11 23:00
Abstract
Motivation: Incorporating gene interaction data into the identification of ‘hit’ genes in genomic experiments is a well-established approach leveraging the ‘guilt by association’ assumption to obtain a network based hit list of functionally related genes. We aim to develop a method to allow for multivariate gene scores and multiple hit labels in order to extend the analysis of genomic screening data within such an approach.Results: We propose a Markov random field-based method to achieve our aim and show that the particular advantages of our method compared with those currently used lead to new insights in previously analysed data as well as for our own motivating data. Our method additionally achieves the best performance in an independent simulation experiment. The real data applications we consider comprise of a survival analysis and differential expression experiment and a cell-based RNA interference functional screen.Availability and implementation: We provide all of the data and code related to the results in the paper.Contact:sean.j.robinson@utu.fi or laurent.guyon@cea.frSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

Alignment of dynamic networks

Tue, 2017-07-11 23:00
Abstract
Motivation: Network alignment (NA) aims to find a node mapping that conserves similar regions between compared networks. NA is applicable to many fields, including computational biology, where NA can guide the transfer of biological knowledge from well- to poorly-studied species across aligned network regions. Existing NA methods can only align static networks. However, most complex real-world systems evolve over time and should thus be modeled as dynamic networks. We hypothesize that aligning dynamic network representations of evolving systems will produce superior alignments compared to aligning the systems’ static network representations, as is currently done.Results: For this purpose, we introduce the first ever dynamic NA method, DynaMAGNA ++. This proof-of-concept dynamic NA method is an extension of a state-of-the-art static NA method, MAGNA++. Even though both MAGNA++ and DynaMAGNA++ optimize edge as well as node conservation across the aligned networks, MAGNA++ conserves static edges and similarity between static node neighborhoods, while DynaMAGNA++ conserves dynamic edges (events) and similarity between evolving node neighborhoods. For this purpose, we introduce the first ever measure of dynamic edge conservation and rely on our recent measure of dynamic node conservation. Importantly, the two dynamic conservation measures can be optimized with any state-of-the-art NA method and not just MAGNA++. We confirm our hypothesis that dynamic NA is superior to static NA, on synthetic and real-world networks, in computational biology and social domains. DynaMAGNA++ is parallelized and has a user-friendly graphical interface.Availability and implementation:http://nd.edu/∼cone/DynaMAGNA++/.Contact:tmilenko@nd.eduSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

Predicting multicellular function through multi-layer tissue networks

Tue, 2017-07-11 23:00
Abstract
Motivation: Understanding functions of proteins in specific human tissues is essential for insights into disease diagnostics and therapeutics, yet prediction of tissue-specific cellular function remains a critical challenge for biomedicine.Results: Here, we present OhmNet, a hierarchy-aware unsupervised node feature learning approach for multi-layer networks. We build a multi-layer network, where each layer represents molecular interactions in a different human tissue. OhmNet then automatically learns a mapping of proteins, represented as nodes, to a neural embedding-based low-dimensional space of features. OhmNet encourages sharing of similar features among proteins with similar network neighborhoods and among proteins activated in similar tissues. The algorithm generalizes prior work, which generally ignores relationships between tissues, by modeling tissue organization with a rich multiscale tissue hierarchy. We use OhmNet to study multicellular function in a multi-layer protein interaction network of 107 human tissues. In 48 tissues with known tissue-specific cellular functions, OhmNet provides more accurate predictions of cellular function than alternative approaches, and also generates more accurate hypotheses about tissue-specific protein actions. We show that taking into account the tissue hierarchy leads to improved predictive power. Remarkably, we also demonstrate that it is possible to leverage the tissue hierarchy in order to effectively transfer cellular functions to a functionally uncharacterized tissue. Overall, OhmNet moves from flat networks to multiscale models able to predict a range of phenotypes spanning cellular subsystems. Availability and implementation: Source code and datasets are available at http://snap.stanford.edu/ohmnet.Contact:jure@cs.stanford.edu
Categories: Bioinformatics, Journals

A new method to study the change of miRNA–mRNA interactions due to environmental exposures

Tue, 2017-07-11 23:00
Abstract
Motivation: Integrative approaches characterizing the interactions among different types of biological molecules have been demonstrated to be useful for revealing informative biological mechanisms. One such example is the interaction between microRNA (miRNA) and messenger RNA (mRNA), whose deregulation may be sensitive to environmental insult leading to altered phenotypes. The goal of this work is to develop an effective data integration method to characterize deregulation between miRNA and mRNA due to environmental toxicant exposures. We will use data from an animal experiment designed to investigate the effect of low-dose environmental chemical exposure on normal mammary gland development in rats to motivate and evaluate the proposed method.Results: We propose a new network approach—integrative Joint Random Forest (iJRF), which characterizes the regulatory system between miRNAs and mRNAs using a network model. iJRF is designed to work under the high-dimension low-sample-size regime, and can borrow information across different treatment conditions to achieve more accurate network inference. It also effectively takes into account prior information of miRNA–mRNA regulatory relationships from existing databases. When iJRF is applied to the data from the environmental chemical exposure study, we detected a few important miRNAs that regulated a large number of mRNAs in the control group but not in the exposed groups, suggesting the disruption of miRNA activity due to chemical exposure. Effects of chemical exposure on two affected miRNAs were further validated using breast cancer human cell lines.Availability and implementation: R package iJRF is available at CRAN.Contacts:pei.wang@mssm.edu or susan.teitelbaum@mssm.eduSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

Multiple network-constrained regressions expand insights into influenza vaccination responses

Tue, 2017-07-11 23:00
Abstract
Motivation: Systems immunology leverages recent technological advancements that enable broad profiling of the immune system to better understand the response to infection and vaccination, as well as the dysregulation that occurs in disease. An increasingly common approach to gain insights from these large-scale profiling experiments involves the application of statistical learning methods to predict disease states or the immune response to perturbations. However, the goal of many systems studies is not to maximize accuracy, but rather to gain biological insights. The predictors identified using current approaches can be biologically uninterpretable or present only one of many equally predictive models, leading to a narrow understanding of the underlying biology.Results: Here we show that incorporating prior biological knowledge within a logistic modeling framework by using network-level constraints on transcriptional profiling data significantly improves interpretability. Moreover, incorporating different types of biological knowledge produces models that highlight distinct aspects of the underlying biology, while maintaining predictive accuracy. We propose a new framework, Logistic Multiple Network-constrained Regression (LogMiNeR), and apply it to understand the mechanisms underlying differential responses to influenza vaccination. Although standard logistic regression approaches were predictive, they were minimally interpretable. Incorporating prior knowledge using LogMiNeR led to models that were equally predictive yet highly interpretable. In this context, B cell-specific genes and mTOR signaling were associated with an effective vaccination response in young adults. Overall, our results demonstrate a new paradigm for analyzing high-dimensional immune profiling data in which multiple networks encoding prior knowledge are incorporated to improve model interpretability.Availability and implementation: The R source code described in this article is publicly available at https://bitbucket.org/kleinstein/logminer.Contact:steven.kleinstein@yale.edu or stefan.avey@yale.eduSupplementary information:Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

Image-based spatiotemporal causality inference for protein signaling networks

Tue, 2017-07-11 23:00
Abstract
Motivation: Efforts to model how signaling and regulatory networks work in cells have largely either not considered spatial organization or have used compartmental models with minimal spatial resolution. Fluorescence microscopy provides the ability to monitor the spatiotemporal distribution of many molecules during signaling events, but as of yet no methods have been described for large scale image analysis to learn a complex protein regulatory network. Here we present and evaluate methods for identifying how changes in concentration in one cell region influence concentration of other proteins in other regions.Results: Using 3D confocal microscope movies of GFP-tagged T cells undergoing costimulation, we learned models containing putative causal relationships among 12 proteins involved in T cell signaling. The models included both relationships consistent with current knowledge and novel predictions deserving further exploration. Further, when these models were applied to the initial frames of movies of T cells that had been only partially stimulated, they predicted the localization of proteins at later times with statistically significant accuracy. The methods, consisting of spatiotemporal alignment, automated region identification, and causal inference, are anticipated to be applicable to a number of biological systems.Availability and implementation: The source code and data are available as a Reproducible Research Archive at http://murphylab.cbd.cmu.edu/software/2017_TcellCausalModels/Contact:murphy@cmu.edu
Categories: Bioinformatics, Journals

Denoising genome-wide histone ChIP-seq with convolutional neural networks

Tue, 2017-07-11 23:00
Abstract
Motivation: Chromatin immune-precipitation sequencing (ChIP-seq) experiments are commonly used to obtain genome-wide profiles of histone modifications associated with different types of functional genomic elements. However, the quality of histone ChIP-seq data is affected by many experimental parameters such as the amount of input DNA, antibody specificity, ChIP enrichment and sequencing depth. Making accurate inferences from chromatin profiling experiments that involve diverse experimental parameters is challenging.Results: We introduce a convolutional denoising algorithm, Coda, that uses convolutional neural networks to learn a mapping from suboptimal to high-quality histone ChIP-seq data. This overcomes various sources of noise and variability, substantially enhancing and recovering signal when applied to low-quality chromatin profiling datasets across individuals, cell types and species. Our method has the potential to improve data quality at reduced costs. More broadly, this approach—using a high-dimensional discriminative model to encode a generative noise process—is generally applicable to other biological domains where it is easy to generate noisy data but difficult to analytically characterize the noise or underlying data distribution.Availability and implementation:https://github.com/kundajelab/coda.Contact:akundaje@stanford.edu
Categories: Bioinformatics, Journals

Large-scale structure prediction by improved contact predictions and model quality assessment

Tue, 2017-07-11 23:00
Abstract
Motivation: Accurate contact predictions can be used for predicting the structure of proteins. Until recently these methods were limited to very big protein families, decreasing their utility. However, recent progress by combining direct coupling analysis with machine learning methods has made it possible to predict accurate contact maps for smaller families. To what extent these predictions can be used to produce accurate models of the families is not known.Results: We present the PconsFold2 pipeline that uses contact predictions from PconsC3, the CONFOLD folding algorithm and model quality estimations to predict the structure of a protein. We show that the model quality estimation significantly increases the number of models that reliably can be identified. Finally, we apply PconsFold2 to 6379 Pfam families of unknown structure and find that PconsFold2 can, with an estimated 90% specificity, predict the structure of up to 558 Pfam families of unknown structure. Out of these, 415 have not been reported before.Availability and Implementation: Datasets as well as models of all the 558 Pfam families are available at http://c3.pcons.net/. All programs used here are freely available.Contact:arne@bioinfo.se
Categories: Bioinformatics, Journals