Bioinformatics

Syndicate content
Updated: 7 hours 49 min ago

An introduction to deep learning on biological sequence data: examples and solutions

Tue, 2017-08-22 23:00
Abstract
Motivation
Deep neural network architectures such as convolutional and long short-term memory networks have become increasingly popular as machine learning tools during the recent years. The availability of greater computational resources, more data, new algorithms for training deep models and easy to use libraries for implementation and training of neural networks are the drivers of this development. The use of deep learning has been especially successful in image recognition; and the development of tools, applications and code examples are in most cases centered within this field rather than within biology.
Results
Here, we aim to further the development of deep learning methods within biology by providing application examples and ready to apply and adapt code templates. Given such examples, we illustrate how architectures consisting of convolutional and long short-term memory neural networks can relatively easily be designed and trained to state-of-the-art performance on three biological sequence problems: prediction of subcellular localization, protein secondary structure and the binding of peptides to MHC Class II molecules.
Availability and implementation
All implementations and datasets are available online to the scientific community at https://github.com/vanessajurtz/lasagne4bio.
Contact
skaaesonderby@gmail.com
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

BSviewer: a genotype-preserving, nucleotide-level visualizer for bisulfite sequencing data

Mon, 2017-08-07 23:00
Abstract
Motivation
The bisulfite sequencing technology has been widely used to study the DNA methylation profile in many species. However, most of the current visualization tools for bisulfite sequencing data only provide high-level views (i.e. overall methylation densities) while miss the methylation dynamics at nucleotide level. Meanwhile, they also focus on CpG sites while omit other information (such as genotypes on SNP sites) which could be helpful for interpreting the methylation pattern of the data. A bioinformatics tool that visualizes the methylation statuses at nucleotide level and preserves the most essential information of the sequencing data is thus valuable and needed.
Results
We have developed BSviewer, a lightweight nucleotide-level visualization tool for bisulfite sequencing data. Using an imprinting gene as an example, we show that BSviewer could be specifically helpful for interpreting the data with allele-specific DNA methylation pattern.
Availability and implementation
BSviewer is implemented in Perl and runs on most GNU/Linux platforms. Source code and testing dataset are freely available at http://sunlab.cpy.cuhk.edu.hk/BSviewer/.
Contact
haosun@cuhk.edu.hk
Categories: Bioinformatics, Journals

Molecular signatures that can be transferred across different omics platforms

Mon, 2017-08-07 23:00
Bioinformatics (2017) 33 (14): i333-i340.
Categories: Bioinformatics, Journals

Motif independent identification of potential RNA G-quadruplexes by G4RNA screener

Wed, 2017-08-02 23:00
Abstract
Motivation
G-quadruplex structures in RNA molecules are known to have regulatory impacts in cells but are difficult to locate in the genome. The minimal requirements for G-quadruplex folding in RNA (G≥3N1-7 G≥3N1-7 G≥3N1-7 G≥3) is being challenged by observations made on specific examples in recent years. The definition of potential G-quadruplex sequences has major repercussions on the observation of the structure since it introduces a bias. The canonical motif only describes a sub-population of the reported G-quadruplexes. To address these issues, we propose an RNA G-quadruplex prediction strategy that does not rely on a motif definition.
Results
We trained an artificial neural network with sequences of experimentally validated G-quadruplexes from the G4RNA database encoded using an abstract definition of their sequence. This artificial neural network, G4NN, evaluates the similarity of a given sequence to known G-quadruplexes and reports it as a score. G4NN has a predictive power comparable to the reported G richness and G/C skewness evaluations that are the current state-of-the-art for the identification of potential RNA G-quadruplexes. We combined these approaches in the G4RNA screener, a program designed to manage and evaluate the sequences to identify potential G-quadruplexes.
Availability and implementation
G4RNA screener is available for download at http://gitlabscottgroup.med.usherbrooke.ca/J-Michel/g4rna_screener.
Contact
jean-michel.garant@usherbrooke.ca or jean-pierre.perreault@usherbrooke.ca or michelle.scott@usherbrooke.ca
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

DelPhiForce web server: electrostatic forces and energy calculations and visualization

Wed, 2017-08-02 23:00
Abstract
Summary
Electrostatic force is an essential component of the total force acting between atoms and macromolecules. Therefore, accurate calculations of electrostatic forces are crucial for revealing the mechanisms of many biological processes. We developed a DelPhiForce web server to calculate and visualize the electrostatic forces at molecular level. DelPhiForce web server enables modeling of electrostatic forces on individual atoms, residues, domains and molecules, and generates an output that can be visualized by VMD software. Here we demonstrate the usage of the server for various biological problems including protein–cofactor, domain–domain, protein–protein, protein–DNA and protein–RNA interactions.
Availability and implementation
The DelPhiForce web server is available at: http://compbio.clemson.edu/delphi-force.
Contact
delphi@clemson.edu
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

ComplexViewer: visualization of curated macromolecular complexes

Wed, 2017-08-02 23:00
Abstract
Summary
Proteins frequently function as parts of complexes, assemblages of multiple proteins and other biomolecules, yet network visualizations usually only show proteins as parts of binary interactions. ComplexViewer visualizes interactions with more than two participants and thereby avoids the need to first expand these into multiple binary interactions. Furthermore, if binding regions between molecules are known then these can be displayed in the context of the larger complex.
Availability and implementation
freely available under Apache version 2 license; EMBL-EBI Complex Portal: http://www.ebi.ac.uk/complexportal; Source code: https://github.com/MICommunity/ComplexViewer; Package: https://www.npmjs.com/package/complexviewer; http://biojs.io/d/complexviewer. Language: JavaScript; Web technology: Scalable Vector Graphics; Libraries: D3.js.
Contact
colin.combe@ed.ac.uk or juri.rappsilber@ed.ac.uk
Categories: Bioinformatics, Journals

MFIB: a repository of protein complexes with mutual folding induced by binding

Wed, 2017-08-02 23:00
Abstract
Motivation
It is commonplace that intrinsically disordered proteins (IDPs) are involved in crucial interactions in the living cell. However, the study of protein complexes formed exclusively by IDPs is hindered by the lack of data and such analyses remain sporadic. Systematic studies benefited other types of protein–protein interactions paving a way from basic science to therapeutics; yet these efforts require reliable datasets that are currently lacking for synergistically folding complexes of IDPs.
Results
Here we present the Mutual Folding Induced by Binding (MFIB) database, the first systematic collection of complexes formed exclusively by IDPs. MFIB contains an order of magnitude more data than any dataset used in corresponding studies and offers a wide coverage of known IDP complexes in terms of flexibility, oligomeric composition and protein function from all domains of life. The included complexes are grouped using a hierarchical classification and are complemented with structural and functional annotations. MFIB is backed by a firm development team and infrastructure, and together with possible future community collaboration it will provide the cornerstone for structural and functional studies of IDP complexes.
Availability and implementation
MFIB is freely accessible at http://mfib.enzim.ttk.mta.hu/. The MFIB application is hosted by Apache web server and was implemented in PHP. To enrich querying features and to enhance backend performance a MySQL database was also created.
Contact
simon.istvan@ttk.mta.hu, meszaros.balint@ttk.mta.hu
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

BiobankUniverse: automatic matchmaking between datasets for biobank data discovery and integration

Tue, 2017-08-01 23:00
Abstract
Motivation
Biobanks are indispensable for large-scale genetic/epidemiological studies, yet it remains difficult for researchers to determine which biobanks contain data matching their research questions.
Results
To overcome this, we developed a new matching algorithm that identifies pairs of related data elements between biobanks and research variables with high precision and recall. It integrates lexical comparison, Unified Medical Language System ontology tagging and semantic query expansion. The result is BiobankUniverse, a fast matchmaking service for biobanks and researchers. Biobankers upload their data elements and researchers their desired study variables, BiobankUniverse automatically shortlists matching attributes between them. Users can quickly explore matching potential and search for biobanks/data elements matching their research. They can also curate matches and define personalized data-universes.
Availability and implementation
BiobankUniverse is available at http://biobankuniverse.com or can be downloaded as part of the open source MOLGENIS suite at http://github.com/molgenis/molgenis.
Contact
m.a.swertz@rug.nl
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

LRCstats, a tool for evaluating long reads correction methods

Tue, 2017-08-01 23:00
Abstract
Motivation
Third-generation sequencing (TGS) platforms that generate long reads, such as PacBio and Oxford Nanopore technologies, have had a dramatic impact on genomics research. However, despite recent improvements, TGS reads suffer from high-error rates and the development of read correction methods is an active field of research. This motivates the need to develop tools that can evaluate the accuracy of noisy long reads correction tools.
Results
We introduce LRCstats, a tool that measures the accuracy of long reads correction tools. LRCstats takes advantage of long reads simulators that provide each simulated read with an alignment to the reference genome segment they originate from, and does not rely on a step of mapping corrected reads onto the reference genome. This allows for the measurement of the accuracy of the correction while being consistent with the actual errors introduced in the simulation process used to generate noisy reads. We illustrate the usefulness of LRCstats by analyzing the accuracy of four hybrid correction methods for PacBio long reads over three datasets.
Availability and implementation
https://github.com/cchauve/lrcstats
Contact
laseanl@sfu.ca or cedric.chauve@sfu.ca
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

fastNGSadmix: admixture proportions and principal component analysis of a single NGS sample

Tue, 2017-08-01 23:00
Abstract
Motivation
Estimation of admixture proportions and principal component analysis (PCA) are fundamental tools in populations genetics. However, applying these methods to low- or mid-depth sequencing data without taking genotype uncertainty into account can introduce biases.
Results
Here we present fastNGSadmix, a tool to fast and reliably estimate admixture proportions and perform PCA from next generation sequencing data of a single individual. The analyses are based on genotype likelihoods of the input sample and a set of predefined reference populations. The method has high accuracy, even at low sequencing depth and corrects for the biases introduced by small reference populations.
Availability and implementation
The admixture estimation method is implemented in C ++ and the PCA method is implemented in R. The code is freely available at http://www.popgen.dk/software/index.php/FastNGSadmix
Contact
emil.jorsboe@bio.ku.dk
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

Reference genome assessment from a population scale perspective: an accurate profile of variability and noise

Fri, 2017-07-28 23:00
Abstract
Motivation
Current plant and animal genomic studies are often based on newly assembled genomes that have not been properly consolidated. In this scenario, misassembled regions can easily lead to false-positive findings. Despite quality control scores are included within genotyping protocols, they are usually employed to evaluate individual sample quality rather than reference sequence reliability. We propose a statistical model that combines quality control scores across samples in order to detect incongruent patterns at every genomic region. Our model is inherently robust since common artifact signals are expected to be shared between independent samples over misassembled regions of the genome.
Results
The reliability of our protocol has been extensively tested through different experiments and organisms with accurate results, improving state-of-the-art methods. Our analysis demonstrates synergistic relations between quality control scores and allelic variability estimators, that improve the detection of misassembled regions, and is able to find strong artifact signals even within the human reference assembly. Furthermore, we demonstrated how our model can be trained to properly rank the confidence of a set of candidate variants obtained from new independent samples.
Availability and implementation
This tool is freely available at http://gitlab.com/carbonell/ces.
Contact
jcarbonell.cipf@gmail.com or joaquin.dopazo@juntadeandalucia.es
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

Improved prediction of breast cancer outcome by identifying heterogeneous biomarkers

Fri, 2017-07-28 23:00
Abstract
Motivation
Identification of genes that can be used to predict prognosis in patients with cancer is important in that it can lead to improved therapy, and can also promote our understanding of tumor progression on the molecular level. One of the common but fundamental problems that render identification of prognostic genes and prediction of cancer outcomes difficult is the heterogeneity of patient samples.
Results
To reduce the effect of sample heterogeneity, we clustered data samples using K-means algorithm and applied modified PageRank to functional interaction (FI) networks weighted using gene expression values of samples in each cluster. Hub genes among resulting prioritized genes were selected as biomarkers to predict the prognosis of samples. This process outperformed traditional feature selection methods as well as several network-based prognostic gene selection methods when applied to Random Forest. We were able to find many cluster-specific prognostic genes for each dataset. Functional study showed that distinct biological processes were enriched in each cluster, which seems to reflect different aspect of tumor progression or oncogenesis among distinct patient groups. Taken together, these results provide support for the hypothesis that our approach can effectively identify heterogeneous prognostic genes, and these are complementary to each other, improving prediction accuracy.
Availability and implementation
https://github.com/mathcom/CPR
Contact
jgahn@inu.ac.kr
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

FAF-Drugs4: free ADME-tox filtering computations for chemical biology and early stages drug discovery

Fri, 2017-07-28 23:00
Abstract
Motivation
Identification of small molecules that could be interesting starting points for drug discovery or to investigate a biological system as in chemical biology endeavours is both time consuming and costly. In silico approaches that assist the design of quality compound collections or help to prioritize molecules before synthesis or purchase are therefore valuable. Here quality refers to the selection of molecules that pass one or several selected filters that can be tuned by the users according to the project and the stage of the project. These filters can involve prediction of physicochemical properties, search for toxicophores or other unwanted chemical groups.
Results
FAF-Drugs4 is a novel version of our online server dedicated to the preparation and annotation of compound collections. The tool is now faster and several parameters have been optimized. In addition, a new service referred to as FAF-QED, an implementation of the quantitative estimate of drug-likeness method, is now available.
Availability and implementation
The server is available at http://fafdrugs4.mti.univ-paris-diderot.fr.
Contact
Bruno.Villoutreix@inserm.fr
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

3DBIONOTES v2.0: a web server for the automatic annotation of macromolecular structures

Thu, 2017-07-27 23:00
Abstract
Motivation
Complementing structural information with biochemical and biomedical annotations is a powerful approach to explore the biological function of macromolecular complexes. However, currently the compilation of annotations and structural data is a feature only available for those structures that have been released as entries to the Protein Data Bank.
Results
To help researchers in assessing the consistency between structures and biological annotations for structural models not deposited in databases, we present 3DBIONOTES v2.0, a web application designed for the automatic annotation of biochemical and biomedical information onto macromolecular structural models determined by any experimental or computational technique.
Availability and implementation
The web server is available at http://3dbionotes-ws.cnb.csic.es.
Contact
jsegura@cnb.csic.es
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

Sequence2Vec: a novel embedding approach for modeling transcription factor binding affinity landscape

Wed, 2017-07-26 23:00
Abstract
Motivation
An accurate characterization of transcription factor (TF)-DNA affinity landscape is crucial to a quantitative understanding of the molecular mechanisms underpinning endogenous gene regulation. While recent advances in biotechnology have brought the opportunity for building binding affinity prediction methods, the accurate characterization of TF-DNA binding affinity landscape still remains a challenging problem.
Results
Here we propose a novel sequence embedding approach for modeling the transcription factor binding affinity landscape. Our method represents DNA binding sequences as a hidden Markov model which captures both position specific information and long-range dependency in the sequence. A cornerstone of our method is a novel message passing-like embedding algorithm, called Sequence2Vec, which maps these hidden Markov models into a common nonlinear feature space and uses these embedded features to build a predictive model. Our method is a novel combination of the strength of probabilistic graphical models, feature space embedding and deep learning. We conducted comprehensive experiments on over 90 large-scale TF-DNA datasets which were measured by different high-throughput experimental technologies. Sequence2Vec outperforms alternative machine learning methods as well as the state-of-the-art binding affinity prediction methods.
Availability and implementation
Our program is freely available at https://github.com/ramzan1990/sequence2vec.
Contact
xin.gao@kaust.edu.sa or lsong@cc.gatech.edu.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

pgRNAFinder: a web-based tool to design distance independent paired-gRNA

Wed, 2017-07-26 23:00
Abstract
Summary
The CRISPR/Cas System has been shown to be an efficient and accurate genome-editing technique. There exist a number of tools to design the guide RNA sequences and predict potential off-target sites. However, most of the existing computational tools on gRNA design are restricted to small deletions. To address this issue, we present pgRNAFinder, with an easy-to-use web interface, which enables researchers to design single or distance-free paired-gRNA sequences. The web interface of pgRNAFinder contains both gRNA search and scoring system. After users input query sequences, it searches gRNA by 3' protospacer-adjacent motif (PAM), and possible off-targets, and scores the conservation of the deleted sequences rapidly. Filters can be applied to identify high-quality CRISPR sites. PgRNAFinder offers gRNA design functionality for 8 vertebrate genomes. Furthermore, to keep pgRNAFinder open, extensible to any organism, we provide the source package for local use.
Availability and implementation
The pgRNAFinder is freely available at http://songyanglab.sysu.edu.cn/wangwebs/pgRNAFinder/, and the source code and user manual can be obtained from https://github.com/xiexiaowei/pgRNAFinder.
Contact
songyang@bcm.edu or daizhim@mail.sysu.edu.cn
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

iDNA4mC: identifying DNA N 4 -methylcytosine sites based on nucleotide chemical properties

Tue, 2017-07-25 23:00
Abstract
Motivation
DNA N4-methylcytosine (4mC) is an epigenetic modification. The knowledge about the distribution of 4mC is helpful for understanding its biological functions. Although experimental methods have been proposed to detect 4mC sites, they are expensive for performing genome-wide detections. Thus, it is necessary to develop computational methods for predicting 4mC sites.
Results
In this work, we developed iDNA4mC, the first webserver to identify 4mC sites, in which DNA sequences are encoded with both nucleotide chemical properties and nucleotide frequency. The predictive results of the rigorous jackknife test and cross species test demonstrated that the performance of iDNA4mC is quite promising and holds high potential to become a useful tool for identifying 4mC sites.
Availability and implementation
The user-friendly web-server, iDNA4mC, is freely accessible at http://lin.uestc.edu.cn/server/iDNA4mC.
Contact
chenweiimu@gmail.com or hlin@uestc.edu.cn
Categories: Bioinformatics, Journals

Towards clinically more relevant dissection of patient heterogeneity via survival-based Bayesian clustering

Tue, 2017-07-25 23:00
Abstract
Motivation
Discovery of clinically relevant disease sub-types is of prime importance in personalized medicine. Disease sub-type identification has in the past often been explored in an unsupervised machine learning paradigm which involves clustering of patients based on available-omics data, such as gene expression. A follow-up analysis involves determining the clinical relevance of the molecular sub-types such as that reflected by comparing their disease progressions. The above methodology, however, fails to guarantee the separability of the sub-types based on their subtype-specific survival curves.
Results
We propose a new algorithm, Survival-based Bayesian Clustering (SBC) which simultaneously clusters heterogeneous-omics and clinical end point data (time to event) in order to discover clinically relevant disease subtypes. For this purpose we formulate a novel Hierarchical Bayesian Graphical Model which combines a Dirichlet Process Gaussian Mixture Model with an Accelerated Failure Time model. In this way we make sure that patients are grouped in the same cluster only when they show similar characteristics with respect to molecular features across data types (e.g. gene expression, mi-RNA) as well as survival times. We extensively test our model in simulation studies and apply it to cancer patient data from the Breast Cancer dataset and The Cancer Genome Atlas repository. Notably, our method is not only able to find clinically relevant sub-groups, but is also able to predict cluster membership and survival on test data in a better way than other competing methods.
Availability and implementation
Our R-code can be accessed as https://github.com/ashar799/SBC.
Contact
ashar@bit.uni-bonn.de
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

pLoc-mAnimal: predict subcellular localization of animal proteins with both single and multiple sites

Sun, 2017-07-23 23:00
Abstract
Motivation
Cells are deemed the basic unit of life. However, many important functions of cells as well as their growth and reproduction are performed via the protein molecules located at their different organelles or locations. Facing explosive growth of protein sequences, we are challenged to develop fast and effective method to annotate their subcellular localization. However, this is by no means an easy task. Particularly, mounting evidences have indicated proteins have multi-label feature meaning that they may simultaneously exist at, or move between, two or more different subcellular location sites. Unfortunately, most of the existing computational methods can only be used to deal with the single-label proteins. Although the ‘iLoc-Animal’ predictor developed recently is quite powerful that can be used to deal with the animal proteins with multiple locations as well, its prediction quality needs to be improved, particularly in enhancing the absolute true rate and reducing the absolute false rate.
Results
Here we propose a new predictor called ‘pLoc-mAnimal’, which is superior to iLoc-Animal as shown by the compelling facts. When tested by the most rigorous cross-validation on the same high-quality benchmark dataset, the absolute true success rate achieved by the new predictor is 37% higher and the absolute false rate is four times lower in comparison with the state-of-the-art predictor.
Availability and implementation
To maximize the convenience of most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc-mAnimal/, by which users can easily get their desired results without the need to go through the complicated mathematics involved.
Contact
xxiao@gordonlifescience.org or kcchou@gordonlifescience.org
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

SPRINT: an SNP-free toolkit for identifying RNA editing sites

Sun, 2017-07-23 23:00
Abstract
Motivation
RNA editing generates post-transcriptional sequence alterations. Detection of RNA editing sites (RESs) typically requires the filtering of SNVs called from RNA-seq data using an SNP database, an obstacle that is difficult to overcome for most organisms.
Results
Here, we present a novel method named SPRINT that identifies RESs without the need to filter out SNPs. SPRINT also integrates the detection of hyper RESs from remapped reads, and has been fully automated to any RNA-seq data with reference genome sequence available. We have rigorously validated SPRINT’s effectiveness in detecting RESs using RNA-seq data of samples in which genes encoding RNA editing enzymes are knock down or over-expressed, and have also demonstrated its superiority over current methods. We have applied SPRINT to investigate RNA editing across tissues and species, and also in the development of mouse embryonic central nervous system. A web resource (http://sprint.tianlab.cn) of RESs identified by SPRINT has been constructed.
Availability and implementation
The software and related data are available at http://sprint.tianlab.cn.
Contact
weidong.tian@fudan.edu.cn
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals