Bioinformatics

RSS-материал
Обновлено: 46 min 44 sec ago

DisruPPI: structure-based computational redesign algorithm for protein binding disruption

ср, 2018-06-27 02:00
Abstract
Motivation
Disruption of protein–protein interactions can mitigate antibody recognition of therapeutic proteins, yield monomeric forms of oligomeric proteins, and elucidate signaling mechanisms, among other applications. While designing affinity-enhancing mutations remains generally quite challenging, both statistically and physically based computational methods can precisely identify affinity-reducing mutations. In order to leverage this ability to design variants of a target protein with disrupted interactions, we developed the DisruPPI protein design method (DISRUpting Protein–Protein Interactions) to optimize combinations of mutations simultaneously for both disruption and stability, so that incorporated disruptive mutations do not inadvertently affect the target protein adversely.
Results
Two existing methods for predicting mutational effects on binding, FoldX and INT5, were demonstrated to be quite precise in selecting disruptive mutations from the SKEMPI and AB-Bind databases of experimentally determined changes in binding free energy. DisruPPI was implemented to use an INT5-based disruption score integrated with an AMBER-based stability assessment and was applied to disrupt protein interactions in a set of different targets representing diverse applications. In retrospective evaluation with three different case studies, comparison of DisruPPI-designed variants to published experimental data showed that DisruPPI was able to identify more diverse interaction-disrupting and stability-preserving variants more efficiently and effectively than previous approaches. In prospective application to an interaction between enhanced green fluorescent protein (EGFP) and a nanobody, DisruPPI was used to design five EGFP variants, all of which were shown to have significantly reduced nanobody binding while maintaining function and thermostability. This demonstrates that DisruPPI may be readily utilized for effective removal of known epitopes of therapeutically relevant proteins.
Availability and implementation
DisruPPI is implemented in the EpiSweep package, freely available under an academic use license.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

DeepFam: deep learning based alignment-free method for protein family modeling and prediction

ср, 2018-06-27 02:00
Abstract
Motivation
A large number of newly sequenced proteins are generated by the next-generation sequencing technologies and the biochemical function assignment of the proteins is an important task. However, biological experiments are too expensive to characterize such a large number of protein sequences, thus protein function prediction is primarily done by computational modeling methods, such as profile Hidden Markov Model (pHMM) and k-mer based methods. Nevertheless, existing methods have some limitations; k-mer based methods are not accurate enough to assign protein functions and pHMM is not fast enough to handle large number of protein sequences from numerous genome projects. Therefore, a more accurate and faster protein function prediction method is needed.
Results
In this paper, we introduce DeepFam, an alignment-free method that can extract functional information directly from sequences without the need of multiple sequence alignments. In extensive experiments using the Clusters of Orthologous Groups (COGs) and G protein-coupled receptor (GPCR) dataset, DeepFam achieved better performance in terms of accuracy and runtime for predicting functions of proteins compared to the state-of-the-art methods, both alignment-free and alignment-based methods. Additionally, we showed that DeepFam has a power of capturing conserved regions to model protein families. In fact, DeepFam was able to detect conserved regions documented in the Prosite database while predicting functions of proteins. Our deep learning method will be useful in characterizing functions of the ever increasing protein sequences.
Availability and implementation
Codes are available at https://bhi-kimlab.github.io/DeepFam.
Категории: Bioinformatics, Journals

Protein threading using residue co-variation and deep learning

ср, 2018-06-27 02:00
Abstract
Motivation
Template-based modeling, including homology modeling and protein threading, is a popular method for protein 3D structure prediction. However, alignment generation and template selection for protein sequences without close templates remain very challenging.
Results
We present a new method called DeepThreader to improve protein threading, including both alignment generation and template selection, by making use of deep learning (DL) and residue co-variation information. Our method first employs DL to predict inter-residue distance distribution from residue co-variation and sequential information (e.g. sequence profile and predicted secondary structure), and then builds sequence-template alignment by integrating predicted distance information and sequential features through an ADMM algorithm. Experimental results suggest that predicted inter-residue distance is helpful to both protein alignment and template selection especially for protein sequences without very close templates, and that our method outperforms currently popular homology modeling method HHpred and threading method CNFpred by a large margin and greatly outperforms the latest contact-assisted protein threading method EigenTHREADER.
Availability and implementation
http://raptorx.uchicago.edu/
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

mGPfusion: predicting protein stability changes with Gaussian process kernel learning and data fusion

ср, 2018-06-27 02:00
Abstract
Motivation
Proteins are commonly used by biochemical industry for numerous processes. Refining these proteins’ properties via mutations causes stability effects as well. Accurate computational method to predict how mutations affect protein stability is necessary to facilitate efficient protein design. However, accuracy of predictive models is ultimately constrained by the limited availability of experimental data.
Results
We have developed mGPfusion, a novel Gaussian process (GP) method for predicting protein’s stability changes upon single and multiple mutations. This method complements the limited experimental data with large amounts of molecular simulation data. We introduce a Bayesian data fusion model that re-calibrates the experimental and in silico data sources and then learns a predictive GP model from the combined data. Our protein-specific model requires experimental data only regarding the protein of interest and performs well even with few experimental measurements. The mGPfusion models proteins by contact maps and infers the stability effects caused by mutations with a mixture of graph kernels. Our results show that mGPfusion outperforms state-of-the-art methods in predicting protein stability on a dataset of 15 different proteins and that incorporating molecular simulation data improves the model learning and prediction accuracy.
Availability and implementation
Software implementation and datasets are available at github.com/emmijokinen/mgpfusion.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

DLBI: deep learning guided Bayesian inference for structure reconstruction of super-resolution fluorescence microscopy

ср, 2018-06-27 02:00
Abstract
Motivation
Super-resolution fluorescence microscopy with a resolution beyond the diffraction limit of light, has become an indispensable tool to directly visualize biological structures in living cells at a nanometer-scale resolution. Despite advances in high-density super-resolution fluorescent techniques, existing methods still have bottlenecks, including extremely long execution time, artificial thinning and thickening of structures, and lack of ability to capture latent structures.
Results
Here, we propose a novel deep learning guided Bayesian inference (DLBI) approach, for the time-series analysis of high-density fluorescent images. Our method combines the strength of deep learning and statistical inference, where deep learning captures the underlying distribution of the fluorophores that are consistent with the observed time-series fluorescent images by exploring local features and correlation along time-axis, and statistical inference further refines the ultrastructure extracted by deep learning and endues physical meaning to the final image. In particular, our method contains three main components. The first one is a simulator that takes a high-resolution image as the input, and simulates time-series low-resolution fluorescent images based on experimentally calibrated parameters, which provides supervised training data to the deep learning model. The second one is a multi-scale deep learning module to capture both spatial information in each input low-resolution image as well as temporal information among the time-series images. And the third one is a Bayesian inference module that takes the image from the deep learning module as the initial localization of fluorophores and removes artifacts by statistical inference. Comprehensive experimental results on both real and simulated datasets demonstrate that our method provides more accurate and realistic local patch and large-field reconstruction than the state-of-the-art method, the 3B analysis, while our method is more than two orders of magnitude faster.
Availability and implementation
The main program is available at https://github.com/lykaust15/DLBI
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

A novel methodology on distributed representations of proteins using their interacting ligands

ср, 2018-06-27 02:00
Abstract
Motivation
The effective representation of proteins is a crucial task that directly affects the performance of many bioinformatics problems. Related proteins usually bind to similar ligands. Chemical characteristics of ligands are known to capture the functional and mechanistic properties of proteins suggesting that a ligand-based approach can be utilized in protein representation. In this study, we propose SMILESVec, a Simplified molecular input line entry system (SMILES)-based method to represent ligands and a novel method to compute similarity of proteins by describing them based on their ligands. The proteins are defined utilizing the word-embeddings of the SMILES strings of their ligands. The performance of the proposed protein description method is evaluated in protein clustering task using TransClust and MCL algorithms. Two other protein representation methods that utilize protein sequence, Basic local alignment tool and ProtVec, and two compound fingerprint-based protein representation methods are compared.
Results
We showed that ligand-based protein representation, which uses only SMILES strings of the ligands that proteins bind to, performs as well as protein sequence-based representation methods in protein clustering. The results suggest that ligand-based protein description can be an alternative to the traditional sequence or structure-based representation of proteins and this novel approach can be applied to different bioinformatics problems such as prediction of new protein–ligand interactions and protein function annotation.
Availability and implementation
https://github.com/hkmztrk/SMILESVecProteinRepresentation
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

HFSP: high speed homology-driven function annotation of proteins

ср, 2018-06-27 02:00
Abstract
Motivation
The rapid drop in sequencing costs has produced many more (predicted) protein sequences than can feasibly be functionally annotated with wet-lab experiments. Thus, many computational methods have been developed for this purpose. Most of these methods employ homology-based inference, approximated via sequence alignments, to transfer functional annotations between proteins. The increase in the number of available sequences, however, has drastically increased the search space, thus significantly slowing down alignment methods.
Results
Here we describe homology-derived functional similarity of proteins (HFSP), a novel computational method that uses results of a high-speed alignment algorithm, MMseqs2, to infer functional similarity of proteins on the basis of their alignment length and sequence identity. We show that our method is accurate (85% precision) and fast (more than 40-fold speed increase over state-of-the-art). HFSP can help correct at least a 16% error in legacy curations, even for a resource of as high quality as Swiss-Prot. These findings suggest HFSP as an ideal resource for large-scale functional annotation efforts.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

Enumerating consistent sub-graphs of directed acyclic graphs: an insight into biomedical ontologies

ср, 2018-06-27 02:00
Abstract
Motivation
Modern problems of concept annotation associate an object of interest (gene, individual, text document) with a set of interrelated textual descriptors (functions, diseases, topics), often organized in concept hierarchies or ontologies. Most ontology can be seen as directed acyclic graphs (DAGs), where nodes represent concepts and edges represent relational ties between these concepts. Given an ontology graph, each object can only be annotated by a consistent sub-graph; that is, a sub-graph such that if an object is annotated by a particular concept, it must also be annotated by all other concepts that generalize it. Ontologies therefore provide a compact representation of a large space of possible consistent sub-graphs; however, until now we have not been aware of a practical algorithm that can enumerate such annotation spaces for a given ontology.
Results
We propose an algorithm for enumerating consistent sub-graphs of DAGs. The algorithm recursively partitions the graph into strictly smaller graphs until the resulting graph becomes a rooted tree (forest), for which a linear-time solution is computed. It then combines the tallies from graphs created in the recursion to obtain the final count. We prove the correctness of this algorithm, propose several practical accelerations, evaluate it on random graphs and then apply it to characterize four major biomedical ontologies. We believe this work provides valuable insights into the complexity of concept annotation spaces and its potential influence on the predictability of ontological annotation.
Availability and implementation
https://github.com/shawn-peng/counting-consistent-sub-DAG
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples

ср, 2018-06-27 02:00
Abstract
Motivation
Microbial communities play important roles in the function and maintenance of various biosystems, ranging from the human body to the environment. A major challenge in microbiome research is the classification of microbial communities of different environments or host phenotypes. The most common and cost-effective approach for such studies to date is 16S rRNA gene sequencing. Recent falls in sequencing costs have increased the demand for simple, efficient and accurate methods for rapid detection or diagnosis with proved applications in medicine, agriculture and forensic science. We describe a reference- and alignment-free approach for predicting environments and host phenotypes from 16S rRNA gene sequencing based on k-mer representations that benefits from a bootstrapping framework for investigating the sufficiency of shallow sub-samples. Deep learning methods as well as classical approaches were explored for predicting environments and host phenotypes.
Results
A k-mer distribution of shallow sub-samples outperformed Operational Taxonomic Unit (OTU) features in the tasks of body-site identification and Crohn’s disease prediction. Aside from being more accurate, using k-mer features in shallow sub-samples allows (i) skipping computationally costly sequence alignments required in OTU-picking and (ii) provided a proof of concept for the sufficiency of shallow and short-length 16S rRNA sequencing for phenotype prediction. In addition, k-mer features predicted representative 16S rRNA gene sequences of 18 ecological environments, and 5 organismal environments with high macro-F1 scores of 0.88 and 0.87. For large datasets, deep learning outperformed classical methods such as Random Forest and Support Vector Machine.
Availability and implementation
The software and datasets are available at https://llp.berkeley.edu/micropheno.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

SIMPLE: Sparse Interaction Model over Peaks of moLEcules for fast, interpretable metabolite identification from tandem mass spectra

ср, 2018-06-27 02:00
Abstract
Motivation
Recent success in metabolite identification from tandem mass spectra has been led by machine learning, which has two stages: mapping mass spectra to molecular fingerprint vectors and then retrieving candidate molecules from the database. In the first stage, i.e. fingerprint prediction, spectrum peaks are features and considering their interactions would be reasonable for more accurate identification of unknown metabolites. Existing approaches of fingerprint prediction are based on only individual peaks in the spectra, without explicitly considering the peak interactions. Also the current cutting-edge method is based on kernels, which are computationally heavy and difficult to interpret.
Results
We propose two learning models that allow to incorporate peak interactions for fingerprint prediction. First, we extend the state-of-the-art kernel learning method by developing kernels for peak interactions to combine with kernels for peaks through multiple kernel learning (MKL). Second, we formulate a sparse interaction model for metabolite peaks, which we call SIMPLE, which is computationally light and interpretable for fingerprint prediction. The formulation of SIMPLE is convex and guarantees global optimization, for which we develop an alternating direction method of multipliers (ADMM) algorithm. Experiments using the MassBank dataset show that both models achieved comparative prediction accuracy with the current top-performance kernel method. Furthermore SIMPLE clearly revealed individual peaks and peak interactions which contribute to enhancing the performance of fingerprint prediction.
Availability and implementation
The code will be accessed through http://mamitsukalab.org/tools/SIMPLE/.
Категории: Bioinformatics, Journals

Bayesian networks for mass spectrometric metabolite identification via molecular fingerprints

ср, 2018-06-27 02:00
Abstract
Motivation
Metabolites, small molecules that are involved in cellular reactions, provide a direct functional signature of cellular state. Untargeted metabolomics experiments usually rely on tandem mass spectrometry to identify the thousands of compounds in a biological sample. Recently, we presented CSI:FingerID for searching in molecular structure databases using tandem mass spectrometry data. CSI:FingerID predicts a molecular fingerprint that encodes the structure of the query compound, then uses this to search a molecular structure database such as PubChem. Scoring of the predicted query fingerprint and deterministic target fingerprints is carried out assuming independence between the molecular properties constituting the fingerprint.
Results
We present a scoring that takes into account dependencies between molecular properties. As before, we predict posterior probabilities of molecular properties using machine learning. Dependencies between molecular properties are modeled as a Bayesian tree network; the tree structure is estimated on the fly from the instance data. For each edge, we also estimate the expected covariance between the two random variables. For fixed marginal probabilities, we then estimate conditional probabilities using the known covariance. Now, the corrected posterior probability of each candidate can be computed, and candidates are ranked by this score. Modeling dependencies improves identification rates of CSI:FingerID by 2.85 percentage points.
Availability and implementation
The new scoring Bayesian (fixed tree) is integrated into SIRIUS 4.0 (https://bio.informatik.uni-jena.de/software/sirius/).
Категории: Bioinformatics, Journals

A spectral clustering-based method for identifying clones from high-throughput B cell repertoire sequencing data

ср, 2018-06-27 02:00
Abstract
Motivation
B cells derive their antigen-specificity through the expression of Immunoglobulin (Ig) receptors on their surface. These receptors are initially generated stochastically by somatic re-arrangement of the DNA and further diversified following antigen-activation by a process of somatic hypermutation, which introduces mainly point substitutions into the receptor DNA at a high rate. Recent advances in next-generation sequencing have enabled large-scale profiling of the B cell Ig repertoire from blood and tissue samples. A key computational challenge in the analysis of these data is partitioning the sequences to identify descendants of a common B cell (i.e. a clone). Current methods group sequences using a fixed distance threshold, or a likelihood calculation that is computationally-intensive. Here, we propose a new method based on spectral clustering with an adaptive threshold to determine the local sequence neighborhood. Validation using simulated and experimental datasets demonstrates that this method has high sensitivity and specificity compared to a fixed threshold that is optimized for these measures. In addition, this method works on datasets where choosing an optimal fixed threshold is difficult and is more computationally efficient in all cases. The ability to quickly and accurately identify members of a clone from repertoire sequencing data will greatly improve downstream analyses. Clonally-related sequences cannot be treated independently in statistical models, and clonal partitions are used as the basis for the calculation of diversity metrics, lineage reconstruction and selection analysis. Thus, the spectral clustering-based method here represents an important contribution to repertoire analysis.
Availability and implementation
Source code for this method is freely available in the SCOPe (Spectral Clustering for clOne Partitioning) R package in the Immcantation framework: www.immcantation.org under the CC BY-SA 4.0 license.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

An evolutionary model motivated by physicochemical properties of amino acids reveals variation among proteins

ср, 2018-06-27 02:00
Abstract
Motivation
The relative rates of amino acid interchanges over evolutionary time are likely to vary among proteins. Variation in those rates has the potential to reveal information about constraints on proteins. However, the most straightforward model that could be used to estimate relative rates of amino acid substitution is parameter-rich and it is therefore impractical to use for this purpose.
Results
A six-parameter model of amino acid substitution that incorporates information about the physicochemical properties of amino acids was developed. It showed that amino acid side chain volume, polarity and aromaticity have major impacts on protein evolution. It also revealed variation among proteins in the relative importance of those properties. The same general approach can be used to improve the fit of empirical models such as the commonly used PAM and LG models.
Availability and implementation
Perl code and test data are available from https://github.com/ebraun68/sixparam.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

Deconvolution and phylogeny inference of structural variations in tumor genomic samples

ср, 2018-06-27 02:00
Abstract
Motivation
Phylogenetic reconstruction of tumor evolution has emerged as a crucial tool for making sense of the complexity of emerging cancer genomic datasets. Despite the growing use of phylogenetics in cancer studies, though, the field has only slowly adapted to many ways that tumor evolution differs from classic species evolution. One crucial question in that regard is how to handle inference of structural variations (SVs), which are a major mechanism of evolution in cancers but have been largely neglected in tumor phylogenetics to date, in part due to the challenges of reliably detecting and typing SVs and interpreting them phylogenetically.
Results
We present a novel method for reconstructing evolutionary trajectories of SVs from bulk whole-genome sequence data via joint deconvolution and phylogenetics, to infer clonal sub-populations and reconstruct their ancestry. We establish a novel likelihood model for joint deconvolution and phylogenetic inference on bulk SV data and formulate an associated optimization algorithm. We demonstrate the approach to be efficient and accurate for realistic scenarios of SV mutation on simulated data. Application to breast cancer genomic data from The Cancer Genome Atlas shows it to be practical and effective at reconstructing features of SV-driven evolution in single tumors.
Availability and implementation
Python source code and associated documentation are available at https://github.com/jaebird123/tusv.
Категории: Bioinformatics, Journals

Accurate prediction of orthologs in the presence of divergence after duplication

ср, 2018-06-27 02:00
Abstract
Motivation
When gene duplication occurs, one of the copies may become free of selective pressure and evolve at an accelerated pace. This has important consequences on the prediction of orthology relationships, since two orthologous genes separated by divergence after duplication may differ in both sequence and function. In this work, we make the distinction between the primary orthologs, which have not been affected by accelerated mutation rates on their evolutionary path, and the secondary orthologs, which have. Similarity-based prediction methods will tend to miss secondary orthologs, whereas phylogeny-based methods cannot separate primary and secondary orthologs. However, both types of orthology have applications in important areas such as gene function prediction and phylogenetic reconstruction, motivating the need for methods that can distinguish the two types.
Results
We formalize the notion of divergence after duplication and provide a theoretical basis for the inference of primary and secondary orthologs. We then put these ideas to practice with the Hybrid Prediction of Paralogs and Orthologs (HyPPO) framework, which combines ideas from both similarity and phylogeny approaches. We apply our method to simulated and empirical datasets and show that we achieve superior accuracy in predicting primary orthologs, secondary orthologs and paralogs.
Availability and implementation
HyPPO is a modular framework with a core developed in Python and is provided with a variety of C++ modules. The source code is available at https://github.com/manuellafond/HyPPO.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

Inference of species phylogenies from bi-allelic markers using pseudo-likelihood

ср, 2018-06-27 02:00
Abstract
Motivation
Phylogenetic networks represent reticulate evolutionary histories. Statistical methods for their inference under the multispecies coalescent have recently been developed. A particularly powerful approach uses data that consist of bi-allelic markers (e.g. single nucleotide polymorphism data) and allows for exact likelihood computations of phylogenetic networks while numerically integrating over all possible gene trees per marker. While the approach has good accuracy in terms of estimating the network and its parameters, likelihood computations remain a major computational bottleneck and limit the method’s applicability.
Results
In this article, we first demonstrate why likelihood computations of networks take orders of magnitude more time when compared to trees. We then propose an approach for inference of phylogenetic networks based on pseudo-likelihood using bi-allelic markers. We demonstrate the scalability and accuracy of phylogenetic network inference via pseudo-likelihood computations on simulated data. Furthermore, we demonstrate aspects of robustness of the method to violations in the underlying assumptions of the employed statistical model. Finally, we demonstrate the application of the method to biological data. The proposed method allows for analyzing larger datasets in terms of the numbers of taxa and reticulation events. While pseudo-likelihood had been proposed before for data consisting of gene trees, the work here uses sequence data directly, offering several advantages as we discuss.
Availability and implementation
The methods have been implemented in PhyloNet (http://bioinfocs.rice.edu/phylonet).
Категории: Bioinformatics, Journals

A gene–phenotype relationship extraction pipeline from the biomedical literature using a representation learning approach

ср, 2018-06-27 02:00
Abstract
Motivation
The fundamental challenge of modern genetic analysis is to establish gene-phenotype correlations that are often found in the large-scale publications. Because lexical features of gene are relatively regular in text, the main challenge of these relation extraction is phenotype recognition. Due to phenotypic descriptions are often study- or author-specific, few lexicon can be used to effectively identify the entire phenotypic expressions in text, especially for plants.
Results
We have proposed a pipeline for extracting phenotype, gene and their relations from biomedical literature. Combined with abbreviation revision and sentence template extraction, we improved the unsupervised word-embedding-to-sentence-embedding cascaded approach as representation learning to recognize the various broad phenotypic information in literature. In addition, the dictionary- and rule-based method was applied for gene recognition. Finally, we integrated one of famous information extraction system OLLIE to identify gene-phenotype relations. To demonstrate the applicability of the pipeline, we established two types of comparison experiment using model organism Arabidopsis thaliana. In the comparison of state-of-the-art baselines, our approach obtained the best performance (F1-Measure of 66.83%). We also applied the pipeline to 481 full-articles from TAIR gene-phenotype manual relationship dataset to prove the validity. The results showed that our proposed pipeline can cover 70.94% of the original dataset and add 373 new relations to expand it.
Availability and implementation
The source code is available at http://www.wutbiolab.cn: 82/Gene-Phenotype-Relation-Extraction-Pipeline.zip.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

Improving genomics-based predictions for precision medicine through active elicitation of expert knowledge

ср, 2018-06-27 02:00
Abstract
Motivation
Precision medicine requires the ability to predict the efficacies of different treatments for a given individual using high-dimensional genomic measurements. However, identifying predictive features remains a challenge when the sample size is small. Incorporating expert knowledge offers a promising approach to improve predictions, but collecting such knowledge is laborious if the number of candidate features is very large.
Results
We introduce a probabilistic framework to incorporate expert feedback about the impact of genomic measurements on the outcome of interest and present a novel approach to collect the feedback efficiently, based on Bayesian experimental design. The new approach outperformed other recent alternatives in two medical applications: prediction of metabolic traits and prediction of sensitivity of cancer cells to different drugs, both using genomic features as predictors. Furthermore, the intelligent approach to collect feedback reduced the workload of the expert to approximately 11%, compared to a baseline approach.
Availability and implementation
Source code implementing the introduced computational methods is freely available at https://github.com/AaltoPML/knowledge-elicitation-for-precision-medicine.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

Training for translation between disciplines: a philosophy for life and data sciences curricula

ср, 2018-06-27 02:00
Abstract
Motivation
Our society has become data-rich to the extent that research in many areas has become impossible without computational approaches. Educational programmes seem to be lagging behind this development. At the same time, there is a growing need not only for strong data science skills, but foremost for the ability to both translate between tools and methods on the one hand, and application and problems on the other.
Results
Here we present our experiences with shaping and running a masters’ programme in bioinformatics and systems biology in Amsterdam. From this, we have developed a comprehensive philosophy on how translation in training may be achieved in a dynamic and multidisciplinary research area, which is described here. We furthermore describe two requirements that enable translation, which we have found to be crucial: sufficient depth and focus on multidisciplinary topic areas, coupled with a balanced breadth from adjacent disciplines. Finally, we present concrete suggestions on how this may be implemented in practice, which may be relevant for the effectiveness of life science and data science curricula in general, and of particular interest to those who are in the process of setting up such curricula.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

Driver gene mutations based clustering of tumors: methods and applications

ср, 2018-06-27 02:00
Abstract
Motivation
Somatic mutations in proto-oncogenes and tumor suppressor genes constitute a major category of causal genetic abnormalities in tumor cells. The mutation spectra of thousands of tumors have been generated by The Cancer Genome Atlas (TCGA) and other whole genome (exome) sequencing projects. A promising approach to utilizing these resources for precision medicine is to identify genetic similarity-based sub-types within a cancer type and relate the pinpointed sub-types to the clinical outcomes and pathologic characteristics of patients.
Results
We propose two novel methods, ccpwModel and xGeneModel, for mutation-based clustering of tumors. In the former, binary variables indicating the status of cancer driver genes in tumors and the genes’ involvement in the core cancer pathways are treated as the features in the clustering process. In the latter, the functional similarities of putative cancer driver genes and their confidence scores as the ‘true’ driver genes are integrated with the mutation spectra to calculate the genetic distances between tumors. We apply both methods to the TCGA data of 16 cancer types. Promising results are obtained when these methods are compared to state-of-the-art approaches as to the associations between the determined tumor clusters and patient race (or survival time). We further extend the analysis to detect mutation-characterized transcriptomic prognostic signatures, which are directly relevant to the etiology of carcinogenesis.
Availability and implementation
R codes and example data for ccpwModel and xGeneModel can be obtained from http://webusers.xula.edu/kzhang/ISMB2018/ccpw_xGene_software.zip.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals