Bioinformatics

Syndicate content
Updated: 12 hours 19 хв ago

ECCB2020: the 19th European Conference on Computational Biology

Вт, 2020-12-29 02:00
This volume of Bioinformatics includes the proceedings papers of the 19th European Conference in Computational Biology (ECCB), an annual international conference for research in computational biology and bioinformatics.
Категорії: Bioinformatics, Journals

Dementia key gene identification with multi-layered SNP-gene-disease network

Вт, 2020-12-29 02:00
Abstract
Motivation
Recently, various approaches for diagnosing and treating dementia have received significant attention, especially in identifying key genes that are crucial for dementia. If the mutations of such key genes could be tracked, it would be possible to predict the time of onset of dementia and significantly aid in developing drugs to treat dementia. However, gene finding involves tremendous cost, time and effort. To alleviate these problems, research on utilizing computational biology to decrease the search space of candidate genes is actively conducted.
In this study, we propose a framework in which diseases, genes and single-nucleotide polymorphisms are represented by a layered network, and key genes are predicted by a machine learning algorithm. The algorithm utilizes a network-based semi-supervised learning model that can be applied to layered data structures.
Results
The proposed method was applied to a dataset extracted from public databases related to diseases and genes with data collected from 186 patients. A portion of key genes obtained using the proposed method was verified in silico through PubMed literature, and the remaining genes were left as possible candidate genes.
Availability and implementation
The code for the framework will be available at http://www.alphaminers.net/.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

panRGP: a pangenome-based method to predict genomic islands and explore their diversity

Вт, 2020-12-29 02:00
Abstract
Motivation
Horizontal gene transfer (HGT) is a major source of variability in prokaryotic genomes. Regions of genome plasticity (RGPs) are clusters of genes located in highly variable genomic regions. Most of them arise from HGT and correspond to genomic islands (GIs). The study of those regions at the species level has become increasingly difficult with the data deluge of genomes. To date, no methods are available to identify GIs using hundreds of genomes to explore their diversity.
Results
We present here the panRGP method that predicts RGPs using pangenome graphs made of all available genomes for a given species. It allows the study of thousands of genomes in order to access the diversity of RGPs and to predict spots of insertions. It gave the best predictions when benchmarked along other GI detection tools against a reference dataset. In addition, we illustrated its use on metagenome assembled genomes by redefining the borders of the leuX tRNA hotspot, a well-studied spot of insertion in Escherichia coli. panRPG is a scalable and reliable tool to predict GIs and spots making it an ideal approach for large comparative studies.
Availability and implementation
The methods presented in the current work are available through the following software: https://github.com/labgem/PPanGGOLiN. Detailed results and scripts to compute the benchmark metrics are available at https://github.com/axbazin/panrgp_supdata.
Категорії: Bioinformatics, Journals

GRaSP: a graph-based residue neighborhood strategy to predict binding sites

Вт, 2020-12-29 02:00
Abstract
Motivation
The discovery of protein–ligand-binding sites is a major step for elucidating protein function and for investigating new functional roles. Detecting protein–ligand-binding sites experimentally is time-consuming and expensive. Thus, a variety of in silico methods to detect and predict binding sites was proposed as they can be scalable, fast and present low cost.
Results
We proposed Graph-based Residue neighborhood Strategy to Predict binding sites (GRaSP), a novel residue centric and scalable method to predict ligand-binding site residues. It is based on a supervised learning strategy that models the residue environment as a graph at the atomic level. Results show that GRaSP made compatible or superior predictions when compared with methods described in the literature. GRaSP outperformed six other residue-centric methods, including the one considered as state-of-the-art. Also, our method achieved better results than the method from CAMEO independent assessment. GRaSP ranked second when compared with five state-of-the-art pocket-centric methods, which we consider a significant result, as it was not devised to predict pockets. Finally, our method proved scalable as it took 10–20 s on average to predict the binding site for a protein complex whereas the state-of-the-art residue-centric method takes 2–5 h on average.
Availability and implementation
The source code and datasets are available at https://github.com/charles-abreu/GRaSP.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

FBA reveals guanylate kinase as a potential target for antiviral therapies against SARS-CoV-2

Вт, 2020-12-29 02:00
Abstract
Motivation
The novel coronavirus (SARS-CoV-2) currently spreads worldwide, causing the disease COVID-19. The number of infections increases daily, without any approved antiviral therapy. The recently released viral nucleotide sequence enables the identification of therapeutic targets, e.g. by analyzing integrated human-virus metabolic models. Investigations of changed metabolic processes after virus infections and the effect of knock-outs on the host and the virus can reveal new potential targets.
Results
We generated an integrated host–virus genome-scale metabolic model of human alveolar macrophages and SARS-CoV-2. Analyses of stoichiometric and metabolic changes between uninfected and infected host cells using flux balance analysis (FBA) highlighted the different requirements of host and virus. Consequently, alterations in the metabolism can have different effects on host and virus, leading to potential antiviral targets. One of these potential targets is guanylate kinase (GK1). In FBA analyses, the knock-out of the GK1 decreased the growth of the virus to zero, while not affecting the host. As GK1 inhibitors are described in the literature, its potential therapeutic effect for SARS-CoV-2 infections needs to be verified in in-vitro experiments.
Availability and implementation
The computational model is accessible at https://identifiers.org/biomodels.db/MODEL2003020001.
Категорії: Bioinformatics, Journals

MirCure: a tool for quality control, filter and curation of microRNAs of animals and plants

Вт, 2020-12-29 02:00
Abstract
Motivation
microRNAs (miRNAs) are essential components of gene expression regulation at the post-transcriptional level. miRNAs have a well-defined molecular structure and this has facilitated the development of computational and high-throughput approaches to predict miRNAs genes. However, due to their short size, miRNAs have often been incorrectly annotated in both plants and animals. Consequently, published miRNA annotations and miRNA databases are enriched for false miRNAs, jeopardizing their utility as molecular information resources. To address this problem, we developed MirCure, a new software for quality control, filtering and curation of miRNA candidates. MirCure is an easy-to-use tool with a graphical interface that allows both scoring of miRNA reliability and browsing of supporting evidence by manual curators.
Results
Given a list of miRNA candidates, MirCure evaluates a number of miRNA-specific features based on gene expression, biogenesis and conservation data, and generates a score that can be used to discard poorly supported miRNA annotations. MirCure can also curate and adjust the annotation of the 5p and 3p arms based on user-provided small RNA-seq data. We evaluated MirCure on a set of manually curated animal and plant miRNAs and demonstrated great accuracy. Moreover, we show that MirCure can be used to revisit previous bona fide miRNAs annotations to improve miRNA databases.
Availability and implementation
The MirCure software and all the additional scripts used in this project are publicly available at https://github.com/ConesaLab/MirCure. A Docker image of MirCure is available at https://hub.docker.com/r/conesalab/mircure.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

Exploring chromatin conformation and gene co-expression through graph embedding

Вт, 2020-12-29 02:00
Abstract
Motivation
The relationship between gene co-expression and chromatin conformation is of great biological interest. Thanks to high-throughput chromosome conformation capture technologies (Hi-C), researchers are gaining insights on the tri-dimensional organization of the genome. Given the high complexity of Hi-C data and the difficult definition of gene co-expression networks, the development of proper computational tools to investigate such relationship is rapidly gaining the interest of researchers. One of the most fascinating questions in this context is how chromatin topology correlates with gene co-expression and which physical interaction patterns are most predictive of co-expression relationships.
Results
To address these questions, we developed a computational framework for the prediction of co-expression networks from chromatin conformation data. We first define a gene chromatin interaction network where each gene is associated to its physical interaction profile; then, we apply two graph embedding techniques to extract a low-dimensional vector representation of each gene from the interaction network; finally, we train a classifier on gene embedding pairs to predict if they are co-expressed. Both graph embedding techniques outperform previous methods based on manually designed topological features, highlighting the need for more advanced strategies to encode chromatin information. We also establish that the most recent technique, based on random walks, is superior. Overall, our results demonstrate that chromatin conformation and gene regulation share a non-linear relationship and that gene topological embeddings encode relevant information, which could be used also for downstream analysis.
Availability and implementation
The source code for the analysis is available at: https://github.com/marcovarrone/gene-expression-chromatin.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

Feasible-metabolic-pathway-exploration technique using chemical latent space

Вт, 2020-12-29 02:00
Abstract
Motivation
Exploring metabolic pathways is one of the key techniques for developing highly productive microbes for the bioproduction of chemical compounds. To explore feasible pathways, not only examining a combination of well-known enzymatic reactions but also finding potential enzymatic reactions that can catalyze the desired structural changes are necessary. To achieve this, most conventional techniques use manually predefined-reaction rules, however, they cannot sufficiently find potential reactions because the conventional rules cannot comprehensively express structural changes before and after enzymatic reactions. Evaluating the feasibility of the explored pathways is another challenge because there is no way to validate the reaction possibility of unknown enzymatic reactions by these rules. Therefore, a technique for comprehensively capturing the structural changes in enzymatic reactions and a technique for evaluating the pathway feasibility are still necessary to explore feasible metabolic pathways.
Results
We developed a feasible-pathway-exploration technique using chemical latent space obtained from a deep generative model for compound structures. With this technique, an enzymatic reaction is regarded as a difference vector between the main substrate and the main product in chemical latent space acquired from the generative model. Features of the enzymatic reaction are embedded into the fixed-dimensional vector, and it is possible to express structural changes of enzymatic reactions comprehensively. The technique also involves differential-evolution-based reaction selection to design feasible candidate pathways and pathway scoring using neural-network-based reaction-possibility prediction. The proposed technique was applied to the non-registered pathways relevant to the production of 2-butanone, and successfully explored feasible pathways that include such reactions.
Категорії: Bioinformatics, Journals

Ensembling graph attention networks for human microbe–drug association prediction

Вт, 2020-12-29 02:00
Abstract
Motivation
Human microbes get closely involved in an extensive variety of complex human diseases and become new drug targets. In silico methods for identifying potential microbe–drug associations provide an effective complement to conventional experimental methods, which can not only benefit screening candidate compounds for drug development but also facilitate novel knowledge discovery for understanding microbe–drug interaction mechanisms. On the other hand, the recent increased availability of accumulated biomedical data for microbes and drugs provides a great opportunity for a machine learning approach to predict microbe–drug associations. We are thus highly motivated to integrate these data sources to improve prediction accuracy. In addition, it is extremely challenging to predict interactions for new drugs or new microbes, which have no existing microbe–drug associations.
Results
In this work, we leverage various sources of biomedical information and construct multiple networks (graphs) for microbes and drugs. Then, we develop a novel ensemble framework of graph attention networks with a hierarchical attention mechanism for microbe–drug association prediction from the constructed multiple microbe–drug graphs, denoted as EGATMDA. In particular, for each input graph, we design a graph convolutional network with node-level attention to learn embeddings for nodes (i.e. microbes and drugs). To effectively aggregate node embeddings from multiple input graphs, we implement graph-level attention to learn the importance of different input graphs. Experimental results under different cross-validation settings (e.g. the setting for predicting associations for new drugs) showed that our proposed method outperformed seven state-of-the-art methods. Case studies on predicted microbe–drug associations further demonstrated the effectiveness of our proposed EGATMDA method.
Availability
Source codes and supplementary materialssupplementary materials are available at: https://github.com/longyahui/EGATMDA/
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

A general near-exact k-mer counting method with low memory consumption enables de novo assembly of 106× human sequence data in 2.7 hours

Вт, 2020-12-29 02:00
Abstract
Motivation
In de novo sequence assembly, a standard pre-processing step is k-mer counting, which computes the number of occurrences of every length-k sub-sequence in the sequencing reads. Sequencing errors can produce many k-mers that do not appear in the genome, leading to the need for an excessive amount of memory during counting. This issue is particularly serious when the genome to be assembled is large, the sequencing depth is high, or when the memory available is limited.
Results
Here, we propose a fast near-exact k-mer counting method, CQF-deNoise, which has a module for dynamically removing noisy false k-mers. It automatically determines the suitable time and number of rounds of noise removal according to a user-specified wrong removal rate. We tested CQF-deNoise comprehensively using data generated from a diverse set of genomes with various data properties, and found that the memory consumed was almost constant regardless of the sequencing errors while the noise removal procedure had minimal effects on counting accuracy. Compared with four state-of-the-art k-mer counting methods, CQF-deNoise consistently performed the best in terms of memory usage, consuming 49–76% less memory than the second best method. When counting the k-mers from a human dataset with around 60× coverage, the peak memory usage of CQF-deNoise was only 10.9 GB (gigabytes) for k = 28 and 21.5 GB for k = 55. De novo assembly of 106× human sequencing data using CQF-deNoise for k-mer counting required only 2.7 h and 90 GB peak memory.
Availability and implementation
The source codes of CQF-deNoise and SH-assembly are available at https://github.com/Christina-hshi/CQF-deNoise.git and https://github.com/Christina-hshi/SH-assembly.git, respectively, both under the BSD 3-Clause license.
Категорії: Bioinformatics, Journals

Adversarial deconfounding autoencoder for learning robust gene expression embeddings

Вт, 2020-12-29 02:00
Abstract
Motivation
Increasing number of gene expression profiles has enabled the use of complex models, such as deep unsupervised neural networks, to extract a latent space from these profiles. However, expression profiles, especially when collected in large numbers, inherently contain variations introduced by technical artifacts (e.g. batch effects) and uninteresting biological variables (e.g. age) in addition to the true signals of interest. These sources of variations, called confounders, produce embeddings that fail to transfer to different domains, i.e. an embedding learned from one dataset with a specific confounder distribution does not generalize to different distributions. To remedy this problem, we attempt to disentangle confounders from true signals to generate biologically informative embeddings.
Results
In this article, we introduce the Adversarial Deconfounding AutoEncoder (AD-AE) approach to deconfounding gene expression latent spaces. The AD-AE model consists of two neural networks: (i) an autoencoder to generate an embedding that can reconstruct original measurements, and (ii) an adversary trained to predict the confounder from that embedding. We jointly train the networks to generate embeddings that can encode as much information as possible without encoding any confounding signal. By applying AD-AE to two distinct gene expression datasets, we show that our model can (i) generate embeddings that do not encode confounder information, (ii) conserve the biological signals present in the original space and (iii) generalize successfully across different confounder domains. We demonstrate that AD-AE outperforms standard autoencoder and other deconfounding approaches.
Availability and implementation
Our code and data are available at https://gitlab.cs.washington.edu/abdincer/ad-ae.
Contact
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

DeepCDR: a hybrid graph convolutional network for predicting cancer drug response

Вт, 2020-12-29 02:00
Abstract
Motivation
Accurate prediction of cancer drug response (CDR) is challenging due to the uncertainty of drug efficacy and heterogeneity of cancer patients. Strong evidences have implicated the high dependence of CDR on tumor genomic and transcriptomic profiles of individual patients. Precise identification of CDR is crucial in both guiding anti-cancer drug design and understanding cancer biology.
Results
In this study, we present DeepCDR which integrates multi-omics profiles of cancer cells and explores intrinsic chemical structures of drugs for predicting CDR. Specifically, DeepCDR is a hybrid graph convolutional network consisting of a uniform graph convolutional network and multiple subnetworks. Unlike prior studies modeling hand-crafted features of drugs, DeepCDR automatically learns the latent representation of topological structures among atoms and bonds of drugs. Extensive experiments showed that DeepCDR outperformed state-of-the-art methods in both classification and regression settings under various data settings. We also evaluated the contribution of different types of omics profiles for assessing drug response. Furthermore, we provided an exploratory strategy for identifying potential cancer-associated genes concerning specific cancer types. Our results highlighted the predictive power of DeepCDR and its potential translational value in guiding disease-specific drug design.
Availability and implementation
DeepCDR is freely available at https://github.com/kimmo1019/DeepCDR.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

CLPred: a sequence-based protein crystallization predictor using BLSTM neural network

Вт, 2020-12-29 02:00
Abstract
Motivation
Determining the structures of proteins is a critical step to understand their biological functions. Crystallography-based X-ray diffraction technique is the main method for experimental protein structure determination. However, the underlying crystallization process, which needs multiple time-consuming and costly experimental steps, has a high attrition rate. To overcome this issue, a series of in silico methods have been developed with the primary aim of selecting the protein sequences that are promising to be crystallized. However, the predictive performance of the current methods is modest.
Results
We propose a deep learning model, so-called CLPred, which uses a bidirectional recurrent neural network with long short-term memory (BLSTM) to capture the long-range interaction patterns between k-mers amino acids to predict protein crystallizability. Using sequence only information, CLPred outperforms the existing deep-learning predictors and a vast majority of sequence-based diffraction-quality crystals predictors on three independent test sets. The results highlight the effectiveness of BLSTM in capturing non-local, long-range inter-peptide interaction patterns to distinguish proteins that can result in diffraction-quality crystals from those that cannot. CLPred has been steadily improved over the previous window-based neural networks, which is able to predict crystallization propensity with high accuracy. CLPred can also be improved significantly if it incorporates additional features from pre-extracted evolutional, structural and physicochemical characteristics. The correctness of CLPred predictions is further validated by the case studies of Sox transcription factor family member proteins and Zika virus non-structural proteins.
Availability and implementation
https://github.com/xuanwenjing/CLPred.
Категорії: Bioinformatics, Journals

Conditional out-of-distribution generation for unpaired data using transfer VAE

Вт, 2020-12-29 02:00
Abstract
Motivation
While generative models have shown great success in sampling high-dimensional samples conditional on low-dimensional descriptors (stroke thickness in MNIST, hair color in CelebA, speaker identity in WaveNet), their generation out-of-distribution poses fundamental problems due to the difficulty of learning compact joint distribution across conditions. The canonical example of the conditional variational autoencoder (CVAE), for instance, does not explicitly relate conditions during training and, hence, has no explicit incentive of learning such a compact representation.
Results
We overcome the limitation of the CVAE by matching distributions across conditions using maximum mean discrepancy in the decoder layer that follows the bottleneck. This introduces a strong regularization both for reconstructing samples within the same condition and for transforming samples across conditions, resulting in much improved generalization. As this amount to solving a style-transfer problem, we refer to the model as transfer VAE (trVAE). Benchmarking trVAE on high-dimensional image and single-cell RNA-seq, we demonstrate higher robustness and higher accuracy than existing approaches. We also show qualitatively improved predictions by tackling previously problematic minority classes and multiple conditions in the context of cellular perturbation response to treatment and disease based on high-dimensional single-cell gene expression data. For generic tasks, we improve Pearson correlations of high-dimensional estimated means and variances with their ground truths from 0.89 to 0.97 and 0.75 to 0.87, respectively. We further demonstrate that trVAE learns cell-type-specific responses after perturbation and improves the prediction of most cell-type-specific genes by 65%.
Availability and implementation
The trVAE implementation is available via github.com/theislab/trvae. The results of this article can be reproduced via github.com/theislab/trvae_reproducibility.
Категорії: Bioinformatics, Journals

Supervised learning on phylogenetically distributed data

Вт, 2020-12-29 02:00
Abstract
Motivation
The ability to develop robust machine-learning (ML) models is considered imperative to the adoption of ML techniques in biology and medicine fields. This challenge is particularly acute when data available for training is not independent and identically distributed (iid), in which case trained models are vulnerable to out-of-distribution generalization problems. Of particular interest are problems where data correspond to observations made on phylogenetically related samples (e.g. antibiotic resistance data).
Results
We introduce DendroNet, a new approach to train neural networks in the context of evolutionary data. DendroNet explicitly accounts for the relatedness of the training/testing data, while allowing the model to evolve along the branches of the phylogenetic tree, hence accommodating potential changes in the rules that relate genotypes to phenotypes. Using simulated data, we demonstrate that DendroNet produces models that can be significantly better than non-phylogenetically aware approaches. DendroNet also outperforms other approaches at two biological tasks of significant practical importance: antiobiotic resistance prediction in bacteria and trophic level prediction in fungi.
Availability and implementation
https://github.com/BlanchetteLab/DendroNet.
Категорії: Bioinformatics, Journals

Matrix (factorization) reloaded: flexible methods for imputing genetic interactions with cross-species and side information

Вт, 2020-12-29 02:00
Abstract
Motivation
Mapping genetic interactions (GIs) can reveal important insights into cellular function and has potential translational applications. There has been great progress in developing high-throughput experimental systems for measuring GIs (e.g. with double knockouts) as well as in defining computational methods for inferring (imputing) unknown interactions. However, existing computational methods for imputation have largely been developed for and applied in baker’s yeast, even as experimental systems have begun to allow measurements in other contexts. Importantly, existing methods face a number of limitations in requiring specific side information and with respect to computational cost. Further, few have addressed how GIs can be imputed when data are scarce.
Results
In this article, we address these limitations by presenting a new imputation framework, called Extensible Matrix Factorization (EMF). EMF is a framework of composable models that flexibly exploit cross-species information in the form of GI data across multiple species, and arbitrary side information in the form of kernels (e.g. from protein–protein interaction networks). We perform a rigorous set of experiments on these models in matched GI datasets from baker’s and fission yeast. These include the first such experiments on genome-scale GI datasets in multiple species in the same study. We find that EMF models that exploit side and cross-species information improve imputation, especially in data-scarce settings. Further, we show that EMF outperforms the state-of-the-art deep learning method, even when using strictly less data, and incurs orders of magnitude less computational cost.
Availability
Implementations of models and experiments are available at: https://github.com/lrgr/EMF.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

The effect of kinship in re-identification attacks against genomic data sharing beacons

Вт, 2020-12-29 02:00
Abstract
Motivation
Big data era in genomics promises a breakthrough in medicine, but sharing data in a private manner limit the pace of field. Widely accepted ‘genomic data sharing beacon’ protocol provides a standardized and secure interface for querying the genomic datasets. The data are only shared if the desired information (e.g. a certain variant) exists in the dataset. Various studies showed that beacons are vulnerable to re-identification (or membership inference) attacks. As beacons are generally associated with sensitive phenotype information, re-identification creates a significant risk for the participants. Unfortunately, proposed countermeasures against such attacks have failed to be effective, as they do not consider the utility of beacon protocol.
Results
In this study, for the first time, we analyze the mitigation effect of the kinship relationships among beacon participants against re-identification attacks. We argue that having multiple family members in a beacon can garble the information for attacks since a substantial number of variants are shared among kin-related people. Using family genomes from HapMap and synthetically generated datasets, we show that having one of the parents of a victim in the beacon causes (i) significant decrease in the power of attacks and (ii) substantial increase in the number of queries needed to confirm an individual’s beacon membership. We also show how the protection effect attenuates when more distant relatives, such as grandparents are included alongside the victim. Furthermore, we quantify the utility loss due adding relatives and show that it is smaller compared with flipping based techniques.
Категорії: Bioinformatics, Journals

PathFinder: Bayesian inference of clone migration histories in cancer

Вт, 2020-12-29 02:00
Abstract
Summary
Metastases cause a vast majority of cancer morbidity and mortality. Metastatic clones are formed by dispersal of cancer cells to secondary tissues, and are not medically detected or visible until later stages of cancer development. Clone phylogenies within patients provide a means of tracing the otherwise inaccessible dynamic history of migrations of cancer cells. Here, we present a new Bayesian approach, PathFinder, for reconstructing the routes of cancer cell migrations. PathFinder uses the clone phylogeny, the number of mutational differences among clones, and the information on the presence and absence of observed clones in primary and metastatic tumors. By analyzing simulated datasets, we found that PathFinder performes well in reconstructing clone migrations from the primary tumor to new metastases as well as between metastases. It was more challenging to trace migrations from metastases back to primary tumors. We found that a vast majority of errors can be corrected by sampling more clones per tumor, and by increasing the number of genetic variants assayed per clone. We also identified situations in which phylogenetic approaches alone are not sufficient to reconstruct migration routes.In conclusion, we anticipate that the use of PathFinder will enable a more reliable inference of migration histories and their posterior probabilities, which is required to assess the relative preponderance of seeding of new metastasis by clones from primary tumors and/or existing metastases.
Availability and implementation
PathFinder is available on the web at https://github.com/SayakaMiura/PathFinder.
Категорії: Bioinformatics, Journals

Probabilistic graphlets capture biological function in probabilistic molecular networks

Вт, 2020-12-29 02:00
Abstract
Motivation
Molecular interactions have been successfully modeled and analyzed as networks, where nodes represent molecules and edges represent the interactions between them. These networks revealed that molecules with similar local network structure also have similar biological functions. The most sensitive measures of network structure are based on graphlets. However, graphlet-based methods thus far are only applicable to unweighted networks, whereas real-world molecular networks may have weighted edges that can represent the probability of an interaction occurring in the cell. This information is commonly discarded when applying thresholds to generate unweighted networks, which may lead to information loss.
Results
We introduce probabilistic graphlets as a tool for analyzing the local wiring patterns of probabilistic networks. To assess their performance compared to unweighted graphlets, we generate synthetic networks based on different well-known random network models and edge probability distributions and demonstrate that probabilistic graphlets outperform their unweighted counterparts in distinguishing network structures. Then we model different real-world molecular interaction networks as weighted graphs with probabilities as weights on edges and we analyze them with our new weighted graphlets-based methods. We show that due to their probabilistic nature, probabilistic graphlet-based methods more robustly capture biological information in these data, while simultaneously showing a higher sensitivity to identify condition-specific functions compared to their unweighted graphlet-based method counterparts.
Availabilityand implementation
Our implementation of probabilistic graphlets is available at https://github.com/Serdobe/Probabilistic_Graphlets.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

svMIL: predicting the pathogenic effect of TAD boundary-disrupting somatic structural variants through multiple instance learning

Вт, 2020-12-29 02:00
Abstract
Motivation
Despite the fact that structural variants (SVs) play an important role in cancer, methods to predict their effect, especially for SVs in non-coding regions, are lacking, leaving them often overlooked in the clinic. Non-coding SVs may disrupt the boundaries of Topologically Associated Domains (TADs), thereby affecting interactions between genes and regulatory elements such as enhancers. However, it is not known when such alterations are pathogenic. Although machine learning techniques are a promising solution to answer this question, representing the large number of interactions that an SV can disrupt in a single feature matrix is not trivial.
Results
We introduce svMIL: a method to predict pathogenic TAD boundary-disrupting SV effects based on multiple instance learning, which circumvents the need for a traditional feature matrix by grouping SVs into bags that can contain any number of disruptions. We demonstrate that svMIL can predict SV pathogenicity, measured through same-sample gene expression aberration, for various cancer types. In addition, our approach reveals that somatic pathogenic SVs alter different regulatory interactions than somatic non-pathogenic SVs and germline SVs.
Availability and implementation
All code for svMIL is publicly available on GitHub: https://github.com/UMCUGenetics/svMIL.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals