Bioinformatics

genozip: a fast and efficient compression tool for VCF files

Bioinformatics - Чт, 2020-05-14 02:00
Abstract
Motivation
genozip is a new lossless compression tool for Variant Call Format (VCF) files. By applying field-specific algorithms and fully utilizing the available computational hardware, genozip achieves the highest compression ratios amongst existing lossless compression tools known to the authors, at speeds comparable with the fastest multi-threaded compressors.
Availability and implementation
genozip is freely available to non-commercial users. It can be installed via conda-forge, Docker Hub, or downloaded from github.com/divonlan/genozip.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

rScudo: an R package for classification of molecular profiles using rank-based signatures

Bioinformatics - Вт, 2020-05-12 02:00
Abstract
Summary
The classification of biological samples by means of their respective molecular profiles is a topic of great interest for its potential diagnostic, prognostic and investigational applications. rScudo is an R package for the classification of molecular profiles based on a radically new approach consisting in the analysis of the similarity of rank-based sample-specific signatures. The validity of rScudo unconventional approach has been validated through direct comparison with current methods in the international SBV IMPROVER Diagnostic Signature Challenge. Due to its novelty, there is ample room for conceptual improvements and for exploring additional applications. The rScudo package has been specifically designed to facilitate experimenting with the rank-based signature approach, to test its application to different types of molecular profiles and to simplify direct comparison with existing methods.
Availability and implementation
The package is available as part of the Bioconductor suite at https://bioconductor.org/packages/rScudo.
Категорії: Bioinformatics, Journals

COVID-2019-associated overexpressed Prevotella proteins mediated host–pathogen interactions and their role in coronavirus outbreak

Bioinformatics - Ср, 2020-05-06 02:00
Abstract
Motivation
The outbreak of COVID-2019 initiated at Wuhan, China has become a global threat by rapid transmission and severe fatalities. Recent studies have uncovered whole genome sequence of SARS-CoV-2 (causing COVID-2019). In addition, lung metagenomic studies on infected patients revealed overrepresented Prevotella spp. producing certain proteins in abundance. We performed host–pathogen protein–protein interaction analysis between SARS-CoV-2 and overrepresented Prevotella proteins with human proteome. We also performed functional overrepresentation analysis of interacting proteins to understand their role in COVID-2019 severity.
Results
It was found that overexpressed Prevotella proteins can promote viral infection. As per the results, Prevotella proteins, but not viral proteins, are involved in multiple interactions with NF-kB, which is involved in increasing clinical severity of COVID-2019. Prevotella may have role in COVID-2019 outbreak and should be given importance for understanding disease mechanisms and improving treatment outcomes.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

PathWalks: identifying pathway communities using a disease-related map of integrated information

Bioinformatics - Вт, 2020-05-05 02:00
Abstract
Motivation
Understanding the underlying biological mechanisms and respective interactions of a disease remains an elusive, time consuming and costly task. Computational methodologies that propose pathway/mechanism communities and reveal respective relationships can be of great value as they can help expedite the process of identifying how perturbations in a single pathway can affect other pathways.
Results
We present a random-walks-based methodology called PathWalks, where a walker crosses a pathway-to-pathway network under the guidance of a disease-related map. The latter is a gene network that we construct by integrating multi-source information regarding a specific disease. The most frequent trajectories highlight communities of pathways that are expected to be strongly related to the disease under study.We apply the PathWalks methodology on Alzheimer's disease and idiopathic pulmonary fibrosis and establish that it can highlight pathways that are also identified by other pathway analysis tools as well as are backed through bibliographic references. More importantly, PathWalks produces additional new pathways that are functionally connected with those already established, giving insight for further experimentation.
Availability and implementation
https://github.com/vagkaratzas/PathWalks.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

grünifai: interactive multiparameter optimization of molecules in a continuous vector space

Bioinformatics - Вт, 2020-05-05 02:00
Abstract
Summary
Optimizing small molecules in a drug discovery project is a notoriously difficult task as multiple molecular properties have to be considered and balanced at the same time. In this work, we present our novel interactive in silico compound optimization platform termed grünifai to support the ideation of the next generation of compounds under the constraints of a multiparameter objective. grünifai integrates adjustable in silico models, a continuous representation of the chemical space, a scalable particle swarm optimization algorithm and the possibility to actively steer the compound optimization through providing feedback on generated intermediate structures.
Availability and implementation
Source code and documentation are freely available under an MIT license and are openly available on GitHub (https://github.com/jrwnter/gruenifai). The backend, including the optimization method and distribution on multiple GPU nodes is written in Python 3. The frontend is written in ReactJS.
Категорії: Bioinformatics, Journals

GM-DockZn: a geometry matching-based docking algorithm for zinc proteins

Bioinformatics - Вт, 2020-05-05 02:00
Abstract
Motivation
Molecular docking is a widely used technique for large-scale virtual screening of the interactions between small-molecule ligands and their target proteins. However, docking methods often perform poorly for metalloproteins due to additional complexity from the three-way interactions among amino-acid residues, metal ions and ligands. This is a significant problem because zinc proteins alone comprise about 10% of all available protein structures in the protein databank. Here, we developed GM-DockZn that is dedicated for ligand docking to zinc proteins. Unlike the existing docking methods developed specifically for zinc proteins, GM-DockZn samples ligand conformations directly using a geometric grid around the ideal zinc-coordination positions of seven discovered coordination motifs, which were found from the survey of known zinc proteins complexed with a single ligand.
Results
GM-DockZn has the best performance in sampling near-native poses with correct coordination atoms and numbers within the top 50 and top 10 predictions when compared to several state-of-the-art techniques. This is true not only for a non-redundant dataset of zinc proteins but also for a homolog set of different ligand and zinc-coordination systems for the same zinc proteins. Similar superior performance of GM-DockZn for near-native-pose sampling was also observed for docking to apo-structures and cross-docking between different ligand complex structures of the same protein. The highest success rate for sampling nearest near-native poses within top 5 and top 1 was achieved by combining GM-DockZn for conformational sampling with GOLD for ranking. The proposed geometry-based sampling technique will be useful for ligand docking to other metalloproteins.
Availability and implementation
GM-DockZn is freely available at www.qmclab.com/ for academic users.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

Bayesian structural equation modeling in multiple omics data with application to circadian genes

Bioinformatics - Вт, 2020-05-05 02:00
Abstract
Motivation
It is well known that the integration among different data-sources is reliable because of its potential of unveiling new functionalities of the genomic expressions, which might be dormant in a single-source analysis. Moreover, different studies have justified the more powerful analyses of multi-platform data. Toward this, in this study, we consider the circadian genes’ omics profile, such as copy number changes and RNA-sequence data along with their survival response. We develop a Bayesian structural equation modeling coupled with linear regressions and log normal accelerated failure-time regression to integrate the information between these two platforms to predict the survival of the subjects. We place conjugate priors on the regression parameters and derive the Gibbs sampler using the conditional distributions of them.
Results
Our extensive simulation study shows that the integrative model provides a better fit to the data than its closest competitor. The analyses of glioblastoma cancer data and the breast cancer data from TCGA, the largest genomics and transcriptomics database, support our findings.
Availability and implementation
The developed method is wrapped in R package available at https://github.com/MAITYA02/semmcmc.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

HiC-Hiker: a probabilistic model to determine contig orientation in chromosome-length scaffolds with Hi-C

Bioinformatics - Вт, 2020-05-05 02:00
Abstract
Motivation
De novo assembly of reference-quality genomes used to require enormously laborious tasks. In particular, it is extremely time-consuming to build genome markers for ordering assembled contigs along chromosomes; thus, they are only available for well-established model organisms. To resolve this issue, recent studies demonstrated that Hi-C could be a powerful and cost-effective means to output chromosome-length scaffolds for non-model species with no genome marker resources, because the Hi-C contact frequency between a pair of two loci can be a good estimator of their genomic distance, even if there is a large gap between them. Indeed, state-of-the-art methods such as 3D-DNA are now widely used for locating contigs in chromosomes. However, it remains challenging to reduce errors in contig orientation because shorter contigs have fewer contacts with their neighboring contigs. These orientation errors lower the accuracy of gene prediction, read alignment, and synteny block estimation in comparative genomics.
Results
To reduce these contig orientation errors, we propose a new algorithm, named HiC-Hiker, which has a firm grounding in probabilistic theory, rigorously models Hi-C contacts across contigs, and effectively infers the most probable orientations via the Viterbi algorithm. We compared HiC-Hiker and 3D-DNA using human and worm genome contigs generated from short reads, evaluated their performances, and observed a remarkable reduction in the contig orientation error rate from 4.3% (3D-DNA) to 1.7% (HiC-Hiker). Our algorithm can consider long-range information between distal contigs and precisely estimates Hi-C read contact probabilities among contigs, which may also be useful for determining the ordering of contigs.
Availability and implementation
HiC-Hiker is freely available at: https://github.com/ryought/hic_hiker.
Категорії: Bioinformatics, Journals

SHOGUN: a modular, accurate and scalable framework for microbiome quantification

Bioinformatics - Пн, 2020-05-04 02:00
Abstract
Summary
The software pipeline SHOGUN profiles known taxonomic and gene abundances of short-read shotgun metagenomics sequencing data. The pipeline is scalable, modular and flexible. Data analysis and transformation steps can be run individually or together in an automated workflow. Users can easily create new reference databases and can select one of three DNA alignment tools, ranging from ultra-fast low-RAM k-mer-based database search to fully exhaustive gapped DNA alignment, to best fit their analysis needs and computational resources. The pipeline includes an implementation of a published method for taxonomy assignment disambiguation with empirical Bayesian redistribution. The software is installable via the conda resource management framework, has plugins for the QIIME2 and QIITA packages and produces both taxonomy and gene abundance profile tables with a single command, thus promoting convenient and reproducible metagenomics research.
Availability and implementation
https://github.com/knights-lab/SHOGUN.
Категорії: Bioinformatics, Journals

PRIME: a probabilistic imputation method to reduce dropout effects in single-cell RNA sequencing

Bioinformatics - Ср, 2020-04-29 02:00
Abstract
Summary
Single-cell RNA sequencing technology provides a novel means to analyze the transcriptomic profiles of individual cells. The technique is vulnerable, however, to a type of noise called dropout effects, which lead to zero-inflated distributions in the transcriptome profile and reduce the reliability of the results. Single-cell RNA sequencing data, therefore, need to be carefully processed before in-depth analysis. Here, we describe a novel imputation method that reduces dropout effects in single-cell sequencing. We construct a cell correspondence network and adjust gene expression estimates based on transcriptome profiles for the local subnetwork of cells of the same type. We comprehensively evaluated this method, called PRIME (PRobabilistic IMputation to reduce dropout effects in Expression profiles of single-cell sequencing), on synthetic and eight real single-cell sequencing datasets and verified that it improves the quality of visualization and accuracy of clustering analysis and can discover gene expression patterns hidden by noise.
Availability and implementation
The source code for the proposed method is freely available at https://github.com/hyundoo/PRIME.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

PPTPP: a novel therapeutic peptide prediction method using physicochemical property encoding and adaptive feature representation learning

Bioinformatics - Ср, 2020-04-29 02:00
Abstract
Motivation
Peptide is a promising candidate for therapeutic and diagnostic development due to its great physiological versatility and structural simplicity. Thus, identifying therapeutic peptides and investigating their properties are fundamentally important. As an inexpensive and fast approach, machine learning-based predictors have shown their strength in therapeutic peptide identification due to excellences in massive data processing. To date, no reported therapeutic peptide predictor can perform high-quality generic prediction and informative physicochemical properties (IPPs) identification simultaneously.
Results
In this work, Physicochemical Property-based Therapeutic Peptide Predictor (PPTPP), a Random Forest-based prediction method was presented to address this issue. A novel feature encoding and learning scheme were initiated to produce and rank physicochemical property-related features. Besides being capable of predicting multiple therapeutics peptides with high comparability to established predictors, the presented method is also able to identify peptides’ informative IPP. Results presented in this work not only illustrated the soundness of its working capacity but also demonstrated its potential for investigating other therapeutic peptides.
Availability and implementation
https://github.com/YPZ858/PPTPP.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

aPCoA: covariate adjusted principal coordinates analysis

Bioinformatics - Пн, 2020-04-27 02:00
Abstract
Summary
In fields, such as ecology, microbiology and genomics, non-Euclidean distances are widely applied to describe pairwise dissimilarity between samples. Given these pairwise distances, principal coordinates analysis is commonly used to construct a visualization of the data. However, confounding covariates can make patterns related to the scientific question of interest difficult to observe. We provide adjusted principal coordinates analysis as an easy-to-use tool, available as both an R package and a Shiny app, to improve data visualization in this context, enabling enhanced presentation of the effects of interest.
Availability and implementation
The R package ‘aPCoA’ and Shiny app can be accessed at https://cran.r-project.org/web/packages/aPCoA/index.html and https://biostatistics.mdanderson.org/shinyapps/aPCoA/.
Категорії: Bioinformatics, Journals

OpenBioLink: a benchmarking framework for large-scale biomedical link prediction

Bioinformatics - Пн, 2020-04-27 02:00
Abstract
Summary
Recently, novel machine-learning algorithms have shown potential for predicting undiscovered links in biomedical knowledge networks. However, dedicated benchmarks for measuring algorithmic progress have not yet emerged. With OpenBioLink, we introduce a large-scale, high-quality and highly challenging biomedical link prediction benchmark to transparently and reproducibly evaluate such algorithms. Furthermore, we present preliminary baseline evaluation results.
Availability and implementation
Source code and data are openly available at https://github.com/OpenBioLink/OpenBioLink.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

Targeted domain assembly for fast functional profiling of metagenomic datasets with S3A

Bioinformatics - Пт, 2020-04-24 02:00
Abstract
Motivation
The understanding of the ever-increasing number of metagenomic sequences accumulating in our databases demands for approaches that rapidly ‘explore’ the content of multiple and/or large metagenomic datasets with respect to specific domain targets, avoiding full domain annotation and full assembly.
Results
S3A is a fast and accurate domain-targeted assembler designed for a rapid functional profiling. It is based on a novel construction and a fast traversal of the Overlap-Layout-Consensus graph, designed to reconstruct coding regions from domain annotated metagenomic sequence reads. S3A relies on high-quality domain annotation to efficiently assemble metagenomic sequences and on the design of a new confidence measure for a fast evaluation of overlapping reads. Its implementation is highly generic and can be applied to any arbitrary type of annotation. On simulated data, S3A achieves a level of accuracy similar to that of classical metagenomics assembly tools while permitting to conduct a faster and sensitive profiling on domains of interest. When studying a few dozens of functional domains—a typical scenario—S3A is up to an order of magnitude faster than general purpose metagenomic assemblers, thus enabling the analysis of a larger number of datasets in the same amount of time. S3A opens new avenues to the fast exploration of the rapidly increasing number of metagenomic datasets displaying an ever-increasing size.
Availability and implementation
S3A is available at http://www.lcqb.upmc.fr/S3A_ASSEMBLER/.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

A reformulation of pLSA for uncertainty estimation and hypothesis testing in bio-imaging

Bioinformatics - Чт, 2020-04-23 02:00
Abstract
Motivation
Probabilistic latent semantic analysis (pLSA) is commonly applied to describe mass spectra (MS) images. However, the method does not provide certain outputs necessary for the quantitative scientific interpretation of data. In particular, it lacks assessment of statistical uncertainty and the ability to perform hypothesis testing. We show how linear Poisson modelling advances pLSA, giving covariances on model parameters and supporting χ2 testing for the presence/absence of MS signal components. As an example, this is useful for the identification of pathology in MALDI biological samples. We also show potential wider applicability, beyond MS, using magnetic resonance imaging (MRI) data from colorectal xenograft models.
Results
Simulations and MALDI spectra of a stroke-damaged rat brain show MS signals from pathological tissue can be quantified. MRI diffusion data of control and radiotherapy-treated tumours further show high sensitivity hypothesis testing for treatment effects. Successful χ2 and degrees-of-freedom are computed, allowing null-hypothesis thresholding at high levels of confidence.
Availability and implementation
Open-source image analysis software available from TINA Vision, www.tina-vision.net.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

Can ODE gene regulatory models neglect time lag or measurement scaling?

Bioinformatics - Чт, 2020-04-23 02:00
Abstract
Motivation
Many ordinary differential equation (ODE) models have been introduced to replace linear regression models for inferring gene regulatory relationships from time-course gene expression data. But, since the observed data are usually not direct measurements of the gene products or there is an unknown time lag in gene regulation, it is problematic to directly apply traditional ODE models or linear regression models.
Results
We introduce a lagged ODE model to infer lagged gene regulatory relationships from time-course measurements, which are modeled as linear transformation of the gene products. A time-course microarray dataset from a yeast cell-cycle study is used for simulation assessment of the methods and real data analysis. The results show that our method, by considering both time lag and measurement scaling, performs much better than other linear and ODE models. It indicates the necessity of explicitly modeling the time lag and measurement scaling in ODE gene regulatory models.
Availability and implementation
R code is available at https://www.sta.cuhk.edu.hk/xfan/share/lagODE.zip.
Категорії: Bioinformatics, Journals

Learning context-aware structural representations to predict antigen and antibody binding interfaces

Bioinformatics - Ср, 2020-04-22 02:00
Abstract
Motivation
Understanding how antibodies specifically interact with their antigens can enable better drug and vaccine design, as well as provide insights into natural immunity. Experimental structural characterization can detail the ‘ground truth’ of antibody–antigen interactions, but computational methods are required to efficiently scale to large-scale studies. To increase prediction accuracy as well as to provide a means to gain new biological insights into these interactions, we have developed a unified deep learning-based framework to predict binding interfaces on both antibodies and antigens.
Results
Our framework leverages three key aspects of antibody–antigen interactions to learn predictive structural representations: (i) since interfaces are formed from multiple residues in spatial proximity, we employ graph convolutions to aggregate properties across local regions in a protein; (ii) since interactions are specific between antibody–antigen pairs, we employ an attention layer to explicitly encode the context of the partner; (iii) since more data are available for general protein–protein interactions, we employ transfer learning to leverage this data as a prior for the specific case of antibody–antigen interactions. We show that this single framework achieves state-of-the-art performance at predicting binding interfaces on both antibodies and antigens, and that each of its three aspects drives additional improvement in the performance. We further show that the attention layer not only improves performance, but also provides a biologically interpretable perspective into the mode of interaction.
Availability and implementation
The source code is freely available on github at https://github.com/vamships/PECAN.git.
Категорії: Bioinformatics, Journals

dv-trio: a family-based variant calling pipeline using DeepVariant

Bioinformatics - Вт, 2020-04-21 02:00
Abstract
Motivation
In 2018, Google published an innovative variant caller, DeepVariant, which converts pileups of sequence reads into images and uses a deep neural network to identify single-nucleotide variants and small insertion/deletions from next-generation sequencing data. This approach outperforms existing state-of-the-art tools. However, DeepVariant was designed to call variants within a single sample. In disease sequencing studies, the ability to examine a family trio (father-mother-affected child) provides greater power for disease mutation discovery.
Results
To further improve DeepVariant’s variant calling accuracy in family-based sequencing studies, we have developed a family-based variant calling pipeline, dv-trio, which incorporates the trio information from the Mendelian genetic model into variant calling based on DeepVariant.
Availability and implementation
dv-trio is available via an open source BSD3 license at GitHub (https://github.com/VCCRI/dv-trio/).
Contact
e.giannoulatou@victorchang.edu.au
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

Codon optimization: a mathematical programing approach

Bioinformatics - Пн, 2020-04-20 02:00
Abstract
Motivation
Synthesizing proteins in heterologous hosts is an important tool in biotechnology. However, the genetic code is degenerate and the codon usage is biased in many organisms. Synonymous codon changes that are customized for each host organism may have a significant effect on the level of protein expression. This effect can be measured by using metrics, such as codon adaptation index, codon pair bias, relative codon bias and relative codon pair bias. Codon optimization is designing codons that improve one or more of these objectives. Currently available algorithms and software solutions either rely on heuristics without providing optimality guarantees or are very rigid in modeling different objective functions and restrictions.
Results
We develop an effective mixed integer linear programing (MILP) formulation, which considers multiple objectives. Our numerical study shows that this formulation can be effectively used to generate (Pareto) optimal codon designs even for very long amino acid sequences using a standard commercial solver. We also show that one can obtain designs in the efficient frontier in reasonable solution times and incorporate other complex objectives, such as mRNA secondary structures in codon design using MILP formulations.
Availability and implementation
http://alpersen.bilkent.edu.tr/codonoptimization/CodonOptimization.zip.
Категорії: Bioinformatics, Journals

A novel normalization and differential abundance test framework for microbiome data

Bioinformatics - Пн, 2020-04-20 02:00
Abstract
Motivation
Microbial communities have been proved to have close relationship with many diseases. The identification of differentially abundant microbial species is clinically meaningful for finding disease-related pathogenic or probiotic bacteria. However, certain characteristics of microbiome data have hurdled the accuracy and effectiveness of differential abundance analysis. The abundances or counts of microbiome species are usually on different scales and exhibit zero-inflation and over-dispersion. Normalization is a crucial step before the differential abundance test. However, existing normalization methods typically try to adjust counts on different scales to a common scale by constructing size factors with the assumption that count distributions across samples are equivalent up to a certain percentile. These methods often yield undesirable results when differentially abundant species are of low to medium abundance level. For differential abundance analysis, existing methods often use a single distribution to model the dispersion of species which lacks flexibility to catch a single species’ distinctiveness. These methods tend to detect a lot of false positives and often lack of power when the effect size is small.
Results
We develop a novel framework for differential abundance analysis on sparse high-dimensional marker gene microbiome data. Our methodology relies on a novel network-based normalization technique and a two-stage zero-inflated mixture count regression model (RioNorm2). Our normalization method aims to find a group of relatively invariant microbiome species across samples and conditions in order to construct the size factor. Another contribution of the paper is that our testing approach can take under-sampling and over-dispersion into consideration by separating microbiome species into two groups and model them separately. Through comprehensive simulation studies, the performance of our method is consistently powerful and robust across different settings with different sample size, library size and effect size. We also demonstrate the effectiveness of our novel framework using a published dataset of metastatic melanoma and find biological insights from the results.
Availability and implementation
The R package ‘RioNorm2’ can be installed from Github athttps://github.com/yuanjing-ma/RioNorm2.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals