Bioinformatics

Syndicate content
Updated: 5 hours 43 min ago

Quantifying the similarity of topological domains across normal and cancer human cell types

Wed, 2018-06-27 02:00
Abstract
Motivation
Three-dimensional chromosome structure has been increasingly shown to influence various levels of cellular and genomic functions. Through Hi-C data, which maps contact frequency on chromosomes, it has been found that structural elements termed topologically associating domains (TADs) are involved in many regulatory mechanisms. However, we have little understanding of the level of similarity or variability of chromosome structure across cell types and disease states. In this study, we present a method to quantify resemblance and identify structurally similar regions between any two sets of TADs.
Results
We present an analysis of 23 human Hi-C samples representing various tissue types in normal and cancer cell lines. We quantify global and chromosome-level structural similarity, and compare the relative similarity between cancer and non-cancer cells. We find that cancer cells show higher structural variability around commonly mutated pan-cancer genes than normal cells at these same locations.
Availability and implementation
Software for the methods and analysis can be found at https://github.com/Kingsford-Group/localtadsim
Categories: Bioinformatics, Journals

Classifying tumors by supervised network propagation

Wed, 2018-06-27 02:00
Abstract
Motivation
Network propagation has been widely used to aggregate and amplify the effects of tumor mutations using knowledge of molecular interaction networks. However, propagating mutations through interactions irrelevant to cancer leads to erosion of pathway signals and complicates the identification of cancer subtypes.
Results
To address this problem we introduce a propagation algorithm, Network-Based Supervised Stratification (NBS2), which learns the mutated subnetworks underlying tumor subtypes using a supervised approach. Given an annotated molecular network and reference tumor mutation profiles for which subtypes have been predefined, NBS2 is trained by adjusting the weights on interaction features such that network propagation best recovers the provided subtypes. After training, weights are fixed such that mutation profiles of new tumors can be accurately classified. We evaluate NBS2 on breast and glioblastoma tumors, demonstrating that it outperforms the best network-based approaches in classifying tumors to known subtypes for these diseases. By interpreting the interaction weights, we highlight characteristic molecular pathways driving selected subtypes.
Availability and implementation
The NBS2 package is freely available at: https://github.com/wzhang1984/NBSS.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

Bayesian parameter estimation for biochemical reaction networks using region-based adaptive parallel tempering

Wed, 2018-06-27 02:00
Abstract
Motivation
Mathematical models have become standard tools for the investigation of cellular processes and the unraveling of signal processing mechanisms. The parameters of these models are usually derived from the available data using optimization and sampling methods. However, the efficiency of these methods is limited by the properties of the mathematical model, e.g. non-identifiabilities, and the resulting posterior distribution. In particular, multi-modal distributions with long valleys or pronounced tails are difficult to optimize and sample. Thus, the developement or improvement of optimization and sampling methods is subject to ongoing research.
Results
We suggest a region-based adaptive parallel tempering algorithm which adapts to the problem-specific posterior distributions, i.e. modes and valleys. The algorithm combines several established algorithms to overcome their individual shortcomings and to improve sampling efficiency. We assessed its properties for established benchmark problems and two ordinary differential equation models of biochemical reaction networks. The proposed algorithm outperformed state-of-the-art methods in terms of calculation efficiency and mixing. Since the algorithm does not rely on a specific problem structure, but adapts to the posterior distribution, it is suitable for a variety of model classes.
Availability and implementation
The code is available both as Supplementary MaterialSupplementary Material and in a Git repository written in MATLAB.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

An optimization framework for network annotation

Wed, 2018-06-27 02:00
Abstract
Motivation
A chief goal of systems biology is the reconstruction of large-scale executable models of cellular processes of interest. While accurate continuous models are still beyond reach, a powerful alternative is to learn a logical model of the processes under study, which predicts the logical state of any node of the model as a Boolean function of its incoming nodes. Key to learning such models is the functional annotation of the underlying physical interactions with activation/repression (sign) effects. Such annotations are pretty common for a few well-studied biological pathways.
Results
Here we present a novel optimization framework for large-scale sign annotation that employs different plausible models of signaling and combines them in a rigorous manner. We apply our framework to two large-scale knockout datasets in yeast and evaluate its different components as well as the combined model to predict signs of different subsets of physical interactions. Overall, we obtain an accurate predictor that outperforms previous work by a considerable margin.
Availability and implementation
The code is publicly available at https://github.com/spatkar94/NetworkAnnotation.git.
Categories: Bioinformatics, Journals

Learning with multiple pairwise kernels for drug bioactivity prediction

Wed, 2018-06-27 02:00
Abstract
Motivation
Many inference problems in bioinformatics, including drug bioactivity prediction, can be formulated as pairwise learning problems, in which one is interested in making predictions for pairs of objects, e.g. drugs and their targets. Kernel-based approaches have emerged as powerful tools for solving problems of that kind, and especially multiple kernel learning (MKL) offers promising benefits as it enables integrating various types of complex biomedical information sources in the form of kernels, along with learning their importance for the prediction task. However, the immense size of pairwise kernel spaces remains a major bottleneck, making the existing MKL algorithms computationally infeasible even for small number of input pairs.
Results
We introduce pairwiseMKL, the first method for time- and memory-efficient learning with multiple pairwise kernels. pairwiseMKL first determines the mixture weights of the input pairwise kernels, and then learns the pairwise prediction function. Both steps are performed efficiently without explicit computation of the massive pairwise matrices, therefore making the method applicable to solving large pairwise learning problems. We demonstrate the performance of pairwiseMKL in two related tasks of quantitative drug bioactivity prediction using up to 167 995 bioactivity measurements and 3120 pairwise kernels: (i) prediction of anticancer efficacy of drug compounds across a large panel of cancer cell lines; and (ii) prediction of target profiles of anticancer compounds across their kinome-wide target spaces. We show that pairwiseMKL provides accurate predictions using sparse solutions in terms of selected kernels, and therefore it automatically identifies also data sources relevant for the prediction problem.
Availability and implementation
Code is available at https://github.com/aalto-ics-kepaco.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

Improved pathway reconstruction from RNA interference screens by exploiting off-target effects

Wed, 2018-06-27 02:00
Abstract
Motivation
Pathway reconstruction has proven to be an indispensable tool for analyzing the molecular mechanisms of signal transduction underlying cell function. Nested effects models (NEMs) are a class of probabilistic graphical models designed to reconstruct signalling pathways from high-dimensional observations resulting from perturbation experiments, such as RNA interference (RNAi). NEMs assume that the short interfering RNAs (siRNAs) designed to knockdown specific genes are always on-target. However, it has been shown that most siRNAs exhibit strong off-target effects, which further confound the data, resulting in unreliable reconstruction of networks by NEMs.
Results
Here, we present an extension of NEMs called probabilistic combinatorial nested effects models (pc-NEMs), which capitalize on the ancillary siRNA off-target effects for network reconstruction from combinatorial gene knockdown data. Our model employs an adaptive simulated annealing search algorithm for simultaneous inference of network structure and error rates inherent to the data. Evaluation of pc-NEMs on simulated data with varying number of phenotypic effects and noise levels as well as real data demonstrates improved reconstruction compared to classical NEMs. Application to Bartonella henselae infection RNAi screening data yielded an eight node network largely in agreement with previous works, and revealed novel binary interactions of direct impact between established components.
Availability and implementation
The software used for the analysis is freely available as an R package at https://github.com/cbg-ethz/pcNEM.git.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

Onto2Vec: joint vector-based representation of biological entities and their ontology-based annotations

Wed, 2018-06-27 02:00
Abstract
Motivation
Biological knowledge is widely represented in the form of ontology-based annotations: ontologies describe the phenomena assumed to exist within a domain, and the annotations associate a (kind of) biological entity with a set of phenomena within the domain. The structure and information contained in ontologies and their annotations make them valuable for developing machine learning, data analysis and knowledge extraction algorithms; notably, semantic similarity is widely used to identify relations between biological entities, and ontology-based annotations are frequently used as features in machine learning applications.
Results
We propose the Onto2Vec method, an approach to learn feature vectors for biological entities based on their annotations to biomedical ontologies. Our method can be applied to a wide range of bioinformatics research problems such as similarity-based prediction of interactions between proteins, classification of interaction types using supervised learning, or clustering. To evaluate Onto2Vec, we use the gene ontology (GO) and jointly produce dense vector representations of proteins, the GO classes to which they are annotated, and the axioms in GO that constrain these classes. First, we demonstrate that Onto2Vec-generated feature vectors can significantly improve prediction of protein–protein interactions in human and yeast. We then illustrate how Onto2Vec representations provide the means for constructing data-driven, trainable semantic similarity measures that can be used to identify particular relations between proteins. Finally, we use an unsupervised clustering approach to identify protein families based on their Enzyme Commission numbers. Our results demonstrate that Onto2Vec can generate high quality feature vectors from biological entities and ontologies. Onto2Vec has the potential to significantly outperform the state-of-the-art in several predictive applications in which ontologies are involved.
Availability and implementation
https://github.com/bio-ontology-research-group/onto2vec
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

A new method for constructing tumor specific gene co-expression networks based on samples with tumor purity heterogeneity

Wed, 2018-06-27 02:00
Abstract
Motivation
Tumor tissue samples often contain an unknown fraction of stromal cells. This problem is widely known as tumor purity heterogeneity (TPH) was recently recognized as a severe issue in omics studies. Specifically, if TPH is ignored when inferring co-expression networks, edges are likely to be estimated among genes with mean shift between non-tumor- and tumor cells rather than among gene pairs interacting with each other in tumor cells. To address this issue, we propose Tumor Specific Net (TSNet), a new method which constructs tumor-cell specific gene/protein co-expression networks based on gene/protein expression profiles of tumor tissues. TSNet treats the observed expression profile as a mixture of expressions from different cell types and explicitly models tumor purity percentage in each tumor sample.
Results
Using extensive synthetic data experiments, we demonstrate that TSNet outperforms a standard graphical model which does not account for TPH. We then apply TSNet to estimate tumor specific gene co-expression networks based on TCGA ovarian cancer RNAseq data. We identify novel co-expression modules and hub structure specific to tumor cells.
Availability and implementation
R codes can be found at https://github.com/petraf01/TSNet.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

PrimAlign: PageRank-inspired Markovian alignment for large biological networks

Wed, 2018-06-27 02:00
Abstract
Motivation
Cross-species analysis of large-scale protein–protein interaction (PPI) networks has played a significant role in understanding the principles deriving evolution of cellular organizations and functions. Recently, network alignment algorithms have been proposed to predict conserved interactions and functions of proteins. These approaches are based on the notion that orthologous proteins across species are sequentially similar and that topology of PPIs between orthologs is often conserved. However, high accuracy and scalability of network alignment are still a challenge.
Results
We propose a novel pairwise global network alignment algorithm, called PrimAlign, which is modeled as a Markov chain and iteratively transited until convergence. The proposed algorithm also incorporates the principles of PageRank. This approach is evaluated on tasks with human, yeast and fruit fly PPI networks. The experimental results demonstrate that PrimAlign outperforms several prevalent methods with statistically significant differences in multiple evaluation measures. PrimAlign, which is multi-platform, achieves superior performance in runtime with its linear asymptotic time complexity. Further evaluation is done with synthetic networks and results suggest that popular topological measures do not reflect real precision of alignments.
Availability and implementation
The source code is available at http://web.ecs.baylor.edu/faculty/cho/PrimAlign.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

SigMat: a classification scheme for gene signature matching

Wed, 2018-06-27 02:00
Abstract
Motivation
Several large-scale efforts have been made to collect gene expression signatures from a variety of biological conditions, such as response of cell lines to treatment with drugs, or tumor samples with different characteristics. These gene signature collections are utilized through bioinformatics tools for ‘signature matching’, whereby a researcher studying an expression profile can identify previously cataloged biological conditions most related to their profile. Signature matching tools typically retrieve from the collection the signature that has highest similarity to the user-provided profile. Alternatively, classification models may be applied where each biological condition in the signature collection is a class label; however, such models are trained on the collection of available signatures and may not generalize to the novel cellular context or cell line of the researcher’s expression profile.
Results
We present an advanced multi-way classification algorithm for signature matching, called SigMat, that is trained on a large signature collection from a well-studied cellular context, but can also classify signatures from other cell types by relying on an additional, small collection of signatures representing the target cell type. It uses these ‘tuning data’ to learn two additional parameters that help adapt its predictions for other cellular contexts. SigMat outperforms other similarity scores and classification methods in identifying the correct label of a query expression profile from as many as 244 or 500 candidate classes (drug treatments) cataloged by the LINCS L1000 project. SigMat retains its high accuracy in cross-cell line applications even when the amount of tuning data is severely limited.
Availability and implementation
SigMat is available on GitHub at https://github.com/JinfengXiao/SigMat.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

GSEA-InContext: identifying novel and common patterns in expression experiments

Wed, 2018-06-27 02:00
Abstract
Motivation
Gene Set Enrichment Analysis (GSEA) is routinely used to analyze and interpret coordinate pathway-level changes in transcriptomics experiments. For an experiment where less than seven samples per condition are compared, GSEA employs a competitive null hypothesis to test significance. A gene set enrichment score is tested against a null distribution of enrichment scores generated from permuted gene sets, where genes are randomly selected from the input experiment. Looking across a variety of biological conditions, however, genes are not randomly distributed with many showing consistent patterns of up- or down-regulation. As a result, common patterns of positively and negatively enriched gene sets are observed across experiments. Placing a single experiment into the context of a relevant set of background experiments allows us to identify both the common and experiment-specific patterns of gene set enrichment.
Results
We compiled a compendium of 442 small molecule transcriptomic experiments and used GSEA to characterize common patterns of positively and negatively enriched gene sets. To identify experiment-specific gene set enrichment, we developed the GSEA-InContext method that accounts for gene expression patterns within a background set of experiments to identify statistically significantly enriched gene sets. We evaluated GSEA-InContext on experiments using small molecules with known targets to show that it successfully prioritizes gene sets that are specific to each experiment, thus providing valuable insights that complement standard GSEA analysis.
Availability and implementation
GSEA-InContext implemented in Python, Supplementary results and the background expression compendium are available at: https://github.com/CostelloLab/GSEA-InContext.
Categories: Bioinformatics, Journals

Deep neural networks and distant supervision for geographic location mention extraction

Wed, 2018-06-27 02:00
Abstract
Motivation
Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision techniques to generate additional samples to train our NER.
Results
Our NER achieves an F1-score of 0.910 and significantly outperforms the previous state-of-the-art system. Using the additional data generated through distant supervision further boosts the performance of the NER achieving an F1-score of 0.927. The NER presented in this research improves over previous systems significantly. Our experiments also demonstrate the NER’s capability to embed external features to further boost the system’s performance. We believe that the same methodology can be applied for recognizing similar biomedical entities in scientific literature.
Categories: Bioinformatics, Journals

NeuroMorphoVis: a collaborative framework for analysis and visualization of neuronal morphology skeletons reconstructed from microscopy stacks

Wed, 2018-06-27 02:00
Abstract
Motivation
From image stacks to computational models, processing digital representations of neuronal morphologies is essential to neuroscientific research. Workflows involve various techniques and tools, leading in certain cases to convoluted and fragmented pipelines. The existence of an integrated, extensible and free framework for processing, analysis and visualization of those morphologies is a challenge that is still largely unfulfilled.
Results
We present NeuroMorphoVis, an interactive, extensible and cross-platform framework for building, visualizing and analyzing digital reconstructions of neuronal morphology skeletons extracted from microscopy stacks. Our framework is capable of detecting and repairing tracing artifacts, allowing the generation of high fidelity surface meshes and high resolution volumetric models for simulation and in silico imaging studies. The applicability of NeuroMorphoVis is demonstrated with two case studies. The first simulates the construction of three-dimensional profiles of neuronal somata and the other highlights how the framework is leveraged to create volumetric models of neuronal circuits for simulating different types of in vitro imaging experiments.
Availability and implementation
The source code and documentation are freely available on https://github.com/BlueBrain/NeuroMorphoVis under the GNU public license. The morphological analysis, visualization and surface meshing are implemented as an extensible Python API (Application Programming Interface) based on Blender, and the volume reconstruction and analysis code is written in C++ and parallelized using OpenMP. The framework features are accessible from a user-friendly GUI (Graphical User Interface) and a rich CLI (Command Line Interface).
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

The Kappa platform for rule-based modeling

Wed, 2018-06-27 02:00
Abstract
Motivation
We present an overview of the Kappa platform, an integrated suite of analysis and visualization techniques for building and interactively exploring rule-based models. The main components of the platform are the Kappa Simulator, the Kappa Static Analyzer and the Kappa Story Extractor. In addition to these components, we describe the Kappa User Interface, which includes a range of interactive visualization tools for rule-based models needed to make sense of the complexity of biological systems. We argue that, in this approach, modeling is akin to programming and can likewise benefit from an integrated development environment. Our platform is a step in this direction.
Results
We discuss details about the computation and rendering of static, dynamic, and causal views of a model, which include the contact map (CM), snaphots at different resolutions, the dynamic influence network (DIN) and causal compression. We provide use cases illustrating how these concepts generate insight. Specifically, we show how the CM and snapshots provide information about systems capable of polymerization, such as Wnt signaling. A well-understood model of the KaiABC oscillator, translated into Kappa from the literature, is deployed to demonstrate the DIN and its use in understanding systems dynamics. Finally, we discuss how pathways might be discovered or recovered from a rule-based model by means of causal compression, as exemplified for early events in EGF signaling.
Availability and implementation
The Kappa platform is available via the project website at kappalanguage.org. All components of the platform are open source and freely available through the authors’ code repositories.
Categories: Bioinformatics, Journals

Author Index

Wed, 2018-06-27 02:00
Abdellah,M. i574
Categories: Bioinformatics, Journals

Covariate-dependent negative binomial factor analysis of RNA sequencing data

Wed, 2018-06-27 02:00
Abstract
Motivation
High-throughput sequencing technologies, in particular RNA sequencing (RNA-seq), have become the basic practice for genomic studies in biomedical research. In addition to studying genes individually, for example, through differential expression analysis, investigating co-ordinated expression variations of genes may help reveal the underlying cellular mechanisms to derive better understanding and more effective prognosis and intervention strategies. Although there exists a variety of co-expression network based methods to analyze microarray data for this purpose, instead of blindly extending these methods for microarray data that may introduce unnecessary bias, it is crucial to develop methods well adapted to RNA-seq data to identify the functional modules of genes with similar expression patterns.
Results
We have developed a fully Bayesian covariate-dependent negative binomial factor analysis (dNBFA) method—dNBFA—for RNA-seq count data, to capture coordinated gene expression changes, while considering effects from covariates reflecting different influencing factors. Unlike existing co-expression network based methods, our proposed model does not require multiple ad-hoc choices on data processing, transformation, as well as co-expression measures and can be directly applied to RNA-seq data. Furthermore, being capable of incorporating covariate information, the proposed method can tackle setups with complex confounding factors in different experiment designs. Finally, the natural model parameterization removes the need for a normalization preprocessing step, as commonly adopted to compensate for the effect of sequencing-depth variations. Efficient Bayesian inference of model parameters is derived by exploiting conditional conjugacy via novel data augmentation techniques. Experimental results on several real-world RNA-seq datasets on complex diseases suggest dNBFA as a powerful tool for discovering the gene modules with significant differential expression and meaningful biological insight.
Availability and implementation
dNBFA is implemented in R language and is available at https://github.com/siamakz/dNBFA.
Categories: Bioinformatics, Journals

aliFreeFold: an alignment-free approach to predict secondary structure from homologous RNA sequences

Wed, 2018-06-27 02:00
Abstract
Motivation
Predicting the conserved secondary structure of homologous ribonucleic acid (RNA) sequences is crucial for understanding RNA functions. However, fast and accurate RNA structure prediction is challenging, especially when the number and the divergence of homologous RNA increases. To address this challenge, we propose aliFreeFold, based on a novel alignment-free approach which computes a representative structure from a set of homologous RNA sequences using sub-optimal secondary structures generated for each sequence. It is based on a vector representation of sub-optimal structures capturing structure conservation signals by weighting structural motifs according to their conservation across the sub-optimal structures.
Results
We demonstrate that aliFreeFold provides a good balance between speed and accuracy regarding predictions of representative structures for sets of homologous RNA compared to traditional methods based on sequence and structure alignment. We show that aliFreeFold is capable of uncovering conserved structural features fastly and effectively thanks to its weighting scheme that gives more (resp. less) importance to common (resp. uncommon) structural motifs. The weighting scheme is also shown to be capable of capturing conservation signal as the number of homologous RNA increases. These results demonstrate the ability of aliFreefold to efficiently and accurately provide interesting structural representatives of RNA families.
Availability and implementation
aliFreeFold was implemented in C++. Source code and Linux binary are freely available at https://github.com/UdeS-CoBIUS/aliFreeFold.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

Random forest based similarity learning for single cell RNA sequencing data

Wed, 2018-06-27 02:00
Abstract
Motivation
Genome-wide transcriptome sequencing applied to single cells (scRNA-seq) is rapidly becoming an assay of choice across many fields of biological and biomedical research. Scientific objectives often revolve around discovery or characterization of types or sub-types of cells, and therefore, obtaining accurate cell–cell similarities from scRNA-seq data is a critical step in many studies. While rapid advances are being made in the development of tools for scRNA-seq data analysis, few approaches exist that explicitly address this task. Furthermore, abundance and type of noise present in scRNA-seq datasets suggest that application of generic methods, or of methods developed for bulk RNA-seq data, is likely suboptimal.
Results
Here, we present RAFSIL, a random forest based approach to learn cell–cell similarities from scRNA-seq data. RAFSIL implements a two-step procedure, where feature construction geared towards scRNA-seq data is followed by similarity learning. It is designed to be adaptable and expandable, and RAFSIL similarities can be used for typical exploratory data analysis tasks like dimension reduction, visualization and clustering. We show that our approach compares favorably with current methods across a diverse collection of datasets, and that it can be used to detect and highlight unwanted technical variation in scRNA-seq datasets in situations where other methods fail. Overall, RAFSIL implements a flexible approach yielding a useful tool that improves the analysis of scRNA-seq data.
Availability and implementation
The RAFSIL R package is available at www.kostkalab.net/software.html
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

A pan-genome-based machine learning approach for predicting antimicrobial resistance activities of the Escherichia coli strains

Wed, 2018-06-27 02:00
Abstract
Motivation
Antimicrobial resistance (AMR) is becoming a huge problem in both developed and developing countries, and identifying strains resistant or susceptible to certain antibiotics is essential in fighting against antibiotic-resistant pathogens. Whole-genome sequences have been collected for different microbial strains in order to identify crucial characteristics that allow certain strains to become resistant to antibiotics; however, a global inspection of the gene content responsible for AMR activities remains to be done.
Results
We propose a pan-genome-based approach to characterize antibiotic-resistant microbial strains and test this approach on the bacterial model organism Escherichia coli. By identifying core and accessory gene clusters and predicting AMR genes for the E. coli pan-genome, we not only showed that certain classes of genes are unevenly distributed between the core and accessory parts of the pan-genome but also demonstrated that only a portion of the identified AMR genes belong to the accessory genome. Application of machine learning algorithms to predict whether specific strains were resistant to antibiotic drugs yielded the best prediction accuracy for the set of AMR genes within the accessory part of the pan-genome, suggesting that these gene clusters were most crucial to AMR activities in E. coli. Selecting subsets of AMR genes for different antibiotic drugs based on a genetic algorithm (GA) achieved better prediction performances than the gene sets established in the literature, hinting that the gene sets selected by the GA may warrant further analysis in investigating more details about how E. coli fight against antibiotics.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics, Journals

Unsupervised embedding of single-cell Hi-C data

Wed, 2018-06-27 02:00
Abstract
Motivation
Single-cell Hi-C (scHi-C) data promises to enable scientists to interrogate the 3D architecture of DNA in the nucleus of the cell, studying how this structure varies stochastically or along developmental or cell-cycle axes. However, Hi-C data analysis requires methods that take into account the unique characteristics of this type of data. In this work, we explore whether methods that have been developed previously for the analysis of bulk Hi-C data can be applied to scHi-C data. We apply methods designed for analysis of bulk Hi-C data to scHi-C data in conjunction with unsupervised embedding.
Results
We find that one of these methods, HiCRep, when used in conjunction with multidimensional scaling (MDS), strongly outperforms three other methods, including a technique that has been used previously for scHi-C analysis. We also provide evidence that the HiCRep/MDS method is robust to extremely low per-cell sequencing depth, that this robustness is improved even further when high-coverage and low-coverage cells are projected together, and that the method can be used to jointly embed cells from multiple published datasets.
Categories: Bioinformatics, Journals