Bioinformatics

RSS-материал
Обновлено: 19 hours 17 min ago

Approximate, simultaneous comparison of microbial genome architectures via syntenic anchoring of quiver representations

сб, 2018-09-08 02:00
Motivation
A long-standing limitation in comparative genomic studies is the dependency on a reference genome, which hinders the spectrum of genetic diversity that can be identified across a population of organisms. This is especially true in the microbial world where genome architectures can significantly vary. There is therefore a need for computational methods that can simultaneously analyze the architectures of multiple genomes without introducing bias from a reference.
Results
In this article, we present Ptolemy: a novel method for studying the diversity of genome architectures—such as structural variation and pan-genomes—across a collection of microbial assemblies without the need of a reference. Ptolemy is a ‘top-down’ approach to compare whole genome assemblies. Genomes are represented as labeled multi-directed graphs—known as quivers—which are then merged into a single, canonical quiver by identifying ‘gene anchors’ via synteny analysis. The canonical quiver represents an approximate, structural alignment of all genomes in a given collection encoding structural variation across (sub-) populations within the collection. We highlight various applications of Ptolemy by analyzing structural variation and the pan-genomes of different datasets composing of Mycobacterium, Saccharomyces, Escherichia and Shigella species. Our results show that Ptolemy is flexible and can handle both conserved and highly dynamic genome architectures. Ptolemy is user-friendly—requires only FASTA-formatted assembly along with a corresponding GFF-formatted file—and resource-friendly—can align 24 genomes in ∼10 mins with four CPUs and <2 GB of RAM.
Availability and implementation
Github: https://github.com/AbeelLab/ptolemy
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

CNEFinder: finding conserved non-coding elements in genomes

сб, 2018-09-08 02:00
Motivation
Conserved non-coding elements (CNEs) represent an enigmatic class of genomic elements which, despite being extremely conserved across evolution, do not encode for proteins. Their functions are still largely unknown. Thus, there exists a need to systematically investigate their roles in genomes. Towards this direction, identifying sets of CNEs in a wide range of organisms is an important first step. Currently, there are no tools published in the literature for systematically identifying CNEs in genomes.
Results
We fill this gap by presenting CNEFinder; a tool for identifying CNEs between two given DNA sequences with user-defined criteria. The results presented here show the tool’s ability of identifying CNEs accurately and efficiently. CNEFinder is based on a k-mer technique for computing maximal exact matches. The tool thus does not require or compute whole-genome alignments or indexes, such as the suffix array or the Burrows Wheeler Transform (BWT), which makes it flexible to use on a wide scale.
Availability and implementation
Free software under the terms of the GNU GPL (https://github.com/lorrainea/CNEFinder).
Категории: Bioinformatics, Journals

A fast adaptive algorithm for computing whole-genome homology maps

сб, 2018-09-08 02:00
Motivation
Whole-genome alignment is an important problem in genomics for comparing different species, mapping draft assemblies to reference genomes and identifying repeats. However, for large plant and animal genomes, this task remains compute and memory intensive. In addition, current practical methods lack any guarantee on the characteristics of output alignments, thus making them hard to tune for different application requirements.
Results
We introduce an approximate algorithm for computing local alignment boundaries between long DNA sequences. Given a minimum alignment length and an identity threshold, our algorithm computes the desired alignment boundaries and identity estimates using kmer-based statistics, and maintains sufficient probabilistic guarantees on the output sensitivity. Further, to prioritize higher scoring alignment intervals, we develop a plane-sweep based filtering technique which is theoretically optimal and practically efficient. Implementation of these ideas resulted in a fast and accurate assembly-to-genome and genome-to-genome mapper. As a result, we were able to map an error-corrected whole-genome NA12878 human assembly to the hg38 human reference genome in about 1 min total execution time and <4 GB memory using eight CPU threads, achieving significant improvement in memory-usage over competing methods. Recall accuracy of computed alignment boundaries was consistently found to be >97% on multiple datasets. Finally, we performed a sensitive self-alignment of the human genome to compute all duplications of length ≥1 Kbp and ≥90% identity. The reported output achieves good recall and covers twice the number of bases than the current UCSC browser’s segmental duplication annotation.
Availability and implementation
https://github.com/marbl/MashMap
Категории: Bioinformatics, Journals

Recognition of CRISPR/Cas9 off-target sites through ensemble learning of uneven mismatch distributions

сб, 2018-09-08 02:00
Motivation
CRISPR/Cas9 is driving a broad range of innovative applications from basic biology to biotechnology and medicine. One of its current issues is the effect of off-target editing that should be critically resolved and should be completely avoided in the ideal use of this system.
Results
We developed an ensemble learning method to detect the off-target sites of a single guide RNA (sgRNA) from its thousands of genome-wide candidates. Nucleotide mismatches between on-target and off-target sites have been studied recently. We confirm that there exists strong mismatch enrichment and preferences at the 5′-end close regions of the off-target sequences. Comparing with the on-target sites, sequences of no-editing sites can be also characterized by GC composition changes and position-specific mismatch binary features. Under this novel space of features, an ensemble strategy was applied to train a prediction model. The model achieved a mean score 0.99 of Aera Under Receiver Operating Characteristic curve and a mean score 0.45 of Aera Under Precision-Recall curve in cross-validations on big datasets, outperforming state-of-the-art methods in various test scenarios. Our predicted off-target sites also correspond very well to those detected by high-throughput sequencing techniques. Especially, two case studies for selecting sgRNAs to cure hearing loss and retinal degeneration partly prove the effectiveness of our method.
Availability and implementation
The python and matlab version of source codes for detecting off-target sites of a given sgRNA and the supplementary files are freely available on the web at https://github.com/penn-hui/OfftargetPredict.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

DREAM-Yara: an exact read mapper for very large databases with short update time

сб, 2018-09-08 02:00
Motivation
Mapping-based approaches have become limited in their application to very large sets of references since computing an FM-index for very large databases (e.g. >10 GB) has become a bottleneck. This affects many analyses that need such index as an essential step for approximate matching of the NGS reads to reference databases. For instance, in typical metagenomics analysis, the size of the reference sequences has become prohibitive to compute a single full-text index on standard machines. Even on large memory machines, computing such index takes about 1 day of computing time. As a result, updates of indices are rarely performed. Hence, it is desirable to create an alternative way of indexing while preserving fast search times.
Results
To solve the index construction and update problem we propose the DREAM (Dynamic seaRchablE pArallel coMpressed index) framework and provide an implementation. The main contributions are the introduction of an approximate search distributor via a novel use of Bloom filters. We combine several Bloom filters to form an interleaved Bloom filter and use this new data structure to quickly exclude reads for parts of the databases where they cannot match. This allows us to keep the databases in several indices which can be easily rebuilt if parts are updated while maintaining a fast search time. The second main contribution is an implementation of DREAM-Yara a distributed version of a fully sensitive read mapper under the DREAM framework.
Availability and implementation
https://gitlab.com/pirovc/dream_yara/
Категории: Bioinformatics, Journals

Learning structural motif representations for efficient protein structure search

сб, 2018-09-08 02:00
Motivation
Given a protein of unknown function, fast identification of similar protein structures from the Protein Data Bank (PDB) is a critical step for inferring its biological function. Such structural neighbors can provide evolutionary insights into protein conformation, interfaces and binding sites that are not detectable from sequence similarity. However, the computational cost of performing pairwise structural alignment against all structures in PDB is prohibitively expensive. Alignment-free approaches have been introduced to enable fast but coarse comparisons by representing each protein as a vector of structure features or fingerprints and only computing similarity between vectors. As a notable example, FragBag represents each protein by a ‘bag of fragments’, which is a vector of frequencies of contiguous short backbone fragments from a predetermined library. Despite being efficient, the accuracy of FragBag is unsatisfactory because its backbone fragment library may not be optimally constructed and long-range interacting patterns are omitted.
Results
Here we present a new approach to learning effective structural motif presentations using deep learning. We develop DeepFold, a deep convolutional neural network model to extract structural motif features of a protein structure. We demonstrate that DeepFold substantially outperforms FragBag on protein structural search on a non-redundant protein structure database and a set of newly released structures. Remarkably, DeepFold not only extracts meaningful backbone segments but also finds important long-range interacting motifs for structural comparison. We expect that DeepFold will provide new insights into the evolution and hierarchical organization of protein structural motifs.
Availability and implementation
https://github.com/largelymfs/DeepFold
Категории: Bioinformatics, Journals

Insights on the alteration of functionality of a tyrosine kinase 2 variant: a molecular dynamics study

сб, 2018-09-08 02:00
Motivation
The tyrosine kinase 2 protein (Tyk2), encoded by the TYK2 gene, has a crucial role in signal transduction and the pathogenesis of many diseases. A single nucleotide polymorphism of the TYK2 gene, SNP rs34536443, is of major importance, since it has been shown to confer protection against various, mainly, autoimmune diseases. This polymorphism results in a Pro to Ala change at amino acid position 1104 of the encoded Tyk2 protein that affects its enzymatic activity. However, the details of the underlined mechanism are unknown. To address this issue, in this study, we used molecular dynamics simulations on the kinase domains of both wild type and variant Tyk2 protein.
Results
Our MD results provided information, at atomic level, on the consequences of the Pro1104 to Ala substitution on the structure and dynamics of the kinase domain of Tyk2 and suggested reduced enzymatic activity of the resulting protein variant due to stabilization of inactive conformations, thus adding to knowledge towards the elucidation of the protection mechanism against autoimmune diseases associated with this point mutation.
Категории: Bioinformatics, Journals

Topology independent structural matching discovers novel templates for protein interfaces

сб, 2018-09-08 02:00
Motivation
Protein–protein interactions (PPI) are essential for the function of the cellular machinery. The rapid growth of protein–protein complexes with known 3D structures offers a unique opportunity to study PPI to gain crucial insights into protein function and the causes of many diseases. In particular, it would be extremely useful to compare interaction surfaces of monomers, as this would enable the pinpointing of potential interaction surfaces based solely on the monomer structure, without the need to predict the complete complex structure. While there are many structural alignment algorithms for individual proteins, very few have been developed for protein interfaces, and none that can align only the interface residues to other interfaces or surfaces of interacting monomer subunits in a topology independent (non-sequential) manner.
Results
We present InterComp, a method for topology and sequence-order independent structural comparisons. The method is general and can be applied to various structural comparison applications. By representing residues as independent points in space rather than as a sequence of residues, InterComp can be applied to a wide range of problems including interface–surface comparisons and interface–interface comparisons. We demonstrate a use-case by applying InterComp to find similar protein interfaces on the surface of proteins. We show that InterComp pinpoints the correct interface for almost half of the targets (283 of 586) when considering the top 10 hits, and for 24% of the top 1, even when no templates can be found with regular sequence-order dependent structural alignment methods.
Availability and implementation
The source code and the datasets are available at: http://wallnerlab.org/InterComp.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

Analysis of single amino acid variations in singlet hot spots of protein–protein interfaces

сб, 2018-09-08 02:00
Motivation
Single amino acid variations (SAVs) in protein–protein interaction (PPI) sites play critical roles in diseases. PPI sites (interfaces) have a small subset of residues called hot spots that contribute significantly to the binding energy, and they may form clusters called hot regions. Singlet hot spots are the single amino acid hot spots outside of the hot regions. The distribution of SAVs on the interface residues may be related to their disease association.
Results
We performed statistical and structural analyses of SAVs with literature curated experimental thermodynamics data, and demonstrated that SAVs which destabilize PPIs are more likely to be found in singlet hot spots rather than hot regions and energetically less important interface residues. In contrast, non-hot spot residues are significantly enriched in neutral SAVs, which do not affect PPI stability. Surprisingly, we observed that singlet hot spots tend to be enriched in disease-causing SAVs, while benign SAVs significantly occur in non-hot spot residues. Our work demonstrates that SAVs in singlet hot spot residues have significant effect on protein stability and function.
Availability and implementation
The dataset used in this paper is available as Supplementary Material. The data can be found at http://prism.ccbb.ku.edu.tr/data/sav/ as well.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

Predicting protein–protein interactions through sequence-based deep learning

сб, 2018-09-08 02:00
Motivation
High-throughput experimental techniques have produced a large amount of protein–protein interaction (PPI) data, but their coverage is still low and the PPI data is also very noisy. Computational prediction of PPIs can be used to discover new PPIs and identify errors in the experimental PPI data.
Results
We present a novel deep learning framework, DPPI, to model and predict PPIs from sequence information alone. Our model efficiently applies a deep, Siamese-like convolutional neural network combined with random projection and data augmentation to predict PPIs, leveraging existing high-quality experimental PPI data and evolutionary information of a protein pair under prediction. Our experimental results show that DPPI outperforms the state-of-the-art methods on several benchmarks in terms of area under precision-recall curve (auPR), and computationally is more efficient. We also show that DPPI is able to predict homodimeric interactions where other methods fail to work accurately, and the effectiveness of DPPI in specific applications such as predicting cytokine-receptor binding affinities.
Availability and implementation
Predicting protein-protein interactions through sequence-based deep learning): https://github.com/hashemifar/DPPI/.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

iCFN: an efficient exact algorithm for multistate protein design

сб, 2018-09-08 02:00
Motivation
Multistate protein design addresses real-world challenges, such as multi-specificity design and backbone flexibility, by considering both positive and negative protein states with an ensemble of substates for each. It also presents an enormous challenge to exact algorithms that guarantee the optimal solutions and enable a direct test of mechanistic hypotheses behind models. However, efficient exact algorithms are lacking for multistate protein design.
Results
We have developed an efficient exact algorithm called interconnected cost function networks (iCFN) for multistate protein design. Its generic formulation allows for a wide array of applications such as stability, affinity and specificity designs while addressing concerns such as global flexibility of protein backbones. iCFN treats each substate design as a weighted constraint satisfaction problem (WCSP) modeled through a CFN; and it solves the coupled WCSPs using novel bounds and a depth-first branch-and-bound search over a tree structure of sequences, substates, and conformations. When iCFN is applied to specificity design of a T-cell receptor, a problem of unprecedented size to exact methods, it drastically reduces search space and running time to make the problem tractable. Moreover, iCFN generates experimentally-agreeing receptor designs with improved accuracy compared with state-of-the-art methods, highlights the importance of modeling backbone flexibility in protein design, and reveals molecular mechanisms underlying binding specificity.
Availability and implementation
https://shen-lab.github.io/software/iCFN
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

DeepDTA: deep drug–target binding affinity prediction

сб, 2018-09-08 02:00
Motivation
The identification of novel drug–target (DT) interactions is a substantial part of the drug discovery process. Most of the computational methods that have been proposed to predict DT interactions have focused on binary classification, where the goal is to determine whether a DT pair interacts or not. However, protein–ligand interactions assume a continuum of binding strength values, also called binding affinity and predicting this value still remains a challenge. The increase in the affinity data available in DT knowledge-bases allows the use of advanced learning techniques such as deep learning architectures in the prediction of binding affinities. In this study, we propose a deep-learning based model that uses only sequence information of both targets and drugs to predict DT interaction binding affinities. The few studies that focus on DT binding affinity prediction use either 3D structures of protein–ligand complexes or 2D features of compounds. One novel approach used in this work is the modeling of protein sequences and compound 1D representations with convolutional neural networks (CNNs).
Results
The results show that the proposed deep learning based model that uses the 1D representations of targets and drugs is an effective approach for drug target binding affinity prediction. The model in which high-level representations of a drug and a target are constructed via CNNs achieved the best Concordance Index (CI) performance in one of our larger benchmark datasets, outperforming the KronRLS algorithm and SimBoost, a state-of-the-art method for DT binding affinity prediction.
Availability and implementation
https://github.com/hkmztrk/DeepDTA
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

Protein pocket detection via convex hull surface evolution and associated Reeb graph

сб, 2018-09-08 02:00
Motivation
Protein pocket information is invaluable for drug target identification, agonist design, virtual screening and receptor-ligand binding analysis. A recent study indicates that about half holoproteins can simultaneously bind multiple interacting ligands in a large pocket containing structured sub-pockets. Although this hierarchical pocket and sub-pocket structure has a significant impact to multi-ligand synergistic interactions in the protein binding site, there is no method available for this analysis. This work introduces a computational tool based on differential geometry, algebraic topology and physics-based simulation to address this pressing issue.
Results
We propose to detect protein pockets by evolving the convex hull surface inwards until it touches the protein surface everywhere. The governing partial differential equations (PDEs) include the mean curvature flow combined with the eikonal equation commonly used in the fast marching algorithm in the Eulerian representation. The surface evolution induced Morse function and Reeb graph are utilized to characterize the hierarchical pocket and sub-pocket structure in controllable detail. The proposed method is validated on PDBbind refined sets of 4414 protein-ligand complexes. Extensive numerical tests indicate that the proposed method not only provides a unique description of pocket-sub-pocket relations, but also offers efficient estimations of pocket surface area, pocket volume and pocket depth.
Availability and implementation
Source code available at https://github.com/rdzhao/ProteinPocketDetection. Webserver available at http://weilab.math.msu.edu/PPD/.
Категории: Bioinformatics, Journals

MDPbiome: microbiome engineering through prescriptive perturbations

сб, 2018-09-08 02:00
Motivation
Recent microbiome dynamics studies highlight the current inability to predict the effects of external perturbations on complex microbial populations. To do so would be particularly advantageous in fields such as medicine, bioremediation or industrial scenarios.
Results
MDPbiome statistically models longitudinal metagenomics samples undergoing perturbations as a Markov Decision Process (MDP). Given a starting microbial composition, our MDPbiome system suggests the sequence of external perturbation(s) that will engineer that microbiome to a goal state, for example, a healthier or more performant composition. It also estimates intermediate microbiome states along the path, thus making it possible to avoid particularly undesirable/unhealthy states. We demonstrate MDPbiome performance over three real and distinct datasets, proving its flexibility, and the reliability and universality of its output ‘optimal perturbation policy’. For example, an MDP created using a vaginal microbiome time series, with a goal of recovering from bacterial vaginosis, suggested avoidance of perturbations such as lubricants or sex toys; while another MDP provided a quantitative explanation for why salmonella vaccine accelerates gut microbiome maturation in chicks. This novel analytical approach has clear applications in medicine, where it could suggest low-impact clinical interventions that will lead to achievement or maintenance of a healthy microbial population, or alternately, the sequence of interventions necessary to avoid strongly negative microbiome states.
Availability and implementation
Code (https://github.com/beatrizgj/MDPbiome) and result files (https://tomdelarosa.shinyapps.io/MDPbiome/) are available online.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

piMGM: incorporating multi-source priors in mixed graphical models for learning disease networks

сб, 2018-09-08 02:00
Motivation
Learning probabilistic graphs over mixed data is an important way to combine gene expression and clinical disease data. Leveraging the existing, yet imperfect, information in pathway databases for mixed graphical model (MGM) learning is an understudied problem with tremendous potential applications in systems medicine, the problems of which often involve high-dimensional data.
Results
We present a new method, piMGM, which can learn with accuracy the structure of probabilistic graphs over mixed data by appropriately incorporating priors from multiple experts with different degrees of reliability. We show that piMGM accurately scores the reliability of prior information from a given expert even at low sample sizes. The reliability scores can be used to determine active pathways in healthy and disease samples. We tested piMGM on both simulated and real data from TCGA, and we found that its performance is not affected by unreliable priors. We demonstrate the applicability of piMGM by successfully using prior information to identify pathway components that are important in breast cancer and improve cancer subtype classification.
Availability and implementation
http://www.benoslab.pitt.edu/manatakisECCB2018.html
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals

Ontology-based validation and identification of regulatory phenotypes

сб, 2018-09-08 02:00
Motivation
Function annotations of gene products, and phenotype annotations of genotypes, provide valuable information about molecular mechanisms that can be utilized by computational methods to identify functional and phenotypic relatedness, improve our understanding of disease and pathobiology, and lead to discovery of drug targets. Identifying functions and phenotypes commonly requires experiments which are time-consuming and expensive to carry out; creating the annotations additionally requires a curator to make an assertion based on reported evidence. Support to validate the mutual consistency of functional and phenotype annotations as well as a computational method to predict phenotypes from function annotations, would greatly improve the utility of function annotations.
Results
We developed a novel ontology-based method to validate the mutual consistency of function and phenotype annotations. We apply our method to mouse and human annotations, and identify several inconsistencies that can be resolved to improve overall annotation quality. We also apply our method to the rule-based prediction of regulatory phenotypes from functions and demonstrate that we can predict these phenotypes with Fmax of up to 0.647.
Availability and implementation
https://github.com/bio-ontology-research-group/phenogocon
Категории: Bioinformatics, Journals

Quantitative trait loci identification for brain endophenotypes via new additive model with random networks

сб, 2018-09-08 02:00
Motivation
The identification of quantitative trait loci (QTL) is critical to the study of causal relationships between genetic variations and disease abnormalities. We focus on identifying the QTLs associated to the brain endophenotypes in imaging genomics study for Alzheimer’s Disease (AD). Existing research works mainly depict the association between single nucleotide polymorphisms (SNPs) and the brain endophenotypes via the linear methods, which may introduce high bias due to the simplicity of the models. Since the influence of QTLs on brain endophenotypes is quite complex, it is desired to design the appropriate non-linear models to investigate the associations of genotypes and endophenotypes.
Results
In this paper, we propose a new additive model to learn the non-linear associations between SNPs and brain endophenotypes in Alzheimer’s disease. Our model can be flexibly employed to explain the non-linear influence of QTLs, thus is more adaptive for the complex distribution of the high-throughput biological data. Meanwhile, as an important computational learning theory contribution, we provide the generalization error analysis for the proposed approach. Unlike most previous theoretical analysis under independent and identically distributed samples assumption, our error bound is based on m-dependent observations, which is more appropriate for the high-throughput and noisy biological data. Experiments on the data from Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort demonstrate the promising performance of our approach for identifying biological meaningful SNPs.
Availability and implementation
An executable is available at https://github.com/littleq1991/additive_FNNRW.
Категории: Bioinformatics, Journals

Liquid-chromatography retention order prediction for metabolite identification

сб, 2018-09-08 02:00
Motivation
Liquid Chromatography (LC) followed by tandem Mass Spectrometry (MS/MS) is one of the predominant methods for metabolite identification. In recent years, machine learning has started to transform the analysis of tandem mass spectra and the identification of small molecules. In contrast, LC data is rarely used to improve metabolite identification, despite numerous published methods for retention time prediction using machine learning.
Results
We present a machine learning method for predicting the retention order of molecules; that is, the order in which molecules elute from the LC column. Our method has important advantages over previous approaches: We show that retention order is much better conserved between instruments than retention time. To this end, our method can be trained using retention time measurements from different LC systems and configurations without tedious pre-processing, significantly increasing the amount of available training data. Our experiments demonstrate that retention order prediction is an effective way to learn retention behaviour of molecules from heterogeneous retention time data. Finally, we demonstrate how retention order prediction and MS/MS-based scores can be combined for more accurate metabolite identifications when analyzing a complete LC-MS/MS run.
Availability and implementation
Implementation of the method is available at https://version.aalto.fi/gitlab/bache1/retention_order_prediction.git.
Категории: Bioinformatics, Journals

fastp: an ultra-fast all-in-one FASTQ preprocessor

сб, 2018-09-08 02:00
Motivation
Quality control and preprocessing of FASTQ files are essential to providing clean data for downstream analysis. Traditionally, a different tool is used for each operation, such as quality control, adapter trimming and quality filtering. These tools are often insufficiently fast as most are developed using high-level programming languages (e.g. Python and Java) and provide limited multi-threading support. Reading and loading data multiple times also renders preprocessing slow and I/O inefficient.
Results
We developed fastp as an ultra-fast FASTQ preprocessor with useful quality control and data-filtering features. It can perform quality control, adapter trimming, quality filtering, per-read quality pruning and many other operations with a single scan of the FASTQ data. This tool is developed in C++ and has multi-threading support. Based on our evaluation, fastp is 2–5 times faster than other FASTQ preprocessing tools such as Trimmomatic or Cutadapt despite performing far more operations than similar tools.
Availability and implementation
The open-source code and corresponding instructions are available at https://github.com/OpenGene/fastp.
Категории: Bioinformatics, Journals

DeepDiff: DEEP-learning for predicting DIFFerential gene expression from histone modifications

сб, 2018-09-08 02:00
Motivation
Computational methods that predict differential gene expression from histone modification signals are highly desirable for understanding how histone modifications control the functional heterogeneity of cells through influencing differential gene regulation. Recent studies either failed to capture combinatorial effects on differential prediction or primarily only focused on cell type-specific analysis. In this paper we develop a novel attention-based deep learning architecture, DeepDiff, that provides a unified and end-to-end solution to model and to interpret how dependencies among histone modifications control the differential patterns of gene regulation. DeepDiff uses a hierarchy of multiple Long Short-Term Memory (LSTM) modules to encode the spatial structure of input signals and to model how various histone modifications cooperate automatically. We introduce and train two levels of attention jointly with the target prediction, enabling DeepDiff to attend differentially to relevant modifications and to locate important genome positions for each modification. Additionally, DeepDiff introduces a novel deep-learning based multi-task formulation to use the cell-type-specific gene expression predictions as auxiliary tasks, encouraging richer feature embeddings in our primary task of differential expression prediction.
Results
Using data from Roadmap Epigenomics Project (REMC) for ten different pairs of cell types, we show that DeepDiff significantly outperforms the state-of-the-art baselines for differential gene expression prediction. The learned attention weights are validated by observations from previous studies about how epigenetic mechanisms connect to differential gene expression.
Availability and implementation
Codes and results are available at deepchrome.org.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категории: Bioinformatics, Journals