Bioinformatics

Syndicate content
Updated: %count година 24 min ago

In vitro versus in vivo compositional landscapes of histone sequence preferences in eucaryotic genomes

Пн, 2018-09-10 02:00
Abstract
Motivation
Although the nucleosome occupancy along a genome can be in part predicted by in vitro experiments, it has been recently observed that the chromatin organization presents important differences in vitro with respect to in vivo. Such differences mainly regard the hierarchical and regular structures of the nucleosome fiber, whose existence has long been assumed, and in part also observed in vitro, but that does not apparently occur in vivo. It is also well known that the DNA sequence has a role in determining the nucleosome occupancy. Therefore, an important issue is to understand if, and to what extent, the structural differences in the chromatin organization between in vitro and in vivo have a counterpart in terms of the underlying genomic sequences.
Results
We present the first quantitative comparison between the in vitro and in vivo nucleosome maps of two model organisms (S. cerevisiae and C. elegans). The comparison is based on the construction of weighted k-mer dictionaries. Our findings show that there is a good level of sequence conservation between in vitro and in vivo in both the two organisms, in contrast to the abovementioned important differences in chromatin structural organization. Moreover, our results provide evidence that the two organisms predispose themselves differently, in terms of sequence composition and both in vitro and in vivo, for the nucleosome occupancy. This leads to the conclusion that, although the notion of a genome encoding for its own nucleosome occupancy is general, the intrinsic histone k-mer sequence preferences tend to be species-specific.
Availability and implementation
The files containing the dictionaries and the main results of the analysis are available at http://math.unipa.it/rombo/material.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

Branch-recombinant Gaussian processes for analysis of perturbations in biological time series

Сб, 2018-09-08 02:00
Motivation
A common class of behaviour encountered in the biological sciences involves branching and recombination. During branching, a statistical process bifurcates resulting in two or more potentially correlated processes that may undergo further branching; the contrary is true during recombination, where two or more statistical processes converge. A key objective is to identify the time of this bifurcation (branch or recombination time) from time series measurements, e.g. by comparing a control time series with perturbed time series. Gaussian processes (GPs) represent an ideal framework for such analysis, allowing for nonlinear regression that includes a rigorous treatment of uncertainty. Currently, however, GP models only exist for two-branch systems. Here, we highlight how arbitrarily complex branching processes can be built using the correct composition of covariance functions within a GP framework, thus outlining a general framework for the treatment of branching and recombination in the form of branch-recombinant Gaussian processes (B-RGPs).
Results
We first benchmark the performance of B-RGPs compared to a variety of existing regression approaches, and demonstrate robustness to model misspecification. B-RGPs are then used to investigate the branching patterns of Arabidopsis thaliana gene expression following inoculation with the hemibotrophic bacteria, Pseudomonas syringae DC3000, and a disarmed mutant strain, hrpA. By grouping genes according to the number of branches, we could naturally separate out genes involved in basal immune response from those subverted by the virulent strain, and show enrichment for targets of pathogen protein effectors. Finally, we identify two early branching genes WRKY11 and WRKY17, and show that genes that branched at similar times to WRKY11/17 were enriched for W-box binding motifs, and overrepresented for genes differentially expressed in WRKY11/17 knockouts, suggesting that branch time could be used for identifying direct and indirect binding targets of key transcription factors.
Availability and implementation
https://github.com/cap76/BranchingGPs
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

Author Index

Сб, 2018-09-08 02:00
Категорії: Bioinformatics, Journals

ECCB 2018: The 17th European Conference on Computational Biology

Сб, 2018-09-08 02:00
This volume of Bioinformatics includes the proceedings papers of the 17th European Conference in Computational Biology (ECCB), an annual international Conference for research in computational biology and bioinformatics.
Категорії: Bioinformatics, Journals

ECCB 2018 Organization

Сб, 2018-09-08 02:00
Категорії: Bioinformatics, Journals

Conditional generative adversarial network for gene expression inference

Сб, 2018-09-08 02:00
Motivation
The rapid progress of gene expression profiling has facilitated the prosperity of recent biological studies in various fields, where gene expression data characterizes various cell conditions and regulatory mechanisms under different experimental circumstances. Despite the widespread application of gene expression profiling and advances in high-throughput technologies, profiling in genome-wide level is still expensive and difficult. Previous studies found that high correlation exists in the expression pattern of different genes, such that a small subset of genes can be informative to approximately describe the entire transcriptome. In the Library of Integrated Network-based Cell-Signature program, a set of ∼1000 landmark genes have been identified that contain ∼80% information of the whole genome and can be used to predict the expression of remaining genes. For a cost-effective profiling strategy, traditional methods measure the profiles of landmark genes and then infer the expression of other target genes via linear models. However, linear models do not have the capacity to capture the non-linear associations in gene regulatory networks.
Results
As a flexible model with high representative power, deep learning models provide an alternate to interpret the complex relation among genes. In this paper, we propose a deep learning architecture for the inference of target gene expression profiles. We construct a novel conditional generative adversarial network by incorporating both the adversarial and ℓ1-norm loss terms in our model. Unlike the smooth and blurry predictions resulted by mean squared error objective, the coupled adversarial and ℓ1-norm loss function leads to more accurate and sharp predictions. We validate our method under two different settings and find consistent and significant improvements over all the comparing methods.
Категорії: Bioinformatics, Journals

Prioritising candidate genes causing QTL using hierarchical orthologous groups

Сб, 2018-09-08 02:00
Motivation
A key goal in plant biotechnology applications is the identification of genes associated to particular phenotypic traits (for example: yield, fruit size, root length). Quantitative Trait Loci (QTL) studies identify genomic regions associated with a trait of interest. However, to infer potential causal genes in these regions, each of which can contain hundreds of genes, these data are usually intersected with prior functional knowledge of the genes. This process is however laborious, particularly if the experiment is performed in a non-model species, and the statistical significance of the inferred candidates is typically unknown.
Results
This paper introduces QTLSearch, a method and software tool to search for candidate causal genes in QTL studies by combining Gene Ontology annotations across many species, leveraging hierarchical orthologous groups. The usefulness of this approach is demonstrated by re-analysing two metabolic QTL studies: one in Arabidopsis thaliana, the other in Oryza sativa subsp. indica. Even after controlling for statistical significance, QTLSearch inferred potential causal genes for more QTL than BLAST-based functional propagation against UniProtKB/Swiss-Prot, and for more QTL than in the original studies.
Availability and implementation
QTLSearch is distributed under the LGPLv3 license. It is available to install from the Python Package Index (as qtlsearch), with the source available from https://bitbucket.org/alex-warwickvesztrocy/qtlsearch.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

IRSOM, a reliable identifier of ncRNAs based on supervised self-organizing maps with rejection

Сб, 2018-09-08 02:00
Motivation
Non-coding RNAs (ncRNAs) play important roles in many biological processes and are involved in many diseases. Their identification is an important task, and many tools exist in the literature for this purpose. However, almost all of them are focused on the discrimination of coding and ncRNAs without giving more biological insight. In this paper, we propose a new reliable method called IRSOM, based on a supervised Self-Organizing Map (SOM) with a rejection option, that overcomes these limitations. The rejection option in IRSOM improves the accuracy of the method and also allows identifing the ambiguous transcripts. Furthermore, with the visualization of the SOM, we analyze the rejected predictions and highlight the ambiguity of the transcripts.
Results
IRSOM was tested on datasets of several species from different reigns, and shown better results compared to state-of-art. The accuracy of IRSOM is always greater than 0.95 for all the species with an average specificity of 0.98 and an average sensitivity of 0.99. Besides, IRSOM is fast (it takes around 254 s to analyze a dataset of 147 000 transcripts) and is able to handle very large datasets.
Availability and implementation
IRSOM is implemented in Python and C++. It is available on our software platform EvryRNA (http://EvryRNA.ibisc.univ-evry.fr).
Категорії: Bioinformatics, Journals

Discovering epistatic feature interactions from neural network models of regulatory DNA sequences

Сб, 2018-09-08 02:00
Motivation
Transcription factors bind regulatory DNA sequences in a combinatorial manner to modulate gene expression. Deep neural networks (DNNs) can learn the cis-regulatory grammars encoded in regulatory DNA sequences associated with transcription factor binding and chromatin accessibility. Several feature attribution methods have been developed for estimating the predictive importance of individual features (nucleotides or motifs) in any input DNA sequence to its associated output prediction from a DNN model. However, these methods do not reveal higher-order feature interactions encoded by the models.
Results
We present a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence. DFIM accurately identifies ground truth motif interactions embedded in simulated regulatory DNA sequences. DFIM identifies synergistic interactions between GATA1 and TAL1 motifs from in vivo TF binding models. DFIM reveals epistatic interactions involving nucleotides flanking the core motif of the Cbf1 TF in yeast from in vitro TF binding models. We also apply DFIM to regulatory sequence models of in vivo chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci. Our approach makes significant strides in improving the interpretability of deep learning models for genomics.
Availability and implementation
Code is available at: https://github.com/kundajelab/dfim.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

A deep neural network approach for learning intrinsic protein-RNA binding preferences

Сб, 2018-09-08 02:00
Motivation
The complexes formed by binding of proteins to RNAs play key roles in many biological processes, such as splicing, gene expression regulation, translation and viral replication. Understanding protein-RNA binding may thus provide important insights to the functionality and dynamics of many cellular processes. This has sparked substantial interest in exploring protein-RNA binding experimentally, and predicting it computationally. The key computational challenge is to efficiently and accurately infer protein-RNA binding models that will enable prediction of novel protein-RNA interactions to additional transcripts of interest.
Results
We developed DLPRB (Deep Learning for Protein-RNA Binding), a new deep neural network (DNN) approach for learning intrinsic protein-RNA binding preferences and predicting novel interactions. We present two different network architectures: a convolutional neural network (CNN), and a recurrent neural network (RNN). The novelty of our network hinges upon two key aspects: (i) the joint analysis of both RNA sequence and structure, which is represented as a probability vector of different RNA structural contexts; (ii) novel features in the architecture of the networks, such as the application of RNNs to RNA-binding prediction, and the combination of hundreds of variable-length filters in the CNN. Our results in inferring accurate RNA-binding models from high-throughput in vitro data exhibit substantial improvements, compared to all previous approaches for protein-RNA binding prediction (both DNN and non-DNN based). A more modest, yet statistically significant, improvement is achieved for in vivo binding prediction. When incorporating experimentally-measured RNA structure, compared to predicted one, the improvement on in vivo data increases. By visualizing the binding specificities, we can gain biological insights underlying the mechanism of protein RNA-binding.
Availability and implementation
The source code is publicly available at https://github.com/ilanbb/dlprb.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

Bayesian inference on stochastic gene transcription from flow cytometry data

Сб, 2018-09-08 02:00
Motivation
Transcription in single cells is an inherently stochastic process as mRNA levels vary greatly between cells, even for genetically identical cells under the same experimental and environmental conditions. We present a stochastic two-state switch model for the population of mRNA molecules in single cells where genes stochastically alternate between a more active ON state and a less active OFF state. We prove that the stationary solution of such a model can be written as a mixture of a Poisson and a Poisson-beta probability distribution. This finding facilitates inference for single cell expression data, observed at a single time point, from flow cytometry experiments such as FACS or fluorescence in situ hybridization (FISH) as it allows one to sample directly from the equilibrium distribution of the mRNA population. We hence propose a Bayesian inferential methodology using a pseudo-marginal approach and a recent approximation to integrate over unobserved states associated with measurement error.
Results
We provide a general inferential framework which can be widely used to study transcription in single cells from the kind of data arising in flow cytometry experiments. The approach allows us to separate between the intrinsic stochasticity of the molecular dynamics and the measurement noise. The methodology is tested in simulation studies and results are obtained for experimental multiple single cell expression data from FISH flow cytometry experiments.
Availability and implementation
All analyses were implemented in R. Source code and the experimental data are available at https://github.com/SimoneTiberi/Bayesian-inference-on-stochastic-gene-transcription-from-flow-cytometry-data.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

Off-target predictions in CRISPR-Cas9 gene editing using deep learning

Сб, 2018-09-08 02:00
Motivation
The prediction of off-target mutations in CRISPR-Cas9 is a hot topic due to its relevance to gene editing research. Existing prediction methods have been developed; however, most of them just calculated scores based on mismatches to the guide sequence in CRISPR-Cas9. Therefore, the existing prediction methods are unable to scale and improve their performance with the rapid expansion of experimental data in CRISPR-Cas9. Moreover, the existing methods still cannot satisfy enough precision in off-target predictions for gene editing at the clinical level.
Results
To address it, we design and implement two algorithms using deep neural networks to predict off-target mutations in CRISPR-Cas9 gene editing (i.e. deep convolutional neural network and deep feedforward neural network). The models were trained and tested on the recently released off-target dataset, CRISPOR dataset, for performance benchmark. Another off-target dataset identified by GUIDE-seq was adopted for additional evaluation. We demonstrate that convolutional neural network achieves the best performance on CRISPOR dataset, yielding an average classification area under the ROC curve (AUC) of 97.2% under stratified 5-fold cross-validation. Interestingly, the deep feedforward neural network can also be competitive at the average AUC of 97.0% under the same setting. We compare the two deep neural network models with the state-of-the-art off-target prediction methods (i.e. CFD, MIT, CROP-IT, and CCTop) and three traditional machine learning models (i.e. random forest, gradient boosting trees, and logistic regression) on both datasets in terms of AUC values, demonstrating the competitive edges of the proposed algorithms. Additional analyses are conducted to investigate the underlying reasons from different perspectives.
Availability and implementation
The example code are available at https://github.com/MichaelLinn/off_target_prediction. The related datasets are available at https://github.com/MichaelLinn/off_target_prediction/tree/master/data.
Категорії: Bioinformatics, Journals

CisPi: a transcriptomic score for disclosing cis-acting disease-associated lincRNAs

Сб, 2018-09-08 02:00
Motivation
Long intergenic noncoding RNAs (lincRNAs) have risen to prominence in cancer biology as new biomarkers of disease. Those lincRNAs transcribed from active cis-regulatory elements (enhancers) have provided mechanistic insight into cis-acting regulation; however, in the absence of an enhancer hallmark, computational prediction of cis-acting transcription of lincRNAs remains challenging. Here, we introduce a novel transcriptomic method: a cis-regulatory lincRNA–gene associating metric, termed ‘CisPi’. CisPi quantifies the mutual information between lincRNAs and local gene expression regarding their response to perturbation, such as disease risk-dependence. To predict risk-dependent lincRNAs in neuroblastoma, an aggressive pediatric cancer, we advance this scoring scheme to measure lincRNAs that represent the minority of reads in RNA-Seq libraries by a novel side-by-side analytical pipeline.
Results
Altered expression of lincRNAs that stratifies tumor risk is an informative readout of oncogenic enhancer activity. Our CisPi metric therefore provides a powerful computational model to identify enhancer-templated RNAs (eRNAs), eRNA-like lincRNAs, or active enhancers that regulate the expression of local genes. First, risk-dependent lincRNAs revealed active enhancers, over-represented neuroblastoma susceptibility loci, and uncovered novel clinical biomarkers. Second, the prioritized lincRNAs were significantly prognostic. Third, the predicted target genes further inherited the prognostic significance of these lincRNAs. In sum, RNA-Seq alone is sufficient to identify disease-associated lincRNAs using our methodologies, allowing broader applications to contexts in which enhancer hallmarks are not available or show limited sensitivity.
Availability and implementation
The source code is available on request. The prioritized lincRNAs and their target genes are in the Supplementary MaterialSupplementary Material.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

SPhyR: tumor phylogeny estimation from single-cell sequencing data under loss and error

Сб, 2018-09-08 02:00
Motivation
Cancer is characterized by intra-tumor heterogeneity, the presence of distinct cell populations with distinct complements of somatic mutations, which include single-nucleotide variants (SNVs) and copy-number aberrations (CNAs). Single-cell sequencing technology enables one to study these cell populations at single-cell resolution. Phylogeny estimation algorithms that employ appropriate evolutionary models are key to understanding the evolutionary mechanisms behind intra-tumor heterogeneity.
Results
We introduce Single-cell Phylogeny Reconstruction (SPhyR), a method for tumor phylogeny estimation from single-cell sequencing data. In light of frequent loss of SNVs due to CNAs in cancer, SPhyR employs the k-Dollo evolutionary model, where a mutation can only be gained once but lost k times. Underlying SPhyR is a novel combinatorial characterization of solutions as constrained integer matrix completions, based on a connection to the cladistic multi-state perfect phylogeny problem. SPhyR outperforms existing methods on simulated data and on a metastatic colorectal cancer.
Availability and implementation
SPhyR is available on https://github.com/elkebir-group/SPhyR.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

S-Cluster++: a fast program for solving the cluster containment problem for phylogenetic networks

Сб, 2018-09-08 02:00
Motivation
Comparative genomic studies indicate that extant genomes are more properly considered to be a fusion product of random mutations over generations (vertical evolution) and genomic material transfers between individuals of different lineages (reticulate transfer). This has motivated biologists to use phylogenetic networks and other general models to study genome evolution. Two fundamental algorithmic problems arising from verification of phylogenetic networks and from computing Robinson-Foulds distance in the space of phylogenetic networks are the tree and cluster containment problems. The former asks how to decide whether or not a phylogenetic tree is displayed in a phylogenetic network. The latter is to decide whether a subset of taxa appears as a cluster in some tree displayed in a phylogenetic network. The cluster containment problem (CCP) is also closely related to testing the infinite site model on a recombination network. Both the tree containment and CCP are NP-complete. Although the CCP was introduced a decade ago, there has been little progress in developing fast algorithms for it on arbitrary phylogenetic networks.
Results
In this work, we present a fast computer program for the CCP. This program is developed on the basis of a linear-time transformation from the small version of the CCP to the SAT problem.
Availability and implementation
The program package is available for download on http://www.math.nus.edu.sg/∼matzlx/ccp.
Категорії: Bioinformatics, Journals

Accurate and adaptive imputation of summary statistics in mixed-ethnicity cohorts

Сб, 2018-09-08 02:00
Motivation
Methods based on summary statistics obtained from genome-wide association studies have gained considerable interest in genetics due to the computational cost and privacy advantages they present. Imputing missing summary statistics has therefore become a key procedure in many bioinformatics pipelines, but available solutions may rely on additional knowledge about the populations used in the original study and, as a result, may not always ensure feasibility or high accuracy of the imputation procedure.
Results
We present ARDISS, a method to impute missing summary statistics in mixed-ethnicity cohorts through Gaussian Process Regression and automatic relevance determination. ARDISS is trained on an external reference panel and does not require information about allele frequencies of genotypes from the original study. Our method approximates the original GWAS population by a combination of samples from a reference panel relying exclusively on the summary statistics and without any external information. ARDISS successfully reconstructs the original composition of mixed-ethnicity cohorts and outperforms alternative solutions in terms of speed and imputation accuracy both for heterogeneous and homogeneous datasets.
Availability and implementation
The proposed method is available at https://github.com/BorgwardtLab/ARDISS.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals

Towards an accurate and efficient heuristic for species/gene tree co-estimation

Сб, 2018-09-08 02:00
Motivation
Species and gene trees represent how species and individual loci within their genomes evolve from their most recent common ancestors. These trees are central to addressing several questions in biology relating to, among other issues, species conservation, trait evolution and gene function. Consequently, their accurate inference from genomic data is a major endeavor. One approach to their inference is to co-estimate species and gene trees from genome-wide data. Indeed, Bayesian methods based on this approach already exist. However, these methods are very slow, limiting their applicability to datasets with small numbers of taxa. The more commonly used approach is to first infer gene trees individually, and then use gene tree estimates to infer the species tree. Methods in this category rely significantly on the accuracy of the gene trees which is often not high when the dataset includes closely related species.
Results
In this work, we introduce a simple, yet effective, iterative method for co-estimating gene and species trees from sequence data of multiple, unlinked loci. In every iteration, the method estimates a species tree, uses it as a generative process to simulate a collection of gene trees, and then selects gene trees for the individual loci from among the simulated gene trees by making use of the sequence data. We demonstrate the accuracy and efficiency of our method on simulated as well as biological data, and compare them to those of existing competing methods.
Availability and implementation
The method has been implemented in PhyloNet, which is publicly available at http://bioinfocs.rice.edu/phylonet.
Категорії: Bioinformatics, Journals

Fast characterization of segmental duplications in genome assemblies

Сб, 2018-09-08 02:00
Motivation
Segmental duplications (SDs) or low-copy repeats, are segments of DNA > 1 Kbp with high sequence identity that are copied to other regions of the genome. SDs are among the most important sources of evolution, a common cause of genomic structural variation and several are associated with diseases of genomic origin including schizophrenia and autism. Despite their functional importance, SDs present one of the major hurdles for de novo genome assembly due to the ambiguity they cause in building and traversing both state-of-the-art overlap-layout-consensus and de Bruijn graphs. This causes SD regions to be misassembled, collapsed into a unique representation, or completely missing from assembled reference genomes for various organisms. In turn, this missing or incorrect information limits our ability to fully understand the evolution and the architecture of the genomes. Despite the essential need to accurately characterize SDs in assemblies, there has been only one tool that was developed for this purpose, called Whole-Genome Assembly Comparison (WGAC); its primary goal is SD detection. WGAC is comprised of several steps that employ different tools and custom scripts, which makes this strategy difficult and time consuming to use. Thus there is still a need for algorithms to characterize within-assembly SDs quickly, accurately, and in a user friendly manner.
Results
Here we introduce SEgmental Duplication Evaluation Framework (SEDEF) to rapidly detect SDs through sophisticated filtering strategies based on Jaccard similarity and local chaining. We show that SEDEF accurately detects SDs while maintaining substantial speed up over WGAC that translates into practical run times of minutes instead of weeks. Notably, our algorithm captures up to 25% ‘pairwise error’ between segments, whereas previous studies focused on only 10%, allowing us to more deeply track the evolutionary history of the genome.
Availability and implementation
SEDEF is available at https://github.com/vpc-ccg/sedef.
Категорії: Bioinformatics, Journals

PAIPline: pathogen identification in metagenomic and clinical next generation sequencing samples

Сб, 2018-09-08 02:00
Motivation
Next generation sequencing (NGS) has provided researchers with a powerful tool to characterize metagenomic and clinical samples in research and diagnostic settings. NGS allows an open view into samples useful for pathogen detection in an unbiased fashion and without prior hypothesis about possible causative agents. However, NGS datasets for pathogen detection come with different obstacles, such as a very unfavorable ratio of pathogen to host reads. Alongside often appearing false positives and irrelevant organisms, such as contaminants, tools are often challenged by samples with low pathogen loads and might not report organisms present below a certain threshold. Furthermore, some metagenomic profiling tools are only focused on one particular set of pathogens, for example bacteria.
Results
We present PAIPline, a bioinformatics pipeline specifically designed to address problems associated with detecting pathogens in diagnostic samples. PAIPline particularly focuses on userfriendliness and encapsulates all necessary steps from preprocessing to resolution of ambiguous reads and filtering up to visualization in a single tool. In contrast to existing tools, PAIPline is more specific while maintaining sensitivity. This is shown in a comparative evaluation where PAIPline was benchmarked along other well-known metagenomic profiling tools on previously published well-characterized datasets. Additionally, as part of an international cooperation project, PAIPline was applied to an outbreak sample of hemorrhagic fevers of then unknown etiology. The presented results show that PAIPline can serve as a robust, reliable, user-friendly, adaptable and generalizable stand-alone software for diagnostics from NGS samples and as a stepping stone for further downstream analyses.
Availability and implementation
PAIPline is freely available under https://gitlab.com/rki_bioinformatics/paipline.
Категорії: Bioinformatics, Journals

An accurate and rapid continuous wavelet dynamic time warping algorithm for end-to-end mapping in ultra-long nanopore sequencing

Сб, 2018-09-08 02:00
Motivation
Long-reads, point-of-care and polymerase chain reaction-free are the promises brought by nanopore sequencing. Among various steps in nanopore data analysis, the end-to-end mapping between the raw electrical current signal sequence and the reference expected signal sequence serves as the key building block to signal labeling, and the following signal visualization, variant identification and methylation detection. One of the classic algorithms to solve the signal mapping problem is the dynamic time warping (DTW). However, the ultra-long nanopore sequencing and an order of magnitude difference in the sampling speed complexify the scenario and make the classical DTW infeasible to solve the problem.
Results
Here, we propose a novel multi-level DTW algorithm, continuous wavelet DTW (cwDTW), based on continuous wavelet transforms with different scales of the two signal sequences. Our algorithm starts from low-resolution wavelet transforms of the two sequences, such that the transformed sequences are short and have similar sampling rates. Then the peaks and nadirs of the transformed sequences are extracted to form feature sequences with similar lengths, which can be easily mapped by the original DTW. Our algorithm then recursively projects the warping path from a lower-resolution level to a higher-resolution one by building a context-dependent boundary and enabling a constrained search for the warping path in the latter. Comprehensive experiments on two real nanopore datasets on human and on Pandoraea pnomenusa demonstrate the efficiency and effectiveness of the proposed algorithm. In particular, cwDTW can gain remarkable acceleration with tiny loss of the alignment accuracy. On the real nanopore datasets, cwDTW can finish an alignment task in few seconds, which is about 3000 times faster than the original DTW. By successfully applying cwDTW on the tasks of signal labeling and ultra-long sequence comparison, we further demonstrate the power and applicability of cwDTW.
Availability and implementation
Our program is available at https://github.com/realbigws/cwDTW.
Supplementary information
Supplementary dataSupplementary data are available at Bioinformatics online.
Категорії: Bioinformatics, Journals