Bioinformatics

Syndicate content
Bioinformatics - RSS feed of current issue
Updated: 16 hours 4 min ago

Predicting giant transmembrane {beta}-barrel architecture

Wed, 2012-05-09 12:19

Motivation: The β-barrel is a ubiquitous fold that is deployed to accomplish a wide variety of biological functions including membrane-embedded pores. Key influences of β-barrel lumen diameter include the number of β-strands (n) and the degree of shear (S), the latter value measuring the extent to which the β-sheet is tilted within the β-barrel. Notably, it has previously been reported that the shear value for small antiparallel β-barrels (n≤24) typically ranges between n and 2n. Conversely, it has been suggested that the β-strands in giant antiparallel β-barrels, such as those formed by pore forming cholesterol-dependent cytolysins (CDC), are parallel relative to the axis of the β-barrel, i.e. S=0. The S=0 arrangement, however, has never been observed in crystal structures of small β-barrels. Therefore, the structural basis for how CDCs form a β-barrel and span a membrane remains to be understood.

Results: Through comparison of molecular models with experimental data, we are able to identify how giant CDC β-barrels utilize a ‘near parallel’ arrangement of β-strands where S=n/2. Furthermore, we show how side-chain packing within the β-barrel lumen is an important limiting factor with respect to the possible shear values for small β-barrels (n≤24  β-strands). In contrast, our models reveal no such limitation restricts the shear value of giant β-barrels (n>24 β-strands). Giant β-barrels can thus access a different architecture compared with smaller β-barrels.

Contact: michelle.dunstone@monash.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Bioinformatics, Journals

1001 Proteomes: a functional proteomics portal for the analysis of Arabidopsis thaliana accessions

Wed, 2012-05-09 12:19

Motivation: The sequencing of over a thousand natural strains of the model plant Arabidopsis thaliana is producing unparalleled information at the genetic level for plant researchers. To enable the rapid exploitation of these data for functional proteomics studies, we have created a resource for the visualization of protein information and proteomic datasets for sequenced natural strains of A. thaliana.

Results: The 1001 Proteomes portal can be used to visualize amino acid substitutions or non-synonymous single-nucleotide polymorphisms in individual proteins of A. thaliana based on the reference genome Col-0. We have used the available processed sequence information to analyze the conservation of known residues subject to protein phosphorylation among these natural strains. The substitution of amino acids in A. thaliana natural strains is heavily constrained and is likely a result of the conservation of functional attributes within proteins. At a practical level, we demonstrate that this information can be used to clarify ambiguously defined phosphorylation sites from phosphoproteomic studies. Protein sets of available natural variants are available for download to enable proteomic studies on these accessions. Together this information can be used to uncover the possible roles of specific amino acids in determining the structure and function of proteins in the model plant A. thaliana. An online portal to enable the community to exploit these data can be accessed at http://1001proteomes.masc-proteomics.org/

Contact: jlheazlewood@lbl.gov

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Bioinformatics, Journals

CONTRA: copy number analysis for targeted resequencing

Wed, 2012-05-09 12:19

Motivation: In light of the increasing adoption of targeted resequencing (TR) as a cost-effective strategy to identify disease-causing variants, a robust method for copy number variation (CNV) analysis is needed to maximize the value of this promising technology.

Results: We present a method for CNV detection for TR data, including whole-exome capture data. Our method calls copy number gains and losses for each target region based on normalized depth of coverage. Our key strategies include the use of base-level log-ratios to remove GC-content bias, correction for an imbalanced library size effect on log-ratios, and the estimation of log-ratio variations via binning and interpolation. Our methods are made available via CONTRA (COpy Number Targeted Resequencing Analysis), a software package that takes standard alignment formats (BAM/SAM) and outputs in variant call format (VCF4.0), for easy integration with other next-generation sequencing analysis packages. We assessed our methods using samples from seven different target enrichment assays, and evaluated our results using simulated data and real germline data with known CNV genotypes.

Availability and implementation: Source code and sample data are freely available under GNU license (GPLv3) at http://contra-cnv.sourceforge.net/

Contact: Jason.Li@petermac.org

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Bioinformatics, Journals

Probabilistic suffix array: efficient modeling and prediction of protein families

Wed, 2012-05-09 12:19

Motivation: Markov models are very popular for analyzing complex sequences such as protein sequences, whose sources are unknown, or whose underlying statistical characteristics are not well understood. A major problem is the computational complexity involved with using Markov models, especially the exponential growth of their size with the order of the model. The probabilistic suffix tree (PST) and its improved variant sparse probabilistic suffix tree (SPST) have been proposed to address some of the key problems with Markov models. The use of the suffix tree, however, implies that the space requirement for the PST/SPST could still be high.

Results: We present the probabilistic suffix array (PSA), a data structure for representing information in variable length Markov chains. The PSA essentially encodes information in a Markov model by providing a time and space-efficient alternative to the PST/SPST. Given a sequence of length N, construction and learning in the PSA is done in O(N) time and space, independent of the Markov order. Prediction using the PSA is performed in O(mlog $$\frac{\hbox{ N }}{\left|\Sigma \right|}$$) time, where m is the pattern length, and is the symbol alphabet. In terms of modeling and prediction accuracy, using protein families from Pfam 25.0, SPST and PSA produced similar results (SPST 89.82%, PSA 89.56%), but slightly lower than HMMER3 (92.55%). A modified algorithm for PSA prediction improved the performance to 91.7%, or just 0.79% from HMMER3 results. The average (maximum) practical construction space for the protein families tested was 21.58±6.32N (41.11N) bytes using the PSA, 27.55±13.16N (63.01N) bytes using SPST and 47±24.95N (140.3N) bytes for HMMER3. The PSA was 255 times faster to construct than the SPST, and 11 times faster than HMMER3.

Availability: http://www.csee.wvu.edu/~adjeroh/projects/PSA

Contact: don@csee.wvu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Bioinformatics, Journals

Fulcrum: condensing redundant reads from high-throughput sequencing studies

Wed, 2012-05-09 12:19

Motivation: Ultra-high-throughput sequencing produces duplicate and near-duplicate reads, which can consume computational resources in downstream applications. A tool that collapses such reads should reduce storage and assembly complications and costs.

Results: We developed Fulcrum to collapse identical and near-identical Illumina and 454 reads (such as those from PCR clones) into single error-corrected sequences; it can process paired-end as well as single-end reads. Fulcrum is customizable and can be deployed on a single machine, a local network or a commercially available MapReduce cluster, and it has been optimized to maximize ease-of-use, cross-platform compatibility and future scalability. Sequence datasets have been collapsed by up to 71%, and the reduced number and improved quality of the resulting sequences allow assemblers to produce longer contigs while using less memory.

Availability and implementation: Source code and a tutorial are available at http://pringlelab.stanford.edu/protocols.html under a BSD-like license. Fulcrum was written and tested in Python 2.6, and the single-machine and local-network modes depend on a modified version of the Parallel Python library (provided).

Contact: erik.m.lehnert@gmail.com

Supplementary information: Supplementary information is available at Bioinformatics online.

Categories: Bioinformatics, Journals

A subspace method for the detection of transcription factor binding sites

Wed, 2012-05-09 12:19

Motivation: The identification of the sites at which transcription factors (TFs) bind to Deoxyribonucleic acid (DNA) is an important problem in molecular biology. Many computational methods have been developed for motif finding, most of them based on position-specific scoring matrices (PSSMs) which assume the independence of positions within a binding site. However, some experimental and computational studies demonstrate that interdependences within the positions exist.

Results: In this article, we introduce a novel motif finding method which constructs a subspace based on the covariance of numerical DNA sequences. When a candidate sequence is projected into the modeled subspace, a threshold in the Q-residuals confidence allows us to predict whether this sequence is a binding site. Using the TRANSFAC and JASPAR databases, we compared our Q-residuals detector with existing PSSM methods. In most of the studied TF binding sites, the Q-residuals detector performs significantly better and faster than MATCH and MAST. As compared with Motifscan, a method which takes into account interdependences, the performance of the Q-residuals detector is better when the number of available sequences is small.

Availability: http://r-forge.r-project.org/projects/meet

Contact: epairo@ibecbarcelona.eu; alexandre.perera@upc.edu

Supplementary information: Supplementary data (1, 2, 3 and 4) are available at Bioinformatics online.

Categories: Bioinformatics, Journals

PhyLAT: a phylogenetic local alignment tool

Wed, 2012-05-09 12:19

Motivation: The expansion of DNA sequencing capacity has enabled the sequencing of whole genomes from a number of related species. These genomes can be combined in a multiple alignment that provides useful information about the evolutionary history at each genomic locus. One area in which evolutionary information can productively be exploited is in aligning a new sequence to a database of existing, aligned genomes. However, existing high-throughput alignment tools are not designed to work effectively with multiple genome alignments.

Results: We introduce PhyLAT, the phylogenetic local alignment tool, to compute local alignments of a query sequence against a fixed multiple-genome alignment of closely related species. PhyLAT uses a known phylogenetic tree on the species in the multiple alignment to improve the quality of its computed alignments while also estimating the placement of the query on this tree. It combines a probabilistic approach to alignment with seeding and expansion heuristics to accelerate discovery of significant alignments. We provide evidence, using alignments of human chromosome 22 against a five-species alignment from the UCSC Genome Browser database, that PhyLAT's alignments are more accurate than those of other commonly used programs, including BLAST, POY, MAFFT, MUSCLE and CLUSTAL. PhyLAT also identifies more alignments in coding DNA than does pairwise alignment alone. Finally, our tool determines the evolutionary relationship of query sequences to the database more accurately than do POY, RAxML, EPA or pplacer.

Availability: www.cse.wustl.edu/~htsun/phylat

Contact: sunhongtao@wustl.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Bioinformatics, Journals

Fast protein binding site comparisons using visual words representation

Wed, 2012-05-09 12:19

Motivation: Finding geometrically similar protein binding sites is crucial for understanding protein functions and can provide valuable information for protein–protein docking and drug discovery. As the number of known protein–protein interaction structures has dramatically increased, a high-throughput and accurate protein binding site comparison method is essential. Traditional alignment-based methods can provide accurate correspondence between the binding sites but are computationally expensive.

Results: In this article, we present a novel method for the comparisons of protein binding sites using a ‘visual words’ representation (PBSword). We first extract geometric features of binding site surfaces and build a vocabulary of visual words by clustering a large set of feature descriptors. We then describe a binding site surface with a high-dimensional vector that encodes the frequency of visual words, enhanced by the spatial relationships among them. Finally, we measure the similarity of binding sites by utilizing metric space operations, which provide speedy comparisons between protein binding sites. Our experimental results show that PBSword achieves a comparable classification accuracy to an alignment-based method and improves accuracy of a feature-based method by 36% on a non-redundant dataset. PBSword also exhibits a significant efficiency improvement over an alignment-based method.

Availability: PBSword is available at http://proteindbs.rnet.missouri.edu/pbsword/pbsword.html

Contact: shyuc@missouri.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Bioinformatics, Journals

Matrix eQTL: ultra fast eQTL analysis via large matrix operations

Wed, 2012-05-09 12:19

Motivation: Expression quantitative trait loci (eQTL) analysis links variations in gene expression levels to genotypes. For modern datasets, eQTL analysis is a computationally intensive task as it involves testing for association of billions of transcript-SNP (single-nucleotide polymorphism) pair. The heavy computational burden makes eQTL analysis less popular and sometimes forces analysts to restrict their attention to just a small subset of transcript-SNP pairs. As more transcripts and SNPs get interrogated over a growing number of samples, the demand for faster tools for eQTL analysis grows stronger.

Results: We have developed a new software for computationally efficient eQTL analysis called Matrix eQTL. In tests on large datasets, it was 2–3 orders of magnitude faster than existing popular tools for QTL/eQTL analysis, while finding the same eQTLs. The fast performance is achieved by special preprocessing and expressing the most computationally intensive part of the algorithm in terms of large matrix operations. Matrix eQTL supports additive linear and ANOVA models with covariates, including models with correlated and heteroskedastic errors. The issue of multiple testing is addressed by calculating false discovery rate; this can be done separately for cis- and trans-eQTLs.

Availability: Matlab and R implementations are available for free at http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL

Contact: shabalin@email.unc.edu

Categories: Bioinformatics, Journals

Fast and accurate inference of local ancestry in Latino populations

Wed, 2012-05-09 12:19

Motivation: It is becoming increasingly evident that the analysis of genotype data from recently admixed populations is providing important insights into medical genetics and population history. Such analyses have been used to identify novel disease loci, to understand recombination rate variation and to detect recent selection events. The utility of such studies crucially depends on accurate and unbiased estimation of the ancestry at every genomic locus in recently admixed populations. Although various methods have been proposed and shown to be extremely accurate in two-way admixtures (e.g. African Americans), only a few approaches have been proposed and thoroughly benchmarked on multi-way admixtures (e.g. Latino populations of the Americas).

Results: To address these challenges we introduce here methods for local ancestry inference which leverage the structure of linkage disequilibrium in the ancestral population (LAMP-LD), and incorporate the constraint of Mendelian segregation when inferring local ancestry in nuclear family trios (LAMP-HAP). Our algorithms uniquely combine hidden Markov models (HMMs) of haplotype diversity within a novel window-based framework to achieve superior accuracy as compared with published methods. Further, unlike previous methods, the structure of our HMM does not depend on the number of reference haplotypes but on a fixed constant, and it is thereby capable of utilizing large datasets while remaining highly efficient and robust to over-fitting. Through simulations and analysis of real data from 489 nuclear trio families from the mainland US, Puerto Rico and Mexico, we demonstrate that our methods achieve superior accuracy compared with published methods for local ancestry inference in Latinos.

Availability: http://lamp.icsi.berkeley.edu/lamp/lampld/

Contact: bpasaniu@hsph.harvard.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Bioinformatics, Journals

Penalized logistic regression for high-dimensional DNA methylation data with case-control studies

Wed, 2012-05-09 12:19

Motivation: DNA methylation is a molecular modification of DNA that plays crucial roles in regulation of gene expression. Particularly, CpG rich regions are frequently hypermethylated in cancer tissues, but not methylated in normal tissues. However, there are not many methodological literatures of case-control association studies for high-dimensional DNA methylation data, compared with those of microarray gene expression. One key feature of DNA methylation data is a grouped structure among CpG sites from a gene that are possibly highly correlated. In this article, we proposed a penalized logistic regression model for correlated DNA methylation CpG sites within genes from high-dimensional array data. Our regularization procedure is based on a combination of the l1 penalty and squared l2 penalty on degree-scaled differences of coefficients of CpG sites within one gene, so it induces both sparsity and smoothness with respect to the correlated regression coefficients. We combined the penalized procedure with a stability selection procedure such that a selection probability of each regression coefficient was provided which helps us make a stable and confident selection of methylation CpG sites that are possibly truly associated with the outcome.

Results: Using simulation studies we demonstrated that the proposed procedure outperforms existing main-stream regularization methods such as lasso and elastic-net when data is correlated within a group. We also applied our method to identify important CpG sites and corresponding genes for ovarian cancer from over 20 000 CpGs generated from Illumina Infinium HumanMethylation27K Beadchip. Some genes identified are potentially associated with cancers.

Contact: sw2206@columbia.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Bioinformatics, Journals

Inferring gene regulatory networks by ANOVA

Wed, 2012-05-09 12:19

Motivation: To improve the understanding of molecular regulation events, various approaches have been developed for deducing gene regulatory networks from mRNA expression data.

Results: We present a new score for network inference, 2, that is derived from an analysis of variance. Candidate transcription factor:target gene (TF:TG) relationships are assumed more likely if the expression of TF and TG are mutually dependent in at least a subset of the examined experiments. We evaluate this dependency by 2, a non-parametric, non-linear correlation coefficient. It is fast, easy to apply and does not require the discretization of the input data. In the recent DREAM5 blind assessment, the arguably most comprehensive evaluation of inference methods, our approach based on 2 was rated the best performer on real expression compendia. It also performs better than methods tested in other recently published comparative assessments. About half of our predicted novel predictions are true interactions as estimated from qPCR experiments performed for DREAM5.

Conclusions: The score 2 has a number of interesting features that enable the efficient detection of gene regulatory interactions. For most experimental setups, it is an interesting alternative to other measures of dependency such as Pearson's correlation or mutual information.

Availability: See http://www2.bio.ifi.lmu.de/kueffner/anova.tar.gz for code and example data.

Contact: kueffner@bio.ifi.lmu.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Bioinformatics, Journals

Improving GO semantic similarity measures by exploring the ontology beneath the terms and modelling uncertainty

Wed, 2012-05-09 12:19

Motivation: Several measures have been recently proposed for quantifying the functional similarity between gene products according to well-structured controlled vocabularies where biological terms are organized in a tree or in a directed acyclic graph (DAG) structure. However, existing semantic similarity measures ignore two important facts. First, when calculating the similarity between two terms, they disregard the descendants of these terms. While this makes no difference when the ontology is a tree, we shall show that it has important consequences when the ontology is a DAG—this is the case, for example, with the Gene Ontology (GO). Second, existing similarity measures do not model the inherent uncertainty which comes from the fact that our current knowledge of the gene annotation and of the ontology structure is incomplete. Here, we propose a novel approach based on downward random walks that can be used to improve any of the existing similarity measures to exhibit these two properties. The approach is computationally efficient—random walks do not need to be simulated as we provide formulas to calculate their stationary distributions.

Results: To show that our approach can potentially improve any semantic similarity measure, we test it on six different semantic similarity measures: three commonly used measures by Resnik (1999), Lin (1998), and Jiang and Conrath (1997); and three recently proposed measures: simUI, simGIC by Pesquita et al. (2008); GraSM by Couto et al. (2007); and Couto and Silva (2011). We applied these improved measures to the GO annotations of the yeast Saccharomyces cerevisiae, and tested how they correlate with sequence similarity, mRNA co-expression and protein–protein interaction data. Our results consistently show that the use of downward random walks leads to more reliable similarity measures.

Availability: We have developed a suite of tools that implement existing semantic similarity measures and our improved measures based on random walks. The tools are implemented in Matlab and are freely available from: http://www.paccanarolab.org/papers/GOsim/

Contact: alberto@cs.rhul.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Bioinformatics, Journals

AutoLabDB: a substantial open source database schema to support a high-throughput automated laboratory

Wed, 2012-05-09 12:19

Motivation: Modern automated laboratories need substantial data management solutions to both store and make accessible the details of the experiments they perform. To be useful, a modern Laboratory Information Management System (LIMS) should be flexible and easily extensible to support evolving laboratory requirements, and should be based on the solid foundations of a robust, well-designed database. We have developed such a database schema to support an automated laboratory that performs experiments in systems biology and high-throughput screening.

Results: We describe the design of the database schema (AutoLabDB), detailing the main features and describing why we believe it will be relevant to LIMS manufacturers or custom builders. This database has been developed to support two large automated Robot Scientist systems over the last 5 years, where it has been used as the basis of an LIMS that helps to manage both the laboratory and all the experiment data produced.

Availability and implementation: The database schema has been made available as open source (BSD license), so that others may use, extend and improve it to meet their own needs. Example software interfaces to the database are also provided. http://autolabdb.sourceforge.net/

Contact: afc@aber.ac.uk

Categories: Bioinformatics, Journals

GECA: a fast tool for gene evolution and conservation analysis in eukaryotic protein families

Wed, 2012-05-09 12:19

Summary: GECA is a fast, user-friendly and freely-available tool for representing gene exon/intron organization and highlighting changes in gene structure among members of a gene family. It relies on protein alignment, completed with the identification of common introns in the corresponding genes using CIWOG. GECA produces a main graphical representation showing the resulting aligned set of gene structures, where exons are to scale. The important and original feature of GECA is that it combines these gene structures with a symbolic display highlighting sequence similarity between subsequent genes. It is worth noting that this combination of gene structure with the indications of similarities between related genes allows rapid identification of possible events of gain or loss of introns, or points to erroneous structural annotations. The output image is generated in a portable network graphics format which can be used for scientific publications.

Availability and implementation: Web-implemented version and source code are freely available at https://peroxibase.toulouse.inra.fr/geca_input_demo.php and a detailed example can be found at https://peroxibase.toulouse.inra.fr/geca_instructions.php

Contact: mathe@lrsv.ups-tlse.fr

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Bioinformatics, Journals

Rknots: topological analysis of knotted biopolymers with R

Wed, 2012-05-09 12:19

Motivation: Rknots is a flexible R package providing tools for the detection and characterization of topological knots in biological polymers. The package is well documented and provides a simple syntax for data import and preprocessing, structure reduction, topological analyses and 2D and 3D visualization. Remarkably, Rknots is not limited to protein knots and allows researchers from interdisciplinary fields to analyze different topological structures and to develop simple yet fully custom pipelines.

Availability: Rknots is distributed under the GPL-2 license and is available from the CRAN (the Comprehensive R Archive network) at http://cran.r-project.org/web/packages/Rknots

Contact: federico.comoglio@bsse.ethz.ch

Supplementary Information: Supplementary data are available at Bioinformatics online.

Categories: Bioinformatics, Journals

MageComet--web application for harmonizing existing large-scale experiment descriptions

Wed, 2012-05-09 12:19

Motivation: Meta-analysis of large gene expression datasets obtained from public repositories requires consistently annotated data. Curation of such experiments, however, is an expert activity which involves repetitive manipulation of text. Existing tools for automated curation are few, which bottleneck the analysis pipeline.

Results: We present MageComet, a web application for biologists and annotators that facilitates the re-annotation of gene expression experiments in MAGE-TAB format. It incorporates data mining, automatic annotation, use of ontologies and data validation to improve the consistency and quality of experimental meta-data from the ArrayExpress Repository.

Availability and implementation: Source and tutorials for MageComet are openly available at goo.gl/8LQPR under the GNU GPL v3 licenses. An implementation can be found at goo.gl/IdCuA

Contact: parkinson@ebi.ac.uk or xue.vin@gmail.com

Categories: Bioinformatics, Journals

OCAP: an open comprehensive analysis pipeline for iTRAQ

Wed, 2012-05-09 12:19

Motivation: Mass spectrometry-based iTRAQ protein quantification is a high-throughput assay for determining relative protein expressions and identifying disease biomarkers. Processing and analysis of these large and complex data involves a number of distinct components and it is desirable to have a pipeline to efficiently integrate these together. To date, there are limited public available comprehensive analysis pipelines for iTRAQ data and many of these existing pipelines have limited visualization tools and no convenient interfaces with downstream analyses. We have developed a new open source comprehensive iTRAQ analysis pipeline, OCAP, integrating a wavelet-based preprocessing algorithm which provides better peak picking, a new quantification algorithm and a suite of visualizsation tools. OCAP is mainly developed in C++ and is provided as a standalone version (OCAP_standalone) as well as an R package. The R package (OCAP) provides the necessary interfaces with downstream statistical analysis.

Availability: OCAP is freely available and can be downloaded at http://www.maths.usyd.edu.au/u/penghao

Contact: penghao.wang@sydney.edu.au

Categories: Bioinformatics, Journals

DADP: the database of anuran defense peptides

Wed, 2012-05-09 12:19

Summary: Anuran tissues, and especially skin, are a rich source of bioactive peptides and their precursors. We here present a manually curated database of antimicrobial and other defense peptides with a total of 2571 entries, most of them in the precursor form with demarcated signal peptide (SP), acidic proregion(s) and bioactive moiety(s) corresponding to 1923 non-identical bioactive sequences. Search functions on the corresponding web server facilitate the extraction of six distinct SP classes. The more conserved of these can be used for searching cDNA and UniProtKB databases for potential bioactive peptides, for creating PROSITE search patterns, and for phylogenetic analysis.

Availability: DADP is accessible at http://split4.pmfst.hr/dadp/

Contact: juretic@pmfst.hr

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Bioinformatics, Journals

Metab2MeSH: annotating compounds with medical subject headings

Wed, 2012-05-09 12:19

Summary: Progress in high-throughput genomic technologies has led to the development of a variety of resources that link genes to functional information contained in the biomedical literature. However, tools attempting to link small molecules to normal and diseased physiology and published data relevant to biologists and clinical investigators, are still lacking. With metabolomics rapidly emerging as a new omics field, the task of annotating small molecule metabolites becomes highly relevant. Our tool Metab2MeSH uses a statistical approach to reliably and automatically annotate compounds with concepts defined in Medical Subject Headings, and the National Library of Medicine's controlled vocabulary for biomedical concepts. These annotations provide links from compounds to biomedical literature and complement existing resources such as PubChem and the Human Metabolome Database.

Availability: http://metab2mesh.ncibi.org

Contact: akarnovs@umich.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Bioinformatics, Journals