Small molecule metabolism in biological systems
We contribute to the understanding and simulation of metabolism and its role in biological systems through open-source computational methods. We aim to decipher the structural identity and physical properties of biochemical metabolites in organisms. We document this information in open chemical information systems and develop cheminformatics and bioinformatics tools and methods. This includes, but is not limited to, methods for the elucidation of metabolomes, Computer-Assisted Structure Elucidation (CASE), the reconstruction of metabolic networks and algorithm development in chem- and bioinformatics.
Active research
Open Science, Open Source Software and FAIR, Open Data
As a research group, we are proudly looking back on 20 years of dedication to the ideas of Open Science, Open Source Software and Open Data in cheminformatics and computational metabolomics. In chemistry, these ideas have been promoted by the Blue Obelisk movement in the representation of all chemical scientists dedicated to these ideas [1][2]. Nowadays, the idea of FAIR data [3] – data which is Findable, Accessible, Interoperable and Reproducible (FAIR) – resonates well with many scientists, funders and learned societies. FAIR data, like open-source software, is a means to enable open science, which again is at the heart of the scientific method. We are active in establishing FAIR data resources in metabolomics and linking this work to the European Open Science Cloud (EOSC).
Computer-Assisted Structure Elucidation (CASE)
Computer-Assisted Structure Elucidation (CASE) methods [4] aim to determine the structure of metabolites by screening of large candidate spaces based on spectroscopic methods. Our SENECA system [5] is based on a stochastic structure generator, which is guided by a spectroscopy-based scoring function. Simulated Annealing [6] and Evolutionary Algorithms [7] are at the core of the structure generation process allowing to explore the structural space of isomers.
In order to perform the scoring, we need precise and fast methods for the prediction of mass and NMR spectra. Here, we employ machine learning methods such as support vector machines to correlate graph-based molecular descriptors with database knowledge [8]. The resulting prediction engines are then used as judges in our SENECA scoring function or elsewhere. Further, to effectively narrow down our search space during the structure determination process, we employ Natural Product (NP) likeness as a filter (for more details, see below). The stochastic CASE tool is valuable in structure elucidation of newly isolated Natural Products or unknown metabolites, information that is crucial for further metabolomic approaches.
Chemical information systems
Open Science requires FAIR and Open Data [9]. A number of our research projects would not have been possible without open chemical databases, a number of which we had to create. An early example is NMRShiftDB [10][11], a database of NMR spectra of organic compounds, without which our work on the prediction of NMR spectra with statistical and machine learning methods [12] would not have been possible.
Species-specific metabolome inference
Metabolomics and lipidomics are experimental areas that suffer from identifying only a minor fraction of small molecules out of a large number of small molecules detected. One of the reasons for this is the lack of adequate databases that contain extensive metabolome data (small molecules, reactions, association to enzymes, biological containers, etc.) in a species-specific way. Towards this problem, we explore different alternatives of producing such types of resources: chemical unification of existing metabolism databases; text mining of small molecules, proteins, tissues/cell types, and organisms; and chemical enumeration of generic reactions and lipids. Through these approaches, we provide species-specific molecule catalogues that aim to improve the chances of researchers in metabolomics to identify detected small molecules.
Metabolism data integration – through a novel merge method – shows that merging metabolism resources significantly increases the size of the metabolite catalogue. This is complemented by a text mining pipeline, which – analyzing PubMed abstracts – produces some thousands of additional metabolites and relations between tissues and small molecules. Results retrieved only through text mining have a bias towards exogenous small molecules.
On enumerating generic reactions from the previous sets, the number of small molecules generated grows exponentially and only a few paths lead to known metabolites. To narrow down the results, we explore methods, which rely on thermodynamic feasibility, catalogue lookups, and reaction similarity.
Historic projects, listed for documentation purposes
Generating and curating high-quality metabolic models using chemical ctructure
Creating a high-quality genome-scale metabolic model reconstruction requires meticulous manual curation and can take several years to finish. Consequently, many automated pipelines and curation tools have emerged to assist in the process. Despite the tools available, there remains a disconnect, with extensive curation still required on automatically produced draft reconstructions. We have developed a flexible desktop application (Metingear) and library (MDK) [13] that allows the development of new and existing models utilising the chemical structure of metabolites. The chemical structure can be utilised for unambiguous metabolite identification, which is important when comparing and merging existing models.
Polyketide structure prediction
Polyketides are complex, mostly high weight small molecules, produced mainly by secondary metabolism in bacteria and fungi, and have a wide variety of applications. Huge modular enzymes, called polyketide synthases (PKS), assemble polyketides through a series of elongation steps, where malonyl-CoA derivatives are added (but only a C2-unit is incorporated due to decarboxylation), similar in a way to fatty acids synthesis. Examples of well-known polyketides are erythromycin or tetracyclines. Polyketides, in general, have found applications as antibiotics, anti-tumoral agents, antifungals, insecticides and growth factors, among others.
Working on trans-AT polyketides, and in close collaboration with Prof. Piel at the ETH in Zürich, we have implemented an algorithm for the recognition of fine-grained keto synthase domain variants, that allow producing structural hypothesis for a polyketide starting from the sequence of the polyketide synthetase that it is assembling it.
Structure identification in mass spectrometry-based metabolomics
Part of our research comprised the development and implementation of methods to analyze tandem mass spectrometry (MSn) data in metabolomics. Over the last years, tandem and accurate mass MS have become the techniques of choice to study the metabolome, with various instruments and methods being available to cover the whole metabolome landscape. However, the chemical diversity of the metabolome and a lack of accepted reporting standards make the analysis inherently challenging and time-consuming. Typical mass spectrometry-based studies generate complex data where the signals of interest are obscured by systematic and random noise. Proper data preprocessing and consequent peak detection and extraction is essential for compound identification.
We worked on a modular workflow-based MS data analysis system to facilitate efficient compound identification and further open standards and free data/methods exchange in the field of metabolomics. Choosing a non-commercial, open-source workflow environment guarantees limitless accessibility and gives the data analyst the advantage of having a variety of analytical pre-and post-processing methods available from the community. Ongoing efforts include the development of the MassCascade library [14][15] and the KNIME plugin, adoption of open standards from the Metabolomics Standards Initiative, and implementation of robust methods for peak identification going beyond simple mass and spectra similarity queries.
References
- (2006): The Blue Obelisk - Interoperability in Chemical Informatics. In: Journal of Chemical Information and Modeling, vol. 46, no. 3, pp. 991–998, 2006.
- (2011): Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on.. In: Journal of cheminformatics, vol. 3, no. 1, pp. 37, 2011.
- (2016): The FAIR Guiding Principles for scientific data management and stewardship. In: Scientific Data, vol. 3, pp. 160018, 2016.
- (2004): Recent developments in automated structure elucidation of natural products. In: Natural Product Reports, vol. 21, no. 4, pp. 512–518, 2004.
- (2001): SENECA: A platform-independent, distributed, and parallel system for computer-assisted structure elucidation in organic chemistry. In: Journal of Chemical Information & Computer Sciences, vol. 41, no. 6, pp. 1500–1507, 2001.
- (2001): SENECA: A platform-independent, distributed, and parallel system for computer-assisted structure elucidation in organic chemistry. In: Journal of Chemical Information & Computer Sciences, vol. 41, no. 6, pp. 1500–1507, 2001.
- (2004): Evolutionary-algorithm-based strategy for computer-assisted structure elucidation. In: Journal of Chemical Information & Computer Sciences, vol. 44, no. 2, pp. 489–498, 2004.
- (2008): Building blocks for automated elucidation of metabolites: machine learning methods for NMR prediction.. In: BMC Bioinformatics, vol. 9, no. 1, pp. 400, 2008.
- (2016): The FAIR Guiding Principles for scientific data management and stewardship. In: Scientific Data, vol. 3, pp. 160018, 2016.
- (2003): NMRShiftDB - Constructing a free chemical information system with open-source components. In: Journal of Chemical Information & Computer Sciences, vol. 43, no. 6, pp. 1733–1739, 2003.
- (2004): NMRShiftDB -- compound identification and structure elucidation support through a free community-built web database.. In: Phytochemistry, vol. 65, no. 19, pp. 2711–2717, 2004.
- (2008): Building blocks for automated elucidation of metabolites: machine learning methods for NMR prediction.. In: BMC Bioinformatics, vol. 9, no. 1, pp. 400, 2008.
- (2013): Metingear: a development environment for annotating genome-scale metabolic models.. In: Bioinformatics, vol. 29, no. 17, pp. 2213–2215, 2013.
- (2014): MassCascade: Visual Programming for LC-MS Data Processing in Metabolomics.. In: Molecular Informatics, vol. 33, no. 4, pp. 307–310, 2014.
- (2013): KNIME-CDK: Workflow-driven cheminformatics. In: BMC Bioinformatics, vol. 14, no. 1, pp. 257, 2013.