MARCUS: Molecular Annotation and Recognition for Curating Unravelled Structures

Our new article, “MARCUS: molecular annotation and recognition for curating unravelled structures“, has been published in the RSC Digital Discovery Journal. This is a tool designed for curation of natural product literature, integrating COCONUT-aware schema mapping, CIP-based stereochemical validation, and human-in-the-loop structure refinement. This integrated web-based platform combines automated text annotation, multi-engine OCSR, and direct submission capabilities to the COCONUT database. MARCUS employs a fine-tuned GPT-4 model to extract chemical entities and utilises a Human-in-the-loop ensemble approach integrating DECIMER, MolNexTR, and MolScribe for structure recognition. The platform aims to streamline the data extraction workflow from PDF upload to database submission, significantly reducing curation time. MARCUS bridges the gap between unstructured chemical literature and machine-actionable databases, enabling the application of FAIR data principles and facilitating AI-driven chemical discovery. Through open-source code, accessible models, and comprehensive documentation, the web application enhances accessibility and promotes community-driven development. This approach facilitates unrestricted use and encourages the collaborative advancement of automated chemical literature curation tools.

We are pleased to dedicate MARCUS to Dr Marcus Ennis, the longest-serving curator of the ChEBI database, on the occasion of his 75th birthday

Rajan K, Weissenborn VK, Lederer L, Zielesny A, Steinbeck C (2025) MARCUS: Molecular annotation and recognition for curating unravelled structures. Digit Discov. https://doi.org/10.1039/d5dd00313j