Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in BST
Wednesday, July 23rd
11:20-11:40
Knowledge Graph–Powered and LLM-Assisted Microbial Growth Predictions: Integrating Symbolic Rule Mining, Boosted Trees, and Deep Graph-Based Models
Confirmed Presenter: Marcin Joachimiak, Lawrence Berkeley National Laboratory, United States

Room: 03A
Format: In person

Moderator(s): Tiffany Callahan


Authors List: Show

  • Marcin Joachimiak, Lawrence Berkeley National Laboratory, United States

Presentation Overview: Show

Knowledge Graph–Powered and LLM-Assisted Microbial Growth Predictions: Integrating Symbolic Rule Mining, Boosted Trees, and Deep Graph-Based Models

Joachimiak MP

BOKR

Predicting microbial growth preferences has broad impacts in biotechnology, healthcare, and environmental management. By identifying the media and conditions conducive to the growth of an organism, researchers can streamline strain selection for industrial processes, develop targeted antimicrobials, and uncover metabolic pathways for biodegradation or bioproduction. However, microbial cultivation remains an unsolved challenge, with only a small fraction of microbial taxa easily culturable.
Microorganisms are diverse in their metabolic capabilities and growth preferences, though much of this knowledge remains fragmented and locked in unstructured text. To address this, we developed KG-Microbe, a knowledge graph (KG) of over 800,000 bacterial and archaeal taxa, 3,000 types of complex traits, and 30,000 types of genome functional annotations. Built using a reproducible pipeline grounded in ontologies, KG-Microbe supports a spectrum of applications, such as predicting growth conditions and traits of microbes, interpreting metagenomics and other omics data, and providing recommendations for bioengineering and biomanufacturing.
Using KG-Microbe, we constructed machine learning pipelines to predict microbial growth preferences using different combinations of KG-derived input data types with: 1) symbolic rule mining, producing editable, human-readable explanations, 2) gradient boosted decision trees, and (3) deep graph-based models, which can achieve higher accuracy but with lower transparency. We demonstrate that symbolic rule mining can match the performance of “black box” methods, while boosted tree models yielded a mean precision of 70% across 46 diverse cultivation media. To help interpret and validate these predictions, we show that Large Language Models (LLMs) can be used to synthesize and explain model outputs. By comparing the model and their results, we identified key data features, data type biases, and knowledge gaps relevant to predicting growth preferences. We also use KG-Microbe embedding vector analogies and complex semantic queries across combinations of organismal traits to generate hypotheses and identify target organisms with specific properties.
Our work highlights the capabilities of a KG-driven approach and the trade-offs between model interpretability and predictive performance. These findings motivate the development of hybrid AI/ML approaches that combine model transparency with enhanced data utilization and predictive performance to advance microbial cultivation.

11:40-12:00
ProDiGenIDB – a unified resource of disease-associated genes, their protein products, and intrinsic disorder annotations
Confirmed Presenter: Jovana Kovacevic, Faculty of Mathematics, Belgrade University, Belgrade, Serbia, Serbia

Room: 03A
Format: In person

Moderator(s): Tiffany Callahan


Authors List: Show

  • Jovana Kovacevic, Faculty of Mathematics, Belgrade University, Belgrade, Serbia, Serbia
  • Anđelka Zečević, Mathematical Institute, Serbian Academy of Sciences and Arts, Belgrade, Serbia, Serbia
  • Lazar Vasović, Faculty of Mathematics, Belgrade University, Belgrade, Serbia, Serbia

Presentation Overview: Show

Understanding gene-disease associations is essential in biomedical research, yet relevant information is often distributed across multiple heterogeneous databases. To overcome this inconsistency, we developed ProDiGenIDB, an integrated database that consolidates gene-disease relationships from several recognized and publicly available sources, while also enriching them with complementary data on gene and protein identifiers, disease ontology, and protein structural disorder.

ProDiGenIDB brings together over 400,000 curated associations sourced from DisGeNet, COSMIC, HumsaVar, Orphanet, ClinVar, HPO, and DISEASES. Each entry includes gene-related metadata (Gene Symbol, Entrez ID, UniProt ID, Ensembl ID), disease descriptors (Disease Name, DOID), and a reference to the original source database.

Importantly, the database also incorporates predicted intrinsic disorder information for proteins encoded by the associated genes. These predictions were generated using commonly used protein disorder prediction tools such as IUPred and VSL2, providing an additional insight into potential the lack of structure of disease-related proteins.

Another important aspect of the database construction involved mapping disease names to standardized Disease Ontology IDs (DOIDs). To improve this process, we applied Natural Language Processing (NLP) techniques using advanced text representation models to enhance the accuracy and consistency of term association.
ProDiGenIDB represents a valuable resource for integrative biomedical studies, particularly in contexts where protein disorder is hypothesized to play a functional or pathological role.

12:00-12:20
Causal knowledge graph analysis identifies adverse drug effects
Confirmed Presenter: Sumyyah Toonsi, King Abdullah Unversity of Science and Technology, Saudi Arabia

Room: 03A
Format: In person

Moderator(s): Tiffany Callahan


Authors List: Show

  • Sumyyah Toonsi, King Abdullah Unversity of Science and Technology, Saudi Arabia
  • Paul Schofield, Cambridge University, United Kingdom
  • Robert Hoehndorf, King Abdullah Unversity of Science and Technology, Saudi Arabia

Presentation Overview: Show

Motivation: Knowledge graphs and structural causal models have each proven valuable for organizing biomedical
knowledge and estimating causal effects, but remain largely disconnected: knowledge graphs encode qualitative
relationships focusing on facts and deductive reasoning without formal probabilistic semantics, while causal models lack
integration with background knowledge in knowledge graphs and have no access to the deductive reasoning capabilities
that knowledge graphs provide.
Results: To bridge this gap, we introduce a novel formulation of Causal Knowledge Graphs (CKGs) which extend
knowledge graphs with formal causal semantics, preserving their deductive capabilities while enabling principled
causal inference. CKGs support deconfounding via explicitly marked causal edges and facilitate hypothesis formulation
aligned with both encoded and entailed background knowledge. We constructed a Drug–Disease CKG (DD-CKG)
integrating disease progression pathways, drug indications, side-effects, and hierarchical disease classification to enable
automated large-scale mediation analysis. Applied to UK Biobank and MIMIC-IV cohorts, we tested whether drugs
mediate effects between indications and downstream disease progression, adjusting for confounders inferred from the
DD-CKG. Our approach successfully reproduced known adverse drug reactions with high precision while identifying
previously undocumented significant candidate adverse effects. Further validation through side effect similarity analysis
demonstrated that combining our predicted drug effects with established databases significantly improves the prediction
of shared drug indications, supporting the clinical relevance of our novel findings. These results demonstrate that our
methodology provides a generalizable, knowledge-driven framework for scalable causal inference.

12:20-12:40
CROssBARv2: A Unified Biomedical Knowledge Graph for Heterogeneous Data Representation and LLM-Driven Exploration
Confirmed Presenter: Bünyamin Şen, Hacettepe University, Turkey

Room: 03A
Format: In person

Moderator(s): Tiffany Callahan


Authors List: Show

  • Bünyamin Şen, Hacettepe University, Turkey
  • Erva Ulusoy, Hacettepe University, Turkey
  • Melih Darcan, Hacettepe University, Turkey
  • Mert Ergün, Hacettepe University, Turkey
  • Tunca Dogan, Hacettepe University, Turkey

Presentation Overview: Show

Developing effective therapeutics against prevalent diseases requires a deep understanding of molecular, genetic, and cellular factors involved in disease development/progression. However, such knowledge is dispersed across different databases, publications, and ontologies, making collecting, integrating and analysing biological data a major challenge. Here, we present CROssBARv2, an extended and improved version of our previous work (https://crossbar.kansil.org/), a heterogeneous biological knowledge graph (KG) based system to facilitate systems biology and drug discovery/repurposing. CROssBARv2 collects large-scale biological data from 32 data sources and stores them in a Neo4j graph database. CROssBARv2 consists of 2,709,502 nodes and 12,688,124 relationships between 14 node types. On top of that, we developed a GraphQL API and a large language model interface to convert users’ natural language-based queries into Neo4j's Cypher query language back and forth to access information within the KG and answer specific scientific questions without LLM hallucinations, mainly to facilitate the usage of the resource. To evaluate the capability of CROssBAR-LLMs (LLMs augmented with structured knowledge in CROssBAR) in answering biomedical questions, we constructed multiple benchmark datasets and employed an independent benchmark to systematically compare various open- and closed-source LLMs. Our results revealed that CROssBAR-LLMs display a significantly improved accuracy in answering these scientific questions compared to standalone LLMs and even LLMs augmented with web search. CROssBARv2 (https://crossbarv2.hubiodatalab.com/) is expected to contribute to life sciences research considering (i) the discovery of disease mechanisms at the molecular level and (ii) the development of effective personalised therapeutic strategies.

12:40-12:45
Benchmarking Data Leakage on Link Prediction in Biomedical Knowledge Graph Embeddings
Confirmed Presenter: Galadriel Brière, Aix Marseille Univ, INSERM, MMG, Marseille, France, France

Room: 03A
Format: In person

Moderator(s): Tiffany Callahan


Authors List: Show

  • Galadriel Brière, Aix Marseille Univ, INSERM, MMG, Marseille, France, France
  • Thomas Stosskopf, Aix Marseille Univ, INSERM, MMG, Marseille, France, France
  • Benjamin Loire, Aix Marseille Univ, INSERM, MMG, Marseille, France, France
  • Anaïs Baudot, Aix Marseille Univ, INSERM, MMG, Marseille, France, France

Presentation Overview: Show

In recent years, Knowledge Graphs (KGs) have gained significant attention for their ability to organize massive biomedical knowledge into entities and relationships. Knowledge Graph Embedding (KGE) models facilitate efficient exploration of KGs by learning compact data representations. These models are increasingly applied on biomedical KGs for various tasks, notably link prediction that enables applications such as drug repurposing.

The research community has implemented benchmarks to evaluate and compare the large diversity of KGE models. However, existing benchmarks often overlook the issue of Data Leakage (DL), which can lead to inflated performance and compromise the validity of benchmark results. DL may occur due to inadequate separation between training and test sets (DL1), use of illegitimate features (DL2), or evaluation settings that fail to reflect real-world inference conditions (DL3).

In this study, we implement systematic procedures to detect and mitigate these sources of DL. We evaluate popular KGE models on a biomedical KG and show that inappropriate data separation (DL1) artificially inflates model performances and that models do not rely on node degree as a shortcut feature (DL2). For DL3, we implement realistic inference conditions with i) a zero-shot training procedure in which drugs in test and validation sets have no known indications during training and ii) a drug repurposing ground-truth for rare diseases. Performances collapse in both these scenarios.

Our findings highlight the need for more rigorous evaluation protocols and raise concerns about the reliability of current KGE models for real-world biomedical applications such as drug repurposing.

12:45-12:50
A machine learning framework for extracting and structuring biological pathway knowledge from scientific literature
Confirmed Presenter: Mun Su Kwon, Korea Advanced Institute of Science and Technology (KAIST), South Korea

Room: 03A
Format: In person

Moderator(s): Tiffany Callahan


Authors List: Show

  • Mun Su Kwon, Korea Advanced Institute of Science and Technology (KAIST), South Korea
  • Junkyu Lee, Korea Advanced Institute of Science and Technology (KAIST), South Korea
  • Haechan Sung, Korea Advanced Institute of Science and Technology (KAIST), South Korea
  • Hyun Uk Kim, Korea Advanced Institute of Science and Technology (KAIST), South Korea

Presentation Overview: Show

Advances in text mining have significantly improved the accessibility of scientific knowledge from literature. However, a major challenge in biology and biotechnology remains in extracting information embedded within biological pathway images, which are not easily accessible through conventional text-based methods. To overcome this limitation, we present a machine learning–based framework called “Extraction of Biological Pathway Information” (EBPI). The framework systematically retrieves relevant publications based on user-defined queries, identifies biological pathway figures, and extracts structured information such as genes, enzymes, and metabolites. EBPI combines image processing and natural language models to identify texts from diagrams, classify terms into biological categories, and infer biochemical reaction directionality using graphical cues such as arrows. The extracted information is output in an editable, tabular format suitable for integration with pathway databases and knowledge graphs. Validated against manually curated pathway maps, EBPI enables scalable knowledge extraction from complex visual data of biological pathways and opens new directions for automated literature curation across many biological disciplines.

12:50-13:00
Invited Presentation: Poster Madness
Room: 03A
Format: In person

Moderator(s): Tiffany Callahan


Authors List: Show

Presentation Overview: Show

Each accepted poster presenter is given up 1 minute to advertise their poster.

14:00-14:20
Proceedings Presentation: ScGOclust: leveraging gene ontology to find functionally analogous cell types between distant species
Confirmed Presenter: Yuyao Song, European Bioinformatics Institute, United Kingdom

Room: 03A
Format: In person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Yuyao Song, European Bioinformatics Institute, United Kingdom
  • Yanhui Hu, Department of Genetics, Harvard Medical School, United States
  • Julian Dow, School of Molecular Biosciences, University of Glasgow, United Kingdom
  • Norbert Perrimon, Department of Genetics, Harvard Medical School and Howard Hughes Medical Institute, United States
  • Irene Papatheodorou, European Bioinformatics Institute; Earlham Institute and University of East Anglia, United Kingdom

Presentation Overview: Show

Basic biological processes are shared across animal species, yet their cellular mechanisms are profoundly diverse. Comparing cell-type gene expression between species reveals conserved and divergent cellular functions. However, as phylogenetic distance increases, gene-based comparisons become less informative. The Gene Ontology (GO) knowledgebase offers a solution by serving as the most comprehensive resource of gene functions across a vast diversity of species, providing a bridge for distant species comparisons. Here, we present scGOclust, a computational tool that constructs de novo cellular functional profiles using GO terms, facilitating systematic and robust comparisons within and across species. We applied scGOclust to analyse and compare the heart, gut and kidney between mouse and fly, and whole-body data from C.elegans and H.vulgaris. We show that scGOclust effectively recapitulates the function spectrum of different cell types, characterises functional similarities between homologous cell types, and reveals functional convergence between unrelated cell types. Additionally, we identified subpopulations within the fly crop that show circadian rhythm-regulated secretory properties and hypothesize an analogy between fly principal cells from different segments and distinct mouse kidney tubules. We envision scGOclust as an effective tool for uncovering functionally analogous cell types or organs across distant species, offering fresh perspectives on evolutionary and functional biology.

14:20-14:40
Integrating autoantibody-related knowledge in an ontology populated using a curated dataset from literature
Confirmed Presenter: Fabien Maury, Inserm, France

Room: 03A
Format: In person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Fabien Maury, Inserm, France
  • Solène Grosdidier, BlueMed Writing, Netherlands
  • Killian Halberda, Inserm, France
  • Isabelle Desguerre, AP-HP, France
  • Adrien Coulet, Inria, France
  • Maud de Dieuleveult, Inserm, France

Presentation Overview: Show

Autoimmune diseases (AIDs) are often characterized by the presence of autoantibodies (AAbs). But many of these diseases are rare and can be hard to diagnose, partly due to the lack of easily accessible knowledge such as the type of AAb to test for, in order to diagnose a particular AID. Indeed, to our knowledge, no centralized resource including all available knowledge
related to human autoantibodies exists as of 04-2025.
To fill this gap, first, we introduce a light ontology that allows to represent relationships about AAbs, their molecular targets, and the related AIDs and their clinical signs. Also, this ontology allows to specify the provenance of the relationships, by reusing the PROV-O ontology.
Second, we introduce the MAKAAO Core dataset, a dataset compiled manually from the literature by several curators. MAKAAO Core includes the name and synonyms (both in English and French) of over 350 autoantibodies, along with their targets and associated AIDs. Targets and diseases are referred to using identifiers from reference resources.
We used this dataset to populate our ontology, and named the result the MAKAAO knowledge graph (MAKAAO KG), which constitutes the central part of a future reference resource.

14:40-15:00
Ontology pre-training improves machine learning predictions of aqueous solubility and other metabolite properties
Room: 03A
Format: In person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Charlotte Tumescheit, University of Zurich, Swiss Institute of Bioinformatics, Switzerland
  • Martin Glauer, Otto-von-Guericke University Magdeburg, Germany
  • Simon Flügel, Osnabrück University, Germany
  • Fabian Neuhaus, Otto-von-Guericke University Magdeburg, Germany
  • Till Mossakowski, Osnabrück University, Germany
  • Janna Hastings, Unversity of Zurich, Swiss Institute of Bioinformatics, University of St. Gallen, Switzerland

Presentation Overview: Show

Predicting properties of small molecule metabolites from structures is a challenging task. Molecular language models have emerged as a highly performant AI approach for prediction of diverse properties directly from ‘language-like’ representations of the structures of molecules. However, for many prediction problems, there is a shortage of available training data and model performance is still limited.

Integrating expert knowledge into language models has the potential to improve performance on prediction tasks and model generalisability. Bio-ontologies offer curated knowledge ideal for this purpose. Here, we demonstrate a novel approach to knowledge injection, ‘ontology pre-training’, which we have previously shown to work for a pilot case study in the classification task of toxicity prediction. Now, we extend this to regression tasks such as solubility prediction.

First, we pre-train a Transformer-based language model on molecules from PubChem. Then, using our novel method, we embed the knowledge contained in a classification hierarchy derived from the ChEBI ontology into the model as an intermediate training step between general-purpose pre-training and task-specific fine-tuning. Finally, we fine-tune the models on a range of regression tasks. We find a clear improvement in performance and training times across the diverse prediction tasks.

Our results show that adding an additional knowledge-based training step to a machine learning model can improve performance. Our method is intuitive and generalisable and we plan to extend it to further biological modalities and prediction datasets, including proteins and RNA, as well as exploring the impact of different ontologies.

15:00-15:20
Building the Aging Biomarkers Ontology and Its Applications in Aging Research
Room: 03A
Format: In person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Hande McGinty, Kansas State University, Manhattan KS, United States
  • Srikar Reddy Gadusu, Kansas State University, United States
  • Yigit Kucuk, KONCORDANT Lab, United States
  • Aaron King, Aeon Biomarkers, United States

Presentation Overview: Show

Aging is a complex biological process shaped by numerous biomarkers—such as cholesterol and blood sugar levels—that serve as measurable indicators of health and disease. Despite the abundance of biomarker data, identifying meaningful patterns and relationships remains a significant challenge. To address this, we began developing the Aging Biomarkers Ontology (ABO), a structured framework that formally defines aging-related biomarkers, organizes them hierarchically, and maps their interconnections to facilitate deeper analysis. Furthermore, we employed two complementary approaches to enrich the graph and uncover hidden associations among aging biomarkers: Depth-Limited Search (DLS) and machine learning-based embedding search. DLS identifies associations by traversing connected nodes within a predefined depth, while the embedding-based method encodes biomarker relationships as numerical vectors and uses cosine similarity to predict potential links. We evaluated the performance of both methods in detecting known and novel relationships. Our results demonstrate the value of systematically integrating statistical analysis with graph-based reasoning and machine learning to explore aging-related biomarkers. The resulting framework enhances the interpretability of biomarker data, supports hypothesis generation, and contributes to advancing biomedical research in aging and longevity.

15:20-15:40
Discovering cellular contributions to disease pathogenesis in the NLM Cell Knowledge Network
Confirmed Presenter: Richard Scheuermann, Division of Intramural Research, National Library of Medicine, United States

Room: 03A
Format: In person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Richard Scheuermann, Division of Intramural Research, National Library of Medicine, United States
  • Anne Deslattes Mays, Division of Intramural Research, National Library of Medicine, United States
  • Matthew Diller, Division of Intramural Research, National Library of Medicine, United States
  • Caroline Eastwood, Wellcome Sanger Institute, United Kingdom
  • Rezarta Islamaj, Division of Intramural Research, National Library of Medicine, United States
  • James Leaman, Division of Intramural Research, National Library of Medicine, United States
  • Raymond LeClair, Division of Intramural Research, National Library of Medicine, United States
  • Zhiyong Lu, Division of Intramural Research, National Library of Medicine, United States
  • Chris Mungall, Lawrence Berkeley National Laboratory, United States
  • Vinh Nguyen, Division of Intramural Research, National Library of Medicine, United States
  • David Osumi-Sutherland, Wellcome Sanger Institute, United Kingdom
  • Beverly Peng, J. Craig Venter Institute, United States
  • Noam Rotenberg, Division of Intramural Research, National Library of Medicine, United States
  • William Spear, Division of Intramural Research, National Library of Medicine, United States
  • Bingfang Xu, Division of Intramural Research, National Library of Medicine, United States
  • Yun Zhang, Division of Intramural Research, National Library of Medicine, United States

Presentation Overview: Show

Knowledge about the role of genes in disease pathogenesis has been obtained from genetic and genome-wide association studies. The proteins encoded by these genes are frequently found to be effective therapeutic targets. However, little is known about which cells are the functional home of these disease-associated genes and proteins. Single cell genomic technologies are now revealing the cellular complexity of human tissues at high resolution. The transcriptomes defined by these technologies reflect the functional cellular phenotypes. Database resources that capture and disseminate data derived from these single cell technologies have been developed. But the knowledge derived from their analysis and interpretation is largely buried as free text in the scientific literature.
Here we describe the development of a Cell Knowledge Network (CKN) prototype at the National Library of Medicine (NLM) that captures and exposes knowledge about cell phenotypes (cell types and states) derived from single cell technologies and related experiments. NLM-CKN is populated using validated computational analysis pipelines and natural language processing of the scientific literature and integrated with other sources of relevant knowledge about genes, anatomical structures, diseases, and drugs.
Using this integration of experimental sc/snRNAseq data with prior knowledge about disease predispositions and drug targets, a novel linkage between lung pericytes and pulmonary hypertension was discovered through the KCNK3 gene intermediary with implications for novel therapeutic interventions.
Through the integration of knowledge from single cell technologies with other sources of knowledge about genetic predispositions and therapeutic targets, the NLM-CKN is revealing the cellular contributions to disease pathogenesis.

15:40-16:00
Cat-VRS for Genomic Knowledge Curation: A Hyperintensional Representation Framework for FAIR Categorical Variation
Confirmed Presenter: Daniel Puthawala, Nationwide Children's Hospital, United States

Room: 03A
Format: In person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Daniel Puthawala, Nationwide Children's Hospital, United States
  • Brendan Reardon, Dana-Farber Cancer Institute, United States

Presentation Overview: Show

Cat-VRS: A FAIR catvar Standard
Categorical variants (catvars)—such as “MET exon 14 skipping” and “TP53 loss”—are foundational to genomic knowledge, linking sets of genomic variants to clinically relevant assertions like oncogenicity scores or predicted therapeutic response. Yet despite their importance, catvars remain unstandardized, ambiguous, and largely non-computable, creating persistent barriers to search, curation, interoperability, and reuse. Existing standards either offer flexible models for sequence-resolved variants (e.g., GA4GH VRS) or rigid top-down nomenclatures (e.g., HGVS) that fail to capture the diversity and nuance of categorical assertions.

We present the Categorical Variation Representation Specification (Cat-VRS), a new GA4GH standard for representing catvars using a hyperintensional, constraint-based model. Cat-VRS encodes categorical meaning compositionally and bottom-up: structured constraints—such as sequence location or protein functional consequence—support precise, flexible representations at varying levels of granularity. Cat-VRS is fully interoperable with other GA4GH standards, supports ontology mappings, and was developed through global community collaboration in alignment with the FAIR data principles.

Cat-VRS 1.0 was recently released by GA4GH and is already in use by ClinVar and MaveDB, with integration underway in CIViC and the VICC MetaKB. These early implementations demonstrate Cat-VRS’s practical utility in enabling reusable, computable representations of categorical knowledge.

As precision medicine scales, so too does the need for infrastructure that supports consistent curation, standardized data sharing, and automated variant knowledge matching. We invite the bio-ontologies and knowledge representation community to engage with Cat-VRS as both a practical tool and an extensible framework for advancing interoperable genomic knowledge.

16:40-17:40
Invited Presentation: Knowledge Graphs: Theory, Applications and Challenges
Room: 03A
Format: In person

Moderator(s): Augustin Luna


Authors List: Show

  • Ian Horrocks

Presentation Overview: Show

Knowledge Graphs have rapidly become a mainstream technology that combines features of databases and AI. In this talk I will introduce Knowledge Graphs, explaining their features and the theory behind them. I will then consider some of the challenges inherent in both the theory and implementation of Knowledge Graphs and present some solutions that have made possible the development of popular language standards and robust and high-performance Knowledge Graph systems. Finally, I will illustrate the wide applicability of knowledge graph technology with some example use cases.

17:40-17:45
Bridging Language Barriers in Bio-Curation: An LLM-Enhanced Workflow for Ontology Translation into Japanese
Confirmed Presenter: Mark Streer, SciBite (Elsevier Ltd.), United Kingdom

Room: 03A
Format: In person

Moderator(s): Augustin Luna


Authors List: Show

  • Mark Streer, SciBite (Elsevier Ltd.), United Kingdom
  • Olivia Watson, SciBite (Elsevier Ltd.), United Kingdom
  • Mark McDowall, SciBite (Elsevier Ltd.), United Kingdom
  • Jane Lomax, SciBite (Elsevier Ltd.), United Kingdom

Presentation Overview: Show

SciBite’s ontology management and named entity recognition (NER) software relies on curated public ontologies to support data harmonization under FAIR principles (findable, accessible, interoperable, and reusable). Public ontologies are foundational for data FAIR-ification, providing structured vocabularies that enable consistent annotation and semantic integration; however, they are predominantly developed in English, creating barriers for non-English users and applications. To address this challenge for our Japanese customers, we developed a large language model (LLM)-enhanced bio-curation workflow for English-to-Japanese translation, focusing on synonym enrichment of the Uberon anatomy ontology as a case study. Our approach implements a three-step process: (1) importing mapped Japanese synonyms from existing bilingual datasets (e.g., DBCLS resources), (2) generating Japanese candidate synonyms based on English synonyms and definitions using an LLM, and (3) validating candidates against the source ontology to ensure appropriate placement as well as online dictionaries and other references to confirm their real-world applicability. Initially developed for synonym enrichment, this workflow can be extended to semantic refinement into broadMatch and narrowMatch relationships in addition to exactMatch—critical for terminology lacking perfect English equivalents. Furthermore, the workflow is well-suited to agentic frameworks such as LangGraph to orchestrate generation and Internet research processes, as well as LLM-ensemble evaluation to automatically confirm clear matches, allowing ambiguous cases to be prioritized for “human-in-the-loop" curation. This approach represents a promising solution for scalable ontology translation, contributing to the FAIR development and application of bio-ontologies across language barriers and enhancing international biomedical research collaboration.

17:45-17:50
Enabling FAIR Single-Cell RNAseq Data Management with COPO
Confirmed Presenter: Felix Shaw, Earlham Institute, United Kingdom

Room: 03A
Format: In person

Moderator(s): Augustin Luna


Authors List: Show

  • Felix Shaw, Earlham Institute, United Kingdom
  • Debby Ku, Earlham Institute, United Kingdom
  • Aaliyah Providence, Earlham Institute, United Kingdom
  • Irene Papatheodorou, Earlham Institute, United Kingdom

Presentation Overview: Show

We present our work on establishing standards and tools for validating and submitting single-cell RNA sequencing (scRNA-seq) data and metadata using the COPO brokering platform. Effective research data management is essential for enabling data reuse, integration, and the discovery of new biological insights. As new technologies like single-cell sequencing and transcriptomics emerge, they often outpace existing data infrastructure.

Single-cell technologies allow detailed insights into biological processes, for example, tracking gene expression dynamics in crops, dissecting pathogen-host interactions at the cellular level, or identifying stress-resilient cell types. Yet without comprehensive metadata and appropriate data management tools, the full potential of these datasets remains unrealised.

Implementing the FAIR principles—particularly around metadata quality is crucial. At present, there are few widely adopted standards or tools for describing scRNA-seq experiments. In response, we have developed a structured metadata template tailored to these experiments, informed by extensive consultation with researchers across the single-cell community and aligned with existing standards.

This metadata standard is integrated into COPO, which provides a streamlined interface for validating and brokering data and metadata to public repositories. Standardised metadata improves discoverability, supports data integration across platforms, and enables consistent reuse. It also ensures proper attribution, facilitates collaboration across diverse disciplines, and enhances reproducibility.

By submitting with FAIR metadata viaSingle-cell RNA-seq COPO, we transform scRNA-seq outputs from isolated experimental results into well-labelled, interoperable datasets suitable for downstream applications such as machine learning. Our work addresses a key infrastructure gap, enabling more effective, collaborative, and impactful research in the single-cell field.

17:50-17:55
Cancer Complexity Knowledge Portal: A centralized web portal for finding cancer related data, software tools, and other resources
Room: 03A
Format: In person

Moderator(s): Augustin Luna


Authors List: Show

  • Orion Banks, Sage Bionetworks, United States
  • Ashley Clayton, Sage Bionetworks, United Kingdom
  • Aditi Gopalan, Sage Bionetworks, United Kingdom
  • Amber Nelson, Sage Bionetworks, United States
  • Stockard Simon, Sage Bionetworks, United States
  • Verena Chung, Sage Bionetworks, United States
  • Amy Heiser, Sage Bionetworks, United States
  • Jay Hodgson, Sage Bionetworks, United States
  • Aditya Nath, Sage Bionetworks, United States
  • Adam Hindman, Sage Bionetworks, United States
  • Milen Nikolov, Sage Bionetworks, United States
  • Adam Taylor, Sage Bionetworks, United Kingdom
  • James Eddy, Sage Bionetworks, United States
  • Susheel Varma, Sage Bionetworks, United States
  • Jineta Banerjee, Sage Bionetworks, United States

Presentation Overview: Show

Applying artificial intelligence and machine learning to biomedical problems requires clean, high-quality data and reusable software tools. The Cancer Complexity Knowledge Portal (CCKP), a NIH-listed domain-specific repository maintained by the Multi-Consortia Coordinating (MC2) Center at Sage Bionetworks, makes oncology data findable and accessible. The MC2 Center coordinates resources among six cancer-focused research consortia funded by the National Cancer Institute.

To establish metadata standards, the CCKP hosts data models for various modalities, including genomics and imaging. New models are also being developed for emerging types, such as spatial transcriptomics. These models undergo iterative development with versioned releases maintained in a public GitHub repository. They power data management tools developed by Sage Bionetworks, including the Schematic Python package and the Data Curator App, which support FAIR data annotation.

The data models help researchers link research outputs and assist the CCKP in highlighting activities from NCI-funded cancer research programs. The portal offers search and filtering capabilities to accelerate discovery and collaboration. As of November 2024, it hosts information on 3,786 publications, 904 datasets, and 292 computational tools from over 140 research grants. The models incorporate elements from the Cancer Research Data Commons Data Hub to support integration within the CRDC ecosystem.

We are engaging with scientists, clinicians, and patient advocates to leverage user-centred design and structured data models, making cancer data more findable, accessible, and reusable. These improvements aim to bridge the gap between experimental and computational labs, fueling scientific discovery.

17:55-18:00
COSI Closing Remarks
Room: 03A
Format: In person

Moderator(s): Augustin Luna


Authors List: Show

  • Augustin Luna
  • Tiffany Callahan