Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in BST
Tuesday, July 22nd
11:20-12:20
Invited Presentation: Open Knowledge Bases in the Age of Generative AI
Confirmed Presenter: Chris Mungall

Room: 03A
Format: In person

Moderator(s): Karsten Hokamp


Authors List: Show

  • Chris Mungall

Presentation Overview: Show

The scientific and clinical community relies on the active development of a wide range of inter-linked knowledge bases, in order to plan experiments, interpret omics data, and to help with the diagnosis and treatment of disease. These knowledge bases make use of expert curation, and the use of community ontologies in order to provide accurate and structured information that can be used algorithmically. The advent of generative AI and agentic methods presents fantastic opportunities for accelerating curation, increasing the breadth and depth of coverage. Open knowledge bases also present opportunities to generative AI, in the form of a trusted backbone of knowledge that can mitigate the hallucinations that plague large language models. However, the pace of development of AI, combined with misunderstandings about both strengths and weaknesses, poses significant dangers. In this talk, I will present our recent work on the use of agentic AI to assist with manual knowledge base tasks, particularly those involving complex ontology development and maintenance tasks. I will present a realistic picture of challenges we face, but also strategies to mitigate them, and a path towards a future where agents, curators, and others can work together to leverage and integrate open source tools and data along with the combined knowledge of the scientific community.

12:20-12:40
textToKnowledgeGraph: Generation of Molecular Interaction Knowledge Graphs Using Large Language Models for Exploration in Cytoscape
Confirmed Presenter: Favour James, Obafemi Awolowo University, Nigeria

Room: 03A
Format: In person

Moderator(s): Karsten Hokamp


Authors List: Show

  • Favour James, Obafemi Awolowo University, Nigeria
  • Christopher Churas, Department of Medicine, University of California San Diego, La Jolla, CA, United States., United States
  • Trey Ideker, Department of Medicine, University of California San Diego, La Jolla, CA, United States., United States
  • Dexter Pratt, Department of Medicine, University of California San Diego, La Jolla, CA, United States., United States
  • Augustin Luna, National Library of Medicine and National Cancer Institute, Bethesda, MD, USA, United States

Presentation Overview: Show

Knowledge graphs (KGs) are powerful tools for structuring and analyzing biological information due to their ability to represent data and improve queries across heterogeneous datasets. However, constructing KGs from unstructured literature remains challenging due to the cost and expertise required for manual curation. Prior works have explored text-mining techniques to automate this process, but have limitations that impact their ability to capture complex relationships fully. Traditional text-mining methods struggle with understanding context across sentences. Additionally, these methods lack expert-level background knowledge, making it difficult to infer relationships that require awareness of concepts indirectly described in the text.
Large Language Models (LLMs) present an opportunity to overcome these challenges. LLMs are trained on diverse literature, equipping them with contextual knowledge that enables more accurate extraction. Additionally, LLMs can process the entirety of an article, capturing relationships across sections rather than analyzing single sentences; this allows for more precise extraction. We present textToKnowledgeGraph, an artificial intelligence tool using LLMs to extract interactions from individual publications directly in Biological Expression Language (BEL). BEL was chosen for its compact and detailed representation of biological relationships, allowing for structured and computationally accessible encoding.
This work makes several contributions. 1. Development of the open‑source Python textToKnowledgeGraph package (pypi.org/project/texttoknowledgegraph) for BEL extraction from scientific articles, usable from the command line and within other projects, 2. An interactive application within Cytoscape Web to simplify extraction and exploration, 3. A dataset of extractions that have been both computationally and manually reviewed to support future fine-tuning efforts.

12:40-13:00
BMFM-RNA: An Open Framework for Building and Evaluating Transcriptomic Foundation Models built on Biomed-Multi-Omic
Confirmed Presenter: Bharath Dandala, IBM, United States

Room: 03A
Format: In person

Moderator(s): Karsten Hokamp


Authors List: Show

  • Bharath Dandala, IBM, United States
  • Michael M Danziger, IBM, Israel
  • Ching-Huei Tsou, IBM, United States
  • Akira Koseki, IBM, Japan
  • Viatcheslav Gurev, IBM, United States
  • Tal Kozlovski, IBM, Israel
  • Ella Barkan, IBM, Israel
  • Matthew Madgwick, IBM, United Kingdom
  • Akihiro Kosugi, IBM, Japan
  • Tanwi Biswas, IBM, Japan
  • Liran Szalk, IBM, Israel
  • Matan Ninio, IBM, Israel

Presentation Overview: Show

High-throughput sequencing has revolutionized transcriptomic studies, and the synthesis of these diverse datasets holds significant potential for a deeper under- standing of cell biology. Recent advancements have introduced several promising techniques for building transcriptomic foundation models (TFMs), each emphasizing unique modeling decisions and demonstrating potential in handling the inherent challenges of high-dimensional, sparse data. However, despite their individual strengths, current TFMs still struggle to fully capture biologically meaningful representations, highlighting the need for further improvements. Recognizing that existing TFM approaches possess complementary strengths and weaknesses, a promising direction lies in the systematic exploration of various combinations of design, training, and evaluation methodologies. Thus, to accelerate progress in this field, we present bmfm-rna (shown in Figure 1), a comprehensive framework that not only facilitates this combinatorial exploration but is also inherently flexible and easily extensible to incorporate novel methods as the field continues to advance. This framework enables scalable data processing and features extensible transformer architectures. It supports a variety of input representations, pretraining objectives, masking strategies, domain-specific metrics, and model interpretation methods. Furthermore, it facilitates down- stream tasks such as cell type annotation, perturbation prediction, and batch effect correction on benchmark datasets. Models trained with the framework achieve performance comparable to scGPT, Geneformer and other TFMs on these downstream tasks. By open-sourcing this framework with strong performance, we aim to lower barriers for developing TFMs and invite the community to build more effective TFMs. bmfm-rna is available via Apache license at https://github.com/BiomedSciAI/biomed-multi-omic

DOME Registry - Supporting ML transparency and reproducibility in the life sciences
Confirmed Presenter: Gavin Farrell, Uni Padova, Italy

Room: 03A
Format: In person

Moderator(s): Karsten Hokamp


Authors List: Show

  • Gavin Farrell, Uni Padova, Italy
  • Omar Attafi, University of Padova, Italy
  • Silvio Tosatto, University of Padova, Italy

Presentation Overview: Show

The adoption of machine learning (ML) methods in the life sciences has been transformative, solving landmark challenges such as accurate protein structure prediction, improving bioimaging diagnostics and accelerating drug discovery. However, researchers face a reuse and reproducibility crisis of ML publications. Authors are publishing ML methods lacking core information to transfer value back to the reader. Commonly absent are links to code, data and models eroding trust in the methods.

In response to this ELIXIR Europe developed a practical checklist of recommendations covering key ML methods aspects for disclosure covering; data, optimisation, model and evaluation. These are now known collectively as the DOME Recommendations published in Nature Methods by Walsh et al. 2021. Building on this successful first step towards addressing the ML publishing crisis, ELIXIR has developed a technological solution to support the implementation of the DOME Recommendations. This solution is known as the DOME Registry and was published in GigaScience by Ataffi et al. in late 2024.

This talk will cover the DOME Registry technology which serves as a curated database of ML methods for life science publications by allowing researchers to annotate and share their methods. The service can also be adopted by publishers during their ML publishing workflow to increase a publication’s transparency and reproducibility. An overview of the next steps for the DOME Registry will also be provided - considering new ML ontologies, metadata formats and integrations building towards a stronger ML ecosystem for the life sciences.

AutoPeptideML 2: An open source library for democratizing machine learning for peptide bioactivity prediction
Confirmed Presenter: Raúl Fernández-Díaz, IBM Research | UCD Conway Institute, Ireland

Room: 03A
Format: In person

Moderator(s): Karsten Hokamp


Authors List: Show

  • Raúl Fernández-Díaz, IBM Research | UCD Conway Institute, Ireland
  • Thanh Lam Hoang, IBM Research Dublin, Ireland
  • Vanessa Lopez, IBM Research Dublin, Ireland
  • Denis Shields, University College Dublin, Ireland

Presentation Overview: Show

Peptides are a rapidly growing drug modality with diverse bioactivities and accessible synthesis, particularly for canonical peptides composed of the 20 standard amino acids. However, enhancing their pharmacological properties often requires chemical modifications, increasing synthesis cost and complexity. Consequently, most existing data and predictive models focus on canonical peptides. To accelerate the development of peptide drugs, there is a need for models that generalize from canonical to non-canonical peptides.

We present AutoPeptideML, an open-source, user-friendly machine learning platform designed to bridge this gap. It empowers experimental scientists to build custom predictive models without specialized computational knowledge, enabling active learning workflows that optimize experimental design and reduce sample requirements. AutoPeptideML introduces key innovations: (1) preprocessing pipelines for harmonizing diverse peptide formats (e.g., sequences, SMILES); (2) automated sampling of negative peptides with matched physicochemical properties; (3) robust test set selection with multiple similarity functions (via the Hestia-GOOD framework); (4) flexible model building with multiple representation and algorithm choices; (5) thorough model evaluation for unseen data at multiple similarity levels; and (6) FAIR-compliant, interpretable outputs to support reuse and sharing. A webserver with GUI enhances accessibility and interoperability.

We validated AutoPeptideML on 18 peptide bioactivity datasets and found that automated negative sampling and rigorous evaluation reduce overestimation of model performance, promoting user trust. A follow-up investigation also highlighted the current limitations in extrapolating from canonical to non-canonical peptides using existing representation methods.

AutoPeptideML is a powerful platform for democratizing machine learning in peptide research, facilitating integration with experimental workflows across academia and industry.

14:00-14:20
BioPortal: a rejuvenated resource for biomedical ontologies
Confirmed Presenter: J. Harry Caufield, Lawrence Berkeley National Laboratory, United States

Room: 03A
Format: In person

Moderator(s): Karsten Hokamp


Authors List: Show

  • J. Harry Caufield, Lawrence Berkeley National Laboratory, United States
  • Jennifer Vendetti, Stanford University, United States
  • Nomi Harris, Lawrence Berkeley National Laboratory, United States
  • Michael Dorf, Stanford University, United States
  • Alex Skrenchuk, Stanford University, United States
  • Rafael Gonçalves, Stanford University, United States
  • John Graybeal, Stanford University, United States
  • Harshad Hegde, Lawrence Berkeley National Laboratory, United States
  • Timothy Redmond, Stanford University, United States
  • Chris Mungall, Lawrence Berkeley National Laboratory, United States
  • Mark Musen, Stanford University, United States

Presentation Overview: Show

BioPortal is an open repository of biomedical ontologies that supports data organization, curation, and integration across various domains. Serving as a fundamental infrastructure for modern information systems, BioPortal has been an open-source project for 20 years and currently hosts over 1,500 ontologies, with 1,192 publicly accessible.

Recent enhancements include tools for creating cross-ontology knowledge graphs and a semi-automated process for ontology change requests. Traditionally, ontology updates required expertise and were time-consuming, as users had to submit requests through developers. BioPortal's new service expedites this process using the Knowledge Graph Change Language (KGCL). A user-friendly interface accepts change requests via forms, which are then converted to GitHub issues with KGCL commands.

The new BioPortal Knowledge Graph (KG-Bioportal) tool merges user-selected ontology subsets using a common graph format and the Biolink Model. An open-source pipeline translates ontologies into the KGX graph format, facilitating interoperability with other biomedical knowledge sources. KG-Bioportal enables more integrated and flexible querying of ontologies, allowing researchers to connect information across domains.

Future improvements include enhanced ontology pages, automated metadata updates, and KG features with graph-based search and large language model integration. These enhancements aim to position BioPortal as an interoperable resource that meets the community's evolving needs.

14:20-14:40
Formal Validation of Variant Classification Rules Using Domain-Specific Language and Meta-Predicates
Confirmed Presenter: Michael Bouzinier, Forome Association, Harvard University, IDEXX Laboratories, United States

Room: 03A
Format: In person

Moderator(s): Karsten Hokamp


Authors List: Show

  • Michael Bouzinier, Forome Association, Harvard University, IDEXX Laboratories, United States
  • Michael Chumack, Forome Association, United States
  • Giorgi Shavtvalishvili, Forome Association, Impel, Georgia
  • Eugenia Lvova, Forome Association, Deggendorf Institute of Technology, Germany
  • Dmitry Etin, Forome Association, Deggendorf Institute of Technology, Austria

Presentation Overview: Show

This talk aims to initiate a community discussion on strategies for validating the selection and curation of genetic variants for clinical and research purposes. We present our approach using a Domain-Specific Language (DSL), first introduced with the AnFiSA platform at BOSC 2019.

Since our 2022 publication, we have continued developing this methodology. At BOSC 2023, we presented two extensions: the strong typing of genetic variables in the DSL, and the application of our framework beyond genetics, into population and environmental health.

This year, we focus on validating the provenance and evidentiary support of annotation elements based on purpose, knowledge domain, method of derivation, and scale — an ontology we introduced in 2023. We aim to support two key use cases: (1) logical validation during rule development, and (2) ensuring rule portability when existing rules are adapted for new clinical or laboratory settings.

We present a proof of concept using meta-predicates — embedded assertions in DSL scripts that validate specific properties of genetic annotations used in variant curation. This technique draws inspiration from Invariant-based Programming.

Finally, we frame our work in the context of AI-assisted code synthesis. Recent studies highlight the advantages of deep learning-guided program induction over test-time training and fine tuning (TTT/TTFT) for structured reasoning tasks. This reinforces the promise of DSL-based approaches as transparent, verifiable complements to generative AI in modern computational genomics.

14:40-15:00
BioChatter: An open-source framework integrating knowledge graphs and large language models for Accessible Biomedical AI
Confirmed Presenter: Sebastian Lobentanzer, Helmholtz Munich, Germany

Room: 03A
Format: In person

Moderator(s): Karsten Hokamp


Authors List: Show

  • Sebastian Lobentanzer, Helmholtz Munich, Germany

Presentation Overview: Show

The integration of large language models (LLMs) with structured biomedical knowledge remains a key challenge for building robust, trustworthy, and reproducible AI applications in biomedicine. We present BioChatter (https://biochatter.org), an open-source Python framework that bridges ontology-driven knowledge graphs (KGs) and LLMs through a modular, extensible architecture. Built as a companion to the BioCypher ecosystem for constructing biomedical KGs (https://biocypher.org), BioChatter allows researchers to easily build LLM-powered applications grounded in domain knowledge and interoperable data standards.

BioChatter emphasises transparent, community-driven development, supported by extensive documentation, real-world usage examples, and active support channels. Its design supports multiple modes of use from lightweight prototyping to server-based deployment and integrates naturally with open LLM ecosystems (e.g., Ollama, LangChain), knowledge graphs, and the Model Context Protocol (MCP) for LLM tool usage. We highlight ongoing applications across biomedical domains, including automated knowledge integration pipelines for drug discovery (Open Targets), clinical decision support prototypes, and data sharing platforms within the German research infrastructure.

The open-source nature of BioChatter, together with its benchmark-first approach for validating biomedical LLM applications, facilitates broad adoption and collaboration. By lowering the entry barrier for building trustworthy biomedical AI systems, BioChatter contributes to the growing open-source ecosystem supporting reproducible, transparent, and community-driven AI development in the life sciences.

15:00-15:20
Applications of Bioschemas in FAIR, AI and knowledge representation
Confirmed Presenter: Nick Juty, The University of Manchester, United Kingdom

Room: 03A
Format: In person

Moderator(s): Karsten Hokamp


Authors List: Show

  • Nick Juty, The University of Manchester, United Kingdom
  • Phil Reed, The University of Manchester, United Kingdom
  • Helena Schnitzer, Forschungszentrum Jülich GmbH, Germany
  • Leyla Jael Castro, ZB MED Information Centre for Life Sciences, Germany
  • Alban Gaignard, University of Nantes, France
  • Carole Goble, The University of Manchester, United Kingdom

Presentation Overview: Show

Bioschemas.org defines domain-specific metadata schemas based on schema.org extensions, which expose key metadata properties from resource records. This provides a lightweight and easily adoptable means to incorporate key metadata on web records, and a mechanism to link to domain-specific ontology/vocabulary terms. As an established community effort focused on improving the FAIRness of resources in the Life Sciences, we now aim to extend the impact of Bioschemas beyond improvements to ‘findability’.
Bioschemas has been used to aggregate data in a distributed environment through federation, using metadata Bioschemas markup. More recently, we are leveraging Bioschemas deployments on resource websites, harvesting directly to populate SPARQL endpoints, subsequently creating queryable knowledge graphs.
An improved Bioschemas validation process will assess the ‘FAIR’ level of the user’s web records and suggest the most appropriate Bioschemas profile based on similarity to those in the Bioschemas registry.
Our learnings in operating this community will be extended into non-’bio’ domains wishing to more easily incorporate ontologies and metadata in their web-based records. To that end, we have a sister site dedicated to hosting the many domain-agnostic types/profiles that have already emerged from our work (so far 7 profiles aligned to digital objects in research, e.g., workflows, datasets, training materials): https://schemas.science/. Through this infrastructure we will develop a sustainable, cross-institutional collaborative space for long-term and wide ranging impact, supporting our existing community engagement with global AI, ML, and Training communities, and others in the future.

15:20-15:40
RO-Crate: Capturing FAIR research outputs in bioinformatics and beyond
Confirmed Presenter: Phil Reed, The University of Manchester, United Kingdom

Room: 03A
Format: In person

Moderator(s): Karsten Hokamp


Authors List: Show

  • Eli Chadwick, The University of Manchester, United Kingdom
  • Stian Soiland-Reyes, The University of Manchester, United Kingdom
  • Phil Reed, The University of Manchester, United Kingdom
  • Claus Weiland, Leibniz Institute for Biodiversity and Earth System Research, Germany
  • Dag Endresen, University of Oslo, Norway
  • Felix Shaw, Earlam Institute, United Kingdom
  • Timo Mühlhaus, RPTU Kaiserslautern-Landau, Germany
  • Carole Goble, The University of Manchester, United Kingdom

Presentation Overview: Show

RO-Crate is a mechanism for packaging research outputs with structured metadata, providing machine-readability and reproducibility following the FAIR principles. It enables interlinking methods, data, and outputs with the outcomes of a project or a piece of work, even where distributed across repositories.

Researchers can distribute their work as an RO-Crate to ensure their data travels with its metadata, so that key components are correctly tracked, archived, and attributed. Data stewards and infrastructure providers can integrate RO-Crate into the projects and platforms they support, to make it easier for researchers to create and consume RO-Crates without requiring technical expertise.

Community-developed extensions called “profiles” allow the creation of tailored RO-Crates that serve the needs of a particular domain or data format.

Current uses of RO-Crate in bioinformatics include:
∙ Describing and sharing computational workflows registered with WorkflowHub
∙ Creating FAIR exports of workflow executions from workflow engines and biodiversity digital twin simulations
∙ Enabling an appropriate level of credit and attribution, particularly in currently under-recognised roles (eg. sample gathering, processing, sample distribution)
∙ Capturing plant science experiments as Annotated Research Contexts (ARC), complex objects which include workflows, workflow executions, inputs, and results
∙ Defining metadata conventions for biodiversity genomics

This presentation will outline the RO-Crate project and highlight its most prominent applications within bioinformatics, with the aim of increasing awareness and sparking new conversations and collaborations within the BOSC community.

PheBee: A Graph-Based System for Scalable, Traceable, and Semantically Aware Phenotyping
Confirmed Presenter: David Gordon, Office of Data Sciences at Nationwide Children's Hospital, United States

Room: 03A
Format: In person

Moderator(s): Karsten Hokamp


Authors List: Show

  • David Gordon, Office of Data Sciences at Nationwide Children's Hospital, United States
  • Max Homilius, Office of Data Sciences at Nationwide Children's Hospital, United States
  • Austin Antoniou, Office of Data Sciences at Nationwide Children's Hospital, United States
  • Connor Grannis, Office of Data Sciences at Nationwide Children's Hospital, United States
  • Grant Lammi, Office of Data Sciences at Nationwide Children's Hospital, United States
  • Adam Herman, Office of Data Sciences at Nationwide Children's Hospital, United States
  • Ashley Kubatko, Office of Data Sciences at Nationwide Children's Hospital, United States
  • Peter White, Office of Data Sciences at Nationwide Children's Hospital, United States

Presentation Overview: Show

The association of phenotypes and disease diagnoses is a cornerstone of clinical care and biomedical research. Significant work has gone into standardizing these concepts in ontologies like the Human Phenotype Ontology and Mondo, and in developing interoperability standards such as Phenopackets. Managing subject-term associations in a traceable and scalable way that enables semantic queries and bridges clinical and research efforts remains a significant challenge.

PheBee is an open-source tool designed to address this challenge by using a graph-based approach to organize and explore data. It allows users to perform powerful, meaning-based searches and supports standardized data exchange through Phenopackets. The system is easy to deploy and share thanks to reproducible setup templates.

The graph model underlying PheBee captures subject-term associations along with their provenance and modifiers. Queries leverage ontology structure to traverse semantic term relationships. Terms can be linked at the patient, encounter, or note level, supporting temporal and contextual pattern analysis. PheBee accommodates both manually assigned and computationally derived phenotypes, enabling use across diverse pipelines. When integrated downstream of natural language processing pipelines, PheBee maintains traceability from extracted terms to the original clinical text, enabling high-throughput, auditable term capture.

PheBee is currently being piloted in internal translational research projects supporting phenotype-driven pediatric care. Its graph foundation also empowers future feature development, such as natural language querying using retrieval augmented generation or genomic data integration to identify subjects with variants in phenotypically relevant genes.

PheBee advances open science in biomedical research and clinical support by promoting structured, traceable phenotype data.

The role of the Ontology Development Kit in supporting ontology compliance in adverse legal landscapes
Confirmed Presenter: Damien Goutte-Gattat, University of Cambridge, United Kingdom

Room: 03A
Format: In person

Moderator(s): Karsten Hokamp


Authors List: Show

  • Damien Goutte-Gattat, University of Cambridge, United Kingdom

Presentation Overview: Show

Ontologies, like code, are a form of speech. As such, they can be
subject to laws and other regulations that attempt to control how
freedom of speech is exercised, and ontology editors may find themselves
in the position of being legally compelled to introduce some changes in
their ontologies for the sole purpose of complying with the laws that
applies to them.

Therefore, developers of tools used for ontology editing and maintenance
need to ponder whether their tools should provide features to facilitate
the introduction of such legally mandated changes, and how.

As developers of the Ontology Development Kit (ODK), one of the main
tools used to maintain ontologies of the OBO Foundry, we will consider
both the moral and technical aspects of allowing ODK users to comply
with arbitrary legal restrictions. The overall approach we are
envisioning, in order to contain the impacts of such restrictions to the
jurisdiction that mandate them, is a “split world” system, where the ODK
would facilitate the production of slightly different editions of the
same ontology.

15:40-16:00
10 years of the AberOWL ontology repository: moving towards federated reasoning and natural language access
Confirmed Presenter: Robert Hoehndorf, King Abdullah University of Science and Technology, Saudi Arabia

Room: 03A
Format: In person

Moderator(s): Karsten Hokamp


Authors List: Show

  • Fernando Zhapa-Camacho, King Abdullah University of Science and Technology, Saudi Arabia
  • Olga Mashkova, King Abdullah University of Science and Technology, Saudi Arabia
  • Maxat Kulmanov, King Abdullah University of Science and Technology, Saudi Arabia
  • Robert Hoehndorf, King Abdullah University of Science and Technology, Saudi Arabia

Presentation Overview: Show

AberOWL is a framework for ontology-based data access in biology that has provided reasoning services for bio-ontologies since 2015. Unlike other ontology repositories in the life sciences such as BioPortal, OLS, and OntoBee, AberOWL uniquely focuses on providing access to Description Logic querying through a Description Logic reasoner. The system comprises a reasoning service using OWLAPI and the Elk reasoner, an ElasticSearch service for natural language queries, and a SPARQL endpoint capable of embedding Description Logic queries within SPARQL queries. AberOWL contains all ontologies from BioPortal and the OBO library, enabling lightweight reasoning over the OWL 2 EL profile and implementing the Ontology-Based Data Access paradigm. This allows query enhancement through reasoning to infer implicit knowledge not explicitly stated in data. After a decade of operation, AberOWL is evolving in three key directions: (1) introducing a lightweight, containerized version enabling local deployment for single ontologies with the ability to register with the central repository for federated reasoning access; (2) integrating improved natural language processing through Large Language Models to facilitate Description Logic querying without requiring strict syntax adherence; and (3) implementing a FAIR API that standardizes access to ontology querying and repositories, improving interoperability. These advancements will transform AberOWL into a more federated system with FAIR API access and LLM integration for enhanced ontology interaction.

16:40-16:50
The global biodata infrastructure: how, where, who, and what?
Confirmed Presenter: Guy Cochrane, Global Biodata Coalition, United Kingdom

Room: 03A
Format: In person

Moderator(s): Karsten Hokamp


Authors List: Show

  • Guy Cochrane, Global Biodata Coalition, United Kingdom
  • Chuck Cook, Global Biodata Coalition, United Kingdom

Presentation Overview: Show

Life science and biomedical research around the world is critically dependent on a global infrastructure of biodata resources that store and provide access to research data, and to tools and services that allow users to interrogate, combine and re-use these data to generate new insights. These resources, most of which are open and freely available, form a critical, globally distributed, and highly-connected infrastructure that has grown organically over time.

Funders and managers of biodata resources are keenly aware that the long-term sustainability of this infrastructure, and of the individual resources it comprises, is under threat. The infrastructure has not been well described and there is a need to understand how many resources there are, where they are located, who funds them, and which are of the greatest importance for the scientific community.

The Global Biodata Coalition has worked to describe the infrastructure by undertaking an inventory of global biodata resources and by running a selection process to identify a set of—currently 52—Global Core Biodata Resources (GCBRs) that are of fundamental importance to global life sciences research.

We will present an overview of the location and funders of the GCBRs, and will summarise the findings of the two rounds of the global inventory of biodata resources, which identified over 3,700 resources.

The results of these analyses provide an overview of the infrastructure and will allow the GBC to identify major funders of biodata resources that are not currently engaged in the discussion of issues of sustainability.

16:50-17:50
Panel: Data Sustainability
Room: 03A
Format: In person

Moderator(s): Monica Munoz-Torres


Authors List: Show

  • Chris Mungall
  • Varsha Khodiyar
  • Tony Burdett
  • Nicky Mulder

Presentation Overview: Show

This BOSC 2025 panel will tackle the essential challenge of Data Sustainability, defined as the proactive and principled approach to ensuring bioinformatics research data remains FAIR, ethically managed, and valuable for future generations through sufficient infrastructure, funding, expertise, and governance. In light of current funding pressures and the risk of data loss that impedes scientific progress and wastes resources, establishing sustainable practices has become more urgent than ever. This discussion will incorporate diverse perspectives to examine practical strategies and solutions across key areas, including FAIR/CARE principles, funding models, open science, data lifecycle management, technical scalability, and ethical considerations.

17:50-18:00
Closing Remarks
Room: 03A
Format: In person

Moderator(s): Monica Munoz-Torres


Authors List: Show

  • Nomi Harris
Wednesday, July 23rd
11:20-11:40
Knowledge-Graph-driven and LLM-enhanced Microbial Growth Predictions
Confirmed Presenter: Marcin Joachimiak, Lawrence Berkeley National Laboratory, United States

Room: 03A
Format: In person

Moderator(s): Tiffany Callahan


Authors List: Show

  • Marcin Joachimiak, Lawrence Berkeley National Laboratory, United States

Presentation Overview: Show

Predicting microbial growth preferences has far-reaching impacts in biotechnology, healthcare, and environmental management. Cultivating microbes allows researchers to streamline strain selection, develop targeted antimicrobials, and uncover metabolic pathways for biodegradation or biomanufacturing. However, with most microbial taxa remaining uncultivated and knowledge of their metabolic capabilities and organismal traits fragmented in unstructured text, cultivation remains a major challenge. To address this, we developed KG-Microbe, a knowledge graph (KG) of over 800,000 bacterial and archaeal taxa, 3,000 types of traits, and 30,000 types of functional annotations.
Using KG-Microbe, we constructed machine learning pipelines to predict microbial growth preferences. We compared symbolic rule mining, which produces human-readable explanations, with "black box" methods like gradient boosted decision trees and deep graph-based models. While boosted tree models achieved a mean precision of over 70% across 46 diverse media, we demonstrate that symbolic rule mining can match their performance, offering crucial interpretability. To further validate predictions, we used large language models (LLMs) to interpret and explain model outputs.
By comparing these different models and their outputs, we identified key data features and knowledge gaps relevant to predicting microbial cultivation media preferences. We also used vector embedding analogy reasoning as well as complex graph queries on KG-Microbe to generate novel hypotheses and identify organisms with specific properties. Our work highlights the power of a KG-driven approach and the trade-offs between model interpretability and predictive performance. These findings motivate the development of hybrid AI models that combine transparency with predictive accuracy to advance microbial cultivation.

11:40-12:00
ProDiGenIDB – a unified resource of disease-associated genes, their protein products, and intrinsic disorder annotations
Confirmed Presenter: Jovana Kovacevic, Faculty of Mathematics, Belgrade University, Belgrade, Serbia, Serbia

Room: 03A
Format: In person

Moderator(s): Tiffany Callahan


Authors List: Show

  • Jovana Kovacevic, Faculty of Mathematics, Belgrade University, Belgrade, Serbia, Serbia
  • Anđelka Zečević, Mathematical Institute, Serbian Academy of Sciences and Arts, Belgrade, Serbia, Serbia
  • Lazar Vasović, Faculty of Mathematics, Belgrade University, Belgrade, Serbia, Serbia

Presentation Overview: Show

Understanding gene-disease associations is essential in biomedical research, yet relevant information is often distributed across multiple heterogeneous databases. To overcome this inconsistency, we developed ProDiGenIDB, an integrated database that consolidates gene-disease relationships from several recognized and publicly available sources, while also enriching them with complementary data on gene and protein identifiers, disease ontology, and protein structural disorder.

ProDiGenIDB brings together over 400,000 curated associations sourced from DisGeNet, COSMIC, HumsaVar, Orphanet, ClinVar, HPO, and DISEASES. Each entry includes gene-related metadata (Gene Symbol, Entrez ID, UniProt ID, Ensembl ID), disease descriptors (Disease Name, DOID), and a reference to the original source database.

Importantly, the database also incorporates predicted intrinsic disorder information for proteins encoded by the associated genes. These predictions were generated using commonly used protein disorder prediction tools such as IUPred and VSL2, providing an additional insight into potential the lack of structure of disease-related proteins.

Another important aspect of the database construction involved mapping disease names to standardized Disease Ontology IDs (DOIDs). To improve this process, we applied Natural Language Processing (NLP) techniques using advanced text representation models to enhance the accuracy and consistency of term association.
ProDiGenIDB represents a valuable resource for integrative biomedical studies, particularly in contexts where protein disorder is hypothesized to play a functional or pathological role.

12:00-12:20
Causal knowledge graph analysis identifies adverse drug effects
Confirmed Presenter: Sumyyah Toonsi, King Abdullah Unversity of Science and Technology, Saudi Arabia

Room: 03A
Format: In person

Moderator(s): Tiffany Callahan


Authors List: Show

  • Sumyyah Toonsi, King Abdullah Unversity of Science and Technology, Saudi Arabia
  • Paul Schofield, Cambridge University, United Kingdom
  • Robert Hoehndorf, King Abdullah Unversity of Science and Technology, Saudi Arabia

Presentation Overview: Show

Motivation: Knowledge graphs and structural causal models have each proven valuable for organizing biomedical
knowledge and estimating causal effects, but remain largely disconnected: knowledge graphs encode qualitative
relationships focusing on facts and deductive reasoning without formal probabilistic semantics, while causal models lack
integration with background knowledge in knowledge graphs and have no access to the deductive reasoning capabilities
that knowledge graphs provide.
Results: To bridge this gap, we introduce a novel formulation of Causal Knowledge Graphs (CKGs) which extend
knowledge graphs with formal causal semantics, preserving their deductive capabilities while enabling principled
causal inference. CKGs support deconfounding via explicitly marked causal edges and facilitate hypothesis formulation
aligned with both encoded and entailed background knowledge. We constructed a Drug–Disease CKG (DD-CKG)
integrating disease progression pathways, drug indications, side-effects, and hierarchical disease classification to enable
automated large-scale mediation analysis. Applied to UK Biobank and MIMIC-IV cohorts, we tested whether drugs
mediate effects between indications and downstream disease progression, adjusting for confounders inferred from the
DD-CKG. Our approach successfully reproduced known adverse drug reactions with high precision while identifying
previously undocumented significant candidate adverse effects. Further validation through side effect similarity analysis
demonstrated that combining our predicted drug effects with established databases significantly improves the prediction
of shared drug indications, supporting the clinical relevance of our novel findings. These results demonstrate that our
methodology provides a generalizable, knowledge-driven framework for scalable causal inference.

12:20-12:40
CROssBARv2: A Unified Biomedical Knowledge Graph for Heterogeneous Data Representation and LLM-Driven Exploration
Confirmed Presenter: Erva Ulusoy, Hacettepe University, Turkey

Room: 03A
Format: In person

Moderator(s): Tiffany Callahan


Authors List: Show

  • Bünyamin Şen, Hacettepe University, Turkey
  • Erva Ulusoy, Hacettepe University, Turkey
  • Melih Darcan, Hacettepe University, Turkey
  • Mert Ergün, Hacettepe University, Turkey
  • Tunca Dogan, Hacettepe University, Turkey

Presentation Overview: Show

Developing effective therapeutics against prevalent diseases requires a deep understanding of molecular, genetic, and cellular factors involved in disease development/progression. However, such knowledge is dispersed across different databases, publications, and ontologies, making collecting, integrating and analysing biological data a major challenge. Here, we present CROssBARv2, an extended and improved version of our previous work (https://crossbar.kansil.org/), a heterogeneous biological knowledge graph (KG) based system to facilitate systems biology and drug discovery/repurposing. CROssBARv2 collects large-scale biological data from 32 data sources and stores them in a Neo4j graph database. CROssBARv2 consists of 2,709,502 nodes and 12,688,124 relationships between 14 node types. On top of that, we developed a GraphQL API and a large language model interface to convert users’ natural language-based queries into Neo4j's Cypher query language back and forth to access information within the KG and answer specific scientific questions without LLM hallucinations, mainly to facilitate the usage of the resource. To evaluate the capability of CROssBAR-LLMs (LLMs augmented with structured knowledge in CROssBAR) in answering biomedical questions, we constructed multiple benchmark datasets and employed an independent benchmark to systematically compare various open- and closed-source LLMs. Our results revealed that CROssBAR-LLMs display a significantly improved accuracy in answering these scientific questions compared to standalone LLMs and even LLMs augmented with web search. CROssBARv2 (https://crossbarv2.hubiodatalab.com/) is expected to contribute to life sciences research considering (i) the discovery of disease mechanisms at the molecular level and (ii) the development of effective personalised therapeutic strategies.

12:40-12:45
Benchmarking Data Leakage on Link Prediction in Biomedical Knowledge Graph Embeddings
Confirmed Presenter: Galadriel Brière, Aix Marseille Univ, INSERM, MMG, Marseille, France, France

Room: 03A
Format: In person

Moderator(s): Tiffany Callahan


Authors List: Show

  • Galadriel Brière, Aix Marseille Univ, INSERM, MMG, Marseille, France, France
  • Thomas Stosskopf, Aix Marseille Univ, INSERM, MMG, Marseille, France, France
  • Benjamin Loire, Aix Marseille Univ, INSERM, MMG, Marseille, France, France
  • Anaïs Baudot, Aix Marseille Univ, INSERM, MMG, Marseille, France, France

Presentation Overview: Show

In recent years, Knowledge Graphs (KGs) have gained significant attention for their ability to organize massive biomedical knowledge into entities and relationships. Knowledge Graph Embedding (KGE) models facilitate efficient exploration of KGs by learning compact data representations. These models are increasingly applied on biomedical KGs for various tasks, notably link prediction that enables applications such as drug repurposing.

The research community has implemented benchmarks to evaluate and compare the large diversity of KGE models. However, existing benchmarks often overlook the issue of Data Leakage (DL), which can lead to inflated performance and compromise the validity of benchmark results. DL may occur due to inadequate separation between training and test sets (DL1), use of illegitimate features (DL2), or evaluation settings that fail to reflect real-world inference conditions (DL3).

In this study, we implement systematic procedures to detect and mitigate these sources of DL. We evaluate popular KGE models on a biomedical KG and show that inappropriate data separation (DL1) artificially inflates model performances and that models do not rely on node degree as a shortcut feature (DL2). For DL3, we implement realistic inference conditions with i) a zero-shot training procedure in which drugs in test and validation sets have no known indications during training and ii) a drug repurposing ground-truth for rare diseases. Performances collapse in both these scenarios.

Our findings highlight the need for more rigorous evaluation protocols and raise concerns about the reliability of current KGE models for real-world biomedical applications such as drug repurposing.

12:45-12:50
A machine learning framework for extracting and structuring biological pathway knowledge from scientific literature
Confirmed Presenter: Mun Su Kwon, Korea Advanced Institute of Science and Technology (KAIST), South Korea

Room: 03A
Format: In person

Moderator(s): Tiffany Callahan


Authors List: Show

  • Mun Su Kwon, Korea Advanced Institute of Science and Technology (KAIST), South Korea
  • Junkyu Lee, Korea Advanced Institute of Science and Technology (KAIST), South Korea
  • Haechan Sung, Korea Advanced Institute of Science and Technology (KAIST), South Korea
  • Hyun Uk Kim, Korea Advanced Institute of Science and Technology (KAIST), South Korea

Presentation Overview: Show

Advances in text mining have significantly improved the accessibility of scientific knowledge from literature. However, a major challenge in biology and biotechnology remains in extracting information embedded within biological pathway images, which are not easily accessible through conventional text-based methods. To overcome this limitation, we present a machine learning–based framework called “Extraction of Biological Pathway Information” (EBPI). The framework systematically retrieves relevant publications based on user-defined queries, identifies biological pathway figures, and extracts structured information such as genes, enzymes, and metabolites. EBPI combines image processing and natural language models to identify texts from diagrams, classify terms into biological categories, and infer biochemical reaction directionality using graphical cues such as arrows. The extracted information is output in an editable, tabular format suitable for integration with pathway databases and knowledge graphs. Validated against manually curated pathway maps, EBPI enables scalable knowledge extraction from complex visual data of biological pathways and opens new directions for automated literature curation across many biological disciplines.

12:50-13:00
Invited Presentation: Poster Madness
Room: 03A
Format: In person

Moderator(s): Tiffany Callahan


Authors List: Show

Presentation Overview: Show

Each accepted poster presenter is given up 1 minute to advertise their poster.

14:00-14:20
Proceedings Presentation: ScGOclust: leveraging gene ontology to find functionally analogous cell types between distant species
Confirmed Presenter: Yuyao Song, European Bioinformatics Institute, United Kingdom

Room: 03A
Format: In person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Yuyao Song, European Bioinformatics Institute, United Kingdom
  • Yanhui Hu, Department of Genetics, Harvard Medical School, United States
  • Julian Dow, School of Molecular Biosciences, University of Glasgow, United Kingdom
  • Norbert Perrimon, Department of Genetics, Harvard Medical School and Howard Hughes Medical Institute, United States
  • Irene Papatheodorou, European Bioinformatics Institute; Earlham Institute and University of East Anglia, United Kingdom

Presentation Overview: Show

Basic biological processes are shared across animal species, yet their cellular mechanisms are profoundly diverse. Comparing cell-type gene expression between species reveals conserved and divergent cellular functions. However, as phylogenetic distance increases, gene-based comparisons become less informative. The Gene Ontology (GO) knowledgebase offers a solution by serving as the most comprehensive resource of gene functions across a vast diversity of species, providing a bridge for distant species comparisons. Here, we present scGOclust, a computational tool that constructs de novo cellular functional profiles using GO terms, facilitating systematic and robust comparisons within and across species. We applied scGOclust to analyse and compare the heart, gut and kidney between mouse and fly, and whole-body data from C.elegans and H.vulgaris. We show that scGOclust effectively recapitulates the function spectrum of different cell types, characterises functional similarities between homologous cell types, and reveals functional convergence between unrelated cell types. Additionally, we identified subpopulations within the fly crop that show circadian rhythm-regulated secretory properties and hypothesize an analogy between fly principal cells from different segments and distinct mouse kidney tubules. We envision scGOclust as an effective tool for uncovering functionally analogous cell types or organs across distant species, offering fresh perspectives on evolutionary and functional biology.

14:20-14:40
Integrating autoantibody-related knowledge in an ontology populated using a curated dataset from literature
Confirmed Presenter: Fabien Maury, Inserm, France

Room: 03A
Format: In person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Fabien Maury, Inserm, France
  • Solène Grosdidier, BlueMed Writing, Netherlands
  • Killian Halberda, Inserm, France
  • Isabelle Desguerre, AP-HP, France
  • Adrien Coulet, Inria, France
  • Maud de Dieuleveult, Inserm, France

Presentation Overview: Show

Autoimmune diseases (AIDs) are often characterized by the presence of autoantibodies (AAbs). But many of these diseases are rare and can be hard to diagnose, partly due to the lack of easily accessible knowledge such as the type of AAb to test for, in order to diagnose a particular AID. Indeed, to our knowledge, no centralized resource including all available knowledge
related to human autoantibodies exists as of 04-2025.
To fill this gap, first, we introduce a light ontology that allows to represent relationships about AAbs, their molecular targets, and the related AIDs and their clinical signs. Also, this ontology allows to specify the provenance of the relationships, by reusing the PROV-O ontology.
Second, we introduce the MAKAAO Core dataset, a dataset compiled manually from the literature by several curators. MAKAAO Core includes the name and synonyms (both in English and French) of over 350 autoantibodies, along with their targets and associated AIDs. Targets and diseases are referred to using identifiers from reference resources.
We used this dataset to populate our ontology, and named the result the MAKAAO knowledge graph (MAKAAO KG), which constitutes the central part of a future reference resource.

14:40-15:00
Ontology pre-training improves machine learning predictions of aqueous solubility and other metabolite properties
Confirmed Presenter: Charlotte Tumescheit, University of Zurich, Swiss Institute of Bioinformatics, Switzerland

Room: 03A
Format: In person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Charlotte Tumescheit, University of Zurich, Swiss Institute of Bioinformatics, Switzerland
  • Martin Glauer, Otto-von-Guericke University Magdeburg, Germany
  • Simon Flügel, Osnabrück University, Germany
  • Fabian Neuhaus, Otto-von-Guericke University Magdeburg, Germany
  • Till Mossakowski, Osnabrück University, Germany
  • Janna Hastings, Unversity of Zurich, Swiss Institute of Bioinformatics, University of St. Gallen, Switzerland

Presentation Overview: Show

Predicting properties of small molecule metabolites from structures is a challenging task. Molecular language models have emerged as a highly performant AI approach for prediction of diverse properties directly from ‘language-like’ representations of the structures of molecules. However, for many prediction problems, there is a shortage of available training data and model performance is still limited.

Integrating expert knowledge into language models has the potential to improve performance on prediction tasks and model generalisability. Bio-ontologies offer curated knowledge ideal for this purpose. Here, we demonstrate a novel approach to knowledge injection, ‘ontology pre-training’, which we have previously shown to work for a pilot case study in the classification task of toxicity prediction. Now, we extend this to regression tasks such as solubility prediction and a wider range of classification tasks.

First, we pre-train a Transformer-based language model on molecules from PubChem. Then, using our novel method, we embed the knowledge contained in a classification hierarchy derived from the ChEBI ontology into the model as an intermediate training step between general-purpose pre-training and task-specific fine-tuning. Finally, we fine-tune the models on a range of regression tasks. We find a clear improvement in performance and training times across the diverse prediction tasks.

Our results show that adding an additional knowledge-based training step to a machine learning model can improve performance. Our method is intuitive and generalisable and we plan to extend it to further biological modalities and prediction datasets, including proteins and RNA, as well as exploring the impact of different ontologies.

15:00-15:20
Building the Aging Biomarkers Ontology and Its Applications in Aging Research
Confirmed Presenter: Hande McGinty, Kansas State University, Manhattan KS, United States

Room: 03A
Format: In person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Hande McGinty, Kansas State University, Manhattan KS, United States
  • Srikar Reddy Gadusu, Kansas State University, United States
  • Yigit Kucuk, KONCORDANT Lab, United States
  • Aaron King, Aeon Biomarkers, United States

Presentation Overview: Show

Aging is a complex biological process shaped by numerous biomarkers—such as cholesterol and blood sugar levels—that serve as measurable indicators of health and disease. Despite the abundance of biomarker data, identifying meaningful patterns and relationships remains a significant challenge. To address this, we began developing the Aging Biomarkers Ontology (ABO), a structured framework that formally defines aging-related biomarkers, organizes them hierarchically, and maps their interconnections to facilitate deeper analysis. Furthermore, we employed two complementary approaches to enrich the graph and uncover hidden associations among aging biomarkers: Depth-Limited Search (DLS) and machine learning-based embedding search. DLS identifies associations by traversing connected nodes within a predefined depth, while the embedding-based method encodes biomarker relationships as numerical vectors and uses cosine similarity to predict potential links. We evaluated the performance of both methods in detecting known and novel relationships. Our results demonstrate the value of systematically integrating statistical analysis with graph-based reasoning and machine learning to explore aging-related biomarkers. The resulting framework enhances the interpretability of biomarker data, supports hypothesis generation, and contributes to advancing biomedical research in aging and longevity.

15:20-15:40
Discovering cellular contributions to disease pathogenesis in the NLM Cell Knowledge Network
Confirmed Presenter: Richard Scheuermann, Division of Intramural Research, National Library of Medicine, United States

Room: 03A
Format: In person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Richard Scheuermann, Division of Intramural Research, National Library of Medicine, United States
  • Anne Deslattes Mays, Division of Intramural Research, National Library of Medicine, United States
  • Matthew Diller, Division of Intramural Research, National Library of Medicine, United States
  • Caroline Eastwood, Wellcome Sanger Institute, United Kingdom
  • Rezarta Islamaj, Division of Intramural Research, National Library of Medicine, United States
  • James Leaman, Division of Intramural Research, National Library of Medicine, United States
  • Raymond LeClair, Division of Intramural Research, National Library of Medicine, United States
  • Zhiyong Lu, Division of Intramural Research, National Library of Medicine, United States
  • Chris Mungall, Lawrence Berkeley National Laboratory, United States
  • Vinh Nguyen, Division of Intramural Research, National Library of Medicine, United States
  • David Osumi-Sutherland, Wellcome Sanger Institute, United Kingdom
  • Beverly Peng, J. Craig Venter Institute, United States
  • Noam Rotenberg, Division of Intramural Research, National Library of Medicine, United States
  • William Spear, Division of Intramural Research, National Library of Medicine, United States
  • Bingfang Xu, Division of Intramural Research, National Library of Medicine, United States
  • Yun Zhang, Division of Intramural Research, National Library of Medicine, United States

Presentation Overview: Show

Knowledge about the role of genes in disease pathogenesis has been obtained from genetic and genome-wide association studies. The proteins encoded by these genes are frequently found to be effective therapeutic targets. However, little is known about which cells are the functional home of these disease-associated genes and proteins. Single cell genomic technologies are now revealing the cellular complexity of human tissues at high resolution. The transcriptomes defined by these technologies reflect the functional cellular phenotypes. Database resources that capture and disseminate data derived from these single cell technologies have been developed. But the knowledge derived from their analysis and interpretation is largely buried as free text in the scientific literature.
Here we describe the development of a Cell Knowledge Network (CKN) prototype at the National Library of Medicine (NLM) that captures and exposes knowledge about cell phenotypes (cell types and states) derived from single cell technologies and related experiments. NLM-CKN is populated using validated computational analysis pipelines and natural language processing of the scientific literature and integrated with other sources of relevant knowledge about genes, anatomical structures, diseases, and drugs.
Using this integration of experimental sc/snRNAseq data with prior knowledge about disease predispositions and drug targets, a novel linkage between lung pericytes and pulmonary hypertension was discovered through the KCNK3 gene intermediary with implications for novel therapeutic interventions.
Through the integration of knowledge from single cell technologies with other sources of knowledge about genetic predispositions and therapeutic targets, the NLM-CKN is revealing the cellular contributions to disease pathogenesis.

15:40-16:00
Cat-VRS for Genomic Knowledge Curation: A Hyperintensional Representation Framework for FAIR Categorical Variation
Confirmed Presenter: Daniel Puthawala, Nationwide Children's Hospital, United States

Room: 03A
Format: In person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Daniel Puthawala, Nationwide Children's Hospital, United States
  • Brendan Reardon, Dana-Farber Cancer Institute, United States

Presentation Overview: Show

Cat-VRS: A FAIR catvar Standard
Categorical variants (catvars)—such as “MET exon 14 skipping” and “TP53 loss”—are foundational to genomic knowledge, linking sets of genomic variants to clinically relevant assertions like oncogenicity scores or predicted therapeutic response. Yet despite their importance, catvars remain unstandardized, ambiguous, and largely non-computable, creating persistent barriers to search, curation, interoperability, and reuse. Existing standards either offer flexible models for sequence-resolved variants (e.g., GA4GH VRS) or rigid top-down nomenclatures (e.g., HGVS) that fail to capture the diversity and nuance of categorical assertions.

We present the Categorical Variation Representation Specification (Cat-VRS), a new GA4GH standard for representing catvars using a hyperintensional, constraint-based model. Cat-VRS encodes categorical meaning compositionally and bottom-up: structured constraints—such as sequence location or protein functional consequence—support precise, flexible representations at varying levels of granularity. Cat-VRS is fully interoperable with other GA4GH standards, supports ontology mappings, and was developed through global community collaboration in alignment with the FAIR data principles.

Cat-VRS 1.0 was recently released by GA4GH and is already in use by ClinVar and MaveDB, with integration underway in CIViC and the VICC MetaKB. These early implementations demonstrate Cat-VRS’s practical utility in enabling reusable, computable representations of categorical knowledge.

As precision medicine scales, so too does the need for infrastructure that supports consistent curation, standardized data sharing, and automated variant knowledge matching. We invite the bio-ontologies and knowledge representation community to engage with Cat-VRS as both a practical tool and an extensible framework for advancing interoperable genomic knowledge.

16:40-17:40
Invited Presentation: Knowledge Graphs: Theory, Applications and Challenges
Confirmed Presenter: Ian Horrocks

Room: 03A
Format: In person

Moderator(s): Augustin Luna


Authors List: Show

  • Ian Horrocks

Presentation Overview: Show

Knowledge Graphs have rapidly become a mainstream technology that combines features of databases and AI. In this talk I will introduce Knowledge Graphs, explaining their features and the theory behind them. I will then consider some of the challenges inherent in both the theory and implementation of Knowledge Graphs and present some solutions that have made possible the development of popular language standards and robust and high-performance Knowledge Graph systems. Finally, I will illustrate the wide applicability of knowledge graph technology with some example use cases.

17:40-17:45
Bridging Language Barriers in Bio-Curation: An LLM-Enhanced Workflow for Ontology Translation into Japanese
Confirmed Presenter: Mark Streer, SciBite (Elsevier Ltd.), United Kingdom

Room: 03A
Format: In person

Moderator(s): Augustin Luna


Authors List: Show

  • Mark Streer, SciBite (Elsevier Ltd.), United Kingdom
  • Olivia Watson, SciBite (Elsevier Ltd.), United Kingdom
  • Mark McDowall, SciBite (Elsevier Ltd.), United Kingdom
  • Jane Lomax, SciBite (Elsevier Ltd.), United Kingdom

Presentation Overview: Show

SciBite’s ontology management and named entity recognition (NER) software relies on curated public ontologies to support data harmonization under FAIR principles (findable, accessible, interoperable, and reusable). Public ontologies are foundational for data FAIR-ification, providing structured vocabularies that enable consistent annotation and semantic integration; however, they are predominantly developed in English, creating barriers for non-English users and applications. To address this challenge for our Japanese customers, we developed a large language model (LLM)-enhanced bio-curation workflow for English-to-Japanese translation, focusing on synonym enrichment of the Uberon anatomy ontology as a case study. Our approach implements a three-step process: (1) importing mapped Japanese synonyms from existing bilingual datasets (e.g., DBCLS resources), (2) generating Japanese candidate synonyms based on English synonyms and definitions using an LLM, and (3) validating candidates against the source ontology to ensure appropriate placement as well as online dictionaries and other references to confirm their real-world applicability. Initially developed for synonym enrichment, this workflow can be extended to semantic refinement into broadMatch and narrowMatch relationships in addition to exactMatch—critical for terminology lacking perfect English equivalents. Furthermore, the workflow is well-suited to agentic frameworks such as LangGraph to orchestrate generation and Internet research processes, as well as LLM-ensemble evaluation to automatically confirm clear matches, allowing ambiguous cases to be prioritized for “human-in-the-loop" curation. This approach represents a promising solution for scalable ontology translation, contributing to the FAIR development and application of bio-ontologies across language barriers and enhancing international biomedical research collaboration.

17:45-17:50
Enabling FAIR Single-Cell RNAseq Data Management with COPO
Confirmed Presenter: Felix Shaw, Earlham Institute, United Kingdom

Room: 03A
Format: In person

Moderator(s): Augustin Luna


Authors List: Show

  • Felix Shaw, Earlham Institute, United Kingdom
  • Debby Ku, Earlham Institute, United Kingdom
  • Aaliyah Providence, Earlham Institute, United Kingdom
  • Irene Papatheodorou, Earlham Institute, United Kingdom

Presentation Overview: Show

We present our work on establishing standards and tools for validating and submitting single-cell RNA sequencing (scRNA-seq) data and metadata using the COPO brokering platform. Effective research data management is essential for enabling data reuse, integration, and the discovery of new biological insights. As new technologies like single-cell sequencing and transcriptomics emerge, they often outpace existing data infrastructure.

Single-cell technologies allow detailed insights into biological processes, for example, tracking gene expression dynamics in crops, dissecting pathogen-host interactions at the cellular level, or identifying stress-resilient cell types. Yet without comprehensive metadata and appropriate data management tools, the full potential of these datasets remains unrealised.

Implementing the FAIR principles—particularly around metadata quality is crucial. At present, there are few widely adopted standards or tools for describing scRNA-seq experiments. In response, we have developed a structured metadata template tailored to these experiments, informed by extensive consultation with researchers across the single-cell community and aligned with existing standards.

This metadata standard is integrated into COPO, which provides a streamlined interface for validating and brokering data and metadata to public repositories. Standardised metadata improves discoverability, supports data integration across platforms, and enables consistent reuse. It also ensures proper attribution, facilitates collaboration across diverse disciplines, and enhances reproducibility.

By submitting with FAIR metadata viaSingle-cell RNA-seq COPO, we transform scRNA-seq outputs from isolated experimental results into well-labelled, interoperable datasets suitable for downstream applications such as machine learning. Our work addresses a key infrastructure gap, enabling more effective, collaborative, and impactful research in the single-cell field.

17:50-17:55
Cancer Complexity Knowledge Portal: A centralized web portal for finding cancer related data, software tools, and other resources
Confirmed Presenter: Susheel Varma, Sage Bionetworks, United States

Room: 03A
Format: In person

Moderator(s): Augustin Luna


Authors List: Show

  • Orion Banks, Sage Bionetworks, United States
  • Ashley Clayton, Sage Bionetworks, United Kingdom
  • Aditi Gopalan, Sage Bionetworks, United Kingdom
  • Amber Nelson, Sage Bionetworks, United States
  • Stockard Simon, Sage Bionetworks, United States
  • Verena Chung, Sage Bionetworks, United States
  • Amy Heiser, Sage Bionetworks, United States
  • Jay Hodgson, Sage Bionetworks, United States
  • Aditya Nath, Sage Bionetworks, United States
  • Adam Hindman, Sage Bionetworks, United States
  • Milen Nikolov, Sage Bionetworks, United States
  • Adam Taylor, Sage Bionetworks, United Kingdom
  • James Eddy, Sage Bionetworks, United States
  • Susheel Varma, Sage Bionetworks, United States
  • Jineta Banerjee, Sage Bionetworks, United States
  • Aditya Nath, Sage Bionetworks, United States

Presentation Overview: Show

Applying artificial intelligence and machine learning to biomedical problems requires clean, high-quality data and reusable software tools. The Cancer Complexity Knowledge Portal (CCKP), a NIH-listed domain-specific repository maintained by the Multi-Consortia Coordinating (MC2) Center at Sage Bionetworks, makes oncology data findable and accessible. The MC2 Center coordinates resources among six cancer-focused research consortia funded by the National Cancer Institute.

To establish metadata standards, the CCKP hosts data models for various modalities, including genomics and imaging. New models are also being developed for emerging types, such as spatial transcriptomics. These models undergo iterative development with versioned releases maintained in a public GitHub repository. They power data management tools developed by Sage Bionetworks, including the Schematic Python package and the Data Curator App, which support FAIR data annotation.

The data models help researchers link research outputs and assist the CCKP in highlighting activities from NCI-funded cancer research programs. The portal offers search and filtering capabilities to accelerate discovery and collaboration. As of November 2024, it hosts information on 3,786 publications, 904 datasets, and 292 computational tools from over 140 research grants. The models incorporate elements from the Cancer Research Data Commons Data Hub to support integration within the CRDC ecosystem.

We are engaging with scientists, clinicians, and patient advocates to leverage user-centred design and structured data models, making cancer data more findable, accessible, and reusable. These improvements aim to bridge the gap between experimental and computational labs, fueling scientific discovery.

17:55-18:00
COSI Closing Remarks
Room: 03A
Format: In person

Moderator(s): Augustin Luna


Authors List: Show

  • Augustin Luna
  • Tiffany Callahan