Return to ISMB/ECCB 2025 Homepage Click here for the abridged agenda

Schedule for BOKR

NOTE: Browser resolution may limit the width of the agenda and you may need to scroll the iframe to see additional columns.
Click the buttons below to download your current table in that format

Date	Start Time	End Time	Room	Track	Title	Confrimed Presenter	Format	Authors	Abstract
2025-07-22	11:20:00	11:20:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	BOSC-BOKR Joint Session Introduction
2025-07-22	11:21:00	12:20:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	Open Knowledge Bases in the Age of Generative AI	Chris Mungall	In person	Chris Mungall	The scientific and clinical community relies on the active development of a wide range of inter-linked knowledge bases, in order to plan experiments, interpret omics data, and to help with the diagnosis and treatment of disease. These knowledge bases make use of expert curation, and the use of community ontologies in order to provide accurate and structured information that can be used algorithmically. The advent of generative AI and agentic methods presents fantastic opportunities for accelerating curation, increasing the breadth and depth of coverage. Open knowledge bases also present opportunities to generative AI, in the form of a trusted backbone of knowledge that can mitigate the hallucinations that plague large language models. However, the pace of development of AI, combined with misunderstandings about both strengths and weaknesses, poses significant dangers. In this talk, I will present our recent work on the use of agentic AI to assist with manual knowledge base tasks, particularly those involving complex ontology development and maintenance tasks. I will present a realistic picture of challenges we face, but also strategies to mitigate them, and a path towards a future where agents, curators, and others can work together to leverage and integrate open source tools and data along with the combined knowledge of the scientific community.
2025-07-22	12:20:00	12:40:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	textToKnowledgeGraph: Generation of Molecular Interaction Knowledge Graphs Using Large Language Models for Exploration in Cytoscape	Favour James	In person	Favour James, Christopher Churas, Trey Ideker, Dexter Pratt, Augustin Luna	Knowledge graphs (KGs) are powerful tools for structuring and analyzing biological information due to their ability to represent data and improve queries across heterogeneous datasets. However, constructing KGs from unstructured literature remains challenging due to the cost and expertise required for manual curation. Prior works have explored text-mining techniques to automate this process, but have limitations that impact their ability to capture complex relationships fully. Traditional text-mining methods struggle with understanding context across sentences. Additionally, these methods lack expert-level background knowledge, making it difficult to infer relationships that require awareness of concepts indirectly described in the text. Large Language Models (LLMs) present an opportunity to overcome these challenges. LLMs are trained on diverse literature, equipping them with contextual knowledge that enables more accurate extraction. Additionally, LLMs can process the entirety of an article, capturing relationships across sections rather than analyzing single sentences; this allows for more precise extraction. We present textToKnowledgeGraph, an artificial intelligence tool using LLMs to extract interactions from individual publications directly in Biological Expression Language (BEL). BEL was chosen for its compact and detailed representation of biological relationships, allowing for structured and computationally accessible encoding. This work makes several contributions. 1. Development of the open‑source Python textToKnowledgeGraph package (pypi.org/project/texttoknowledgegraph) for BEL extraction from scientific articles, usable from the command line and within other projects, 2. An interactive application within Cytoscape Web to simplify extraction and exploration, 3. A dataset of extractions that have been both computationally and manually reviewed to support future fine-tuning efforts.
2025-07-22	12:40:00	13:00:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	BMFM-RNA: An Open Framework for Building and Evaluating Transcriptomic Foundation Models built on Biomed-Multi-Omic	Bharath Dandala	In person	Bharath Dandala, Michael M Danziger, Ching-Huei Tsou, Akira Koseki, Viatcheslav Gurev, Tal Kozlovski, Ella Barkan, Matthew Madgwick, Akihiro Kosugi, Tanwi Biswas, Liran Szalk, Matan Ninio	High-throughput sequencing has revolutionized transcriptomic studies, and the synthesis of these diverse datasets holds significant potential for a deeper under- standing of cell biology. Recent advancements have introduced several promising techniques for building transcriptomic foundation models (TFMs), each emphasizing unique modeling decisions and demonstrating potential in handling the inherent challenges of high-dimensional, sparse data. However, despite their individual strengths, current TFMs still struggle to fully capture biologically meaningful representations, highlighting the need for further improvements. Recognizing that existing TFM approaches possess complementary strengths and weaknesses, a promising direction lies in the systematic exploration of various combinations of design, training, and evaluation methodologies. Thus, to accelerate progress in this field, we present bmfm-rna (shown in Figure 1), a comprehensive framework that not only facilitates this combinatorial exploration but is also inherently flexible and easily extensible to incorporate novel methods as the field continues to advance. This framework enables scalable data processing and features extensible transformer architectures. It supports a variety of input representations, pretraining objectives, masking strategies, domain-specific metrics, and model interpretation methods. Furthermore, it facilitates down- stream tasks such as cell type annotation, perturbation prediction, and batch effect correction on benchmark datasets. Models trained with the framework achieve performance comparable to scGPT, Geneformer and other TFMs on these downstream tasks. By open-sourcing this framework with strong performance, we aim to lower barriers for developing TFMs and invite the community to build more effective TFMs. bmfm-rna is available via Apache license at https://github.com/BiomedSciAI/biomed-multi-omic
2025-07-22	12:40:00	13:00:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	DOME Registry - Supporting ML transparency and reproducibility in the life sciences	Gavin Farrell	In person	Gavin Farrell, Omar Attafi, Silvio Tosatto	The adoption of machine learning (ML) methods in the life sciences has been transformative, solving landmark challenges such as accurate protein structure prediction, improving bioimaging diagnostics and accelerating drug discovery. However, researchers face a reuse and reproducibility crisis of ML publications. Authors are publishing ML methods lacking core information to transfer value back to the reader. Commonly absent are links to code, data and models eroding trust in the methods. In response to this ELIXIR Europe developed a practical checklist of recommendations covering key ML methods aspects for disclosure covering; data, optimisation, model and evaluation. These are now known collectively as the DOME Recommendations published in Nature Methods by Walsh et al. 2021. Building on this successful first step towards addressing the ML publishing crisis, ELIXIR has developed a technological solution to support the implementation of the DOME Recommendations. This solution is known as the DOME Registry and was published in GigaScience by Ataffi et al. in late 2024. This talk will cover the DOME Registry technology which serves as a curated database of ML methods for life science publications by allowing researchers to annotate and share their methods. The service can also be adopted by publishers during their ML publishing workflow to increase a publication’s transparency and reproducibility. An overview of the next steps for the DOME Registry will also be provided - considering new ML ontologies, metadata formats and integrations building towards a stronger ML ecosystem for the life sciences.
2025-07-22	12:40:00	13:00:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	AutoPeptideML 2: An open source library for democratizing machine learning for peptide bioactivity prediction	Raúl Fernández-Díaz	In person	Raúl Fernández-Díaz, Thanh Lam Hoang, Vanessa Lopez, Denis Shields	Peptides are a rapidly growing drug modality with diverse bioactivities and accessible synthesis, particularly for canonical peptides composed of the 20 standard amino acids. However, enhancing their pharmacological properties often requires chemical modifications, increasing synthesis cost and complexity. Consequently, most existing data and predictive models focus on canonical peptides. To accelerate the development of peptide drugs, there is a need for models that generalize from canonical to non-canonical peptides. We present AutoPeptideML, an open-source, user-friendly machine learning platform designed to bridge this gap. It empowers experimental scientists to build custom predictive models without specialized computational knowledge, enabling active learning workflows that optimize experimental design and reduce sample requirements. AutoPeptideML introduces key innovations: (1) preprocessing pipelines for harmonizing diverse peptide formats (e.g., sequences, SMILES); (2) automated sampling of negative peptides with matched physicochemical properties; (3) robust test set selection with multiple similarity functions (via the Hestia-GOOD framework); (4) flexible model building with multiple representation and algorithm choices; (5) thorough model evaluation for unseen data at multiple similarity levels; and (6) FAIR-compliant, interpretable outputs to support reuse and sharing. A webserver with GUI enhances accessibility and interoperability. We validated AutoPeptideML on 18 peptide bioactivity datasets and found that automated negative sampling and rigorous evaluation reduce overestimation of model performance, promoting user trust. A follow-up investigation also highlighted the current limitations in extrapolating from canonical to non-canonical peptides using existing representation methods. AutoPeptideML is a powerful platform for democratizing machine learning in peptide research, facilitating integration with experimental workflows across academia and industry.
2025-07-22	14:00:00	14:20:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	BioPortal: a rejuvenated resource for biomedical ontologies	J. Harry Caufield	In person	J. Harry Caufield, Jennifer Vendetti, Nomi Harris, Michael Dorf, Alex Skrenchuk, Rafael Gonçalves, John Graybeal, Harshad Hegde, Timothy Redmond, Chris Mungall, Mark Musen	BioPortal is an open repository of biomedical ontologies that supports data organization, curation, and integration across various domains. Serving as a fundamental infrastructure for modern information systems, BioPortal has been an open-source project for 20 years and currently hosts over 1,500 ontologies, with 1,192 publicly accessible. Recent enhancements include tools for creating cross-ontology knowledge graphs and a semi-automated process for ontology change requests. Traditionally, ontology updates required expertise and were time-consuming, as users had to submit requests through developers. BioPortal's new service expedites this process using the Knowledge Graph Change Language (KGCL). A user-friendly interface accepts change requests via forms, which are then converted to GitHub issues with KGCL commands. The new BioPortal Knowledge Graph (KG-Bioportal) tool merges user-selected ontology subsets using a common graph format and the Biolink Model. An open-source pipeline translates ontologies into the KGX graph format, facilitating interoperability with other biomedical knowledge sources. KG-Bioportal enables more integrated and flexible querying of ontologies, allowing researchers to connect information across domains. Future improvements include enhanced ontology pages, automated metadata updates, and KG features with graph-based search and large language model integration. These enhancements aim to position BioPortal as an interoperable resource that meets the community's evolving needs.
2025-07-22	14:20:00	14:40:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	Formal Validation of Variant Classification Rules Using Domain-Specific Language and Meta-Predicates	Michael Bouzinier	In person	Michael Bouzinier, Dmitry Etin, Giorgi Shavtvalishvili, Eugenia Lvova	This talk aims to initiate a community discussion on strategies for validating the selection and curation of genetic variants for clinical and research purposes. We present our approach using a Domain-Specific Language (DSL), first introduced with the AnFiSA platform at BOSC 2019. Since our 2022 publication, we have continued developing this methodology. At BOSC 2023, we presented two extensions: the strong typing of genetic variables in the DSL, and the application of our framework beyond genetics, into population and environmental health. This year, we focus on validating the provenance and evidentiary support of annotation elements based on purpose, knowledge domain, method of derivation, and scale — an ontology we introduced in 2023. We aim to support two key use cases: (1) logical validation during rule development, and (2) ensuring rule portability when existing rules are adapted for new clinical or laboratory settings. We present a proof of concept using meta-predicates — embedded assertions in DSL scripts that validate specific properties of genetic annotations used in variant curation. This technique draws inspiration from Invariant-based Programming. Finally, we frame our work in the context of AI-assisted code synthesis. Recent studies highlight the advantages of deep learning-guided program induction over test-time training and fine tuning (TTT/TTFT) for structured reasoning tasks. This reinforces the promise of DSL-based approaches as transparent, verifiable complements to generative AI in modern computational genomics.
2025-07-22	14:40:00	15:00:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	BioChatter: An open-source framework integrating knowledge graphs and large language models for Accessible Biomedical AI	Sebastian Lobentanzer	In person	Sebastian Lobentanzer	The integration of large language models (LLMs) with structured biomedical knowledge remains a key challenge for building robust, trustworthy, and reproducible AI applications in biomedicine. We present BioChatter (https://biochatter.org), an open-source Python framework that bridges ontology-driven knowledge graphs (KGs) and LLMs through a modular, extensible architecture. Built as a companion to the BioCypher ecosystem for constructing biomedical KGs (https://biocypher.org), BioChatter allows researchers to easily build LLM-powered applications grounded in domain knowledge and interoperable data standards. BioChatter emphasises transparent, community-driven development, supported by extensive documentation, real-world usage examples, and active support channels. Its design supports multiple modes of use from lightweight prototyping to server-based deployment and integrates naturally with open LLM ecosystems (e.g., Ollama, LangChain), knowledge graphs, and the Model Context Protocol (MCP) for LLM tool usage. We highlight ongoing applications across biomedical domains, including automated knowledge integration pipelines for drug discovery (Open Targets), clinical decision support prototypes, and data sharing platforms within the German research infrastructure. The open-source nature of BioChatter, together with its benchmark-first approach for validating biomedical LLM applications, facilitates broad adoption and collaboration. By lowering the entry barrier for building trustworthy biomedical AI systems, BioChatter contributes to the growing open-source ecosystem supporting reproducible, transparent, and community-driven AI development in the life sciences.
2025-07-22	15:00:00	15:20:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	Applications of Bioschemas in FAIR, AI and knowledge representation	Nick Juty	In person	Nick Juty, Phil Reed, Helena Schnitzer, Leyla Jael Castro, Alban Gaignard, Carole Goble	Bioschemas.org defines domain-specific metadata schemas based on schema.org extensions, which expose key metadata properties from resource records. This provides a lightweight and easily adoptable means to incorporate key metadata on web records, and a mechanism to link to domain-specific ontology/vocabulary terms. As an established community effort focused on improving the FAIRness of resources in the Life Sciences, we now aim to extend the impact of Bioschemas beyond improvements to ‘findability’. Bioschemas has been used to aggregate data in a distributed environment through federation, using metadata Bioschemas markup. More recently, we are leveraging Bioschemas deployments on resource websites, harvesting directly to populate SPARQL endpoints, subsequently creating queryable knowledge graphs. An improved Bioschemas validation process will assess the ‘FAIR’ level of the user’s web records and suggest the most appropriate Bioschemas profile based on similarity to those in the Bioschemas registry. Our learnings in operating this community will be extended into non-’bio’ domains wishing to more easily incorporate ontologies and metadata in their web-based records. To that end, we have a sister site dedicated to hosting the many domain-agnostic types/profiles that have already emerged from our work (so far 7 profiles aligned to digital objects in research, e.g., workflows, datasets, training materials): https://schemas.science/. Through this infrastructure we will develop a sustainable, cross-institutional collaborative space for long-term and wide ranging impact, supporting our existing community engagement with global AI, ML, and Training communities, and others in the future.
2025-07-22	15:20:00	15:40:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	RO-Crate: Capturing FAIR research outputs in bioinformatics and beyond	Phil Reed	In person	Eli Chadwick, Stian Soiland-Reyes, Phil Reed, Claus Weiland, Dag Endresen, Felix Shaw, Timo Mühlhaus, Carole Goble	RO-Crate is a mechanism for packaging research outputs with structured metadata, providing machine-readability and reproducibility following the FAIR principles. It enables interlinking methods, data, and outputs with the outcomes of a project or a piece of work, even where distributed across repositories. Researchers can distribute their work as an RO-Crate to ensure their data travels with its metadata, so that key components are correctly tracked, archived, and attributed. Data stewards and infrastructure providers can integrate RO-Crate into the projects and platforms they support, to make it easier for researchers to create and consume RO-Crates without requiring technical expertise. Community-developed extensions called “profiles” allow the creation of tailored RO-Crates that serve the needs of a particular domain or data format. Current uses of RO-Crate in bioinformatics include: ∙ Describing and sharing computational workflows registered with WorkflowHub ∙ Creating FAIR exports of workflow executions from workflow engines and biodiversity digital twin simulations ∙ Enabling an appropriate level of credit and attribution, particularly in currently under-recognised roles (eg. sample gathering, processing, sample distribution) ∙ Capturing plant science experiments as Annotated Research Contexts (ARC), complex objects which include workflows, workflow executions, inputs, and results ∙ Defining metadata conventions for biodiversity genomics This presentation will outline the RO-Crate project and highlight its most prominent applications within bioinformatics, with the aim of increasing awareness and sparking new conversations and collaborations within the BOSC community.
2025-07-22	15:20:00	15:40:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	PheBee: A Graph-Based System for Scalable, Traceable, and Semantically Aware Phenotyping	David Gordon	In person	David Gordon, Max Homilius, Austin Antoniou, Connor Grannis, Grant Lammi, Adam Herman, Ashley Kubatko, Peter White	The association of phenotypes and disease diagnoses is a cornerstone of clinical care and biomedical research. Significant work has gone into standardizing these concepts in ontologies like the Human Phenotype Ontology and Mondo, and in developing interoperability standards such as Phenopackets. Managing subject-term associations in a traceable and scalable way that enables semantic queries and bridges clinical and research efforts remains a significant challenge. PheBee is an open-source tool designed to address this challenge by using a graph-based approach to organize and explore data. It allows users to perform powerful, meaning-based searches and supports standardized data exchange through Phenopackets. The system is easy to deploy and share thanks to reproducible setup templates. The graph model underlying PheBee captures subject-term associations along with their provenance and modifiers. Queries leverage ontology structure to traverse semantic term relationships. Terms can be linked at the patient, encounter, or note level, supporting temporal and contextual pattern analysis. PheBee accommodates both manually assigned and computationally derived phenotypes, enabling use across diverse pipelines. When integrated downstream of natural language processing pipelines, PheBee maintains traceability from extracted terms to the original clinical text, enabling high-throughput, auditable term capture. PheBee is currently being piloted in internal translational research projects supporting phenotype-driven pediatric care. Its graph foundation also empowers future feature development, such as natural language querying using retrieval augmented generation or genomic data integration to identify subjects with variants in phenotypically relevant genes. PheBee advances open science in biomedical research and clinical support by promoting structured, traceable phenotype data.
2025-07-22	15:20:00	15:40:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	The role of the Ontology Development Kit in supporting ontology compliance in adverse legal landscapes	Damien Goutte-Gattat	In person	Damien Goutte-Gattat	Ontologies, like code, are a form of speech. As such, they can be subject to laws and other regulations that attempt to control how freedom of speech is exercised, and ontology editors may find themselves in the position of being legally compelled to introduce some changes in their ontologies for the sole purpose of complying with the laws that applies to them. Therefore, developers of tools used for ontology editing and maintenance need to ponder whether their tools should provide features to facilitate the introduction of such legally mandated changes, and how. As developers of the Ontology Development Kit (ODK), one of the main tools used to maintain ontologies of the OBO Foundry, we will consider both the moral and technical aspects of allowing ODK users to comply with arbitrary legal restrictions. The overall approach we are envisioning, in order to contain the impacts of such restrictions to the jurisdiction that mandate them, is a “split world” system, where the ODK would facilitate the production of slightly different editions of the same ontology.
2025-07-22	15:40:00	16:00:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	10 years of the AberOWL ontology repository: moving towards federated reasoning and natural language access	Robert Hoehndorf	In person	Fernando Zhapa-Camacho, Olga Mashkova, Maxat Kulmanov, Robert Hoehndorf	AberOWL is a framework for ontology-based data access in biology that has provided reasoning services for bio-ontologies since 2015. Unlike other ontology repositories in the life sciences such as BioPortal, OLS, and OntoBee, AberOWL uniquely focuses on providing access to Description Logic querying through a Description Logic reasoner. The system comprises a reasoning service using OWLAPI and the Elk reasoner, an ElasticSearch service for natural language queries, and a SPARQL endpoint capable of embedding Description Logic queries within SPARQL queries. AberOWL contains all ontologies from BioPortal and the OBO library, enabling lightweight reasoning over the OWL 2 EL profile and implementing the Ontology-Based Data Access paradigm. This allows query enhancement through reasoning to infer implicit knowledge not explicitly stated in data. After a decade of operation, AberOWL is evolving in three key directions: (1) introducing a lightweight, containerized version enabling local deployment for single ontologies with the ability to register with the central repository for federated reasoning access; (2) integrating improved natural language processing through Large Language Models to facilitate Description Logic querying without requiring strict syntax adherence; and (3) implementing a FAIR API that standardizes access to ontology querying and repositories, improving interoperability. These advancements will transform AberOWL into a more federated system with FAIR API access and LLM integration for enhanced ontology interaction.
2025-07-22	16:40:00	16:50:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	The global biodata infrastructure: how, where, who, and what?	Guy Cochrane	In person	Chuck Cook, Guy Cochrane	Life science and biomedical research around the world is critically dependent on a global infrastructure of biodata resources that store and provide access to research data, and to tools and services that allow users to interrogate, combine and re-use these data to generate new insights. These resources, most of which are open and freely available, form a critical, globally distributed, and highly-connected infrastructure that has grown organically over time. Funders and managers of biodata resources are keenly aware that the long-term sustainability of this infrastructure, and of the individual resources it comprises, is under threat. The infrastructure has not been well described and there is a need to understand how many resources there are, where they are located, who funds them, and which are of the greatest importance for the scientific community. The Global Biodata Coalition has worked to describe the infrastructure by undertaking an inventory of global biodata resources and by running a selection process to identify a set of—currently 52—Global Core Biodata Resources (GCBRs) that are of fundamental importance to global life sciences research. We will present an overview of the location and funders of the GCBRs, and will summarise the findings of the two rounds of the global inventory of biodata resources, which identified over 3,700 resources. The results of these analyses provide an overview of the infrastructure and will allow the GBC to identify major funders of biodata resources that are not currently engaged in the discussion of issues of sustainability.
2025-07-22	16:50:00	17:50:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	Data Sustainability			Chris Mungall, Susanna Sansone, Susanna Sansone, Chris Mungall, Varsha Khodiyar, Susanna Sansone, Chris Mungall, Varsha Khodiyar, Chris Mungall, Varsha Khodiyar, Chris Mungall, Varsha Khodiyar, Tony Burdett, Nicky Mulder	This BOSC 2025 panel will tackle the essential challenge of Data Sustainability, defined as the proactive and principled approach to ensuring bioinformatics research data remains FAIR, ethically managed, and valuable for future generations through sufficient infrastructure, funding, expertise, and governance. In light of current funding pressures and the risk of data loss that impedes scientific progress and wastes resources, establishing sustainable practices has become more urgent than ever. This discussion will incorporate diverse perspectives to examine practical strategies and solutions across key areas, including FAIR/CARE principles, funding models, open science, data lifecycle management, technical scalability, and ethical considerations.
2025-07-22	17:50:00	18:00:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	Closing Remarks			Nomi Harris
2025-07-23	11:20:00	11:40:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	Knowledge-Graph-driven and LLM-enhanced Microbial Growth Predictions	Marcin Joachimiak	In person	Marcin Joachimiak	Predicting microbial growth preferences has far-reaching impacts in biotechnology, healthcare, and environmental management. Cultivating microbes allows researchers to streamline strain selection, develop targeted antimicrobials, and uncover metabolic pathways for biodegradation or biomanufacturing. However, with most microbial taxa remaining uncultivated and knowledge of their metabolic capabilities and organismal traits fragmented in unstructured text, cultivation remains a major challenge. To address this, we developed KG-Microbe, a knowledge graph (KG) of over 800,000 bacterial and archaeal taxa, 3,000 types of traits, and 30,000 types of functional annotations. Using KG-Microbe, we constructed machine learning pipelines to predict microbial growth preferences. We compared symbolic rule mining, which produces human-readable explanations, with "black box" methods like gradient boosted decision trees and deep graph-based models. While boosted tree models achieved a mean precision of over 70% across 46 diverse media, we demonstrate that symbolic rule mining can match their performance, offering crucial interpretability. To further validate predictions, we used large language models (LLMs) to interpret and explain model outputs. By comparing these different models and their outputs, we identified key data features and knowledge gaps relevant to predicting microbial cultivation media preferences. We also used vector embedding analogy reasoning as well as complex graph queries on KG-Microbe to generate novel hypotheses and identify organisms with specific properties. Our work highlights the power of a KG-driven approach and the trade-offs between model interpretability and predictive performance. These findings motivate the development of hybrid AI models that combine transparency with predictive accuracy to advance microbial cultivation.
2025-07-23	11:40:00	12:00:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	ProDiGenIDB – a unified resource of disease-associated genes, their protein products, and intrinsic disorder annotations	Jovana Kovacevic	In person	Jovana Kovacevic, Anđelka Zečević, Lazar Vasović	Understanding gene-disease associations is essential in biomedical research, yet relevant information is often distributed across multiple heterogeneous databases. To overcome this inconsistency, we developed ProDiGenIDB, an integrated database that consolidates gene-disease relationships from several recognized and publicly available sources, while also enriching them with complementary data on gene and protein identifiers, disease ontology, and protein structural disorder. ProDiGenIDB brings together over 400,000 curated associations sourced from DisGeNet, COSMIC, HumsaVar, Orphanet, ClinVar, HPO, and DISEASES. Each entry includes gene-related metadata (Gene Symbol, Entrez ID, UniProt ID, Ensembl ID), disease descriptors (Disease Name, DOID), and a reference to the original source database. Importantly, the database also incorporates predicted intrinsic disorder information for proteins encoded by the associated genes. These predictions were generated using commonly used protein disorder prediction tools such as IUPred and VSL2, providing an additional insight into potential the lack of structure of disease-related proteins. Another important aspect of the database construction involved mapping disease names to standardized Disease Ontology IDs (DOIDs). To improve this process, we applied Natural Language Processing (NLP) techniques using advanced text representation models to enhance the accuracy and consistency of term association. ProDiGenIDB represents a valuable resource for integrative biomedical studies, particularly in contexts where protein disorder is hypothesized to play a functional or pathological role.
2025-07-23	12:00:00	12:20:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	Causal knowledge graph analysis identifies adverse drug effects	Sumyyah Toonsi	In person	Sumyyah Toonsi, Paul Schofield, Robert Hoehndorf	Motivation: Knowledge graphs and structural causal models have each proven valuable for organizing biomedical knowledge and estimating causal effects, but remain largely disconnected: knowledge graphs encode qualitative relationships focusing on facts and deductive reasoning without formal probabilistic semantics, while causal models lack integration with background knowledge in knowledge graphs and have no access to the deductive reasoning capabilities that knowledge graphs provide. Results: To bridge this gap, we introduce a novel formulation of Causal Knowledge Graphs (CKGs) which extend knowledge graphs with formal causal semantics, preserving their deductive capabilities while enabling principled causal inference. CKGs support deconfounding via explicitly marked causal edges and facilitate hypothesis formulation aligned with both encoded and entailed background knowledge. We constructed a Drug–Disease CKG (DD-CKG) integrating disease progression pathways, drug indications, side-effects, and hierarchical disease classification to enable automated large-scale mediation analysis. Applied to UK Biobank and MIMIC-IV cohorts, we tested whether drugs mediate effects between indications and downstream disease progression, adjusting for confounders inferred from the DD-CKG. Our approach successfully reproduced known adverse drug reactions with high precision while identifying previously undocumented significant candidate adverse effects. Further validation through side effect similarity analysis demonstrated that combining our predicted drug effects with established databases significantly improves the prediction of shared drug indications, supporting the clinical relevance of our novel findings. These results demonstrate that our methodology provides a generalizable, knowledge-driven framework for scalable causal inference.
2025-07-23	12:20:00	12:40:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	CROssBARv2: A Unified Biomedical Knowledge Graph for Heterogeneous Data Representation and LLM-Driven Exploration	Erva Ulusoy		Bünyamin Şen, Erva Ulusoy, Melih Darcan, Mert Ergün, Tunca Dogan	Developing effective therapeutics against prevalent diseases requires a deep understanding of molecular, genetic, and cellular factors involved in disease development/progression. However, such knowledge is dispersed across different databases, publications, and ontologies, making collecting, integrating and analysing biological data a major challenge. Here, we present CROssBARv2, an extended and improved version of our previous work (https://crossbar.kansil.org/), a heterogeneous biological knowledge graph (KG) based system to facilitate systems biology and drug discovery/repurposing. CROssBARv2 collects large-scale biological data from 32 data sources and stores them in a Neo4j graph database. CROssBARv2 consists of 2,709,502 nodes and 12,688,124 relationships between 14 node types. On top of that, we developed a GraphQL API and a large language model interface to convert users’ natural language-based queries into Neo4j's Cypher query language back and forth to access information within the KG and answer specific scientific questions without LLM hallucinations, mainly to facilitate the usage of the resource. To evaluate the capability of CROssBAR-LLMs (LLMs augmented with structured knowledge in CROssBAR) in answering biomedical questions, we constructed multiple benchmark datasets and employed an independent benchmark to systematically compare various open- and closed-source LLMs. Our results revealed that CROssBAR-LLMs display a significantly improved accuracy in answering these scientific questions compared to standalone LLMs and even LLMs augmented with web search. CROssBARv2 (https://crossbarv2.hubiodatalab.com/) is expected to contribute to life sciences research considering (i) the discovery of disease mechanisms at the molecular level and (ii) the development of effective personalised therapeutic strategies.
2025-07-23	12:40:00	12:45:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	Benchmarking Data Leakage on Link Prediction in Biomedical Knowledge Graph Embeddings	Galadriel Brière	In person	Galadriel Brière, Thomas Stosskopf, Benjamin Loire, Anaïs Baudot	In recent years, Knowledge Graphs (KGs) have gained significant attention for their ability to organize massive biomedical knowledge into entities and relationships. Knowledge Graph Embedding (KGE) models facilitate efficient exploration of KGs by learning compact data representations. These models are increasingly applied on biomedical KGs for various tasks, notably link prediction that enables applications such as drug repurposing. The research community has implemented benchmarks to evaluate and compare the large diversity of KGE models. However, existing benchmarks often overlook the issue of Data Leakage (DL), which can lead to inflated performance and compromise the validity of benchmark results. DL may occur due to inadequate separation between training and test sets (DL1), use of illegitimate features (DL2), or evaluation settings that fail to reflect real-world inference conditions (DL3). In this study, we implement systematic procedures to detect and mitigate these sources of DL. We evaluate popular KGE models on a biomedical KG and show that inappropriate data separation (DL1) artificially inflates model performances and that models do not rely on node degree as a shortcut feature (DL2). For DL3, we implement realistic inference conditions with i) a zero-shot training procedure in which drugs in test and validation sets have no known indications during training and ii) a drug repurposing ground-truth for rare diseases. Performances collapse in both these scenarios. Our findings highlight the need for more rigorous evaluation protocols and raise concerns about the reliability of current KGE models for real-world biomedical applications such as drug repurposing.
2025-07-23	12:45:00	12:50:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	A machine learning framework for extracting and structuring biological pathway knowledge from scientific literature	Mun Su Kwon	In person	Mun Su Kwon, Junkyu Lee, Haechan Sung, Hyun Uk Kim	Advances in text mining have significantly improved the accessibility of scientific knowledge from literature. However, a major challenge in biology and biotechnology remains in extracting information embedded within biological pathway images, which are not easily accessible through conventional text-based methods. To overcome this limitation, we present a machine learning–based framework called “Extraction of Biological Pathway Information” (EBPI). The framework systematically retrieves relevant publications based on user-defined queries, identifies biological pathway figures, and extracts structured information such as genes, enzymes, and metabolites. EBPI combines image processing and natural language models to identify texts from diagrams, classify terms into biological categories, and infer biochemical reaction directionality using graphical cues such as arrows. The extracted information is output in an editable, tabular format suitable for integration with pathway databases and knowledge graphs. Validated against manually curated pathway maps, EBPI enables scalable knowledge extraction from complex visual data of biological pathways and opens new directions for automated literature curation across many biological disciplines.
2025-07-23	12:50:00	13:00:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	Poster Madness				Each accepted poster presenter is given up 1 minute to advertise their poster.
2025-07-23	14:00:00	14:20:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	ScGOclust: leveraging gene ontology to find functionally analogous cell types between distant species	Yuyao Song	In person	Yuyao Song, Yanhui Hu, Julian Dow, Norbert Perrimon, Irene Papatheodorou	Basic biological processes are shared across animal species, yet their cellular mechanisms are profoundly diverse. Comparing cell-type gene expression between species reveals conserved and divergent cellular functions. However, as phylogenetic distance increases, gene-based comparisons become less informative. The Gene Ontology (GO) knowledgebase offers a solution by serving as the most comprehensive resource of gene functions across a vast diversity of species, providing a bridge for distant species comparisons. Here, we present scGOclust, a computational tool that constructs de novo cellular functional profiles using GO terms, facilitating systematic and robust comparisons within and across species. We applied scGOclust to analyse and compare the heart, gut and kidney between mouse and fly, and whole-body data from C.elegans and H.vulgaris. We show that scGOclust effectively recapitulates the function spectrum of different cell types, characterises functional similarities between homologous cell types, and reveals functional convergence between unrelated cell types. Additionally, we identified subpopulations within the fly crop that show circadian rhythm-regulated secretory properties and hypothesize an analogy between fly principal cells from different segments and distinct mouse kidney tubules. We envision scGOclust as an effective tool for uncovering functionally analogous cell types or organs across distant species, offering fresh perspectives on evolutionary and functional biology.
2025-07-23	14:20:00	14:40:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	Integrating autoantibody-related knowledge in an ontology populated using a curated dataset from literature	Fabien Maury	In person	Fabien Maury, Solène Grosdidier, Killian Halberda, Isabelle Desguerre, Adrien Coulet, Maud de Dieuleveult	Autoimmune diseases (AIDs) are often characterized by the presence of autoantibodies (AAbs). But many of these diseases are rare and can be hard to diagnose, partly due to the lack of easily accessible knowledge such as the type of AAb to test for, in order to diagnose a particular AID. Indeed, to our knowledge, no centralized resource including all available knowledge related to human autoantibodies exists as of 04-2025. To fill this gap, first, we introduce a light ontology that allows to represent relationships about AAbs, their molecular targets, and the related AIDs and their clinical signs. Also, this ontology allows to specify the provenance of the relationships, by reusing the PROV-O ontology. Second, we introduce the MAKAAO Core dataset, a dataset compiled manually from the literature by several curators. MAKAAO Core includes the name and synonyms (both in English and French) of over 350 autoantibodies, along with their targets and associated AIDs. Targets and diseases are referred to using identifiers from reference resources. We used this dataset to populate our ontology, and named the result the MAKAAO knowledge graph (MAKAAO KG), which constitutes the central part of a future reference resource.
2025-07-23	14:40:00	15:00:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	Ontology pre-training improves machine learning predictions of aqueous solubility and other metabolite properties	Charlotte Tumescheit	In person	Charlotte Tumescheit, Martin Glauer, Simon Flügel, Fabian Neuhaus, Till Mossakowski, Janna Hastings	Predicting properties of small molecule metabolites from structures is a challenging task. Molecular language models have emerged as a highly performant AI approach for prediction of diverse properties directly from ‘language-like’ representations of the structures of molecules. However, for many prediction problems, there is a shortage of available training data and model performance is still limited. Integrating expert knowledge into language models has the potential to improve performance on prediction tasks and model generalisability. Bio-ontologies offer curated knowledge ideal for this purpose. Here, we demonstrate a novel approach to knowledge injection, ‘ontology pre-training’, which we have previously shown to work for a pilot case study in the classification task of toxicity prediction. Now, we extend this to regression tasks such as solubility prediction and a wider range of classification tasks. First, we pre-train a Transformer-based language model on molecules from PubChem. Then, using our novel method, we embed the knowledge contained in a classification hierarchy derived from the ChEBI ontology into the model as an intermediate training step between general-purpose pre-training and task-specific fine-tuning. Finally, we fine-tune the models on a range of regression tasks. We find a clear improvement in performance and training times across the diverse prediction tasks. Our results show that adding an additional knowledge-based training step to a machine learning model can improve performance. Our method is intuitive and generalisable and we plan to extend it to further biological modalities and prediction datasets, including proteins and RNA, as well as exploring the impact of different ontologies.
2025-07-23	15:00:00	15:20:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	Building the Aging Biomarkers Ontology and Its Applications in Aging Research	Hande McGinty	In person	Hande McGinty, Srikar Reddy Gadusu, Yigit Kucuk, Aaron King	Aging is a complex biological process shaped by numerous biomarkers—such as cholesterol and blood sugar levels—that serve as measurable indicators of health and disease. Despite the abundance of biomarker data, identifying meaningful patterns and relationships remains a significant challenge. To address this, we began developing the Aging Biomarkers Ontology (ABO), a structured framework that formally defines aging-related biomarkers, organizes them hierarchically, and maps their interconnections to facilitate deeper analysis. Furthermore, we employed two complementary approaches to enrich the graph and uncover hidden associations among aging biomarkers: Depth-Limited Search (DLS) and machine learning-based embedding search. DLS identifies associations by traversing connected nodes within a predefined depth, while the embedding-based method encodes biomarker relationships as numerical vectors and uses cosine similarity to predict potential links. We evaluated the performance of both methods in detecting known and novel relationships. Our results demonstrate the value of systematically integrating statistical analysis with graph-based reasoning and machine learning to explore aging-related biomarkers. The resulting framework enhances the interpretability of biomarker data, supports hypothesis generation, and contributes to advancing biomedical research in aging and longevity.
2025-07-23	15:20:00	15:40:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	Discovering cellular contributions to disease pathogenesis in the NLM Cell Knowledge Network	Richard Scheuermann	In person	Richard Scheuermann, Anne Deslattes Mays, Matthew Diller, Caroline Eastwood, Rezarta Islamaj, James Leaman, Raymond LeClair, Zhiyong Lu, Chris Mungall, Vinh Nguyen, David Osumi-Sutherland, Beverly Peng, Noam Rotenberg, William Spear, Bingfang Xu, Yun Zhang	Knowledge about the role of genes in disease pathogenesis has been obtained from genetic and genome-wide association studies. The proteins encoded by these genes are frequently found to be effective therapeutic targets. However, little is known about which cells are the functional home of these disease-associated genes and proteins. Single cell genomic technologies are now revealing the cellular complexity of human tissues at high resolution. The transcriptomes defined by these technologies reflect the functional cellular phenotypes. Database resources that capture and disseminate data derived from these single cell technologies have been developed. But the knowledge derived from their analysis and interpretation is largely buried as free text in the scientific literature. Here we describe the development of a Cell Knowledge Network (CKN) prototype at the National Library of Medicine (NLM) that captures and exposes knowledge about cell phenotypes (cell types and states) derived from single cell technologies and related experiments. NLM-CKN is populated using validated computational analysis pipelines and natural language processing of the scientific literature and integrated with other sources of relevant knowledge about genes, anatomical structures, diseases, and drugs. Using this integration of experimental sc/snRNAseq data with prior knowledge about disease predispositions and drug targets, a novel linkage between lung pericytes and pulmonary hypertension was discovered through the KCNK3 gene intermediary with implications for novel therapeutic interventions. Through the integration of knowledge from single cell technologies with other sources of knowledge about genetic predispositions and therapeutic targets, the NLM-CKN is revealing the cellular contributions to disease pathogenesis.
2025-07-23	15:40:00	16:00:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	Cat-VRS for Genomic Knowledge Curation: A Hyperintensional Representation Framework for FAIR Categorical Variation	Daniel Puthawala	In person	Daniel Puthawala, Brendan Reardon	Cat-VRS: A FAIR catvar Standard Categorical variants (catvars)—such as “MET exon 14 skipping” and “TP53 loss”—are foundational to genomic knowledge, linking sets of genomic variants to clinically relevant assertions like oncogenicity scores or predicted therapeutic response. Yet despite their importance, catvars remain unstandardized, ambiguous, and largely non-computable, creating persistent barriers to search, curation, interoperability, and reuse. Existing standards either offer flexible models for sequence-resolved variants (e.g., GA4GH VRS) or rigid top-down nomenclatures (e.g., HGVS) that fail to capture the diversity and nuance of categorical assertions. We present the Categorical Variation Representation Specification (Cat-VRS), a new GA4GH standard for representing catvars using a hyperintensional, constraint-based model. Cat-VRS encodes categorical meaning compositionally and bottom-up: structured constraints—such as sequence location or protein functional consequence—support precise, flexible representations at varying levels of granularity. Cat-VRS is fully interoperable with other GA4GH standards, supports ontology mappings, and was developed through global community collaboration in alignment with the FAIR data principles. Cat-VRS 1.0 was recently released by GA4GH and is already in use by ClinVar and MaveDB, with integration underway in CIViC and the VICC MetaKB. These early implementations demonstrate Cat-VRS’s practical utility in enabling reusable, computable representations of categorical knowledge. As precision medicine scales, so too does the need for infrastructure that supports consistent curation, standardized data sharing, and automated variant knowledge matching. We invite the bio-ontologies and knowledge representation community to engage with Cat-VRS as both a practical tool and an extensible framework for advancing interoperable genomic knowledge.
2025-07-23	16:40:00	17:40:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	Knowledge Graphs: Theory, Applications and Challenges	Ian Horrocks	In person	Ian Horrocks	Knowledge Graphs have rapidly become a mainstream technology that combines features of databases and AI. In this talk I will introduce Knowledge Graphs, explaining their features and the theory behind them. I will then consider some of the challenges inherent in both the theory and implementation of Knowledge Graphs and present some solutions that have made possible the development of popular language standards and robust and high-performance Knowledge Graph systems. Finally, I will illustrate the wide applicability of knowledge graph technology with some example use cases.
2025-07-23	17:40:00	17:45:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	Bridging Language Barriers in Bio-Curation: An LLM-Enhanced Workflow for Ontology Translation into Japanese	Mark Streer	In person	Mark Streer, Olivia Watson, Mark McDowall, Jane Lomax	SciBite’s ontology management and named entity recognition (NER) software relies on curated public ontologies to support data harmonization under FAIR principles (findable, accessible, interoperable, and reusable). Public ontologies are foundational for data FAIR-ification, providing structured vocabularies that enable consistent annotation and semantic integration; however, they are predominantly developed in English, creating barriers for non-English users and applications. To address this challenge for our Japanese customers, we developed a large language model (LLM)-enhanced bio-curation workflow for English-to-Japanese translation, focusing on synonym enrichment of the Uberon anatomy ontology as a case study. Our approach implements a three-step process: (1) importing mapped Japanese synonyms from existing bilingual datasets (e.g., DBCLS resources), (2) generating Japanese candidate synonyms based on English synonyms and definitions using an LLM, and (3) validating candidates against the source ontology to ensure appropriate placement as well as online dictionaries and other references to confirm their real-world applicability. Initially developed for synonym enrichment, this workflow can be extended to semantic refinement into broadMatch and narrowMatch relationships in addition to exactMatch—critical for terminology lacking perfect English equivalents. Furthermore, the workflow is well-suited to agentic frameworks such as LangGraph to orchestrate generation and Internet research processes, as well as LLM-ensemble evaluation to automatically confirm clear matches, allowing ambiguous cases to be prioritized for “human-in-the-loop" curation. This approach represents a promising solution for scalable ontology translation, contributing to the FAIR development and application of bio-ontologies across language barriers and enhancing international biomedical research collaboration.
2025-07-23	17:45:00	17:50:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	Enabling FAIR Single-Cell RNAseq Data Management with COPO	Felix Shaw	In person	Felix Shaw, Debby Ku, Aaliyah Providence, Irene Papatheodorou	We present our work on establishing standards and tools for validating and submitting single-cell RNA sequencing (scRNA-seq) data and metadata using the COPO brokering platform. Effective research data management is essential for enabling data reuse, integration, and the discovery of new biological insights. As new technologies like single-cell sequencing and transcriptomics emerge, they often outpace existing data infrastructure. Single-cell technologies allow detailed insights into biological processes, for example, tracking gene expression dynamics in crops, dissecting pathogen-host interactions at the cellular level, or identifying stress-resilient cell types. Yet without comprehensive metadata and appropriate data management tools, the full potential of these datasets remains unrealised. Implementing the FAIR principles—particularly around metadata quality is crucial. At present, there are few widely adopted standards or tools for describing scRNA-seq experiments. In response, we have developed a structured metadata template tailored to these experiments, informed by extensive consultation with researchers across the single-cell community and aligned with existing standards. This metadata standard is integrated into COPO, which provides a streamlined interface for validating and brokering data and metadata to public repositories. Standardised metadata improves discoverability, supports data integration across platforms, and enables consistent reuse. It also ensures proper attribution, facilitates collaboration across diverse disciplines, and enhances reproducibility. By submitting with FAIR metadata viaSingle-cell RNA-seq COPO, we transform scRNA-seq outputs from isolated experimental results into well-labelled, interoperable datasets suitable for downstream applications such as machine learning. Our work addresses a key infrastructure gap, enabling more effective, collaborative, and impactful research in the single-cell field.
2025-07-23	17:50:00	17:55:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	Cancer Complexity Knowledge Portal: A centralized web portal for finding cancer related data, software tools, and other resources	Susheel Varma	In person	Orion Banks, Ashley Clayton, Aditi Gopalan, Amber Nelson, Stockard Simon, Verena Chung, Amy Heiser, Jay Hodgson, Aditya Nath, Adam Hindman, Milen Nikolov, Adam Taylor, James Eddy, Susheel Varma, Jineta Banerjee	Applying artificial intelligence and machine learning to biomedical problems requires clean, high-quality data and reusable software tools. The Cancer Complexity Knowledge Portal (CCKP), a NIH-listed domain-specific repository maintained by the Multi-Consortia Coordinating (MC2) Center at Sage Bionetworks, makes oncology data findable and accessible. The MC2 Center coordinates resources among six cancer-focused research consortia funded by the National Cancer Institute. To establish metadata standards, the CCKP hosts data models for various modalities, including genomics and imaging. New models are also being developed for emerging types, such as spatial transcriptomics. These models undergo iterative development with versioned releases maintained in a public GitHub repository. They power data management tools developed by Sage Bionetworks, including the Schematic Python package and the Data Curator App, which support FAIR data annotation. The data models help researchers link research outputs and assist the CCKP in highlighting activities from NCI-funded cancer research programs. The portal offers search and filtering capabilities to accelerate discovery and collaboration. As of November 2024, it hosts information on 3,786 publications, 904 datasets, and 292 computational tools from over 140 research grants. The models incorporate elements from the Cancer Research Data Commons Data Hub to support integration within the CRDC ecosystem. We are engaging with scientists, clinicians, and patient advocates to leverage user-centred design and structured data models, making cancer data more findable, accessible, and reusable. These improvements aim to bridge the gap between experimental and computational labs, fueling scientific discovery.
2025-07-23	17:55:00	18:00:00	03A	BOKR: Bio-Ontologies and Knowledge Representation	COSI Closing Remarks			Augustin Luna, Tiffany Callahan, Augustin Luna, Tiffany Callahan

- top -