Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in BST
Wednesday, July 23rd
11:20-11:40
Session: Session I: Ontologies and Knowledge Graphs in the Biodata Ecosystem
Invited Presentation: Disease Ontology Knowledgebase: A Global BioData hub for FAIR disease data discovery
Confirmed Presenter: Lynn Schriml, University of Maryland School of Medicine

Room: 12
Format: In person

Moderator(s): Ishwar Chandramouliswaran


Authors List: Show

  • Lynn Schriml, University of Maryland School of Medicine

Presentation Overview: Show

Development of long-term biodata resources, by design, depends on a stable data model with persistent identifiers, regular data releases, and reliable responsiveness to ongoing community needs. Addressing evolving needs while continually advancing our data representation has facilitated the sustained 20-year growth and utility of the Human Disease Ontology (DO, https://www.disease-ontology.org/). Biodata resources must maintain their relevance, adapting to address and fulfill persistent, evolving needs. Strategically, the DO actively identifies and connects with our expanding user community, thusly driving DO’s integration of diverse disease factors (e.g., molecular, environmental and mechanistic) into a singular framework. Serving a vast user community since 2003 (> 415 biomedical resources across 45 countries), the DO’s continual content and classification expansion is driven by the ever-evolving disease knowledge ecosystem. The DO, a designated Global Core Biodata Resource (https://globalbiodata.org/), empowers disease data integration, standardization, and analysis across the interconnected web of biomedical information. A focus on modernizing infrastructure is imperative to provide new mechanisms for data interoperability and accessibility. Our strategic approach includes following community best practices (e.g., OBO Foundry, FAIR principles), adapting established technical approaches (e.g., Neo4j; Swagger for API), and openly sharing project-developed tooling - reduces technical debt while maximizing data delivery opportunities. The DO Knowledgebase (DO-KB) tools (DO-KB SPARQL service and endpoint, Faceted Search Interface, advanced API service, DO.utils) have been developed to enhance data discovery, delivering an integrated data system that exposes the DO’s semantic knowledge and connects disease-related data across Open Linked Data resources.

11:40-12:00
Session: Session I: Ontologies and Knowledge Graphs in the Biodata Ecosystem
Invited Presentation: Integrating Data Treasures: Knowledge graphs of the DSMZ Digital Diversity
Confirmed Presenter: Julia Koblitz, Leibniz Institute DSMZ, Germany

Room: 12
Format: In person

Moderator(s): Ishwar Chandramouliswaran


Authors List: Show

  • Julia Koblitz, Leibniz Institute DSMZ, Germany
  • Lorenz Christian Reimer, Leibniz Institute DSMZ, Germany

Presentation Overview: Show

The DSMZ (German Collection of Microorganisms and Cell Cultures) hosts a wealth of biological data, covering microbial traits (BacDive), taxonomy (LPSN), enzymes and ligands (BRENDA), rRNA genes (SILVA), cell lines (CellDive), cultivation media (MediaDive), strain identity (StrainInfo), and more. To make these diverse datasets accessible and interoperable, the DSMZ Digital Diversity initiative provides a central hub for integrated data and establishes a framework for linking and accessing these resources (https://hub.dsmz.de).
At its core lies the DSMZ Digital Diversity Ontology (D3O), an upper ontology designed to unify key concepts across all databases, enabling seamless integration and advanced exploration. This ontology is complemented by well-established ontologies such as ChEBI, ENVO, and NCIT, among others. By standardizing all resources within a defined vocabulary, we enhance their interoperability, both internally and with the Linked Open Data community. Where necessary, we will also develop and curate our own ontologies, such as the well-known BRENDA tissue ontology (BTO), a comprehensive ontology for LPSN taxonomy and nomenclature, and the Microbial Isolation Source Ontology (MISO), which has already been applied to annotate more than 80,000 microbial strains.
D3O also provides a stable foundation for transforming our databases into RDF (resource description framework) and providing the knowledge graphs via open SPARQL endpoints. The first knowledge graphs of BacDive and MediaDive are already available at https://sparql.dsmz.de, enabling researchers to query and analyze microbial trait data and cultivation media. These initial steps lay the groundwork for integrating additional databases, such as BRENDA and StrainInfo, into unified, queryable knowledge graphs.

12:00-12:20
Session: Session I: Ontologies and Knowledge Graphs in the Biodata Ecosystem
Invited Presentation: Metabolomics Workbench: Data Sharing, Analysis and Integration at the National Metabolomics Data Repository
Confirmed Presenter: Mano Maurya, University of California, San Diego, USA

Room: 12
Format: In person

Moderator(s): Ishwar Chandramouliswaran


Authors List: Show

  • Eoin Fahy, University of California, San Diego, USA
  • Neela Srinivasan, University of California, San Diego, USA
  • Shashidhar Rao, University of California, San Diego, USA
  • Mano Maurya, University of California, San Diego, USA
  • Srinivasan Ramachandran, University of California, San Diego, USA
  • Kevin Coakley, University of California, San Diego, USA
  • Sara Rahiminejad, University of California, San Diego, USA
  • Hardik Dodia, University of California, San Diego, USA
  • Sumana Srinivasan, University of California, San Diego, USA
  • Shakti Gupta, University of California, San Diego, USA
  • Christine Kirkpatrick, University of California, San Diego, USA
  • Shankar Subramaniam, University of California, San Diego, USA

Presentation Overview: Show

The National Metabolomics Data Repository (NMDR) was developed as part of the National Institutes of Health (NIH) Common Fund Metabolomics Program to facilitate the deposition and sharing of metabolomics data and metadata from researchers worldwide. The NMDR, housed at the San Diego Supercomputer Center (SDSC), University of California, San Diego, has developed the Metabolomics Workbench (MW). The Metabolomics Workbench also provides analysis tools and access to metabolite standards, including RefMet, protocols, tutorials, training, and more. RefMet facilitates metabolite name harmonization, an essential step in data integration across different studies and collaboration across different research centers. Thus, the MW-NMDR serves as a one-stop infrastructure for metabolomics research and is widely regarded as one of the most FAIR (findable, accessible, interoperable, usable) data resources. In this work, we will present some of the key aspects of the MW-NMDR, such as continuous curation to maintain quality, use of controlled vocabularies and ontologies to promote interoperability, development of tools to contribute to driving scientific innovation, and integration of tools developed by the community into the MW. We will also discuss our involvement in other data sharing, reuse, and integration efforts, namely the NIH Common Fund Data Ecosystem (CFDE) and a collaboration with the European Bioinformatics Institute (EBI)’s MetabolomeXchange as part of the Chan Zuckerberg initiative.

12:20-12:40
Session: Session I: Ontologies and Knowledge Graphs in the Biodata Ecosystem
Invited Presentation: Building sustainable solutions for federally-funded open-source biomedical tools and technologies
Confirmed Presenter: Karamarie Fecho

Room: 12
Format: Live stream

Moderator(s): Ishwar Chandramouliswaran


Authors List: Show

  • Karamarie Fecho

Presentation Overview: Show

Federally-funded, open-source, biomedical tools and technologies often fail due to a lack of a business model for sustainability, which quickly leads to technical obsolescence and is often preceded by insufficient scientific impact and the failure to create a thriving Community of Practice. The open-source ROBOKOP (Reasoning Over Biomedical Objects linked in Knowledge Oriented Pathways) knowledge graph (KG) system is jointly funded by the National Institute of Environmental Health Sciences and the Office of Data Science Strategy within the National Institutes of Health as a modular biomedical KG system designed to explore relationships between biomedical entities. The ROBOKOP system includes the aggregated ROBOKOP KG composed of integrated and harmonized “knowledge” derived from dozens of “knowledge sources”, a user interface to the ROBOKOP KG, and a collection of supporting tools and resources. ROBOKOP has demonstrated its utility in a variety of use cases, including suggesting “adverse outcome pathways” to explain the biological relationships between chemical exposures and disease outcomes and the related concept of “clinical outcome pathways” to explain the biological mechanisms underlying the therapeutic effects of drug exposures. We have been evaluating approaches to ensure the long-term sustainability of ROBOKOP, independent of federal funding. One approach is to adopt and adapt the best practices of, and lessons learned by, successful open-source biomedical Communities of Practice, with engaged scientific end users and technical contributors. This presentation will provide an overview of our evaluation results and detail our proposed solution for transitioning ROBOKOP from federal funding to independent long-term sustainability.

12:40-13:00
Session: Session I: Ontologies and Knowledge Graphs in the Biodata Ecosystem
Invited Presentation: SEA CDM: An Ontology-Based Common Data Model for Standardizing and Integrating Biomedical Experimental Data in Vaccine Research
Confirmed Presenter: Yongqun He, University of Michigan Medical School, Ann Arbor, MI, USA

Room: 12
Format: In person

Moderator(s): Ishwar Chandramouliswaran


Authors List: Show

  • Anthony Huffman, University of Michigan Medical School, Ann Arbor, MI, USA
  • Jie Zheng, University of Michigan Medical School, Ann Arbor, MI, USA
  • Guanming Wu, Oregon Health & Science University, Portland, OR, USA
  • Ann Maria Masci, University of Texas, MD Anderson Cancer Center, Houston, TX, USA
  • Junguk Hur, University of North Dakota School of Medicine and Health Sciences, Grand Forks, ND, USA
  • Tao Cui, Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, USA
  • Yongqun He, University of Michigan Medical School, Ann Arbor, MI, USA

Presentation Overview: Show

With the increasing volume of experimental data across biomedical fields, standardizing, sharing, and integrating heterogeneous experimental data has become a major challenge. Our systematic VIOLIN vaccine database has collected and annotated over 4,700 vaccines against 217 infectious and non-infectious diseases such as cancer, and vaccine components such as over 100 vaccine adjuvants and over 1,700 vaccine-induced host immune factors. To support standardization, we developed the community-based Vaccine Ontology (VO) to represent vaccine knowledge and associated metadata. To support interoperable standardization, annotation, and integration of various biomedical experimental datasets, we have developed an ontology-supported Study-Experiment-Assay (SEA) common data model (CDM), consisting of 12 core classes (or called tables in a relational database setting), such as Organism, Sample, Intervention, and Assay. The SEA CDM was evaluated systematically using the vaccine-induced host gene immune response data from our VIOLIN VIGET (Vaccine Induced Gene Expression Analysis Tool) system. We also developed a MySQL database and a Neo4J knowledge graph based on the SEA CDM, to systematically represent the VIGET data and influenza-related host gene expression data from two large-scale data resources: ImmPort and CELLxGENE. Our results show that ontologies such as VO can greatly support interoperable data annotation and provide additional semantic knowledge (e.g., vaccine hierarchy). This proof-of-concept study demonstrated the feasibility and validity of the SEA CDM for standardizing and integrating heterogeneous datasets and highlights its potential for application to other big bioresources. The novel SEA CDM lays a foundation for building a FAIR and AI-ready Biodata Ecosystem, leading to advanced AI research.

14:00-14:20
Session: Session II: Building Scalable and Sustainable Biodata Infrastructure
Invited Presentation: The Evolution of Ensembl: Scaling for Accessibility, Performance, and Interoperability
Confirmed Presenter: Mallory Freeberg

Room: 12
Format: In person

Moderator(s): Peter Maccallum


Authors List: Show

  • Mallory Freeberg

Presentation Overview: Show

Ensembl is an open platform that integrates publicly available genomics data across the tree of life, enabling activities spanning research to clinical and agricultural applications. Ensembl provides a comprehensive collection of data including genomes, genomic annotations, and genetic variants, as well as computational outputs such as gene predictions, functional annotations, regulatory region predictions, and comparative genomic analyses.
In its 25-year history, Ensembl has grown to support all domains of life - from vertebrates to plants to bacteria - releasing new data roughly quarterly. Initially developed for the human genome, Ensembl expanded to include additional key vertebrates totalling a few hundred genomes. With the advent of global biodiversity and pangenome projects, Ensembl now contains thousands of genomes and is anticipated to grow to tens of thousands of genomes in the coming years. This explosion in data size necessitates a more scalable and rapidly deployable mechanism to ensure timely release of new high-quality genomes for immediate use by the community.
Ensembl is evolving to meet increasing scalability demands to ensure continued accessibility, performance, and interoperability. We have developed a new service-oriented infrastructure, deployed as a set of orchestrated microservices. Our new refget implementation enables rapid, unambiguous sequence retrieval using checksum-based identifiers. Our GraphQL service has been expanded to support genome metadata queries, facilitating programmatic access to assembly composition and linked datasets. With streamlined components and more modern technologies, Ensembl will be easier to maintain, delivering high-quality data quickly and benefiting the global scientific communities that rely on this key resource.

14:20-14:40
Session: Session II: Building Scalable and Sustainable Biodata Infrastructure
Invited Presentation: Insights from GlyGen in Developing Sustainable Knowledgebases with Well-Defined Infrastructure Stacks
Confirmed Presenter: Kate Warner, George Washington university, USA

Room: 12
Format: In person

Moderator(s): Peter Maccallum


Authors List: Show

  • Kate Warner, George Washington university, USA

Presentation Overview: Show

GlyGen is a data integration and dissemination project for glycan and glycoconjugate related data, which retrieves information from multiple international data sources to form a central knowledgebase for glycoscience data exploration. To maintain our high-quality service while meeting the needs of our users, we have structured GlyGen into related but distinct spaces - the Web portal, Data portal, API, Wiki, and SPARQL - which makes clear delineation of tasks for maintenance and innovation while providing different mechanisms for data access. General users can use the interactive GlyGen web portal to search and explore GlyGen data using our various tools and search functionalities. For programmatic access, users can use the API (https://api.glygen.org) to access GlyGen data objects for glycans and proteins, while the SPARQL endpoint (https://sparql.glygen.org) is built to provide an alternative programmatic access to the GlyGen data using semantic web technologies. For users interested in using the datasets in research, data mining or machine learning projects, versioned dataset flat files can be downloaded at our Data portal (https://data.glygen.org), along with the dataset’s Biocompute Object (BCO) (https://biocomputeobject.org) which documents the metadata of the dataset for proper attribution, reproducibility and data sharing. All components of the GlyGen ecosystem are built using well-established web technology stacks, enabling rapid development and deployment on both on-premise infrastructure and commercial cloud platforms, while also ensuring straightforward maintenance.
Finally, we will discuss how the ability to be freely accessible and under the Creative Commons Attribution 4.0 International (CC BY 4.0) license helps to encourage FAIR data, open science, and collaboration.

14:40-15:00
Session: Session II: Building Scalable and Sustainable Biodata Infrastructure
Invited Presentation: TBD
Room: 12
Format: In person

Moderator(s): Peter Maccallum


Authors List: Show

  • Philip Blood
15:00-15:20
Session: Session II: Building Scalable and Sustainable Biodata Infrastructure
Invited Presentation: Production workflows and orchestration at MGnify, ELIXIR’s Core Data Resource for metagenomics
Confirmed Presenter: Martin Beracochea, EMBL-EBI

Room: 12
Format: In person

Moderator(s): Peter Maccallum


Authors List: Show

  • Martin Beracochea, EMBL-EBI

Presentation Overview: Show

MGnify is a key resource for the assembly, analysis and archiving of microbiome-derived sequencing datasets. Designed to be interoperable with the European Nucleotide Archive (ENA) for data archiving, MGnify’s analyses can be initiated from various ENA sequence data products, including private datasets. Accessioned data outputs are produced in commonly used formats and available via web visualisation and APIs.
The rapid evolution of the field of microbiome research over the past decade has brought significant challenges: exponential dataset growth; increased sample diversity; additional data analyses and new sequencing technologies. To address these challenges, MGnify’s latest pipelines have transitioned from the Common Workflow Language to Nextflow, nf-core, and a new automation system. This enhances resource management and supports heterogeneous computing including cloud environments, handles large-scale data production, and reduces manual intervention.
Key MGnify outputs include taxonomic and functional analyses of metagenomes, covering >600,000 datasets. The service produces and organises metagenome assemblies and metagenome-assembled genomes, totaling >480,000, as well as nearly 2.5 billion protein sequences. The available annotations have broadened to include the mobilome and virome, as well as increased taxonomic specificity via additional amplicon sequence variant analyses.
While these developments have positioned MGnify to efficiently take advantage of elastic compute resources, the volume of demand still outstrips the available resources. As such, we have started to evaluate how analyses can be federated through the use of our Nextflow pipelines (and community produced Galaxy versions), in combination with Research Objects, to provide future scalability yet retaining a centralised point of discovery.

15:20-15:40
Session: Session II: Building Scalable and Sustainable Biodata Infrastructure
Invited Presentation: A SCALE-Able Approach to Building “Hybrid” Repositories to Drive Sustainable, Data Ecosystems
Confirmed Presenter: Robert Schuler

Room: 12
Format: In person

Moderator(s): Peter Maccallum


Authors List: Show

  • Robert Schuler

Presentation Overview: Show

Scientific discovery increasingly relies on the ability to acquire, curate, integrate, analyze, and share vast and varied datasets. For instance, advancements like AlphaFold, an AI-based protein prediction tool, and ChatGPT, a large language model-based chatbot, have generated immense excitement in science and industry for harnessing data and computation to solve significant challenges. However, it’s easy to overlook that these remarkable achievements were only made possible after the accumulation of a critical mass of AI-ready data. Both examples relied on open data sources meticulously generated by user communities over several decades. We argue that scalable, sustainable data repositories that bridge the divide between domain-specific and generalist repositories and that actively engage communities of investigators in the task of organizing and curating data will be required to meet the challenge of producing a future critical mass of data to unlock new discoveries. Such resources must move beyond the label of “repository” and instead employ a socio-technical approach that inculcates a culture and skill set for data management, sharing, reuse, and reproducibility. In this talk, we will discuss our efforts toward developing FaceBase as a “SCALE-able” data resource built on the principles of Self-service Curation, domain-Agnostic data-centric platforms, Lightweight information models, and Evolvable systems. Based on our approach, working within the dental, oral, craniofacial, and biologically relevant research community, we have seen several hundred studies encompassing many thousands of subjects and specimens’ worth of data across multiple imaging modalities and sequencing assay types contributed and curated by the community.

15:40-15:50
Session: Session II: Building Scalable and Sustainable Biodata Infrastructure
Invited Presentation: From Platforms to Practice: How the ELIXIR Model Enables Impactful, Sustainable Biodata Resources
Confirmed Presenter: Fabio Liberante

Room: 12
Format: In person

Moderator(s): Peter Maccallum


Authors List: Show

  • Fabio Liberante

Presentation Overview: Show

Biodata resources are only as impactful as the ecosystems in which they operate. ELIXIR provides a coordinated European infrastructure that supports the sustainability, discoverability, and effective reuse of life science data — enabling biodata resources to thrive in an increasingly complex global research environment.
This talk will provide an overview of how ELIXIR delivers this support through its Core Data Resources, five Platforms — including Data and Interoperability — and an active network of Communities. Together, these elements underpin the long-term value and resilience of biodata infrastructures by helping resources:
Implement FAIR practices
Link across scientific domains
Plan for the full biodata resource lifecycle
We will highlight the role of registries and standards, the monitoring and periodic review of Core Data Resources, and the importance of both qualitative and quantitative indicators in tracking impact. Recent challenges — including the effects of large-scale data scraping — will also be discussed, alongside the need to balance openness with sustainability.
Finally, we will share some insights from ELIXIR’s international collaborations, including with the NIH, to illustrate how global coordination enhances the visibility, value, and future-proofing of open data infrastructures.

15:50-16:00
Session: Session II: Building Scalable and Sustainable Biodata Infrastructure
Invited Presentation: NIH-ODSS
Room: 12
Format: In person

Moderator(s): Ishwar Chandramouliswaran


Authors List: Show

  • Ishwar Chandramouliswaran
16:40-17:00
Session: Session III: Funding Models, Impact, and Engagement in Biodata Resources
Invited Presentation: Meeting user expectations in a resource constrained environment: Europe PMC’s approach
Confirmed Presenter: Melissa Harrison, EMBL-EBI, UK

Room: 12
Format: In person

Moderator(s): Fabio Liberante


Authors List: Show

  • Melissa Harrison, EMBL-EBI, UK

Presentation Overview: Show

Artificial intelligence (AI), in particular generative AI, is rapidly changing the expectations of researchers and how they approach literature search. New tools are being brought to market and established services are focussing on incorporating AI to develop more advanced features.
In 2024 landscape analyses and focus groups outreach market research was conducted to understand evolving user needs as AI technology advances in research workflows. Over 50 scholarly discovery tools were assessed based on their governance and payment models, use of AI, and services offered. It highlighted issues around sustainability and widespread adoption of AI-enabled features, in particular for summarisation, recommendation, and natural language search. The focus groups outlined main user journeys involved in literature research and discovery, uncovering community doubts around the use of AI in relation to trust and transparency, and the need for reproducible results.
As we develop innovative solutions and modernise existing infrastructure, acknowledging existing resource constraints, careful planning and assurances that the new features meet the needs of users is required. To better track user engagement we have increased tracking capabilities with Matomo web analytics tools and introduced A/B testing to the site to help ensure iterative improvements address user needs and make data-driven decisions efficiently. The new advanced search tool is released in beta stages to gather insights on user behaviour and incrementally improve performance and design.
We will share outputs of our market and user research along with insights we have gained and our plans on how Europe PMC can address these challenges.

17:00-17:20
Session: Session III: Funding Models, Impact, and Engagement in Biodata Resources
Invited Presentation: Coopetition as a Catalyst for Researcher Engagement with Open Data
Confirmed Presenter: Mark Hahnel

Room: 12
Format: In person

Moderator(s): Fabio Liberante


Authors List: Show

  • Mark Hahnel

Presentation Overview: Show

The NIH GREI (Generalist Repository Ecosystem Initiative) aims to enhance data sharing and reuse of NIH-funded research by fostering collaboration among generalist repositories. It focuses on establishing consistent standards and practices, and promoting FAIR data principles to improve data discoverability.


GREI's coopetition model allows for the creation of a more cohesive and effective data-sharing landscape, while still allowing individual repositories to innovate and differentiate themselves. Repositories work together to establish common standards, metadata, and best practices for data sharing, improving overall interoperability. They collaborate on initiatives that benefit the entire ecosystem, such as developing consistent metrics and enhancing data discoverability.

While cooperating on core principles, repositories maintain their unique features, services, and competitive advantages. They continue to attract users by offering specialized functionalities, such as data visualization, analysis tools, or specific community support.


This unique strategy for innovating in the repository space allows for uniform ways to track the impact of open datasets. Primarily through citations, we can start to look at ways in which researchers can be rewarded and incentivised to follow good data practices.

17:20-17:40
Session: Session III: Funding Models, Impact, and Engagement in Biodata Resources
Invited Presentation: Evaluating the Impact of Biodata Resources: Insights from EMBL-EBI’s Impact Assessments
Confirmed Presenter: Eleni Tzampatzopoulou, EMBL-EBI, UK

Room: 12
Format: In person

Moderator(s): Fabio Liberante


Authors List: Show

  • Eleni Tzampatzopoulou, EMBL-EBI, UK

Presentation Overview: Show

The provision of open access data through biodata resources is a critical driver of breakthroughs in life sciences research, advances in clinical practice and industry innovations that benefit humankind. However, understanding their long-term economic and societal impacts remains a challenge. As part of ongoing efforts to establish a framework and evidence base for demonstrating the value of open data resources, EMBL-EBI employs a combination of qualitative and quantitative approaches, such as service monitoring metrics, cost-benefit analyses, large-scale user surveys, data resource usage analysis and in-depth case studies. Service monitoring metrics, including unique visitors, data submission volumes and citation of datasets, indicate the breadth and diversity of user engagement with FAIR data resources. The 2024 user survey showcased the depth of utility users derive from resources, such as research years saved and reduced duplication of effort. Surveys and other user engagement also highlight EMBL-EBI’s contribution to downstream products and AI model development. Economic impact analyses, focused on the impact of direct increases in research efficiency, do not quantify these secondary or indirect impacts through data reuse, even though qualitative data suggests they are likely to be significant. Here we explore how mixed methods can characterise the impact of data reuse, considering methodologies such as in-depth case studies, data mining, administrative data and other novel approaches. We consider different methodologies EMBL-EBI has explored and propose how future impact monitoring could capture a fuller extent of the direct and indirect impacts of biodata resources, informing priority setting for life sciences funders.

17:40-18:00
Session: Session III: Funding Models, Impact, and Engagement in Biodata Resources
Invited Presentation: TBD
Room: 12
Format: In person

Moderator(s): Fabio Liberante


Authors List: Show

  • Alex Bateman