BOSC

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in CEST
Monday, July 24th
10:30-10:35
BOSC Opening Remarks
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Karsten Hokamp

  • Nomi Harris
10:35-10:40
Open Bioinformatics Foundation update
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Karsten Hokamp

10:40-10:45
Platinum & Gold Sponsor videos
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Karsten Hokamp

10:45-10:50
CoFest summary
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Karsten Hokamp

  • Thomas Schlapp
10:50-11:50
Invited Presentation: A New Odyssey: Pioneering the Future of Scientific Progress Through Open Collaboration
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Nomi Harris

  • Sara EL-Gebali, FAIRPoints & SciLifeLab-Data Centre, Sweden


Presentation Overview: Show

Join us in humanity’s quest for knowledge and understanding, on a transformative odyssey of scientific progress, fuelled by the power of open collaboration. This presentation will navigate the realms of scientific discovery, exploring the profound impact of globally inclusive and collaborative efforts that have the potential to revolutionize the very fabric of scientific advancement.

As we embark on this journey, we will delve into the principles of diverse global alliances and pioneering scientific institutions, illustrating how their values align with real-world initiatives that promote open science, and provide examples of the groundbreaking opportunities that arise when individuals from various backgrounds, disciplines, and experiences unite in pursuit of a common goal.

In the spirit of “Infinite Diversity in Infinite Combinations,” this presentation will showcase inspiring examples of collaboration and inclusivity in action, demonstrating how community building, mentorship programs, and grassroots movements play an essential role in fostering inclusive scientific communities.

These initiatives empower individuals from underrepresented backgrounds to overcome barriers, access valuable resources, and contribute their unique perspectives to the scientific conversation. By nurturing a culture of inclusivity and support, we can create a flourishing environment that encourages the exchange of ideas and accelerates the pace of scientific discovery. Join us as we chart a course toward a brighter, more inclusive future, where the collective power of diverse minds and open collaboration propels us forward into the uncharted territories of scientific discovery.

11:50-12:10
Open Targets Platform and Open Targets Genetics: Supporting systematic open-source approach for drug-target identification and prioritisation
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Monica Munoz-Torres

  • Prashant Uniyal, Open Targets, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK; EMBL-EBI, Cambridgeshire CB10 1SD,UK, United Kingdom
  • Ehsan Barkhordari, Open Targets, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK; EMBL-EBI, Cambridgeshire CB10 1SD,UK, United Kingdom
  • Manuel Bernal Llinares, Open Targets, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK; EMBL-EBI, Cambridgeshire CB10 1SD,UK, United Kingdom
  • Annalisa Buniello, Open Targets, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK; EMBL-EBI, Cambridgeshire CB10 1SD,UK, United Kingdom
  • Carlos Cruz, Open Targets, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK; EMBL-EBI, Cambridgeshire CB10 1SD,UK, United Kingdom
  • Ricardo Esteban Martinez Osorio, Open Targets, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK; EMBL-EBI, Cambridgeshire CB10 1SD,UK, United Kingdom
  • Luca Fumis, Open Targets, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK; EMBL-EBI, Cambridgeshire CB10 1SD,UK, United Kingdom
  • Irene Lopez, Open Targets, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK; EMBL-EBI, Cambridgeshire CB10 1SD,UK, United Kingdom
  • Chintan Mehta, Open Targets, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK; EMBL-EBI, Cambridgeshire CB10 1SD,UK, United Kingdom
  • Daniel Suveges, Open Targets, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK; EMBL-EBI, Cambridgeshire CB10 1SD,UK, United Kingdom
  • Kirill Tsukanov, Open Targets, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK; EMBL-EBI, Cambridgeshire CB10 1SD,UK, United Kingdom
  • Daniel Considine, Open Targets, Hinxton, Cambridgeshire, CB10 1SD, UK; Wellcome Sanger Institute, Cambridgeshire, CB10 1SA, UK, United Kingdom
  • Xiangyu Jack Ge, Open Targets, Hinxton, Cambridgeshire, CB10 1SD, UK; Wellcome Sanger Institute, Cambridgeshire, CB10 1SA, UK, United Kingdom
  • Yakov Tsepilov, Open Targets, Hinxton, Cambridgeshire, CB10 1SD, UK; Wellcome Sanger Institute, Cambridgeshire, CB10 1SA, UK, United Kingdom
  • Stuart Horswell, Open Targets, Hinxton, Cambridgeshire, CB10 1SD, UK; Wellcome Sanger Institute, Cambridgeshire, CB10 1SA, UK, United Kingdom
  • Helena Cornu, Open Targets, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK; EMBL-EBI, Cambridgeshire CB10 1SD,UK, United Kingdom
  • Sarah Young, Open Targets, Hinxton, Cambridgeshire, CB10 1SD, UK; Wellcome Sanger Institute, Cambridgeshire, CB10 1SA, UK, United Kingdom
  • Maya Ghoussaini, Open Targets, Hinxton, Cambridgeshire, CB10 1SD, UK; Wellcome Sanger Institute, Cambridgeshire, CB10 1SA, UK, United Kingdom
  • David Ochoa, Open Targets, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK; EMBL-EBI, Cambridgeshire CB10 1SD,UK, United Kingdom
  • David G Hulcoop, Open Targets, Cambridgeshire CB10 1SD, UK; GlaxoSmithKline plc, Gunnels Wood Road, Stevenage, SG1 2NY, UK, United Kingdom
  • Ellen McDonagh, Open Targets, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK; EMBL-EBI, Cambridgeshire CB10 1SD,UK, United Kingdom
  • Ian Dunham, Open Targets, Hinxton, Cambridge CB10 1SD, UK; EMBL-EBI, Cambridge CB10 1SD,UK; Wellcome Sanger Institute, CB10 1SA, UK, United Kingdom


Presentation Overview: Show

The process of drug discovery and development is costly and inefficient with predictions that around 90% of drugs entering the phase 1 clinical trials do not get approved. However, drugs having targets with underlying evidence for genetic associations with the relevant disease are twice as likely to be approved in clinical trials.
The Open Targets consortium is a pre-competitive collaboration between academic institutes and industry partners to systematically identify novel targets for disease treatment. The Open Targets Platform (https://platform.opentargets.org/) and Open Targets Genetics Portal (https://genetics.opentargets.org/) are open-source informatics resources that systematically harmonise and integrate key evidence data sources and provide tooling to enable identification and prioritisation of targets. The Platform builds and scores target-disease associations by integrating over 22 data sources, covering evidence from genetic associations, somatic mutations, known drugs, differential-expression, animal-models, and pathways. The Platform and Genetics Portal use ML for mining published literature, classifying clinical trials and linking GWAS loci to target genes. Our codebase and data is open-source and can be accessed via the web interface or programmatically through APIs. NIH National Cancer Institute has re-used our code to create an instance called Molecular Targets Platform with focus on paediatric cancer.

12:10-12:15
Systematic approach to preparing of medical claims data for biomedical research
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Monica Munoz-Torres

  • Michelle Audirac, Harvard T.H. Chan School of Public Health, Harvard University, United States
  • Michael Bouzinier, Harvard University Research Computing, United States
  • Danielle Braun, Harvard T.H. Chan School of Public Health, Harvard University, United States
  • Mahmood Shad, Harvard University Research Computing, United States
  • Scott Yockel, Harvard University Research Computing, United States


Presentation Overview: Show

Biomedical research often requires combining domain specific data with broad healthcare data. The Centers for Medicare and Medicaid Services (CMS) is a vital source of medical claims data in the US, collecting and maintaining healthcare data for millions of beneficiaries. CMS data provides important insights into healthcare utilization, expenditures, and health outcomes, making it a valuable resource for research in epidemiology and environmental health. However, the quality of CMS data often leaves a lot of space for improvements. We present a systematic approach to cleansing the CMS data and preparing it to be ready for ML and AI. Our approach is based on designing a domain specific language (DSL) to describe the most common data transformations in our feature engineering pipelines: isomorphic transformations, unions, rollups or projections, approximations, simple and custom aggregations, nesting and unnesting of the arrays, collapsing multiple columns and transpositions as well as disambiguation and QC rules. We will illustrate every of these operations with examples of data cleansing for different types of data, such as a patient's DOB, sex, race and ethnicity and admission data. We will present and discuss detailed QC results for 50 US states for the years from 2011 to 2018.

12:15-12:20
Domain Specific Language and variables for systematic approach to genetic variant curation and interpretation
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Monica Munoz-Torres

  • Marina Pozhidaeva, Deggendorf Institute of Technology, Forome Association, MA, USA, Germany
  • Dmitry Etin, Forome Association, MA, USA; Deggendorf Institute of Technology; Oracle Corporation, Austria
  • Gennadii Zakharov, Quanotri, Georgia
  • Michael Bouzinier, Forome Association, MA, USA; Oracle Corporation; Harvard University, United States


Presentation Overview: Show

Many clinicians and researchers agree that genome sequencing should become part of routine clinical practice, eventually becoming available in community hospitals and remote places. One of the roadblocks slowing this journey is the caution with which insurance companies and governments are reimbursing these services. One reason for this caution is a gap in the standards for demonstrating clinical utility of genome sequencing based on the current clinical genetics guidelines. To facilitate higher evidence-based medicine standards we need a systematic and structured approach to developing, maintaining, and publishing these guidelines. We believe that defining a weakly typed Domain Specific Language (DSL) for expressing genetic variant curation and interpretation rules in the context of provenance, confidence, and biological evidence of the data, would be a step forward in this direction. The types of DSL variables are based on the scale and resolution, knowledge domain, and method by which the technical or biological annotation corresponding to a variable has been produced.

We will present the prototype of such a DSL and annotation classification system used as part of our AnFiSA software tool. We will provide examples of the variant curation rules written in the DSL and demonstrate how they work on real-life data.

12:20-12:25
Platform for global genomic surveillance of emerging diseases applied to Mpox
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Monica Munoz-Torres

  • Ferdous Nasri, Data Analytics & Computational Statistics, Hasso Plattner Institute, University of Potsdam, Germany, Germany
  • Kunaphas Kongkitimanon, Data Analytics & Computational Statistics, Hasso Plattner Institute, University of Potsdam, Germany, Germany
  • Alice Wittig, Bioinformatics and Systems Biology, Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany, Germany
  • Jorge Sánchez Cortés, Data Analytics & Computational Statistics, Hasso Plattner Institute, University of Potsdam, Germany, Germany
  • Annika Brinkmann, Centre for Biological Threats and Special Pathogens, Robert Koch Institute, Seestrasse 10, Berlin 13353, Germany, Germany
  • Andreas Nitsche, Centre for Biological Threats and Special Pathogens, Robert Koch Institute, Seestrasse 10, Berlin 13353, Germany, Germany
  • Anna-Juliane Schmachtenberg, Data Analytics & Computational Statistics, Hasso Plattner Institute, University of Potsdam, Germany, Germany
  • Bernhard Y. Renard, Data Analytics & Computational Statistics, Hasso Plattner Institute, University of Potsdam, Germany, Germany
  • Stephan Fuchs, Bioinformatics and Systems Biology, Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany, Germany


Presentation Overview: Show

Monkeypox (Mpox) is mutating at an exceptional rate for a DNA virus and its global spread is concerning, making genomic surveillance a necessity. With MpoxRadar, we provide an interactive dashboard to track virus variants on mutation level worldwide. MpoxRadar allows users to select among different genomes as reference for comparison. The occurrence of mutation profiles based on the selected reference is indicated on an interactive world map that shows the respective geographic sampling site in customizable time ranges to easily follow the frequency or trend of defined mutations. Furthermore, the user can filter for specific mutations, genes, countries, genome types, and sequencing protocols and download the filtered data directly from MpoxRadar. MpoxRadar is open-source and freely accessible at https://MpoxRadar.net.
This is done by downloading genomic data from the National Center for Biotechnology Information (NCBI), aligning genomes to multiple references, variant calling, and creating a database for data and metadata access from the web server. Using this tool, we can find possible sequencing artifacts by comparing mutations from samples assembled using different sequencing technologies. Furthermore, we enable easy filtering for researchers who could simply search, e.g., for mutations highly associated with APOBEC3 enzymes.

13:50-14:10
From 2023 to a FAIR Future; bridging the provenance metadata gap by centering the bioinformatics practitioner perspective
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Hervé Ménager

  • Renske de Wit, Vrije Universiteit Amsterdam, Netherlands
  • Alexandru Iosup, Vrije Universiteit Amsterdam, Netherlands
  • Sanne Abeln, Vrije Universiteit Amsterdam, Netherlands
  • Michael Crusoe, Vrije Universiteit Amsterdam; Common Workflow Language project, Germany


Presentation Overview: Show

As bioinformaticians, we care about the quality, re-usability, and transparency of our research. But even when data is FAIR (which is not often), its provenance metadata is not connected to workflow executions, and is lost. Even with workflow provenance approaches like CWLProv, as of 2023 manual annotation about inputs is required for full provenance capture. Here, we present an approach to help practitioners determine which provenance data they should focus on, using a practitioner-centered inquisitive process. We also show how the insights gathered from this process can be used to improve existing provenance and metadata standards. Our process to develop the list of relevant metadata was driven by an imagination exercise: What questions would we, as bioinformaticians, want to ask of research objects produced from our specific workflow? Using our list of questions we then identified the metadata required to answer them (using existing metadata standards).
We identified three important areas of improvement and two extensions for CWLProv, which are also being added to the CWLProv successor, the RO-Crate WorkflowRun profile.

14:10-14:30
SWIPE: Open source infrastructure as code for running WDL workflows at low cost
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Hervé Ménager

  • Todd Morse, The Chan Zuckerberg Foundation, United States
  • Ryan Lim, The Chan Zuckerberg Foundation, United States
  • Jessica Gadling, The Chan Zuckerberg Foundation, United States


Presentation Overview: Show

SWIPE packages the cloud infrastructure used by Chan Zuckerberg Infectious Disease to run our bioinformatics WDL workflows so our architecture can be used by others. To make the infrastructure portable we used terraform to define our infrastructure as code. The architecture is well tuned for our use case: running the same pipelines with some common inputs, while scaling to meet highly variable demand for pipeline runs, at low cost. This is achieved by using AWS batch to quickly scale up and down, high bandwidth and disk I/O instances to quickly load and read large file inputs, a caching strategy to re-use common inputs between pipeline runs, and AWS Spot Instances to lower costs. Leveraging Spot Instances can be a challenge because workloads may be interrupted, requiring workloads run on spot to implement handling for recovery. SWIPE leverages the WDL workflow definitions to automatically resume at the last completed step, freeing up workflow developers from needing to implement their own recovery logic. Coordination is handled by AWS Step Functions, a serverless workflow orchestrator. This allows SWIPE to run a workflow with an AWS API call and scale down to zero when not in use.

14:30-14:50
Automated production engine to decode the tree of life
Room: Salle Rhone 3b
Format: Live-stream

Moderator(s): Hervé Ménager

  • Priyanka Surana, Wellcome Sanger Institute, United Kingdom


Presentation Overview: Show

Darwin Tree of Life, a collaboration between 10 organisational partners, aims to capture the biodiversity on the islands of Britain and Ireland through genomics. To analyse this diversity of life, we are building a series of production grade workflows that take the raw data from the sequencing machines to (1) assemble, decontaminate, and curate the genome, (2) create automated standardised genome publication, and (3) run comparative genomics analysis. Here, we showcase how data flows from the sequencing machines to our pipelines and through them to public archives where all our data is made available rapidly and without embargo. Next, our released data is downloaded back to our servers where it is processed into standardised genome publications. Finally, we share our roadmap for the next phase which involves making our pipelines sustainable using green coding principles. All our pipelines are developed using open-source principles with nf-core templates and tools to ensure they meet the highest community standards. We are one of several initiatives working towards the goal of sequencing all complex life on Earth. This should help in conservation, to understand the interconnectedness of all life, and to build a new economy based on biological materials.

14:50-15:10
Tonkaz: A workflow reproducibility scale for automatic validation of biological interpretation results
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Hervé Ménager

  • Hirotaka Suetake, Sator, Inc., Japan
  • Tazro Ohta, Institute for Advanced Academic Research, Chiba University, Japan


Presentation Overview: Show

Reproducibility of data analysis workflows is a key concern in bioinformatics. While recent advancements in computing technologies, such as virtualization, have facilitated easier reproduction of workflow execution, assessing the reproducibility of results remains a challenge. Specifically, there is a lack of standardized methods for verifying whether the biological interpretation of reproduced results is consistent across different executions.

To address this issue, we propose a new metric: a reproducibility scale for workflow execution results. This metric evaluates the reproducibility of results based on biological feature values, such as the number of reads, mapping rate, and variant frequency, representing the biological interpretation of the data. We have developed a prototype system that automatically evaluates the reproducibility of results using the proposed metric, streamlining the evaluation process.

To demonstrate the effectiveness of our approach, we conducted experiments using workflows employed by researchers in real-life research projects, as well as common use cases frequently encountered in bioinformatics. Our approach enables the automatic evaluation of result reproducibility on a fine-grained scale, promoting a more nuanced perspective on reproducibility. This shift from a binary view of identical or non-identical results to a graduated scale facilitates more informed discussions and decision-making in the bioinformatics field.

15:10-15:15
EASEL (Efficient, Accurate, Scalable Eukaryotic modeLs), a tool for the improvement of eukaryotic structural and functional genome annotation
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Hervé Ménager

  • Cynthia Webster, Ecology and Evolutionary Biology Department, University of Connecticut, United States
  • Karl Fetter, Ecology and Evolutionary Biology Department, University of Connecticut, United States
  • Sumaira Zaman, Ecology and Evolutionary Biology Department, University of Connecticut, United States
  • Vidya Vuruputoor, Ecology and Evolutionary Biology Department, University of Connecticut, United States
  • Akriti Bhattarai, Ecology and Evolutionary Biology Department, University of Connecticut, United States
  • Jill Wegrzyn, Ecology and Evolutionary Biology Department, University of Connecticut, United States


Presentation Overview: Show

The emergence of affordable high-throughput sequencing technologies has increased both the number and quality of eukaryotic genomes. Although reference genomes and their associated contiguity are increasingly accessible, an efficient and accurate workflow for structural annotation of protein-coding genes remains a challenge. Existing programs struggle with predicting less common gene structures (long introns, micro-exons), finding the preferred TIS location, and distinguishing pseudogenes. We present EASEL (Efficient, Accurate, Scalable Eukaryotic modeLs), an open-source genome annotation tool that leverages machine learning, RNA folding, and functional annotations to enhance gene prediction accuracy (https://gitlab.com/PlantGenomicsLab/easel). EASEL works by aligning high throughput short read data (RNA-Seq) and assembling putative transcripts via StringTie2 and PsiCLASS. Frames are subsequently predicted through TransDecoder using a gene family database (EggNOG) and Expressed Sequence Tag (EST) and protein hints are generated. Each gene model is independently used to train AUGUSTUS, and the resulting predictions are combined into a single gene set using AGAT. Implicated gene structures are filtered by primary and secondary features (molecular weight, GC content, free energy, etc.) with a clade-specific random forest model and then functionally annotated with EnTAP. This results in a full-scale workflow that balances efficiency and accuracy to generate high-quality genome annotations.

15:15-15:20
Realizing FAIR Principles For Data and Workflows with the Arvados Platform
Room: Salle Rhone 3b
Format: Live-stream

Moderator(s): Hervé Ménager

  • Peter Amstutz, Curii Corporation, United States
  • Brett Smith, Curii Corporation, United States
  • Stephen Smith, Curii Corporation, United States
  • Lucas Di Pentima, Curii Corporation, United States
  • Tom Clegg, Curii Corporation, United States
  • Lisa Knox, Curii Corporation, United States
  • Alexander Sasha Wait Zaranek, Curii Corporation, United States
  • Sarah Zaranek, Curii Corporation, United States


Presentation Overview: Show

Introduced in 2016, the FAIR principles provide guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets. The FAIR principles refer to data, metadata and infrastructure. Additionally, computational workflows can follow FAIR principles for both the workflows’ descriptions themselves and the (meta)data the workflows use or produce.

The largest pharmaceutical organizations are using Arvados to implement FAIR principles in large multinational deployments. As a result, these organizations foster cross-department data sharing, save time and reduce workload by re-using work from other groups, as well as enable better decision making and scientific excellence. Our talk will focus on the FAIR principles and how Arvados can help you “go FAIR” with your data and other digital objects. Arvados is a 100% open-source platform that integrates a data management system called “Keep” and a compute management system called “Crunch”, creating a unified environment to store and organize data and run Common Workflow Language (CWL) workflows on that data. Arvados is multi-user and multi-platform, running on various cloud and high performance computing environments.

Open Source infrastructure is ideally suited for realizing the FAIR vision. The Arvados community welcomes users, contributors and those interested in FAIR data and workflows (https://arvados.org/community/). B

16:00-16:05
Faster evaluation of CRISPR guide RNAs across entire genomes
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Karsten Hokamp

  • Carl Schmitz, Queensland University of Technology, Australia
  • Jacob Bradford, Queensland University of Technology, Australia
  • Dimitri Perrin, Queensland University of Technology, Australia


Presentation Overview: Show

The design of CRISPR-Cas9 guide RNAs is not trivial. In particular, evaluating the risk of off-target modifications is computationally expensive: a brute-force approach would require comparing each candidate guide with every possible CRISPR target site in the genome. In a mammalian genome, this means hundreds of millions of comparisons for each guide. We have previously introduced Crackling, a gRNA design tool that relies on Inverted Signature Slice Lists (ISSL) to accelerate off-target scoring by only considering sites with partial matches (a slice) with the candidate guide. This produced an order of magnitude speed up whilst still maintaining scoring accuracy. Here, we present a complete reimplementation of Crackling in C++ and discuss further improvements. Using longer slices we perform fewer comparisons, and we show it is possible to construct a collection of slices that still preserve an exact off-target score. We have benchmarked two ISSL configurations with the new version of Crackling and report a 15-22 times speed up over the default ISSL configuration. We also show that, using memory-mapped files, this can be achieved without any significant increase in memory usage. CracklingPlusPlus is available at https://github.com/bmds-lab/CracklingPlusPlus under the Berkeley Software Distribution (BSD) 3-Clause license.

16:05-16:10
JBrowse 2: a modular genome browser with views of synteny and structural variation
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Karsten Hokamp

  • Colin Diesh, University of California, Berkeley, United States
  • Garrett Stevens, University of California, Berkeley, United States
  • Peter Xie, University of California, Berkeley, United States
  • Teresa De Jesus Martinez, University of California, Berkeley, United States
  • Elliot Hershberg, University of California, Berkeley, United States
  • Angel Leung, University of California, Berkeley, United States
  • Emma Guo, University of California, Berkeley, United States
  • Shihab Dider, University of California, Berkeley, United States
  • Junjun Zhang, OICR, Canada
  • Caroline Bridge, OICR, Canada
  • Gregory Hogue, OICR, Canada
  • Andrew Duncan, OICR, Canada
  • Scott Cain, OICR, Canada
  • Robert Buels, University of California, Berkeley, United States
  • Lincoln Stein, OICR, Canada
  • Ian Holmes, University of California, Berkeley, United States


Presentation Overview: Show

Genome browsers are commonly used in bioinformatics for interactive visualization of genomic datasets. However, most genome browsers are only capable of showing data from a single genome at a single locus. We created JBrowse 2 to enable the visualization of multiple related genomes with built-in synteny views, and to show complex structural variants that can span multiple genomic loci. JBrowse 2 can be extensively customized via configuration or extended via plugins to address the needs of its diverse user base. Recent improvements to the JBrowse 2 core codebase include: the addition of multi-wiggle tracks, which can show many different quantitative signals in a compact format; ability to launch a synteny view from a regular genome browser view; and new visualization modalities for alignments tracks. We will also demonstrate new plugins to map genome coordinates onto 3-D protein structures and to explore splicing with the isoform inspector. JBrowse 2 uses a modern web application stack using React and TypeScript. It runs as a web app, a desktop app, or as components installable via NPM. JBrowse 2 also has Jupyter Notebook integration via jbrowse_jupyter (on PyPI) and R/Shiny integration via JBrowseR (on CRAN).

16:10-16:15
PhyloGenes: A web-based tool for plant gene function inference using phylogenetics
Room: Salle Rhone 3b
Format: Live-stream

Moderator(s): Karsten Hokamp

  • Swapnil Sawant, Phoenix Bioinformatics, United States
  • Tanya Berardini, Phoenix Bioinformatics, United States
  • Trilok Prithvi, Phoenix Bioinformatics, United States
  • Peifen Zhang, Computercraft Corporation (for NCBI), United States


Presentation Overview: Show

PhyloGenes (phylogenes.org) is a web-based bioinformatics tool that that leverages advanced search and indexing technology based on Apache Solr to index and analyze genes and phylogenetic trees of over 8,000 gene families across 50 organisms, including 40 plant species. It integrates experimental and phylogenetically-inferred gene function annotations, publications, and sequence alignments from PantherDB, Gene Ontology, and UniProt and displays them using interactive and efficient visualization tools based on D3.js. By presenting information in a way that visually reflects phylogenetic relationships, PhyloGenes facilitates more effective inference of gene function. By making annotation evidence and sources and other metadata clearly visible and traceable, PhyloGenes will improve the accuracy of inferred gene functions. PhyloGenes enables users to address research questions such as identifying orthologs, predict unknown gene function, and discover novel gene families in plants. PhyloGenes contributes to the open source bioinformatics ecosystem by providing a visually intuitive, robust, and transparent solution for gene function inference that could be adapted to other sets of organisms.

16:20-16:25
CCQTL: facilitating QTL mapping in the Collaborative Cross
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Karsten Hokamp

  • Remi Planel, Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, F-75015 Paris, France, France
  • Victoire Baillet, Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, F-75015 Paris, France, France
  • Vincent Guillemot, Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, F-75015 Paris, France, France
  • Jean Jaubert, Institut Pasteur, Université Paris Cité, Mouse Genetics Laboratory, F-75015 Paris, France, France
  • Christian Vosshenrich, Institut Pasteur, Université Paris Cité, Innate Immunity Unit, F-75015 Paris, France, France
  • Rachel Torchet, Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, F-75015 Paris, France, France
  • Marion Rincel, Institut Pasteur, Université Paris Cité, Microenvironment and Immunity Unit, F-75015 Paris, France, France
  • Pascal Campagne, Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, F-75015 Paris, France, France
  • Xavier Montagutelli, Institut Pasteur, Université Paris Cité, Mouse Genetics Laboratory, F-75015 Paris, France, France


Presentation Overview: Show

Quantitative Trait Locus (QTL) mapping in mapping populations and Genome-Wide Association Studies (GWAS) in natural populations are complementary approaches for dissecting the genetic architecture of complex traits. While GWAS are typically carried out by statistical genetics groups well-versed in quantitative environments and code management, experimental geneticists performing QTL mapping focus on labor-intensive phenotyping experiments thus requiring further support, both for code and statistics, to benefit from best practices in the field. 
We present CCQTL, a comprehensive platform for QTL mapping in the Collaborative Cross (CC), an increasingly used mouse mapping population. CCQTL features an intuitive graphical user interface (GUI) for seamless end-to-end QTL mapping analysis, from data transformation to candidate gene identification. It also includes a robust database structure ensuring secure, organized storage of phenotypic data, accompanied by an advanced permissions system.
CCQTL's analytical component leverages R/qtl2 tools integrated into preconfigured Galaxy workflows designed explicitly for the CC. This setup facilitates one-click, reproducible analyses. The platform's interface (GUI, database, and analytics) is containerized using Docker, enabling straightforward deployment and scalability. While primarily designed to empower non-specialists in conducting their own data analyses, CCQTL's Galaxy-brought reproducibility and sophisticated database permission system also renders it valuable for experienced users seeking streamlined solutions.

16:25-16:30
higlass-python: A Programmable Genome Browser for Linked Interactive Visualization and Exploration of Genomic Data
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Karsten Hokamp

  • Trevor Manz, Harvard Medical School, United States
  • Vedaz Yilmaz, UMass Chan Medical School, United States
  • Nils Gehlenborg, Harvard University, United States
  • Nezar Abdennur, UMass Chan Medical School, United States


Presentation Overview: Show

HiGlass is a web-based viewer for 1D and 2D genomic datasets, providing smooth navigation and flexible view configurations. We developed higlass-python, a software development kit (SDK) to expand HiGlass capabilities and integrate it with the scientific Python ecosystem. This SDK enables custom HiGlass-powered applications, dashboards, and rapid genomic data exploration in computational notebooks, such as Jupyter Notebooks.

Higlass-python offers a toolkit for synchronizing genomic representations within HiGlass and third-party visualizations presenting alternative representations of genomic loci. This feature empowers computational researchers to extend the genome browser using Python scripts.

Integrating traditional genome browser views and dynamically linked alternative views, higlass-python facilitates more complete exploration and analysis of genomic data. The toolkit provides building blocks for two-way data binding between HiGlass and other visualizations using single-locus mapping information.

We utilized higlass-python in Jupyter Notebooks to examine latent embedding spaces of 3D genomic contact frequency profiles within functional and epigenomic contexts. Dimensionality-reduced contact profile representations are embedded in 2D visualizations, with each point representing a genomic locus or interval. Our SDK coordinates multiple views of individual loci with HiGlass, synchronizing selections across all views and allowing data loading, exploration, export, and analysis within the same computational environment.

16:30-16:35
Accelerated nanopore basecalling with SLOW5 data format
Room: Salle Rhone 3b
Format: Live-stream

Moderator(s): Karsten Hokamp

  • Hiruna Samarakoon, Genomics Pillar, Garvan Institute of Medical Research, Sydney, NSW, Australia., Australia
  • James M. Ferguson, Genomics Pillar, Garvan Institute of Medical Research, Sydney, NSW, Australia., Australia
  • Hasindu Gamaarachchi, Genomics Pillar, Garvan Institute of Medical Research, Sydney, NSW, Australia., Australia
  • Ira W. Deveson, Genomics Pillar, Garvan Institute of Medical Research, Sydney, NSW, Australia., Australia


Presentation Overview: Show

Nanopore sequencing is emerging as a key pillar in the genomic technology landscape but computational constraints limiting its scalability remain to be overcome. The translation of raw current signal data into DNA or RNA sequence reads, known as ‘basecalling’, is a major friction in any nanopore sequencing workflow. Here, we exploit the advantages of the recently developed signal data format ‘SLOW5’ to streamline and accelerate nanopore basecalling on high-performance computer (HPC) and cloud environments. SLOW5 permits highly efficient sequential data access, eliminating a significant analysis bottleneck. To take advantage of this, we introduce Buttery-eel, an open-source wrapper for Oxford Nanopore’s Guppy basecaller (Guppy) that enables SLOW5 data access, resulting in performance improvements that are essential for scalable, affordable basecalling. Basecalling a realistic human whole-genome sequencing dataset (at ~30X coverage), in FAST5 format, with 4 GPUs, took a minimum of 13.3 hours (cloud-system) and a maximum of 41.6 hours (distributed-file-system) with Guppy (high accuracy model). In BLOW5 format, basecalled with Buttery-eel, overall runtimes were reduced to ~5 hours on every system, corresponding to 2.7-fold (cloud-system), 3.4-fold (parallel-file-system) and 9.1-fold (distributed-file-system) improvements, respectively.

16:35-16:40
Pre-upload quality checks: Efficient bioinformatics file validation in the browser for time and cost savings
Room: Salle Rhone 3b
Format: Live-stream

Moderator(s): Karsten Hokamp

  • Charles Bickham, University of Southern California, United States
  • Julie Han, Chan Zuckerberg Initiative, United States
  • Robert Aboukhalil, Chan Zuckerberg Initiative, United States


Presentation Overview: Show

Web applications are important tools for reducing the barrier of entry to bioinformatics. One common challenge is that users who upload invalid or low-quality data sometimes wait hours for upload to finish and analysis to begin, before they are notified of data quality issues. We developed an approach to detect low-quality data in the browser before it is uploaded to a server, which we applied to Chan Zuckerberg ID (CZ ID), an open source, cloud-based metagenomics platform that helps researchers detect and track infectious diseases worldwide. We compiled seqtk and htsfile to WebAssembly to run diagnostic checks on a subset of each FASTA or FASTQ input file, and warn users of issues before the upload starts. Diagnostics include detecting duplicate read names, truncated files, mismatched R1/R2 read names, and detecting reads that don't match the selected sequencing platform. Our approach prevented ~3 TB of invalid data from being uploaded, corresponding to ~15 days of user upload time. We found that 97% of the ~60,000 issues prevented were from users selecting a type of analysis that did not fit their input data. In addition, detecting these issues at the source reduces cloud compute costs and saves time for end-users.

16:45-17:00
Collaborative bioinformatics with Multiplayer IGV
Room: Salle Rhone 3b
Format: Live-stream

Moderator(s): Karsten Hokamp

  • Robert Aboukhalil, Chan Zuckerberg Initiative, United States


Presentation Overview: Show

Bioinformatics is a highly collaborative field, yet most of our visualization tools do not support real-time collaboration. Here we introduce Multiplayer IGV, an interactive web app that adds collaborative functionality to IGV. Multiplayer IGV lets users create IGV sessions with shareable URLs, which enables a group of scientists to collaborate on a shared view. In real time, the interface syncs user presence, cursor movements, mouse clicks, changing genomic loci, as well as adding, removing, and reordering IGV tracks. To speed up variant review, Multiplayer IGV supports annotating loci of interest with a comment, which are also synced to other users.

Real-time syncing is implemented using a combination of two technologies. First, using WebSockets instead of HTTP allows us to more efficiently broadcast messages to users connected to a channel. Second, to ensure the session information is not lost once all users have left the channel, IGV state is regularly persisted to a Postgres database.

One application of this approach is for clinical genomics, where a common need is to enable multiple scientists and clinicians to review variants by inspecting them in IGV. Used alongside video conferencing and screen sharing, Multiplayer IGV can streamline synchronous and asynchronous modes of collaboration.

17:00-17:20
Standards to Connect Biomedical and Behavioral Research to Artificial Intelligence in the Bridge2AI Program
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Jason Williams

  • Harry Caufield, Lawrence Berkeley National Laboratory, United States
  • James Eddy, Sage Bionetworks, United States
  • Wesley Goar, Nationwide Children's Hospital, United States
  • Marcin Joachimiak, Lawrence Berkeley National Laboraotry, United States
  • Corey Cox, University of Colorado Anschutz Medical Campus, United States
  • Milen Nikolov, Sage Bionetworks, United States
  • Justin Reese, Lawrence Berkeley National Laboratory, United States
  • James Stevenson, Nationwide Children's Hospital, United States
  • Sarah Gehrke, University of Colorado Anschutz Medical Campus, United States
  • Nomi Harris, Lawrence Berkeley National Laboratory, United States
  • Amy Heiser, Sage Bionetworks, United States
  • Jessica Mitchell, Johns Hopkins University, United States
  • Christopher Chute, Johns Hopkins University, United States
  • Melissa Haendel, University of Colorado Anschutz Medical Campus, United States
  • Christopher Mungall, Lawrence Berkeley National Laboratory, United States
  • Alex Wagner, Nationwide Children's Hospital, United States Minor Outlying Islands
  • Monica Munoz-Torres, University of Colorado Anschutz Medical Campus, United States


Presentation Overview: Show

The Bridge to Artificial Intelligence (Bridge2AI) Program aims to facilitate the use of AI in biomedical and behavioral research by generating new, ethically-sourced data and promoting a culture of ethical consideration. The Bridge2AI Integration, Dissemination, and Evaluation (Bridge) Center is tasked with ensuring that these goals are met throughout the data lifecycle. The current lack of standardized data is hindering progress in this field. To remediate this, the Standards Core of the Bridge Center is developing best practices for data collection, deposition, quality assurance, query, dissemination, and integration, with the aim of ensuring that the standards implemented in the quest for generating AI-ready data are generalizable and useful to the wider scientific community. By promoting open source and collaborative development of best practices and norms, and by developing methods to build and extend standards to address key data linkage and integration use cases for AI, the Bridge2AI Standards Core aims to provide researchers with access to new resources for discovery and innovation. Learn more at https://bridge2ai.org/standards-core

17:20-17:40
Transforming unstructured biomedical texts with large language models
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Jason Williams

  • J. Harry Caufield, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, United States
  • Harshad Hegde, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, United States
  • Vincent Emonet, Institute of Data Science, Faculty of Science and Engineering, Maastricht University, Maastricht, The Netherlands, Netherlands
  • Nomi Harris, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, United States
  • Marcin Joachimiak, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, United States
  • Nicolas Matentzoglu, Semanticly Ltd, Athens, Greece, Greece
  • Hyeongsik Kim, Robert Bosch LLC, Sunnyvale, CA 94085, USA, United States
  • Sierra Moxon, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, United States
  • Justin Reese, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, United States
  • Melissa Haendel, Anschutz Medical Campus, University of Colorado, Aurora, CO 80217, USA, United States
  • Peter Robinson, The Jackson Laboratory, Bar Harbor, ME 04609, USA, United States
  • Christopher Mungall, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, United States


Presentation Overview: Show

Creating biological knowledge bases and ontologies relies on time-consuming curation. Newly-emerging approaches driven by artificial intelligence and natural language processing can assist curators in populating these knowledge bases, but current approaches rely on extensive training data and are unable to populate arbitrarily complex nested knowledge schemas. We have developed Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES), a Knowledge Extraction approach that relies on the ability of Large Language Models (LLMs) to perform zero-shot learning (ZSL) and general-purpose query answering from flexible prompts and return information conforming to a schema. Given a user-defined schema and an input text, SPIRES recursively queries GPT-3+ to obtain responses matching the schema. SPIRES uses existing ontologies and vocabularies to provide identifiers for all matched elements. SPIRES may be applied to varied tasks, including extraction of cellular signaling pathways, disease treatments, drug mechanisms, and chemical to disease causation graphs. This approach offers easy customization, flexibility, and the ability to perform new tasks in the absence of any additional training data. SPIRES supports a strategy of leveraging the language interpreting capabilities of LLMs to assemble knowledge bases, assisting manual knowledge curation and acquisition while supporting validation with publicly-available databases and ontologies external to the LLM.

17:40-17:45
EUGENe: A Python toolkit for predictive analyses of regulatory sequences
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Jason Williams

  • Adam Klie, University of California, San Diego, United States
  • David Laub, University of California, San Diego, United States
  • James Talwar, University of California, San Diego, United States
  • Hayden Stites, Daniel Hand High School, United States
  • Joe Solvason, University of California, San Diego, United States
  • Tobias Jores, University of Washington, United States
  • Emma Farley, University of California, San Diego, United States
  • Hannah Carter, University of California, San Diego, United States


Presentation Overview: Show

Deep learning (DL) has made significant strides in regulatory genomics, but its progress and wide-spread adoption has been limited by a fragmented set of tools, methods, and data. To address this, we introduce EUGENe (Elucidating the Utility of Genomic Elements with Neural Nets), a Findable, Accessible, Interoperable, and Reusable (FAIR) toolkit that integrates these aspects into a cohesive set of Python packages. EUGENe offers unparalleled functionality while maintaining ease of use through simplified and standardized interfaces for dataset loading and preprocessing, model instantiation and training, and model evaluation and interpretation. We demonstrate EUGENe's ability to go from processed data to manuscript-quality figures by training and evaluating built-in, seminal, and custom sequence models on three distinct tasks: promoter activity prediction, RNA binding protein specificity prediction, and transcription factor binding event classification. We emphasize that the code used in each use case is simple, readable, and well documented. We believe that EUGENe provides a platform upon which computational scientists can rapidly develop and share methods and models for answering critical questions about the regulatory sequence code, thereby facilitating a collaborative ecosystem for DL applications in regulatory genomics research.

17:45-17:50
Identifying integrated multi-omics biomarkers to build a sepsis detection model using machine learning
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Jason Williams

  • Tyrone Chen, Monash University, Australia
  • Anton Y. Peleg, Monash University, Australia
  • Sonika Tyagi, Royal Melbourne Institute of Technology, Australia


Presentation Overview: Show

Sepsis, which is responsible for 20% of global deaths, has prompted the development of data-driven methods to improve its detection, resolution, prognosis, and treatment. Existing methods detect biomarkers through a single assay, but face challenges in capturing the complex combination of biomolecular interactions involved in gene regulation and expression. To address this, we developed a multi-omics approach that assays cellular levels of various molecules to obtain a standardised sepsis diagnosis and prognosis model. Fast and accurate sepsis detection will supplement clinical treatment of sepsis at all levels, leading to a personalised medicine approach.

We first obtained multi-omics data through our partnership with an Australian National Framework project of multiple bacterial strains associated with sepsis and non-sepsis conditions. Genomics, transcriptomics, proteomics and metabolomics data are available. Using these as input data, we split our workflow into two parts. In the first stage, we convert the (1) regulatory omics data into a generic data representation. In the second stage, we integrate the (2) functional omics component with the regulatory omics component. We hypothesised that the regulatory omics signatures drive the functional signature. Our method is both species and data agnostic, as well as publicly available as a conda package.

Tuesday, July 25th
10:30-11:30
Invited Presentation: The Dissonance between Scientific Altruism & Capitalist Extraction: The Zero Trust and Federated Data Sovereignty Solution
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Jason Williams

  • Joseph M Yracheta, Native Bio-Data Consortium, United States


Presentation Overview: Show

The history that led to the roadbuilding of Open Data’s current pathway is a murky one and one that isn’t fully transparent as to causes, sponsors or decision makers. The philosophical underpinnings and moral obligation of Open Data is flawed and over prioritizes funding commitments and scientists’ rights to data rather than the privacy, development and implementation obligations that are often left to the unjust realm of commercial capitalization. Mr. Yracheta will discuss from the USA’s American Indian perspective (as sovereign Domestic Dependent Nations) and their relationship with states’ and federal governments. Several U.S. policy documents and legislative acts that were created in and out of consultation with American Indian nations will be shared and discussed. Key gaps in the moral obligation to society at large will be demonstrated via this special relationship with tribes. Where possible, Mr. Yracheta will employ game theory to move the Open Data argument into a more socially just and philosophically robust posture. Interrogated are two possible pathways to our current landscape 1) Is the current Open Data environment well intended but ignorant? Neglectful? Aligned with or against the Belmont Principle of Non-Coercion? Or 2) Is the current Open Data environment obfuscatory, purposely predatory and extractive? The answers to either position will determine the timing and delivery of benefits to societies and individuals.

Joseph M. Yracheta is an Amerindigenous Scientist (P’urhepecha y Raramuri from Mexico) and Executive Director of the Native BioData Consortium within the Cheyenne River Lakota Nation (Sioux). Mr. Yracheta has been a scientist since 1990; he started as a bench biotechnician and worked across many disciplines. In 2014 he graduated from UW Seattle with a master’s in Pharmaceutics and Bioethics under Drs. Ken Thummel and Wylie Burke. He is currently finishing a DrPH in Environmental Health and Engineering from Johns Hopkins under Drs. Ana Navas-Acien and Paul Locke.

Mr. Yracheta is passionately working to end Amerindigenous Health Disparity by the “wearing many research hats” of law, ethics, policy, genomics, omics, health outcomes, epidemiology, health care prevention/intervention and allostatic load from systemic racism. Mr. Yracheta believes that ALL data and resources must be seen as unforeseen futures, where their value will constantly change. He feels this data must be secured for Indigenous economic sustainability.

11:30-11:50
An Open Source Platform for Scalable Genomics Data Infrastructures
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Radhika Khetani

  • Mitchell Shiell, Ontario Institute of Cancer Research (OICR), Canada
  • Jon Eubank, Ontario Institute of Cancer Research (OICR), Canada
  • Justin Richardson, Ontario Institute of Cancer Research (OICR), Canada
  • Brandon Chan, Ontario Institute of Cancer Research (OICR), Canada
  • Puneet Bajwa, Ontario Institute of Cancer Research (OICR), Canada
  • Robin Haw, Ontario Institute of Cancer Research (OICR), Canada
  • Christina Yung, Ontario Institute of Cancer Research (OICR), Canada
  • Lincoln Stien, Ontario Institute of Cancer Research (OICR), Canada
  • Melanie Courtot, Ontario Institute of Cancer Research (OICR), Canada


Presentation Overview: Show

Data repositories are essential resources to accelerate scientific discoveries over unified genomics datasets. Unfortunately, building out and maintaining them is resource-intensive, requiring a breadth of expertise of software engineers, cloud infrastructure specialists and bioinformaticians. Overture addresses this with an extensible open-source platform of modular components made to build into scalable genomics data infrastructures.

The five core microservices of Overture work in concert to create scalable data commons for filtering, querying and collaborating over large datasets. Ego handles authentication and authorization alongside an administrative Ego UI component. Score is our file transfer and object storage microservice with integrated SAMtools capabilities. Metadata management is handled by Song, which tracks and validates file metadata across distributed servers and against user-defined schemas. Maestro handles the indexing of Song repositories into a single Elasticsearch index that the Arranger Search API then consumes and exposes its pre-built and configurable UI components.

Overture's core capabilities were initially informed by our experiences working on International Cancer Genome Consortium (ICGC) and the NCI Genomic Data Commons Data Portal. Today, Overture powers and informs all our projects, most notably ICGC-Accelerating Research in Genomic Ontology (ICGC-ARGO), VirusSeq and the Ontario Hereditary Cancer Research Network.

11:50-12:05
BioThings SDK for building a knowledge base API ecosystem in the context of the Biomedical Translator Program
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Radhika Khetani

  • Yao Yao, The Scripps Research Institute, United States
  • Jason Lin, The Scripps Research Institute, United States
  • Everaldo Rodolpho, The Scripps Research Institute, United States
  • Marco Alvarado Cano, The Scripps Research Institute, United States
  • Nichollette Acosta, The Scripps Research Institute, United States
  • Ginger Tsueng, The Scripps Research Institute, United States
  • Andrew I. Su, The Scripps Research Institute, United States
  • Chunlei Wu, The Scripps Research Institute, United States


Presentation Overview: Show

Building web-based APIs (Application Programming Interfaces) has become increasingly popular for accessing biomedical data or knowledge for their flexibility, simplicity, and reliability over the traditional flat-file downloads. Our team has previously developed a set of high-performance and scalable biomedical knowledge APIs, called “BioThings APIs”, which are now commonly used by the community. In this abstract, we focus on the underlying toolkit to build these APIs and how it can help specific research communities to build their own API-centered knowledge base ecosystem. This toolkit, BioThings SDK (https://biothings.io), is a generalized software development kit for developers to build, update and deploy the high-performance APIs from any knowledge sources and biomedical data types. Users can take advantage of the abstracted technical layers we built into the SDK, and produce a high-performance API, which follows the best practice and community standards. In the NCATS-funded Biomedical Translator program, BioThings SDK is used by multiple teams to create dozens of new Knowledge Provider APIs. These APIs then contributed to a large-scale, sustainable, and interoperable API-based knowledgebase ecosystem. We believe the use case from the Translator program is also applicable to other domain-specific research communities who might face the same knowledge sharing, integration, and long-term maintenance challenges.

12:05-12:10
Reproducible models in Systems Biology are higher cited
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Radhika Khetani

  • Sebastian Höpfl, Institute for Stochastics and Applications (ISA), Germany
  • Nicole Radde, Institute for Stochastics and Applications (ISA), Germany


Presentation Overview: Show

The Systems Biology community was among the first scientific communities to recognize the need for reproducible models. Still, many published models are not reproducible. Tiwari et al. currently classified 328 published models by their reproducibility. They found out that only every second model was directly reproducible. We use this classification to analyze if reproducible models in Systems Biology have a higher impact in terms of citation numbers. For the analysis, we use Bayesian Estimation, which provides complete distributional information for group means and standard deviations. Outliers are handled via a non-central t-distribution. Beginning about ten years after the introduction of SBML (2013), reproducibility gained broad awareness in the Systems Biology community. Since then, reproducible models have gotten significantly more citations than non-reproducible ones. Our analysis shows 95% credibility for higher citation rates of reproducible models for 2013-2020, and all investigated sub-periods after 2013 to 2020.
Further, normalization to the journal impact factor showed that journals of all ranks could profit by forcing reproducibility. In conclusion, reproducible models offer long-term benefits for journals and individual researchers in terms of more citations. The higher citation count provides evidence for increased reuse of these models and could promote progress in Systems Biology.

12:10-12:15
AutSPACEs: a co-created and open source citizen science project to improve environments for sensory processing in autistic people
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Radhika Khetani

  • Bastian Greshake Tzovaras, The Alan Turing Institute, United Kingdom
  • Georgia Aitkenhead, The Alan Turing Institute, United Kingdom
  • Helen Duncan, The Alan Turing Institute, United Kingdom
  • Callum Mole, The Alan Turing Institute, United Kingdom
  • Martin Stoffel, The Alan Turing Institute, United Kingdom
  • David Llewellyn-Jones, The Alan Turing Institute, United Kingdom
  • Robin Taylor, Independent, United Kingdom
  • Susanna Fantoni, Independent, United Kingdom
  • James Scott, Independent, United Kingdom
  • Otis Smith, Independent, United Kingdom
  • Kirstie Whitaker, The Alan Turing Institute, United Kingdom


Presentation Overview: Show

Around 90% of autistic people process sensory information differently to non-autistic people. Consequently, many public spaces are not well designed for autistic people, e.g. are too bright, busy or loud. Prior research on sensory processing and autism was mostly done in laboratory settings and has overlooked the lived experience of autistic people, thus failing to generalise to real world environments. With AutSPACEs – short for Autism research into Sensory Processing for Accessible Community Environments – we are running a community-led, online citizen science project aimed at filling this data gap in order to make environments more accessible for autistic people.

While an open-source web platform for collecting qualitative data on sensory processing experiences is at the technical heart of the project, the real power of AutSPACEs lies in its implementation of the disability rights movement’s motto “Nothing about us, without us”. AutSPACEs is co-designed by autistic people with the goal to empower autistic people throughout the process. We engaged in co-creation to expand our scope to include recommendations of what could have made sensory processing experiences better. Based on this, AutSPACEs has become a joint repository of qualitative data on sensory processing experiences and strategies for how to improve them.

12:15-12:20
Open Life Science: A mentoring & training virtual program for Open Science ambassadors
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Radhika Khetani

  • Yo Yehudi, Open Life Science, United Kingdom
  • Pradeep Eranti, Open Life Science, France
  • Sara El-Gebali, FAIRPoint & Open Life Science, Sweden
  • Bérénice Batut, University of Freiburg & Open Life Science, Germany


Presentation Overview: Show

Open Life Science (OLS) offers 16-week training and mentoring virtual programs to empower researchers and their teams to lead open research projects in their respective domains and become open science ambassadors for their communities.
With the combination of practical training and 1:1 support from our mentors, we guide participants to reflect on and apply open practices in the context of the socio-cultural environment where they conduct their research. Participants join the program with a project. They attend cohort calls to learn about open science practices and frameworks to apply open science skills, including FAIR research principles and equitable and inclusive community practices. In mentoring calls every alternating week, they reflect on their progress and identify the next steps. To strengthen their skills, graduates can re-join subsequent cohorts as mentors, call facilitators, or experts. We offer micro-grants, live transcription, and other resources to make the program inclusive and accessible.
Since 2020, we have run 7 cohorts and trained >300 participants from 6 continents, across 50+ low- and middle- (LMICs), and high-income countries (HICs), with the help of 120+ mentors and 150+ experts.
This talk will describe the various aspects of the program and highlight all open, reusable, and FAIR materials.

12:20-12:25
Building and Sustaining a Community of Computational Biologists at EMBL through Open-Source Tools and Four Pillars: Training, Community, Infrastructure, and Information
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Radhika Khetani

  • Lisanna Paladin, EMBL - European Molecular Biology Laboratory, Germany
  • Renato Alves, EMBL - European Molecular Biology Laboratory, Germany


Presentation Overview: Show

Especially in the interdisciplinary field of computational biology, collaboration is crucial. The Bio-IT project at the European Molecular Biology Laboratory (EMBL) established a community of bioinformaticians through training, community building, infrastructure, and information support. The project is managed by staff, providing a model for how institutional support can empower community-driven initiatives in science.
Training is an essential component of the Bio-IT project. Researchers learn from their peers through open-source materials and practical workshops, run in a hybrid format to allow participation across all six EMBL sites in Europe. But training alone doesn’t change the culture.It is coupled by community building initiatives such as meetups, coding clubs, an internal Mattermost chat, and a ""grassroots table"" of internal experts. Infrastructure support includes coding tools and a GitLab server. The project provides newcomers guides, intranet links and a blog to ensure access to comprehensive information.
The Bio-IT project provides a model for research institutes looking to develop similar programs to support internal field-specific communities. We will highlight the benefits and challenges of implementing this model, and we hope to spark a little revolution in the way scientists approach community and in the way institutions recognise the value of professionals contributing to it.

13:50-14:10
Session: Joint Session with Bio-Ontologies
The Research Software Ecosystem: an open software metadata commons
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Nomi Harris

  • Hans Ienasescu, Technical University of Denmark, Denmark
  • Salvador Capella-Gutiérrez, Barcelona Supercomputing Center (BSC), Spain
  • Frederik Coppens, Ghent University, Belgium
  • José Mª Fernández, Barcelona Supercomputing Center (BSC), Spain
  • Alban Gaignard, Institut du Thorax, University of Nantes, France
  • Carole Goble, The University of Manchester, United Kingdom
  • Bjoern Gruening, Uni-Freiburg, Germany
  • Johan Gustafsson, Australian Biocommons, Australia
  • Josep Ll Gelpi, Dept. Bioquimica i Biologia Molecular. Univ. Barcelona, Spain
  • Jennifer Harrow, ELIXIR, United Kingdom
  • Steven Manos, Australian BioCommons, Australia
  • Kota Miura, Bioimage Analysis & Research, Japan
  • Steffen Möller, Rostock University Medical Center, Germany
  • Stuart Owen, The University of Manchester, United Kingdom
  • Perrine Paul-Gilloteaux, Institut du Thorax, University of Nantes, France
  • Hedi Peterson, University of Tartu, Estonia
  • Manthos Pithoulias, ELIXIR Europe, United Kingdom
  • Jonathan Tedds, ELIXIR Europe, United Kingdom
  • Dmitri Repchevsky, Barcelona Supercomputing Center (BSC), Spain
  • Federico Zambelli, Department of Biosciences, University of Milan, Milano, Italy, Italy
  • Oleg Zharkov, University of Freiburg, Germany
  • Matúš Kalaš, Computational Biology Unit, Department of Informatics, University of Bergen, Norway
  • Herve Menager, Institut Pasteur, Université Paris Cité, France


Presentation Overview: Show

Research software is a critical component of computational research. Being able to discover, understand and adequately utilize software is essential. Many existing services facilitate these tasks, all of them relying heavily on software metadata. The continued upkeep of such complex and large sets of metadata comes at the cost of multiple efforts of curation, and the resulting metadata is often sparse and inconsistent.
The Research Software Ecosystem (RSEc) aims to act as a proxy to maintain and preserve high-quality metadata for describing research software. These metadata are retrieved and synchronized with many major software-related services, within and beyond the ELIXIR Tools Platform. The EDAM ontology enables the semantic description of the scientific function of the described software.
The RSEc central repository is a GitHub repository that aggregates software metadata mostly related to life sciences, and spanning the multiple aspects of software discovery, evaluation, deployment and execution. The aggregation of metadata in a centralized, open, and version-controlled repository enables the cross-linking of services, the validation and enrichment of software metadata, the development of new services and the analysis of these metadata.

14:10-14:30
Session: Joint Session with Bio-Ontologies
The Linked data Modeling Language (LinkML): a general-purpose data modeling framework
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Nomi Harris

  • Sierra Moxon, LBNL, United States
  • Harold Solbrig, solbrig informatics, United States
  • Deepak Unni, Swiss Institute of Bioinformatics, Switzerland
  • Mark Miller, LBNL, United States
  • Patrick Kalita, LBNL, United States
  • Sujay Patil, LBNL, United States
  • Kevin Schaper, Anschutz Medical Campus, University of Colorado, United States
  • Tim Putman, Anschutz Medical Campus, University of Colorado, United States
  • Corey Cox, Anschutz Medical Campus, University of Colorado, United States
  • Harshad Hegde, LBNL, United States
  • J. Harry Caufield, LBNL, United States
  • Justin Reese, LBNL, United States
  • Melissa Haendel, Anschutz Medical Campus, University of Colorado, United States
  • Christopher J. Mungall, LBNL, United States


Presentation Overview: Show

The Linked data Modeling Language (https://linkml.io) is a data modeling framework that provides a flexible yet expressive standard for describing many kinds of data models from value sets and flat, checklist-style standards to complex normalized data structures that use polymorphism and inheritance. It is purposefully designed so that software engineers and subject matter experts can communicate effectively in the same language while also providing the semantic underpinnings to make data conforming to LinkML schemas easier to understand and reuse computationally. The LinkML framework includes tools to serialize data models in many formats including but not limited to: JSONSchema, OWL, SQL-DDL, and Python Pydantic classes. It also includes tools to help convert instance data between different model serializations, (LinkML runtime), convert schemas from one framework to another (LinkML convert), validate data against a LinkML schema (LinkML validate), retrieve model metadata (LinkML schemaview), bootstrap a LinkML schema from another framework (LinkML schema automator), and tools that auto-generate documentation and schema diagrams. LinkML is an open, extensible modeling framework that allows computers and people to work cooperatively and LinkML makes it easy to model, validate, and distribute data that is reusable and interoperable.

14:30-14:50
Session: Joint Session with Bio-Ontologies
KG-Hub: a framework to facilitate discovery using biological and biomedical knowledge graphs
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Nomi Harris

  • J. Harry Caufield, Lawrence Berkeley National Laboratory, United States
  • Harshad Hegde, Lawrence Berkeley National Laboratory, United States
  • Sierra Moxon, Lawrence Berkeley National Laboratory, United States
  • Marcin Joachimiak, Lawrence Berkeley National Laboratory, United States
  • Chris Mungall, Lawrence Berkeley National Laboratory, United States
  • Justin Reese, Lawrence Berkeley National Laboratory, United States


Presentation Overview: Show

Knowledge graphs (KGs) are a powerful approach for integrating and extracting new knowledge from heterogeneous data using techniques such as graph machine learning, and have been successful in biomedicine and many other domains. However, a framework for FAIR construction and exchange of KGs is absent, resulting in redundant efforts, a lack of KG reuse, and insufficient reproducibility. KG-Hub is an open-source framework that was created to address these challenges by standardizing and facilitating KG assembly. KG-Hub enables consistent extract-transform-load (ETL), ensuring compliance with Biolink Model (a data model for standardizing biological concepts and relationships), and producing versioned and automatically updated builds with stable URLs for graph data and other artifacts. The resulting graphs are easily integrated with any OBO (Open Biological and Biomedical Ontologies) ontology. KG-Hub also includes web-browsable storage of KG artifacts on cloud infrastructure, easy reuse of transformed subgraphs across projects, automated graph machine learning on KGs using a YAML-based framework, and a visualization dashboard and manifest file to quickly assess KG contents.

14:50-14:55
Session: Joint Session with Bio-Ontologies
The SPHN Semantic Interoperability Framework: From clinical routine data to FAIR research data
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Nomi Harris

  • Vasundra Touré, Swiss Institute of Bioinformatics SIB, Switzerland
  • Deepak Unni, Swiss Institute of Bioinformatics SIB, Switzerland
  • Sabine Österle, Swiss Institute of Bioinformatics SIB, Switzerland
  • Katrin Crameri, Swiss Institute of Bioinformatics SIB, Switzerland


Presentation Overview: Show

The Swiss Personalized Health Network (SPHN) is an initiative funded by the Swiss government for building a nationwide infrastructure for sharing clinical and health-related data in a secure and FAIR (Findable, Accessible, Interoperable, Reusable) manner. One goal is to ensure that data coming from different sources is interoperable between stakeholders. The priority was to develop a purpose-independent description of existing knowledge rather than relying on existing data models which are focused on specific use cases.

Together with partners at University Hospitals we have developed the SPHN Semantic Interoperability Framework which encompasses:
- semantics definition for data standardization
- data format specifications for data exchange
- software tools to support data providers and users
- training for facilitating knowledge sharing with stakeholders

Well-defined concepts connected to machine-readable semantic standards (e.g., SNOMED CT and LOINC) function as reusable universal building blocks that can be connected with each other to represent information. By adopting semantic web technologies, a specific schema has been built to encode the semantic concepts with given rules and conventions.

This framework is implemented in all Swiss university hospitals and forms the basis for future data-driven research projects with clinical and other health-related data.

14:55-15:00
Session: Joint Session with Bio-Ontologies
OMEinfo: global geographic metadata for -omics experiments
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Nomi Harris

  • Matthew Crown, Northumbria University, United Kingdom
  • Matthew Bashton, Northumbria University, United Kingdom


Presentation Overview: Show

Microbiome classification studies increasingly associate geographical features like rurality and climate with microbiomes. However, microbiologists/bioinformaticians often struggle to access and integrate rich geographical metadata from sources such as GeoTIFFs. Inconsistent definitions of rurality, for example, can hinder cross-study comparisons. To address this, we present OMEinfo, a Python-based tool for automated retrieval of consistent geographical metadata from user-provided location data. OMEinfo leverages open data sources like the Global Human Settlement Layer, Koppen-Geiger climate classification models and Open-Data Inventory for Anthropogenic Carbon dioxide, to ensure metadata accuracy and provenance.

OMEinfo's Dash application enables users to visualise their sample metadata on an interactive map, to investigate the spatial distribution of metadata features, which is complemented by a numerical data visualisation to analyse patterns and trends in the geographical data before further analysis. The tool is available as a Docker container, providing a portable, lightweight solution for researchers. Through its standardised metadata retrieval approach and incorporation of FAIR and Open data principles, OMEinfo promotes reproducibility and consistency in microbiome metadata. As the field continues to explore the relationship between microbiomes and geographical features, tools like OMEinfo will prove vital in developing a robust, accurate, and interconnected understanding of these interactions.

15:00-15:05
Session: Joint Session with Bio-Ontologies
FAIR-BioRS: Actionable guidelines for making biomedical research software FAIR
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Nomi Harris

  • Bhavesh Patel, FAIR Data Innovations Hub, California Medical Innovations Institute, United States
  • Hervé Ménager, Institut Pasteur, Université Paris Cité, France
  • Sanjay Soundarajan, FAIR Data Innovations Hub, California Medical Innovations Institute, United States


Presentation Overview: Show

We present the first actionable guidelines for making biomedical research software Findable, Accessible, Interoperable, and Reusable (FAIR) in line with the FAIR principles for Research Software (FAIR4RS principles). The FAIR4RS principles are the outcome of a large-scale global initiative to adapt the FAIR data principles to research software. They provide a framework for optimizing the reusability of research software and encourage open science. The FAIR4RS principles are, however, aspirational. Practical guidelines that biomedical researchers can easily follow for making their research software FAIR are lacking. To fill this gap, we established the first minimal and actionable guidelines that researchers can follow to easily make their biomedical research software FAIR. We designate these guidelines as the FAIR Biomedical Research Software (FAIR-BioRS) guidelines. The guidelines provide actionable step-by-step instructions that clearly specify relevant standards, best practices, metadata, and sharing platforms to use. We believe that the FAIR-BioRS guidelines will empower and encourage biomedical researchers into adopting FAIR and open practices for their research software. We present here our approach to establishing these guidelines, summarize their major evolution through community feedback since the first version was presented at BOSC 2022, and explain how the community can benefit from and contribute to them.

15:05-15:25
Session: Joint Session with Bio-Ontologies
BioThings Explorer: a query engine for a federated knowledge graph of biomedical APIs
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Nomi Harris

  • Jackson Callaghan, Scripps Research, United States
  • Colleen Xu, Scripps Research, United States
  • Jiwen Xin, Scripps Research, United States
  • Marco Cano, Scripps Research, United States
  • Eric Zhou, Scripps Research, United States
  • Rohan Juneja, Scripps Research, United States
  • Yao Yao, Scripps Research, United States
  • Madhumita Narayan, Scripps Research, United States
  • Kristina Hanspers, Gladstone Institutes, United States
  • Ayushi Agrawal, Gladstone Institutes, United States
  • Alexander Pico, Gladstone Institutes, United States
  • Chunlei Wu, Scripps Research, United States
  • Andrew Su, Scripps Research, United States


Presentation Overview: Show

Knowledge graphs are an increasingly common data structure for representing biomedical information. They can easily represent heterogeneous types of information, and many algorithms and tools exist for operating on them. Biomedical knowledge graphs have been used in a variety of applications, including drug repurposing, identification of drug targets, prediction of drug side effects, and clinical decision support. Typically, such graphs are constructed as a single structural entity by centralizing and integrating data from multiple disparate sources. We present BioThings Explorer, an application that can query a virtual, federated knowledge graph representing the aggregated information of many disparate biomedical web services. BioThings Explorer leverages semantically precise annotations of the inputs and outputs for each resource, and automates the chaining of web service calls to execute multi-step graph queries. Because there is no large, centralized knowledge graph to maintain, BioThing Explorer is distributed as a lightweight application that dynamically retrieves information at query time. More information can be found at https://explorer.biothings.io, and code is available at https://github.com/biothings/biothings_explorer.

15:25-15:30
Session: Joint Session with Bio-Ontologies
Presenter Q&A and Joint Session Closing
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Nomi Harris

  • Nomi Harris
16:00-16:20
Ten lessons learned on improving the open data reusability of bioinformatics knowledge bases
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Deepak Unni

  • Tarcisio Mendes de Farias, SIB Swiss Institute of Bioinformatics, Switzerland
  • Julien Wollbrett, SIB Swiss Institute of Bioinformatics, Switzerland
  • Marc Robinson-Rechavi, University of Lausanne, Switzerland
  • Frederic Bastian, SIB Swiss Institute of Bioinformatics, Switzerland


Presentation Overview: Show

Background, enhancing interoperability of bioinformatics knowledge bases is a high priority requirement to maximize open data reusability, and thus increase their utility such as the return on investment for biomedical research. A knowledge base may provide useful information for life scientists and other knowledge bases, but it only acquires exchange value once the knowledge base is (re)used, and without interoperability the utility lies dormant. Results, in this talk, we discuss several approaches to boost interoperability depending on the interoperable parts. The findings are driven by several real-world scenario examples that were mostly implemented by Bgee, a well-established gene expression database. Moreover, we discuss ten general main lessons learnt. These lessons can be applied in the context of any bioinformatics knowledge base to foster open data reusability. Conclusions, this work provides pragmatic methods and transferable skills to promote reusability of bioinformatics knowledge bases by focusing on interoperability.

16:20-16:25
FAIR Data Cube, a FAIR data infrastructure for integrated multi-omics data analysis
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Deepak Unni

  • Xiaofeng Liao, RadboudUMC, Netherlands
  • Anna Niehues, Radboudumc, Netherlands
  • Casper de Visser, Radboudumc, Netherlands
  • Junda Huang, Radboudumc, Netherlands
  • Cenna Doornbos, Radboudumc, Netherlands
  • Thomas Ederveen, Radboudumc, Netherlands
  • Purva Kulkarni, Radboudumc, Netherlands
  • Joeri van de Velde, University Medical Center Groningen, Netherlands
  • Morris Swertz, University Medical Center Groningen, Netherlands
  • Martin Brandt, SURF, Netherlands
  • Alain van Gool, Radboudumc, Netherlands
  • Peter-Bram Hoen, Radboudumc, Netherlands


Presentation Overview: Show

We are witnessing an enormous growth in the amount of molecular profiling (-omics) data. A significant fraction of these data may be misused to de-anonymize and (re-)identify individuals. Hence, most data is kept in secure and protected silos. Therefore, it remains a challenge to reuse these data without infringing the privacy of the individuals from which the data were derived. Federated analysis of FAIR data is a privacy-preserving solution to make optimal use of these multi-omics data and transform them into actionable knowledge.

The Netherlands X-omics Initiative is a National Roadmap Large-Scale Research Infrastructure aiming for efficient integration of data generated within X-omics and external datasets. To facilitate this, we developed the FAIR Data Cube (FDCube), which adopts and applies the FAIR principles and helps researchers to create FAIR data and metadata, facilitate reuse of their data, and make their data analysis workflows transparent.

Considering the privacy sensitive nature of -omics data, the FDCube also provides secure data analysis environment to address ethical, societal, and legal issues.

16:25-16:30
Advancing FAIR meta-analyses of nucleotide sequence data with q2-fondue
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Deepak Unni

  • Anja Adamov, ETH Zurich, Switzerland
  • Michal Ziemski, ETH Zurich, Switzerland
  • Lina Kim, ETH Zurich, Switzerland
  • Lena Flörl, ETH Zurich, Switzerland
  • Nicholas Bokulich, ETH Zurich, Switzerland


Presentation Overview: Show

The growing availability of public nucleotide sequencing data enables meta-analysis studies that expand our knowledge of the microbiome revealing consistent insights into the diversity of microbial communities and their interactions with hosts and environments. For the resulting scientific findings to be reliable, analytical workflows must be reproducible and transparent. However, raw data inputs are typically generated through manual data retrieval steps, hindering reproducibility, and causing research bottlenecks. To address these challenges, we developed q2-fondue (Functions for reproducibly Obtaining and Normalizing Data re-Used from Elsewhere), an open-source Python package that streamlines data acquisition, management, and meta-analysis of nucleotide sequence data and metadata.

q2-fondue adheres to the FAIR principles, promoting data findability, accessibility, interoperability, and reusability. It simplifies the acquisition of sequence (meta)data from the NCBI Sequence Read Archive while providing full provenance tracking from download to final visualization. Through its integration in the widely used QIIME 2 ecosystem, q2-fondue enables researchers to utilize other plugins for comprehensive data analysis. We demonstrate the package’s effectiveness through an example meta-analysis of marker gene sequencing studies.

To guarantee consistent functionality, q2-fondue receives ongoing maintenance. Overall, q2-fondue promises to accelerate novel discoveries by improving scalability, accessibility, and reproducibility in a diverse array of meta-analysis studies.

16:30-16:35
Analysing multi-omics data through FAIR Data Points: a X-omics/TWOC demonstrator
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Deepak Unni

  • Junda Huang, Radboudumc, Netherlands
  • Tom Ederveen, Radboudumc, Netherlands
  • Peter-Bram 't Hoen, Raboud University Medical Center, Netherlands
  • Casper de Visser, Radboudumc, Netherlands
  • Anna Niehues, Radboud University Medical Center, Netherlands
  • Francis Agamah, University of Cape Town, South Africa
  • Kees Burger, DTL, Netherlands
  • Purva Kulkarni, Radboud University Medical Center, Netherlands
  • Erik Schultes, GO FAIR International Support and Coordination Office, Netherlands
  • Emile R Chimusa, Department of Applied Sciences, Faculty of Health and Life Sciences, Northumbria University, United Kingdom
  • Xiaofeng Liao, RadboudUMC, Netherlands
  • Alain J. van Gool, Radboudumc, Netherlands
  • Eugene van Someren, TNO, Netherlands


Presentation Overview: Show

Our FAIR research group has filled a FAIR-Data-Point with public COVID-19 multi-omics research datasets. To have COVID-19 relevant biomedical data that we can query in various FDP, we used our FAIR-Data-Cube infrastructure which is a set of tools and services that help researchers in different stages of the Research Data Life Cycle, including creating and describing new data, finding, understanding, and reusing existing FAIR multi-omics data. In addition, we have defined a specific (meta)data structure which is based on the Investigation-Study-Assay(ISA)-framework for study and omics information, and the Phenopackets framework6 for phenotypic information. Important novelty of this infrastructure is that we aim to enable the direct querying of biomedical multi-omics data at feature level. We demonstrate this with a biologically relevant COVID-19 use-case based on Tocilizumab-relevant molecular features.

The experimental metadata was captured in ISA-framework. Transcriptomics data has been converted to pseudobulk data by taking the sum of all gene expression values for each individual gene. Metabolomics data was mapped metabolites to ChEBi ID features. Proteomics data was mapped proteins to UniProt ID. Phenotypic data were captured using Phenopackets. We converted our data into RDF-format to store in FDP. We have also defined human-readable questions which translated to SPARQL-queries.

16:40-17:35
Panel: Panel on Open and Ethical Data Sharing
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Monica Munoz-Torres

  • Sara El-Gebali
  • Joseph Yracheta
  • Bastian Greshake-Tzovaras
  • Verana Ras


Presentation Overview: Show

As a conference that promotes open source and open science, we are big proponents of open data. We love to hear about tools and frameworks (such as FAIR) that promote data sharing and reuse. But sharing data openly also has its challenges: for example, how to securely share data that has personally identifiable health information, or how to ethically obtain and share data from marginalized communities. In this panel discussion, we will delve into some of these intriguing technical and ethical issues, with audience participation encouraged!

17:35-17:40
BOSC Closing Remarks
Room: Salle Rhone 3b
Format: Live from venue

Moderator(s): Nomi Harris

  • Nomi Harris