Attention Presenters - please review the Presenter Information Page available here
Schedule subject to change
All times listed are in EDT
Monday, July 15th
10:40-11:00
Opening Remarks
Room: 524ab
Format: In person

Moderator(s): Nomi Harris


Authors List: Show

  • Nomi Harris
Open Bioinformatics Foundation Update
Room: 524ab
Format: In person

Moderator(s): Nomi Harris


Authors List: Show

  • Heather Wiencko
CoFest Intro
Room: 524ab
Format: In person

Moderator(s): Nomi Harris


Authors List: Show

  • Hervé Ménager
Platinum and Gold Sponsor Videos
Room: 524ab
Format: In person

Moderator(s): Nomi Harris


Authors List: Show

11:00-11:20
Gemma: Curation, re-analysis and dissemination of 18,000 gene expression studies
Confirmed Presenter: Paul Pavlidis, University of British Columbia, Canada

Room: 524ab
Format: In Person

Moderator(s): Hervé Ménager


Authors List: Show

  • Guillaume Poirier-Morency, University of British Columbia, Canada
  • Sanja Rogic, University of British Columbia, Canada
  • Ogan Mancarci, University of British Columbia, Canada
  • Neera Patadia, University of British Columbia, Canada
  • Salva Sherif, University of British Columbia, Canada
  • Delphine Zhou, University of British Columbia, Canada
  • Alice Ma, University of British Columbia, Canada
  • Paul Pavlidis, University of British Columbia, Canada

Presentation Overview: Show

Gemma (https://gemma.msl.ubc.ca) is an open source and open data project focused on increasing the utility of existing gene expression studies. While data repositories like the Gene Expression Omnibus (GEO) and other secondary sources derived from it have important use cases, Gemma goes further than most in curating and re-analyzing data sets. Here we highlight recently software features, annotations and data with the hope of engaging the genomics and bioinformatics community in using and improving Gemma as a resource for computational analyses and biological discovery. Gemma currently contains over 18,000 data sets (>600,000 samples), manually curated using formal ontologies, re-processed and batch-corrected, quality-controlled and subjected to differential expression analysis using multivariate linear models, covering approximately 40% of all human, mouse and rat transcriptome samples from GEO. Our holdings are especially strong in studies of the nervous system (~4,900 studies), perturbations of transcription regulators (~2,200), and small molecule (drug-like) exposures (>1,000). All data and analysis results are available through a web API, or via supporting R/Bioconductor and Python packages. We also introduce a new data browser that advances users’ ability to quickly identify and access data sets of interest. We will discuss examples of insights and applications from Gemma such as “generic” patterns of differential expression and observations on data quality, and ongoing work to incorporate single-cell data. In summary, Gemma adds extensive value to existing gene expression studies for the benefit of the research community.

11:20-11:40
EASTR: Identifying and eliminating systematic alignment errors in multi-exon genes
Confirmed Presenter: Ida Shinder, Johns Hopkins School of Medicine, United States

Room: 524ab
Format: In Person

Moderator(s): Hervé Ménager


Authors List: Show

  • Ida Shinder, Johns Hopkins School of Medicine, United States
  • Richard Hu, Johns Hopkins University, United States
  • Hyun Joo Ji, Johns Hopkins University, United States
  • Kuan-Hao Chao, Johns Hopkins University, United States
  • Mihaela Pertea, Johns Hopkins University, United States

Presentation Overview: Show

Accurate RNA-seq alignment to reference genomes is fundamental to transcript assembly, annotation, and gene expression studies, integral to advancements in both biomedical research and basic sciences. Splice-aware aligners, tools pivotal for interpreting RNA-seq data, are now recognized through our research to introduce systemic alignment errors. Specifically, widely-used tools like STAR and HISAT2 often misalign sequences from repeat regions, resulting in spurious spliced alignments. These misalignments have propagated into reference annotations, leading to the inclusion of numerous 'phantom' introns—artifacts erroneously identified as genuine intronic sequences—in gene databases relied upon by the scientific community.

EASTR (Emending Alignments of Spliced Transcript Reads), our newly developed tool, significantly improves RNA-seq data analysis reliability. EASTR effectively detects and removes erroneous spliced alignments by examining the sequence similarity between the
flanking upstream and downstream regions of an intron and the frequency of their sequence occurrence in the reference genome. Our analysis of various RNA-seq datasets aligned with HISAT/STAR, reveals that, in some datasets, up to 20% of splice-aligned reads are spurious. By removing these spurious spliced alignments, EASTR enhances the reliability of downstream analyses that depend on accurate spliced alignment, such as transcript assembly. Applying EASTR to preprocess alignments before transcript assembly with StringTie2 significantly improves transcript assembly precision and sensitivity across various species and experimental setups. Moreover, applying EASTR to reexamine gene catalogs has revealed a multitude of intronic sequences that were previously incorrectly annotated, improving the precision of annotations and equipping the research community with more accurate gene catalogs.

ROC Picker: propagating statistical and systematic uncertainties in biological analyses
Confirmed Presenter: Jeffrey Roskes, Johns Hopkins University, United States

Room: 524ab
Format: In Person

Moderator(s): Hervé Ménager


Authors List: Show

  • Jeffrey Roskes, Johns Hopkins University, United States

Presentation Overview: Show

We present the first version of ROC Picker, a software package for propagating statistical and systematic uncertainties in a biomedical analysis. The confidence bands are displayed on the ROC curve and can be used to assess the confidence of the discriminant’s power to separate between two types of samples. In contrast to previous approaches, which only include the statistical error on the number of samples in constructing the ROC curve, ROC Picker uses a likelihood-based approach that can handle sample-wise uncertainties, including systematic effects (for example, batch-based errors) that are correlated between a subset of the samples. ROC Picker is modeled on the Combine tool by the CMS Collaboration in experimental particle physics, and is developed as part of the AstroPath project, which applies data processing and analysis methodologies from astronomy into pathology. We plan to further develop ROC Picker to handle various other metrics (such as Kaplan-Meier curves) and other types of uncertainties applicable to biomedical analyses.

Djerba: Sharing and Updating a Modular System for Clinical Report Generation
Confirmed Presenter: Iain Bancarz, Ontario Institute for Cancer Research, Canada

Room: 524ab
Format: In Person

Moderator(s): Hervé Ménager


Authors List: Show

  • Iain Bancarz, Ontario Institute for Cancer Research, Canada
  • Felix Beaudry, Ontario Institute for Cancer Research, Canada
  • Aqsa Alam, Ontario Institute for Cancer Research, Canada
  • Oumaima Hamza, Ontario Institute for Cancer Research, Canada
  • Alexander Fortuna, Ontario Institute for Cancer Research, Canada
  • Trevor Pugh, Ontario Institute for Cancer Research, Canada

Presentation Overview: Show

Djerba is an open-source software package designed to streamline the translation of bioinformatic pipeline output from individual tumor samples into clinical reports. Operating within a CAP/CLIA/ACD accredited laboratory, use cases include clinical genome and transcriptome sequencing for therapeutic assignment or trial eligibility, cell-free DNA sequencing using targeted panels for early cancer detection and plasma whole genome sequencing to detect minimal residual disease.

Djerba integrates output from multiple bioinformatics workflows, external resources queries, and customized user inputs; generates plots, tables, hyperlinks and summary statistics; collates its results into a machine-readable JSON document; and finally renders its output as a PDF report for use by clinicians. It includes mini-Djerba, a pre-built, self-contained tool to simplify manual updates to reports. Djerba has a flexible schema for report syntax with automated validation, and supports uploading JSON report documents to a CouchDB database. This enables reports to be automatically searched, compared, and summarized, and provides a valuable resource for better understanding of clinical datasets.

The modular structure of Djerba enables it to be easily shared, updated and customized while retaining interoperability between sites. Reproducible research is enabled by versioning components and automatically recording input parameters. Partner sites can run a set of analysis and reporting functions appropriate to their needs; while improvements are efficiently tested and integrated into the reporting framework. Examples include deployment to multiple institutions across Canada as part of the MOHCCN-O initiative; and rapidly updating an accredited clinical report with improved copy number variation (CNV) analysis.

Q&A For Flash Talks
Room: 524ab
Format: In person

Moderator(s): Hervé Ménager


Authors List: Show

11:40-12:00
Antimicrobial resistance prediction of nontuberculous mycobacteria from whole genome sequence data
Confirmed Presenter: Idowu Olawoye, Department of Microbiology & Immunology, University of Western Ontario, London, Ontario, Canada, Canada

Room: 524ab
Format: In Person

Moderator(s): Hervé Ménager


Authors List: Show

  • Idowu Olawoye, Department of Microbiology & Immunology, University of Western Ontario, London, Ontario, Canada, Canada
  • Ashlyn Kim, Department of Microbiology & Immunology, University of Western Ontario, London, Ontario, Canada, Canada
  • Maxim Federov, Department of Microbiology & Immunology, University of Western Ontario, London, Ontario, Canada, Canada
  • Nicholas Waglechner, Sinai Health, Toronto, Ontario, Canada, Canada
  • Jennifer Guthrie, Department of Microbiology & Immunology, University of Western Ontario, London, Ontario, Canada, Canada

Presentation Overview: Show

Nontuberculous mycobacteria (NTM) are opportunistic pathogens, predominantly causing pulmonary infections. NTMs are naturally resistant to many antibiotics, with treatment failure common due to acquired antimicrobial resistance (AMR). Whole-genome sequencing (WGS) offers an avenue for highthroughput detection of mutations and genes associated with drug resistance. Furthermore, there is the need develop bioinformatics tools and curated databases to enable detection of NTM resistance.
In this research, we have expanded the databases of ResFinder to accommodate positional mutations and resistance genes of clinically relevant NTMs (M. abscessus and M. avium) from peer-reviewed literature, that were confirmed with phenotypic susceptibility tests. The pipeline uses sequence data (FASTQ) or genome assemblies (FASTA) to predict AMR results. Accuracy was examined by the detection of known mutations using both FASTQ and FASTA inputs.
The ResFinder database additions were successfully validated with raw data and genome assemblies of M. abscessus (n=57) and M. avium (n=58). Comparing the detection of AMR genes and known mutations in sequence files versus genome assemblies, we observed a specificity of 100% for both data formats, while we saw a sensitivity of 98.2% (56/57) for FASTQ versus 100% (57/57) for FASTA files in M. abscessus and sensitivity of 100% (58/58) for FASTQ versus 87.9% (51/58) for FASTA files in M. avium.
Our expansion of the ResFinder database comes with much promise in the advancement of in silico prediction to detect AMR particularly in pathogens such as NTMs. It also creates the opportunity for continous development by including AMR markers for other clinically relevant NTMs.

12:00-12:20
Open2C: Advancing 3D and functional genomics research
Confirmed Presenter: Vedat Yilmaz, UMass Chan Medical School, United States

Room: 524ab
Format: In Person

Moderator(s): Hervé Ménager


Authors List: Show

  • Vedat Yilmaz, UMass Chan Medical School, United States
  • Nezar Abdennur, UMass Chan Medical School, United States
  • Open Chromosome Collective, Open2C, United States

Presentation Overview: Show

Open2C (https://open2c.github.io) is a collaborative, open-source software community focused on advancing research in genome architecture and chromosome biology through computational 3D and functional genomics. Our libraries primarily leverage the scientific Python software stack and are designed to be user-friendly, flexible, scalable, and interoperable with the wider ecosystem in order to meet the demands of contemporary genomic research. Our resources aim to promote a deeper understanding of genome structure and function, accommodating large next-generation sequencing datasets from Chromosome Conformation Capture (3C/Hi-C) technologies. Among the packages we maintain are four notable NumFOCUS-affiliated projects: Cooler (Abdennur et al, Bioinformatics, 2020), a Python library and sparse, HDF5-based storage format for genomic contact maps; Pairtools, a command-line suite for interpreting and extracting contact information from 3D genomic sequencing data; Cooltools (Open2C et al, PLOS Comp Bio, in press), a suite for analyzing genomic contact maps; and Bioframe (Open2C et al, Bioinformatics, 2024), a library for genomic interval operations on Pandas dataframes. In its commitment to open science, Open2C has been selected as a host organization for Google Summer of Code 2024, to help mentor the next generation of open-source developers and contributors. All of our packages are open-source and permissively licensed and the source code can be found at https://github.com/open2c.

A Framework for DNA Binding Motifs Prediction for Nontraditional Model Organism Transcription Factors
Confirmed Presenter: Stephanie Hao, Boston University, United States

Room: 524ab
Format: In Person

Moderator(s): Hervé Ménager


Authors List: Show

  • Stephanie Hao, Boston University, United States
  • Anthony Garza, Boston University, United States
  • Yeting Li, Boston University, United States
  • Nofal Ouardaoui, Boston University, United States
  • Cynthia Bradham, Boston University, United States

Presentation Overview: Show

To improve our predictions of phenotypic outcomes, it is essential to unravel the complex networks of genes and regulatory elements across different species. Consequently, research focusing on the genomics of nontraditional model organisms is advancing rapidly. A bottleneck in non-model organism research is the lack of well-mapped transcription factors to DNA binding motifs. To this end, we implemented a modular prediction algorithm that deciphers DNA binding motifs from the DNA sequences of expressed transcription factors (TFs). We successfully applied this algorithm to 518 sequenced transcription factors of L. variegatus, a long-standing model organism for embryonic development. The modular nature of the algorithm affords species-agnostic applications across all organisms of interest, thereby unlocking new frontiers in genomic prediction, model and non-model organisms alike, and biological understanding.

Bioinformatics tools for comparative genomics analysis of highly similar duplicate genes in eukaryotic genomes
Confirmed Presenter: Xi Zhang, Dalhousie University, Canada

Room: 524ab
Format: Live Stream

Moderator(s): Hervé Ménager


Authors List: Show

  • Xi Zhang, Dalhousie University, Canada
  • Yining Hu, Western University, Canada

Presentation Overview: Show

Gene duplication plays an important role in evolutionary mechanism, which can act as a new source of genetic material in genome evolution. However, detecting duplicate genes from genomic data can be challenging. Various bioinformatics resources have been developed to identify duplicate genes from single and/or multiple species. Here, we developed a Basic Local Alignment Search Tool (BLAST)-based web tool (HSDFinder) and database (HSDatabase), allowing future researchers to easily identify intra-species gene duplications with their own interest genomes by following a pipeline (HSDecipher). Besides, we reviewed recent advancements of gene duplication detection tools and summarized the metrics used to measure sequence identity among gene duplicates within species, which is a quick reference guide for research tools used for detecting gene duplicates.

Q&A For Flash Talks
Room: 524ab
Format: In person

Moderator(s): Hervé Ménager


Authors List: Show

14:20-15:20
Invited Presentation: The Data Shows We Need Better Data
Confirmed Presenter: Mélanie Courtot, Ontario Institute for Cancer Research and University of Toronto, Canada

Room: 524ab
Format: In Person

Moderator(s): Hervé Ménager


Authors List: Show

  • Mélanie Courtot, Ontario Institute for Cancer Research and University of Toronto, Canada

Presentation Overview: Show

Big data, AI, LLMs… do they live up to the hype? In a bright and hopeful future, AI accelerates progress, revolutionizes healthcare, alerts us to health risks, and creates fresh career paths. Yet, in a bleaker outlook, it obliterates jobs, fosters rampant misinformation and increases inequity.

At the root of AI is the data it relies on. In this talk we will discuss how to steer the course by improving the data AI leverages. We will explore the vast ecosystem formed by data, projects and infrastructure. We will travel along different axes to think about the data we are generating and using every day. We will consider data governance – where does it come from, who owns it, and how can we access it? We will investigate open data – how can we leverage health care knowledge for research? Finally, we will share a few thoughts about data quality and data sharing to increase reproducibility and reuse.


Dr Mélanie Courtot is the Director of Genome Informatics at the Ontario Institute for Cancer Research in Toronto, and an Assistant Professor in the Department of Medical Biophysics at University of Toronto. Dr Courtot is passionate about translational informatics – building intelligent systems to gain new insights and impact human health. Her lab aims to build a globally shared knowledge ecosystem to advance science and improve health for all. Her team develops the Overture open source software suite, which supports many active large-scale cancer genomics projects including ICGC and ICGC-ARGO, VirusSeq, and the upcoming Pan-Canadian Genome Library. It also drives the African Pathogen Data Sharing and Archive Platform.


Dr Courtot obtained her PhD in Bioinformatics from the University of British Columbia in 2014, followed by a postdoctoral fellowship in Public Health. Dr Courtot co-leads the Clinical and Phenotypic workstream and Data Use and Cohort representation groups for the Global Alliance for Genomics and Health (GA4GH) as well as cohort harmonization efforts for the International HundredK+ Cohorts Consortium. She is an advisory board member for the Public Health Alliance for Genomic Epidemiology coalition, European Open Science Cloud for Cancer project and the eLwazi open data science platform.

15:20-15:40
Creating an open-source data platform.
Confirmed Presenter: Mitchell Shiell, Ontario Institute for Cancer Research (OICR), Canada

Room: 524ab
Format: In Person

Moderator(s): Monica Munoz-Torres


Authors List: Show

  • Mitchell Shiell, Ontario Institute for Cancer Research (OICR), Canada
  • Jon Eubank, Ontario Institute for Cancer Research (OICR), Canada
  • Justin Richardsson, Ontario Institute for Cancer Research (OICR), Canada
  • Brandon Chan, Ontario Institute for Cancer Research (OICR), Canada
  • Robin Haw, OICR, Canada
  • Lincoln Stein, Ontario Institute for Cancer Research, Canada
  • Melanie Courtot, Ontario Institute for Cancer Research (OICR), Canada
  • Overture Team, Ontario Institute for Cancer Research (OICR), Canada

Presentation Overview: Show

At BOSC 2023, we were excited to introduce Overture, a collection of open-source software used to overcome significant obstacles in storing, managing, and sharing genome-scale datasets. We spoke about our modular and scalable microservice architecture comprising Ego, Song, Score, Maestro, and Arranger and how these services help power VirusSeq, ICGC-ARGO, and the original ICGC portal. We discussed Overtures Core capabilities and how they are split between data consumers, who retrieve data from our platforms; providers, who submit data to our platforms; and administrators, who manage users, configure data models and customize the portal interface. 

In this talk, we’d like to touch on our experience extracting our existing infrastructure and making it work in an open-source environment. We’d then like to show you what it takes to create a data portal, guiding you through a high-level overview of how our platforms get deployed. This will be followed by a more detailed discussion of using the platform, specifically supplying your data model to it, validating and submitting data to it, and configuring its front-end interface. To conclude, we will discuss future aims, highlighting our feature roadmap and opening up to further discussion on the value of our aims with the BOSC community. We invite you to check out our website and demo portal (https://demo.overture.bio/) and chat with us at BOSC 2024. If you wish to contact us remotely our Slack channel, like our software, will always be open and available.

15:40-16:00
Going Viral: The Development of the VirusSeq Data Portal
Confirmed Presenter: Justin Richardsson, Ontario Institute for Cancer Research (OICR), Canada

Room: 524ab
Format: In Person

Moderator(s): Monica Munoz-Torres


Authors List: Show

  • Justin Richardsson, Ontario Institute for Cancer Research (OICR), Canada
  • Jon Eubank, Ontario Institute for Cancer Research (OICR), Canada
  • Leonardo Rivera, Ontario Institute for Cancer Research (OICR), Canada
  • Scott Cain, Ontario Institute for Cancer Research (OICR), Canada
  • Robin Haw, Ontario Institute for Cancer Research (OICR), Canada
  • Mitchell Shiell, Ontario Institute for Cancer Research, Canada
  • Lincoln Stein, Ontario Institute for Cancer Research, Canada
  • Melanie Courtot, Ontario Institute for Cancer Research, Canada
  • Virusseq Team, VirusSeq, Canada

Presentation Overview: Show

Tracking the evolution and spread of the COVID-19 virus prompted a rapid global initiative to sequence SARS-CoV-2 genomes. However, challenges arose concerning the organization and dissemination of this data, particularly in meeting FAIR standards (Findable, Accessible, Interoperable, and Reusable). In response, we were tasked with spearheading efforts to build and deploy an open-access data platform that would harmonise, validate, and automate submissions of all Canadian SARS-CoV-2 sequences to international databases. Having created data infrastructure for many major cancer genomics projects, this shift represented an opportunity for our group to extend our tools and expertise toward building a pathogen-based data platform.

We built the VirusSeq data portal using five open-source Overture software microservices (Ego, Song, Score, Maestro, and Arranger), along with five additional custom-made services, and the third-party OAuth service Keycloak. By using existing software designed for reuse, we were able to successfully build and deploy the platform in a record four weeks. Furthermore, the portal's modular architecture enabled the platform's servers to scale dramatically. From the platform's inception in 2020, it was expected to harmonise and validate the submission of 150,000 viral sequences; by 2024, it has reached over 500,000 sequences.

The VirusSeq project stands as visible evidence of the power of collaborative effort, modular design, and the reusability of software tools. The portal implementation has already been utilised in other projects to track pathogen genomics, and serves as a testament of our ability to combat outbreaks and pandemics through mass collaboration and rapid data dissemination.

Project Website: https://virusseq-dataportal.ca/

intermine.bio2rdf.org : A QLever SPARQL endpoint for InterMine databases
Confirmed Presenter: François Belleau, Arnaud Droit Computational Laboratory, Canada

Room: 524ab
Format: In Person

Moderator(s): Monica Munoz-Torres


Authors List: Show

  • François Belleau, Arnaud Droit Computational Laboratory, Canada
  • Gos Micklem, Department of Genetics, University of Cambridge, United Kingdom
  • Deepak Unni, SIB Swiss Institute of Bioinformatics, Switzerland
  • Arnaud Droit, Research Center of the CHU de Québec, Canada

Presentation Overview: Show

This project explores converting biological data from InterMine, an open-source data warehouse, into a knowledge graph accessible via Bio2RDF. This exposes global InterMine data as RDF, enabling SPARQL queries or exploration through the OpenSearch interactive dashboard.

The approach leverages novel semantic web technologies for Bio2RDF: JSON-LD (lightweight Linked Data format), LinkML (Linked Data modeling language), OpenSearch (search engine and dashboard software), and the scalable QLever SPARQL endpoint.

Using our pipeline, available at Github (https://github.com/intermineorg/intermine2sparql), we retrieve the data model from the InterMine API and transform it into JSON-LD documents stored in an OpenSearch index. Class definitions in LinkML are used to generate JSON-LD context for RDF conversion. Finally, JSON-LD documents are converted to N-Triples format and loaded into the QLever endpoint.

The complete data model from nine InterMine model organism instances is available through our services:
OpenSearch REST API: http://kibio.science
QLever SPARQL endpoint: http://intermine.bio2rdf.org
LinkML model definitions

This project demonstrates the feasibility of converting InterMine data into a JSON-LD. This approach facilitates querying biological data with SPARQL across InterMine instances and integrates it with other LinkML-compatible data sources. OpenSearch APIs empower data scientists with programming tools like Pandas and R, and with BI software (Superset, Power BI, Tableau) to create insightful visualizations.

In conclusion, the project will unlock biological data from InterMine for AI. By converting the data, it becomes accessible to Large Language Models, empowering them to analyze vast datasets, identify relationships, and potentially lead to discoveries in biological research.

Organizing community curation to create an Open database on Thermodynamics of Enzyme-Catalyzed Reactions (openTECR)
Confirmed Presenter: Robert T. Giessmann, Institute for Globally Distributed Open Research and Education, igdore.org, Germany

Room: 524ab
Format: Live Stream

Moderator(s): Monica Munoz-Torres


Authors List: Show

  • Robert T. Giessmann, Institute for Globally Distributed Open Research and Education, igdore.org, Germany

Presentation Overview: Show

openTECR ("Open database on Thermodynamics of Enzyme-Catalyzed Reactions") is a database and a community.

We create a data collection of apparent equilibrium constants of enzyme-catalyzed reactions, being reliable, open and machine-actionable, with a clear change process to integrate new data and correct errors. We believe that Open Science principles, and specifically Open Data and Open Source are key to achieving our vision.

The openTECR database serves computational and experimental scientists in the fields of metabolic engineering, genome-scale metabolic modelling, biocatalysis and related fields by providing curated information. It is used by eQuilibrator as the data basis for making predictions about any possible reaction.

Recently, we organized an open community curation effort (https://opentecr.github.io/invitation-to-curate). We prepared a curation workflow to analyze 278 pages of pages densely packed with tables and textual information. We invited volunteer contributions, and are immensely grateful about 17 volunteers investing almost 100 working hours.

At BOSC 2024, I would like to present our initiative and share our lessons learned about organizing successful community curation. I believe that our example can serve as a blueprint for other databases / project ideas which require a large amount of working hours.

We discovered that key to receiving contributions is to offer very small packages of work and a detailed curation manual. Our smallest task was 3 minutes long and well received.

Our small community (~40 members) shares a mailing list (https://w3id.org/opentecr) and a GitHub organization where we store our data and code under open licenses (https://github.com/opentecr/).

Q&A For Flash Talks
Room: 524ab
Format: In person

Moderator(s): Monica Munoz-Torres


Authors List: Show

16:40-17:00
Connecting Integrated Genome Browser to a huge genome database using its open API solves one problem and creates another
Confirmed Presenter: Ann Loraine, University of North Carolina Charlotte, United States

Room: 524ab
Format: In Person

Moderator(s): Karsten Hokamp


Authors List: Show

  • Ann Loraine, University of North Carolina Charlotte, United States
  • Jaya Sravani Sirigineedi, University of North Carolina Charlotte, United States
  • Nowlan Freese, University of North Carolina Charlotte, United States

Presentation Overview: Show

Integrated Genome Browser (IGB, pronounced “ig-bee”) is a fast, feature-rich, open-source desktop genome browser thousands of researchers have used to explore and analyze genomic data. To support this user audience, we maintain data delivery Web sites called “IGB Quickloads” that supply IGB with around 60 different reference genome assemblies. IGB can open user-provided genome assembly files, but if their desired assembly already exists in an IGB Quickload, they can avoid this inconvenient work. However, we are finding it increasingly difficult to update these Quickload sites as new assemblies are published. Fortunately, many genome database systems now offer robust computational access to their data. By accessing these computational resources, IGB could show new assemblies without our copying them to a Quickload. To test this idea, we devel-oped a new IGB version that consumes and displays data from one such resource, a JSON-emitting API (application programming interface) from the UCSC Genome Browser system. Now available as an “early access” version at the BioViz.org Web site, this new IGB version can display more than 200 assemblies visible in the UCSC Browser, along with dozens of annotation and data tracks. This wealth of data has now given us a new problem to solve. The API provides information like track names and data formats, but little about what the data represent. Thus, we face a new form of an old problem in bioinformatics: how do we categorize and label data so that computer programs and people can understand and use it?

17:00-17:20
Collaborating our way to optimal integration between Tripal 4 and JBrowse 2
Confirmed Presenter: Carolyn T. Caron, University of Saskatchewan, Canada

Room: 524ab
Format: In Person

Moderator(s): Karsten Hokamp


Authors List: Show

  • Carolyn T. Caron, University of Saskatchewan, Canada
  • Lacey-Anne Sanderson, University of Saskatchewan, Canada
  • Kirstin E. Bett, University of Saskatchewan, Canada

Presentation Overview: Show

To meet the diverse needs of their research communities, biological web portals not only strive to make data and their associated metadata accessible and reusable, but also provide tools to help analyze and make connections between the data. Tripal JBrowse extends the functionality of Tripal, an open-source toolkit for building biological web portals, to embed a very popular tool within the community, JBrowse 2. Previous versions of the Tripal JBrowse module were limited to embedding JBrowse into a Tripal site using an iFrame, due to theming collisions between JBrowse and the content management system that Tripal extends, Drupal. This prevented any exchange of data between a Tripal site and a JBrowse instance. In large part due to the JBrowse team’s reception to our feedback regarding these pain points, JBrowse 2 has been developed in such a way as to make embedding the application into a Drupal site a reality. Furthermore, advancements made in Drupal 9 and 10 have removed barriers to communication with other web frameworks allowing direct interaction with the React backend of JBrowse 2. Amidst upgrading our module to Tripal 4, we now have a working prototype of native embedding for JBrowse 2 within a Tripal site. We are very excited for the possibilities this will now bring for the Tripal 4 community. For example, one feature we aim to implement is to embed a miniature JBrowse in gene content pages provided by Tripal that show the structure and context around that gene.

An integrated environment for browsing 3-D protein structures and multiple sequence alignments in JBrowse 2
Confirmed Presenter: Colin Diesh, University of California, Berkeley, United States

Room: 524ab
Format: In Person

Moderator(s): Karsten Hokamp


Authors List: Show

  • Colin Diesh, University of California, Berkeley, United States
  • Caroline Bridge, Ontario Institute for Cancer Research, Canada
  • Garrett Stevens, University of California, Berkeley, United States
  • Scott Cain, Ontario Institute for Cancer Research, Canada
  • Lincoln Stein, Ontario Institute for Cancer Research, Canada
  • Ian Holmes, University of California, Berkeley, United States

Presentation Overview: Show

Recent advances in protein structure prediction have invigorated research in protein structural biology. To enable the visualization of genomic datasets and protein structures in a unified environment, we created a suite of JBrowse 2 plugins to show 3-D protein structures and multiple sequence alignments (MSAs) alongside the genome browser.

To display 3-D protein structures, we incorporated Mol* into a JBrowse 2 plugin. Users can right-click a gene of interest in the genome browser, and the app can either (a) automatically locate AlphaFoldDB structures associated with the gene of interest or (b) allow the user to import their own PDB/MMCIF structure files produced by tools such as ColabFold. The protein sequence encoded by a user selected transcript isoform of the gene of interest is aligned with the sequence from the protein structure file, which allows mouse clicks and mouseovers to show matching positions on the 3-D structure and the genome.

To display MSAs, we incorporated react-msaview, our MSA visualization tool, into a JBrowse 2 plugin. Users can select a gene of interest in the genome browser and launch an in-app NCBI BLAST workflow to recruit sequences for a MSA, or open their own MSA data files. The MSA viewer can display a dendrogram alongside the MSA to show hierarchical or phylogenetic grouping of the sequences. Protein domains and features can be highlighted on the MSA by loading InterProScan results. Like the 3-D protein viewer, mouse clicks and mouseovers show matching positions in the genome browser.

iCn3D, a Platform to Integrate Structures with Functions and Genomics
Confirmed Presenter: Jiyao Wang, NIH/NLM/NCBI, United States

Room: 524ab
Format: In Person

Moderator(s): Karsten Hokamp


Authors List: Show

  • Jiyao Wang, NIH/NLM/NCBI, United States
  • Philippe Youkharibache, NIH/NCI, United States
  • Ravinder Abrol, California State University, Northridge, United States
  • Caesar Tawfeeq, California State University, Northridge, United States
  • Jack Lin, Digital World Biology, United States
  • Umesh Khaniya, NIH/NCI, United States
  • Thomas Madej, NIH/NLM/NCBI, United States
  • James Song, NIH/NLM/NCBI, United States
  • Dachuan Zhang, NIH/NLM/NCBI, United States
  • Christopher Lanczycki, NIH/NLM/NCBI, United States
  • Aron Marchler-Bauer, NIH/NLM/NCBI, United States

Presentation Overview: Show

With the improvement of structural prediction such as AlphaFold, a challenge is to integrate structures with functions and genomics. We started iCn3D as a 3D structure viewer with annotations such as domains and SNPs {Wang, 2020}, then expanded iCn3D to interaction analysis and mutational analysis both interactively and in command line {Wang, 2022}. Recently we added several new features in iCn3D to integrate structures with functions and genomics. First, iCn3D shows the isoforms and their corresponding exons for the gene related to the protein sequence. Second, iCn3D shows a few new annotations: Post Translational Modification (PTM), transmembrane domain detection, and Immunoglobulin (Ig) domain detection. Third, iCn3D allows users to align proteins based on structure, sequence, or residue mapping. Fourth, iCn3D is not only used in Jupyter Notebook and 3D printing, but also expanded to be used in Virtual Reality (VR) and Augumented Reality (AR). Furthermore, all iCn3D views can be reproduced in a sharable URL or iCn3D PNG image. The source code of iCn3D is at https://github.com/ncbi/icn3d.

Q&A For Flash Talks
Room: 524ab
Format: In person

Moderator(s): Karsten Hokamp


Authors List: Show

17:20-17:40
Codefair: Make Biomedical Research Software FAIR Without Breaking a Sweat
Confirmed Presenter: Bhavesh Patel, FAIR Data Innovations Hub, California Medical Innovations Institute, United States

Room: 524ab
Format: In Person

Moderator(s): Swapnil Savant


Authors List: Show

  • Dorian Portillo, FAIR Data Innovations Hub, California Medical Innovations Institute, United States
  • Sanjay Soundarajan, FAIR Data Innovations Hub, California Medical Innovations Institute, United States
  • Jacob Clark, FAIR Data Innovations Hub, California Medical Innovations Institute, United States
  • Bhavesh Patel, FAIR Data Innovations Hub, California Medical Innovations Institute, United States

Presentation Overview: Show

We present codefair, an innovative solution that helps researchers make their biomedical research software Findable, Accessible, Interoperable, and Reusable (FAIR), i.e. optimally reusable by humans and machines. The FAIR Biomedical Research Software (FAIR-BioRS) guidelines provide actionable instructions for making biomedical research software FAIR. While designed to be convenient to follow, we learned that their implementation can still be time consuming for researchers. To address this challenge, we are developing codefair, a free and open source GitHub app that acts as a personal assistant for making research software FAIR. Researchers simply need to install codefair from the GitHub marketplace and proceed with their software development as usual. By leveraging GitHub’s tools such as Probot, codefair monitors activities on the software repository, communicates via GitHub issues, and submits pull requests to help researchers make their software FAIR. Currently, codefair helps with including essential metadata elements such as license file, CITATION.cff metadata file, and codemeta.json metadata file. Additional features are being added to provide support for complying with best coding practices, archiving on Zenodo, registering on bio.tools, and much more to cover all the steps for making software FAIR. By alleviating their burden in the process, we believe codefair will empower and encourage biomedical researchers into adopting FAIR and open practices for their research software. We present here our approach to developing codefair, highlight the current and planned features, and explain how the community can benefit from and contribute to codefair.

An Open-source Ecosystem For Scalable And Computationally Efficient Nanopore Data Processing
Confirmed Presenter: Hasindu Gamaarachchi, University of New South Wales, Australia

Room: 524ab
Format: In Person

Moderator(s): Swapnil Savant


Authors List: Show

  • Hasindu Gamaarachchi, University of New South Wales, Australia
  • Hiruna Samarakoon, UNSW Sydney, Australia
  • James Ferguson, Garvan Institute of Medical Research, Australia
  • Sasha Jenner, University of Sydney, Australia
  • Bonson Wong, Garvan Institute of Medical Research, Australia
  • Timothy Amos, Garvan Institute of Medical Research, Australia
  • Jillian Hammond, Garvan Institute of Medical Research, Australia
  • Hassaan Saadat, UNSW Sydney, Australia
  • Martin Smith, UNSW Sydney, Australia
  • Sri Parameswaran, University of Sydney, Australia
  • Ira Deveson, Garvan Institute of Medical Research, Australia

Presentation Overview: Show

Emerging long-read sequencing - recently dubbed “Nature Method of the Year” - has now become an important tool in understanding genomics. Nanopore is a major commercially available long-read technologies that offer ultra-long reads with limited capital cost. However, computational aspects of nanopore sequence analysis (e.g., data access, storage, basecalling, methylation calling) are still a burden, impeding the scalability of population-scale experiments. In this talk, I will present a complete ecosystem that enables scale nanopore data analysis in a computationally efficient way, built on top of our file format called S/BLOW5 (Nature Biotechnology, 2022). S/BLOW5 reduces computational time by an order of magnitude and additionally reduces storage footprint by ~20-80% compared to existing the FAST5 format. S/BLOW5 ecosystem which is fully open-source now includes: (i) S/BLOW5 file format and accompanying specifications (ii) the slow5lib (C/C++) and pyslow5 (python) software libraries for reading and writing S/BLOW5 files; (iii) the slow5tools toolkit for creating, converting, handling and interacting with SLOW5/BLOW5 files (Genome Biology 2023); and (iv) a suite of open source bioinformatics software packages (including basecalling and methylation calling tools) with which SLOW5 is now integrated (Bioinformatics 2023, GigaScience 2024). The research community has already started building on top of S/BLOW5 and slow5-rs which allows S/BLOW5 access using the Rust programming language is an example. S/BLOW5 will continue to prioritise performance, compatibility, usability and transparency. S/BLOW5 for nanopore signal space is analogous to the seminal SAM/BAM formats in the base-space that bioinformaticians are familiar with, thus making the adoption of S/BLOW5 seamless.

GenomeKit, a Python library for fast and easy access to genomic resources
Confirmed Presenter: Avishai Weissberg, Deep Genomics, Canada

Room: 524ab
Format: In Person

Moderator(s): Swapnil Savant


Authors List: Show

  • Avishai Weissberg, Deep Genomics, Canada

Presentation Overview: Show

GenomeKit is Deep Genomics’ high performance Python library for fast and easy access to genomic resources such as sequence, data tracks, annotations, and variants.
GenomeKit has been in use internally by ML & data scientists and bioinformaticians at Deep Genomics for several years, and we have decided to make it available to the rest of the community. GenomeKit serves as the computational foundation for the data generation and evaluation of the recently published BigRNA foundation model, and most other workflows at Deep Genomics.
At its core, GenomeKit allows users to perform computational operations on the genome, like searching, applying variants, and comparing, extracting and expanding intervals. Classes like Genome, Interval, and Variant form the base for most of its APIs.

For example, GenomeKit allows users to easily get the principal transcript for a particular gene on a specific annotation and patch version of an assembly, accessing interval objects for each of its coding regions, UTRs, exons, introns, etc. These interval objects can further be expanded, intersected, have variants applied to them, etc.

In addition, GenomeKit includes a variety of APIs to open and process the contents of standard data file types (gff3, fasta, etc). GenomeKit's data formats that are highly optimized and compressed for reduced I/O and efficient memory utilization.

This talk aims to cover
the use cases for GenomeKit,
an overview of the API and main capabilities
techniques used to achieve GenomeKit's level of performance
benchmarks comparing GenomeKit with similar libraries

Q&A For Flash Talks
Room: 524ab
Format: In person

Moderator(s): Swapnil Savant


Authors List: Show

17:40-18:00
Tataki: Enhancing the robustness of bioinformatics workflows with simple, tolerant file format detection
Confirmed Presenter: Masaki Fukui, Sator, Inc., Japan

Room: 524ab
Format: In Person

Moderator(s): Swapnil Savant


Authors List: Show

  • Masaki Fukui, Sator, Inc., Japan
  • Hirotaka Suetake, Sator, Inc., Japan
  • Tazro Ohta, Institute for Advanced Academic Research, Chiba University, Japan

Presentation Overview: Show

The increase in data volume in bioinformatics has heightened the demand for robust and reliable workflow analysis. Workflows enable the integration of various analytical tools to perform sequences of analyses in a portable and reproducible manner. However, inconsistencies in file inputs and outputs of workflow components can cause the tools to misidentify file formats of input files and terminate unexpectedly, which decreases the robustness of workflows.

One way to resolve this is to introduce a file format detection tool in between tools within workflows. However, current file format identification tools often fail to adequately handle the issues, as they might misidentify files with abnormalities. Additionally, because they are not always standalone components, integrating them seamlessly into workflows can be difficult.

To enhance workflow robustness, we developed Tataki, a simple command-line file format detection tool, targeting major bioinformatics file formats. It is designed for ease of use within workflows, and also allows users to extend identification criteria using the Common Workflow Language to tolerate file format irregularities such as variance in file formats or those not predefined. We believe Tataki is a practical solution to the file format issues and boosts the productivity of bioinformatics researchers and developers.

Arvados Project Update
Confirmed Presenter: Peter Amstutz, Curii Corporation, United States

Room: 524ab
Format: In Person

Moderator(s): Swapnil Savant


Authors List: Show

  • Peter Amstutz, Curii Corporation, United States
  • Sarah Zaranek, Curii Corporation, United States
  • Alexander Sasha Wait Zaranek, Curii Corporation, United States
  • Tom Clegg, Curii Corporation, Canada
  • Lisa Knox, Curii Corporation, United States
  • Lucas Di Pentima, Curii Corporation, Argentina
  • Brett Smith, Curii Corporation, United States
  • Stephen Smith, Curii Corporation, United States

Presentation Overview: Show

Arvados is a comprehensive, mature, open source platform for managing and processing large scale biomedical data on HPC and cloud. By combining robust data and workflow management capabilities in a single platform, Arvados helps researchers organize and analyze petabytes of data and run reproducible and versioned computational bioinformatics workflows. Since the last time Arvados was presented at BOSC (2022), Arvados has had 3 major releases and 8 minor releases. This short talk will highlight major improvements including a more modern, performant interface, expanded workflow capabilities, and improved data storage and management. Arvados is used in production by some of the largest life sciences companies in the world as well as in academia, and welcomes community participation (https://arvados.org/community/).

BiocPy: Facilitate Bioconductor Workflows in Python
Confirmed Presenter: Jayaram Kancherla, Genentech, United States

Room: 524ab
Format: Live Stream

Moderator(s): Swapnil Savant


Authors List: Show

  • Jayaram Kancherla, Genentech, United States
  • Aaron Lun, Genentech, United States

Presentation Overview: Show

Bioconductor is an open-source software community that provides a rich repository of tools for the analysis and comprehension of genomic data. One of the main advantages of Bioconductor is the development of standardized data representations and a large number of statistical analysis methods tailored for genomic experiments. These tools allow researchers to seamlessly store, manipulate, and analyze data across various genomic experimental modalities in R. Analysts today use a variety of languages in their workflows, including R/Bioconductor for statistical analysis and Python for imaging or machine learning tasks.

Inspired by Bioconductor, BiocPy aims to enable and facilitate these bioinformatics workflows in Python. To achieve this goal, BiocPy provides data structures that are closely aligned with Bioconductor's implementations. These structures include BiocFrame, providing a Bioconductor-like data frame class, and GenomicRanges which aids in representing genomic regions and facilitating analysis. Together they serve as essential and foundational data structures, acting as the building blocks for extensive and complex representations. Notably, container classes such as SummarizedExperiment, SingleCellExperiment, and MultiAssayExperiment cater to the diverse needs of handling single or multi-omic experimental data and metadata.

By adapting mature Bioconductor data structures to Python, BiocPy offers a seamless transition and ease of use across programming languages, fostering reproducible and efficient genomic data analyses. To our knowledge, BiocPy is the first Python framework to provide well-integrated Bioconductor data structures and representations specifically designed for genomic data analysis, paving the way for enhanced cross-language interoperability in bioinformatics workflows. The BiocPy ecosystem is open-source and available at https://github.com/BiocPy.

Q&A For Flash Talks
Room: 524ab
Format: In person

Moderator(s): Swapnil Savant


Authors List: Show

Tuesday, July 16th
8:40-9:00
Enhancing Reproducibility in Immunogenetics: Leveraging Containerization Technology for Bioinformatics Workflows
Confirmed Presenter: Rayo Suseno, UCSF, United States

Room: 524ab
Format: Live Stream

Moderator(s): Jason Williams


Authors List: Show

  • Rayo Suseno, UCSF, United States
  • Kristen Wade, UCSF, United States
  • Jill Hollenbach, UCSF, United States

Presentation Overview: Show

Bioinformatics is experiencing a crisis of reproducibility, which inhibits research progress and undermines scientific findings. This is driven by a variety of factors, including incomplete documentation, poor version control, lack of accessible code, and incompatible software dependencies. Leveraging containerization technology is a promising solution to address these issues and streamline the deployment of specialized bioinformatics workflows. The field of immunogenetics is especially in need of such workflows, as high levels of genomic complexity characteristic of immune loci require the development of unique tools. For example, in 2021, we published a pipeline, Pushing Immunogenetics to the Next Generation (PING), designed to genotype the killer immunoglobulin-like receptor (KIR) genes from short read data. Due to its various dependencies, however, some investigators found PING challenging to run and install. This prompted us to containerize both PING and a recently developed software from our lab, MHConstructor, a de novo short read sequence assembler for the human major histocompatibility complex (MHC) region. A particular challenge faced by MHConstructor is reliance on multiple Python versions, due to dependencies on different bioinformatics tools. This requires usage of multiple conda environments within one container, which we successfully implemented while ensuring seamless switching between environments. Singularity was chosen due to its user-friendly nature, encapsulating the entire workflow in a single file that can be effortlessly executed regardless of the operating system. The containerization of PING and MHConstructor ensures the reproducibility of these two immunogenetic pipelines, providing reliable high throughput analysis of large datasets not previously accessible with extant tools.

9:00-9:20
Breaking the silo: composable bioinformatics through cross-disciplinary open standards
Confirmed Presenter: Nezar Abdennur, UMass Chan Medical School, United States

Room: 524ab
Format: In Person

Moderator(s): Jason Williams


Authors List: Show

  • Nezar Abdennur, UMass Chan Medical School, United States
  • Trevor Manz, Harvard Medical School, United States
  • Jack Huey, UMass Chan Medical School, United States
  • Garrett Ng, UMass Chan Medical School, United States
  • Vedat Yilmaz, UMass Chan Medical School, United States
  • Nils Gehlenborg, Harvard Medical School, United States
  • Open Chromosome Collective, Open2C, United States

Presentation Overview: Show

The practice of data science in genomics and computational biology is fraught with friction. This is largely due to a tight coupling of bioinformatic tools to file input/output. While omic data is specialized and the storage formats for high-throughput sequencing and related data are often standardized, the adoption of emerging open standards not tied to bioinformatics can help better integrate bioinformatic workflows into the wider data science, visualization, and AI/ML ecosystems. Here, we present three libraries as short vignettes for composable bioinformatics. First, we present Oxbow, a Rust-based adapter library that unifies access to common genomic data formats by efficiently transforming queries into Apache Arrow, a standard in-memory columnar representation for tabular data analytics. Second, we present Bioframe, a Python library that performs genomic range operations using standard Pandas dataframes. Last, we present Anywidget, an architecture based on modern web standards for sharing interactive visualizations across all Jupyter-compatible runtimes, including JupyterLab, Google Colab, and VSCode. Together, we demonstrate the composition of these libraries to build a custom connected genomic analysis and visualization environment. We propose that components such as these, which leverage scientific domain-agnostic standards to unbundle specialized file manipulation, analytics, and web interactivity, can serve as reusable building blocks for composing flexible genomic data analysis and machine learning workflows as well as systems for exploratory data analysis and visualization.

9:20-9:40
For long-term sustainable software in bioinformatics: a manifesto
Confirmed Presenter: Luis Pedro Coelho, Queensland University of Technology, Australia

Room: 524ab
Format: In Person

Moderator(s): Jason Williams


Authors List: Show

  • Luis Pedro Coelho, Queensland University of Technology, Australia

Presentation Overview: Show

I will discuss the challenges of maintaining research software in bioinformatics, especially considering the transient nature of funding and the turnover of researchers involved in coding projects. I will also discuss the approaches that my group takes to ensure that we maintain our software over the long-term, even as trainees leave the group. Maintenance involves ensuring the software performs as described. This includes updating it to handle new dependencies and fix bugs as well as providing a modicum of support to users.

I categorize research software into three levels: Level 0 (one-off scripts for specific analyses, often containing minor errors), Level 1 (Extended Methods Code, supporting specific results in publications), and Level 2 (Tools intended for broad use, requiring robustness and extensive documentation). Both Level 1 and Level 2 are made public, but they serve different purposes and upgrading from 1 to 2 involves significant effort. In the case of Tools, we aim for ease of use, reproducibility, and good error reporting.

We follow several practices that facilitate maintenance and support: reproducible research techniques, "dogfooding" (using one's own tools), clear and public support channels, providing error messages that guide users to solutions, and distributing software via Bioconda to minimize installation issues. Additionally, we attempt to gather beta users before publication and improve software based on feedback.

Furthermore, I propose that journals should require a Maintenance and Support statement from authors, similar to the Data Availability statement, to ensure transparency and accountability regarding the long-term support of research software.

BioCompute: A Descriptive Standard for Computable Metadata
Confirmed Presenter: Jonathon Keeney, The George Washington university, United States

Room: 524ab
Format: In Person

Moderator(s): Jason Williams


Authors List: Show

  • Jonathon Keeney, The George Washington university, United States
  • Hadley King, The George Washington university, United States
  • Tianyi Wang, The George Washington university, United States
  • Chinweoke Okonkwo, The George Washington university, United States
  • Raja Mazumder, The George Washington university, United States

Presentation Overview: Show

Scientific review of work in the life sciences has been hindered by a lack of standards relating to the communication of computational pipelines. Often, little or no information related to the computational component is described in detail, rendering the work unreviewable, unfindable, and not reproducible. This challenge has been felt particularly concretely in reviews for academic publishing and in regulatory reviews at regulatory agencies such as the US FDA. BioCompute is a descriptive standard (officially "IEEE 2791-2020") that is flexible enough to accommodate any pipeline but robust enough to provide a computable structure for metadata and annotation of the pipeline. The standard is supported by several major bioinformatics platforms, and two workflow languages, as is the only framework standard of its kind accepted by the FDA for regulatory reviews. This presentation will describe the need and architecture of the standard, the community, and the tools that have been developed to work with the standard. URL: https://biocomputeobject.org/

Breaking Down Research Silos and Fostering Radical Collaboration through Collective Intelligence
Confirmed Presenter: Alberto Pepe, Sage Bionetworks, United States

Room: 524ab
Format: In Person

Moderator(s): Jason Williams


Authors List: Show

  • Robert Allaway, Sage Bionetworks, United States
  • Megan Doerr, Sage Bionetworks, United States
  • Jineta Bannerjee, Sage Bionetworks, United States
  • Milen Nikolov, Sage Bionetworks, United States
  • Amy Heiser, Sage Bionetworks, United States
  • Mialy DeFelice, Sage Bionetworks, United States
  • Adam Hindman, Sage Bionetworks, United States
  • Jake Albrecht, Sage Bionetworks, United States
  • Lakaija Johnson, Sage Bionetworks, United States
  • James Eddy, Sage Bionetworks, United States
  • Miranda McManus, College of Charleston, United States
  • J. Harry Caulfield, Lawrence Berkeley National Laboratory, United States
  • Christopher J. Mungall, Lawrence Berkeley National Laboratory, United States
  • Monica Munoz-Torres, University of Colorado School of Medicine, United States
  • Alberto Pepe, Sage Bionetworks, United States

Presentation Overview: Show

The data landscape is rapidly expanding. Scientists are required to navigate increasing amounts of multimodal data in an attempt to create high-quality, impactful research. As knowledge expands, specialization increases. This comes at a cost: the emergence of deep data silos within separate disciplines. We believe that progress lies in the open exchange of ideas between all stakeholders, harnessing our collective human and Artificial Intelligence (AI). By bringing these communities together and fostering unanticipated connections, we can drive a new age of biomedical innovation. In this talk, we present two ongoing projects at Sage Bionetworks that underscore the need for open approaches to AI/ML to pave the next generation of biological and medical innovations.

(Extended abstract attached)

Q&A For Flash Talks
Room: 524ab
Format: In person

Moderator(s): Jason Williams


Authors List: Show

9:40-10:00
Tripal: a community-driven framework supporting open science, sustainable data web portals
Confirmed Presenter: Lacey-Anne Sanderson, University of Saskatchewan, Canada

Room: 524ab
Format: In Person

Moderator(s): Jason Williams


Authors List: Show

  • Lacey-Anne Sanderson, University of Saskatchewan, Canada
  • Stephen P. Ficklin, Department of Horticulture, Washington State University, Pullman WA, USA, United States
  • Doug Senalik, USDA-ARS, University of Wisconsin-Madison, Madison WI, USA, United States
  • Risharde Ramnath, Department of Horticulture, Washington State University, Pullman WA, USA, United States
  • Sean Buehler, Department of Horticulture, Washington State University, Pullman WA, USA, United States
  • Josh Burns, Department of Horticulture, Washington State University, Pullman WA, USA, United States
  • Valentin Guignon, Bioversity International, Montpellier, France, France
  • Dorrie Main, Department of Horticulture, Washington State University, Pullman WA, USA, United States
  • Kirstin Bett, Department of Plant Sciences, University of Saskatchewan, Saskatoon SK, Canada, Canada

Presentation Overview: Show

As the open science movement gains momentum, UNESCO is highlighting the need for infrastructure to (1) support building of global, inclusive research communities and (2) provide open access to research-associated data and tools. Open source bioinformatic software, like Tripal, is uniquely poised to fill such a need as both open-source and open-science embody the same core principles. Tripal (https://tripal.info) extends several open-source packages into a cohesive platform meant to make development of open science data web portals accessible. Specifically, Drupal provides user management, page templating, content curation, and site administration, while GMOD Chado provides community-developed standards for storage of biological datasets. Tripal 4 currently offers ontology-driven data pages, extensive administrative and curation interfaces, and standards-focused data importers. Sites are fully customizable through various web forms and extensive developer APIs provided by Drupal and Tripal. While there are still some key integrations outstanding, we are now at the point of expanding the default configuration to guide community builders in creating inclusive and open data portals. Our first step on this path is to provide a set of well documented default fields designed to promote high standards of data attribution and completeness of metadata. These fields are based on input and experience from our existing international community of Tripal data portals. Additionally, we are hoping to engage those in the wider open-source and open-science communities in collaboration. Please reach out to us in person at the BOSC cofest, on Github, on Slack or in our weekly cofests on GatherTown (see https://tripal.info/community).

10:40-11:40
Invited Presentation: Open Data, Knowledge Graphs, and Large Language Models
Confirmed Presenter: Andrew Su

Room: 524ab
Format: In Person

Moderator(s): Nomi Harris


Authors List: Show

  • Andrew Su

Presentation Overview: Show

Bioinformatics is the science of collecting, storing, analyzing, and disseminating biological data and information. As in most domains of data science, bioinformaticians have long focused on structured data – information that is represented using ontologies and controlled vocabularies in well-defined data formats and often stored in databases with predefined schemas. This focus on structured data over the last 30 years has been the most efficient way to convert information into testable hypotheses and new scientific insights.

Recent developments in artificial intelligence, particularly the advent of large language models (LLMs), have started to challenge this traditional focus on structured data. By utilizing massive training sets of unstructured text, LLMs have shown exceptional capabilities not only in tasks like question answering and text generation but also in summarization, translation, and code generation. In this presentation, we will examine how LLMs are changing and will continue to change the practice of bioinformatics, particularly at the interface between structured and unstructured data.

Andrew Su, Ph.D., is the Elden and Verna Strahm Professor at the Scripps Research Institute in the Department of Integrative Structural and Computational Biology (ISCB). Dr. Su earned his PhD in chemistry at Scripps Research in 2002, and was the Associate Director of Bioinformatics at The Genomics Institute of the Novartis Research Foundation (GNF) before returning to Scripps Research as a faculty member in 2011.

The Su lab focuses on building and applying bioinformatics infrastructure for biomedical discovery. Dr. Su has had a long-standing interest in leveraging crowdsourcing to organize and integrate knowledge though projects like the Gene Wiki and Wikidata. In partnership with Chunlei Wu’s lab, he has also worked extensively on creating biomedical APIs and enabling API interoperability through the BioThings project. Most recently, his lab has a particular emphasis on constructing and mining knowledge graphs for drug repurposing. In all this work, the Su lab has embraced the principles of open science, open data, and open source software.

11:40-12:00
Gene Set Summarization Using Large Language Models
Confirmed Presenter: Marcin Joachimiak, Lawrence Berkeley National Laboratory, United States

Room: 524ab
Format: In Person

Moderator(s): Jessica Maia


Authors List: Show

  • Marcin Joachimiak, Lawrence Berkeley National Laboratory, United States
  • J. Harry Caufield, Lawrence Berkeley National Laboratory, United States
  • Nomi Harris, Lawrence Berkeley National Laboratory, United States
  • Hyeongsik Kim, Lawrence Berkeley National Laboratory, United States
  • Chris Mungall, Lawrence Berkeley National Laboratory, United States

Presentation Overview: Show

Molecular biologists often use statistical enrichment analysis to interpret gene lists derived from high-throughput experiments and computational analyses. This traditional method assesses the over- or under-representation of biological function terms associated with genes, based on curated assertions from databases such as Gene Ontology (GO). Alternatively, interpreting gene lists can be conceptualized as a textual summarization task, where Large Language Models (LLMs) utilize scientific texts to reduce reliance on traditional knowledge bases. This approach offers advantages because traditional knowledge bases struggle to scale their curation and integration efforts, while being unable to encompass all available knowledge.
Our tool, TALISMAN (Terminological ArtificiaL Intelligence SuMmarization of Annotation and Narratives), employs generative AI to perform gene set summarization, complementing standard enrichment analysis. This innovative approach leverages various sources of gene functional information including: 1) structured text from curated ontological databases, 2) narrative summaries without ontology, and 3) direct model retrieval.
LLMs can generate biologically plausible GO term summaries for gene sets but struggle to provide reliable significance scores or rankings and do not match the precision of standard methods. Notably, newer LLM models significantly outperform older versions.
While these methods are not yet suitable replacements for standard term enrichment analysis, they do offer advantages for summarizing implicit knowledge across large and unstandardized datasets, particularly where the volume of information exceeds human processing capabilities. This, together with the generative capacities of LLMs, such as to suggest novel summarization terms, makes them a valuable tool for improved understanding in complex biological data analyses.

12:00-12:20
FAIR, modular and reproducible image-based ML workflows for biologists: a template and case study from imageomics
Confirmed Presenter: Hilmar Lapp, Duke University, United States

Room: 524ab
Format: In Person

Moderator(s): Jessica Maia


Authors List: Show

  • Meghan Balk, National Ecological Observatory Network (NEON), United States
  • John Bradley, Duke University, United States
  • Hilmar Lapp, Duke University, United States

Presentation Overview: Show

Machine Learning (ML) has become a critical tool in the life sciences, and is being applied to diverse biological data types, including the rapidly growing vast trove of biological image data. Using image-based ML for biological research questions frequently requires combining different ML models and algorithms into complex computational workflows. We present a template for creating FAIR and reproducible workflows for imageomics, an emerging field that uses AI and ML to extract knowledge from biological images. Recognizing the inherently interdisciplinary nature of imageomics, the template distinguishes between a conceptual workflow for interdisciplinary research convergence and using the conceptual workflow to implement an executable application-specific workflow. We present how implementation technology choices can promote research software engineering best practices and enable end-to-end automation, while also accommodating ongoing ML and computer science research, and empowering biologists to make modifications. The results include a conceptual workflow for detecting and quantifying traits from biological specimen images, and a concrete workflow for a dataset of fish museum specimen images. We extended core FAIR data and software practices to ML models, such as persistent identifiers, version control, semantic versioning, and rich metadata. We find that the objective of a FAIR ML workflow promotes all workflow components to be FAIR. Ensuring full reproducibility is a separate step, and achieving workflow reproducibility requires end-to-end automation interoperable between high-performance computing environments, necessitating a formal workflow definition language with an associated workflow manager and execution engine.

14:20-14:40
Trust and Transparency in Reporting Machine Learning: The DOME-GigaScience Press Trial
Confirmed Presenter: Chris Armit, GigaScience Press, Hong Kong

Room: 524ab
Format: In Person

Moderator(s): Jessica Maia


Authors List: Show

  • Chris Armit, GigaScience Press, Hong Kong
  • Mary Ann Tuli, GigaScience Press, Hong Kong
  • Yannan Fan, GigaScience Press, China
  • Nafisa Qazi, GigaScience Press, Hong Kong
  • Nicole Nogoy, GigaScience Press, Hong Kong
  • Hans Zauner, GigaScience Press, Hong Kong
  • Hongling Zhou, GigaScience Press, China
  • Hongfang Zhang, GigaScience Press, China
  • Christopher Hunter, GigaScience Press, Hong Kong
  • Scott Edmunds, GigaScience Press, Hong Kong
  • Laurie Goodman, GigaScience Press, United States

Presentation Overview: Show

Machine learning is increasingly applied to biological and biomedical data, and there is a need for sufficient detail to enable a researcher to understand the machine learning approach used in a research study. This is even more challenging due to Machine Learning studies being inherently difficult to interpret (the so-called “black box” effect).  To throw light on these methods, GigaScience Press (https://www.gigasciencepress.org/) has partnered with the DOME Consortium with the goal of encouraging authors to follow the DOME (Data, Optimisation, Model, Evaluation) recommendations.

The role of the GigaScience DataBase (GigaDB) Data Curation team is to ensure the Data Submission process runs as smoothly as possible. The DOME Consortium has generated the DOME Wizard (https://dome.ds-wizard.org/) which enables researchers to submit their DOME annotations to a central repository (https://registry.dome-ml.org/) and share them with reviewers. The GigaDB team scans submitted manuscripts for Machine Learning content, and performs checks to ensure that DOME annotations in support of GigaScience and GigaByte manuscripts are sufficiently complete.

To increase the visibility of the supporting DOME annotation, a link to DOME annotation is included in the GigaDB dataset that accompanies a GigaScience or GigaByte manuscript. The DOME annotations are a great asset to peer review, providing the necessary high-level overview to properly understand a machine learning study. We recommend that other journals follow our example in encouraging DOME annotations to be submitted early in the publication process and prior to peer-review.

14:40-15:40
Panel: Open Approaches to AI/ML in Bioinformatics
Room: 524ab
Format: In person

Moderator(s): Monica Munoz-Torres


Authors List: Show