The SciFinder tool lets you search Titles, Authors, and Abstracts of talks and panels. Enter your search term below and your results will be shown at the bottom of the page. You can also click on a track to see all the talks given in that track on that day.

View Talks By Category

Scroll down to view Results

July 12, 2024
July 13, 2024
July 14, 2024
July 15, 2024
July 16, 2024

Results

July 15, 2024
10:40-11:00
Opening Remarks
Track: BOSC

Room: 524ab
Moderator(s): Nomi Harris


Authors List: Show

  • Nomi Harris
July 15, 2024
10:40-11:00
Open Bioinformatics Foundation Update
Track: BOSC

Room: 524ab
Moderator(s): Nomi Harris


Authors List: Show

  • Heather Wiencko
July 15, 2024
10:40-11:00
CoFest Intro
Track: BOSC

Room: 524ab
Moderator(s): Nomi Harris


Authors List: Show

  • Hervé Ménager
July 15, 2024
10:40-11:00
Platinum and Gold Sponsor Videos
Track: BOSC

Room: 524ab
Moderator(s): Nomi Harris


Authors List: Show

July 15, 2024
11:00-11:20
Gemma: Curation, re-analysis and dissemination of 18,000 gene expression studies
Confirmed Presenter: Paul Pavlidis, University of British Columbia, Canada
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Hervé Ménager


Authors List: Show

  • Guillaume Poirier-Morency, Guillaume Poirier-Morency, University of British Columbia
  • Sanja Rogic, Sanja Rogic, University of British Columbia
  • Ogan Mancarci, Ogan Mancarci, University of British Columbia
  • Neera Patadia, Neera Patadia, University of British Columbia
  • Salva Sherif, Salva Sherif, University of British Columbia
  • Delphine Zhou, Delphine Zhou, University of British Columbia
  • Alice Ma, Alice Ma, University of British Columbia
  • Paul Pavlidis, Paul Pavlidis, University of British Columbia

Presentation Overview:Show

Gemma (https://gemma.msl.ubc.ca) is an open source and open data project focused on increasing the utility of existing gene expression studies. While data repositories like the Gene Expression Omnibus (GEO) and other secondary sources derived from it have important use cases, Gemma goes further than most in curating and re-analyzing data sets. Here we highlight recently software features, annotations and data with the hope of engaging the genomics and bioinformatics community in using and improving Gemma as a resource for computational analyses and biological discovery. Gemma currently contains over 18,000 data sets (>600,000 samples), manually curated using formal ontologies, re-processed and batch-corrected, quality-controlled and subjected to differential expression analysis using multivariate linear models, covering approximately 40% of all human, mouse and rat transcriptome samples from GEO. Our holdings are especially strong in studies of the nervous system (~4,900 studies), perturbations of transcription regulators (~2,200), and small molecule (drug-like) exposures (>1,000). All data and analysis results are available through a web API, or via supporting R/Bioconductor and Python packages. We also introduce a new data browser that advances users’ ability to quickly identify and access data sets of interest. We will discuss examples of insights and applications from Gemma such as “generic” patterns of differential expression and observations on data quality, and ongoing work to incorporate single-cell data. In summary, Gemma adds extensive value to existing gene expression studies for the benefit of the research community.

July 15, 2024
11:20-11:40
EASTR: Identifying and eliminating systematic alignment errors in multi-exon genes
Confirmed Presenter: Ida Shinder, Johns Hopkins School of Medicine, United States
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Hervé Ménager


Authors List: Show

  • Ida Shinder, Ida Shinder, Johns Hopkins School of Medicine
  • Richard Hu, Richard Hu, Johns Hopkins University
  • Hyun Joo Ji, Hyun Joo Ji, Johns Hopkins University
  • Kuan-Hao Chao, Kuan-Hao Chao, Johns Hopkins University
  • Mihaela Pertea, Mihaela Pertea, Johns Hopkins University

Presentation Overview:Show

Accurate RNA-seq alignment to reference genomes is fundamental to transcript assembly, annotation, and gene expression studies, integral to advancements in both biomedical research and basic sciences. Splice-aware aligners, tools pivotal for interpreting RNA-seq data, are now recognized through our research to introduce systemic alignment errors. Specifically, widely-used tools like STAR and HISAT2 often misalign sequences from repeat regions, resulting in spurious spliced alignments. These misalignments have propagated into reference annotations, leading to the inclusion of numerous 'phantom' introns—artifacts erroneously identified as genuine intronic sequences—in gene databases relied upon by the scientific community.

EASTR (Emending Alignments of Spliced Transcript Reads), our newly developed tool, significantly improves RNA-seq data analysis reliability. EASTR effectively detects and removes erroneous spliced alignments by examining the sequence similarity between the
flanking upstream and downstream regions of an intron and the frequency of their sequence occurrence in the reference genome. Our analysis of various RNA-seq datasets aligned with HISAT/STAR, reveals that, in some datasets, up to 20% of splice-aligned reads are spurious. By removing these spurious spliced alignments, EASTR enhances the reliability of downstream analyses that depend on accurate spliced alignment, such as transcript assembly. Applying EASTR to preprocess alignments before transcript assembly with StringTie2 significantly improves transcript assembly precision and sensitivity across various species and experimental setups. Moreover, applying EASTR to reexamine gene catalogs has revealed a multitude of intronic sequences that were previously incorrectly annotated, improving the precision of annotations and equipping the research community with more accurate gene catalogs.

July 15, 2024
11:20-11:40
ROC Picker: propagating statistical and systematic uncertainties in biological analyses
Confirmed Presenter: Jeffrey Roskes, Johns Hopkins University, United States
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Hervé Ménager


Authors List: Show

  • Jeffrey Roskes, Jeffrey Roskes, Johns Hopkins University

Presentation Overview:Show

We present the first version of ROC Picker, a software package for propagating statistical and systematic uncertainties in a biomedical analysis. The confidence bands are displayed on the ROC curve and can be used to assess the confidence of the discriminant’s power to separate between two types of samples. In contrast to previous approaches, which only include the statistical error on the number of samples in constructing the ROC curve, ROC Picker uses a likelihood-based approach that can handle sample-wise uncertainties, including systematic effects (for example, batch-based errors) that are correlated between a subset of the samples. ROC Picker is modeled on the Combine tool by the CMS Collaboration in experimental particle physics, and is developed as part of the AstroPath project, which applies data processing and analysis methodologies from astronomy into pathology. We plan to further develop ROC Picker to handle various other metrics (such as Kaplan-Meier curves) and other types of uncertainties applicable to biomedical analyses.

July 15, 2024
11:20-11:40
Djerba: Sharing and Updating a Modular System for Clinical Report Generation
Confirmed Presenter: Iain Bancarz, Ontario Institute for Cancer Research, Canada
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Hervé Ménager


Authors List: Show

  • Iain Bancarz, Iain Bancarz, Ontario Institute for Cancer Research
  • Felix Beaudry, Felix Beaudry, Ontario Institute for Cancer Research
  • Aqsa Alam, Aqsa Alam, Ontario Institute for Cancer Research
  • Oumaima Hamza, Oumaima Hamza, Ontario Institute for Cancer Research
  • Alexander Fortuna, Alexander Fortuna, Ontario Institute for Cancer Research
  • Trevor Pugh, Trevor Pugh, Ontario Institute for Cancer Research

Presentation Overview:Show

Djerba is an open-source software package designed to streamline the translation of bioinformatic pipeline output from individual tumor samples into clinical reports. Operating within a CAP/CLIA/ACD accredited laboratory, use cases include clinical genome and transcriptome sequencing for therapeutic assignment or trial eligibility, cell-free DNA sequencing using targeted panels for early cancer detection and plasma whole genome sequencing to detect minimal residual disease.

Djerba integrates output from multiple bioinformatics workflows, external resources queries, and customized user inputs; generates plots, tables, hyperlinks and summary statistics; collates its results into a machine-readable JSON document; and finally renders its output as a PDF report for use by clinicians. It includes mini-Djerba, a pre-built, self-contained tool to simplify manual updates to reports. Djerba has a flexible schema for report syntax with automated validation, and supports uploading JSON report documents to a CouchDB database. This enables reports to be automatically searched, compared, and summarized, and provides a valuable resource for better understanding of clinical datasets.

The modular structure of Djerba enables it to be easily shared, updated and customized while retaining interoperability between sites. Reproducible research is enabled by versioning components and automatically recording input parameters. Partner sites can run a set of analysis and reporting functions appropriate to their needs; while improvements are efficiently tested and integrated into the reporting framework. Examples include deployment to multiple institutions across Canada as part of the MOHCCN-O initiative; and rapidly updating an accredited clinical report with improved copy number variation (CNV) analysis.

July 15, 2024
11:20-11:40
Q&A For Flash Talks
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Hervé Ménager


Authors List: Show

July 15, 2024
11:40-12:00
Antimicrobial resistance prediction of nontuberculous mycobacteria from whole genome sequence data
Confirmed Presenter: Idowu Olawoye, Department of Microbiology & Immunology, University of Western Ontario
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Hervé Ménager


Authors List: Show

  • Idowu Olawoye, Idowu Olawoye, Department of Microbiology & Immunology
  • Ashlyn Kim, Ashlyn Kim, Department of Microbiology & Immunology
  • Maxim Federov, Maxim Federov, Department of Microbiology & Immunology
  • Nicholas Waglechner, Nicholas Waglechner, Sinai Health
  • Jennifer Guthrie, Jennifer Guthrie, Department of Microbiology & Immunology

Presentation Overview:Show

Nontuberculous mycobacteria (NTM) are opportunistic pathogens, predominantly causing pulmonary infections. NTMs are naturally resistant to many antibiotics, with treatment failure common due to acquired antimicrobial resistance (AMR). Whole-genome sequencing (WGS) offers an avenue for highthroughput detection of mutations and genes associated with drug resistance. Furthermore, there is the need develop bioinformatics tools and curated databases to enable detection of NTM resistance.
In this research, we have expanded the databases of ResFinder to accommodate positional mutations and resistance genes of clinically relevant NTMs (M. abscessus and M. avium) from peer-reviewed literature, that were confirmed with phenotypic susceptibility tests. The pipeline uses sequence data (FASTQ) or genome assemblies (FASTA) to predict AMR results. Accuracy was examined by the detection of known mutations using both FASTQ and FASTA inputs.
The ResFinder database additions were successfully validated with raw data and genome assemblies of M. abscessus (n=57) and M. avium (n=58). Comparing the detection of AMR genes and known mutations in sequence files versus genome assemblies, we observed a specificity of 100% for both data formats, while we saw a sensitivity of 98.2% (56/57) for FASTQ versus 100% (57/57) for FASTA files in M. abscessus and sensitivity of 100% (58/58) for FASTQ versus 87.9% (51/58) for FASTA files in M. avium.
Our expansion of the ResFinder database comes with much promise in the advancement of in silico prediction to detect AMR particularly in pathogens such as NTMs. It also creates the opportunity for continous development by including AMR markers for other clinically relevant NTMs.

July 15, 2024
12:00-12:20
Open2C: Advancing 3D and functional genomics research
Confirmed Presenter: Vedat Yilmaz, UMass Chan Medical School, United States
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Hervé Ménager


Authors List: Show

  • Vedat Yilmaz, Vedat Yilmaz, UMass Chan Medical School
  • Nezar Abdennur, Nezar Abdennur, UMass Chan Medical School
  • Open Chromosome Collective, Open Chromosome Collective, Open2C

Presentation Overview:Show

Open2C (https://open2c.github.io) is a collaborative, open-source software community focused on advancing research in genome architecture and chromosome biology through computational 3D and functional genomics. Our libraries primarily leverage the scientific Python software stack and are designed to be user-friendly, flexible, scalable, and interoperable with the wider ecosystem in order to meet the demands of contemporary genomic research. Our resources aim to promote a deeper understanding of genome structure and function, accommodating large next-generation sequencing datasets from Chromosome Conformation Capture (3C/Hi-C) technologies. Among the packages we maintain are four notable NumFOCUS-affiliated projects: Cooler (Abdennur et al, Bioinformatics, 2020), a Python library and sparse, HDF5-based storage format for genomic contact maps; Pairtools, a command-line suite for interpreting and extracting contact information from 3D genomic sequencing data; Cooltools (Open2C et al, PLOS Comp Bio, in press), a suite for analyzing genomic contact maps; and Bioframe (Open2C et al, Bioinformatics, 2024), a library for genomic interval operations on Pandas dataframes. In its commitment to open science, Open2C has been selected as a host organization for Google Summer of Code 2024, to help mentor the next generation of open-source developers and contributors. All of our packages are open-source and permissively licensed and the source code can be found at https://github.com/open2c.

July 15, 2024
12:00-12:20
A Framework for DNA Binding Motifs Prediction for Nontraditional Model Organism Transcription Factors
Confirmed Presenter: Stephanie Hao, Boston University, United States
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Hervé Ménager


Authors List: Show

  • Stephanie Hao, Stephanie Hao, Boston University
  • Anthony Garza, Anthony Garza, Boston University
  • Yeting Li, Yeting Li, Boston University
  • Nofal Ouardaoui, Nofal Ouardaoui, Boston University
  • Cynthia Bradham, Cynthia Bradham, Boston University

Presentation Overview:Show

To improve our predictions of phenotypic outcomes, it is essential to unravel the complex networks of genes and regulatory elements across different species. Consequently, research focusing on the genomics of nontraditional model organisms is advancing rapidly. A bottleneck in non-model organism research is the lack of well-mapped transcription factors to DNA binding motifs. To this end, we implemented a modular prediction algorithm that deciphers DNA binding motifs from the DNA sequences of expressed transcription factors (TFs). We successfully applied this algorithm to 518 sequenced transcription factors of L. variegatus, a long-standing model organism for embryonic development. The modular nature of the algorithm affords species-agnostic applications across all organisms of interest, thereby unlocking new frontiers in genomic prediction, model and non-model organisms alike, and biological understanding.

July 15, 2024
12:00-12:20
Bioinformatics tools for comparative genomics analysis of highly similar duplicate genes in eukaryotic genomes
Confirmed Presenter: Xi Zhang, Dalhousie University, Canada
Track: BOSC

Room: 524ab
Format: Live Stream
Moderator(s): Hervé Ménager


Authors List: Show

  • Xi Zhang, Xi Zhang, Dalhousie University
  • Yining Hu, Yining Hu, Western University

Presentation Overview:Show

Gene duplication plays an important role in evolutionary mechanism, which can act as a new source of genetic material in genome evolution. However, detecting duplicate genes from genomic data can be challenging. Various bioinformatics resources have been developed to identify duplicate genes from single and/or multiple species. Here, we developed a Basic Local Alignment Search Tool (BLAST)-based web tool (HSDFinder) and database (HSDatabase), allowing future researchers to easily identify intra-species gene duplications with their own interest genomes by following a pipeline (HSDecipher). Besides, we reviewed recent advancements of gene duplication detection tools and summarized the metrics used to measure sequence identity among gene duplicates within species, which is a quick reference guide for research tools used for detecting gene duplicates.

July 15, 2024
12:00-12:20
Q&A For Flash Talks
Track: BOSC

Room: 524ab
Format: Live Stream
Moderator(s): Hervé Ménager


Authors List: Show

July 15, 2024
14:20-15:20
Invited Presentation: The Data Shows We Need Better Data
Confirmed Presenter: Mélanie Courtot, Ontario Institute for Cancer Research and University of Toronto, Canada
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Hervé Ménager


Authors List: Show

  • Mélanie Courtot, Mélanie Courtot, Ontario Institute for Cancer Research and University of Toronto

Presentation Overview:Show

Big data, AI, LLMs… do they live up to the hype? In a bright and hopeful future, AI accelerates progress, revolutionizes healthcare, alerts us to health risks, and creates fresh career paths. Yet, in a bleaker outlook, it obliterates jobs, fosters rampant misinformation and increases inequity.

At the root of AI is the data it relies on. In this talk we will discuss how to steer the course by improving the data AI leverages. We will explore the vast ecosystem formed by data, projects and infrastructure. We will travel along different axes to think about the data we are generating and using every day. We will consider data governance – where does it come from, who owns it, and how can we access it? We will investigate open data – how can we leverage health care knowledge for research? Finally, we will share a few thoughts about data quality and data sharing to increase reproducibility and reuse.


Dr Mélanie Courtot is the Director of Genome Informatics at the Ontario Institute for Cancer Research in Toronto, and an Assistant Professor in the Department of Medical Biophysics at University of Toronto. Dr Courtot is passionate about translational informatics – building intelligent systems to gain new insights and impact human health. Her lab aims to build a globally shared knowledge ecosystem to advance science and improve health for all. Her team develops the Overture open source software suite, which supports many active large-scale cancer genomics projects including ICGC and ICGC-ARGO, VirusSeq, and the upcoming Pan-Canadian Genome Library. It also drives the African Pathogen Data Sharing and Archive Platform.


Dr Courtot obtained her PhD in Bioinformatics from the University of British Columbia in 2014, followed by a postdoctoral fellowship in Public Health. Dr Courtot co-leads the Clinical and Phenotypic workstream and Data Use and Cohort representation groups for the Global Alliance for Genomics and Health (GA4GH) as well as cohort harmonization efforts for the International HundredK+ Cohorts Consortium. She is an advisory board member for the Public Health Alliance for Genomic Epidemiology coalition, European Open Science Cloud for Cancer project and the eLwazi open data science platform.

July 15, 2024
15:20-15:40
Creating an open-source data platform.
Confirmed Presenter: Mitchell Shiell, Ontario Institute for Cancer Research (OICR), Canada
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Monica Munoz-Torres


Authors List: Show

  • Mitchell Shiell, Mitchell Shiell, Ontario Institute for Cancer Research (OICR)
  • Jon Eubank, Jon Eubank, Ontario Institute for Cancer Research (OICR)
  • Justin Richardsson, Justin Richardsson, Ontario Institute for Cancer Research (OICR)
  • Brandon Chan, Brandon Chan, Ontario Institute for Cancer Research (OICR)
  • Robin Haw, Robin Haw, OICR
  • Lincoln Stein, Lincoln Stein, Ontario Institute for Cancer Research
  • Melanie Courtot, Melanie Courtot, Ontario Institute for Cancer Research (OICR)
  • Overture Team, Overture Team, Ontario Institute for Cancer Research (OICR)

Presentation Overview:Show

At BOSC 2023, we were excited to introduce Overture, a collection of open-source software used to overcome significant obstacles in storing, managing, and sharing genome-scale datasets. We spoke about our modular and scalable microservice architecture comprising Ego, Song, Score, Maestro, and Arranger and how these services help power VirusSeq, ICGC-ARGO, and the original ICGC portal. We discussed Overtures Core capabilities and how they are split between data consumers, who retrieve data from our platforms; providers, who submit data to our platforms; and administrators, who manage users, configure data models and customize the portal interface. 

In this talk, we’d like to touch on our experience extracting our existing infrastructure and making it work in an open-source environment. We’d then like to show you what it takes to create a data portal, guiding you through a high-level overview of how our platforms get deployed. This will be followed by a more detailed discussion of using the platform, specifically supplying your data model to it, validating and submitting data to it, and configuring its front-end interface. To conclude, we will discuss future aims, highlighting our feature roadmap and opening up to further discussion on the value of our aims with the BOSC community. We invite you to check out our website and demo portal (https://demo.overture.bio/) and chat with us at BOSC 2024. If you wish to contact us remotely our Slack channel, like our software, will always be open and available.

July 15, 2024
15:40-16:00
Going Viral: The Development of the VirusSeq Data Portal
Confirmed Presenter: Justin Richardsson, Ontario Institute for Cancer Research (OICR), Canada
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Monica Munoz-Torres


Authors List: Show

  • Justin Richardsson, Justin Richardsson, Ontario Institute for Cancer Research (OICR)
  • Jon Eubank, Jon Eubank, Ontario Institute for Cancer Research (OICR)
  • Leonardo Rivera, Leonardo Rivera, Ontario Institute for Cancer Research (OICR)
  • Scott Cain, Scott Cain, Ontario Institute for Cancer Research (OICR)
  • Robin Haw, Robin Haw, Ontario Institute for Cancer Research (OICR)
  • Mitchell Shiell, Mitchell Shiell, Ontario Institute for Cancer Research
  • Lincoln Stein, Lincoln Stein, Ontario Institute for Cancer Research
  • Melanie Courtot, Melanie Courtot, Ontario Institute for Cancer Research
  • Virusseq Team, Virusseq Team, VirusSeq

Presentation Overview:Show

Tracking the evolution and spread of the COVID-19 virus prompted a rapid global initiative to sequence SARS-CoV-2 genomes. However, challenges arose concerning the organization and dissemination of this data, particularly in meeting FAIR standards (Findable, Accessible, Interoperable, and Reusable). In response, we were tasked with spearheading efforts to build and deploy an open-access data platform that would harmonise, validate, and automate submissions of all Canadian SARS-CoV-2 sequences to international databases. Having created data infrastructure for many major cancer genomics projects, this shift represented an opportunity for our group to extend our tools and expertise toward building a pathogen-based data platform.

We built the VirusSeq data portal using five open-source Overture software microservices (Ego, Song, Score, Maestro, and Arranger), along with five additional custom-made services, and the third-party OAuth service Keycloak. By using existing software designed for reuse, we were able to successfully build and deploy the platform in a record four weeks. Furthermore, the portal's modular architecture enabled the platform's servers to scale dramatically. From the platform's inception in 2020, it was expected to harmonise and validate the submission of 150,000 viral sequences; by 2024, it has reached over 500,000 sequences.

The VirusSeq project stands as visible evidence of the power of collaborative effort, modular design, and the reusability of software tools. The portal implementation has already been utilised in other projects to track pathogen genomics, and serves as a testament of our ability to combat outbreaks and pandemics through mass collaboration and rapid data dissemination.

Project Website: https://virusseq-dataportal.ca/

July 15, 2024
15:40-16:00
intermine.bio2rdf.org : A QLever SPARQL endpoint for InterMine databases
Confirmed Presenter: François Belleau, Arnaud Droit Computational Laboratory, Canada
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Monica Munoz-Torres


Authors List: Show

  • François Belleau, François Belleau, Arnaud Droit Computational Laboratory
  • Gos Micklem, Gos Micklem, Department of Genetics
  • Deepak Unni, Deepak Unni, SIB Swiss Institute of Bioinformatics
  • Arnaud Droit, Arnaud Droit, Research Center of the CHU de Québec

Presentation Overview:Show

This project explores converting biological data from InterMine, an open-source data warehouse, into a knowledge graph accessible via Bio2RDF. This exposes global InterMine data as RDF, enabling SPARQL queries or exploration through the OpenSearch interactive dashboard.

The approach leverages novel semantic web technologies for Bio2RDF: JSON-LD (lightweight Linked Data format), LinkML (Linked Data modeling language), OpenSearch (search engine and dashboard software), and the scalable QLever SPARQL endpoint.

Using our pipeline, available at Github (https://github.com/intermineorg/intermine2sparql), we retrieve the data model from the InterMine API and transform it into JSON-LD documents stored in an OpenSearch index. Class definitions in LinkML are used to generate JSON-LD context for RDF conversion. Finally, JSON-LD documents are converted to N-Triples format and loaded into the QLever endpoint.

The complete data model from nine InterMine model organism instances is available through our services:
OpenSearch REST API: http://kibio.science
QLever SPARQL endpoint: http://intermine.bio2rdf.org
LinkML model definitions

This project demonstrates the feasibility of converting InterMine data into a JSON-LD. This approach facilitates querying biological data with SPARQL across InterMine instances and integrates it with other LinkML-compatible data sources. OpenSearch APIs empower data scientists with programming tools like Pandas and R, and with BI software (Superset, Power BI, Tableau) to create insightful visualizations.

In conclusion, the project will unlock biological data from InterMine for AI. By converting the data, it becomes accessible to Large Language Models, empowering them to analyze vast datasets, identify relationships, and potentially lead to discoveries in biological research.

July 15, 2024
15:40-16:00
Organizing community curation to create an Open database on Thermodynamics of Enzyme-Catalyzed Reactions (openTECR)
Confirmed Presenter: Robert T. Giessmann, Institute for Globally Distributed Open Research and Education, igdore.org
Track: BOSC

Room: 524ab
Format: Live Stream
Moderator(s): Monica Munoz-Torres


Authors List: Show

  • Robert T. Giessmann, Robert T. Giessmann, Institute for Globally Distributed Open Research and Education

Presentation Overview:Show

openTECR ("Open database on Thermodynamics of Enzyme-Catalyzed Reactions") is a database and a community.

We create a data collection of apparent equilibrium constants of enzyme-catalyzed reactions, being reliable, open and machine-actionable, with a clear change process to integrate new data and correct errors. We believe that Open Science principles, and specifically Open Data and Open Source are key to achieving our vision.

The openTECR database serves computational and experimental scientists in the fields of metabolic engineering, genome-scale metabolic modelling, biocatalysis and related fields by providing curated information. It is used by eQuilibrator as the data basis for making predictions about any possible reaction.

Recently, we organized an open community curation effort (https://opentecr.github.io/invitation-to-curate). We prepared a curation workflow to analyze 278 pages of pages densely packed with tables and textual information. We invited volunteer contributions, and are immensely grateful about 17 volunteers investing almost 100 working hours.

At BOSC 2024, I would like to present our initiative and share our lessons learned about organizing successful community curation. I believe that our example can serve as a blueprint for other databases / project ideas which require a large amount of working hours.

We discovered that key to receiving contributions is to offer very small packages of work and a detailed curation manual. Our smallest task was 3 minutes long and well received.

Our small community (~40 members) shares a mailing list (https://w3id.org/opentecr) and a GitHub organization where we store our data and code under open licenses (https://github.com/opentecr/).

July 15, 2024
15:40-16:00
Q&A For Flash Talks
Track: BOSC

Room: 524ab
Format: Live Stream
Moderator(s): Monica Munoz-Torres


Authors List: Show

July 15, 2024
16:40-17:00
Connecting Integrated Genome Browser to a huge genome database using its open API solves one problem and creates another
Confirmed Presenter: Ann Loraine, University of North Carolina Charlotte, United States
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Karsten Hokamp


Authors List: Show

  • Ann Loraine, Ann Loraine, University of North Carolina Charlotte
  • Jaya Sravani Sirigineedi, Jaya Sravani Sirigineedi, University of North Carolina Charlotte
  • Nowlan Freese, Nowlan Freese, University of North Carolina Charlotte

Presentation Overview:Show

Integrated Genome Browser (IGB, pronounced “ig-bee”) is a fast, feature-rich, open-source desktop genome browser thousands of researchers have used to explore and analyze genomic data. To support this user audience, we maintain data delivery Web sites called “IGB Quickloads” that supply IGB with around 60 different reference genome assemblies. IGB can open user-provided genome assembly files, but if their desired assembly already exists in an IGB Quickload, they can avoid this inconvenient work. However, we are finding it increasingly difficult to update these Quickload sites as new assemblies are published. Fortunately, many genome database systems now offer robust computational access to their data. By accessing these computational resources, IGB could show new assemblies without our copying them to a Quickload. To test this idea, we devel-oped a new IGB version that consumes and displays data from one such resource, a JSON-emitting API (application programming interface) from the UCSC Genome Browser system. Now available as an “early access” version at the BioViz.org Web site, this new IGB version can display more than 200 assemblies visible in the UCSC Browser, along with dozens of annotation and data tracks. This wealth of data has now given us a new problem to solve. The API provides information like track names and data formats, but little about what the data represent. Thus, we face a new form of an old problem in bioinformatics: how do we categorize and label data so that computer programs and people can understand and use it?

July 15, 2024
17:00-17:20
Collaborating our way to optimal integration between Tripal 4 and JBrowse 2
Confirmed Presenter: Carolyn T. Caron, University of Saskatchewan, Canada
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Karsten Hokamp


Authors List: Show

  • Carolyn T. Caron, Carolyn T. Caron, University of Saskatchewan
  • Lacey-Anne Sanderson, Lacey-Anne Sanderson, University of Saskatchewan
  • Kirstin E. Bett, Kirstin E. Bett, University of Saskatchewan

Presentation Overview:Show

To meet the diverse needs of their research communities, biological web portals not only strive to make data and their associated metadata accessible and reusable, but also provide tools to help analyze and make connections between the data. Tripal JBrowse extends the functionality of Tripal, an open-source toolkit for building biological web portals, to embed a very popular tool within the community, JBrowse 2. Previous versions of the Tripal JBrowse module were limited to embedding JBrowse into a Tripal site using an iFrame, due to theming collisions between JBrowse and the content management system that Tripal extends, Drupal. This prevented any exchange of data between a Tripal site and a JBrowse instance. In large part due to the JBrowse team’s reception to our feedback regarding these pain points, JBrowse 2 has been developed in such a way as to make embedding the application into a Drupal site a reality. Furthermore, advancements made in Drupal 9 and 10 have removed barriers to communication with other web frameworks allowing direct interaction with the React backend of JBrowse 2. Amidst upgrading our module to Tripal 4, we now have a working prototype of native embedding for JBrowse 2 within a Tripal site. We are very excited for the possibilities this will now bring for the Tripal 4 community. For example, one feature we aim to implement is to embed a miniature JBrowse in gene content pages provided by Tripal that show the structure and context around that gene.

July 15, 2024
17:00-17:20
An integrated environment for browsing 3-D protein structures and multiple sequence alignments in JBrowse 2
Confirmed Presenter: Colin Diesh, University of California, Berkeley
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Karsten Hokamp


Authors List: Show

  • Colin Diesh, Colin Diesh, University of California
  • Caroline Bridge, Caroline Bridge, Ontario Institute for Cancer Research
  • Garrett Stevens, Garrett Stevens, University of California
  • Scott Cain, Scott Cain, Ontario Institute for Cancer Research
  • Lincoln Stein, Lincoln Stein, Ontario Institute for Cancer Research
  • Ian Holmes, Ian Holmes, University of California

Presentation Overview:Show

Recent advances in protein structure prediction have invigorated research in protein structural biology. To enable the visualization of genomic datasets and protein structures in a unified environment, we created a suite of JBrowse 2 plugins to show 3-D protein structures and multiple sequence alignments (MSAs) alongside the genome browser.

To display 3-D protein structures, we incorporated Mol* into a JBrowse 2 plugin. Users can right-click a gene of interest in the genome browser, and the app can either (a) automatically locate AlphaFoldDB structures associated with the gene of interest or (b) allow the user to import their own PDB/MMCIF structure files produced by tools such as ColabFold. The protein sequence encoded by a user selected transcript isoform of the gene of interest is aligned with the sequence from the protein structure file, which allows mouse clicks and mouseovers to show matching positions on the 3-D structure and the genome.

To display MSAs, we incorporated react-msaview, our MSA visualization tool, into a JBrowse 2 plugin. Users can select a gene of interest in the genome browser and launch an in-app NCBI BLAST workflow to recruit sequences for a MSA, or open their own MSA data files. The MSA viewer can display a dendrogram alongside the MSA to show hierarchical or phylogenetic grouping of the sequences. Protein domains and features can be highlighted on the MSA by loading InterProScan results. Like the 3-D protein viewer, mouse clicks and mouseovers show matching positions in the genome browser.

July 15, 2024
17:00-17:20
iCn3D, a Platform to Integrate Structures with Functions and Genomics
Confirmed Presenter: Jiyao Wang, NIH/NLM/NCBI, United States
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Karsten Hokamp


Authors List: Show

  • Jiyao Wang, Jiyao Wang, NIH/NLM/NCBI
  • Philippe Youkharibache, Philippe Youkharibache, NIH/NCI
  • Ravinder Abrol, Ravinder Abrol, California State University
  • Caesar Tawfeeq, Caesar Tawfeeq, California State University
  • Jack Lin, Jack Lin, Digital World Biology
  • Umesh Khaniya, Umesh Khaniya, NIH/NCI
  • Thomas Madej, Thomas Madej, NIH/NLM/NCBI
  • James Song, James Song, NIH/NLM/NCBI
  • Dachuan Zhang, Dachuan Zhang, NIH/NLM/NCBI
  • Christopher Lanczycki, Christopher Lanczycki, NIH/NLM/NCBI
  • Aron Marchler-Bauer, Aron Marchler-Bauer, NIH/NLM/NCBI

Presentation Overview:Show

With the improvement of structural prediction such as AlphaFold, a challenge is to integrate structures with functions and genomics. We started iCn3D as a 3D structure viewer with annotations such as domains and SNPs {Wang, 2020}, then expanded iCn3D to interaction analysis and mutational analysis both interactively and in command line {Wang, 2022}. Recently we added several new features in iCn3D to integrate structures with functions and genomics. First, iCn3D shows the isoforms and their corresponding exons for the gene related to the protein sequence. Second, iCn3D shows a few new annotations: Post Translational Modification (PTM), transmembrane domain detection, and Immunoglobulin (Ig) domain detection. Third, iCn3D allows users to align proteins based on structure, sequence, or residue mapping. Fourth, iCn3D is not only used in Jupyter Notebook and 3D printing, but also expanded to be used in Virtual Reality (VR) and Augumented Reality (AR). Furthermore, all iCn3D views can be reproduced in a sharable URL or iCn3D PNG image. The source code of iCn3D is at https://github.com/ncbi/icn3d.

July 15, 2024
17:00-17:20
Q&A For Flash Talks
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Karsten Hokamp


Authors List: Show

July 15, 2024
17:20-17:40
Codefair: Make Biomedical Research Software FAIR Without Breaking a Sweat
Confirmed Presenter: Bhavesh Patel, FAIR Data Innovations Hub, California Medical Innovations Institute
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Swapnil Savant


Authors List: Show

  • Dorian Portillo, Dorian Portillo, FAIR Data Innovations Hub
  • Sanjay Soundarajan, Sanjay Soundarajan, FAIR Data Innovations Hub
  • Jacob Clark, Jacob Clark, FAIR Data Innovations Hub
  • Bhavesh Patel, Bhavesh Patel, FAIR Data Innovations Hub

Presentation Overview:Show

We present codefair, an innovative solution that helps researchers make their biomedical research software Findable, Accessible, Interoperable, and Reusable (FAIR), i.e. optimally reusable by humans and machines. The FAIR Biomedical Research Software (FAIR-BioRS) guidelines provide actionable instructions for making biomedical research software FAIR. While designed to be convenient to follow, we learned that their implementation can still be time consuming for researchers. To address this challenge, we are developing codefair, a free and open source GitHub app that acts as a personal assistant for making research software FAIR. Researchers simply need to install codefair from the GitHub marketplace and proceed with their software development as usual. By leveraging GitHub’s tools such as Probot, codefair monitors activities on the software repository, communicates via GitHub issues, and submits pull requests to help researchers make their software FAIR. Currently, codefair helps with including essential metadata elements such as license file, CITATION.cff metadata file, and codemeta.json metadata file. Additional features are being added to provide support for complying with best coding practices, archiving on Zenodo, registering on bio.tools, and much more to cover all the steps for making software FAIR. By alleviating their burden in the process, we believe codefair will empower and encourage biomedical researchers into adopting FAIR and open practices for their research software. We present here our approach to developing codefair, highlight the current and planned features, and explain how the community can benefit from and contribute to codefair.

July 15, 2024
17:20-17:40
An Open-source Ecosystem For Scalable And Computationally Efficient Nanopore Data Processing
Confirmed Presenter: Hasindu Gamaarachchi, University of New South Wales, Australia
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Swapnil Savant


Authors List: Show

  • Hasindu Gamaarachchi, Hasindu Gamaarachchi, University of New South Wales
  • Hiruna Samarakoon, Hiruna Samarakoon, UNSW Sydney
  • James Ferguson, James Ferguson, Garvan Institute of Medical Research
  • Sasha Jenner, Sasha Jenner, University of Sydney
  • Bonson Wong, Bonson Wong, Garvan Institute of Medical Research
  • Timothy Amos, Timothy Amos, Garvan Institute of Medical Research
  • Jillian Hammond, Jillian Hammond, Garvan Institute of Medical Research
  • Hassaan Saadat, Hassaan Saadat, UNSW Sydney
  • Martin Smith, Martin Smith, UNSW Sydney
  • Sri Parameswaran, Sri Parameswaran, University of Sydney
  • Ira Deveson, Ira Deveson, Garvan Institute of Medical Research

Presentation Overview:Show

Emerging long-read sequencing - recently dubbed “Nature Method of the Year” - has now become an important tool in understanding genomics. Nanopore is a major commercially available long-read technologies that offer ultra-long reads with limited capital cost. However, computational aspects of nanopore sequence analysis (e.g., data access, storage, basecalling, methylation calling) are still a burden, impeding the scalability of population-scale experiments. In this talk, I will present a complete ecosystem that enables scale nanopore data analysis in a computationally efficient way, built on top of our file format called S/BLOW5 (Nature Biotechnology, 2022). S/BLOW5 reduces computational time by an order of magnitude and additionally reduces storage footprint by ~20-80% compared to existing the FAST5 format. S/BLOW5 ecosystem which is fully open-source now includes: (i) S/BLOW5 file format and accompanying specifications (ii) the slow5lib (C/C++) and pyslow5 (python) software libraries for reading and writing S/BLOW5 files; (iii) the slow5tools toolkit for creating, converting, handling and interacting with SLOW5/BLOW5 files (Genome Biology 2023); and (iv) a suite of open source bioinformatics software packages (including basecalling and methylation calling tools) with which SLOW5 is now integrated (Bioinformatics 2023, GigaScience 2024). The research community has already started building on top of S/BLOW5 and slow5-rs which allows S/BLOW5 access using the Rust programming language is an example. S/BLOW5 will continue to prioritise performance, compatibility, usability and transparency. S/BLOW5 for nanopore signal space is analogous to the seminal SAM/BAM formats in the base-space that bioinformaticians are familiar with, thus making the adoption of S/BLOW5 seamless.

July 15, 2024
17:20-17:40
GenomeKit, a Python library for fast and easy access to genomic resources
Confirmed Presenter: Avishai Weissberg, Deep Genomics, Canada
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Swapnil Savant


Authors List: Show

  • Avishai Weissberg, Avishai Weissberg, Deep Genomics

Presentation Overview:Show

GenomeKit is Deep Genomics’ high performance Python library for fast and easy access to genomic resources such as sequence, data tracks, annotations, and variants.
GenomeKit has been in use internally by ML & data scientists and bioinformaticians at Deep Genomics for several years, and we have decided to make it available to the rest of the community. GenomeKit serves as the computational foundation for the data generation and evaluation of the recently published BigRNA foundation model, and most other workflows at Deep Genomics.
At its core, GenomeKit allows users to perform computational operations on the genome, like searching, applying variants, and comparing, extracting and expanding intervals. Classes like Genome, Interval, and Variant form the base for most of its APIs.

For example, GenomeKit allows users to easily get the principal transcript for a particular gene on a specific annotation and patch version of an assembly, accessing interval objects for each of its coding regions, UTRs, exons, introns, etc. These interval objects can further be expanded, intersected, have variants applied to them, etc.

In addition, GenomeKit includes a variety of APIs to open and process the contents of standard data file types (gff3, fasta, etc). GenomeKit's data formats that are highly optimized and compressed for reduced I/O and efficient memory utilization.

This talk aims to cover
the use cases for GenomeKit,
an overview of the API and main capabilities
techniques used to achieve GenomeKit's level of performance
benchmarks comparing GenomeKit with similar libraries

July 15, 2024
17:20-17:40
Q&A For Flash Talks
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Swapnil Savant


Authors List: Show

July 15, 2024
17:40-18:00
Tataki: Enhancing the robustness of bioinformatics workflows with simple, tolerant file format detection
Confirmed Presenter: Masaki Fukui, Sator, Inc.
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Swapnil Savant


Authors List: Show

  • Masaki Fukui, Masaki Fukui, Sator
  • Hirotaka Suetake, Hirotaka Suetake, Sator
  • Tazro Ohta, Tazro Ohta, Institute for Advanced Academic Research

Presentation Overview:Show

The increase in data volume in bioinformatics has heightened the demand for robust and reliable workflow analysis. Workflows enable the integration of various analytical tools to perform sequences of analyses in a portable and reproducible manner. However, inconsistencies in file inputs and outputs of workflow components can cause the tools to misidentify file formats of input files and terminate unexpectedly, which decreases the robustness of workflows.

One way to resolve this is to introduce a file format detection tool in between tools within workflows. However, current file format identification tools often fail to adequately handle the issues, as they might misidentify files with abnormalities. Additionally, because they are not always standalone components, integrating them seamlessly into workflows can be difficult.

To enhance workflow robustness, we developed Tataki, a simple command-line file format detection tool, targeting major bioinformatics file formats. It is designed for ease of use within workflows, and also allows users to extend identification criteria using the Common Workflow Language to tolerate file format irregularities such as variance in file formats or those not predefined. We believe Tataki is a practical solution to the file format issues and boosts the productivity of bioinformatics researchers and developers.

July 15, 2024
17:40-18:00
Arvados Project Update
Confirmed Presenter: Peter Amstutz, Curii Corporation, United States
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Swapnil Savant


Authors List: Show

  • Peter Amstutz, Peter Amstutz, Curii Corporation
  • Sarah Zaranek, Sarah Zaranek, Curii Corporation
  • Alexander Sasha Wait Zaranek, Alexander Sasha Wait Zaranek, Curii Corporation
  • Tom Clegg, Tom Clegg, Curii Corporation
  • Lisa Knox, Lisa Knox, Curii Corporation
  • Lucas Di Pentima, Lucas Di Pentima, Curii Corporation
  • Brett Smith, Brett Smith, Curii Corporation
  • Stephen Smith, Stephen Smith, Curii Corporation

Presentation Overview:Show

Arvados is a comprehensive, mature, open source platform for managing and processing large scale biomedical data on HPC and cloud. By combining robust data and workflow management capabilities in a single platform, Arvados helps researchers organize and analyze petabytes of data and run reproducible and versioned computational bioinformatics workflows. Since the last time Arvados was presented at BOSC (2022), Arvados has had 3 major releases and 8 minor releases. This short talk will highlight major improvements including a more modern, performant interface, expanded workflow capabilities, and improved data storage and management. Arvados is used in production by some of the largest life sciences companies in the world as well as in academia, and welcomes community participation (https://arvados.org/community/).

July 15, 2024
17:40-18:00
BiocPy: Facilitate Bioconductor Workflows in Python
Confirmed Presenter: Jayaram Kancherla, Genentech, United States
Track: BOSC

Room: 524ab
Format: Live Stream
Moderator(s): Swapnil Savant


Authors List: Show

  • Jayaram Kancherla, Jayaram Kancherla, Genentech
  • Aaron Lun, Aaron Lun, Genentech

Presentation Overview:Show

Bioconductor is an open-source software community that provides a rich repository of tools for the analysis and comprehension of genomic data. One of the main advantages of Bioconductor is the development of standardized data representations and a large number of statistical analysis methods tailored for genomic experiments. These tools allow researchers to seamlessly store, manipulate, and analyze data across various genomic experimental modalities in R. Analysts today use a variety of languages in their workflows, including R/Bioconductor for statistical analysis and Python for imaging or machine learning tasks.

Inspired by Bioconductor, BiocPy aims to enable and facilitate these bioinformatics workflows in Python. To achieve this goal, BiocPy provides data structures that are closely aligned with Bioconductor's implementations. These structures include BiocFrame, providing a Bioconductor-like data frame class, and GenomicRanges which aids in representing genomic regions and facilitating analysis. Together they serve as essential and foundational data structures, acting as the building blocks for extensive and complex representations. Notably, container classes such as SummarizedExperiment, SingleCellExperiment, and MultiAssayExperiment cater to the diverse needs of handling single or multi-omic experimental data and metadata.

By adapting mature Bioconductor data structures to Python, BiocPy offers a seamless transition and ease of use across programming languages, fostering reproducible and efficient genomic data analyses. To our knowledge, BiocPy is the first Python framework to provide well-integrated Bioconductor data structures and representations specifically designed for genomic data analysis, paving the way for enhanced cross-language interoperability in bioinformatics workflows. The BiocPy ecosystem is open-source and available at https://github.com/BiocPy.

July 15, 2024
17:40-18:00
Q&A For Flash Talks
Track: BOSC

Room: 524ab
Format: Live Stream
Moderator(s): Swapnil Savant


Authors List: Show