Codefair: Make Biomedical Research Software FAIR Without Breaking a Sweat
Confirmed Presenter: Bhavesh Patel, FAIR Data Innovations Hub, California Medical Innovations Institute, United States
Room: 524ab
Format: In Person
Moderator(s): Swapnil Savant
Authors List: Show
- Dorian Portillo, FAIR Data Innovations Hub, California Medical Innovations Institute, United States
- Sanjay Soundarajan, FAIR Data Innovations Hub, California Medical Innovations Institute, United States
- Jacob Clark, FAIR Data Innovations Hub, California Medical Innovations Institute, United States
- Bhavesh Patel, FAIR Data Innovations Hub, California Medical Innovations Institute, United States
Presentation Overview: Show
We present codefair, an innovative solution that helps researchers make their biomedical research software Findable, Accessible, Interoperable, and Reusable (FAIR), i.e. optimally reusable by humans and machines. The FAIR Biomedical Research Software (FAIR-BioRS) guidelines provide actionable instructions for making biomedical research software FAIR. While designed to be convenient to follow, we learned that their implementation can still be time consuming for researchers. To address this challenge, we are developing codefair, a free and open source GitHub app that acts as a personal assistant for making research software FAIR. Researchers simply need to install codefair from the GitHub marketplace and proceed with their software development as usual. By leveraging GitHub’s tools such as Probot, codefair monitors activities on the software repository, communicates via GitHub issues, and submits pull requests to help researchers make their software FAIR. Currently, codefair helps with including essential metadata elements such as license file, CITATION.cff metadata file, and codemeta.json metadata file. Additional features are being added to provide support for complying with best coding practices, archiving on Zenodo, registering on bio.tools, and much more to cover all the steps for making software FAIR. By alleviating their burden in the process, we believe codefair will empower and encourage biomedical researchers into adopting FAIR and open practices for their research software. We present here our approach to developing codefair, highlight the current and planned features, and explain how the community can benefit from and contribute to codefair.
An Open-source Ecosystem For Scalable And Computationally Efficient Nanopore Data Processing
Confirmed Presenter: Hasindu Gamaarachchi, University of New South Wales, Australia
Room: 524ab
Format: In Person
Moderator(s): Swapnil Savant
Authors List: Show
- Hasindu Gamaarachchi, University of New South Wales, Australia
- Hiruna Samarakoon, UNSW Sydney, Australia
- James Ferguson, Garvan Institute of Medical Research, Australia
- Sasha Jenner, University of Sydney, Australia
- Bonson Wong, Garvan Institute of Medical Research, Australia
- Timothy Amos, Garvan Institute of Medical Research, Australia
- Jillian Hammond, Garvan Institute of Medical Research, Australia
- Hassaan Saadat, UNSW Sydney, Australia
- Martin Smith, UNSW Sydney, Australia
- Sri Parameswaran, University of Sydney, Australia
- Ira Deveson, Garvan Institute of Medical Research, Australia
Presentation Overview: Show
Emerging long-read sequencing - recently dubbed “Nature Method of the Year” - has now become an important tool in understanding genomics. Nanopore is a major commercially available long-read technologies that offer ultra-long reads with limited capital cost. However, computational aspects of nanopore sequence analysis (e.g., data access, storage, basecalling, methylation calling) are still a burden, impeding the scalability of population-scale experiments. In this talk, I will present a complete ecosystem that enables scale nanopore data analysis in a computationally efficient way, built on top of our file format called S/BLOW5 (Nature Biotechnology, 2022). S/BLOW5 reduces computational time by an order of magnitude and additionally reduces storage footprint by ~20-80% compared to existing the FAST5 format. S/BLOW5 ecosystem which is fully open-source now includes: (i) S/BLOW5 file format and accompanying specifications (ii) the slow5lib (C/C++) and pyslow5 (python) software libraries for reading and writing S/BLOW5 files; (iii) the slow5tools toolkit for creating, converting, handling and interacting with SLOW5/BLOW5 files (Genome Biology 2023); and (iv) a suite of open source bioinformatics software packages (including basecalling and methylation calling tools) with which SLOW5 is now integrated (Bioinformatics 2023, GigaScience 2024). The research community has already started building on top of S/BLOW5 and slow5-rs which allows S/BLOW5 access using the Rust programming language is an example. S/BLOW5 will continue to prioritise performance, compatibility, usability and transparency. S/BLOW5 for nanopore signal space is analogous to the seminal SAM/BAM formats in the base-space that bioinformaticians are familiar with, thus making the adoption of S/BLOW5 seamless.
GenomeKit, a Python library for fast and easy access to genomic resources
Confirmed Presenter: Avishai Weissberg, Deep Genomics, Canada
Room: 524ab
Format: In Person
Moderator(s): Swapnil Savant
Authors List: Show
- Avishai Weissberg, Deep Genomics, Canada
Presentation Overview: Show
GenomeKit is Deep Genomics’ high performance Python library for fast and easy access to genomic resources such as sequence, data tracks, annotations, and variants.
GenomeKit has been in use internally by ML & data scientists and bioinformaticians at Deep Genomics for several years, and we have decided to make it available to the rest of the community. GenomeKit serves as the computational foundation for the data generation and evaluation of the recently published BigRNA foundation model, and most other workflows at Deep Genomics.
At its core, GenomeKit allows users to perform computational operations on the genome, like searching, applying variants, and comparing, extracting and expanding intervals. Classes like Genome, Interval, and Variant form the base for most of its APIs.
For example, GenomeKit allows users to easily get the principal transcript for a particular gene on a specific annotation and patch version of an assembly, accessing interval objects for each of its coding regions, UTRs, exons, introns, etc. These interval objects can further be expanded, intersected, have variants applied to them, etc.
In addition, GenomeKit includes a variety of APIs to open and process the contents of standard data file types (gff3, fasta, etc). GenomeKit's data formats that are highly optimized and compressed for reduced I/O and efficient memory utilization.
This talk aims to cover
the use cases for GenomeKit,
an overview of the API and main capabilities
techniques used to achieve GenomeKit's level of performance
benchmarks comparing GenomeKit with similar libraries
Q&A For Flash Talks
Room: 524ab
Format: In person
Moderator(s): Swapnil Savant
Authors List: Show