Breaking the silo: composable bioinformatics through cross-disciplinary open standards
Confirmed Presenter: Nezar Abdennur, UMass Chan Medical School, United States
Track: BOSC
Room: 524ab
Format: In Person
Moderator(s): Jason Williams
Authors List: Show
- Nezar Abdennur, Nezar Abdennur, UMass Chan Medical School
- Trevor Manz, Trevor Manz, Harvard Medical School
- Jack Huey, Jack Huey, UMass Chan Medical School
- Garrett Ng, Garrett Ng, UMass Chan Medical School
- Vedat Yilmaz, Vedat Yilmaz, UMass Chan Medical School
- Nils Gehlenborg, Nils Gehlenborg, Harvard Medical School
- Open Chromosome Collective, Open Chromosome Collective, Open2C
Presentation Overview:Show
The practice of data science in genomics and computational biology is fraught with friction. This is largely due to a tight coupling of bioinformatic tools to file input/output. While omic data is specialized and the storage formats for high-throughput sequencing and related data are often standardized, the adoption of emerging open standards not tied to bioinformatics can help better integrate bioinformatic workflows into the wider data science, visualization, and AI/ML ecosystems. Here, we present three libraries as short vignettes for composable bioinformatics. First, we present Oxbow, a Rust-based adapter library that unifies access to common genomic data formats by efficiently transforming queries into Apache Arrow, a standard in-memory columnar representation for tabular data analytics. Second, we present Bioframe, a Python library that performs genomic range operations using standard Pandas dataframes. Last, we present Anywidget, an architecture based on modern web standards for sharing interactive visualizations across all Jupyter-compatible runtimes, including JupyterLab, Google Colab, and VSCode. Together, we demonstrate the composition of these libraries to build a custom connected genomic analysis and visualization environment. We propose that components such as these, which leverage scientific domain-agnostic standards to unbundle specialized file manipulation, analytics, and web interactivity, can serve as reusable building blocks for composing flexible genomic data analysis and machine learning workflows as well as systems for exploratory data analysis and visualization.