The SciFinder tool lets you search Titles, Authors, and Abstracts of talks and panels. Enter your search term below and your results will be shown at the bottom of the page. You can also click on a track to see all the talks given in that track on that day.

View Talks By Category

Scroll down to view Results

July 12, 2024
July 13, 2024
July 14, 2024
July 15, 2024
July 16, 2024

Results

July 16, 2024
8:40-9:00
Enhancing Reproducibility in Immunogenetics: Leveraging Containerization Technology for Bioinformatics Workflows
Confirmed Presenter: Rayo Suseno, UCSF, United States
Track: BOSC

Room: 524ab
Format: Live Stream
Moderator(s): Jason Williams


Authors List: Show

  • Rayo Suseno, Rayo Suseno, UCSF
  • Kristen Wade, Kristen Wade, UCSF
  • Jill Hollenbach, Jill Hollenbach, UCSF

Presentation Overview:Show

Bioinformatics is experiencing a crisis of reproducibility, which inhibits research progress and undermines scientific findings. This is driven by a variety of factors, including incomplete documentation, poor version control, lack of accessible code, and incompatible software dependencies. Leveraging containerization technology is a promising solution to address these issues and streamline the deployment of specialized bioinformatics workflows. The field of immunogenetics is especially in need of such workflows, as high levels of genomic complexity characteristic of immune loci require the development of unique tools. For example, in 2021, we published a pipeline, Pushing Immunogenetics to the Next Generation (PING), designed to genotype the killer immunoglobulin-like receptor (KIR) genes from short read data. Due to its various dependencies, however, some investigators found PING challenging to run and install. This prompted us to containerize both PING and a recently developed software from our lab, MHConstructor, a de novo short read sequence assembler for the human major histocompatibility complex (MHC) region. A particular challenge faced by MHConstructor is reliance on multiple Python versions, due to dependencies on different bioinformatics tools. This requires usage of multiple conda environments within one container, which we successfully implemented while ensuring seamless switching between environments. Singularity was chosen due to its user-friendly nature, encapsulating the entire workflow in a single file that can be effortlessly executed regardless of the operating system. The containerization of PING and MHConstructor ensures the reproducibility of these two immunogenetic pipelines, providing reliable high throughput analysis of large datasets not previously accessible with extant tools.

July 16, 2024
9:00-9:20
Breaking the silo: composable bioinformatics through cross-disciplinary open standards
Confirmed Presenter: Nezar Abdennur, UMass Chan Medical School, United States
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Jason Williams


Authors List: Show

  • Nezar Abdennur, Nezar Abdennur, UMass Chan Medical School
  • Trevor Manz, Trevor Manz, Harvard Medical School
  • Jack Huey, Jack Huey, UMass Chan Medical School
  • Garrett Ng, Garrett Ng, UMass Chan Medical School
  • Vedat Yilmaz, Vedat Yilmaz, UMass Chan Medical School
  • Nils Gehlenborg, Nils Gehlenborg, Harvard Medical School
  • Open Chromosome Collective, Open Chromosome Collective, Open2C

Presentation Overview:Show

The practice of data science in genomics and computational biology is fraught with friction. This is largely due to a tight coupling of bioinformatic tools to file input/output. While omic data is specialized and the storage formats for high-throughput sequencing and related data are often standardized, the adoption of emerging open standards not tied to bioinformatics can help better integrate bioinformatic workflows into the wider data science, visualization, and AI/ML ecosystems. Here, we present three libraries as short vignettes for composable bioinformatics. First, we present Oxbow, a Rust-based adapter library that unifies access to common genomic data formats by efficiently transforming queries into Apache Arrow, a standard in-memory columnar representation for tabular data analytics. Second, we present Bioframe, a Python library that performs genomic range operations using standard Pandas dataframes. Last, we present Anywidget, an architecture based on modern web standards for sharing interactive visualizations across all Jupyter-compatible runtimes, including JupyterLab, Google Colab, and VSCode. Together, we demonstrate the composition of these libraries to build a custom connected genomic analysis and visualization environment. We propose that components such as these, which leverage scientific domain-agnostic standards to unbundle specialized file manipulation, analytics, and web interactivity, can serve as reusable building blocks for composing flexible genomic data analysis and machine learning workflows as well as systems for exploratory data analysis and visualization.

July 16, 2024
9:20-9:40
For long-term sustainable software in bioinformatics: a manifesto
Confirmed Presenter: Luis Pedro Coelho, Queensland University of Technology, Australia
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Jason Williams


Authors List: Show

  • Luis Pedro Coelho, Luis Pedro Coelho, Queensland University of Technology

Presentation Overview:Show

I will discuss the challenges of maintaining research software in bioinformatics, especially considering the transient nature of funding and the turnover of researchers involved in coding projects. I will also discuss the approaches that my group takes to ensure that we maintain our software over the long-term, even as trainees leave the group. Maintenance involves ensuring the software performs as described. This includes updating it to handle new dependencies and fix bugs as well as providing a modicum of support to users.

I categorize research software into three levels: Level 0 (one-off scripts for specific analyses, often containing minor errors), Level 1 (Extended Methods Code, supporting specific results in publications), and Level 2 (Tools intended for broad use, requiring robustness and extensive documentation). Both Level 1 and Level 2 are made public, but they serve different purposes and upgrading from 1 to 2 involves significant effort. In the case of Tools, we aim for ease of use, reproducibility, and good error reporting.

We follow several practices that facilitate maintenance and support: reproducible research techniques, "dogfooding" (using one's own tools), clear and public support channels, providing error messages that guide users to solutions, and distributing software via Bioconda to minimize installation issues. Additionally, we attempt to gather beta users before publication and improve software based on feedback.

Furthermore, I propose that journals should require a Maintenance and Support statement from authors, similar to the Data Availability statement, to ensure transparency and accountability regarding the long-term support of research software.

July 16, 2024
9:20-9:40
BioCompute: A Descriptive Standard for Computable Metadata
Confirmed Presenter: Jonathon Keeney, The George Washington university, United States
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Jason Williams


Authors List: Show

  • Jonathon Keeney, Jonathon Keeney, The George Washington university
  • Hadley King, Hadley King, The George Washington university
  • Tianyi Wang, Tianyi Wang, The George Washington university
  • Chinweoke Okonkwo, Chinweoke Okonkwo, The George Washington university
  • Raja Mazumder, Raja Mazumder, The George Washington university

Presentation Overview:Show

Scientific review of work in the life sciences has been hindered by a lack of standards relating to the communication of computational pipelines. Often, little or no information related to the computational component is described in detail, rendering the work unreviewable, unfindable, and not reproducible. This challenge has been felt particularly concretely in reviews for academic publishing and in regulatory reviews at regulatory agencies such as the US FDA. BioCompute is a descriptive standard (officially "IEEE 2791-2020") that is flexible enough to accommodate any pipeline but robust enough to provide a computable structure for metadata and annotation of the pipeline. The standard is supported by several major bioinformatics platforms, and two workflow languages, as is the only framework standard of its kind accepted by the FDA for regulatory reviews. This presentation will describe the need and architecture of the standard, the community, and the tools that have been developed to work with the standard. URL: https://biocomputeobject.org/

July 16, 2024
9:20-9:40
Breaking Down Research Silos and Fostering Radical Collaboration through Collective Intelligence
Confirmed Presenter: Alberto Pepe, Sage Bionetworks, United States
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Jason Williams


Authors List: Show

  • Robert Allaway, Robert Allaway, Sage Bionetworks
  • Megan Doerr, Megan Doerr, Sage Bionetworks
  • Jineta Bannerjee, Jineta Bannerjee, Sage Bionetworks
  • Milen Nikolov, Milen Nikolov, Sage Bionetworks
  • Amy Heiser, Amy Heiser, Sage Bionetworks
  • Mialy DeFelice, Mialy DeFelice, Sage Bionetworks
  • Adam Hindman, Adam Hindman, Sage Bionetworks
  • Jake Albrecht, Jake Albrecht, Sage Bionetworks
  • Lakaija Johnson, Lakaija Johnson, Sage Bionetworks
  • James Eddy, James Eddy, Sage Bionetworks
  • Miranda McManus, Miranda McManus, College of Charleston
  • J. Harry Caulfield, J. Harry Caulfield, Lawrence Berkeley National Laboratory
  • Christopher J. Mungall, Christopher J. Mungall, Lawrence Berkeley National Laboratory
  • Monica Munoz-Torres, Monica Munoz-Torres, University of Colorado School of Medicine
  • Alberto Pepe, Alberto Pepe, Sage Bionetworks

Presentation Overview:Show

The data landscape is rapidly expanding. Scientists are required to navigate increasing amounts of multimodal data in an attempt to create high-quality, impactful research. As knowledge expands, specialization increases. This comes at a cost: the emergence of deep data silos within separate disciplines. We believe that progress lies in the open exchange of ideas between all stakeholders, harnessing our collective human and Artificial Intelligence (AI). By bringing these communities together and fostering unanticipated connections, we can drive a new age of biomedical innovation. In this talk, we present two ongoing projects at Sage Bionetworks that underscore the need for open approaches to AI/ML to pave the next generation of biological and medical innovations.

(Extended abstract attached)

July 16, 2024
9:20-9:40
Q&A For Flash Talks
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Jason Williams


Authors List: Show

July 16, 2024
9:40-10:00
Tripal: a community-driven framework supporting open science, sustainable data web portals
Confirmed Presenter: Lacey-Anne Sanderson, University of Saskatchewan, Canada
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Jason Williams


Authors List: Show

  • Lacey-Anne Sanderson, Lacey-Anne Sanderson, University of Saskatchewan
  • Stephen P. Ficklin, Stephen P. Ficklin, Department of Horticulture
  • Doug Senalik, Doug Senalik, USDA-ARS
  • Risharde Ramnath, Risharde Ramnath, Department of Horticulture
  • Sean Buehler, Sean Buehler, Department of Horticulture
  • Josh Burns, Josh Burns, Department of Horticulture
  • Valentin Guignon, Valentin Guignon, Bioversity International
  • Dorrie Main, Dorrie Main, Department of Horticulture
  • Kirstin Bett, Kirstin Bett, Department of Plant Sciences

Presentation Overview:Show

As the open science movement gains momentum, UNESCO is highlighting the need for infrastructure to (1) support building of global, inclusive research communities and (2) provide open access to research-associated data and tools. Open source bioinformatic software, like Tripal, is uniquely poised to fill such a need as both open-source and open-science embody the same core principles. Tripal (https://tripal.info) extends several open-source packages into a cohesive platform meant to make development of open science data web portals accessible. Specifically, Drupal provides user management, page templating, content curation, and site administration, while GMOD Chado provides community-developed standards for storage of biological datasets. Tripal 4 currently offers ontology-driven data pages, extensive administrative and curation interfaces, and standards-focused data importers. Sites are fully customizable through various web forms and extensive developer APIs provided by Drupal and Tripal. While there are still some key integrations outstanding, we are now at the point of expanding the default configuration to guide community builders in creating inclusive and open data portals. Our first step on this path is to provide a set of well documented default fields designed to promote high standards of data attribution and completeness of metadata. These fields are based on input and experience from our existing international community of Tripal data portals. Additionally, we are hoping to engage those in the wider open-source and open-science communities in collaboration. Please reach out to us in person at the BOSC cofest, on Github, on Slack or in our weekly cofests on GatherTown (see https://tripal.info/community).

July 16, 2024
10:40-11:40
Invited Presentation: Open Data, Knowledge Graphs, and Large Language Models
Confirmed Presenter: Andrew Su
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Nomi Harris


Authors List: Show

  • Andrew Su

Presentation Overview:Show

Bioinformatics is the science of collecting, storing, analyzing, and disseminating biological data and information. As in most domains of data science, bioinformaticians have long focused on structured data – information that is represented using ontologies and controlled vocabularies in well-defined data formats and often stored in databases with predefined schemas. This focus on structured data over the last 30 years has been the most efficient way to convert information into testable hypotheses and new scientific insights.

Recent developments in artificial intelligence, particularly the advent of large language models (LLMs), have started to challenge this traditional focus on structured data. By utilizing massive training sets of unstructured text, LLMs have shown exceptional capabilities not only in tasks like question answering and text generation but also in summarization, translation, and code generation. In this presentation, we will examine how LLMs are changing and will continue to change the practice of bioinformatics, particularly at the interface between structured and unstructured data.

Andrew Su, Ph.D., is the Elden and Verna Strahm Professor at the Scripps Research Institute in the Department of Integrative Structural and Computational Biology (ISCB). Dr. Su earned his PhD in chemistry at Scripps Research in 2002, and was the Associate Director of Bioinformatics at The Genomics Institute of the Novartis Research Foundation (GNF) before returning to Scripps Research as a faculty member in 2011.

The Su lab focuses on building and applying bioinformatics infrastructure for biomedical discovery. Dr. Su has had a long-standing interest in leveraging crowdsourcing to organize and integrate knowledge though projects like the Gene Wiki and Wikidata. In partnership with Chunlei Wu’s lab, he has also worked extensively on creating biomedical APIs and enabling API interoperability through the BioThings project. Most recently, his lab has a particular emphasis on constructing and mining knowledge graphs for drug repurposing. In all this work, the Su lab has embraced the principles of open science, open data, and open source software.

July 16, 2024
11:40-12:00
Gene Set Summarization Using Large Language Models
Confirmed Presenter: Marcin Joachimiak, Lawrence Berkeley National Laboratory, United States
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Jessica Maia


Authors List: Show

  • Marcin Joachimiak, Marcin Joachimiak, Lawrence Berkeley National Laboratory
  • J. Harry Caufield, J. Harry Caufield, Lawrence Berkeley National Laboratory
  • Nomi Harris, Nomi Harris, Lawrence Berkeley National Laboratory
  • Hyeongsik Kim, Hyeongsik Kim, Lawrence Berkeley National Laboratory
  • Chris Mungall, Chris Mungall, Lawrence Berkeley National Laboratory

Presentation Overview:Show

Molecular biologists often use statistical enrichment analysis to interpret gene lists derived from high-throughput experiments and computational analyses. This traditional method assesses the over- or under-representation of biological function terms associated with genes, based on curated assertions from databases such as Gene Ontology (GO). Alternatively, interpreting gene lists can be conceptualized as a textual summarization task, where Large Language Models (LLMs) utilize scientific texts to reduce reliance on traditional knowledge bases. This approach offers advantages because traditional knowledge bases struggle to scale their curation and integration efforts, while being unable to encompass all available knowledge.
Our tool, TALISMAN (Terminological ArtificiaL Intelligence SuMmarization of Annotation and Narratives), employs generative AI to perform gene set summarization, complementing standard enrichment analysis. This innovative approach leverages various sources of gene functional information including: 1) structured text from curated ontological databases, 2) narrative summaries without ontology, and 3) direct model retrieval.
LLMs can generate biologically plausible GO term summaries for gene sets but struggle to provide reliable significance scores or rankings and do not match the precision of standard methods. Notably, newer LLM models significantly outperform older versions.
While these methods are not yet suitable replacements for standard term enrichment analysis, they do offer advantages for summarizing implicit knowledge across large and unstandardized datasets, particularly where the volume of information exceeds human processing capabilities. This, together with the generative capacities of LLMs, such as to suggest novel summarization terms, makes them a valuable tool for improved understanding in complex biological data analyses.

July 16, 2024
12:00-12:20
FAIR, modular and reproducible image-based ML workflows for biologists: a template and case study from imageomics
Confirmed Presenter: Hilmar Lapp, Duke University, United States
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Jessica Maia


Authors List: Show

  • Meghan Balk, Meghan Balk, National Ecological Observatory Network (NEON)
  • John Bradley, John Bradley, Duke University
  • Hilmar Lapp, Hilmar Lapp, Duke University

Presentation Overview:Show

Machine Learning (ML) has become a critical tool in the life sciences, and is being applied to diverse biological data types, including the rapidly growing vast trove of biological image data. Using image-based ML for biological research questions frequently requires combining different ML models and algorithms into complex computational workflows. We present a template for creating FAIR and reproducible workflows for imageomics, an emerging field that uses AI and ML to extract knowledge from biological images. Recognizing the inherently interdisciplinary nature of imageomics, the template distinguishes between a conceptual workflow for interdisciplinary research convergence and using the conceptual workflow to implement an executable application-specific workflow. We present how implementation technology choices can promote research software engineering best practices and enable end-to-end automation, while also accommodating ongoing ML and computer science research, and empowering biologists to make modifications. The results include a conceptual workflow for detecting and quantifying traits from biological specimen images, and a concrete workflow for a dataset of fish museum specimen images. We extended core FAIR data and software practices to ML models, such as persistent identifiers, version control, semantic versioning, and rich metadata. We find that the objective of a FAIR ML workflow promotes all workflow components to be FAIR. Ensuring full reproducibility is a separate step, and achieving workflow reproducibility requires end-to-end automation interoperable between high-performance computing environments, necessitating a formal workflow definition language with an associated workflow manager and execution engine.

July 16, 2024
14:20-14:40
Trust and Transparency in Reporting Machine Learning: The DOME-GigaScience Press Trial
Confirmed Presenter: Chris Armit, GigaScience Press, Hong Kong
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Jessica Maia


Authors List: Show

  • Chris Armit, Chris Armit, GigaScience Press
  • Mary Ann Tuli, Mary Ann Tuli, GigaScience Press
  • Yannan Fan, Yannan Fan, GigaScience Press
  • Nafisa Qazi, Nafisa Qazi, GigaScience Press
  • Nicole Nogoy, Nicole Nogoy, GigaScience Press
  • Hans Zauner, Hans Zauner, GigaScience Press
  • Hongling Zhou, Hongling Zhou, GigaScience Press
  • Hongfang Zhang, Hongfang Zhang, GigaScience Press
  • Christopher Hunter, Christopher Hunter, GigaScience Press
  • Scott Edmunds, Scott Edmunds, GigaScience Press
  • Laurie Goodman, Laurie Goodman, GigaScience Press

Presentation Overview:Show

Machine learning is increasingly applied to biological and biomedical data, and there is a need for sufficient detail to enable a researcher to understand the machine learning approach used in a research study. This is even more challenging due to Machine Learning studies being inherently difficult to interpret (the so-called “black box” effect).  To throw light on these methods, GigaScience Press (https://www.gigasciencepress.org/) has partnered with the DOME Consortium with the goal of encouraging authors to follow the DOME (Data, Optimisation, Model, Evaluation) recommendations.

The role of the GigaScience DataBase (GigaDB) Data Curation team is to ensure the Data Submission process runs as smoothly as possible. The DOME Consortium has generated the DOME Wizard (https://dome.ds-wizard.org/) which enables researchers to submit their DOME annotations to a central repository (https://registry.dome-ml.org/) and share them with reviewers. The GigaDB team scans submitted manuscripts for Machine Learning content, and performs checks to ensure that DOME annotations in support of GigaScience and GigaByte manuscripts are sufficiently complete.

To increase the visibility of the supporting DOME annotation, a link to DOME annotation is included in the GigaDB dataset that accompanies a GigaScience or GigaByte manuscript. The DOME annotations are a great asset to peer review, providing the necessary high-level overview to properly understand a machine learning study. We recommend that other journals follow our example in encouraging DOME annotations to be submitted early in the publication process and prior to peer-review.

July 16, 2024
14:40-15:40
Panel: Open Approaches to AI/ML in Bioinformatics
Track: BOSC

Room: 524ab
Format: In Person
Moderator(s): Monica Munoz-Torres


Authors List: Show