Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in CDT
Tuesday, May 13th
8:45-9:00
Conference Welcome
Format: In person


Authors List: Show

9:00-10:00
Invited Presentation: Computational Immunology: Bridging Algorithms, Biology, and Medicine
Confirmed Presenter: Aly Azeem Khan

Format: In person


Authors List: Show

  • Aly Azeem Khan

Presentation Overview: Show

The complexity of the human immune system presents profound challenges and opportunities for computational biology. This talk will demonstrate how integrating principled computational and machine learning methods with deep biological knowledge yields powerful tools for deciphering immune function. We will present vignettes highlighting the development of novel algorithms, specifically informed by immunological principles, to analyze challenging single-cell RNA-seq data. Examples include modeling B cell clonal evolution and affinity maturation, illustrating how such component models serve as critical steps towards understanding system-level immune behavior. The talk will conclude by exploring key computational frontiers and pressing challenges at the intersection of systems immunology, machine learning, and medicine.

10:30-10:50
Proceedings Presentation: GRPhIN: Graphlet Characterization of Regulatory and Physical Interaction Networks
Confirmed Presenter: Altaf Barelvi, Reed College, United States

Format: In Person


Authors List: Show

  • Altaf Barelvi, Reed College, United States
  • Oliver Anderson, Reed College, United States
  • Anna Ritz, Reed College, United States

Presentation Overview: Show

Graphs are powerful tools for modeling and analyzing molecular interaction networks. Graphs typically represent either undirected physical interactions or directed regulatory relationships, which can obscure a particular protein's functional context. Graphlets can describe local topologies and patterns within graphs, and combining physical and regulatory interactions offer new graphlet configurations that can provide biological insights. We present GRPhIN, a tool for characterizing graphlets and protein roles within graphlets in mixed physical and regulatory interaction networks. We describe the graphlets of mixed networks in B. subtilis, C. elegans, D. melanogaster, D. rerio, and S. cerevisiae, and examine local topologies of proteins and subnetworks related to the oxidative stress response pathway. We found a number of graphlets that were abundant in all species, specific node positions (orbits) within graphlets that were over-represented in stress-associated proteins, and rarely-occurring graphlets that were enriched in stress subnetworks. These results showcase the potential for using graphlets in mixed physical and regulatory interaction networks to identify new patterns beyond a single interaction type.

10:50-11:10
Proceedings Presentation: Vector Semantics of Multidomain Protein Architectures
Confirmed Presenter: Xiaoyue Cui, Carnegie Mellon University, United States

Format: In Person


Authors List: Show

  • Xiaoyue Cui, Carnegie Mellon University, United States
  • Yuting Xiao, Carnegie Mellon University, United States
  • Maureen Stolzer, Carnegie Mellon University, United States
  • Dannie Durand, Carnegie Mellon University, United States

Presentation Overview: Show

Multidomain proteins are mosaics of sequence segments that encode distinct structural of functional modules, called domains. Domains fold independently in diverse sequence contexts and are found in proteins that are otherwise unrelated. The modular organization of multidomain proteins poses analytical challenges. Many bioinformatic tools cannot be directly applied to families with variable domain architectures, in which some regions of the sequence have discernible similarity, while others are not alignable. At the same time, the variations in domain composition carry useful information that current tools are not designed to exploit.

In the area of protein function prediction, modeling the relationship between the domain content of a protein and its function is particularly challenging. While mapping GO terms to protein domains is an active area of bioinformatic research, there is no consensus on how these annotations should be used. It is not clear whether a given domain confers the same functional properties in all contexts, or whether the functional contribution of a domain depends on the neighboring domains in the protein. Here we investigate the potential of vector semantic models, first developed for natural language processing, for elucidating functional relationships between multidomain proteins.

Availability: Relevant scripts and data can be found at https://github.com/xcui297/Domain-architecture-semantics

11:10-11:25
lociPARSE: A Locality-aware Invariant Point Attention Model for Scoring RNA 3D Structures
Confirmed Presenter: Sumit Tarafder, Virginia Tech, United States

Format: In Person


Authors List: Show

  • Sumit Tarafder, Virginia Tech, United States
  • Debswapna Bhattacharya, Virginia Tech, United States

Presentation Overview: Show

Introduction

Advancements in deep learning have improved RNA 3D structure prediction, but developing reliable scoring functions remains challenging, especially without experimental structures. Existing methods fall into two categories: statistical potentials struggle to distinguish accurate structures due to a limited understanding of RNA energetics, while deep learning approaches rely on RMSD, which fails to capture local atomic environments and RNA flexibility. A better alternative to RMSD as a ground truth is the Local Distance Difference Test (lDDT), which is superposition-free, rotation-invariant, and robust to structural variations. How can we design an RNA scoring model that leverages lDDT while ensuring invariance under global Euclidean transformations?

Methods

Here, we provide such a solution by developing a new attention-based architecture, called lociPARSE (locality-aware invariant Point Attention-based RNA ScorEr) [1], for scoring RNA 3D structures using local nucleotide-wise lDDT instead of RMSD. Inspired by AlphaFold2, it defines nucleotide-wise frames with rotation and translation parameters to model local atomic environments. By modifying IPA to incorporate RNA-specific locality, lociPARSE effectively captures structural accuracy at the nucleotide level. It outperforms traditional statistical potentials and state-of-the-art ML-based scoring methods like ARES across multiple benchmarks, including CASP15 blind tests, demonstrating superior performance in RNA structure assessment across a wide range of performance measures.

Results and Discussions

We compared lociPARSE with statistical potentials (rsRNASP, cgRNASP, RASP, DFIRE-RNA) and ML-based methods (RNA3DCNN, ARES) on 12 blind CASP15 RNA test targets where the corresponding 3D structural models are collected directly from the CASP15 website https://predictioncenter.org/ casp15/ based on the blind predictions submitted by various participating groups in the CASP15 RNA 3D structure prediction challenge. lociPARSE consistently outperformed all methods across nearly all performance metrics. A key strength is its ability to predict nucleotide-wise quality scores, where it surpasses RNA3DCNN, the only other method with this capability. By combining a locality-aware IPA framework with lDDT, lociPARSE effectively assesses nucleotide accuracy while considering local atomic environments, addressing core challenges in RNA scoring.

Availability
lociPARSE is published in the Journal of Chemical Information and Modeling (https://doi.org/10.1021/acs.jcim.4c01621) and freely available to download under the GNU General Public License v3 at https://github.com/Bhattacharya-Lab/lociPARSE.

References

[1] Sumit Tarafder, Debswapna Bhattacharya, “lociPARSE: a locality-aware invariant point attention model for scoring RNA 3D structures”, Journal of Chemical Information and Modeling, Volume 64, Issue 22, Pages 8655–8664, November 2024, doi: https://doi.org/10.1021/acs.jcim.4c01621

11:25-11:40
PharmAlchemy: An Agentic Framework for Integrative Drug–Gene–Disease Knowledge and Precision Drug Discovery
Confirmed Presenter: Kevin Song, The University of Alabama at Birmingham, United States

Format: In Person


Authors List: Show

  • Kevin Song, The University of Alabama at Birmingham, United States
  • Andrew Trotter, The University of Alabama at Birmingham, United States
  • Jake Chen, The University of Alabama at Birmingham, United States

Presentation Overview: Show

Integrating fragmented biomedical data remains a major challenge in systems pharmacology and therapeutic development. We present PharmAlchemy, an agentic computational framework that consolidates diverse biomedical knowledge—including curated drug–gene–disease associations, chemical annotations, pathway-linked gene sets, and clinical trial data—into a unified, interoperable platform for translational research. Built on a formal set-theoretic and relational-algebraic foundation, PharmAlchemy enables both structured queries and advanced hypergraph-based analyses across heterogeneous datasets. The framework comprises three key components: (1) an agentic AI system that extracts and integrates information from public datasets of disease-related genes, drug targets, LLM-mined PubMed abstracts, and curated pathway-linked gene sets; (2) a multi-hop, reliability-aware traversal algorithm that dynamically navigates drug–gene–disease networks to identify high-confidence therapeutic candidates; and (3) a semantic compound similarity engine that infers molecular relatedness and bioactivity across chemical space and drug-gene interactions. In case studies on opioid receptor ligand discovery and pancreatic cancer drug repositioning, PharmAlchemy significantly expanded the candidate compound space and uncovered latent mechanistic insights not captured by conventional resources. By unifying structured biomedical knowledge into a scalable and intelligent platform, PharmAlchemy advances AI-enabled hypothesis generation and accelerates the development of precision therapeutics.

11:40-12:00
Proceedings Presentation: Assessing Deep Segmentation Performance for Macromolecular Subunits Using Extreme Points
Confirmed Presenter: Manuel Zumbado-Corrales, Instituto Tecnológico de Costa Rica, Costa Rica

Format: Live Stream


Authors List: Show

  • Manuel Zumbado-Corrales, Instituto Tecnológico de Costa Rica, Costa Rica
  • Juan Esquivel-Rodriguez, Instituto Tecnológico de Costa Rica, Costa Rica

Presentation Overview: Show

Motivation: The accurate segmentation of protein-based macromolecular structures is fundamental to understand biological processes. Cryo-electron microscopy (cryo-EM) enables imaging of these structures, but automated segmentation remains challenging. We evaluated the use of extreme points as a channel to a neural network that segments macromolecular structures, as integrating additional features as input channels has shown potential to enhance segmentation accuracy. The Intersection-over-Union metric and precision-recall curves were employed to assess model performance.
Results: The model demonstrated substantial performance improvements in complex segmentation contexts, with some challenges in regions specially those containing narrow shapes or structures extending across the EM map. Distribution analysis of Intersection-over-Union and Recall scores reinforced the model's potential. Given the inherent complexity of the segmentation task, the incorporation of extreme points, simulating expert input, emerges as a significant strategy. This method not only underscores the challenge of the segmentation process but also emphasizes the role of human-in-the-loop guidance in enhancing overall segmentation performance.

13:30-14:30
Invited Presentation: Programming life that is not alive: biocomputing in synthetic cells
Confirmed Presenter: Kate Adamala, University of Minnesota, United States

Format: In Person


Authors List: Show

  • Kate Adamala, University of Minnesota, United States

Presentation Overview: Show

Computation with biological logic gates promises bridging the gap between traditional hardware and living organisms. Boolean logic executed by biological circuits can offer advantages of life-like properties, including regeneration, self-replication and evolution. However, the first priority of any self-respecting live cell is to remain alive and to reproduce, which is often in conflict with requirements of stringent, pre-programmed logic gates. That's why biological computing in bacteria and other live cells is often unpredictable, natural cell gates are leaky and not very scalable.
Synthetic cells are emerging as an alternative to live natural cells for biocomputing, providing greater flexibility and engineerability. They combine advantages of complexity and enzymatic flexibility of live biology with in vitro simplicity. Synthetic cells offer a way to bridge natural biology with electronic devices, and to engineer bio-based tools with unprecedented accuracy and precision.

14:30-14:45
CLEAR: Concise List Enrichment Analysis using R
Confirmed Presenter: Xinglin Jia, Department of Mathematics, Iowa State University, United States

Format: In Person


Authors List: Show

  • Xinglin Jia, Department of Mathematics, Iowa State University, United States
  • An Phan, Department of Mathematics, Iowa State University, United States
  • Karin Dorman, Department of Statistics, Iowa State University, United States
  • Claus Kadelka, Department of Mathematics, Iowa State University, United States

Presentation Overview: Show

Many modern high-throughput methods provide genome-wide data for all genes, SNPs, or other molecular features. Since biological functions are carried out by interacting proteins rather than individual genes, gene set analysis is crucial for interpreting these large-scale datasets. Model-based gene set analysis methods such as GenGO and MGSA use probabilistic approaches to infer which biological categories are activated. GenGO identifies active Gene Ontology (GO) categories using a generative probabilistic model that accounts for noise and overlapping GO terms to reduce redundancy in the result. MGSA extends this framework by introducing a Bayesian network, simultaneously inferring all categories, and improving robustness against noise. These methods have the advantage of returning a group of concise, non-redundant gene sets, which traditional methods (such as Over-representation Analysis (ORA) and Gene Set Enrichment Analysis (GSEA) lack since they test each gene set individually.
However, GenGO and MGSA rely on binary gene activation states, which are determined by an arbitrary, user-defined threshold, rather than utilizing the underlying continuous test statistics such as effect sizes or p-values. Some extensions of MGSA incorporate the topological structure of the Gene Ontology or additional constraints to improve the model performance, but the statistical information associated with the genes is disregarded. We propose a novel, Bayesian model-based method, Concise List Enrichment Analysis using R (CLEAR), which directly models the gene-level statistics rather than the binary activation states.
CLEAR assumes that the gene statistics follow distinct distributions under the alternative and null hypotheses, enabling a more sensitive and nuanced interpretation of gene-level variation within gene sets. This probabilistic, continuous framework improves the robustness and interpretability of gene set analysis. We compared the performance of CLEAR against established methods using both in silico and real datasets, assessing its sensitivity and ability to return gene sets with established phenotype relevance. CLEAR achieves higher sensitivity and improves output interpretability by reducing redundancy and preserving more meaningful information. In conclusion, CLEAR is a powerful gene set enrichment analysis method that leverages all the information available in gene-level statistics and identifies relevant gene sets with greater precision.

14:45-15:00
Enumerating and Exploring the Space of Clonal Trees
Confirmed Presenter: Kendra Winhall, Carleton College, United States

Format: In Person


Authors List: Show

  • Bryce Bernstein, Carleton College, United States
  • Kendra Winhall, Carleton College, United States
  • Layla Oesper, Carleton College, United States

Presentation Overview: Show

Tumor growth is a complex evolutionary process initiated by an abnormal ancestor cell progressively gaining mutations, which eventually results in uncontrolled cell division. Researchers use a structure called a clonal tree--which depicts ancestral relationships between mutated cell populations--to represent a tumor's evolutionary history. Researchers have developed numerous methods of reconstructing clonal trees from tumor sequencing data. To evaluate the accuracy of their tree inference methods, researchers have devised many techniques to simulate clonal trees. However, previous research has not analyzed the space of clonal trees under different evolutionary models. Such exploration would help to better understand the underlying structure and characteristics of these spaces and create appropriate simulation procedures.

We analyzed four different categories of clonal trees, each with their own set of assumptions. For each category, we designed and implemented algorithms that provably generate all such clonal trees with a specified number of mutations. The Infinite Sites Assumption (ISA) is a common model that states that once a mutation is gained it is never lost or gained again. We analyzed two categories of ISA trees: one which only permits one new mutation per subpopulation of cells and one which allows multiple. We also relaxed the ISA by exploring k-Dollo trees which allow for each mutation to be deleted up to k times. These included a simplified subset, which we named Restricted 1-Dollo Trees, and the broader set of all 1-Dollo trees.

We analyzed the generated trees to discover patterns in the data across different assumptions. We investigated a frequently used simulation method for ISA trees and found that it was not representative of the corresponding full set of ISA trees. We then verified that an approach called Wilson's Algorithm, which is designed to generate uniform spanning trees, successfully generates representative samples of ISA trees. However, we are currently unaware of representative sampling methods for 1-Dollo trees.
The continued research and development of algorithms that appropriately sample trees for any of these groups will have implications on the evaluation and comparison of clonal tree inference methods.

15:00-15:15
Hydrocarbon degrading potential of microbes in Great Lakes sediments assessed through metagenomics
Confirmed Presenter: Yogita Warkhade, Michigan Technological University, United States

Format: In Person


Authors List: Show

  • Yogita Warkhade, Michigan Technological University, United States
  • Dr. Stephen Techtmann, Michigan Technological University, United States

Presentation Overview: Show

The Great Lakes are one of the world's largest freshwater systems, supporting diverse ecosystems, industries, and human populations. However, hydrocarbon pollution from industrial activities, oil transport, and urban runoff threatens their ecological health. Microorganisms play a crucial role in breaking down these pollutants through aerobic (oxygen-dependent) and anaerobic (oxygen-independent) processes, but their metabolic capabilities in Great Lakes sediments remain underexplored. This study employs metagenomic sequencing to investigate microbial diversity and hydrocarbon-degrading gene abundance in ambient sediment samples from five locations across the Great Lakes. The results showed that Proteobacteria was the most common microbial phylum, with Pseudomonas being the dominant hydrocarbon degrader. Aerobic hydrocarbon-degrading genes were the most abundant, particularly those involved in aliphatic hydrocarbon metabolism. Samples in Lake Michigan (station MI_141) and the Straits of Mackinac (station Strait_3F) had the highest hydrocarbon degradation potential, while a sample from Lake Superior (station Superior_4B) showed a higher proportion of anaerobic hydrocarbon-degrading genes relative to aerobic genes at the same site, possibly due to lower oxygen availability. To better understand the organisms that encode hydrocarbon-degrading genes, we binned metagenomes into metagenome-assembled genomes. The combination of binned and non-binned analyses helped capture both well-known and uncultured microbial contributors to hydrocarbon degradation. These findings provide valuable insights into the microbial potential for natural bioremediation of hydrocarbons in the Great Lakes. Understanding these microbial communities can help design better strategies to mitigate hydrocarbon pollution, ensuring the long-term health of these vital ecosystems.

15:15-15:30
Jaeger: an accurate and fast deep-learning tool to detect bacteriophage sequences
Confirmed Presenter: Yasas Wijesekara, Institute of Bioinformatics, University Medicine Greifswald, Felix-Hausdorff-Str. 8, 17475 Greifswald, Germany, Germany

Format: In Person


Authors List: Show

  • Yasas Wijesekara, Institute of Bioinformatics, University Medicine Greifswald, Felix-Hausdorff-Str. 8, 17475 Greifswald, Germany, Germany
  • Ling-Yi Wu, Theoretical Biology and Bioinformatics, Utrecht University, Padualaan 8, 3584 CH Utrecht, the Netherlands, Netherlands
  • Rick Beeloo, Theoretical Biology and Bioinformatics, Utrecht University, Padualaan 8, 3584 CH Utrecht, the Netherlands, Netherlands
  • Piotr Rozwalak, Faculty of Biological Sciences, Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, Germany
  • Ernestina Hauptfeld, Theoretical Biology and Bioinformatics, Utrecht University, Padualaan 8, 3584 CH Utrecht, the Netherlands, Netherlands
  • Swapnil P. Doijad, Faculty of Biological Sciences, Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, Germany
  • Bas E. Dutilh, Faculty of Biological Sciences, Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, Germany
  • Lars Kaderali, Institute of Bioinformatics, University Medicine Greifswald, Felix-Hausdorff-Str. 8, 17475 Greifswald, Germany, Germany

Presentation Overview: Show

Bacteriophages—the viruses that infect prokaryotes—are the most abundant biological entities on Earth and play fundamental roles in shaping microbial communities. Despite their ubiquity, the vast majority remain uncharacterized, constituting a significant fraction of unidentified sequences in metagenomic datasets. While deep learning-based tools have improved viral sequence identification, they often suffer from high false positive rates when analyzing divergent sequences 1. To address this challenge, we introduce Jaeger, a homology-free deep learning framework designed to identify bacteriophage genome fragments from metagenome-assembled contigs.
Jaeger leverages a convolutional neural network (CNN) with dilated convolutions and six-frame amino acid parameter sharing to directly recognize protein-level signatures from six-frame translated nucleotide sequences. The model is trained to classify short nucleotide fragments into one of four categories: bacteria, archaea, eukaryote, and phage. For longer sequences, a sliding window approach aggregates predictions across multiple non-overlapping fragments to determine the final classification.
While neural networks are highly sensitive, they can generate spurious predictions when encountering sequences that significantly deviate from the training distribution. To mitigate this, we incorporated a neural mean discrepancy-based 2 auxiliary model—termed the reliability model—to detect out-of-distribution samples at deployment, further improving performance.
Extensive benchmarking on the IMG/VR 3 database and real-world metagenomes reveals Jaeger’s consistently high sensitivity (0.87) and precision (0.92) compared to state-of-the-art tools such as VirSorter2 and geNomad 4, Jaeger achieves similar classification accuracy while offering substantial computational speed improvements—running up to 20 times faster in CPU mode and 140 times faster with GPU acceleration. Its scalability allows it to process vast metagenomic datasets efficiently.
Application of Jaeger to approximately 16,000 metagenomic assemblies from the MGnify 5 database identified over five million putative phage contigs, highlighting its potential for uncovering hidden viral diversity. Additionally, Jaeger effectively identifies prophages and distinguishes viral sequences from bacterial, archaeal, and eukaryotic sequences. By integrating deep learning with reliability assessment, Jaeger enhances the robustness of viral sequence identification, making it a powerful tool for large-scale metagenomic studies.
Jaeger is open-source, easy to install, and supports GPU acceleration, making it accessible for large-scale analyses. Its ability to accurately and efficiently classify bacteriophage sequences will aid in uncovering viral diversity and advancing microbial ecology research.
Availability:
Code: https://github.com/MGXlab/Jaeger
Preprint: https://www.biorxiv.org/content/10.1101/2024.09.24.612722v1

Bibliography
1. Wu, L.-Y. et al. Benchmarking bioinformatic virus identification tools using real-world metagenomic data across biomes. Genome Biol. 25, 97 (2024).
2. Dong, X. et al. Neural Mean Discrepancy for Efficient Out-of-Distribution Detection. arXiv (2021) doi:10.48550/arxiv.2104.11408.
3. Camargo, A. P. et al. IMG/VR v4: an expanded database of uncultivated virus genomes within a framework of extensive functional, taxonomic, and ecological metadata. Nucleic Acids Res. 51, D733–D743 (2023).
4. Camargo, A. P. et al. Identification of mobile genetic elements with geNomad. Nat. Biotechnol. 42, 1303–1312 (2024).
5. Richardson, L. et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 51, D753–D759 (2023).

16:00-16:20
Proceedings Presentation: Revealing tissue architecture through the hypercomplex Fourier analysis of spatial transcriptomics data
Confirmed Presenter: H. Robert Frost, Dartmouth College, United States

Format: In Person


Authors List: Show

  • H. Robert Frost, Dartmouth College, United States

Presentation Overview: Show

We present an approach for analyzing spatial transcriptomics (ST) data using a quaternion-domain discrete two-dimensional Fourier transform. Quaternions are four dimensional hypercomplex numbers that have been primarily employed to represent rotations in computer graphics with biomedical applications focused on biomolecule structure and orientation. According to our proposed model, the quaternion associated with each location in an ST dataset represents a vector in R^3 whose length captures sequencing depth and whose direction captures three transcriptomic features (individual genes, gene sets or latent variables). This representation has several important benefits: 1) it enables the use of powerful Fourier-based image analysis techniques on a multi-dimensional representation of ST data, 2) it implies that transformations in transcriptomic state can be viewed as three-dimensional rotations with a corresponding representation as rotation quaternions, and 3) it supports an ST visualization that captures transcriptomic uncertainty. We demonstrate the features of this model through the analysis of Visium ST data and discuss how a similar model can be applied to single cell RNA-sequencing data. An implementation of this model, support for the hypercomplex Fourier analysis of ST data, and example vignettes are included in the QSC R package.

16:20-16:40
Proceedings Presentation: ScaleSC: A superfast and scalable single cell RNA-seq data analysis pipeline powered by GPU
Confirmed Presenter: Haotian Zhang, Biogen, United States

Format: In Person


Authors List: Show

  • Wenxing Hu, Biogen, United States
  • Haotian Zhang, Biogen, United States
  • Yu Sun, Biogen, United States
  • Shaolong Cao, Biogen, United States
  • Jake Gagnon, Biogen, United States
  • Zhengyu Ouyang, BioInfoRx, United States
  • Yuka Moroishi, Biogen, United States
  • Baohong Zhang, Biogen, United States

Presentation Overview: Show

The rise of large-scale single-cell RNA-seq data has introduced challenges in data processing due to its slow speed. Leveraging advancements in GPU computing ecosystems, such as CuPy, and building on Scanpy and rapids-singlecell package, we developed ScaleSC, a GPU-accelerated solution for single-cell data processing. ScaleSC delivers over a 20x speedup through GPU computing and significantly improves scalability, handling datasets of 10–40 million cells with over 1000 batches by overcoming the memory bottleneck on a single A100 card - far surpassing rapids-singlecell’s capacity of processing only 1 million cells without multi-GPU support. We also resolved discrepancies between GPU and CPU algorithm implementations to ensure consistency. In addition to core optimizations, we developed new advanced tools for marker gene identification, cluster merging, and more, with GPU-optimized implementations seamlessly integrated. Designed for ease of use, the ScaleSC package is compatible with Scanpy workflows, requiring minimal adaptation from users. The ScaleSC package (https://github.com/interactivereport/ScaleSC) promises significant benefits for the single-cell RNA-seq computational community.

16:40-16:55
ScCheck - Evaluating Data Quality in Plant Single-Cell RNA Analysis Through Denoising Techniques
Confirmed Presenter: Sania Zafar Awan, MU Institute of Data Science & Analytics, University of Missouri - Columbia, United States

Format: In Person


Authors List: Show

  • Sania Zafar Awan, MU Institute of Data Science & Analytics, University of Missouri - Columbia, United States
  • Edward Mirielli, MU Institute of Data Science & Analytics, University of Missouri - Columbia, United States
  • Timothy Haithcoat, MU Institute of Data Science & Analytics, University of Missouri - Columbia, United States
  • Marc Libault, Division of Plant Science and Technology, College of Agriculture, Food, and Natural Resources, University of Missouri-Columbia, Columbia, MO, United States, United States

Presentation Overview: Show


Single-cell RNA sequencing (scRNA-Seq) significantly advances our ability to explore complex biological systems by providing gene expression profiles at the cellular level. However, this technology is still vulnerable to technical noise, including dropout effects and insufficient detection sensitivity, which can obscure authentic biological signals. While various denoising techniques have been proposed to address these challenges, their effectiveness has primarily been assessed using human and mouse datasets, creating a notable gap in understanding how these methods apply to plant systems.

This research develops a pipeline to thoroughly benchmark the study of three advanced denoising methodologies, MAGIC, Deep Count Autoencoder (DCA), and scVI, applied to plant single-cell transcriptomics data. The study examines the impact of denoising on critical downstream analyses, including clustering accuracy, the resolution of transcriptional subpopulations, and the ability to recover marker genes. Additionally, we consider computational factors such as runtime efficiency, scalability, and reproducibility, which are crucial for integrating these methods into plant research workflows. In contrast to studies that prioritize marker gene discovery, this research positions denoising as a crucial step for enhancing data quality and interpretability within plant scRNA-seq workflows. The findings establish a replicable framework for benchmarking denoising methods in non-model organisms, highlighting specific trade-offs that researchers must consider when selecting a denoising strategy. It also offers options for automatic hyperparameter tuning of models like DCA and SCVI.

This study emphasizes plant datasets, addressing a critical need in agricultural genomics. It marks a step toward the innovative use of single-cell data for crop improvement, stress response investigations, and functional annotation. The results are designed to guide and facilitate the adoption of complex computational workflows within the field of plant single-cell research analysis.

16:55-17:10
Autoencoder Mixed Effects Deep Learning for the interpretable analysis of single cell RNA sequencing data by separately modeling batch-specific and batch-agnostic effects
Confirmed Presenter: Aixa X. Andrade, University of Texas Southwestern Medical center, United States

Format: In Person


Authors List: Show

  • Aixa X. Andrade, University of Texas Southwestern Medical center, United States
  • Son Nguyen, University of Texas Southwestern Medical center, United States
  • Albert Montillo, University of Texas Southwestern Medical center, United States

Presentation Overview: Show

Single-cell RNA sequencing data can provide unprecedented insights into cellular heterogeneity, yet batch effects arising from both technical and biological factors can obscure meaningful signals. We propose an autoencoder Mixed Effects Deep Learning framework, called aMEDL, that separately models batch-invariant and batch-specific variation to improve the suppression of batch effects, while preserving biologically relevant information. The aMEDL framework comprises two complementary autoencoder networks: an adversarial network that learns a batch-invariant representation, and a probabilistic network that learns batch-specific signals. This dual network approach explicitly models batch distributions rather than discarding them, capturing crucial biological variation that might otherwise be lost. We evaluate aMEDL across diverse datasets, including a single-cell dataset from cardiovascular tissue of healthy donors [1] and a single-nucleus dataset from subjects with Autism Spectrum Disorder (ASD) and Typically Developing (TD) individuals [2]. The framework is compared to the traditional method for scRNA-seq processing, principal component analysis (PCA), and to a newer neural network approach for data abstraction that uses a single autoencoder (AE) network. In both cases, the proposed framework outperforms the comparable methods. In the Healthy Heart dataset, while measuring batch separability via the mean Average Silhouette Width (ASW) with a range of -1.0 to +1.0, we find that aMEDL’s random effects subnetwork accurately captures batch differences (higher is better) with an ASW of +0.37, outperforming PCA (−0.48) and AE (−0.45). Meanwhile, its fixed effects component effectively suppresses batch signals in the latent space (lower is better), with an ASW of −0.50 compared to −0.48 (PCA) and −0.45 (AE). Additionally, using UMAP-based visualizations, aMEDL is observed regularly outperforming the comparable methods. For example in the ASD dataset, it preserved cell type information that PCA did not and avoided spurious clusters observed from the AE approach. Similar favorable results were obtained in the ASD dataset, where the random effects subnetwork reliably captured donor-specific variations, demonstrating aMEDL’s ability to disentangle donor variability from shared biological signals. Overall, aMEDL not only eliminates undesired batch effects, but also maintains batch-specific differences, preventing overcorrection and false clustering. As the first deep learning framework to simultaneously model batch-invariant and batch-specific signals, aMEDL provides an interpretable, generative platform for uncovering disease mechanisms, donor variability, and technical artifacts in single-cell transcriptomics, ultimately paving the way for deeper insights into health and disease.

References
[1] Litvinukova, M. et al. Cells of the adult human heart. Nature 588, 466–472 (2020).
[2] Velmeshev, D. et al. Single-cell genomics identifies cell type-specific molecular changes in autism. Science 364, 685–689 (2019).

17:10-17:25
Enhanced Single-Cell Transcript Assembly via Discriminative Modeling of UMI-indexed and Internal Reads
Confirmed Presenter: Xiaofei Carl Zang, Pennsylvania State University, United States

Format: In Person


Authors List: Show

  • Xiaofei Carl Zang, Pennsylvania State University, United States
  • Mingfu Shao, Pennsylvania State University, United States

Presentation Overview: Show

Single-cell RNA sequencing (scRNA-seq) has revolutionized transcriptome profiling at cellular resolution. However, full-length transcript reconstruction remains a critical bottleneck. While technologies such as SMART-Seq3 combine UMI-linked reads that index and thread multiple reads with the same unique molecule, and internal reads that fill coverage gaps, existing assembly tools fail to leverage the distinct biological and statistical properties of these read types. For example, UMI reads exhibit distribution biases towards 5'-end and coverage sparsity, whereas internal reads resemble bulk RNA-seq data and increase both coverage and noise levels. Current methods including reference-dependent approaches (e.g., scRNAss) and meta-assemblers (e.g., Aletsch, TransMeta, PsiClass), borrows additional information from either a reference transcriptome or other cells. Hence, they either sacrifice cell-specificity or overlook read-type heterogeneity, leading to compromised precision or sensitivity.

Here, we present Amaranth, a novel single-cell assembler that discriminatively integrates UMI and internal reads to achieve accurate, cell-specific transcript modeling without relying on external references or cross-cell aggregation. Amaranth employs a multi-tiered computational framework designed to address the distinct specificity, sensitivity, and strandness biases inherent to UMI-linked and internal reads. First, a noise-aware read classifier filters spurious internal reads and PCR duplicates while preserving unique molecular signals. Second, a discriminative algorithm fills UMI-read coverage gaps with internal reads while circumventing read contaminations, for example, from intron retentions. Third, precise detection of transcription start/end sites leverages UMI-read termini to anchor isoform boundaries. Finally, a probabilistic model infers unique transcript molecules by clustering reads using both splice junctions and UMI indices, minimizing isoform ambiguity. By explicitly distinguishing and weighting read types, Amaranth reduces false positives from non-specific internal reads while recovering transcripts undetected by UMI-dependent approaches.

Benchmarked on SMART-Seq3 datasets from human and mouse, Amaranth outperformed other state-of-the-art tools, including Scallop2 and StringTie2, achieving an increase in precision and sensitivity at the single-cell level. By enabling transcript-level resolution in scRNA-seq, Amaranth unlocks new avenues to study isoform-specific regulation across heterogeneous cell populations.

17:25-17:40
scGEN: Adaptive Weighting Strategy for Challenging Cell Clustering in Single-cell RNA-seq
Confirmed Presenter: Fang Huang, University of North Dakota, United States

Format: Live Stream


Authors List: Show

  • Fang Huang, University of North Dakota, United States
  • Dayu Hu, National University of Defense Technology, China
  • Renxiang Guan, Northeastern University, China
  • Hao Quan, Northeastern University, China
  • Brett McGregor, University of North Dakota, United States
  • Junguk Hur, University of North Dakota, United States
  • Jinghua Zhang, National University of Defense Technology, China
  • Kai Guo, University of North Dakota, United States

Presentation Overview: Show

Recent advances in transcriptional sequencing have greatly improved our ability to explore heterogeneity within cell populations, and a crucial component of that progress has been built on improvements in unsupervised clustering. The quality of clustering significantly influences downstream analyses. However, many current methods fail to adequately address challenging cell populations, particularly those at the boundaries between different cell states or developmental stages. This limitation risks the loss of essential biological insights. Attempts to apply hard-sample mining techniques from traditional machine learning to single-cell analysis have so far fallen short of capturing the complexity of cellular relationships, often missing key biological contexts. Furthermore, current methods typically rely on highly variable genes (HVGs) without considering differences in their expression levels. HVGs with lower expression may still carry important disease signals, and disregarding these subtle cues limits the model's ability to capture clinically relevant information. To overcome these limitations, we present a single-cell Gene-aware Embedded Network (scGEN), which was designed to capture the topological relationships among cells while accounting for challenging samples to enhance clustering accuracy. scGEN employs an adaptive weighting strategy to guide the network in concentrating intensively on the representative hard cells. It employs two fine-tuning parameters to prioritize HVGs based on their expression levels, allowing the model to detect nuanced, lowly-expressed signals that are crucial for understanding complex biological processes. Testing on eight independent scRNA-seq datasets demonstrated that scGEN consistently outperforms seven other leading clustering approaches, underscoring its effectiveness in revealing significant biological structures. Additionally, scGEN resolved 14 distinct cell populations from a human fetal pituitary dataset and identified 372 cells with discordant annotations from their published paper. The majority of the discordant cells were originally labeled as stem cells but, in fact, exhibited higher expression of various gonadotrope precursor markers (reclassified as Pre.Gonado). Moreover, scGEN refined cell-type assignments and uncovered subtle but biologically meaningful differences as demonstrated by differential expression and GO enrichment analysis results that provided a more consistent and robust view of cellular heterogeneity than existing approaches. These findings underscore scGEN’s ability to refine cell-type assignments and reveal subtle, biologically meaningful differences.

17:40-17:55
Improving rigor in defining distinct cell groups within single cell RNA-seq analysis
Confirmed Presenter: Sarah Munro, University of Minnesota, United States

Format: In Person


Authors List: Show

  • Sarah Munro, University of Minnesota, United States
  • Jeremy Chacón, University of Minnesota, United States

Presentation Overview: Show

Defining distinct cell groups within a population is a critical step in single cell RNA-seq (scRNA-seq) analyses. Although a variety of approaches exist for defining cell clusters in scRNA-seq experiments, it remains challenging for researchers to employ methods that maintain scientific objectivity and rigor, because some of these approaches allow too much subjectivity and enable misleading statistical analysis. Bias can be introduced when researchers overinterpret UMAP visualizations and gene markers, or when they use reference-based annotations that are not well-matched to their experimental data.

In light of these challenges, what should researchers do when defining cell groups in their scRNA-seq analysis? In our roles as bioinformaticians at the Minnesota Supercomputing Institute we have collectively analyzed many different types of scRNA-seq experiments. Our aim with this work is to demonstrate the practices that we use to gain confidence in our cell annotation and clustering.

When working with a well-understood tissue, reference-based cell typing is commonly applied. Here we emphasize the need to use multiple cell type references at different scales and compare annotation consistency with defined metrics. For example, in an experiment in which CD45+ immune cells were sorted from a tissue, one might expect that using an immune-cell-only reference would be sufficient, but we have observed that sorting can fail silently. We show the importance of using a broader reference (e.g. including epithelial cells) to prevent misannotation. In other cases, when reference-based annotation is ineffective or not possible (e.g. highly-specific cell populations or non-model organisms), we often rely on reference-free clustering algorithms to define cell groups followed by marker gene analysis.

In both of these approaches for defining cell groups it is important to mitigate bias and avoid misleading results. Based on our practical experience, we have identified a suite of appropriate published metrics and visualization tools that we can apply to assess the consistency of results across algorithms and limit subjectivity when we select our final cell group definitions. These recommendations are intended to be adaptable to a wide variety of scRNA-seq study designs and therefore should be beneficial to researchers across disciplines.

Wednesday, May 14th
8:45-9:00
Welcome - Day 2
Format: In person


Authors List: Show

9:00-10:00
Invited Presentation: Accelerate Discovery by Integrating AI, Statistics and Genomic Health Science
Confirmed Presenter: Xihong Lin

Format: In person


Authors List: Show

  • Xihong Lin

Presentation Overview: Show

Building a state-of-art ecosystem is pivotal for accelerating biological and health science discoveries and translational science. Such an ecosystem involves several pillars, including technology development, data fairness, scalable and powerful statistical and ML methods and tools, interpretable data analysis, and trustworthy decision-making. Rapid advancements in generative ML have revolutionized data utilization and enabled machines to learn from data more effectively. Statistics, as the science of learning from data while accounting for uncertainty, plays a pivotal role in addressing complex real-world problems and facilitating trustworthy decision-making. Massive multi-modal data have been generated in genomics and health. In this talk, I will discuss the challenges and opportunities in building such an ecosystem that integrates statistics, generative ML, and genomics and health. I will illustrate key points using the analysis of large scale biobanks, whole genome sequencing data, and electronic health records, and demonstrate the power of scientific discovery using ML-generated synthetic data. This talk aims to stimulate proactive and thought-provoking discussions, foster interdisciplinary collaboration, and cultivate open-minded approaches to advance scientific discovery.

10:30-10:50
Proceedings Presentation: Asymmetric Integration of Various Cancer Datasets for Identifying Risk-Associated Variants and Genes
Confirmed Presenter: Ruixuan Wang, University of Michigan, United States

Format: In Person


Authors List: Show

  • Ruixuan Wang, University of Michigan, United States
  • Lam Tran, University of Michigan, United States
  • Ben Brennen, University of Michigan, United States
  • Lars Fritsche, University of Michigan, United States
  • Kevin He, University of Michigan, United States
  • Chad Brenner, University of Michigan, United States
  • Hui Jiang, University of Michigan, United States

Presentation Overview: Show

Cancer genomic research provides an opportunity to identify cancer risk-associated genes, but often suffers from undesirable low statistical power due to a limited sample size. Integrated analysis with different cancers has the potential to enhance statistical power for identifying pan-cancer risk genes. However, substantial heterogeneity across various cancers makes this challenging. Recently, a novel asymmetric integration method was developed that can deal with data heterogeneity and exclude unhelpful datasets from the analysis. We adapted and applied this method to integrate genotype datasets with matched case and control individuals from the Michigan Genomics Initiative (MGI), using each cancer as the primary dataset of interest and the other cancers as auxiliary datasets, respectively. Conditional logistic regression models were coupled with the asymmetric integrated framework to handle the matched case-control study design and permutation tests were performed to control for false discovery rates (FDR). At the same FDR level, the integrated analysis found more potential genetic variants and genes that are associated with the risks of various cancers, showcasing the promise of the proposed approach for integrated analysis of cancer datasets.

10:50-11:05
Genotyping CFTR with next-generation sequencing data using T1K
Confirmed Presenter: Yifei Gao, Dartmouth College, United States

Format: In Person


Authors List: Show

  • Yifei Gao, Dartmouth College, United States
  • Li Song, Dartmouth College, United States

Presentation Overview: Show

Cystic fibrosis (CF), one of the most fatal monogenic diseases in the United States, is caused by mutations in the CFTR (cystic fibrosis transmembrane conductance regulator) gene on Chromosome 7. These mutations lead to malfunction of the CFTR protein, which is present in every organ of the body, resulting in thick, sticky mucus that causes blockages and traps germs, leading to recurrent infections. Over 1,000 CFTR variants have been identified, with the most common being the deletion of three base pairs, resulting in the loss of the amino acid phenylalanine at position 508 (F508del) of the protein. These genotypes are classified into six functional categories based on their functions. Accurate genotyping of CFTR is essential for patient stratification and the development of precision medicine strategies.

T1K is a powerful computational method developed by our Lab that can robustly genotype highly polymorphic genes, such as HLA and KIR genes, from sequencing data. T1K genotypes genes by identifying abundant alleles from the reference allele database based on read alignments. In detail, it implements weighted expectation-maximization (EM) algorithm to simultaneously compute allele abundances across all the genes in the database. Unlike many HLA and KIR genotyping methods that strictly rely on the IPD-IMGT/HLA and IPD-KIR databases, T1K is flexible on the reference database format and can be extended to genotype other genes.

In this study, we adapt the T1K framework to analyze RNA and DNA sequencing data for the CFTR genotyping. We developed a method to generate the reference allele database for T1K by systematically inducing mutations for each CFTR variant given their variant names curated in CFTR2 database (https://www.cftr2.org/). We applied T1K to multiple publicly available sequencing datasets from NCBI SRA and successfully identified CFTR genotypes across diverse patient samples. Our work provides a novel solution for CFTR genotyping in clinical and research settings using sequencing data. Our approach also demonstrates T1K’s generalizability in genes other than HLA and KIR, thus proving its potential for broad applicability in genetic studies. All code and analysis pipelines developed for this study are publicly available at T1K’s GitHub repository (https://github.com/mourisl/T1K).

11:05-11:20
Statistical considerations for allele-specific mQTL association analysis using long-read nanopore-based DNA sequencing
Confirmed Presenter: Nicholas Larson, Mayo Clinic College of Medicine and Science, United States

Format: In Person


Authors List: Show

  • Nicholas Larson, Mayo Clinic College of Medicine and Science, United States
  • Shannon McDonnell, Mayo Clinic College of Medicine and Science, United States
  • Yijun Tian, H. Lee Moffitt Cancer Center & Research Institute, United States
  • Liang Wang, H. Lee Moffitt Cancer Center & Research Institute, United States

Presentation Overview: Show

Common single-nucleotide polymorphisms (SNPs) that are associated with complex traits are believed to be primarily regulatory in nature, conferring effects via gene expression dysregulation. Some of these may act as methylation quantitative trait loci (mQTL), whereby epigenetic alterations mediate these SNP regulatory effects. Traditionally, mQTL studies require separate collection of genetic and epigenetic datasets on the same set of samples, and next-generation sequencing methods for methylation profiling are prone to biases from PCR amplification and bisulfite conversion. Third-generation single-molecule nanopore DNA sequencing offers multiple advantages over these established approaches for mQTL analysis, notably ultra-long read length and simultaneous characterization of 5mC and 5hmC nucleotide modifications, yielding both necessary data elements in one experiment. Like other sequencing-based data, mQTL analyses using nanopore reads are amenable to count-based modelling, which can leverage both sequencing depth and number of altered reads in modeling base methylation probabilities. Herein, we propose a statistical framework that explicitly leverages long-range phase information in nanopore-based mQTL association analyses. First, we discuss a two-stage strategy for combining read-backed and statistical phasing utilizing LongPhase and ShapeIT4 to yield high-confidence chromosomally phased genetic and epigenetic data. We next define extensions of (quasi-)binomial models for allele-specific mQTL analysis, notably in handling read-phase uncertainty within and across individual samples. We examine the properties of these methods under various conditions via simulation, including latent phase switch errors for distal mQTL analyses. Finally, we illustrate our methods via real data application using nanopore DNA sequencing data on 28 normal prostate tissue samples to validate previously identified mQTLs linked to prostate cancer risk as well as exploring novel long-range mQTL associations afforded by high-confidence chromosomal phasing.

11:20-11:35
GWAS Meta-Analysis of Admixed Populations (GMAX) uses local ancestry inference to identify associated loci in GSCAN meta-analysis
Confirmed Presenter: Natashia Benjamin, Penn State College of Medicine, United States

Format: In Person


Authors List: Show

  • Natashia Benjamin, Penn State College of Medicine, United States
  • Dajiang Liu, Penn State College of Medicine, United States

Presentation Overview: Show

Admixed populations possess ancestry from multiple continental source groups, resulting in the unique mosaic genome structure from distinct continental ancestries. Hence, it is important to properly analyze admixed population genomes, including heterogeneity in effect sizes and linkage disequilibrium structure. Existed methods, for example TRACTOR, have already shown that incorporating local ancestry information in genome wide association studies (GWAS) can increase the power of discovering variant-trait associations, especially for admixed populations. Despite this fact, there are no current methods that have incorporated local ancestry information in meta-analysis. Here, we develop a method, GMAX Local Ancestry Inference (GMAX-LAI), to cooperate local ancestry across the genome of admixed individuals for GWAS meta-analysis. We first estimate ancestry proportions at a given variant for admixed study by decomposing allele frequencies as a weighted sum of allele frequencies of continental ancestries. By comparing with RFMix, a commonly used individual level LAI method, our approach provides comparable estimation. These ancestral estimates are later incorporated in our mixed effect meta-regression model to model genetic effects in our meta-analysis. We apply our method to GSCAN (GWAS & Sequencing Consortium of Alcohol and Nicotine use) smoking and drinking traits with a diverse ancestry background (55% European, 15% African American, 6% of Latino/Hispanic American, 24% of East Asian.) For African American studies, the proportions, on average, range from 64%-100% for African ancestry and 0%-36% for European ancestry. While for Latino/Hispanic studies, the estimated average compositions are 65%-82% European, 0%-17% African and 20%-28% Native American. We also observe significant ancestry proportion difference across studies, reflecting substantial study-specific local ancestry genetic structure. By meta-analyzing 121 studies, our method identifies 444 loci associated with the ‘Drinks per Week’(DrnkWk) trait, 32 loci associated with the ‘Age at Smoking Initiation’ (AgeSmk) trait, 74 loci associated with the ‘Cigarettes per Day’ (CigDay) trait, 76 loci associated with the ‘Smoking Cessation’ (SmkCes) trait and 930 loci associated with the ‘Smoking Initiation’ (SmkInit) trait. Comparing our results to MEMO, a global ancestry model, our local ancestry model was able to identify novel loci mapped to MSANTD4, RIT2, TENM4, GAS2L3, STARD9, UXS1 and NLGN1. Overall, our model highlights the benefits of including local ancestry information for admixed individuals under a GWAS meta-analysis setting. The application of our method to GSCAN provides a significant step forward in understanding the genetic architecture of tobacco and alcohol use in admixed populations.

11:35-11:55
Proceedings Presentation: Exploration of Chaos Game Representation and Integrative Deep Learning Approaches for Whole-genome Sequencing-Based Grapevine Genetic Testing
Confirmed Presenter: Ping Liang, Department of Biological Sciences, Brock University, Canada

Format: In Person


Authors List: Show

  • Andrew Vu, Brock University, Canada
  • Brendan Park, Brock University, Canada
  • Yifeng Li, Department of Computer Science, Brock University, Canada
  • Ping Liang, Department of Biological Sciences, Brock University, Canada

Presentation Overview: Show

The identification of grapevine species, cultivars, and clones associated with desired traits such as disease resistance, crop yield, crop quality, etc., is an important component of viticulture. True-to-type identification has proven to be very challenging for grapevine due to the existence of a large number of cultivars and clones and the historical issues of synonyms and homonyms. DNA-based identification, superior to morphology-based methods in accuracy, has been used as the current standard true-to-type method for grapevine, but not without shortcomings, such as limited numbers of biomarkers, and limited accessibility of services. To overcome some of the limitations of the traditional microsatellite marker-based on genetic testing, we explored a whole-genome sequencing (WGS)-based approach by taking advantage of the latest next-generation sequencing technologies (NGS) to achieve the best accuracy at affordable cost. To address the challenges of the extremely high-dimensional nature of the WGS data, we examined the effectiveness of using Chaos Game Representation (CGR) for representing the genome sequence data and the use of deep learning in visual analysis for species and cultivar identification. We found that CGR images provide a meaningful way of capturing patterns and motifs for use with visual analysis, with the best prediction results demonstrating a 99% mean balanced accuracy in classifying a subset of five species, and 80% mean balanced accuracy in classifying 41 cultivars. Our preliminary research highlights the potential for CGR and deep learning as a complementary tool for WGS-based species-level and cultivar-level classification.

13:30-14:30
Invited Presentation: Computational methods for improved inference of tumor evolution
Confirmed Presenter: Layla Oesper

Format: In Person


Authors List: Show

  • Layla Oesper

Presentation Overview: Show

Tumors evolve as part of an evolutionary process where distinct sets of genomic mutations accumulate in different cell lineages descending from an original founder cell. A better understanding of how such tumor lineages evolve over time, which mutations occur together or separately, and in what order these mutations were gained may yield important insight into cancer and how to treat it. Thus, in recent years there has been an increased interest in computationally inferring the evolutionary history of a tumor – that is, a rooted tree where vertices represent populations of cells that have a unique complement of somatic mutations and edges that represent ancestral relationships between these populations. However, accurately inferring these trees from sequencing data is often a challenging process. In this talk, I will describe computational methods that address issues related to the inference and analysis of tumor evolution.

14:30-14:45
NetCIS: A Network-based Common Insertion Site Analysis of Case-Control Sleeping Beauty Screens
Confirmed Presenter: Mathew Fischbach, Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA, United States

Format: In Person


Authors List: Show

  • Mathew Fischbach, Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA, United States
  • Dan Reiter, Department of Immunology, Mayo Clinic, Rochester, MN, USA, United States
  • Wen Wang, Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA, United States
  • Chad Myers, Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA, United States
  • Laura Rogers, Department of Immunology, Mayo Clinic, Rochester, MN, USA, United States

Presentation Overview: Show

Forward genetic screens are powerful tools for novel biological discoveries because they do not require any a priori knowledge of genes controlling a phenotype of interest. The Sleeping Beauty (SB) transposon system is a commonly employed genetic screen tool that can induce both gain and loss of gene function mutations via insertional mutagenesis. SB screens have successfully discovered cancer driver genes that could serve as novel therapeutic targets using a single experimental group. More recently, SB technologies have been extended to other applications, including mechanism of drug resistance and immunotherapy target discovery. These screen designs commonly include selected (case) and unselected (control) groups. A plethora of sequencing approaches have been developed to identify common insertion sites (CIS), and as such, many computational tools have been created to identify which mutational insertions are relevant to the biology being studied. Importantly, existing tools do not implement data-driven case-control comparisons capable of analyzing screens with this design. To this end, we developed NetCIS: a Network-based Common Insertion Site analysis tool to robustly identify CIS in a case-control phenotype selection screen. NetCIS performed well on a previously published SB screen dataset, recapitulating the top CIS identified using an alternate tool. In addition, NetCIS identified several novel unannotated CIS that were missed by other methods. NetCIS code and accompanying tutorial can be found at https://github.com/RogersLabGroup/NetCIS.

14:45-15:00
GRN inference optimization with mammalian gold standard datasets
Confirmed Presenter: Seyifunmi Owoeye, Cincinnati Children's Hospital Medical Centre, United States

Format: In Person


Authors List: Show

  • Seyifunmi Owoeye, Cincinnati Children's Hospital Medical Centre, United States
  • Alexandar Katko, Cincinnati Children's Hospital Medical Centre, United States
  • Joseph Wayman, Cincinnati Children's Hospital Medical Centre, United States
  • Ellie Kim, University of Cincinnati, United States
  • Svetlana Korinfskaya, Cincinnati Children's Hospital Medical Centre, United States
  • Adenike Shittu, Cincinnati Children's Hospital Medical Centre, United States
  • Artem Barski, Cincinnati Children's Hospital Medical Centre, United States
  • Emily Miraldi, Cincinnati Children's Hospital Medical Centre, United States

Presentation Overview: Show

Single-cell RNA-sequencing (scRNA-seq) provides a quantitative, genome-scale approximation of cell behavior within complex tissue environments. Gene regulatory networks (GRNs) describe the control of gene expression by transcription factors (TFs) and are thus a critical engineering tool linking cellular behaviors to targetable molecular regulator mechanisms. State-of-the-art GRN inference methods use prior information (knowledge of TF binding, promoter-enhancer interactions), often derived from parallel single-nuclei-(sn)ATAC-seq (chromatin accessibility data), to improve GRN inference accuracy from scRNA-seq data. For example, our method, the Inferelator, models gene expression as a multivariate linear function of protein TF activities, where an ATAC-derived prior of TF-gene interactions is used to (1) estimate TF activities and (2) guide GRN inference via an adaptive LASSO penalty. Even within our particular modeling framework, numerous modeling decisions impact the quality of GRN inference, including – but not limited to (1) methods for prior construction from ATAC-seq, (2) how and whether to incorporate generic prior information sources (e.g., literature) when ATAC data is available and (3) resolution of gene expression data (single-cell or pseudobulk).

Here, we present two relevant benchmark datasets to enable the assessment of key modeling decisions within the Inferelator and comparison to state-of-the-art methods (SCENIC+, CellOracle, and others). Critically, our benchmarks utilize multiome-seq designs (tandem snRNA-seq and snATAC-seq) from complex mammalian settings that mirror our target GRN inference applications: (1) murine CD4 T cells derived from young and aged tissue contexts (77k cells) and (2) dynamic response of human CD4 T cell populations to T cell receptor activation (66k cells). Equally important, for each benchmark (mouse physiological, human dynamic), we have curated gold-standard TF-gene interactions supported by both TF binding (e.g., ChIP-seq) and TF functional (e.g., perturbation followed by RNA-seq) data. Overall, we find that best practices for the Inferelator depend on context (dynamic versus steady-state) and technical factors (e.g., size of gene expression dataset), supporting that knowledge of how to use a GRN inference tool (as opposed to which tool is used) is critical to achieving quality GRN inference in mammalian settings.

15:00-15:15
Network medicine-based epistasis detection in complex diseases: ready for quantum computing
Confirmed Presenter: Markus Hoffmann, National Institutes of Health, United States

Format: In Person


Authors List: Show

  • Markus Hoffmann, National Institutes of Health, United States

Presentation Overview: Show

Most heritable diseases are polygenic. To comprehend the underlying genetic architecture, it is crucial to discover the clinically relevant epistatic interactions (EIs) between genomic single nucleotide polymorphisms (SNPs) (1–3). Existing statistical computational methods for EI detection are mostly limited to pairs of SNPs due to the combinatorial explosion of higher-order EIs. With NeEDL (network-based epistasis detection via local search), we leverage network medicine to inform the selection of EIs that are an order of magnitude more statistically significant compared to existing tools and consist, on average, of five SNPs. We further show that this computationally demanding task can be substantially accelerated once quantum computing hardware becomes available. We apply NeEDL to eight different diseases and discover genes (affected by EIs of SNPs) that are partly known to affect the disease, additionally, these results are reproducible across independent cohorts. EIs for these eight diseases can be interactively explored in the Epistasis Disease Atlas (https://epistasis-disease-atlas.com). In summary, NeEDL demonstrates the potential of seamlessly integrated quantum computing techniques to accelerate biomedical research. Our network medicine approach detects higher-order EIs with unprecedented statistical and biological evidence, yielding unique insights into polygenic diseases and providing a basis for the development of improved risk scores and combination therapies.

1.Heap G.A., Trynka G., Jansen R.C., Bruinenberg M., Swertz M.A., Dinesen L.C., Hunt K.A., Wijmenga C., Vanheel D.A., Franke L. Complex nature of SNP genotype effects on gene expression in primary human leucocytes. BMC Med. Genom. 2009; 2:1.

2.Bush W.S., Moore J.H. Chapter 11: Genome-wide association studies. PLoS Comput. Biol. 2012; 8:e1002822.

3.MacArthur J., Bowler E., Cerezo M., Gil L., Hall P., Hastings E., Junkins H., McMahon A., Milano A., Morales J. et al. . The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 2017; 45:D896–D901.

15:15-15:30
GEMINI: A Breakthrough System for Robust Gene Regulatory Network Discovery, Enabling the Application of GRNs to Industrial Level Genetic Engineering
Confirmed Presenter: Ridhi Gutta, Academies of Loudoun, United States

Format: Live Stream


Authors List: Show

  • Ridhi Gutta, Academies of Loudoun, United States

Presentation Overview: Show

In order to resolve crucial global issues, the widespread application of genetic engineering at an industrial level is key. Effective genetic engineering at an industrial scale hinges heavily on precise cellular control of the microorganism at hand. However, the majority of synthetically engineered strains fail at the industrial level due to disruptions in gene regulation. This stems from a lack of understanding and usage of gene regulatory networks (GRNs), which control cellular processes and metabolism. Research shows that effective manipulation of host GRNs and effective introduction of synthetic GRNs can improve product yield and functionality significantly. However, current GRN inference tools are extremely slow, inaccurate, and incompatible with industrial scale processes, because of which there are no complete expression based GRNs for any commonly used organism, limiting the application of GRNs as a practical tool in genetic engineering at the industrial level. This research proposes a novel computational system, GEMINI, to enable fast and efficient GRN inference for integration into industrial scale pipelines. GEMINI consists of two main parts. First, I create a novel information theoretic algorithm that replaces traditional sequential inference and calculation methods, ensuring compatibility with parallel processing. Second, I integrate a novel GNN architecture based on spectral convolution to bypass intensive eigenvalue computation and efficiently learn global and local regulatory structures. On the DREAM4 and DREAM5 in silico benchmarks, GEMINI outperforms all industry leaders in terms of AUROC and AUPRC, achieving a nearly 300% increase in AUPRC compared to the industry leading method, GENIE3. When applied on a real biological E. coli dataset, GEMINI not only recovered 98% of existing interactions, but discovered 468 novel candidate interactions, which were validated against literature. Thus, GEMINI was able to construct the most complete expression based GRN of E. coli to date, providing a novel biological blueprint for genetic engineers to use at the industrial level. GEMINI removes reliance on expensive computing equipment and enables fast and accurate GRN inference for the first time, opening doors to more efficient gene expression control and metabolic pathway manipulation for more effective application of genetic engineering at an industrial level.

16:00-16:15
VECTr: Facilitating Visual Comparison of Clonal Trees
Confirmed Presenter: Thea Traw, Carleton College, United States

Format: In Person


Authors List: Show

  • Thea Traw, Carleton College, United States
  • Quoc Nguyen, University of Minnesota - Twin Cities, United States
  • Layla Oesper, Carleton College, United States
  • Eric Alexander, Carleton College, United States

Presentation Overview: Show

Cancer is an evolutionary process, in that a tumor results from a series of genetic mutations that are acquired over time. This series of mutations is often represented as a clonal tree: a particular type of rooted tree where vertices represent tumor cell clones and edges represent ancestral relationships between them. Such trees have the potential for impact in a variety of ways, from clinical settings--in which understanding a tumor's evolution may lead to more targeted therapies--to a general better understanding of how tumors evolve. Comparing similar clonal trees is an important task and ends up being an essential part of analysis whenever a new algorithm is developed to infer clonal trees (usually from sequencing data). Distance measures have been developed that compare clonal trees and allow a user to quantify how different two clonal trees are from each other. While such distance measures are incredibly useful, as they allow for direct comparison of which trees are more or less similar, they would be much more effective if it was easier to identify exactly what parts of the clonal trees contribute to the calculated distances, rather than simply reporting a single numerical value.

In this work, we describe a set of visual encodings that unpack differences across clonal tree structure, giving researchers a sense of where and how two trees differ from one another.These encodings were informed by interviews with computational cancer biologists, from which we drew a collection of key tree difference classes. We then introduce a tool called VECTr (Visual Encodings for Clonal Trees) that affords pairwise comparison of clonal trees across three different visualizations: a node-link diagram overlaid with information from a user-selected distance measure; a heatmap matrix encoding changes in relationships between mutations across the two trees; and a tripartite graph highlighting the shifts of individual mutations. All three visualizations allow for more granular interpretation of what parts of the clonal trees contributed to the inferred total distance. We afford comparison using three distance measures: parent-child, ancestor-descendant and distinct lineage. We demonstrate the utility of our visualization tool using a variety of different use cases, including application to both simulated and real clonal trees inferred from sequencing data.

16:15-16:30
Understanding segmental duplications and genomic evolvability through network analysis
Confirmed Presenter: Saiful Islam, Institute for Artificial Intelligence and Data Science, State University of New York at Buffalo, Buffalo, NY, USA, United States

Format: In Person


Authors List: Show

  • Saiful Islam, Institute for Artificial Intelligence and Data Science, State University of New York at Buffalo, Buffalo, NY, USA, United States
  • Alber Aqil, Department of Biological Sciences, State University of New York at Buffalo, Buffalo, NY, USA, United States
  • Faraz Hach, Vancouver Prostate Centre and Department of Urologic Sciences, University of British Columbia, Canada., Canada
  • Ibrahim Numanagić, Department of Computer Science, University of Victoria, Victoria, BC, Canada, Canada
  • Naoki Masuda, Department of Mathematics, State University of New York at Buffalo, Buffalo, NY, USA, United States
  • Omer Gokcumen, Department of Biological Sciences, State University of New York at Buffalo, Buffalo, NY, USA, United States

Presentation Overview: Show

Evolvability, the ability of populations to adapt to selection pressures, is shaped by genomic structures like segmental duplications. The segmental duplications introduce genomic redundancy, enabling the emergence of new functions from duplicated elements. Using long-read genome assemblies, we analyzed segmental duplications in 117 vertebrate species and an outgroup (i.e., starfish) [1]. We constructed segmental duplication networks for each species which represents the relationships between duplicated parts of genomes, capturing the duplications landscape of the species. We then quantified these landscapes using 13 network properties (e.g., network density, average clustering coefficient) and tested among three evolutionary hypotheses: (1) selective constraint: minimal variation in segmental duplication landscape among vertabrates, indicating strong selective pressures constrain on genomic structural evolvability, (2) phylogenetic drift: duplication landscape has drifted gradually following the evolutionary lineage from the most recent common ancestor, and (3) species-specific dynamics: segmental duplication landscape is highly diverse and does not align closely with the phylogeny. Our findings reveal that segmental duplication profiles in vertebrates are predominantly driven by fast-evolving, species-specific dynamics, supporting hypothesis (3) and thereby fostering unique adaptive potentials. While the examination of genome size variation across vertebrates revealed no significant correlation emerged between genome size and duplication metrics, lineage-specific events, such as whole-genome duplications in ray-finned fish (e.g., sterlet sturgeon and brown trout), influence genome size. These results highlight the role of segmental duplications in shaping species-specific adaptability and underscore the effectiveness of network-based approaches in genomic research. Our dataset and analytical framework serve as valuable resources for exploring genomic evolvability, bridging evolutionary biology and network analysis to advance the study of genome evolution.

16:30-16:45
A machine learning framework for inferring cell-specific gene regulatory networks from single-cell multi-omics
Confirmed Presenter: Yasin Uzun, Penn State College of Medicine, United States

Format: In Person


Authors List: Show

  • Yasin Uzun, Penn State College of Medicine, United States
  • Karamveer Karamveer, Penn State College of Medicine, United States
  • Hannah Valensi, Penn State College of Medicine, United States
  • Eric Moeller, Penn State College of Medicine, United States

Presentation Overview: Show

Gene regulatory networks (GRNs) control gene expression programs that drive cellular functions and differentiation, making them essential for understanding biological processes in development and disease. The emergence of single-cell multi-omics sequencing, which simultaneously profiles transcriptomic and epigenomic features in the same cells, has significantly enhanced GRN inference.

Despite recent advances, existing GRN inference methods based on single-cell multi-omics data vary in benchmarking datasets and performance metrics, making direct comparisons challenging. A standardized evaluation of these methods remains lacking. To address this, we developed a publicly available repository of single-cell GRN datasets. This resource integrates reference networks derived from transcription factor (TF)-DNA interaction datasets and functional perturbation studies, alongside diverse single-cell multi-omics datasets that match the reference networks in cell type composition.

Using these curated datasets, we conducted an unbiased benchmarking of leading single-cell multi-omics GRN inference methods. We assessed accuracy by comparing inferred networks to reference networks, evaluated stability by subsampling cell sets, and measured scalability in handling large datasets. Our analysis revealed key limitations in current GRN inference methods, particularly in accuracy and robustness.

To overcome these challenges, we developed a novel machine learning-based approach for GRN inference. Unlike traditional regression-based methods that predict target gene expression from regulator expression, our framework formulates GRN inference as a binary classification problem. By integrating both gene expression and chromatin accessibility data, we determine whether a regulatory interaction exists between a transcription factor and its target gene.

We validated our method using established single-cell multi-omics datasets and reference networks, demonstrating significant improvement in accuracy over existing approaches. This novel framework provides a robust and scalable solution for inferring gene regulatory networks from single-cell multi-omics data, advancing our ability to model complex gene regulation dynamics.

16:45-17:00
Discussing the performance of graph neural network-based models for anti-CRISPR protein prediction
Confirmed Presenter: Michelle Ramsahoye, University of Colorado - Boulder, United States

Format: In Person


Authors List: Show

  • Michelle Ramsahoye, University of Colorado - Boulder, United States
  • Mirela Alistar, University of Colorado - Boulder, United States

Presentation Overview: Show

Anti-CRISPR (Acr) proteins are found in bacteriophages and have recently been shown to be capable of inhibiting the CRISPR-Cas defense systems of various bacteria [1]. Acr proteins have been proposed to have possible applications as a form of phage therapy treatment and as a tool to regulate CRISPR-Cas genome editing [2,3]. These applications prompt interest in being able to identify these proteins – however, there is little understanding of the origin or evolution of Acr proteins as they do not possess significant sequential or structural similarity to any other protein. As of October 2024, researchers have experimentally validated 122 Acr proteins belonging to 92 different subtypes [4]. Large databases such as Anti-CRISPRdb have expanded to 3681 Acr proteins (containing both experimentally validated and putative proteins found using PSI-BLAST and the PDB database) [5]. This presents the opportunity to further narrow this putative list via in silico methods, thus saving time and resources for biologists performing in vitro experimental validation.

We approach this problem by viewing it as a binary classification problem. Inspired by DeepFRI (a protein function and functional residue prediction model), we utilized protein structure networks (PSNs) as input into graph neural networks (GNNs) for the task of Acr protein prediction [6]. We use a version of the Gussow et al. dataset as a benchmark, as was previously used in other Acr protein prediction machine learning and deep learning models [7]. The work encompasses the following steps: (1) data curation, (2) data preprocessing, (3) training and validation of two GNN models (graph convolutional network and graph attention network architectures), and (4) exploring performance on a dataset used for prior machine learning and deep learning models for the identical task of Acr protein prediction. In this work, we combine PSNs made using both experimentally validated Protein Data Bank (PDB) files and predicted PDB files using ESMFold2. The best performing GCN model has an accuracy of 89% and F1-Score of 91% on the test set, and the best performing GAT model has an accuracy of 85% and F1-Score of 87.5%.

By limiting the training data to only Acrs, we discuss the process of creating smaller, specialized graph neural network models that can benefit from domain knowledge and can be further probed for interpretability. We also discuss limitations associated with the application of structural Acr protein data for use in graph neural networks.

[1] Bondy-Denomy, J. et al. (2013) Nature.

[2] Lin DM et al. (2017) World J Gastrointest Pharmacol Ther.

[3] Marino ND et al. (2020) Nat Methods.

[4] Allemailem KS et al. (2024) Int J Nanomedicine.

[5] Dong et al. (2022) Database.

[6] Gligorijević V et al (2021) Nat Commun.

[7] Gussow, A.B. (2020) Nat Commun.

17:00-17:15
Sliding Window INteraction Grammar (SWING): a generalized interaction language model for peptide and protein interactions
Confirmed Presenter: Jishnu Das, University of Pittsburgh, United States

Format: In Person


Authors List: Show

  • Jane Siwek, University of Pittsburgh, United States
  • Alisa Omelchenko, University of Pittsburgh, United States
  • Prabal Chhibbar, University of Pittsburgh, United States
  • Alok Joglekar, University of Pittsburgh, United States
  • Jishnu Das, University of Pittsburgh, United States

Presentation Overview: Show

The explosion of sequence data has allowed the rapid growth of protein language models (pLMs). pLMs have now been employed in many frameworks including variant-effect and peptide-specificity prediction. Traditionally, for protein-protein or peptide-protein interactions (PPIs), corresponding sequences are either co-embedded followed by post-hoc integration or the sequences are concatenated prior to embedding. Interestingly, no method utilizes a language representation of the interaction itself. We developed an interaction LM (iLM), which uses a novel language to represent interactions between protein/peptide sequences. Sliding Window Interaction Grammar (SWING) leverages differences in amino acid properties to generate an interaction vocabulary. This vocabulary is the input into a LM followed by a supervised prediction step where the LM’s representations are used as features.
SWING was first applied to predicting peptide:MHC (pMHC) interactions. With over 10,000 MHC I and 3,000 MHC II alleles, the possible pMHC combinations are vast, making it infeasible to experimentally identify all potential pMHC interactions. SWING was not only successful at generating Class I and Class II models that have comparable prediction to state-of-the-art approaches, but the unique Mixed Class model was also successful at jointly predicting both classes. Further, the SWING model trained only on Class I alleles was predictive for Class II, a complex prediction task not attempted by any existing approach. For de novo data, using only Class I or Class II data, SWING also accurately predicted Class II pMHC interactions in murine models of SLE (MRL/lpr model) and T1D (NOD model), that were validated experimentally.
To further evaluate SWING’s generalizability, we tested its ability to predict the disruption of specific edges in protein interactome networks by missense mutations. Although modern methods like AlphaMissense and ESM1b can predict interfaces and variant effects/pathogenicity per mutation, they are unable to predict edge-specific disruptions in protein networks. Predicting which missense mutations can lead to the disruption of specific protein interactions provides a fundamental genotype to phenotype link (edgotype) at a molecular level. SWING was successful at accurately predicting the impact of both Mendelian mutations and population variants on PPIs. This is the first generalizable approach that can accurately predict interaction-specific disruptions by missense mutations with only sequence information. When benchmarked against other PPI methods such as passively using protein embeddings, using only the interaction encoding, and alternative iLM architectures, only SWING was able to learn enough information to perform well across prediction tasks for missense mutation perturbation prediction and pMHC binding. Overall, SWING is a first-in-class generalizable zero-shot iLM that learns the language of PPIs.

The corresponding manuscript is currently in press at Nature Methods

17:15-17:30
Unveiling antimicrobial resistance mechanisms in ESKAPE through machine learning
Confirmed Presenter: Abhirupa Ghosh, Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA, United States

Format: In Person


Authors List: Show

  • Abhirupa Ghosh, Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA, United States
  • Charmie Vang, Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA, United States
  • Ethan Wolfe, Undergraduate program in Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA, United States
  • Evan Brenner, Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA, United States
  • Janani Ravi, Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA, United States

Presentation Overview: Show

ESKAPE is a group of notorious bacterial pathogens that cause nosocomial infections and contribute to high morbidity and mortality. These WHO-priority pathogens can evade multiple antibiotics; thus, understanding their antimicrobial resistance (AMR) patterns and mechanisms is essential for improving the detection and treatment strategies. Traditional approaches for detecting AMR in novel bacterial strains require time-consuming, labor-intensive genetic and drug screens. Advances in sequencing technology offer a plethora of bacterial genome data, and computational approaches like machine learning (ML) provide scope for in silico AMR prediction leveraging the existing repertoire of genome data. Existing ML-based AMR predictions in bacteria are often limited to predicting resistance to one drug in one species at a time, neglecting spatiotemporal variations among strains, which is crucial in understanding the evolution and spread of resistance. Here, we introduce a comprehensive ML approach to identify AMR-associated molecular features across bacterial species associated with single or multiple antibiotics or antibiotic classes and stratified by time and geographical locations. This project integrates hundreds to thousands of publicly available genomes for each bacteria coupled with AMR phenotypes and leverages comparative genomics with supervised ML models to predict AMR phenotype and underlying mechanisms. The genomes were annotated to obtain a genome x gene feature matrix (encoding presence/absence) with AMR phenotype labels from BV-BRC. Each genome is paired with curated metadata such as collection year, isolation country, host, and diseases. We used supervised ML (e.g., logistic regression and random forest) to classify new pathogen genomes as resistant or susceptible and identify the top predictors for resistance. F1 scores and Area under Precision-recall curves over prior across bug-drug combinations are consistently high, often exceeding 0.90. The most predictive features returned by the models include biologically relevant, experimentally validated genes playing determinative roles in specific resistance mechanisms (e.g., tetK for tetracycline, mecA for methicillin resistance). We elucidate the potential of ML in the discovery of AMR-associated general and context-specific biomarkers, especially factored by clinically relevant strata. For example, the time-specific holdouts can evaluate how well the model predicts resistance in future data and can discover persistent resistance mechanisms and evolving mechanisms over different periods, and the time-stratified model helps discover AMR features active in specific drugs (by generation) within a class. We are also developing a companion R package, amR, for broad application beyond ESKAPE to other bug-drug combinations to predict AMR phenotypes and best AMR-associated features.

17:30-17:45
The International Society for Computational Biology (ISCB) Degree Endorsement Program
Confirmed Presenter: Russell Schwartz, Carnegie Mellon University, United States

Format: In Person


Authors List: Show

  • Russell Schwartz, Carnegie Mellon University, United States

Presentation Overview: Show

Computational biology, as a relatively young discipline, has benefited greatly from the tremendous diversity of perspectives that have come together to define what it means to be competent in the field, yet that diversity has also brought complications. One longstanding challenge is how we as a community can provide useful guidance and credentialing for work in the field when there is little agreement about what someone competent in computational biology should know or be capable of doing. Efforts to answer that question have been the work of a large community of educators over a period of years to define competencies for the discipline and determine how they can be used for such tasks as designing or assessing a curriculum. The present presentation will focus on one particular outcome of those efforts: a program the ISCB has launched for endorsing degree programs based on an assessment through the framework of the ISCB competencies. This presentation will discuss some of the history of the ISCB competencies efforts that led up to this initiative [1,2] and how the competencies evolved [3,4] and continue to evolve to the present day [5]. It will consider how competencies like this can be used in practice, with a particular focus on applications to program assessment as they are used in the endorsement scheme. It will then explain how that led to the ISCB endorsement process and go through the process as it is now underway. Finally, it will provide guidance for prospective applicants and program reviewers and consider some potential future steps.

[1] Welch, L.R., Schwartz, R. and Lewitter, F. (2012) A report of the curriculum task force of the ISCB Education Committee. PLoS Computational Biology, 8(6), p.e1002570.

[2] Welch, L. et al. (2014) Bioinformatics curriculum guidelines: toward a definition of core competencies. PLOS Computational Biology, 10(3), p.e1003496.

[3] Welch, L. et al. (2016) Applying, evaluating and refining bioinformatics core competencies (an update from the curriculum task force of ISCB’s education committee). PLoS Computational Biology, 12(5), p.e1004943.

[4] Mulder, N. et al. (2018) The development and application of bioinformatics core competencies to improve bioinformatics training and education. PLoS Computational Biology, 14(2), p.e1005772.

[5] Brooksbank, C. et al. (2024) The ISCB competency framework v. 3: a revised and extended standard for bioinformatics education and training. Bioinformatics Advances, 4(1), p.vbae166.

Thursday, May 15th
8:45-9:00
Welcome - Day 3
Format: In person


Authors List: Show

9:00-10:00
Invited Presentation: Sequence-basis of transcription initiation in the human genome
Confirmed Presenter: Jian Zhou

Format: In Person


Authors List: Show

  • Jian Zhou

Presentation Overview: Show

Transcription initiation is a process that is essential to ensuring the proper function of any gene, yet we still lack a unified understanding of sequence patterns and rules that explain most transcription start sites in the human genome. By predicting transcription initiation at base-pair resolution from sequences with a deep learning–inspired explainable model called Puffin, we show that a small set of simple rules can explain transcription initiation at most human promoters. We identify key sequence patterns that contribute to human promoter activity, each activating transcription with distinct position-specific effects. Furthermore, we explain the sequence basis of bidirectional transcription at promoters, identify the links between promoter sequence and gene expression variation across cell types, and explore the conservation of sequence determinants of transcription initiation across mammalian species

10:30-10:50
Proceedings Presentation: ProtFun: A Protein Function Prediction Model Using Graph Attention Networks with a Protein Large Language Model
Confirmed Presenter: Serdar Bozdag, University of North Texas, United States

Format: In Person


Authors List: Show

  • Muhammed Talo, University of Nort Texas, United States
  • Serdar Bozdag, University of North Texas, United States

Presentation Overview: Show

Understanding protein functions facilitates the identification of the underlying causes of many diseases and guides the research for new therapeutic targets and medications. With the advancement of high throughput technologies, obtaining novel protein sequences has been a routine process. However, determining protein functions experimentally is cost- and labor-prohibitive. Therefore, it is crucial to develop computational methods for automatic protein function prediction. In this study, we propose a multi-modal deep learning architecture called ProtFun to predict protein functions. ProtFun integrates protein large language model (LLM) embeddings as node features in a protein family network. Employing a graph attention network (GAT) on this protein family network, ProtFun learns protein embeddings, which are integrated with protein signature representations from InterPro to train a protein function prediction model. We evaluated our architecture using two benchmark datasets. Our results showed that our proposed approach outperformed current state-of-the-art methods for most cases. An ablation study also highlighted the importance of different components of ProtFun. The data and source code of ProtFun is available at https://github.com/bozdaglab/ProtFun under Creative Commons Attribution Non Commercial 4.0 International Public License.

10:50-11:05
Crowdsourcing the Fifth Critical Assessment of protein Function Annotation algorithms (CAFA 5)
Confirmed Presenter: Iddo Friedberg, Iowa State University, United States

Format: In Person


Authors List: Show

  • M. Clara De Paolis Kaluza, Northeastern University, United States
  • Rashika Ramola, Northeastern University, United States
  • Parnal Joshi, Iowa State University, United States
  • Damiano Piovesan, BioComputing UP - University of Padova, Italy
  • Walter Reade, Kaggle, United States
  • Addison Howard, Kaggle, United States
  • Maggie Demkin, Kaggle, United States
  • Predrag Radivojac, Northeastern University, United States
  • Iddo Friedberg, Iowa State University, United States

Presentation Overview: Show

The Critical Assessment of Functional Annotation (CAFA) is a long-standing, ongoing community effort to assess computational methods for protein function prediction independently, highlight well-performing methodologies, identify bottlenecks in the field, and provide a forum for disseminating results and exchanging ideas. CAFA was started in 2010, and every three years since that date has solicited participation from computational groups and engaged biocurators and experimental biologists to collect high-quality data on which to evaluate algorithmic performance in a series of prospective computational challenges to predict function for a large set of target proteins. For the 5th iteration, CAFA 5, we partnered with Kaggle, a data science company that facilitates large machine learning competitions. The partnership with Kaggle facilitated the participation of a much broader community of data scientists in the CAFA challenge. Predictions were collected as entries to a competitive challenge on the crowdsourced science platform. The reach and technology of this approach resulted in a large increase in the number of participating teams, comprised of participants from 77 counties and various scientific and technical backgrounds. After the challenge, predictions were evaluated using a summary metric on a limited set of proteins that had accumulated annotations during a four-month period. In this talk, we will present the outcomes of the increased and diversified participation on the quality of predictions on Gene Ontology (GO) term annotations on an expanded set of annotations and in greater detail across ontology aspects and with past CAFA evaluation. We will specifically focus on how engaging over 1,987 participants in an open data science competition facilitated a noticeable improvement in computational protein function prediction.

11:05-11:20
Identifying Cancer Vaccine Adjuvants in Biomedical Literature Using Large Language Models
Confirmed Presenter: Hasin Rehana, University of North Dakota, United States

Format: In Person


Authors List: Show

  • Hasin Rehana, University of North Dakota, United States
  • Jie Zheng, University of Michigan, United States
  • Leo Yeh, University of Michigan, United States
  • Benu Bansal, University of North Dakota, United States
  • Nur Bengisu Çam, Bogazici University, Turkey
  • Christianah Jemiyo, University of North Dakota, United States
  • Brett McGregor, University of North Dakota, United States
  • Arzucan Özgür, Bogazici University, Turkey
  • Yongqun He, University of Michigan, United States
  • Junguk Hur, University of North Dakota, United States

Presentation Overview: Show

Background: Adjuvants are substances incorporated into vaccines to enhance the immune response, increasing effectiveness and duration. Recognizing the various adjuvants used in preexisting biomedical research is essential for expediting novel therapeutic developments. However, manual curation of the constantly expanding biomedical literature poses significant challenges. This study focuses on addressing these limitations by automatically extracting the adjuvant names from literature related to cancer vaccines.
Methods: Advanced Large Language Models (LLMs), specifically Generative Pretrained Transformers (GPT) and Large Language Model Meta AI (Llama), were employed for this study. Two datasets were utilized for a comprehensive performance evaluation of these models. AdjuvareDB comprised 97 clinical trial records focused on established or potential adjuvants. Another dataset, the Vaccine Adjuvant Compendium (VAC), had 290 annotated PubMed abstracts. GPT-4o and Llama 3.2 were implemented in zero-shot and few-shot settings, offering up to four examples per prompt. The temperature variable was set to zero to eliminate randomness, and three independent runs were conducted for each setting to ensure consistency. Prompts explicitly targeted adjuvant names, testing the impact of contextual information such as substances or interventions. The outputs went through automated and manual evaluations for accuracy, completeness, and consistency.
Results: GPT-4o demonstrated 100% Precision in all assessed configurations, underscoring its strong capacity to eradicate false positives. Moreover, incorporating contextual information led to substantial improvements in Recall and F1-score, affirming the significance of context in model performance. For the VAC dataset, GPT-4o attained a maximum F1-score of 77.32% with incorporating interventions, exceeding Llama-3.2-3B by around 2%. For the AdjuvareDB dataset, GPT-4o attained an F1-score of 81.67% with three-shot prompting that included corresponding interventions, surpassing Llama-3.2-3B’s maximum F1-score of 65.62%. These findings underscore the importance of leveraging examples and contextual information to improve model accuracy and effectiveness in natural language processing tasks.
Conclusion: This study demonstrates that LLMs provide a scalable solution for identifying adjuvant names and streamlining cancer vaccine research. Future endeavors will focus on expanding this framework to a broader range of vaccine types, refining model architectures, and optimizing prompt engineering strategies to improve generalizability across diverse biomedical literature further.

11:20-11:35
A deep learning approach for predicting synthetic lethality in human cells
Confirmed Presenter: Xiang Zhang, University of Minnesota, United States

Format: In Person


Authors List: Show

  • Xiang Zhang, University of Minnesota, United States
  • Ke-Chin Chen, University of Minnesota, United States
  • Daniel Chang, University of Minnesota, United States
  • Arshia Hassan, University of Minnesota, United States
  • Michael Costanzo, University of Toronto, Canada
  • Kevin Brown, The Hospital for Sick Children, Canada
  • Brenda Andrews, University of Toronto, Canada
  • Jason Moffat, The Hospital for Sick Children, Canada
  • Charles Boone, University of Toronto, Canada
  • Chad Myers, University of Minnesota, United States

Presentation Overview: Show

Synthetic lethality (SL), where inhibition of one gene is selectively lethal in cells with mutations in its partner, is a key type of genetic interaction (GI) with significant implications for cancer therapy. One long-term goal is to develop models capable of accurately predicting SLs across diverse cell types and contexts, even with minimal new data input. A major step toward this goal is the generation of a global reference map of human GIs based on genome-wide CRISPR-Cas9 screens. Recently, we generated a genome-scale GI network in the human haploid cell line, HAP1. This network, encompassing ~17,000 genes across over 200 query genes, includes data from double mutants for approximately 4 million unique gene pairs, offering an unprecedented resource for understanding genetic dependencies at scale. A stringent cutoff identified ~7,000 SL pairs accounting for ~0.18% of the screened population.

As a basis for building a model for predicting SL pairs, we leveraged the DepMap dataset as the primary input. DepMap provides comprehensive functional and molecular profiles across hundreds of diverse cancer cell lines, including CRISPR KO gene effect, gene expression, and mutation data. These datasets capture diverse cellular contexts and genetic variability, making DepMap an ideal source for constructing biologically rich, high-dimensional feature representations. Paired with the large collection of double mutant SL interactions from our HAP1 reference map, this combination enables the application of supervised learning to SL prediction.

Using these inputs, we developed a deep neural network (DNN) model for predicting SL pairs, trained and tested on the extensive HAP1 GI dataset. The DNN model demonstrated strong performance with an AUROC of 0.88, which is comparable to the median AUROC expected from control predictions derived from biological replicate screens. We evaluated the model’s generalizability on unseen KO pairs and GIs from other cellular contexts beyond the HAP1 screening system. In addition, we used the model’s predictions on cancer driver genes to guide GI experiments for identifying potential novel drug targets. Our findings highlight the potential of deep learning approaches to predict GIs at scale and suggest a strategy for leveraging a reference human GI network in a single cell line for building predictive models of SL that generalize across contexts. Our model can facilitate the design of more efficient GI experiments and improve our understanding of genetic dependencies and SL in human cells.

11:35-11:55
Proceedings Presentation: Deep Active Learning based Experimental Design to Uncover Synergistic Genetic Interactions for Host Targeted Therapeutics
Confirmed Presenter: Haonan Zhu, Lawrence Livermore National Laboratory, United States

Format: In Person


Authors List: Show

  • Haonan Zhu, Lawrence Livermore National Laboratory, United States
  • Mary Silva, Lawrence Livermore National Laboratory, United States
  • Jose Cadena, Lawrence Livermore National Laboratory, United States
  • Braden Soper, Lawrence Livermore National Laboratory, United States
  • Michał Lisicki, University of Guelph, Canada
  • Braian Peetoom, University of California San Francisco, United States
  • Sergio Baranzini, University of California San Francisco, United States
  • Shivshankar Sundaram, Lawrence Livermore National Laboratory, United States
  • Priyadip Ray, Lawrence Livermore National Laboratory, United States
  • Jeff Drocco, Lawrence Livermore National Laboratory, United States

Presentation Overview: Show

Recent technological advances have introduced new high-throughput methods for studying host-virus interactions, but testing synergistic interactions between host gene pairs during infection remains relatively slow and labor intensive. Identification of multiple gene knockdowns that effectively inhibit viral replication requires a search over the combinatorial space of all possible target gene pairs and is infeasible via brute-force experiments. Although active learning methods for sequential experimental design have shown promise, existing approaches have generally been restricted to single-gene knockdowns or small-scale double knockdown datasets. In this study, we present an integrated Deep Active Learning (DeepAL) framework that incorporates information from a biological knowledge graph (SPOKE, the Scalable Precision Medicine Open Knowledge Engine) to efficiently search the configuration space of a large dataset of all pairwise knock-downs of 356 human genes in HIV infection. Through graph representation learning, the framework is able to generate task-specific representations of genes while also balancing the exploration-exploitation trade-off to pinpoint highly effective double-knockdown pairs. We additionally present an ensemble method for uncertainty quantification and an interpretation of the gene pairs selected by our algorithm via pathway analysis. To our knowledge, this is the first work to show promising results on double-gene knockdown experimental data of appreciable scale (356 by 356 matrix).

13:30-13:45
Using language models to find gene-expression datasets`
Confirmed Presenter: Stephen Piccolo, Brigham Young University, United States

Format: In Person


Authors List: Show

  • Stephen Piccolo, Brigham Young University, United States
  • Amanda Warren, Brigham Young University, United States
  • Abigail Muir, Brigham Young University, United States

Presentation Overview: Show

With so many existing gene expression datasets in online databases, it is a common challenge for scientists to identify similar and relevant datasets to support their research. The purpose of this study was to explore the possibility of using language models to aid with dataset finding, to evaluate the performance of different models, to assess how well they would perform in different biomedical contexts, and to compare them against the Gene Expression Omnibus (GEO) Advanced Search Builder.

Using the GEO database, we selected the title, summary, and overall design to create vector representations of the gene expression datasets. We chose and categorized 30 language models based on the type of corpus on which they were trained (general purpose, biomedical, or scientific defined more broadly) and evaluated the accuracy of each model by comparing the results to manually selected, relevant data sets.

We found that the language models often returned very different and more relevant results than GEO’s Advanced Search Builder. Generally, models with larger embedding sizes performed better than those with smaller sizes. In the context of our study we found that the gte-large language model returned the most relevant results when compared to the manually selected data sets. It is currently unclear the extent to which our findings would generalize to other contexts; using large language models to query public datasets merits future study.

In order for this language model tool to be useful to other scientists, we created a Web app called GEOfinder that allows a user to enter dataset identifiers (GSE IDs) from the GEO database and select filters based on year, experiment type, and number of samples. The Web app then finds studies with similar descriptions and provides an interface that researchers can use to navigate the results and link to the datasets on GEO.

13:45-14:00
MIMIC: a pipeline facilitating Multi-modal Imputation and Multi-modal Integration for disease Classification
Confirmed Presenter: Manikandan Narayanan, Indian Institute of Technology (IIT) Madras, India

Format: In Person


Authors List: Show

  • Dipra Bhagat, Indian Institute of Technology (IIT) Madras, India
  • Manikandan Narayanan, Indian Institute of Technology (IIT) Madras, India

Presentation Overview: Show

Genome-wide data of different types or modalities (e.g., bulk/spatial/single-cell RNAseq, proteomic, DNAseq, genome-wide DNA methylation data) are increasingly used to study a complex disease. Such multi-omic datasets have the potential to transform disease classification and biomarker discovery tasks, once challenges related to redundancy, missingness (of few points or entire data modality), and noise levels of the different data types are addressed. Several methods have been proposed to address these challenges, such as ones that perform imputation of missing modalities and/or integration of multiple modalities to classify the disease status of a test sample. But few studies have explored the tight connection between data imputation and integration in the context of disease classification – the extent to which imputation errors adversely impact or are surprisingly tolerated when classifying disease samples are not well understood for instance.

In this study, we propose a systematic framework called MIMIC (Multi-modal Imputation and Multi-modal Integration for disease Classification) to study how methods for imputing entire data modalities from other measured modalities affect downstream disease classification accuracies. The MIMIC pipeline allows exploring a range of shallow to deep machine learning models for classification, each of which utilizes features from multiple measured/imputed data types to predict disease status. We applied our framework to three diverse disease datasets, Alzheimer’s (MSBB and ROSMAP datasets), Breast Cancer (TCGA-BRCA and METABRIC), and Preterm Birth (MOMI). In a majority of these applications, we found that imputed datasets offer a way to either improve model performance (by 1-5%) or provide comparable performance relative to the measured dataset while expanding the pool of potential biomarkers. Further analysis of genes with top feature importance scores identified by MIMIC revealed informative vs. redundant omics types for the classification task and gene-disease associations. These results are promising and encourage use of MIMIC pipeline to dissect other complex diseases as well, via integration of multiple modalities, measured or imputed.

(The research presented in this work was supported by Wellcome Trust/DBT grant IA/I/17/2/503323 awarded to MN. The authors thank Adiya Jeevannavar for contributions to the initial phases of this project.)

14:00-14:15
Chromatin Accessibility in Human Primary and Metastatic Cancers Links Regional Mutational Processes and Signatures to Tissues of Origin
Confirmed Presenter: Hanli Jiang, University of Toronto & Ontario Institute for Cancer Research, Canada

Format: In Person


Authors List: Show

  • Hanli Jiang, University of Toronto & Ontario Institute for Cancer Research, Canada
  • Jüri Reimand, University of Toronto & Ontario Institute for Cancer Research, Canada

Presentation Overview: Show

Background
Cancer metastasis significantly worsens patient prognosis and is responsible for the majority of cancer-related deaths. Metastatic tumors accumulate additional mutations that facilitate their spread, yet their mutational processes and genomic characteristics remain only partially understood. Large-scale projects have characterized mutational landscapes in primary tumors and in metastatic tumors. Machine learning models have also been used to link regional mutation rates with tissue-specific chromatin and replication profiles in primary cancers. However, comparatively few studies have systematically contrasted primary versus metastatic tumor genomes using advanced machine learning approaches.

Methodology
We applied machine learning techniques to investigate mutational processes in metastatic cancers. In particular, we developed a multi-layer perceptron (MLP) model with gating layers to integrate chromatin accessibility and DNA replication timing as features to predict regional mutation rates. This deep learning model was trained separately to predict mutation rates in primary tumors using data from the Pan-Cancer Analysis of Whole Genomes (PCAWG) consortium and in metastatic tumors using data from the Hartwig Medical Foundation (HMF) cohort, while consistently using the same chromatin accessibility and replication timing features for both. The gating mechanism allowed the model to prioritize tissue-specific genomic features, improving mutation rate prediction and interpretability at finer genomic scales.

Key Findings
Our MLP_Gating model outperformed a random forest baseline across 15 cancer types, achieving adjusted R² values up to 0.95, demonstrating robust predictive power. This strong performance enhances confidence in subsequent findings. Metastatic tumors exhibited distinct mutational landscapes compared to their primary counterparts, yet their chromatin accessibility and replication timing profiles largely retained tissue-specific characteristics. The model further identified genomic regions with disproportionately high mutation burdens in metastases, pinpointing several known cancer driver genes. Notably, certain mutation hotspots displayed mutation rates exceeding expectations based on chromatin and replication timing features, suggesting additional mutagenic influences unique to metastatic progression. Our findings indicate that even as cancers metastasize, they preserve key epigenomic features from their tissue of origin, offering opportunities to tailor treatments according to the primary tumor’s context. Moreover, the known cancer driver genes identified with exceptionally high mutation burdens in metastatic tumors can serve as both potential therapeutic targets and biomarkers, guiding personalized oncology strategies. These insights into mutation hotspots in metastatic cancers refine our understanding of how mutational processes diverge from primary tumors and may enhance the accuracy of future models predicting metastatic risk and progression. By integrating whole-genome sequencing data with deep learning, this study underscores the value of computational approaches in unraveling metastatic tumor evolution.

14:15-14:30
Genomic Language Model for Predicting Enhancers and Their Allele-Specific Activity in the Human Genome
Confirmed Presenter: Rekha Sathian, Department of Biomedical Informatics, Stony Brook University, United States

Format: In Person


Authors List: Show

  • Rekha Sathian, Department of Biomedical Informatics, Stony Brook University, United States
  • Pratik Dutta, Department of Biomedical Informatics, Stony Brook University, United States
  • Ferhat Ay, University of California; Centers for Cancer Immunotherapy and Autoimmunity, La Jolla Institute for Immunology, United States
  • Ramana V Davuluri, Department of Biomedical Informatics, Stony Brook University, United States

Presentation Overview: Show

Enhancers are distal cis-regulatory sequences that coordinate target gene expression in a tissue-specific manner. Disruptions in enhancer function—due to mutations, structural variations, or other mechanisms—can cause aberrant gene expression, leading to congenital disorders, cancers, and common complex diseases, collectively termed enhanceropathies. This notion is supported by the large number of disease-associated SNPs identified in these distal regulatory elements through genome-wide association studies. Predicting and deciphering the regulatory logic of enhancers is challenging due to their intricate sequence features and the lack of consistent genetic or epigenetic signatures that accurately distinguish enhancers from other genomic regions. Recent machine-learning-based methods have highlighted the importance of extracting the nucleotide composition of enhancers but have failed to capture sequence context effectively, leading to suboptimal performance. Motivated by advances in genomic language models, we applied DNABERT(1), a large-language transformer model pre-trained on the human genome, to develop a novel enhancer prediction method called DNABERT-Enhancer. We trained two different models using a large collection of enhancers curated from the ENCODE registry of candidate cis-regulatory elements, an integrative layer of the ENCODE Encyclopedia(2). The best fine-tuned model achieved 88.05% accuracy, with a Matthews correlation coefficient of 76% on independent, held-out data. To further validate the model’s performance, we applied it genome-wide and compared the predictions to publicly available enhancer databases. Our model demonstrated remarkable accuracy, successfully predicting a substantial portion of enhancers cataloged in nine databases, including nearly all enhancers in Vista, 91% of typical enhancers from sedb, 85% of enhancers in TiED, and 77% of EnhancerAtlas enhancers—amounting to 99% of the model’s total predictions. Finally, we applied DNABERT-Enhancer, along with other DNABERT-based regulatory genomic region prediction models, to identify candidate SNPs with allele-specific enhancer and transcription factor binding activity. These candidates were identified by testing functional site disruption via the introduction of variant alleles, which exhibited significant changes in prediction probability. Through careful evaluation, we identified candidate variants exhibiting loss-of-function effects, and their clinical significance was further assessed using ClinVar and GWAS databases. The genome-wide enhancer annotations and candidate loss-of-function genetic variants predicted by DNABERT-Enhancer provide valuable resources for genome interpretation in functional and clinical genomics studies.

1. Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021;37(15):2112-20.
2. Consortium EP, Moore JE, Purcaro MJ, Pratt HE, Epstein CB, Shoresh N, et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 2020;583(7818):699-710.

14:30-14:45
Longitudinal metabolomics study on serine supplementation trial reveals treatment efficacy for retinal disease driven by genetic amino acid dysregulation.
Confirmed Presenter: Roberto Bonelli, The Lowy Medical Research Institute, United States

Format: In Person


Authors List: Show

  • Roberto Bonelli, The Lowy Medical Research Institute, United States
  • Jennifer Zhang, The Lowy Medical Research Institute, United States
  • Katie Nardo, The Lowy Medical Research Institute, United States
  • Victoria Mannor, The Lowy Medical Research Institute, United States
  • Jennifer Trombley, The Lowy Medical Research Institute, United States
  • Lea Scheppke, The Lowy Medical Research Institute, United States
  • Simone Muller, The Lowy Medical Research Institute, United States
  • Martin Friedlander, The Lowy Medical Research Institute, United States
  • Marin Gantner, The Lowy Medical Research Institute, United States

Presentation Overview: Show

Macular Telangiectasia Type 2 (MacTel) is a rare neurovascular degenerative retinal disease with complex genetic and metabolic underpinnings. Genome-wide association studies (GWAS) have identified 11 susceptibility loci, many of which implicate the glycine-serine metabolic pathway. Metabolomic analyses have confirmed significant reductions in glycine and serine levels among MacTel patients, highlighting metabolic dysregulation as a key feature of the disease. Further Mendelian randomization studies have established serine deficiency as a causal factor, driving disease progression through the accumulation of neurotoxic deoxysphingolipids. Experimental evidence further corroborates the link between serine metabolism and disease pathology, as elevated deoxysphingolipids have been observed to induce photoreceptor cell death in retinal organoids and compromise visual function in serine-deficient animal models.
Building on these insights, we conducted a study to investigate the metabolic effects of a 6-week serine and fenofibrate supplementation in 90 MacTel patients. Participants received either serine at two different concentrations, fenofibrate, a combination of serine and fenofibrate, or no treatment. Four time points were included: baseline, 3, 6, and 10 weeks. Each patient was genotyped via SNP array, and at each time point, laboratory blood panels were performed alongside the collection of serum for metabolomics. Using normalisation and regression techniques, we identified hundreds of metabolites affected by these interventions. By integrating summary statistics from the previous MacTel metabolomic studies, we observed striking differences in disease specificity for the metabolic changes induced by different treatments. We also identify that most metabolic changes observed could be reversed by a 4-week washout period.
To characterize the broader metabolic impact of supplementation, we employed dimensionality reduction techniques, including factorial analysis and lasso models, to derive a MacTel metabolic signature. This composite metabolic endpoint allowed us to quantify the extent to which supplementation modulates the heterogeneous metabolic disturbances observed in MacTel patients. Our results demonstrate that targeted supplementation can effectively mitigate the metabolic dysregulation characteristic of MacTel, providing further evidence for the likely causative role of serine metabolism in disease pathogenesis.
This study underscores the utility of integrative bioinformatics metabolomics approaches in elucidating complex metabolic diseases. By leveraging ‘omics data and advanced statistical modelling, we provide a framework for dissecting metabolic perturbations at a systems level, with implications for both MacTel and broader metabolic disorders. These methodologies not only enhance our understanding of disease mechanisms but also inform the development of metabolically targeted interventions.

14:45-15:00
SigMatch: a transcriptome-based regression method to detect mechanistic links across diseases and drugs
Confirmed Presenter: Kewalin Samart, Computational Bioscience Program, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA, United States

Format: In Person


Authors List: Show

  • Kewalin Samart, Computational Bioscience Program, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA, United States
  • Landon Buskirk, Data Science Undergraduate Program, Michigan State University, East Lansing, MI, USA, United States
  • Arjun Krishnan, Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA, United States
  • Janani Ravi, Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA, United States

Presentation Overview: Show

Due to the meteoric rise of antibiotic resistance, research on treating infectious diseases (InfD) is turning to host-directed therapeutics (HDTs) that can complement antibiotics by modulating host responses. Computational drug repurposing has demonstrated the potential to accelerate HDT discovery. Traditional transcriptomics-based repurposing methods identify drugs that reverse the disease differential gene expression signature of the host. However, these methods only consider each disease in isolation and often overlook shared molecular mechanisms across diseases, e.g., the NF-κB pathway, which plays a role in active pulmonary tuberculosis (TB), autoimmune disorders, and cancer. This limitation hinders the identification of potential HDT combinations for less-studied InfDs by leveraging evidence from better-studied non-infectious diseases (NInfDs). Drug combination and synergy prediction, while extensively used in cancer, have been rarely applied to InfDs. To address this, we introduce SigMatch, a novel signature-based sparse regression model designed to discover novel HDT candidate combinations using (i) direct mechanistic connections across drugs and InfDs, and (ii) transferred mechanistic insights from NInfDs to InfDs.
SigMatch applies a boosting-based ensemble regression model to identify combinations of features from drug or disease gene expression signatures that best explain the target disease signature. First, SigMatch utilizes a library of FDA-approved drug signatures and identifies combinations of drugs that collectively reverse a TB signature. Each iteration of the model selects a drug signature that explains a specific facet of the reversed disease signature, continuing until an optimal mechanistic match is reached. Thus, the ensemble model naturally mirrors the reversal of the entire disease state by a sparse combination of drugs. Next, SigMatch can also find a combination of NInfD signatures that share signature patterns with the input InfD, enabling a transfer of knowledge from NInfDs to identify potential drug candidates for understudied InfDs.
We validated the ability of SigMatch to identify mechanistically-related diseases by applying it to NInfDs. Specifically, we trained SigMatch using ~600 NInfD signature features with one NInfD target signature (removed from the features) iteratively for each NInfD and identified numerous meaningful disease-disease associations, including Parkinson’s and Alzheimer’s diseases. Now, we are applying SigMatch to discover potential HDT combinations against TB via both direct and transferred evidence.

15:00-15:20
Proceedings Presentation: Product Manifold Representations for Learning on Biological Pathways
Confirmed Presenter: Daniel McNeela, University of Wisconsin-Madison, United States

Format: Live Stream


Authors List: Show

  • Daniel McNeela, University of Wisconsin-Madison, United States
  • Frederic Sala, University of Wisconsin-Madison, United States
  • Anthony Gitter, University of Wisconsin-Madison, United States

Presentation Overview: Show

Machine learning models that embed graphs in non-Euclidean spaces have shown substantial benefits in a variety of contexts, but their application has not been studied extensively in the biological domain, particularly with respect to biological pathway graphs. Such graphs exhibit a variety of complex network structures, presenting challenges to existing embedding approaches. Learning high-quality embeddings for biological pathway graphs is important for researchers looking to understand the underpinnings of disease and train high-quality predictive models on these networks. In this work, we investigate the effects of embedding pathway graphs in non-Euclidean mixed-curvature spaces and compare against traditional Euclidean graph representation learning models. We then train a supervised model using the learned node embeddings to predict missing protein-protein interactions in pathway graphs. We find large reductions in distortion and boosts on in-distribution edge prediction performance as a result of using mixed-curvature embeddings and their corresponding graph neural network models. However, we find that mixed-curvature representations underperform existing baselines on out-of-distribution edge prediction performance suggesting that these representations may overfit to the training graph topology. We provide our Mixed-Curvature Product Graph Convolutional Network code at https://github.com/mcneela/Mixed-Curvature-GCN and our pathway analysis code at https://github.com/mcneela/Mixed-Curvature-Pathways.

15:20-15:35
Multi-omic analysis of testes reveals pseudogenes as overlooked biological actors
Confirmed Presenter: Ihor Arefiev, University of Sherbrooke, Canada

Format: Live Stream


Authors List: Show

  • Ihor Arefiev, University of Sherbrooke, Canada
  • Joëlle Vincent, University of Sherbrooke, Canada
  • Francis Bourassa, University of Sherbrooke, Canada
  • Marie Brunet, University of Sherbrooke, Canada

Presentation Overview: Show

Pseudogenes are commonly described as defective copies of genes or “junk” DNA, which neither encode any protein nor harbor any biological function. Yet growing evidence demonstrate the transcription and even translation of pseudogenes in mammals. Pseudogene expression was found predictive of tissue type in 7 human cancers. We suggest that pseudogene transcription and translation can be specific to cell types and should not be marginalized in analyses.
To investigate cell-specific transcription, we performed single-cell RNA sequencing (scRNA-seq) on 4 testes of adult mice. Pseudogenes harbor a 85.5% sequence identity on average with their parental counterparts, thus many reads from pseudogenes are multi-mapped reads (MMRs, reads that align at multiple loci in the genome). We tested 4 algorithms for MMR handling and identified the expectation-maximization (EM) as the most reliable based on sequence coverage uniformity. Sequencing data was then aligned to the genome using EM for count correction of MMR. We retrieved an average of 6,608 cells per sample, and identified all spermatogenic stages as well as Leydig cells, Sertoli cells and spermatogonia after clustering. Pseudogenes comprised ~7% (2,295) of all confidently detected transcripts with 77% (1,759) of them detected in all samples, and an additional 11.4% (261) detected in at least 3 samples. Differential expression analysis identified 27 pseudogenes specific to clusters. Trajectory analysis also revealed 21 of these as differentially expressed through pseudotime.
Next, we investigated the translation of pseudogenes with tandem mass-spectrometry on the same samples. We identified 103 pseudogene-encoded proteins, 8 of which were detected in all 4 samples, and 5 in at least 3 samples. 40 of these 103 were also detected in scRNA-seq. 12.62% (13 out of 103) pseudogenic proteins were detected according to HPP guidelines, whilst 87.4% were detected with a single unique peptide. 18 of 21 cell-specific pseudogenes were detected as proteins, including 1 with 2 unique peptides which met the HPP guidelines. Functional inference for this pseudogene based on its sequence identity with its parental gene suggests a role in spermatogonia differentiation.
This project highlights the overlooked transcription and translation of pseudogenes. Our findings challenge the definition of pseudogenes as mere genomic relics.