Attention Presenters - please review the Speaker Information Page available here

Schedule subject to change
All times listed are in CDT

Tuesday, May 13^th

8:45-9:00

Conference Welcome

Format: In person

Authors List: Show

9:00-10:00

Invited Presentation: Computational Immunology: Bridging Algorithms, Biology, and Medicine

Confirmed Presenter: Aly Azeem Khan

Format: In person

Authors List: Show

Presentation Overview: Show

10:30-10:50

Proceedings Presentation: GRPhIN: Graphlet Characterization of Regulatory and Physical Interaction Networks

Confirmed Presenter: Altaf Barelvi, Reed College, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

10:50-11:10

Proceedings Presentation: Vector Semantics of Multidomain Protein Architectures

Confirmed Presenter: Xiaoyue Cui, Carnegie Mellon University, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

11:10-11:25

lociPARSE: A Locality-aware Invariant Point Attention Model for Scoring RNA 3D Structures

Confirmed Presenter: Sumit Tarafder, Virginia Tech, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Introduction

Advancements in deep learning have improved RNA 3D structure prediction, but developing reliable scoring functions remains challenging, especially without experimental structures. Existing methods fall into two categories: statistical potentials struggle to distinguish accurate structures due to a limited understanding of RNA energetics, while deep learning approaches rely on RMSD, which fails to capture local atomic environments and RNA flexibility. A better alternative to RMSD as a ground truth is the Local Distance Difference Test (lDDT), which is superposition-free, rotation-invariant, and robust to structural variations. How can we design an RNA scoring model that leverages lDDT while ensuring invariance under global Euclidean transformations?

Methods

Here, we provide such a solution by developing a new attention-based architecture, called lociPARSE (locality-aware invariant Point Attention-based RNA ScorEr) [1], for scoring RNA 3D structures using local nucleotide-wise lDDT instead of RMSD. Inspired by AlphaFold2, it defines nucleotide-wise frames with rotation and translation parameters to model local atomic environments. By modifying IPA to incorporate RNA-specific locality, lociPARSE effectively captures structural accuracy at the nucleotide level. It outperforms traditional statistical potentials and state-of-the-art ML-based scoring methods like ARES across multiple benchmarks, including CASP15 blind tests, demonstrating superior performance in RNA structure assessment across a wide range of performance measures.

Results and Discussions

We compared lociPARSE with statistical potentials (rsRNASP, cgRNASP, RASP, DFIRE-RNA) and ML-based methods (RNA3DCNN, ARES) on 12 blind CASP15 RNA test targets where the corresponding 3D structural models are collected directly from the CASP15 website https://predictioncenter.org/ casp15/ based on the blind predictions submitted by various participating groups in the CASP15 RNA 3D structure prediction challenge. lociPARSE consistently outperformed all methods across nearly all performance metrics. A key strength is its ability to predict nucleotide-wise quality scores, where it surpasses RNA3DCNN, the only other method with this capability. By combining a locality-aware IPA framework with lDDT, lociPARSE effectively assesses nucleotide accuracy while considering local atomic environments, addressing core challenges in RNA scoring.

Availability
lociPARSE is published in the Journal of Chemical Information and Modeling (https://doi.org/10.1021/acs.jcim.4c01621) and freely available to download under the GNU General Public License v3 at https://github.com/Bhattacharya-Lab/lociPARSE.

References

[1] Sumit Tarafder, Debswapna Bhattacharya, “lociPARSE: a locality-aware invariant point attention model for scoring RNA 3D structures”, Journal of Chemical Information and Modeling, Volume 64, Issue 22, Pages 8655–8664, November 2024, doi: https://doi.org/10.1021/acs.jcim.4c01621

11:25-11:40

PharmAlchemy: An Agentic Framework for Integrative Drug–Gene–Disease Knowledge and Precision Drug Discovery

Confirmed Presenter: Kevin Song, The University of Alabama at Birmingham, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

11:40-12:00

Proceedings Presentation: Assessing Deep Segmentation Performance for Macromolecular Subunits Using Extreme Points

Confirmed Presenter: Manuel Zumbado-Corrales, Instituto Tecnológico de Costa Rica, Costa Rica

Format: Live Stream

Authors List: Show

Presentation Overview: Show

13:30-14:30

Invited Presentation: Programming life that is not alive: biocomputing in synthetic cells

Confirmed Presenter: Kate Adamala, University of Minnesota, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

14:30-14:45

CLEAR: Concise List Enrichment Analysis using R

Confirmed Presenter: Xinglin Jia, Department of Mathematics, Iowa State University, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Many modern high-throughput methods provide genome-wide data for all genes, SNPs, or other molecular features. Since biological functions are carried out by interacting proteins rather than individual genes, gene set analysis is crucial for interpreting these large-scale datasets. Model-based gene set analysis methods such as GenGO and MGSA use probabilistic approaches to infer which biological categories are activated. GenGO identifies active Gene Ontology (GO) categories using a generative probabilistic model that accounts for noise and overlapping GO terms to reduce redundancy in the result. MGSA extends this framework by introducing a Bayesian network, simultaneously inferring all categories, and improving robustness against noise. These methods have the advantage of returning a group of concise, non-redundant gene sets, which traditional methods (such as Over-representation Analysis (ORA) and Gene Set Enrichment Analysis (GSEA) lack since they test each gene set individually.
However, GenGO and MGSA rely on binary gene activation states, which are determined by an arbitrary, user-defined threshold, rather than utilizing the underlying continuous test statistics such as effect sizes or p-values. Some extensions of MGSA incorporate the topological structure of the Gene Ontology or additional constraints to improve the model performance, but the statistical information associated with the genes is disregarded. We propose a novel, Bayesian model-based method, Concise List Enrichment Analysis using R (CLEAR), which directly models the gene-level statistics rather than the binary activation states.
CLEAR assumes that the gene statistics follow distinct distributions under the alternative and null hypotheses, enabling a more sensitive and nuanced interpretation of gene-level variation within gene sets. This probabilistic, continuous framework improves the robustness and interpretability of gene set analysis. We compared the performance of CLEAR against established methods using both in silico and real datasets, assessing its sensitivity and ability to return gene sets with established phenotype relevance. CLEAR achieves higher sensitivity and improves output interpretability by reducing redundancy and preserving more meaningful information. In conclusion, CLEAR is a powerful gene set enrichment analysis method that leverages all the information available in gene-level statistics and identifies relevant gene sets with greater precision.

14:45-15:00

Enumerating and Exploring the Space of Clonal Trees

Confirmed Presenter: Kendra Winhall, Carleton College, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Tumor growth is a complex evolutionary process initiated by an abnormal ancestor cell progressively gaining mutations, which eventually results in uncontrolled cell division. Researchers use a structure called a clonal tree--which depicts ancestral relationships between mutated cell populations--to represent a tumor's evolutionary history. Researchers have developed numerous methods of reconstructing clonal trees from tumor sequencing data. To evaluate the accuracy of their tree inference methods, researchers have devised many techniques to simulate clonal trees. However, previous research has not analyzed the space of clonal trees under different evolutionary models. Such exploration would help to better understand the underlying structure and characteristics of these spaces and create appropriate simulation procedures.

We analyzed four different categories of clonal trees, each with their own set of assumptions. For each category, we designed and implemented algorithms that provably generate all such clonal trees with a specified number of mutations. The Infinite Sites Assumption (ISA) is a common model that states that once a mutation is gained it is never lost or gained again. We analyzed two categories of ISA trees: one which only permits one new mutation per subpopulation of cells and one which allows multiple. We also relaxed the ISA by exploring k-Dollo trees which allow for each mutation to be deleted up to k times. These included a simplified subset, which we named Restricted 1-Dollo Trees, and the broader set of all 1-Dollo trees.

We analyzed the generated trees to discover patterns in the data across different assumptions. We investigated a frequently used simulation method for ISA trees and found that it was not representative of the corresponding full set of ISA trees. We then verified that an approach called Wilson's Algorithm, which is designed to generate uniform spanning trees, successfully generates representative samples of ISA trees. However, we are currently unaware of representative sampling methods for 1-Dollo trees.
The continued research and development of algorithms that appropriately sample trees for any of these groups will have implications on the evaluation and comparison of clonal tree inference methods.

15:00-15:15

Hydrocarbon degrading potential of microbes in Great Lakes sediments assessed through metagenomics

Confirmed Presenter: Yogita Warkhade, Michigan Technological University, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

15:15-15:30

Jaeger: an accurate and fast deep-learning tool to detect bacteriophage sequences

Confirmed Presenter: Yasas Wijesekara, Institute of Bioinformatics, University Medicine Greifswald, Felix-Hausdorff-Str. 8, 17475 Greifswald, Germany, Germany

Format: In Person

Authors List: Show

Presentation Overview: Show

Bacteriophages—the viruses that infect prokaryotes—are the most abundant biological entities on Earth and play fundamental roles in shaping microbial communities. Despite their ubiquity, the vast majority remain uncharacterized, constituting a significant fraction of unidentified sequences in metagenomic datasets. While deep learning-based tools have improved viral sequence identification, they often suffer from high false positive rates when analyzing divergent sequences 1. To address this challenge, we introduce Jaeger, a homology-free deep learning framework designed to identify bacteriophage genome fragments from metagenome-assembled contigs.
Jaeger leverages a convolutional neural network (CNN) with dilated convolutions and six-frame amino acid parameter sharing to directly recognize protein-level signatures from six-frame translated nucleotide sequences. The model is trained to classify short nucleotide fragments into one of four categories: bacteria, archaea, eukaryote, and phage. For longer sequences, a sliding window approach aggregates predictions across multiple non-overlapping fragments to determine the final classification.
While neural networks are highly sensitive, they can generate spurious predictions when encountering sequences that significantly deviate from the training distribution. To mitigate this, we incorporated a neural mean discrepancy-based 2 auxiliary model—termed the reliability model—to detect out-of-distribution samples at deployment, further improving performance.
Extensive benchmarking on the IMG/VR 3 database and real-world metagenomes reveals Jaeger’s consistently high sensitivity (0.87) and precision (0.92) compared to state-of-the-art tools such as VirSorter2 and geNomad 4, Jaeger achieves similar classification accuracy while offering substantial computational speed improvements—running up to 20 times faster in CPU mode and 140 times faster with GPU acceleration. Its scalability allows it to process vast metagenomic datasets efficiently.
Application of Jaeger to approximately 16,000 metagenomic assemblies from the MGnify 5 database identified over five million putative phage contigs, highlighting its potential for uncovering hidden viral diversity. Additionally, Jaeger effectively identifies prophages and distinguishes viral sequences from bacterial, archaeal, and eukaryotic sequences. By integrating deep learning with reliability assessment, Jaeger enhances the robustness of viral sequence identification, making it a powerful tool for large-scale metagenomic studies.
Jaeger is open-source, easy to install, and supports GPU acceleration, making it accessible for large-scale analyses. Its ability to accurately and efficiently classify bacteriophage sequences will aid in uncovering viral diversity and advancing microbial ecology research.
Availability:
Code: https://github.com/MGXlab/Jaeger
Preprint: https://www.biorxiv.org/content/10.1101/2024.09.24.612722v1

Bibliography
1. Wu, L.-Y. et al. Benchmarking bioinformatic virus identification tools using real-world metagenomic data across biomes. Genome Biol. 25, 97 (2024).
2. Dong, X. et al. Neural Mean Discrepancy for Efficient Out-of-Distribution Detection. arXiv (2021) doi:10.48550/arxiv.2104.11408.
3. Camargo, A. P. et al. IMG/VR v4: an expanded database of uncultivated virus genomes within a framework of extensive functional, taxonomic, and ecological metadata. Nucleic Acids Res. 51, D733–D743 (2023).
4. Camargo, A. P. et al. Identification of mobile genetic elements with geNomad. Nat. Biotechnol. 42, 1303–1312 (2024).
5. Richardson, L. et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 51, D753–D759 (2023).

16:00-16:20

Proceedings Presentation: Revealing tissue architecture through the hypercomplex Fourier analysis of spatial transcriptomics data

Confirmed Presenter: H. Robert Frost, Dartmouth College, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

16:20-16:40

Proceedings Presentation: ScaleSC: A superfast and scalable single cell RNA-seq data analysis pipeline powered by GPU

Confirmed Presenter: Haotian Zhang, Biogen, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

16:40-16:55

ScCheck - Evaluating Data Quality in Plant Single-Cell RNA Analysis Through Denoising Techniques

Confirmed Presenter: Sania Zafar Awan, MU Institute of Data Science & Analytics, University of Missouri - Columbia, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Single-cell RNA sequencing (scRNA-Seq) significantly advances our ability to explore complex biological systems by providing gene expression profiles at the cellular level. However, this technology is still vulnerable to technical noise, including dropout effects and insufficient detection sensitivity, which can obscure authentic biological signals. While various denoising techniques have been proposed to address these challenges, their effectiveness has primarily been assessed using human and mouse datasets, creating a notable gap in understanding how these methods apply to plant systems.

This research develops a pipeline to thoroughly benchmark the study of three advanced denoising methodologies, MAGIC, Deep Count Autoencoder (DCA), and scVI, applied to plant single-cell transcriptomics data. The study examines the impact of denoising on critical downstream analyses, including clustering accuracy, the resolution of transcriptional subpopulations, and the ability to recover marker genes. Additionally, we consider computational factors such as runtime efficiency, scalability, and reproducibility, which are crucial for integrating these methods into plant research workflows. In contrast to studies that prioritize marker gene discovery, this research positions denoising as a crucial step for enhancing data quality and interpretability within plant scRNA-seq workflows. The findings establish a replicable framework for benchmarking denoising methods in non-model organisms, highlighting specific trade-offs that researchers must consider when selecting a denoising strategy. It also offers options for automatic hyperparameter tuning of models like DCA and SCVI.

This study emphasizes plant datasets, addressing a critical need in agricultural genomics. It marks a step toward the innovative use of single-cell data for crop improvement, stress response investigations, and functional annotation. The results are designed to guide and facilitate the adoption of complex computational workflows within the field of plant single-cell research analysis.

16:55-17:10

Autoencoder Mixed Effects Deep Learning for the interpretable analysis of single cell RNA sequencing data by separately modeling batch-specific and batch-agnostic effects

Confirmed Presenter: Aixa X. Andrade, University of Texas Southwestern Medical center, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Single-cell RNA sequencing data can provide unprecedented insights into cellular heterogeneity, yet batch effects arising from both technical and biological factors can obscure meaningful signals. We propose an autoencoder Mixed Effects Deep Learning framework, called aMEDL, that separately models batch-invariant and batch-specific variation to improve the suppression of batch effects, while preserving biologically relevant information. The aMEDL framework comprises two complementary autoencoder networks: an adversarial network that learns a batch-invariant representation, and a probabilistic network that learns batch-specific signals. This dual network approach explicitly models batch distributions rather than discarding them, capturing crucial biological variation that might otherwise be lost. We evaluate aMEDL across diverse datasets, including a single-cell dataset from cardiovascular tissue of healthy donors [1] and a single-nucleus dataset from subjects with Autism Spectrum Disorder (ASD) and Typically Developing (TD) individuals [2]. The framework is compared to the traditional method for scRNA-seq processing, principal component analysis (PCA), and to a newer neural network approach for data abstraction that uses a single autoencoder (AE) network. In both cases, the proposed framework outperforms the comparable methods. In the Healthy Heart dataset, while measuring batch separability via the mean Average Silhouette Width (ASW) with a range of -1.0 to +1.0, we find that aMEDL’s random effects subnetwork accurately captures batch differences (higher is better) with an ASW of +0.37, outperforming PCA (−0.48) and AE (−0.45). Meanwhile, its fixed effects component effectively suppresses batch signals in the latent space (lower is better), with an ASW of −0.50 compared to −0.48 (PCA) and −0.45 (AE). Additionally, using UMAP-based visualizations, aMEDL is observed regularly outperforming the comparable methods. For example in the ASD dataset, it preserved cell type information that PCA did not and avoided spurious clusters observed from the AE approach. Similar favorable results were obtained in the ASD dataset, where the random effects subnetwork reliably captured donor-specific variations, demonstrating aMEDL’s ability to disentangle donor variability from shared biological signals. Overall, aMEDL not only eliminates undesired batch effects, but also maintains batch-specific differences, preventing overcorrection and false clustering. As the first deep learning framework to simultaneously model batch-invariant and batch-specific signals, aMEDL provides an interpretable, generative platform for uncovering disease mechanisms, donor variability, and technical artifacts in single-cell transcriptomics, ultimately paving the way for deeper insights into health and disease.

References
[1] Litvinukova, M. et al. Cells of the adult human heart. Nature 588, 466–472 (2020).
[2] Velmeshev, D. et al. Single-cell genomics identifies cell type-specific molecular changes in autism. Science 364, 685–689 (2019).

17:10-17:25

Enhanced Single-Cell Transcript Assembly via Discriminative Modeling of UMI-indexed and Internal Reads

Confirmed Presenter: Xiaofei Carl Zang, Pennsylvania State University, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Single-cell RNA sequencing (scRNA-seq) has revolutionized transcriptome profiling at cellular resolution. However, full-length transcript reconstruction remains a critical bottleneck. While technologies such as SMART-Seq3 combine UMI-linked reads that index and thread multiple reads with the same unique molecule, and internal reads that fill coverage gaps, existing assembly tools fail to leverage the distinct biological and statistical properties of these read types. For example, UMI reads exhibit distribution biases towards 5'-end and coverage sparsity, whereas internal reads resemble bulk RNA-seq data and increase both coverage and noise levels. Current methods including reference-dependent approaches (e.g., scRNAss) and meta-assemblers (e.g., Aletsch, TransMeta, PsiClass), borrows additional information from either a reference transcriptome or other cells. Hence, they either sacrifice cell-specificity or overlook read-type heterogeneity, leading to compromised precision or sensitivity.

Here, we present Amaranth, a novel single-cell assembler that discriminatively integrates UMI and internal reads to achieve accurate, cell-specific transcript modeling without relying on external references or cross-cell aggregation. Amaranth employs a multi-tiered computational framework designed to address the distinct specificity, sensitivity, and strandness biases inherent to UMI-linked and internal reads. First, a noise-aware read classifier filters spurious internal reads and PCR duplicates while preserving unique molecular signals. Second, a discriminative algorithm fills UMI-read coverage gaps with internal reads while circumventing read contaminations, for example, from intron retentions. Third, precise detection of transcription start/end sites leverages UMI-read termini to anchor isoform boundaries. Finally, a probabilistic model infers unique transcript molecules by clustering reads using both splice junctions and UMI indices, minimizing isoform ambiguity. By explicitly distinguishing and weighting read types, Amaranth reduces false positives from non-specific internal reads while recovering transcripts undetected by UMI-dependent approaches.

Benchmarked on SMART-Seq3 datasets from human and mouse, Amaranth outperformed other state-of-the-art tools, including Scallop2 and StringTie2, achieving an increase in precision and sensitivity at the single-cell level. By enabling transcript-level resolution in scRNA-seq, Amaranth unlocks new avenues to study isoform-specific regulation across heterogeneous cell populations.

17:25-17:40

scGEN: Adaptive Weighting Strategy for Challenging Cell Clustering in Single-cell RNA-seq

Confirmed Presenter: Fang Huang, University of North Dakota, United States

Format: Live Stream

Authors List: Show

Presentation Overview: Show

Recent advances in transcriptional sequencing have greatly improved our ability to explore heterogeneity within cell populations, and a crucial component of that progress has been built on improvements in unsupervised clustering. The quality of clustering significantly influences downstream analyses. However, many current methods fail to adequately address challenging cell populations, particularly those at the boundaries between different cell states or developmental stages. This limitation risks the loss of essential biological insights. Attempts to apply hard-sample mining techniques from traditional machine learning to single-cell analysis have so far fallen short of capturing the complexity of cellular relationships, often missing key biological contexts. Furthermore, current methods typically rely on highly variable genes (HVGs) without considering differences in their expression levels. HVGs with lower expression may still carry important disease signals, and disregarding these subtle cues limits the model's ability to capture clinically relevant information. To overcome these limitations, we present a single-cell Gene-aware Embedded Network (scGEN), which was designed to capture the topological relationships among cells while accounting for challenging samples to enhance clustering accuracy. scGEN employs an adaptive weighting strategy to guide the network in concentrating intensively on the representative hard cells. It employs two fine-tuning parameters to prioritize HVGs based on their expression levels, allowing the model to detect nuanced, lowly-expressed signals that are crucial for understanding complex biological processes. Testing on eight independent scRNA-seq datasets demonstrated that scGEN consistently outperforms seven other leading clustering approaches, underscoring its effectiveness in revealing significant biological structures. Additionally, scGEN resolved 14 distinct cell populations from a human fetal pituitary dataset and identified 372 cells with discordant annotations from their published paper. The majority of the discordant cells were originally labeled as stem cells but, in fact, exhibited higher expression of various gonadotrope precursor markers (reclassified as Pre.Gonado). Moreover, scGEN refined cell-type assignments and uncovered subtle but biologically meaningful differences as demonstrated by differential expression and GO enrichment analysis results that provided a more consistent and robust view of cellular heterogeneity than existing approaches. These findings underscore scGEN’s ability to refine cell-type assignments and reveal subtle, biologically meaningful differences.

17:40-17:55

Improving rigor in defining distinct cell groups within single cell RNA-seq analysis

Confirmed Presenter: Sarah Munro, University of Minnesota, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Defining distinct cell groups within a population is a critical step in single cell RNA-seq (scRNA-seq) analyses. Although a variety of approaches exist for defining cell clusters in scRNA-seq experiments, it remains challenging for researchers to employ methods that maintain scientific objectivity and rigor, because some of these approaches allow too much subjectivity and enable misleading statistical analysis. Bias can be introduced when researchers overinterpret UMAP visualizations and gene markers, or when they use reference-based annotations that are not well-matched to their experimental data.

In light of these challenges, what should researchers do when defining cell groups in their scRNA-seq analysis? In our roles as bioinformaticians at the Minnesota Supercomputing Institute we have collectively analyzed many different types of scRNA-seq experiments. Our aim with this work is to demonstrate the practices that we use to gain confidence in our cell annotation and clustering.

When working with a well-understood tissue, reference-based cell typing is commonly applied. Here we emphasize the need to use multiple cell type references at different scales and compare annotation consistency with defined metrics. For example, in an experiment in which CD45+ immune cells were sorted from a tissue, one might expect that using an immune-cell-only reference would be sufficient, but we have observed that sorting can fail silently. We show the importance of using a broader reference (e.g. including epithelial cells) to prevent misannotation. In other cases, when reference-based annotation is ineffective or not possible (e.g. highly-specific cell populations or non-model organisms), we often rely on reference-free clustering algorithms to define cell groups followed by marker gene analysis.

In both of these approaches for defining cell groups it is important to mitigate bias and avoid misleading results. Based on our practical experience, we have identified a suite of appropriate published metrics and visualization tools that we can apply to assess the consistency of results across algorithms and limit subjectivity when we select our final cell group definitions. These recommendations are intended to be adaptable to a wide variety of scRNA-seq study designs and therefore should be beneficial to researchers across disciplines.

Wednesday, May 14^th

8:45-9:00

Welcome - Day 2

Format: In person

Authors List: Show

9:00-10:00

Invited Presentation: Accelerate Discovery by Integrating AI, Statistics and Genomic Health Science

Confirmed Presenter: Xihong Lin

Format: In person

Authors List: Show

Presentation Overview: Show

10:30-10:50

Proceedings Presentation: Asymmetric Integration of Various Cancer Datasets for Identifying Risk-Associated Variants and Genes

Confirmed Presenter: Ruixuan Wang, University of Michigan, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

10:50-11:05

Genotyping CFTR with next-generation sequencing data using T1K

Confirmed Presenter: Yifei Gao, Dartmouth College, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Cystic fibrosis (CF), one of the most fatal monogenic diseases in the United States, is caused by mutations in the CFTR (cystic fibrosis transmembrane conductance regulator) gene on Chromosome 7. These mutations lead to malfunction of the CFTR protein, which is present in every organ of the body, resulting in thick, sticky mucus that causes blockages and traps germs, leading to recurrent infections. Over 1,000 CFTR variants have been identified, with the most common being the deletion of three base pairs, resulting in the loss of the amino acid phenylalanine at position 508 (F508del) of the protein. These genotypes are classified into six functional categories based on their functions. Accurate genotyping of CFTR is essential for patient stratification and the development of precision medicine strategies.

T1K is a powerful computational method developed by our Lab that can robustly genotype highly polymorphic genes, such as HLA and KIR genes, from sequencing data. T1K genotypes genes by identifying abundant alleles from the reference allele database based on read alignments. In detail, it implements weighted expectation-maximization (EM) algorithm to simultaneously compute allele abundances across all the genes in the database. Unlike many HLA and KIR genotyping methods that strictly rely on the IPD-IMGT/HLA and IPD-KIR databases, T1K is flexible on the reference database format and can be extended to genotype other genes.

In this study, we adapt the T1K framework to analyze RNA and DNA sequencing data for the CFTR genotyping. We developed a method to generate the reference allele database for T1K by systematically inducing mutations for each CFTR variant given their variant names curated in CFTR2 database (https://www.cftr2.org/). We applied T1K to multiple publicly available sequencing datasets from NCBI SRA and successfully identified CFTR genotypes across diverse patient samples. Our work provides a novel solution for CFTR genotyping in clinical and research settings using sequencing data. Our approach also demonstrates T1K’s generalizability in genes other than HLA and KIR, thus proving its potential for broad applicability in genetic studies. All code and analysis pipelines developed for this study are publicly available at T1K’s GitHub repository (https://github.com/mourisl/T1K).

11:05-11:20

Statistical considerations for allele-specific mQTL association analysis using long-read nanopore-based DNA sequencing

Confirmed Presenter: Nicholas Larson, Mayo Clinic College of Medicine and Science, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Common single-nucleotide polymorphisms (SNPs) that are associated with complex traits are believed to be primarily regulatory in nature, conferring effects via gene expression dysregulation. Some of these may act as methylation quantitative trait loci (mQTL), whereby epigenetic alterations mediate these SNP regulatory effects. Traditionally, mQTL studies require separate collection of genetic and epigenetic datasets on the same set of samples, and next-generation sequencing methods for methylation profiling are prone to biases from PCR amplification and bisulfite conversion. Third-generation single-molecule nanopore DNA sequencing offers multiple advantages over these established approaches for mQTL analysis, notably ultra-long read length and simultaneous characterization of 5mC and 5hmC nucleotide modifications, yielding both necessary data elements in one experiment. Like other sequencing-based data, mQTL analyses using nanopore reads are amenable to count-based modelling, which can leverage both sequencing depth and number of altered reads in modeling base methylation probabilities. Herein, we propose a statistical framework that explicitly leverages long-range phase information in nanopore-based mQTL association analyses. First, we discuss a two-stage strategy for combining read-backed and statistical phasing utilizing LongPhase and ShapeIT4 to yield high-confidence chromosomally phased genetic and epigenetic data. We next define extensions of (quasi-)binomial models for allele-specific mQTL analysis, notably in handling read-phase uncertainty within and across individual samples. We examine the properties of these methods under various conditions via simulation, including latent phase switch errors for distal mQTL analyses. Finally, we illustrate our methods via real data application using nanopore DNA sequencing data on 28 normal prostate tissue samples to validate previously identified mQTLs linked to prostate cancer risk as well as exploring novel long-range mQTL associations afforded by high-confidence chromosomal phasing.

11:20-11:35

GWAS Meta-Analysis of Admixed Populations (GMAX) uses local ancestry inference to identify associated loci in GSCAN meta-analysis

Confirmed Presenter: Natashia Benjamin, Penn State College of Medicine, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Admixed populations possess ancestry from multiple continental source groups, resulting in the unique mosaic genome structure from distinct continental ancestries. Hence, it is important to properly analyze admixed population genomes, including heterogeneity in effect sizes and linkage disequilibrium structure. Existed methods, for example TRACTOR, have already shown that incorporating local ancestry information in genome wide association studies (GWAS) can increase the power of discovering variant-trait associations, especially for admixed populations. Despite this fact, there are no current methods that have incorporated local ancestry information in meta-analysis. Here, we develop a method, GMAX Local Ancestry Inference (GMAX-LAI), to cooperate local ancestry across the genome of admixed individuals for GWAS meta-analysis. We first estimate ancestry proportions at a given variant for admixed study by decomposing allele frequencies as a weighted sum of allele frequencies of continental ancestries. By comparing with RFMix, a commonly used individual level LAI method, our approach provides comparable estimation. These ancestral estimates are later incorporated in our mixed effect meta-regression model to model genetic effects in our meta-analysis. We apply our method to GSCAN (GWAS & Sequencing Consortium of Alcohol and Nicotine use) smoking and drinking traits with a diverse ancestry background (55% European, 15% African American, 6% of Latino/Hispanic American, 24% of East Asian.) For African American studies, the proportions, on average, range from 64%-100% for African ancestry and 0%-36% for European ancestry. While for Latino/Hispanic studies, the estimated average compositions are 65%-82% European, 0%-17% African and 20%-28% Native American. We also observe significant ancestry proportion difference across studies, reflecting substantial study-specific local ancestry genetic structure. By meta-analyzing 121 studies, our method identifies 444 loci associated with the ‘Drinks per Week’(DrnkWk) trait, 32 loci associated with the ‘Age at Smoking Initiation’ (AgeSmk) trait, 74 loci associated with the ‘Cigarettes per Day’ (CigDay) trait, 76 loci associated with the ‘Smoking Cessation’ (SmkCes) trait and 930 loci associated with the ‘Smoking Initiation’ (SmkInit) trait. Comparing our results to MEMO, a global ancestry model, our local ancestry model was able to identify novel loci mapped to MSANTD4, RIT2, TENM4, GAS2L3, STARD9, UXS1 and NLGN1. Overall, our model highlights the benefits of including local ancestry information for admixed individuals under a GWAS meta-analysis setting. The application of our method to GSCAN provides a significant step forward in understanding the genetic architecture of tobacco and alcohol use in admixed populations.

11:35-11:55

Proceedings Presentation: Exploration of Chaos Game Representation and Integrative Deep Learning Approaches for Whole-genome Sequencing-Based Grapevine Genetic Testing

Confirmed Presenter: Ping Liang, Department of Biological Sciences, Brock University, Canada

Format: In Person

Authors List: Show

Presentation Overview: Show

13:30-14:30

Invited Presentation: Computational methods for improved inference of tumor evolution

Confirmed Presenter: Layla Oesper

Format: In Person

Authors List: Show

Presentation Overview: Show

14:30-14:45

NetCIS: A Network-based Common Insertion Site Analysis of Case-Control Sleeping Beauty Screens

Confirmed Presenter: Mathew Fischbach, Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

14:45-15:00

GRN inference optimization with mammalian gold standard datasets

Confirmed Presenter: Seyifunmi Owoeye, Cincinnati Children's Hospital Medical Centre, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Single-cell RNA-sequencing (scRNA-seq) provides a quantitative, genome-scale approximation of cell behavior within complex tissue environments. Gene regulatory networks (GRNs) describe the control of gene expression by transcription factors (TFs) and are thus a critical engineering tool linking cellular behaviors to targetable molecular regulator mechanisms. State-of-the-art GRN inference methods use prior information (knowledge of TF binding, promoter-enhancer interactions), often derived from parallel single-nuclei-(sn)ATAC-seq (chromatin accessibility data), to improve GRN inference accuracy from scRNA-seq data. For example, our method, the Inferelator, models gene expression as a multivariate linear function of protein TF activities, where an ATAC-derived prior of TF-gene interactions is used to (1) estimate TF activities and (2) guide GRN inference via an adaptive LASSO penalty. Even within our particular modeling framework, numerous modeling decisions impact the quality of GRN inference, including – but not limited to (1) methods for prior construction from ATAC-seq, (2) how and whether to incorporate generic prior information sources (e.g., literature) when ATAC data is available and (3) resolution of gene expression data (single-cell or pseudobulk).

Here, we present two relevant benchmark datasets to enable the assessment of key modeling decisions within the Inferelator and comparison to state-of-the-art methods (SCENIC+, CellOracle, and others). Critically, our benchmarks utilize multiome-seq designs (tandem snRNA-seq and snATAC-seq) from complex mammalian settings that mirror our target GRN inference applications: (1) murine CD4 T cells derived from young and aged tissue contexts (77k cells) and (2) dynamic response of human CD4 T cell populations to T cell receptor activation (66k cells). Equally important, for each benchmark (mouse physiological, human dynamic), we have curated gold-standard TF-gene interactions supported by both TF binding (e.g., ChIP-seq) and TF functional (e.g., perturbation followed by RNA-seq) data. Overall, we find that best practices for the Inferelator depend on context (dynamic versus steady-state) and technical factors (e.g., size of gene expression dataset), supporting that knowledge of how to use a GRN inference tool (as opposed to which tool is used) is critical to achieving quality GRN inference in mammalian settings.

15:00-15:15

Network medicine-based epistasis detection in complex diseases: ready for quantum computing

Confirmed Presenter: Markus Hoffmann, National Institutes of Health, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Most heritable diseases are polygenic. To comprehend the underlying genetic architecture, it is crucial to discover the clinically relevant epistatic interactions (EIs) between genomic single nucleotide polymorphisms (SNPs) (1–3). Existing statistical computational methods for EI detection are mostly limited to pairs of SNPs due to the combinatorial explosion of higher-order EIs. With NeEDL (network-based epistasis detection via local search), we leverage network medicine to inform the selection of EIs that are an order of magnitude more statistically significant compared to existing tools and consist, on average, of five SNPs. We further show that this computationally demanding task can be substantially accelerated once quantum computing hardware becomes available. We apply NeEDL to eight different diseases and discover genes (affected by EIs of SNPs) that are partly known to affect the disease, additionally, these results are reproducible across independent cohorts. EIs for these eight diseases can be interactively explored in the Epistasis Disease Atlas (https://epistasis-disease-atlas.com). In summary, NeEDL demonstrates the potential of seamlessly integrated quantum computing techniques to accelerate biomedical research. Our network medicine approach detects higher-order EIs with unprecedented statistical and biological evidence, yielding unique insights into polygenic diseases and providing a basis for the development of improved risk scores and combination therapies.

1.Heap G.A., Trynka G., Jansen R.C., Bruinenberg M., Swertz M.A., Dinesen L.C., Hunt K.A., Wijmenga C., Vanheel D.A., Franke L. Complex nature of SNP genotype effects on gene expression in primary human leucocytes. BMC Med. Genom. 2009; 2:1.

2.Bush W.S., Moore J.H. Chapter 11: Genome-wide association studies. PLoS Comput. Biol. 2012; 8:e1002822.

3.MacArthur J., Bowler E., Cerezo M., Gil L., Hall P., Hastings E., Junkins H., McMahon A., Milano A., Morales J. et al. . The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 2017; 45:D896–D901.

15:15-15:30

GEMINI: A Breakthrough System for Robust Gene Regulatory Network Discovery, Enabling the Application of GRNs to Industrial Level Genetic Engineering

Confirmed Presenter: Ridhi Gutta, Academies of Loudoun, United States

Format: Live Stream

Authors List: Show

Presentation Overview: Show

In order to resolve crucial global issues, the widespread application of genetic engineering at an industrial level is key. Effective genetic engineering at an industrial scale hinges heavily on precise cellular control of the microorganism at hand. However, the majority of synthetically engineered strains fail at the industrial level due to disruptions in gene regulation. This stems from a lack of understanding and usage of gene regulatory networks (GRNs), which control cellular processes and metabolism. Research shows that effective manipulation of host GRNs and effective introduction of synthetic GRNs can improve product yield and functionality significantly. However, current GRN inference tools are extremely slow, inaccurate, and incompatible with industrial scale processes, because of which there are no complete expression based GRNs for any commonly used organism, limiting the application of GRNs as a practical tool in genetic engineering at the industrial level. This research proposes a novel computational system, GEMINI, to enable fast and efficient GRN inference for integration into industrial scale pipelines. GEMINI consists of two main parts. First, I create a novel information theoretic algorithm that replaces traditional sequential inference and calculation methods, ensuring compatibility with parallel processing. Second, I integrate a novel GNN architecture based on spectral convolution to bypass intensive eigenvalue computation and efficiently learn global and local regulatory structures. On the DREAM4 and DREAM5 in silico benchmarks, GEMINI outperforms all industry leaders in terms of AUROC and AUPRC, achieving a nearly 300% increase in AUPRC compared to the industry leading method, GENIE3. When applied on a real biological E. coli dataset, GEMINI not only recovered 98% of existing interactions, but discovered 468 novel candidate interactions, which were validated against literature. Thus, GEMINI was able to construct the most complete expression based GRN of E. coli to date, providing a novel biological blueprint for genetic engineers to use at the industrial level. GEMINI removes reliance on expensive computing equipment and enables fast and accurate GRN inference for the first time, opening doors to more efficient gene expression control and metabolic pathway manipulation for more effective application of genetic engineering at an industrial level.

16:00-16:15

VECTr: Facilitating Visual Comparison of Clonal Trees

Confirmed Presenter: Thea Traw, Carleton College, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Cancer is an evolutionary process, in that a tumor results from a series of genetic mutations that are acquired over time. This series of mutations is often represented as a clonal tree: a particular type of rooted tree where vertices represent tumor cell clones and edges represent ancestral relationships between them. Such trees have the potential for impact in a variety of ways, from clinical settings--in which understanding a tumor's evolution may lead to more targeted therapies--to a general better understanding of how tumors evolve. Comparing similar clonal trees is an important task and ends up being an essential part of analysis whenever a new algorithm is developed to infer clonal trees (usually from sequencing data). Distance measures have been developed that compare clonal trees and allow a user to quantify how different two clonal trees are from each other. While such distance measures are incredibly useful, as they allow for direct comparison of which trees are more or less similar, they would be much more effective if it was easier to identify exactly what parts of the clonal trees contribute to the calculated distances, rather than simply reporting a single numerical value.

In this work, we describe a set of visual encodings that unpack differences across clonal tree structure, giving researchers a sense of where and how two trees differ from one another.These encodings were informed by interviews with computational cancer biologists, from which we drew a collection of key tree difference classes. We then introduce a tool called VECTr (Visual Encodings for Clonal Trees) that affords pairwise comparison of clonal trees across three different visualizations: a node-link diagram overlaid with information from a user-selected distance measure; a heatmap matrix encoding changes in relationships between mutations across the two trees; and a tripartite graph highlighting the shifts of individual mutations. All three visualizations allow for more granular interpretation of what parts of the clonal trees contributed to the inferred total distance. We afford comparison using three distance measures: parent-child, ancestor-descendant and distinct lineage. We demonstrate the utility of our visualization tool using a variety of different use cases, including application to both simulated and real clonal trees inferred from sequencing data.

16:15-16:30

Understanding segmental duplications and genomic evolvability through network analysis

Confirmed Presenter: Saiful Islam, Institute for Artificial Intelligence and Data Science, State University of New York at Buffalo, Buffalo, NY, USA, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Evolvability, the ability of populations to adapt to selection pressures, is shaped by genomic structures like segmental duplications. The segmental duplications introduce genomic redundancy, enabling the emergence of new functions from duplicated elements. Using long-read genome assemblies, we analyzed segmental duplications in 117 vertebrate species and an outgroup (i.e., starfish) [1]. We constructed segmental duplication networks for each species which represents the relationships between duplicated parts of genomes, capturing the duplications landscape of the species. We then quantified these landscapes using 13 network properties (e.g., network density, average clustering coefficient) and tested among three evolutionary hypotheses: (1) selective constraint: minimal variation in segmental duplication landscape among vertabrates, indicating strong selective pressures constrain on genomic structural evolvability, (2) phylogenetic drift: duplication landscape has drifted gradually following the evolutionary lineage from the most recent common ancestor, and (3) species-specific dynamics: segmental duplication landscape is highly diverse and does not align closely with the phylogeny. Our findings reveal that segmental duplication profiles in vertebrates are predominantly driven by fast-evolving, species-specific dynamics, supporting hypothesis (3) and thereby fostering unique adaptive potentials. While the examination of genome size variation across vertebrates revealed no significant correlation emerged between genome size and duplication metrics, lineage-specific events, such as whole-genome duplications in ray-finned fish (e.g., sterlet sturgeon and brown trout), influence genome size. These results highlight the role of segmental duplications in shaping species-specific adaptability and underscore the effectiveness of network-based approaches in genomic research. Our dataset and analytical framework serve as valuable resources for exploring genomic evolvability, bridging evolutionary biology and network analysis to advance the study of genome evolution.

16:30-16:45

A machine learning framework for inferring cell-specific gene regulatory networks from single-cell multi-omics

Confirmed Presenter: Yasin Uzun, Penn State College of Medicine, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Gene regulatory networks (GRNs) control gene expression programs that drive cellular functions and differentiation, making them essential for understanding biological processes in development and disease. The emergence of single-cell multi-omics sequencing, which simultaneously profiles transcriptomic and epigenomic features in the same cells, has significantly enhanced GRN inference.

Despite recent advances, existing GRN inference methods based on single-cell multi-omics data vary in benchmarking datasets and performance metrics, making direct comparisons challenging. A standardized evaluation of these methods remains lacking. To address this, we developed a publicly available repository of single-cell GRN datasets. This resource integrates reference networks derived from transcription factor (TF)-DNA interaction datasets and functional perturbation studies, alongside diverse single-cell multi-omics datasets that match the reference networks in cell type composition.

Using these curated datasets, we conducted an unbiased benchmarking of leading single-cell multi-omics GRN inference methods. We assessed accuracy by comparing inferred networks to reference networks, evaluated stability by subsampling cell sets, and measured scalability in handling large datasets. Our analysis revealed key limitations in current GRN inference methods, particularly in accuracy and robustness.

To overcome these challenges, we developed a novel machine learning-based approach for GRN inference. Unlike traditional regression-based methods that predict target gene expression from regulator expression, our framework formulates GRN inference as a binary classification problem. By integrating both gene expression and chromatin accessibility data, we determine whether a regulatory interaction exists between a transcription factor and its target gene.

We validated our method using established single-cell multi-omics datasets and reference networks, demonstrating significant improvement in accuracy over existing approaches. This novel framework provides a robust and scalable solution for inferring gene regulatory networks from single-cell multi-omics data, advancing our ability to model complex gene regulation dynamics.

16:45-17:00

Discussing the performance of graph neural network-based models for anti-CRISPR protein prediction

Confirmed Presenter: Michelle Ramsahoye, University of Colorado - Boulder, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Anti-CRISPR (Acr) proteins are found in bacteriophages and have recently been shown to be capable of inhibiting the CRISPR-Cas defense systems of various bacteria [1]. Acr proteins have been proposed to have possible applications as a form of phage therapy treatment and as a tool to regulate CRISPR-Cas genome editing [2,3]. These applications prompt interest in being able to identify these proteins – however, there is little understanding of the origin or evolution of Acr proteins as they do not possess significant sequential or structural similarity to any other protein. As of October 2024, researchers have experimentally validated 122 Acr proteins belonging to 92 different subtypes [4]. Large databases such as Anti-CRISPRdb have expanded to 3681 Acr proteins (containing both experimentally validated and putative proteins found using PSI-BLAST and the PDB database) [5]. This presents the opportunity to further narrow this putative list via in silico methods, thus saving time and resources for biologists performing in vitro experimental validation.

We approach this problem by viewing it as a binary classification problem. Inspired by DeepFRI (a protein function and functional residue prediction model), we utilized protein structure networks (PSNs) as input into graph neural networks (GNNs) for the task of Acr protein prediction [6]. We use a version of the Gussow et al. dataset as a benchmark, as was previously used in other Acr protein prediction machine learning and deep learning models [7]. The work encompasses the following steps: (1) data curation, (2) data preprocessing, (3) training and validation of two GNN models (graph convolutional network and graph attention network architectures), and (4) exploring performance on a dataset used for prior machine learning and deep learning models for the identical task of Acr protein prediction. In this work, we combine PSNs made using both experimentally validated Protein Data Bank (PDB) files and predicted PDB files using ESMFold2. The best performing GCN model has an accuracy of 89% and F1-Score of 91% on the test set, and the best performing GAT model has an accuracy of 85% and F1-Score of 87.5%.

By limiting the training data to only Acrs, we discuss the process of creating smaller, specialized graph neural network models that can benefit from domain knowledge and can be further probed for interpretability. We also discuss limitations associated with the application of structural Acr protein data for use in graph neural networks.

[1] Bondy-Denomy, J. et al. (2013) Nature.

[2] Lin DM et al. (2017) World J Gastrointest Pharmacol Ther.

[3] Marino ND et al. (2020) Nat Methods.

[4] Allemailem KS et al. (2024) Int J Nanomedicine.

[5] Dong et al. (2022) Database.

[6] Gligorijević V et al (2021) Nat Commun.

[7] Gussow, A.B. (2020) Nat Commun.

17:00-17:15

Sliding Window INteraction Grammar (SWING): a generalized interaction language model for peptide and protein interactions

Confirmed Presenter: Jishnu Das, University of Pittsburgh, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

The explosion of sequence data has allowed the rapid growth of protein language models (pLMs). pLMs have now been employed in many frameworks including variant-effect and peptide-specificity prediction. Traditionally, for protein-protein or peptide-protein interactions (PPIs), corresponding sequences are either co-embedded followed by post-hoc integration or the sequences are concatenated prior to embedding. Interestingly, no method utilizes a language representation of the interaction itself. We developed an interaction LM (iLM), which uses a novel language to represent interactions between protein/peptide sequences. Sliding Window Interaction Grammar (SWING) leverages differences in amino acid properties to generate an interaction vocabulary. This vocabulary is the input into a LM followed by a supervised prediction step where the LM’s representations are used as features.
SWING was first applied to predicting peptide:MHC (pMHC) interactions. With over 10,000 MHC I and 3,000 MHC II alleles, the possible pMHC combinations are vast, making it infeasible to experimentally identify all potential pMHC interactions. SWING was not only successful at generating Class I and Class II models that have comparable prediction to state-of-the-art approaches, but the unique Mixed Class model was also successful at jointly predicting both classes. Further, the SWING model trained only on Class I alleles was predictive for Class II, a complex prediction task not attempted by any existing approach. For de novo data, using only Class I or Class II data, SWING also accurately predicted Class II pMHC interactions in murine models of SLE (MRL/lpr model) and T1D (NOD model), that were validated experimentally.
To further evaluate SWING’s generalizability, we tested its ability to predict the disruption of specific edges in protein interactome networks by missense mutations. Although modern methods like AlphaMissense and ESM1b can predict interfaces and variant effects/pathogenicity per mutation, they are unable to predict edge-specific disruptions in protein networks. Predicting which missense mutations can lead to the disruption of specific protein interactions provides a fundamental genotype to phenotype link (edgotype) at a molecular level. SWING was successful at accurately predicting the impact of both Mendelian mutations and population variants on PPIs. This is the first generalizable approach that can accurately predict interaction-specific disruptions by missense mutations with only sequence information. When benchmarked against other PPI methods such as passively using protein embeddings, using only the interaction encoding, and alternative iLM architectures, only SWING was able to learn enough information to perform well across prediction tasks for missense mutation perturbation prediction and pMHC binding. Overall, SWING is a first-in-class generalizable zero-shot iLM that learns the language of PPIs.

The corresponding manuscript is currently in press at Nature Methods

17:15-17:30

Unveiling antimicrobial resistance mechanisms in ESKAPE through machine learning

Confirmed Presenter: Abhirupa Ghosh, Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

ESKAPE is a group of notorious bacterial pathogens that cause nosocomial infections and contribute to high morbidity and mortality. These WHO-priority pathogens can evade multiple antibiotics; thus, understanding their antimicrobial resistance (AMR) patterns and mechanisms is essential for improving the detection and treatment strategies. Traditional approaches for detecting AMR in novel bacterial strains require time-consuming, labor-intensive genetic and drug screens. Advances in sequencing technology offer a plethora of bacterial genome data, and computational approaches like machine learning (ML) provide scope for in silico AMR prediction leveraging the existing repertoire of genome data. Existing ML-based AMR predictions in bacteria are often limited to predicting resistance to one drug in one species at a time, neglecting spatiotemporal variations among strains, which is crucial in understanding the evolution and spread of resistance. Here, we introduce a comprehensive ML approach to identify AMR-associated molecular features across bacterial species associated with single or multiple antibiotics or antibiotic classes and stratified by time and geographical locations. This project integrates hundreds to thousands of publicly available genomes for each bacteria coupled with AMR phenotypes and leverages comparative genomics with supervised ML models to predict AMR phenotype and underlying mechanisms. The genomes were annotated to obtain a genome x gene feature matrix (encoding presence/absence) with AMR phenotype labels from BV-BRC. Each genome is paired with curated metadata such as collection year, isolation country, host, and diseases. We used supervised ML (e.g., logistic regression and random forest) to classify new pathogen genomes as resistant or susceptible and identify the top predictors for resistance. F1 scores and Area under Precision-recall curves over prior across bug-drug combinations are consistently high, often exceeding 0.90. The most predictive features returned by the models include biologically relevant, experimentally validated genes playing determinative roles in specific resistance mechanisms (e.g., tetK for tetracycline, mecA for methicillin resistance). We elucidate the potential of ML in the discovery of AMR-associated general and context-specific biomarkers, especially factored by clinically relevant strata. For example, the time-specific holdouts can evaluate how well the model predicts resistance in future data and can discover persistent resistance mechanisms and evolving mechanisms over different periods, and the time-stratified model helps discover AMR features active in specific drugs (by generation) within a class. We are also developing a companion R package, amR, for broad application beyond ESKAPE to other bug-drug combinations to predict AMR phenotypes and best AMR-associated features.

17:30-17:45

The International Society for Computational Biology (ISCB) Degree Endorsement Program

Confirmed Presenter: Russell Schwartz, Carnegie Mellon University, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Computational biology, as a relatively young discipline, has benefited greatly from the tremendous diversity of perspectives that have come together to define what it means to be competent in the field, yet that diversity has also brought complications. One longstanding challenge is how we as a community can provide useful guidance and credentialing for work in the field when there is little agreement about what someone competent in computational biology should know or be capable of doing. Efforts to answer that question have been the work of a large community of educators over a period of years to define competencies for the discipline and determine how they can be used for such tasks as designing or assessing a curriculum. The present presentation will focus on one particular outcome of those efforts: a program the ISCB has launched for endorsing degree programs based on an assessment through the framework of the ISCB competencies. This presentation will discuss some of the history of the ISCB competencies efforts that led up to this initiative [1,2] and how the competencies evolved [3,4] and continue to evolve to the present day [5]. It will consider how competencies like this can be used in practice, with a particular focus on applications to program assessment as they are used in the endorsement scheme. It will then explain how that led to the ISCB endorsement process and go through the process as it is now underway. Finally, it will provide guidance for prospective applicants and program reviewers and consider some potential future steps.

[1] Welch, L.R., Schwartz, R. and Lewitter, F. (2012) A report of the curriculum task force of the ISCB Education Committee. PLoS Computational Biology, 8(6), p.e1002570.

[2] Welch, L. et al. (2014) Bioinformatics curriculum guidelines: toward a definition of core competencies. PLOS Computational Biology, 10(3), p.e1003496.

[3] Welch, L. et al. (2016) Applying, evaluating and refining bioinformatics core competencies (an update from the curriculum task force of ISCB’s education committee). PLoS Computational Biology, 12(5), p.e1004943.

[4] Mulder, N. et al. (2018) The development and application of bioinformatics core competencies to improve bioinformatics training and education. PLoS Computational Biology, 14(2), p.e1005772.

[5] Brooksbank, C. et al. (2024) The ISCB competency framework v. 3: a revised and extended standard for bioinformatics education and training. Bioinformatics Advances, 4(1), p.vbae166.

Thursday, May 15^th

8:45-9:00

Welcome - Day 3

Format: In person

Authors List: Show

9:00-10:00

Invited Presentation: Sequence-basis of transcription initiation in the human genome

Confirmed Presenter: Jian Zhou

Format: In Person

Authors List: Show

Presentation Overview: Show

10:30-10:50

Proceedings Presentation: ProtFun: A Protein Function Prediction Model Using Graph Attention Networks with a Protein Large Language Model

Confirmed Presenter: Serdar Bozdag, University of North Texas, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

10:50-11:05

Crowdsourcing the Fifth Critical Assessment of protein Function Annotation algorithms (CAFA 5)

Confirmed Presenter: Iddo Friedberg, Iowa State University, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

11:05-11:20

Identifying Cancer Vaccine Adjuvants in Biomedical Literature Using Large Language Models

Confirmed Presenter: Hasin Rehana, University of North Dakota, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Background: Adjuvants are substances incorporated into vaccines to enhance the immune response, increasing effectiveness and duration. Recognizing the various adjuvants used in preexisting biomedical research is essential for expediting novel therapeutic developments. However, manual curation of the constantly expanding biomedical literature poses significant challenges. This study focuses on addressing these limitations by automatically extracting the adjuvant names from literature related to cancer vaccines.
Methods: Advanced Large Language Models (LLMs), specifically Generative Pretrained Transformers (GPT) and Large Language Model Meta AI (Llama), were employed for this study. Two datasets were utilized for a comprehensive performance evaluation of these models. AdjuvareDB comprised 97 clinical trial records focused on established or potential adjuvants. Another dataset, the Vaccine Adjuvant Compendium (VAC), had 290 annotated PubMed abstracts. GPT-4o and Llama 3.2 were implemented in zero-shot and few-shot settings, offering up to four examples per prompt. The temperature variable was set to zero to eliminate randomness, and three independent runs were conducted for each setting to ensure consistency. Prompts explicitly targeted adjuvant names, testing the impact of contextual information such as substances or interventions. The outputs went through automated and manual evaluations for accuracy, completeness, and consistency.
Results: GPT-4o demonstrated 100% Precision in all assessed configurations, underscoring its strong capacity to eradicate false positives. Moreover, incorporating contextual information led to substantial improvements in Recall and F1-score, affirming the significance of context in model performance. For the VAC dataset, GPT-4o attained a maximum F1-score of 77.32% with incorporating interventions, exceeding Llama-3.2-3B by around 2%. For the AdjuvareDB dataset, GPT-4o attained an F1-score of 81.67% with three-shot prompting that included corresponding interventions, surpassing Llama-3.2-3B’s maximum F1-score of 65.62%. These findings underscore the importance of leveraging examples and contextual information to improve model accuracy and effectiveness in natural language processing tasks.
Conclusion: This study demonstrates that LLMs provide a scalable solution for identifying adjuvant names and streamlining cancer vaccine research. Future endeavors will focus on expanding this framework to a broader range of vaccine types, refining model architectures, and optimizing prompt engineering strategies to improve generalizability across diverse biomedical literature further.

11:20-11:35

A deep learning approach for predicting synthetic lethality in human cells

Confirmed Presenter: Xiang Zhang, University of Minnesota, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Synthetic lethality (SL), where inhibition of one gene is selectively lethal in cells with mutations in its partner, is a key type of genetic interaction (GI) with significant implications for cancer therapy. One long-term goal is to develop models capable of accurately predicting SLs across diverse cell types and contexts, even with minimal new data input. A major step toward this goal is the generation of a global reference map of human GIs based on genome-wide CRISPR-Cas9 screens. Recently, we generated a genome-scale GI network in the human haploid cell line, HAP1. This network, encompassing ~17,000 genes across over 200 query genes, includes data from double mutants for approximately 4 million unique gene pairs, offering an unprecedented resource for understanding genetic dependencies at scale. A stringent cutoff identified ~7,000 SL pairs accounting for ~0.18% of the screened population.

As a basis for building a model for predicting SL pairs, we leveraged the DepMap dataset as the primary input. DepMap provides comprehensive functional and molecular profiles across hundreds of diverse cancer cell lines, including CRISPR KO gene effect, gene expression, and mutation data. These datasets capture diverse cellular contexts and genetic variability, making DepMap an ideal source for constructing biologically rich, high-dimensional feature representations. Paired with the large collection of double mutant SL interactions from our HAP1 reference map, this combination enables the application of supervised learning to SL prediction.

Using these inputs, we developed a deep neural network (DNN) model for predicting SL pairs, trained and tested on the extensive HAP1 GI dataset. The DNN model demonstrated strong performance with an AUROC of 0.88, which is comparable to the median AUROC expected from control predictions derived from biological replicate screens. We evaluated the model’s generalizability on unseen KO pairs and GIs from other cellular contexts beyond the HAP1 screening system. In addition, we used the model’s predictions on cancer driver genes to guide GI experiments for identifying potential novel drug targets. Our findings highlight the potential of deep learning approaches to predict GIs at scale and suggest a strategy for leveraging a reference human GI network in a single cell line for building predictive models of SL that generalize across contexts. Our model can facilitate the design of more efficient GI experiments and improve our understanding of genetic dependencies and SL in human cells.

11:35-11:55

Proceedings Presentation: Deep Active Learning based Experimental Design to Uncover Synergistic Genetic Interactions for Host Targeted Therapeutics

Confirmed Presenter: Haonan Zhu, Lawrence Livermore National Laboratory, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

13:30-13:45

Using language models to find gene-expression datasets`

Confirmed Presenter: Stephen Piccolo, Brigham Young University, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

13:45-14:00

MIMIC: a pipeline facilitating Multi-modal Imputation and Multi-modal Integration for disease Classification

Confirmed Presenter: Manikandan Narayanan, Indian Institute of Technology (IIT) Madras, India

Format: In Person

Authors List: Show

Presentation Overview: Show

Genome-wide data of different types or modalities (e.g., bulk/spatial/single-cell RNAseq, proteomic, DNAseq, genome-wide DNA methylation data) are increasingly used to study a complex disease. Such multi-omic datasets have the potential to transform disease classification and biomarker discovery tasks, once challenges related to redundancy, missingness (of few points or entire data modality), and noise levels of the different data types are addressed. Several methods have been proposed to address these challenges, such as ones that perform imputation of missing modalities and/or integration of multiple modalities to classify the disease status of a test sample. But few studies have explored the tight connection between data imputation and integration in the context of disease classification – the extent to which imputation errors adversely impact or are surprisingly tolerated when classifying disease samples are not well understood for instance.

In this study, we propose a systematic framework called MIMIC (Multi-modal Imputation and Multi-modal Integration for disease Classification) to study how methods for imputing entire data modalities from other measured modalities affect downstream disease classification accuracies. The MIMIC pipeline allows exploring a range of shallow to deep machine learning models for classification, each of which utilizes features from multiple measured/imputed data types to predict disease status. We applied our framework to three diverse disease datasets, Alzheimer’s (MSBB and ROSMAP datasets), Breast Cancer (TCGA-BRCA and METABRIC), and Preterm Birth (MOMI). In a majority of these applications, we found that imputed datasets offer a way to either improve model performance (by 1-5%) or provide comparable performance relative to the measured dataset while expanding the pool of potential biomarkers. Further analysis of genes with top feature importance scores identified by MIMIC revealed informative vs. redundant omics types for the classification task and gene-disease associations. These results are promising and encourage use of MIMIC pipeline to dissect other complex diseases as well, via integration of multiple modalities, measured or imputed.

(The research presented in this work was supported by Wellcome Trust/DBT grant IA/I/17/2/503323 awarded to MN. The authors thank Adiya Jeevannavar for contributions to the initial phases of this project.)

14:00-14:15

Chromatin Accessibility in Human Primary and Metastatic Cancers Links Regional Mutational Processes and Signatures to Tissues of Origin

Confirmed Presenter: Hanli Jiang, University of Toronto & Ontario Institute for Cancer Research, Canada

Format: In Person

Authors List: Show

Presentation Overview: Show

Background
Cancer metastasis significantly worsens patient prognosis and is responsible for the majority of cancer-related deaths. Metastatic tumors accumulate additional mutations that facilitate their spread, yet their mutational processes and genomic characteristics remain only partially understood. Large-scale projects have characterized mutational landscapes in primary tumors and in metastatic tumors. Machine learning models have also been used to link regional mutation rates with tissue-specific chromatin and replication profiles in primary cancers. However, comparatively few studies have systematically contrasted primary versus metastatic tumor genomes using advanced machine learning approaches.

Methodology
We applied machine learning techniques to investigate mutational processes in metastatic cancers. In particular, we developed a multi-layer perceptron (MLP) model with gating layers to integrate chromatin accessibility and DNA replication timing as features to predict regional mutation rates. This deep learning model was trained separately to predict mutation rates in primary tumors using data from the Pan-Cancer Analysis of Whole Genomes (PCAWG) consortium and in metastatic tumors using data from the Hartwig Medical Foundation (HMF) cohort, while consistently using the same chromatin accessibility and replication timing features for both. The gating mechanism allowed the model to prioritize tissue-specific genomic features, improving mutation rate prediction and interpretability at finer genomic scales.

Key Findings
Our MLP_Gating model outperformed a random forest baseline across 15 cancer types, achieving adjusted R² values up to 0.95, demonstrating robust predictive power. This strong performance enhances confidence in subsequent findings. Metastatic tumors exhibited distinct mutational landscapes compared to their primary counterparts, yet their chromatin accessibility and replication timing profiles largely retained tissue-specific characteristics. The model further identified genomic regions with disproportionately high mutation burdens in metastases, pinpointing several known cancer driver genes. Notably, certain mutation hotspots displayed mutation rates exceeding expectations based on chromatin and replication timing features, suggesting additional mutagenic influences unique to metastatic progression. Our findings indicate that even as cancers metastasize, they preserve key epigenomic features from their tissue of origin, offering opportunities to tailor treatments according to the primary tumor’s context. Moreover, the known cancer driver genes identified with exceptionally high mutation burdens in metastatic tumors can serve as both potential therapeutic targets and biomarkers, guiding personalized oncology strategies. These insights into mutation hotspots in metastatic cancers refine our understanding of how mutational processes diverge from primary tumors and may enhance the accuracy of future models predicting metastatic risk and progression. By integrating whole-genome sequencing data with deep learning, this study underscores the value of computational approaches in unraveling metastatic tumor evolution.

14:15-14:30

Genomic Language Model for Predicting Enhancers and Their Allele-Specific Activity in the Human Genome

Confirmed Presenter: Rekha Sathian, Department of Biomedical Informatics, Stony Brook University, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Enhancers are distal cis-regulatory sequences that coordinate target gene expression in a tissue-specific manner. Disruptions in enhancer function—due to mutations, structural variations, or other mechanisms—can cause aberrant gene expression, leading to congenital disorders, cancers, and common complex diseases, collectively termed enhanceropathies. This notion is supported by the large number of disease-associated SNPs identified in these distal regulatory elements through genome-wide association studies. Predicting and deciphering the regulatory logic of enhancers is challenging due to their intricate sequence features and the lack of consistent genetic or epigenetic signatures that accurately distinguish enhancers from other genomic regions. Recent machine-learning-based methods have highlighted the importance of extracting the nucleotide composition of enhancers but have failed to capture sequence context effectively, leading to suboptimal performance. Motivated by advances in genomic language models, we applied DNABERT(1), a large-language transformer model pre-trained on the human genome, to develop a novel enhancer prediction method called DNABERT-Enhancer. We trained two different models using a large collection of enhancers curated from the ENCODE registry of candidate cis-regulatory elements, an integrative layer of the ENCODE Encyclopedia(2). The best fine-tuned model achieved 88.05% accuracy, with a Matthews correlation coefficient of 76% on independent, held-out data. To further validate the model’s performance, we applied it genome-wide and compared the predictions to publicly available enhancer databases. Our model demonstrated remarkable accuracy, successfully predicting a substantial portion of enhancers cataloged in nine databases, including nearly all enhancers in Vista, 91% of typical enhancers from sedb, 85% of enhancers in TiED, and 77% of EnhancerAtlas enhancers—amounting to 99% of the model’s total predictions. Finally, we applied DNABERT-Enhancer, along with other DNABERT-based regulatory genomic region prediction models, to identify candidate SNPs with allele-specific enhancer and transcription factor binding activity. These candidates were identified by testing functional site disruption via the introduction of variant alleles, which exhibited significant changes in prediction probability. Through careful evaluation, we identified candidate variants exhibiting loss-of-function effects, and their clinical significance was further assessed using ClinVar and GWAS databases. The genome-wide enhancer annotations and candidate loss-of-function genetic variants predicted by DNABERT-Enhancer provide valuable resources for genome interpretation in functional and clinical genomics studies.

1. Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021;37(15):2112-20.
2. Consortium EP, Moore JE, Purcaro MJ, Pratt HE, Epstein CB, Shoresh N, et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 2020;583(7818):699-710.

14:30-14:45

Longitudinal metabolomics study on serine supplementation trial reveals treatment efficacy for retinal disease driven by genetic amino acid dysregulation.

Confirmed Presenter: Roberto Bonelli, The Lowy Medical Research Institute, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Macular Telangiectasia Type 2 (MacTel) is a rare neurovascular degenerative retinal disease with complex genetic and metabolic underpinnings. Genome-wide association studies (GWAS) have identified 11 susceptibility loci, many of which implicate the glycine-serine metabolic pathway. Metabolomic analyses have confirmed significant reductions in glycine and serine levels among MacTel patients, highlighting metabolic dysregulation as a key feature of the disease. Further Mendelian randomization studies have established serine deficiency as a causal factor, driving disease progression through the accumulation of neurotoxic deoxysphingolipids. Experimental evidence further corroborates the link between serine metabolism and disease pathology, as elevated deoxysphingolipids have been observed to induce photoreceptor cell death in retinal organoids and compromise visual function in serine-deficient animal models.
Building on these insights, we conducted a study to investigate the metabolic effects of a 6-week serine and fenofibrate supplementation in 90 MacTel patients. Participants received either serine at two different concentrations, fenofibrate, a combination of serine and fenofibrate, or no treatment. Four time points were included: baseline, 3, 6, and 10 weeks. Each patient was genotyped via SNP array, and at each time point, laboratory blood panels were performed alongside the collection of serum for metabolomics. Using normalisation and regression techniques, we identified hundreds of metabolites affected by these interventions. By integrating summary statistics from the previous MacTel metabolomic studies, we observed striking differences in disease specificity for the metabolic changes induced by different treatments. We also identify that most metabolic changes observed could be reversed by a 4-week washout period.
To characterize the broader metabolic impact of supplementation, we employed dimensionality reduction techniques, including factorial analysis and lasso models, to derive a MacTel metabolic signature. This composite metabolic endpoint allowed us to quantify the extent to which supplementation modulates the heterogeneous metabolic disturbances observed in MacTel patients. Our results demonstrate that targeted supplementation can effectively mitigate the metabolic dysregulation characteristic of MacTel, providing further evidence for the likely causative role of serine metabolism in disease pathogenesis.
This study underscores the utility of integrative bioinformatics metabolomics approaches in elucidating complex metabolic diseases. By leveraging ‘omics data and advanced statistical modelling, we provide a framework for dissecting metabolic perturbations at a systems level, with implications for both MacTel and broader metabolic disorders. These methodologies not only enhance our understanding of disease mechanisms but also inform the development of metabolically targeted interventions.

14:45-15:00

SigMatch: a transcriptome-based regression method to detect mechanistic links across diseases and drugs

Confirmed Presenter: Kewalin Samart, Computational Bioscience Program, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Due to the meteoric rise of antibiotic resistance, research on treating infectious diseases (InfD) is turning to host-directed therapeutics (HDTs) that can complement antibiotics by modulating host responses. Computational drug repurposing has demonstrated the potential to accelerate HDT discovery. Traditional transcriptomics-based repurposing methods identify drugs that reverse the disease differential gene expression signature of the host. However, these methods only consider each disease in isolation and often overlook shared molecular mechanisms across diseases, e.g., the NF-κB pathway, which plays a role in active pulmonary tuberculosis (TB), autoimmune disorders, and cancer. This limitation hinders the identification of potential HDT combinations for less-studied InfDs by leveraging evidence from better-studied non-infectious diseases (NInfDs). Drug combination and synergy prediction, while extensively used in cancer, have been rarely applied to InfDs. To address this, we introduce SigMatch, a novel signature-based sparse regression model designed to discover novel HDT candidate combinations using (i) direct mechanistic connections across drugs and InfDs, and (ii) transferred mechanistic insights from NInfDs to InfDs.
SigMatch applies a boosting-based ensemble regression model to identify combinations of features from drug or disease gene expression signatures that best explain the target disease signature. First, SigMatch utilizes a library of FDA-approved drug signatures and identifies combinations of drugs that collectively reverse a TB signature. Each iteration of the model selects a drug signature that explains a specific facet of the reversed disease signature, continuing until an optimal mechanistic match is reached. Thus, the ensemble model naturally mirrors the reversal of the entire disease state by a sparse combination of drugs. Next, SigMatch can also find a combination of NInfD signatures that share signature patterns with the input InfD, enabling a transfer of knowledge from NInfDs to identify potential drug candidates for understudied InfDs.
We validated the ability of SigMatch to identify mechanistically-related diseases by applying it to NInfDs. Specifically, we trained SigMatch using ~600 NInfD signature features with one NInfD target signature (removed from the features) iteratively for each NInfD and identified numerous meaningful disease-disease associations, including Parkinson’s and Alzheimer’s diseases. Now, we are applying SigMatch to discover potential HDT combinations against TB via both direct and transferred evidence.

15:00-15:20

Proceedings Presentation: Product Manifold Representations for Learning on Biological Pathways

Confirmed Presenter: Daniel McNeela, University of Wisconsin-Madison, United States

Format: Live Stream

Authors List: Show

Presentation Overview: Show

15:20-15:35

Multi-omic analysis of testes reveals pseudogenes as overlooked biological actors

Confirmed Presenter: Ihor Arefiev, University of Sherbrooke, Canada

Format: Live Stream

Authors List: Show

Presentation Overview: Show

Pseudogenes are commonly described as defective copies of genes or “junk” DNA, which neither encode any protein nor harbor any biological function. Yet growing evidence demonstrate the transcription and even translation of pseudogenes in mammals. Pseudogene expression was found predictive of tissue type in 7 human cancers. We suggest that pseudogene transcription and translation can be specific to cell types and should not be marginalized in analyses.
To investigate cell-specific transcription, we performed single-cell RNA sequencing (scRNA-seq) on 4 testes of adult mice. Pseudogenes harbor a 85.5% sequence identity on average with their parental counterparts, thus many reads from pseudogenes are multi-mapped reads (MMRs, reads that align at multiple loci in the genome). We tested 4 algorithms for MMR handling and identified the expectation-maximization (EM) as the most reliable based on sequence coverage uniformity. Sequencing data was then aligned to the genome using EM for count correction of MMR. We retrieved an average of 6,608 cells per sample, and identified all spermatogenic stages as well as Leydig cells, Sertoli cells and spermatogonia after clustering. Pseudogenes comprised ~7% (2,295) of all confidently detected transcripts with 77% (1,759) of them detected in all samples, and an additional 11.4% (261) detected in at least 3 samples. Differential expression analysis identified 27 pseudogenes specific to clusters. Trajectory analysis also revealed 21 of these as differentially expressed through pseudotime.
Next, we investigated the translation of pseudogenes with tandem mass-spectrometry on the same samples. We identified 103 pseudogene-encoded proteins, 8 of which were detected in all 4 samples, and 5 in at least 3 samples. 40 of these 103 were also detected in scRNA-seq. 12.62% (13 out of 103) pseudogenic proteins were detected according to HPP guidelines, whilst 87.4% were detected with a single unique peptide. 18 of 21 cell-specific pseudogenes were detected as proteins, including 1 with 2 unique peptides which met the HPP guidelines. Functional inference for this pseudogene based on its sequence identity with its parental gene suggests a role in spermatogonia differentiation.
This project highlights the overlooked transcription and translation of pseudogenes. Our findings challenge the definition of pseudogenes as mere genomic relics.