The SciFinder tool lets you search Titles, Authors, and Abstracts of talks and panels. Enter your search term below and your results will be shown at the bottom of the page. You can also click on a track to see all the talks given in that track on that day.

View Talks By Category

Scroll down to view Results

July 14, 2025
July 15, 2025
July 20, 2025
July 21, 2025
July 22, 2025
July 23, 2025
July 24, 2025

Results

July 21, 2025
11:20-11:40
Proceedings Presentation: Harnessing Deep Learning for Proteome-Scale Detection of Amyloid Signaling Motifs
Confirmed Presenter: Witold Dyrka, Politechnika Wrocławska, Poland
Track: GenCompBio: General Computational Biology

Room: 02N
Format: In person

Authors List: Show

  • Krzysztof Pysz, Krzysztof Pysz, Politechnika Wrocławska
  • Jakub Gałązka, Jakub Gałązka, Politechnika Wrocławska
  • Witold Dyrka, Witold Dyrka, Politechnika Wrocławska

Presentation Overview:Show

Amyloid signaling sequences adopt the cross-β fold that is capable of self-replication in the templating process. Propagation of the amxyloid fold from the receptor to the effector protein is used for signal transduction in the immune response pathways in animals, fungi and bacteria. So far, a dozen of families of amyloid signaling motifs (ASMs) have been classified. Unfortunately, due to the wide variety of ASMs it is difficult to identify them in large protein databases available, which limits the possibility of conducting experimental studies. To date, various deep learning (DL) models have been applied across a range of protein-related tasks, including domain family classification and the prediction of protein structure and protein-protein interactions. In this study, we develop tailor-made bidirectional LSTM and BERT-based architectures to model ASM, and compare their performance against a state-of-the-art machine learning grammatical model. Our research is focused on developing a discriminative model of generalized amyloid signaling motifs, capable of detecting ASMs in large data sets. The DL-based models are trained on a diverse set of motif families and a global negative set, and used to identify ASMs from remotely related families. We analyze how both models represent the data and demonstrate that the DL-based approaches effectively detect ASMs, including novel motifs, even at the genome scale.

July 21, 2025
11:40-12:00
Proceedings Presentation: From High-Throughput Evaluation to Wet-Lab Studies: Advancing Mutation Effect Prediction with a Retrieval-Enhanced Model
Confirmed Presenter: Bingxin Zhou, Shanghai Jiao Tong University, China
Track: GenCompBio: General Computational Biology

Room: 02N
Format: In person

Authors List: Show

  • Yang Tan, Yang Tan, East China University of Science and Technology
  • Ruilin Wang, Ruilin Wang, East China University of Science and Technology
  • Banghao Wu, Banghao Wu, Shanghai Jiao Tong University
  • Liang Hong, Liang Hong, Shanghai Jiao Tong University
  • Bingxin Zhou, Bingxin Zhou, Shanghai Jiao Tong University

Presentation Overview:Show

Enzyme engineering is a critical approach for producing enzymes that meet industrial and research demands by modifying wild-type proteins to enhance properties such as catalytic activity and thermostability. Beyond traditional methods like directed evolution and rational design, recent advancements in deep learning offer cost-effective and high-performance alternatives. By encoding implicit coevolutionary patterns, these pre-trained models have become powerful tools for mutation effect prediction, with the central challenge being to uncover the intricate relationships among protein sequence, structure, and function. In this study, we present VenusREM, a retrieval-enhanced protein language model designed to capture local amino acid interactions across both spatial and temporal scales. VenusREM achieves state-of-the-art performance on 217 assays from the ProteinGym benchmark. Beyond high-throughput open benchmark validations, we conducted a low-throughput post-hoc analysis on more than 30 mutants to verify the model’s ability to improve the stability and binding affinity of a VHH antibody. We also validated the practical effectiveness of VenusREM by designing 10 novel mutants of a DNA polymerase and performing wet-lab experiments to evaluate their enhanced activity at elevated temperatures. Both in silico and experimental evaluations not only confirm the reliability of VenusREM as a computational tool for enzyme engineering but also demonstrate a comprehensive evaluation framework for future computational studies in mutation effect prediction. The implementation is publicly available at https://github.com/tyang816/VenusREM.

July 21, 2025
12:00-12:20
BE3D: A Computational Workflow for Integrative Structure-Function Analysis of Base-Editor Tiling Mutagenesis Data
Confirmed Presenter: Yoochan Myung, Broad Institute of MIT and Harvard, United States
Track: GenCompBio: General Computational Biology

Room: 02N
Format: In person

Authors List: Show

  • Yoochan Myung, Yoochan Myung, Broad Institute of MIT and Harvard
  • Calvin Hu, Calvin Hu, Harvard University
  • Surya Mani, Surya Mani, Broad Institute of MIT and Harvard
  • Annie Chen, Annie Chen, Dana-Farber Cancer Institute
  • Vivian Lu, Vivian Lu, Broad Institute of MIT and Harvard
  • Brian Liau, Brian Liau, Harvard University
  • Guillaume Poncet-Montange, Guillaume Poncet-Montange, Broad Institute of MIT and Harvard
  • Gabriel Griffin, Gabriel Griffin, Dana-Farber Cancer Institute
  • Sumaiya Iqbal, Sumaiya Iqbal, Broad Institute of MIT and Harvard

Presentation Overview:Show

Understanding functional consequences of single-nucleotide variants is critical for elucidating the genetic basis of diseases, yet current variant screening technologies have limitations. CRISPR base editors (BEs) efficiently generate transition mutations, enabling targeted variant screens. However, interpreting these screens in the context of protein structure-function relationships remains challenging due to technical constraints and biological variability. We introduce BE3D, an integrated workflow to systematically analyze BE tiling mutagenesis data within protein structural contexts. BE3D comprises three modules: (A) BE-QA, assessing screening quality based on biological hypotheses (e.g., knockout vs. neutral guides); (B) BE-Clust3D, identifying hits from BE screening with an expanded coverage using protein 3D structures and highlighting their clusters; and (C) BE-MetaClust3D, aggregating data from multiple screens, enhancing detection of functionally relevant sites across cell lines and species. Applying BE3D to published BE screens on DNMT3A and MEN1, we show that BE-Clust3D method increased the coverage of functional residues by integrating structural data, yielding up to 3.5-fold improved detection of critical domains in DNMT3A and highlighting crucial drug-binding MEN1 residues (e.g., Met327, Trp346), inaccessible and unidentifiable by Bes due to PAM limitations. Meta-aggregation of MEN1 BE screen readouts from two cell lines (MOLM-13, MV4-11) using BE-MetaClust3D further emphasized a drug-resistant mutational hotspot, achieving a stronger drug-binding site enrichment (3.43-fold) compared to individual screens (average odds ratio 2.2). In summary, BE3D is an open-source, scalable tool for integrative structure-function analysis and interpretation of BE tiling mutagenesis data (Github: https://github.com/broadinstitute/beclust3d-public). BE3D is expected to accelerate variant-to-function investigation and the discovery of drug-targetable sites.

July 21, 2025
12:20-12:40
Enhanced protein evolution with inverse folding models using structural and evolutionary constraints
Confirmed Presenter: Yunjia Li, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences
Track: GenCompBio: General Computational Biology

Room: 02N
Format: In person

Authors List: Show

  • Yunjia Li, Yunjia Li, Institute of Genetics and Developmental Biology
  • Hongyuan Fei, Hongyuan Fei, Institute of Genetics and Developmental Biology
  • Caixia Gao, Caixia Gao, Institute of Genetics and Developmental Biology

Presentation Overview:Show

Protein engineering enables artificial protein evolution through iterative sequence changes, but current methods often suffer from low success rates and limited cost-effectiveness. Here, we present AiCE (AI-informed Constraints for protein Engineering), an approach that facilitates efficient protein evolution using generic protein inverse folding models, reducing dependence on human heuristics and task-specific models. By sampling sequences from inverse folding models and integrating structural and evolutionary constraints, AiCE identifies high-fitness single- and multi-mutations. We applied AiCE to eight protein engineering tasks, including deaminases, a nuclear localization sequence, nucleases, and a reverse transcriptase, spanning proteins from tens to thousands of residues, with success rates of 11%-88%. We also developed base editors for precision medicine and agriculture, including enABE8e (5 bp window), enSdd6-CBE (1.3-fold improved fidelity), and enDdd1-DdCBE (up to 14.3-fold enhanced mitochondrial activity). These results demonstrate that AiCE is a versatile, user-friendly mutation-design method that outperforms conventional approaches in efficiency, scalability, and generalizability.

July 21, 2025
12:40-13:00
Proceedings Presentation: Precise Prediction of Hotspot Residues in Protein-RNA Complexes Using Graph Attention Networks and Pre-trained Protein Language Models
Confirmed Presenter: Siyuan Shen, Central South University, China
Track: GenCompBio: General Computational Biology

Room: 02N
Format: In person

Authors List: Show

  • Siyuan Shen, Siyuan Shen, Central South University
  • Jie Chen, Jie Chen, Xinjiang University
  • Zhijian Huang, Zhijian Huang, Central South University
  • Yuanpeng Zhang, Yuanpeng Zhang, Xinjiang University
  • Ziyu Fan, Ziyu Fan, Central South University
  • Yuting Kong, Yuting Kong, Xinjiang Institute of Engineering
  • Lei Deng, Lei Deng, Central South University

Presentation Overview:Show

Motivation: Protein-RNA interactions play a pivotal role in biological processes and disease mechanisms, with hotspot residues being critical for targeted drug design. Traditional experimental methods for identifying hotspot residues are often inefficient and expensive. Moreover, many existing prediction methods rely heavily on high-resolution structural data, which may not always be available. Consequently, there is an urgent need for an accurate and efficient sequence-based computational approach for predicting hotspot residues in protein-RNA complexes.
Results: In this study, we introduce DeepHotResi, a sequence-based computational method designed to predict hotspot residues in protein-RNA complexes. DeepHotResi leverages a pre-trained protein language model to predict protein structure and generate an amino acid contact map. To enhance feature representation, DeepHotResi integrates the Squeeze-and-Excitation (SE) module, which processes diverse amino acid-level features. Next, it constructs an amino acid feature network from the contact map and SE-Module-derived features. Finally, DeepHotResi employs a Graph Attention Network (GAT) to model hotspot residue prediction as a graph node classification task. Experimental results demonstrate that DeepHotResi outperforms state-of-the-art methods, effectively identifying hotspot residues in protein-RNA complexes with superior accuracy on the test set.

July 21, 2025
14:00-14:20
Proceedings Presentation: Trustworthy Causal Biomarker Discovery: A Multiomics Brain Imaging Genetics based Approach
Confirmed Presenter: Jin Zhang, Northwestern Polytechnical University, China
Track: GenCompBio: General Computational Biology

Room: 02N
Format: In person

Authors List: Show

  • Jin Zhang, Jin Zhang, Northwestern Polytechnical University
  • Yan Yang, Yan Yang, Northwestern Polytechnical University
  • Muheng Shang, Muheng Shang, Northwestern Polytechnical University
  • Lei Guo, Lei Guo, Northwestern Polytechnical University
  • Daoqiang Zhang, Daoqiang Zhang, Nanjing University of Aeronautics and Astronautics
  • Lei Du, Lei Du, Northwestern Polytechnical University

Presentation Overview:Show

Discovering genetic variations underpinning brain disorders is important to understand their pathogenesis. Indirect associations or spurious causal relationships pose a threat to the reliability of biomarker discovery for brain disorders, potentially misleading or incurring bias in subsequent decision-making. Unfortunately, the stringent selection of reliable biomarker candidates for brain disorders remains a predominantly unexplored challenge. In this paper, to fill this gap, we propose a fresh and powerful scheme, referred to as the Causality-aware Genotype intermediate Phenotype Correlation Approach (Ca-GPCA). Specifically, we design a bidirectional association learning framework, integrated with a parallel causal variable decorrelation module and sparse variable regularizer module, to identify trustworthy causal biomarkers. A disease diagnosis module is further incorporated to ensure accurate diagnosis and identification of causal effects for pathogenesis. Additionally, considering the large computational burden incurred by high-dimensional genotype-phenotype covariances, we develop a fast and efficient strategy to reduce the runtime and prompt practical availability and applicability. Extensive experimental results on four simulation data and real neuroimaging genetic data clearly show that Ca-GPCA outperforms state-of-the-art methods with excellent built-in interpretability. This can provide novel and reliable insights into the underlying pathogenic mechanisms of brain disorders.

July 21, 2025
14:20-14:40
Randomized Spatial PCA (RASP): a computationally efficient method for dimensionality reduction of high-resolution spatial transcriptomics data
Confirmed Presenter: Ian Gingerich, Dartmouth College, United States
Track: GenCompBio: General Computational Biology

Room: 03A
Format: In person

Authors List: Show

  • Ian Gingerich, Ian Gingerich, Dartmouth College
  • Brittany Goods, Brittany Goods, Dartmouth College
  • H. Robert Frost, H. Robert Frost, Dartmouth College

Presentation Overview:Show

Spatial transcriptomics (ST) provides critical insights into the complex spatial organization of gene expression in tissues, enabling researchers to unravel the intricate relationship between cellular environments and biological function. Identifying spatial domains within tissues is essential for understanding tissue architecture and the mechanisms underlying various biological processes, including development and disease progression. Here, we present Randomized Spatial PCA (RASP), a novel spatially aware dimensionality reduction method for spatial transcriptomics (ST) data. RASP is designed to be orders-of-magnitude faster than existing techniques, scale to ST data with hundreds-of-thousands of locations, support the flexible integration of non-transcriptomic covariates, and enable the reconstruction of de-noised and spatially smoothed expression values for individual genes. To achieve these goals, RASP uses a randomized two-stage principal component analysis (PCA) framework that leverages sparse matrix operations and configurable spatial smoothing. We compared the performance of RASP against five alternative methods (BASS, GraphST, SEDR, spatialPCA, and STAGATE) on four publicly available ST datasets generated using diverse techniques and resolutions (10x Visium, Stereo-Seq, MERFISH, and 10x Xenium) on human and mouse tissues. Our results demonstrate that RASP achieves tissue domain detection performance comparable or superior to existing methods with a several orders-of-magnitude improvement in computational speed. The efficiency of RASP enhances the analysis of complex ST data by facilitating the exploration of increasingly high-resolution subcellular ST datasets that are being generated.

July 21, 2025
14:40-15:00
Genetic Confounding and Comorbidity: Re-evaluating Causal Inference in Disease Associations
Confirmed Presenter: Hadasa Kaufman, The Hebrew University of Jerusalem, Israel
Track: GenCompBio: General Computational Biology

Room: 02N
Format: In person

Authors List: Show

  • Hadasa Kaufman, Hadasa Kaufman, The Hebrew University of Jerusalem
  • Nadav Rapoport, Nadav Rapoport, Ben-Gurion University of the Negev
  • Michal Linial, Michal Linial, The Hebrew University of Jerusalem

Presentation Overview:Show

Comorbidity analyses indicate that ~35% of common disease pairs tend to occur sequentially within the same individual. Understanding whether this comorbidity pairing reflects a causal relationship or results from shared (often unknown) external factors is crucial for clinical decisions. Although causal inference methods are increasingly used in clinical research, most methods fail to incorporate genetic information, despite the well-documented pleiotropy of single-nucleotide polymorphisms (SNPs). Herein, we develop methodologies aimed at addressing this knowledge gap. In an extensive analysis of 440×440 disease pairs (with ≥500 cases each) from the UK Biobank (UKB), we found that approximately 58% of disease pairs share at least one associated SNP. We compared and evaluated two complementary approaches for addressing the genetic confounding effects. In the first scheme (coined EXPO for Exclude Population), we removed all individuals that displayed shared associated SNPs for both diseases. When EXPO was applied to the 440×440 disease pairs, this method showed a significant shift in p-value distributions (p-value 6e-4), but failed to identify pairs confirming elimination of residual genetic signals. The second approach relied on a propensity score matching (PSM) protocol to balance genetic risk between matched groups. In a pilot test of 5×5 abundant disease pairs, we combined the PSM with polygenic risk scores (PRS). The PRS-PSM and classical PSM yielded consistent results in 80% of cases, and for another 15%, significant results were confirmed only in PRS-PSM. These findings suggest that incorporating genetic information via PRS-PSM will enhance genetic interpretation and validate the genuine causal relationship of outcomes.

July 21, 2025
15:00-15:20
Sparse modeling of interactions enables fast detection of genome-wide epistasis in biobank-scale studies
Confirmed Presenter: Julian Stamp, Brown University, United States
Track: GenCompBio: General Computational Biology

Room: 02N
Format: In person

Authors List: Show

  • Julian Stamp, Julian Stamp, Brown University
  • Samuel Pattillo Smith, Samuel Pattillo Smith, University of Texas
  • Daniel Weinreich, Daniel Weinreich, Brown University
  • Lorin Crawford, Lorin Crawford, Microsoft Research

Presentation Overview:Show

The lack of computational methods capable of detecting epistasis in biobanks has led to uncertainty about the role of non-additive genetic effects on complex trait variation. The marginal epistasis framework is a powerful approach because it estimates the likelihood of a SNP being involved in any interaction, thereby reducing the multiple testing burden. Current implementations of this approach have failed to scale to large human studies. To address this, we present the sparse marginal epistasis (SME) test, which concentrates the scans for epistasis to regions of the genome that have known functional enrichment for a trait of interest. By leveraging the sparse nature of this modeling setup, we develop a novel statistical algorithm that allows SME to run 10 to 90 times faster than state-of-the-art epistatic mapping methods. In a study of blood traits measured in 349,411 individuals from the UK Biobank, we show that reducing searches of epistasis to variants in accessible chromatin regions facilitates the identification of genetic interactions associated with regulatory genomic elements.

July 21, 2025
15:20-15:40
Pan-cancer analysis in the real-world setting uncovers immunogenomic drivers of acquired resistance post-immunotherapy
Confirmed Presenter: Mohamed Reda Keddar, AstraZeneca, United Kingdom
Track: GenCompBio: General Computational Biology

Room: 02N
Format: In person

Authors List: Show

  • Mohamed Reda Keddar, Mohamed Reda Keddar, AstraZeneca
  • Martin Miller, Martin Miller, AstraZeneca

Presentation Overview:Show

Immune checkpoint blockade (ICB) has transformed cancer care, procuring long-lasting benefit to patients across various cancer types. However, >80% of patients fail to respond to ICB (primary resistant) or eventually develop resistance after initial clinical benefit (acquired resistant). Due to difficulty in accessing post-progression clinical samples, remarkably little is known about which immunogenomic features emerge as patients progress on therapy. Here, we use the Tempus AI real-world clinicogenomic database to build a pan-cancer and multimodal dataset of clinical and pre/post-treatment RNA/DNA-seq data from >5,000 patients across NSCLC, HNC, and TNBC. Using a systematic bioinformatics approach, we characterise and compare the clinical and molecular features of acquired vs. primary resistant patients in the post-progression setting. We find that acquired resistant patients consistently derive an ICB-specific prognostic advantage, as they survive significantly longer than their primary counterpart even after progressing. At the molecular level, acquired resistant tumours show a universally inflamed tumour microenvironment (TME) post-progression, specifically maintained or induced by ICB. Using dN/dS to evaluate mutation selection from pre- to post-treatment, we identify ICB-specific mutations selected for post-acquired resistance. These mutations were involved in functionally-relevant molecular processes, including loss of antigen processing and presentation, dysregulated metabolism, and putative immune escape via onogenic signalling pathways. Altogether, our analysis of post-progression samples mapped out the molecular underpinnings of acquired vs. primary ICB resistance and offers an opportunity for improved patient selection strategies and positioning of next-generation immunotherapies to re-activate an effective anti-tumour response and optimise outcome.

July 21, 2025
15:40-16:00
Beyond Mutation Frequency: A Bayesian Framework for Identifying Functional Cancer Drivers from Single-Cell Data
Confirmed Presenter: Komlan Atitey, NIH, United States
Track: GenCompBio: General Computational Biology

Room: 02N
Format: In person

Authors List: Show

  • Komlan Atitey, Komlan Atitey, NIH
  • Benedict Anchang, Benedict Anchang, National Institute of Environmental Health Sciences

Presentation Overview:Show

Cancer is driven by genetic alterations, especially gain-of-function mutations in oncogenes (OGs) and loss-of-function mutations in tumor suppressor genes (TSGs). Traditional approaches to identifying cancer driver genes (CDGs) rely heavily on mutation frequency across patient cohorts. While effective at detecting common drivers, these methods often miss rare but functionally significant mutations, and they struggle with the complexity introduced by tumor heterogeneity. To address these limitations, we present PICDGI (Predict Immunosuppressive Cancer Driver Genes using gene-gene Interaction features), a Bayesian framework that integrates time-series single-cell RNA sequencing (scRNA-seq) data with gene-gene interaction dynamics. PICDGI moves beyond mutation frequency by modeling gene regulatory influence and functional impact within evolving tumor cell populations. PICDGI begins by identifying cancer progenitor cells across tumor stages and reconstructs gene expression trajectories during tumor development. It then uses variational Bayesian inference to infer dynamic gene interaction networks and introduces the gene driver coefficient, a novel metric that quantifies each gene’s regulatory influence on downstream targets. This enables the identification of both known and previously unrecognized driver genes based on their functional roles in tumor progression and immune evasion. When applied to scRNA-seq data from nine samples across three lung adenocarcinoma (LUAD) patients, PICDGI successfully recovered established OGs and TSGs (62%) and revealed novel candidate drivers (38%) with strong expression patterns and relevance to tumor evolution, as confirmed by Moran’s I test. Overall, PICDGI provides a biologically grounded, interaction-driven strategy for identifying functional cancer drivers from single-cell data, offering a powerful tool for advancing personalized cancer genomics.

July 21, 2025
16:40-17:00
AdaGenes: A streaming processor for high-throughput annotation and filtering of sequence variant data
Confirmed Presenter: Nadine S. Kurz, Department of Medical Bioinformatics, University Medical Center Göttingen
Track: GenCompBio: General Computational Biology

Room: 02N
Format: In person

Authors List: Show

  • Nadine S. Kurz, Nadine S. Kurz, Department of Medical Bioinformatics
  • Klara Drofenik, Klara Drofenik, Department of Medical Bioinformatics
  • Kevin Kornrumpf, Kevin Kornrumpf, Department of Medical Bioinformatics
  • Kirsten Reuter-Jessen, Kirsten Reuter-Jessen, Institute of Pathology
  • Jürgen Dönitz, Jürgen Dönitz, Department of Medical Bioinformatics

Presentation Overview:Show

The amount of sequencing data resulting from whole exome or genome sequencing (WES / WGS) presents challenges for annotation, filtering, and analysis.
We introduce the Adaptive Genes processor (AdaGenes), a sequence variant streaming processor designed to efficiently annotate, filter, LiftOver and transform large-scale VCF files. AdaGenes provides a unified solution for researchers to streamline VCF processing workflows and address common challenges in genomic data processing, e.g. to filter out non-relevant variants to focus on further processing of the relevant positions. AdaGenes integrates genomic, transcript and protein data annotations, while maintaining scalability and performance for high-throughput workflows. Leveraging a streaming architecture, AdaGenes processes variant data incrementally, enabling high-performance on large files due to low memory consumption and seamless handling of whole genome files.
The interactive front end provides the user with the ability to dynamically filter variants based on user-defined criteria.
It allows researchers and clinicians to efficiently analyze large genomic datasets, facilitating variant interpretation in diverse genomics applications, such as population studies, clinical diagnostics, and precision medicine.
AdaGenes is able to parse and convert multiple file formats while preserving metadata, and provides a report of the changes made to the variant file.
AdaGenes is available at https://mtb.bioinf.med.uni-goettingen.de/adagenes.

July 21, 2025
17:00-17:20
Comprehensive framework for assessing discrepancies in genomic content and species-level annotations across microbial reference genomes
Confirmed Presenter: Serghei Mangul, Sage Bionetworks, United States; University of Suceava
Track: GenCompBio: General Computational Biology

Room: 02N
Format: In person

Authors List: Show

  • Grigore Boldirev, Grigore Boldirev, Georgia State University
  • Mohammed Alser, Mohammed Alser, Georgia State University
  • Peace Aguma, Peace Aguma, Georgia State University
  • Viorel Munteanu, Viorel Munteanu, University of Suceava
  • Mihai Dimian, Mihai Dimian, Department of Computers
  • Alex Zelikovsky, Alex Zelikovsky, Georgia State University
  • Serghei Mangul, Serghei Mangul, Sage Bionetworks

Presentation Overview:Show

Metagenomics research provides insights into the composition, diversity, and functions of microbial communities in various environments. To identify bacterial species, sequencing reads from samples are typically mapped to reference genomes found in bacterial reference databases. However, multiple references may share the same taxonomic identifiers while containing different genomic information, which can lead to inconsistencies in downstream analyses. We have developed a novel comprehensive framework for assessing discrepancies in genomic content and species-level annotations across microbial reference genomes, and applied it to evaluate the two most widely used bacterial reference databases: PATRIC and RefSeq. NCBI’s taxonomic identifiers were used to assess the agreement between databases at the species level. Species found in both databases were identified by matching taxIDs. To compare genomic representation, the BLAST tool was used to align all contigs from one database to all contigs of the corresponding strain in the other database. This analysis was extended to all overlapping species where strain-level information was available. The study revealed substantial discrepancies between databases. Among single-contig genomes, 85.5% exhibited 100% genomic similarity, 14.4% demonstrated an average similarity of 94.3%, and 17 genomes showed less than 75% similarity. For genomes with 2–10 contigs, 82.6% had 100% similarity, 17% averaged 94.79% similarity, and 128 genomes fell below the 75% threshold. Our results emphasize significant variability in genome representation across reference databases, especially for multi-contig genomes. Our framework will provide a foundation for building a more consistent and comprehensive reference database, which will improve the accuracy, rigor, and reproducibility of metagenomics research.

July 21, 2025
17:20-17:40
Building Ultralarge Pangenomes Using Scalable and Compressive Techniques
Confirmed Presenter: Sumit Walia, University of California San Diego, United States
Track: GenCompBio: General Computational Biology

Room: 02N
Format: In person

Authors List: Show

  • Sumit Walia, Sumit Walia, University of California San Diego
  • Harsh Motwani, Harsh Motwani, University of California San Diego
  • Yu-Hsiang Tseng, Yu-Hsiang Tseng, University of California San Diego
  • Kyle Smith, Kyle Smith, University of California San Diego
  • Russell Corbett-Detig, Russell Corbett-Detig, University of California San Diego
  • Yatish Turakhia, Yatish Turakhia, University of California San Diego

Presentation Overview:Show

Pangenomics studies intra-species genetic diversity by analyzing collections of genomes from the same species. As pangenomics scales to millions of sequences, efficient data formats become crucial to enabling future applications and ensuring efficient computational and memory performance for pangenomic analysis. Current pangenomic formats primarily store variation across genomes but fail to capture shared evolutionary and mutational histories, limiting their applicability. They also face scalability issues due to storage and computational inefficiencies. To address these limitations, we present PanMAN (Pangenome Mutation-Annotated Network), a novel pangenomic format that is the most compact, scalable, and information-rich among all variation-preserving formats. PanMAN encodes not only genome alignments and variations but also shared mutational and evolutionary histories inferred across genomes, making it the first format to unify multiple whole-genome alignment, phylogeny, and mutational histories into a single unified framework. By leveraging "evolutionary compression," PanMAN achieves 3.5X to 1391X compression over other formats (GFA, VG, GBZ, PanGraph, AGC, and tskit) across microbial datasets. To demonstrate scalability, we built the largest pangenome in terms of number of sequences —a PanMAN with 8 million SARS-CoV-2 genomes—requiring just 366MB of disk space. Using SARS-CoV-2 as a case study, we show that PanMAN offers a detailed and accurate portrayal of the pathogen's evolutionary and mutational history, facilitating the discovery of new biological insights. We also present panmanUtils, a software toolkit for constructing, analyzing, and integrating PanMANs with existing pangenomic workflows. PanMANs are poised to enhance the scale, speed, resolution, and overall scope of pangenomic analyses and data sharing.

July 21, 2025
17:40-18:00
Information Content as a metric to evaluate and compare DNA Language Models
Confirmed Presenter: Melissa Sanabria, Technische Universität Dresden, Germany
Track: GenCompBio: General Computational Biology

Room: 02N
Format: In person

Authors List: Show

  • Melissa Sanabria, Melissa Sanabria, Technische Universität Dresden
  • Anna R. Poetsch, Anna R. Poetsch, Technische Universität Dresden

Presentation Overview:Show

Large language models have transformed the field of natural language processing by enabling the generation of coherent and meaningful text. This success has inspired researchers to apply similar approaches to biological sequences, particularly DNA, where the underlying "language" of the genome holds biological insight. DNA language models, such as GROVER, offer a promising avenue for advancing genomic analysis. Despite their potential, evaluating and comparing those models remains a significant challenge. Existing metrics often rely on genome-specific motifs, biological annotations, or the number of parameters used during model training. These limitations make it difficult to perform consistent and generalizable assessments across different models or genomic contexts. We propose the use of entropy and information content as general-purpose metrics to evaluate DNA language models. By computing these measures over whole-genome predictions, we can quantify how much information the model captures during training. It allows us to compare not only between different types of genomic elements and regions—such as coding vs. non-coding sequences or promoters vs. intergenic regions—but also across different versions of the human genome. We also introduce a set of pretrained DNA language models for three major human genome builds: hg19, hg38, and telomere-to-telomere (T2T). Our analysis reveals that, although T2T includes a substantially greater proportion of repetitive sequences, this increase does not adversely affect the information content observed in other genomic regions. Our approach provides a more interpretable and genome-agnostic framework for evaluating DNA language models and offers new insights into how different genome assemblies influence model learning and performance.