Return to ISMB/ECCB 2025 Homepage   Click here for the abridged agenda


Select Track: 3DSIG | Bio-Ontologies and Knowledge Representation | BioInfo-Core | Bioinfo4Women Meet-Up | Bioinformatics in the UK | BioVis | BOSC | CAMDA | CollaborationFest | CompMS | Computational Systems Immunology | Distinguished Keynotes | Dream Challenges | Education | Equity and Diversity | EvolCompGen | Fellows Presentation | Function | General Computational Biology | HiTSeq | iRNA | ISCB-China Workshop | JPI | MICROBIOME | MLCSB | NetBio | NIH Cyberinfrastructure and Emerging Technologies Sessions | NIH/Elixir | Publications - Navigating Journal Submissions | RegSys | Special Track | Stewardship Critical Infrastructure | Student Council Symposium | SysMod | Tech Track | Text Mining | The Innovation Pipeline: How Industry & Academia Can Work Together in Computational Biology | TransMed | Tutorials | VarI | WEB 2025 | Youth Bioinformatics Symposium | All


Schedule for All

NOTE: Browser resolution may limit the width of the agenda and you may need to scroll the iframe to see additional columns.
Click the buttons below to download your current table in that format

Date Start Time End Time Room Track Title Confrimed Presenter Format Authors Abstract
2025-07-21 11:20:00 12:00:00 03B 3DSIG Decoding Immunity: Structural and Dynamical Insights Driving Antibody Innovation Franca Fraternali Franca Fraternali Effective adaptive immune responses rely on antibodies of different isotypes performing distinct effector functions. Understanding their structural diversity is crucial for engineering antibodies with optimal stability, binding, and therapeutic potential. In this keynote, I will present our integrative computational approaches to guide antibody design, which include isotype classification, chain compatibility prediction, 3D structural modeling, and analysis of allosteric communication. In designing novel antibodies, effective pairing of antibody heavy and light chains is essential for effective function, yet the rules governing this remain unclear. I will introduce ImmunoMatch, a suite of AI models fine-tuned on full-length variable regions to predict cognate H–L chain pairs. Built on the AntiBERTa2 language model, ImmunoMatch outperforms CDR- and gene usage–based models, with further improvements from chain type–specific tuning. Applied to B cell repertoires and therapeutic antibodies, ImmunoMatch identifies chain pairing refinement as a hallmark of B cell maturation and uncovers key sequence features driving specificity. Moving beyond the traditional focus on CDRs, we show that framework (FW) mutations can modulate antibody stability and effector function through long-range structural effects. Our analyses revealed that antibody language models (AbLMs) alone lack predictive power for FW mutagenesis. To improve on this, we adopted a structure-based approach, suggesting future directions such as fine-tuning AbLMs with in vitro FW-specific mutational data to improve their utility in antibody design. This shift can broaden the scope of rational engineering toward non-CDR regions and developability attributes, highlighting the need for a holistic view of antibody design.
2025-07-21 12:00:00 12:20:00 03B 3DSIG Rapid and accurate prediction of protein homo-oligomer symmetry using Seq2Symm Meghana Kshirsagar Meghana Kshirsagar, Artur Meller, Ian R. Humphreys, Samuel Sledzieski, Yixi Xu, Rahul Dodhia, Eric Horvitz, Bonnie Berger, Gregory R Bowman, Juan Lavista Ferres, David Baker, Minkyung Baek The majority of proteins must form higher-order assemblies to perform their biological functions, yet few machine learning models can accurately and rapidly predict the symmetry of assemblies involving multiple copies of the same protein chain. Here, we address this gap by finetuning several classes of protein foundation models, to predict homo-oligomer symmetry. Our best model named Seq2Symm, which utilizes ESM2, outperforms existing template-based and deep learning methods achieving an average AUC-PR of 0.47, 0.44 and 0.49 across homo-oligomer symmetries on three held-out test sets compared to 0.24, 0.24 and 0.25 with template-based search. Seq2Symm uses a single sequence as input and can predict at the rate of ~80,000 proteins/hour. We apply this method to 5 proteomes and ~3.5 million unlabeled protein sequences, showing its promise to be used in conjunction with downstream computationally intensive all-atom structure generation methods such as RoseTTAFold2 and AlphaFold2-multimer. Code, datasets, model are available at: https://github.com/microsoft/seq2symm.
2025-07-21 12:20:00 12:40:00 03B 3DSIG Probing Homo-Oligomeric Interaction Signals in Protein Language Models Zhidian Zhang Zhidian Zhang, Yo Akiyama, Yehlin Cho, Sergey Ovchinnikov Homo-oligomeric protein complexes—assemblies of identical subunits—are central to many biological processes and disease mechanisms. Accurate prediction of inter-subunit contacts within these assemblies remains a key challenge, especially as existing structure predictors like AlphaFold scale poorly with complex size. Protein language models (pLMs), trained on vast sequence databases, offer a scalable alternative by learning coevolutionary statistics from single-chain inputs. In this work, we systematically investigate whether pLMs implicitly learn inter-subunit interaction signals, even when trained solely on monomeric sequences. We find that pLM-predicted contact maps often contain partial inter-subunit signal, but the prediction perfomance is consistently weaker than intra-subunit ones. Notably, we observe that larger pLMs recover more accurate inter-contacts, suggesting that model scaling enhances structural resolution of homo-oligomers. Interestingly, missing inter-contact signals often correspond to interfaces without strong biophysical support outside of crystallography, raising the possibility that some predicted absences reflect genuine lack of physiological relevance. Our findings suggest that inter-subunit contact prediction from pLMs could serve as a computational filter for distinguishing biologically relevant homo-oligomers from crystallographic artifacts. These findings open a new avenue for leveraging pLMs not only as structure predictors, but as tools for dissecting the evolutionary logic and physiological relevance of protein assemblies.
2025-07-21 12:40:00 13:00:00 03B 3DSIG Integration between large phenomics and genomics data using scRNA-seq techniques revealed a genetically driven differential progression pattern in neurodegenerative eye disease Christian Anderson, Liam Scott, Simone Muller, Melanie Bahlo, Lea Scheppke, Roberto Bonelli In this study, we used clinical imaging phenotypic data from a large imaging biobank of eyes affected by Macular Telangiectasia Type 2 (MacTel). We extracted 96 clinically graded phenotypic variables on 7,328 eyes (3,675 patients), 15% of eyes had longitudinal observations with an average of 5 years. By borrowing techniques from single-cell RNAseq analysis we were able to divide the eyes of MacTel patients into 11 distinct clusters. Just like cell differentiation, these clusters were differentiated by retinal phenotypes’ presence and severity. Pseudo-time calculations revealed a clear severity difference between these clusters, additionally validated via clinically defined linear severity score. Minimum spanning tree calculation (oblivious to the longitudinal nature of some of our data) revealed two differential progression routes across clusters, one leading to a more vascular-altered retina while the other leading to a neurodegenerative process resulting in vision loss. Our longitudinal data additionally validated this bifurcated route. Analysis of the progression route in fellow eyes revealed a strong intra-patient agreement. Integration with demographic data revealed that patients affected by type 2 diabetes – a known MacTel comorbidity – were significantly more likely to progress following the neurodegenerative route. By integrating genetic data on 2,182 patients, we tested for association between all MacTel significant GWAS loci and progression routes and found a strong genetic association with a locus known to impact retinal vasculature and thickness. Lastly, modelling the genetic loci effect on clinical phenotypes revealed a strong genetic background on retinal insult presence, severity, and progression rate.
2025-07-21 14:00:00 14:40:00 03B 3DSIG Simulations in the age of AlphaFold: dynamics, drug resistance and enzyme design Adrian Mulholland Adrian Mulholland Molecular simulations contribute to practical protein design and engineering workflows. Equilibrium molecular dynamics (MD) simulations not only test and filter designs, but can predict binding, redox and other properties for engineering and optimization. This includes activation heat capacities determining enzyme temperature activity optima, and analysing causes of epistasis. A particular challenge is understanding and predicting mutations far from the active site that affect activity, often introduced by evolution. Dynamical-nonequilibrium (D-NEMD) simulations can predict distal sites relevant to modulating activity, cryptic binding sites, and allosteric effects. Simulations of chemical reactions in proteins with combined quantum mechanics/molecular mechanics (QM/MM) methods characterize crucial species in catalysis, including transition states and reaction intermediates, and how they are formed and stabilized. QM/MM models can be used as ‘theozyme’ templates for enzyme design. QM/MM calculations also allow prediction of spectroscopic and other electronic properties, assisting in design and optimization of photovoltaic proteins, e.g. in designed spectral tuning. Simulations can also analyse trajectories and effects of directed and natural evolution, providing insights for enzyme design and engineering. Examples include identifying the dynamical origins of heat capacity changes introduced by directed evolution of designer enzymes, and revealing e.g. how local electric fields are optimized for specific catalytic activities in beta-lactamase enzymes that cause resistance to ‘last resort’ antibiotics. Electric fields are vital features of many natural enzymes, including heme peroxidases in which they drive proton delivery. Electric field calculations and MD simulations can be combined effectively with AI tools for protein engineering in evolutionary enzyme design.
2025-07-21 14:40:00 15:00:00 03B 3DSIG FlowProt: Classifier-Guided Flow Matching for Targeted Protein Backbone Generation in the de novo DNA Methyltransfarase Family Ali Baran Taşdemir Ali Baran Taşdemir, Ayşe Berçin Barlas, Abdurrahman Olğaç, Ezgi Karaca, Tunca Doğan Designing novel proteins with both structural stability and targeted molecular function remains a central challenge in computational biology. While recent generative models such as diffusion and flow-matching offer promising capabilities for protein backbone generation, functional controllability is still limited. In this work, we introduce FlowProt, a classifier-guided flow-matching generative model designed to create protein backbones with domain-specific functional properties. As a case study, we focus on the catalytic domain of human DNA methyltransferase DNMT3A, a 286-residue protein essential in early epigenetic regulation. FlowProt builds on the FrameFlow architecture, predicting per-residue translation and rotation matrices to reconstruct 3D backbones from noise. A domain classifier, trained to distinguish DNMT proteins from others, guides the model during inference using gradient-based feedback. This enables FlowProt to steer generation toward DNMT-like structures. We evaluate backbone quality using self-consistency metrics (scRMSD, scTM, pLDDT) and domain relevance using ProGReS, sequence similarity, and SAM-binding potential. FlowProt consistently generates high-confidence structures up to 286 residues—the exact length of DNMT3A—with low scRMSD, high scTM, and strong functional similarity. We further validate our designs through structure-based alignment and cofactor-binding analysis with Chai-1, demonstrating high-confidence SAM-binding regions in the generated models. To our knowledge, FlowProt is the first method to integrate flow-matching with classifier guidance for domain-specific backbone design. As future work, we aim to assess DNA-binding potential and further refine functional capabilities via molecular dynamics simulations and benchmarking against state-of-the-art protein design models.
2025-07-21 15:00:00 15:20:00 03B 3DSIG Molecular design and structure-based modeling with generative deep learning Remo Rohs Jesse Weller, Remo Rohs The rapid expansion of crystal structure data and libraries of readily synthesizable molecules has recently opened up new areas of chemical space for drug discovery. Combined with advancements in virtual ligand screening, these expanded libraries are making an impact in early-stage drug discovery. However, traditional virtual screening methods are still only able to explore a small fraction of the near-infinite drug-like chemical space. Generative deep learning techniques address these limitations by leveraging existing data to learn the key intra- and inter-molecular relationships in drug-target interactions. We present DrugHIVE, a deep hierarchical variational autoencoder that surpasses leading autoregressive and diffusion-based models in both speed and performance on standard generative tasks. Our model generates molecules in a rapid single-shot fashion, making it highly scalable and orders of magnitude faster than other top approaches requiring slow, multi-step inference. DrugHIVE’s hierarchical architecture provides enhanced control over molecular generation, enabling substantial improvements in virtual screening efficiency and automating various drug design processes such as de novo generation, molecular optimization, scaffold hopping, linker design, and high-throughput pattern replacement. We demonstrate an improved ability to optimize drug-like properties, synthesizability, binding affinity, and selectivity of molecules through evolutionary latent space search using both experimentally resolved and AlphaFold predicted receptor structures. Recently, we used DrugHIVE to design novel compounds as prospective therapeutics for the important P53 cancer target. These promising new compounds have been synthesized and are currently undergoing experimental testing.
2025-07-21 15:20:00 15:40:00 03B 3DSIG BC-Design: A Biochemistry-Aware Framework for Highly Accurate Inverse Protein Folding Xiangru Tang Xiangru Tang, Xinwu Ye, Fang Wu, Daniel Shao, Dong Xu, Mark Gerstein Inverse protein folding, which aims to design amino acid sequences for desired protein structures, is fundamental to protein engineering and therapeutic development. While recent deep-learning approaches have made remarkable progress, they typically represent biochemical properties as discrete features associated with individual residues. Here, we present BC-Design, a framework that represents biochemical properties as continuous distributions across protein surfaces and interiors. Through contrastive learning, our model learns to encode essential biochemical information within structure embeddings, enabling sequence prediction using only structural input during inference—maintaining compatibility with real-world applications while leveraging biochemical awareness. BC-Design achieves 88% sequence recovery versus state-of-the-art methods’ 67% (a 21% absolute improvement) and reduces perplexity from 2.4 to 1.5 (39.5% relative improvement) on the CATH 4.2 benchmark. Notably, our model exhibits robust generalization across diverse protein characteristics, performing consistently well on proteins of varying sizes (50-500 residues), structural complexity (measured by contact order), and all major CATH fold classes. Through ablation studies, we demonstrate the complementary contributions of structural and biochemical information to this performance. Overall, BC-Design establishes a new paradigm for integrating multimodal protein information, opening new avenues for computational protein engineering and drug discovery.
2025-07-21 15:40:00 16:00:00 03B 3DSIG DivPro: Diverse Protein Sequence Design with Direct Structure Recovery Guidance Xinyi Zhou Xinyi Zhou, Guibao Shen, Yingcong Chen, Guangyong Chen, Pheng Ann Heng Motivation: Structure-based protein design is crucial for designing proteins with novel structures and functions, which aims to generate sequences that fold into desired structures. Current deep learning-based methods primarily focus on training and evaluating models using sequence recovery-based metrics. However, this approach overlooks the inherent ambiguity in the relationship between protein sequences and structures. Relying solely on sequence recovery as a training objective limits the models’ ability to produce diverse sequences that maintain similar structures. These limitations become more pronounced when dealing with remote homologous proteins, which share functional and structural similarities despite low sequence identity. Results: Here, we present DivPro, a model that learns to design diverse sequences that can fold into similar structures. To improve sequence diversity, instead of learning a single fixed sequence representation for an input structure as in existing methods, DivPro learns a probabilistic sequence space from which diverse sequences could be sampled. We leverage the recent advancements in in-silico protein structure prediction. By incorporating structure prediction results as training guidance, DivPro ensures that sequences sampled from this learned space reliably fold into the target structure. We conduct extensive experiments on three sequence design benchmarks and evaluated the structures of designed sequences using structure prediction models including AlphaFold2. Results show that DivPro can maintain high structure recovery while significantly improve the sequence diversity.
2025-07-21 16:40:00 16:50:00 03B 3DSIG AlphaPulldown2—a general pipeline for high-throughput structural modeling Dmitry Molodenskiy Dmitry Molodenskiy, Valentin Maurer, Dingquan Yu, Grzegorz Chojnowski, Stefan Bienert, Gerardo Tauriello, Konstantin Gilep, Torsten Schwede, Jan Kosinski AlphaPulldown2 streamlines protein structural modeling by automating workflows, improving code adaptability, and optimizing data management for large-scale applications. It introduces an automated Snakemake pipeline, compressed data storage, support for additional modeling backends like AlphaFold3 and AlphaLink2, and a range of other improvements. These upgrades make AlphaPulldown2 a versatile platform for predicting both binary interactions and complex multi-unit assemblies.
2025-07-21 16:50:00 17:00:00 03B 3DSIG Extending 3Di: Increasing Protein Structure Search Sensitivity with a Complementary Alphabet Michel van Kempen Michel van Kempen, Johannes Soeding Fast protein structure search methods, such as Foldseek, are essential to make use of the vast amount of structural information generated by structure prediction methods. In Foldseek, the key idea is to represent structures as sequences of discrete tokens from a structural alphabet, enabling fast searches through structure databases using efficient sequence comparison methods. Foldseek uses the 3Di alphabet for structure representation. However, its structure representation, comprising 20 states, describes only a limited aspect of the overall structure, resulting in lower search sensitivity compared to methods like TMalign or Dali, which use the full structure. To further improve structure search sensitivity, we present a new structural alphabet as an extension to the established 3Di alphabet. Instead of replacing 3Di, our new alphabet was trained to encode structural information complementary to the 3Di states. The combination of the two alphabets allows to balance search sensitivity and speed: the 3Di alphabet alone is used for the most time-critical tasks, while the final alignments benefit from additional structural information from both alphabets, increasing the entire search performance. On the SCOPe dataset, extending the 3Di alphabet with 12 states of the new alphabet increases search sensitivity at the superfamily level by 22%, compared to 13% when adding the amino acid alphabet instead. Moreover, adding the new alphabet as a third alphabet to Foldseek improves its search sensitivity by 4.5%.
2025-07-21 17:00:00 17:10:00 03B 3DSIG PPI3D clusters: non-redundant datasets of protein-protein, protein-peptide and protein-nucleic acid complexes, interaction interfaces and binding sites Justas Dapkunas Justas Dapkunas, Kliment Olechnovic, Ceslovas Venclovas To accomplish their functions in living organisms, proteins usually interact with various biological macromolecules, including other proteins and nucleic acids. Despite recent progress in structure prediction, only part of these interactions can be predicted accurately, and modeling those involving nucleic acids is especially hard. Therefore, improved computational methods for analysis and prediction of biomolecular interactions are in high demand. The development of such methods largely depends on the availability of reliable data. However, the experimental data in the Protein Data Bank (PDB) are noisy and hard to interpret. To facilitate the analysis of the biomolecular interactions, we developed the PPI3D web resource that is based on a database of clustered non-redundant sets of biomolecular complexes, interaction interfaces and binding sites. The structures are clustered based on both sequence and structure similarity, thus retaining the alternative interaction modes. All protein-protein, protein-peptide and protein-nucleic acid interaction interfaces and binding sites are pre-analyzed by means of Voronoi tessellation. The data are updated every week to keep in sync with the PDB. The users can query the data by different criteria, select the interactions of interest, download the desired data subsets in tabular format and as coordinate files, and use them for detailed investigation of protein interactions or for training the machine learning models. We expect that the PPI3D clusters will become a useful resource for researchers working on diverse problems related to biomolecular interactions. PPI3D is available at http://bioinformatics.ibt.lt/ppi3d/.
2025-07-21 17:10:00 17:30:00 03B 3DSIG From GWAS to Protein Structures: Illuminating Stress Resistance in Plants Su Datt Lam Fatima Shahid, Neeladri Sen, Christine Orengo, Su Datt Lam Plants face significant environmental stress such as pathogens, salinity, drought, and extreme temperatures. To survive, they evolve diverse adaptive mechanisms. Genome-wide association studies (GWAS) are widely used to identify genes linked to stress resistance, but often generate too many variants to interpret easily. This study maps GWAS-derived missense mutations to rice protein structures to prioritise those with functional impact. Despite limited experimentally determined plant protein structures, resources like the AlphaFold Protein Structure Database and The Encyclopedia of Domains (TED) offer high-quality models and domain annotations. We focused on TED domains with reliable structure—excluding those with low pLDDT scores, disorder, poor packing, or non-globular features. Stress-resistance mutations from the GWAS Atlas were then mapped to these domains. Functional sites were predicted using P2Rank and AlphaFill, and proximity of mutations to these sites was analysed. Among 149 mutations mapped to 113 TED domains, 14 were predicted as non-deleterious by MutPred2—potential gain-of-function variants. 70 mutations were near predicted functional sites. To explore potential impacts on protein interactions, AlphaFold 3 was used to model 24 protein complexes, and mCSM-PPI2 estimated changes in binding affinity. Some mutations enhanced protein-protein interactions. We calculated predicted changes in binding affinity following mutations using mCSM-PPI2. Several interesting cases demonstrated increased binding to interacting partners, which will be discussed in the talk. This is the first study using AlphaFold models to investigate stress-resistance mutations in plants, providing insights into their functional impact and supporting future breeding strategies vital for food security amid climate change.
2025-07-21 17:30:00 17:40:00 03B 3DSIG Chromatin as a Coevolutionary Graph: Modeling the Interplay of Replication with Chromatin Dynamics Sevastianos Korsak Sevastianos Korsak, Krzysztof H Banecki, Karolina Buka, Piotr Górski, Dariusz Plewczynski Modeling DNA replication poses significant challenges due to the intricate interplay of biophysical processes and the need for precise parameter optimization. In this study, we explore the interactions among three key biophysical factors that influence chromatin folding: replication, loop extrusion, and compartmentalization. Replication forks, which act as moving barriers to loop extrusion factors, contribute to the dynamic reorganization of chromatin during S phase. Notably, replication timing is known to correlate with the phase separation of chromatin into A and B compartments. Our approach integrates three components: (1) a numerical model that uses single-cell replication timing data to simulate fork propagation; (2) a stochastic Monte Carlo simulation capturing loop extrusion dynamics, CTCF and fork barriers, and epigenetic state spreading via a Potts Hamiltonian; and (3) a 3D OpenMM simulation that reconstructs chromatin structure based on the resulting state trajectories. In this work, we model the dynamic evolution of chromatin states using co-evolutionary graphs, in which both node and link states evolve stochastically and interactively. These graphs are translated into 3D chromatin structures: links correspond to harmonic bonds representing physical loops, while node states determine compartmental interactions modeled via block-copolymer attractive forces. We reconstruct 3D chromatin trajectories across the cell cycle by incorporating biologically grounded force-field parameters that vary between cell cycle phases to reflect experimentally observed changes in chromatin organization. Our framework, to our knowledge the first to dynamically integrate these three biophysical factors, provides new insights into chromatin behavior during replication and reveals how replication stress impacts chromatin organization.
2025-07-21 17:40:00 18:00:00 03B 3DSIG RNA-TorsionBERT: leveraging language models for RNA 3D torsion angles prediction Clément Bernard Clément Bernard, Guillaume Postic, Sahar Ghannay, Fariz Tahi Predicting the 3D structure of RNA is an ongoing challenge that has yet to be completely addressed despite continuous advancements. RNA 3D structures rely on distances between residues and base interactions but also backbone torsional angles. Knowing the torsional angles for each residue could help reconstruct its global folding, which is what we tackle in this work. We present a novel approach for directly predicting RNA torsional angles from raw sequence data. Our method draws inspiration from the successful application of language models in various domains and adapts them to RNA. We have developed a language-based model, RNA-TorsionBERT, incorporating better sequential interactions for predicting RNA torsional and pseudo-torsional angles from the sequence only. Through extensive benchmarking, we demonstrate that our method improves the prediction of torsional angles compared to state-of-the-art methods. In addition, by using our predictive model, we have inferred a torsion angle-dependent scoring function, called TB-MCQ, that replaces the true reference angles by our model prediction. We show that it accurately evaluates the quality of near-native predicted structures, in terms of RNA backbone torsion angle values. Our work demonstrates promising results, suggesting the potential utility of language models in advancing RNA 3D structure prediction. The source code is freely available on the EvryRNA platform: https://evryrna.ibisc.univ-evry.fr/evryrna/RNA-TorsionBERT.
2025-07-22 11:20:00 12:00:00 03B 3DSIG A (bio)computational perspective on protein folding, function and evolution Diego Ulises Ferreiro Diego Ulises Ferreiro Natural protein molecules are amazing objects that somehow compute their structure, dynamics, and activities given a sequence of amino acids and an environment. In turn, protein evolution solves the problem of finding sequences that satisfy the constraints given by biological functions, closing an informational loop that relates an equilibrium thermodynamic system (protein folding) with a non-equilibrium information-gathering and -using system (protein evolution). I will present and discuss results from an information-theory perspective of protein folding, function, and evolution. I will also present extensions of the theory to other terrestrial biopolymers and potential extraterrestrial ones.
2025-07-22 12:00:00 12:20:00 03B 3DSIG Structural Phylogenetics: toward an evolutionary model capturing both sequence and structure David Moi David Moi, Christophe Dessimoz Inferring deep phylogenetic relationships between proteins requires methods that can capture the iterative optimisation of the final folded protein object through evolution. This entails considering both sequence and structure information. While sequence-based phylogenetics has long been the standard, recent progress in structure prediction and modeling has opened new opportunities to harness 3D structural information in tree reconstruction. We recently introduced FoldTree, a practical framework for structure-based phylogenetics. Central to FoldTree is a robust benchmarking strategy that enables fair comparison between sequence and structure-based methods — a critical step given their fundamentally different inputs. Using local structural alphabets derived from protein geometry, FoldTree not only outperforms conventional sequence-based approaches for remote homologs but, surprisingly, also improves phylogenetic resolution among relatively close relatives. Its success has enabled novel evolutionary insights, such as clarifying the diversification of RRNPPA quorum-sensing receptors across bacteria, plasmids, and phages. Building on this foundation, we now introduce a new generation of structural alphabets developed using graph neural networks (GNNs). In this approach, protein structures are represented as graphs where residues are nodes labeled with physicochemical and geometric features, and edges encode diverse relationships such as spatial proximity, hydrogen bonding, or allosteric coupling. These alphabets capture both local residue environments and the broader network of structural constraints, bridging the gap between sequence and structure information and enabling integrative phylogenetic inference from sequence and structure. Together, these developments chart a path towards integrative sequence and structural phylogenetics, expanding the reach of evolutionary inference beyond the twilight zone of sequence similarity.
2025-07-22 12:20:00 12:40:00 03B 3DSIG The structural and functional plasticity of the GNAT fold: A case of convergent evolution Joel Roca Martinez Joel Roca Martinez, Hazel Leiva, Jialin Lin, Misty L Kuhn, Christine Orengo Spermine/spermidine acetyltransferases (SSATs) are members of the highly diverse Gcn5-related N-acetyltransferase (GNAT) superfamily, which ranks among the top 1% most structurally and sequence-diverse families in the CATH database. Prior studies have shown that while bacterial and eukaryotic SSAT enzymes catalyze the same reaction, they differ in residue conservation patterns, oligomeric states, and presence of allosteric sites. This raises the question of whether their functional similarity reflects convergent or divergent evolution. To investigate this, we utilized complementary in silico and in vitro experimental approaches. In silico experiments included analyzing ~37,000 GNAT sequences using AlphaFold2 modelling and additional sequence- and structure-based tools including FunFamer and FunTuner (in-house tools), Zebra3D, and IQ-TREE. A total of 71 SSAT enzymes were selected for in vitro experimental validation whereby substrate screening and enzyme kinetic assays showed distinct substrate preferences linked to specificity-determining residues and structural features. Our results support a model of convergent evolution between bacterial SpeG and human SSAT1 enzymes, with additional subfamilies showing divergent evolutionary paths. This work highlights the evolutionary plasticity of the GNAT fold and demonstrates how integrating computational and experimental strategies can uncover functional insights in large, diverse enzyme families.
2025-07-22 12:40:00 13:00:00 03B 3DSIG Virus targeting as a dominant driver of interfacial evolution in the structurally resolved human-virus protein-protein interaction network Wan-Chun Su Wan-Chun Su, Yu Xia The competitive nature of host-virus protein-protein interactions drives an ongoing evolutionary arms race between hosts and viruses. The surface regions on a host protein that interact with virus proteins (exogenous interfaces) frequently overlap with those that interact with other host proteins (endogenous interfaces), forming interfaces that are shared between virus and host protein partners (mimic-targeted interfaces). This phenomenon, referred to as interface mimicry, is a common strategy used by viruses to invade and exploit the cellular pathways of host organisms. Yet, the quantitative evolutionary consequences of interface mimicry on the host are not well-understood. Here, we integrate experimentally determined 3D structures and homology-based molecular templates of protein complexes with protein-protein interaction networks to construct a high-resolution human-virus structural interaction network. We perform rigorous site-specific evolutionary analyses on this structural interaction network and find that exogenous-specific interfaces evolve significantly faster than endogenous-specific interfaces. Surprisingly, mimic-targeted interfaces are as fast evolving as exogenous-specific interfaces, despite being targeted by both human and virus proteins. Moreover, we find that rapidly evolving mimic-targeted interfaces bound by human viruses are only visible in the mammalian lineage. Our findings suggest that virus targeting exerts an overwhelming influence on host interfacial evolution, within the context of domain-domain interactions, and that mimic-targeted interfaces on human proteins are the key battleground for a mammalian-specific host-virus evolutionary arms race. Overall, our study provides insights into the selective pressures that viruses impose on their hosts at the protein residue level, enabling a quantitative and systematic understanding of host-pathogen interaction and evolution.
2025-07-22 14:00:00 14:20:00 03B 3DSIG Novel structural arrangements from a billion-scale protein universe Nicola Bordin Jingi Yeo, Yewon Han, Nicola Bordin, Andy M. Lau, Shaun Kandathil, Hyunbin Kim, Milot Mirdita, David Jones, Christine Orengo, Martin Steinegger Recent advances in protein structure prediction by AlphaFold2 and ESMFold have massively expanded the known protein structural landscape. The AlphaFold Protein Structure Database (AFDB) now contains over 200 million models, while ESMAtlas hosts more than 600 million predicted structures from metagenomic data in MGnify. These resources span diverse taxonomic groups, including many unculturable species, and reveal previously unknown evolutionary relationships and structural arrangements. To harness this data, new computational strategies for classification and comparison are essential. We clustered the ESMatlas using Foldseek Cluster, identifying 72 million structure clusters and mapping their distribution across taxa. This uncovered novel evolutionary patterns, such as structural analogs in extreme environments and new domain combinations absent from PDB-based databases like CATH. In parallel, The Encyclopedia of Domains (TED) systematically classifies protein domains across the AFDB and reveals over 365 million domains—far surpassing traditional sequence-based methods. More than 100 million of these domains were previously undetected, underscoring the power of structure-based approaches in expanding known domain space. Together, ESMAtlas and TED help chart uncharted structural territory. ESMatlas proteins enrich the known protein universe with unique domain architectures, while TED reveals thousands of putative new folds. These breakthroughs demonstrate how multidomain proteins evolve through novel fold combinations and packing geometries. Both efforts also reveal patterns of domain exclusivity, lineage-specific architectures, and structural convergence across the Tree of Life, suggesting environmental adaptations and ancient, conserved folds crucial for cellular function. By uncovering new domain arrangements and interactions, we approach a comprehensive map of the protein universe.
2025-07-22 14:20:00 14:40:00 03B 3DSIG Towards a comprehensive view of the pocketome universe – biological implications and algorithmic challenges Hanne Zillmer Hanne Zillmer, Dirk Walther With the availability of reliably predicted 3D-structures for essentially all known proteins, characterizing the entirety of protein - small-molecule interaction sites (binding pockets) has become a possibility. The aim of this study was to identify and analyze all compound-binding sites, i.e. the pocketomes, of eleven different species’ from different kingdoms of life to discern evolutionary trends as well as to arrive at a global cross-species view of the pocketome universe. All protein structures available in the AlphaFold database for each species were subjected to computational binding site predictions. The resulting set of potential binding sites was inspected for overlaps with known pockets and annotated with regard to the protein domains. 2D-projections of all pockets embedded in a 128-dimensional feature space and characterizing all pockets with regard to selected physicochemical properties, yielded informative, global pocketome maps that reveal differentiating features between pockets. By clustering all pockets within species, our study revealed a sub-linear scaling law of the number of unique binding sites relative to the number of unique protein structures per species. Thus, larger proteomes harbor less than proportionally more different binding sites than species with smaller proteomes. We discuss the significance of this finding as well as identify critical and unmet algorithmic challenges.
2025-07-22 14:40:00 14:50:00 03B 3DSIG Towards a Biophysical Description of the Protein Universe Miguel Fernandez-Martin Miguel Fernandez-Martin, Nicola Bordin, Christine Orengo, Alfonso Valencia, Gonzalo Parra Understanding how protein families evolve and function remains a central question in molecular biophysics. By grouping evolutionarily related proteins into Functional Families (FunFams), CATH captures structural and functional conservation beyond sequence identity. By integrating AlphaFold2 models, CATH offers a representative view of the protein universe. Our group has developed a methodology to quantify local frustration conservation patterns in protein families, providing a biophysical interpretation of evolutionary constraints related to foldability, stability and function. In this study, we scaled frustration conservation analysis to a representative portion of the protein universe. We have analyzed over 8,900 FunFams (2.2M sequences) from CATH and TED, and explored the frustration and aminoacid identities distributions across the 20 Foldseek’s 3Di tertiary neighborhoods. We investigated how these geometries influence conservation patterns and find that some aminoacid identities (e.g. C, V, L, F, I, M) are conserved in a minimally frustrated state, indicating their evolutionary importance as structural anchors. Other residues (e.g. T, S, H, G) tend to be conserved in a neutral state, historically overlooked, suggesting that neutral frustration is not just an energetic buffering state but an evolutionarily constrained one. Additionally, some residues (e.g. D, K, E, N, Q) exhibit high proportions of conserved high frustration, potentially relevant for function. We present the first large-scale frustration survey of the protein universe, which allows us to distinguish whether sequence conservation reflects stability, neutrality or function. This framework offers a new way of interpreting conservation and lays the foundation for a biophysically informed understanding of protein evolution.
2025-07-22 14:50:00 15:00:00 03B 3DSIG Computational methods for the characterisation and evaluation of protein-ligand binding sites Javier Sánchez Utgés Javier Sánchez Utgés, Stuart MacGowan, Geoff Barton Fragment screening is used for hit identification in drug discovery, but it is often unclear which binding sites are functionally relevant. Here, data from 37 experiments is analysed. A method to group ligands by protein interactions is introduced and sites clustered by their solvent accessibility. This identified 293 ligand sites, grouped into four clusters. C1 includes buried, conserved, missense-depleted sites and is enriched in known functional sites. C4 comprises accessible, divergent, missense-enriched sites and is depleted in function. This approach is extended to the entire PDB, resulting in the LIGYSIS dataset, accessible through a new web server. LIGYSIS-web hosts a database of 65,000 protein-ligand binding sites across 25,000 proteins. LIGYSIS sites are defined by aggregating unique relevant protein-ligand interfaces across multiple structures. Additionally, users can upload structures for analysis, results visualisation and download. Results are displayed in LIGYSIS-web, a Python Flask web application. Finally, the human component of LIGYSIS, comprising 6800 binding sites across 2775 proteins, is employed to perform the largest benchmark of ligand site prediction to date. Thirteen canonical methods and fifteen novel variants are evaluated using fourteen metrics. Additionally, LIGYSIS is compared to datasets like PDBbind or MOAD and shown to be superior, since it considers non-redundant interfaces across biological assemblies. Re-scored fpocket predictions present the highest recall (60%). The detrimental effect in performance of redundant prediction, and the beneficial impact of stronger pocket scoring schemes is demonstrated. To conclude, top-N+2 recall is proposed as a robust benchmark metric and authors encouraged to share their benchmark code.
2025-07-22 15:00:00 15:20:00 03B 3DSIG ScGOclust: leveraging gene ontology to find functionally analogous cell types between distant species Yuyao Song Yuyao Song, Yanhui Hu, Julian Dow, Norbert Perrimon, Irene Papatheodorou Basic biological processes are shared across animal species, yet their cellular mechanisms are profoundly diverse. Comparing cell-type gene expression between species reveals conserved and divergent cellular functions. However, as phylogenetic distance increases, gene-based comparisons become less informative. The Gene Ontology (GO) knowledgebase offers a solution by serving as the most comprehensive resource of gene functions across a vast diversity of species, providing a bridge for distant species comparisons. Here, we present scGOclust, a computational tool that constructs de novo cellular functional profiles using GO terms, facilitating systematic and robust comparisons within and across species. We applied scGOclust to analyse and compare the heart, gut and kidney between mouse and fly, and whole-body data from C.elegans and H.vulgaris. We show that scGOclust effectively recapitulates the function spectrum of different cell types, characterises functional similarities between homologous cell types, and reveals functional convergence between unrelated cell types. Additionally, we identified subpopulations within the fly crop that show circadian rhythm-regulated secretory properties and hypothesize an analogy between fly principal cells from different segments and distinct mouse kidney tubules. We envision scGOclust as an effective tool for uncovering functionally analogous cell types or organs across distant species, offering fresh perspectives on evolutionary and functional biology.
2025-07-22 15:20:00 15:40:00 03B 3DSIG Mapping and characterization of the human missense variation universe using AlphaFold 3D models Alessia David Gordon Hanna, Elbert Timothy, Suhail A Islam, Michael Sternberg, Alessia David The deep learning algorithm AlphaFold has revolutionized the field of structural biology by producing highly accurate three-dimensional models of the proteome, thus providing a unique opportunity for atom-based analysis of human missense variants. Current variant prediction tools, such as REVEL, EVE and AlphaMissense, have significantly improved the prediction of damaging amino acid substitutions, but do not explain the mechanism by which these variants impact the phenotype, which, in most cases, remains elusive. We developed a pipeline to automatically identify accurately modelled amino acid regions that can be used for variant characterization. The recommended AlphaFold pLDDT threshold for an accurately modelled residue is =>70. When using this threshold for the query residue, the accuracy of the atom-based predictions calculated using our in-house variant prediction algorithm Missense3D is 0.66, MCC 0.36, TPR/FPR 5.1. We show that, when the model accuracy of the environment surrounding the query residue (E-plDDT5A) is considered, an E-plDDT5A >=60 provides similar accuracy, MCC and TPR/FPR to that obtained using the plDDT threshold >=70 for the query residue alone, but increases the number of residues for which an atom-based analysis can be performed. When using this new E-plDDT 5A>=60 threshold, >68% of the human proteome and >4 million missense variants can be modelled with sufficient quality to allow an atom-based analysis. In conclusion, AlphaFold 3D models offer a unique opportunity to understand the consequences of amino acid substitutions on protein structure, thus complementing existing evolutionary-based methods.
2025-07-22 15:40:00 16:00:00 03B 3DSIG CATH-ddG: towards robust mutation effect prediction on protein–protein interactions out of CATH homologous superfamily Guanglei Yu Guanglei Yu, Xuehua Bi, Teng Ma, Yaohang Li, Jianxin Wang Motivation: Protein-protein interactions (PPIs) are fundamental aspects in understanding biological processes. Accurately predicting the effects of mutations on PPIs remains a critical requirement for drug design and disease mechanistic studies. Recently, deep learning models using protein 3D structures have become predominant for predicting mutation effects. However, significant challenges remain in practical applications, in part due to the considerable disparity in generalization capabilities between easy and hard mutations. Specifically, a hard mutation is defined as one with its maximum TM-score < 0.6 when compared to the training set. Additionally, compared to physics-based approaches, deep learning models may overestimate performance due to potential data leakage. Results:We propose new training/test splits that mitigate data leakage according to the CATH homologous superfamily. Under the constraints of physical energy, protein 3D structures and CATH domain objectives, we employ a hybrid noise strategy as data augmentation and present a geometric encoder scenario, named CATH-ddG, to represent the mutational microenvironment differences between wild-type and mutated protein complexes. Additionally, we fine-tune ESM2 representations by incorporating a lightweight nonlinear module to achieve the transferability of sequence co-evolutionary information. Finally, our study demonstrates that CATH-ddG framework provides enhanced generalization by outperforming other baselines on non-superfamily leakage splits, which plays a crucial role in exploring robust mutation effect regression prediction. Independent case studies demonstrate successful enhancement of binding affinity on 419 antibody variants to human epidermal growth factor receptor 2 (HER2) and 285 variants in the receptor-binding domain (RBD) of SARS-CoV-2 to angiotensin-converting enzyme 2 (ACE2) receptor.
2025-07-22 16:40:00 17:00:00 03B 3DSIG Investigating Enzyme Function by Geometric Matching of Catalytic Motifs Raymund Hackett Raymund Hackett, Martin Larralde, Ioannis Riziotis, Janet Thornton, Georg Zeller Detecting catalytic features in protein structures can provide important hints about enzyme function and mechanism. Keeping pace with the rapidly growing universe of predicted protein structures requires computationally fast and scalable but interpretable tools. A library of 3D coordinates describing enzyme catalytic sites, referred to as templates, has been collected from manually curated and literature annotated examples of enzyme catalytic mechanisms described in the Mechanism and Catalytic Site Atlas. We provide this library of templates and a fast and modular python tool implementing the geometric matching algorithm Jess to identify matching catalytic sites in both experimental and predicted protein structures. We implement stringent match filtering to reduce the number of false matches occurring by chance. We validated this method against a non-redundant set of high quality experimental and predicted enzyme structures with well annotated catalytic sites. Geometric, knowledge based criteria are used to differentiate catalytically informative matches from spurious ones. We show that structurally matching catalytic templates is more sensitive than sequence based and even some structure based approaches in identifying homology between extremely distant enzymes. Since geometric matching does not depend on conserved sequence motifs or even common evolutionary history, we are able to identify examples of structural active site similarity in divergent and possibly convergent enzymes. Such examples make interesting case studies into the ancestral evolution of enzyme function. While insufficient for detecting and characterising substrate specific binding sites, this methodology could be suitable for expanding the annotation of enzyme active sites across proteomes.
2025-07-22 17:00:00 17:20:00 03B 3DSIG Cellular location shapes quaternary structure of enzymes. Gyorgy Abrusan Gyorgy Abrusan, Aleksej Zelezniak The main forces driving protein complex evolution are currently not well understood, especially in homomers, where quaternary structure might frequently evolve neutrally. Here we examine the factors determining oligomerisation by analysing the evolution of enzymes in circumstances where homomers rarely evolve. We show that 1) In extracellular environments, most enzymes with known structure are monomers, while in the cytoplasm homomers, indicating that the evolution of oligomers is cellular environment dependent; 2) The evolution of quaternary structure within protein orthogroups is more consistent with the predictions of constructive neutral evolution than an adaptive process: quaternary structure is gained easier than it is lost, and most extracellular monomers evolved from proteins that were monomers also in their ancestral state, without the loss of interfaces. Our results indicate that oligomerisation is context-dependent, and even when adaptive, in many cases it is probably not driven by the intrinsic properties of enzymes, like their biochemical function, but rather the properties of the environment where the enzyme is active. These factors might be macromolecular crowding and excluded volume effects facilitating the evolution of interfaces, and the maintenance of cellular homeostasis through shaping cytoplasm fluidity, protein degradation, or diffusion rates.
2025-07-22 17:20:00 17:40:00 03B 3DSIG In silico design of stable single-domain antibodies with high affinity Gabriel Cia Gabriel Cia, Frederic Rousseau, Luis Serrano Pubul, Joost Schymkowitz, Maarten Dewilde, Savvas N. Savvides, Alexander N. Volkov, Carlo Carolis, Nick Geukens, Zhongyao Zhang, Gabriele Orlando, Damiano Cianferoni, Javier Delgado Blanco, Katerina Maragkou, Teresa Garcia, David Vizarraga, Iva Marković, Rob Van der Kant Monoclonal antibodies are rapidly becoming a standard drug format in the pharmaceutical industry, but current immunization-based methods for antibody discovery often present limitations in terms of developability, binding affinity, cross-reactivity and, importantly, selectively targeting a prespecified epitope. Given these limitations, rational optimization and de novo design of antibodies with computational methods is becoming an attractive alternative to traditional antibody development methods. While recent deep learning methods have shown tremendous progress for protein design, their application to therapeutic antibody formats remains one of the major open challenges in the field. Here, we present EvolveX, a structure-based computational pipeline for antibody optimization and de novo design. EvolveX is a multi-objective optimization algorithm incorporating CDR modeling, biophysical parameters from the FoldX empirical force field and developability features into a single unified antibody design pipeline. We experimentally validated the ability of EvolveX to optimize the affinity and stability of a nanobody targeting mouse Vsig4 and, more challengingly, its ability to redesign the nanobody to bind the human Vsig4 ortholog with very high affinity compared to the wildtype nanobody, resulting in a 1000-fold improved Kd. Structural analyses by X-ray crystallography and NMR confirmed the accuracy of the predicted designs, which display optimized interactions with the antigen. Collectively, our study highlights EvolveX’s potential to overcome current limitations in antibody design, offering a powerful tool for the development of next-generation therapeutics with enhanced specificity, stability, and efficacy.
2025-07-22 17:40:00 18:00:00 03B 3DSIG An improved deep learning model for immunogenic B epitope prediction Rakshanda Sajeed Rakshanda Sajeed, Swatantra Pradhan, Rajgopal Srinivasan, Sadhna Rana The recognition of B epitopes by B cells of the immune system initiates immune response that leads to production of antibodies to combat bacterial and viral infections. The development of computational methods for predicting the epitopes on antigens has shown promising results in the development of subunit vaccines and therapeutics. Recently, the use of protein language models (pLMs) for epitope prediction has led to substantial increase in the prediction accuracies. However, precision needs to be improved greatly to gain significance in practical application. Here, we develop and evaluate a series of models using different combinations of features and feature fusion techniques on a curated independent test set. Our results show that the models that use both protein embeddings along with structural features perform better at predicting B epitopes as compared to the baseline model that uses only protein embeddings as features. We also show from the attention analysis of B and T epitopes, that the evolutionary scale model, ESM-2 captures T-B reciprocity implicitly in the model as a large fraction of high scoring B epitopes are highly attended by the T epitopes.
2025-07-23 11:20:00 12:00:00 04AB BioInfo-Core Bioimage analysis in the age of AI: lessons and a path forward from a core facility perspective. Damian Dalle Nogare Damian Dalle Nogare In recent years, much of the toolchain used in advanced bioimage analysis has become dominated by approaches relying on deep learning/artificial intelligence. These approaches have enabled significant progress in the types of analyses that are possible, for example large scale connectomics, accurate segmentation of massive volumes, and image restoration with high fidelity, as well as increasing the easy with which many day-to-day analysis tasks can be accomplished. However, these approaches come with considerable risk as models become more opaque to the end-user and analysis pipelines involve numerous “black box” steps. In the context of a data analysis core facility, where trust and reproducibility are central to our mission, how can we think about deploying such tools and, more broadly, engage with the rapidly changing landscape of data analysis tools? We propose that the challenges of 21st century data analysis can be ameliorated, if not entirely solved, by a combination of open science, community engagement, and deep collaboration between users, data analysts and research software engineering teams.
2025-07-23 12:00:00 13:00:00 04AB BioInfo-Core The rise of computational imaging Thanks in part to the popularity of spatial transcriptomics, many of us are now being faced with challenges that can be helped or solved with imaging data, or the need to combine images with other forms of data. How can we leverage the robust tools of computational imaging to help our collaborators and solve problems in our projects?
2025-07-23 14:00:00 15:00:00 04AB BioInfo-Core The practical use of AI in cores It’s here and it’s being used. How do we use it to good effect, and how do we teach our collaborators to use it? This topic could include not only generative AI for code, but extend to the use of foundation models in single cell analysis or related topics.
2025-07-23 15:00:00 15:20:00 04AB BioInfo-Core Benchmarking Variant-Calling Workflows: The nf-core/variantbenchmarking Pipeline within the GHGA Framework Kübra Narcı The nf-core/variantbenchmarking pipeline (https://github.com/nf-core/variantbenchmarking) is a versatile and comprehensive workflow designed to benchmark variant-calling tools across various use cases. Developed as part of the German Human Genome-Phenome Archive (GHGA) project, this pipeline supports the evaluation of small variants, insertions and deletions (indels), and structural variants for both germline and somatic samples. Users can leverage publicly available truth datasets, such as Genome in a Bottle or SEQC2, for benchmarking or provide custom VCF files with or without specific regions of interest. The pipeline supports diverse normalization methods, including variant splitting, deduplication, left or right alignment, filtration, and different benchmarking tools such as hap.py, RTG Tools, Truvari, SVAnalyzer, or Witty.er. This flexibility enables tailored analyses to meet specific research needs. The workflow generates detailed performance metrics, such as precision, recall, and F1 scores, allowing researchers to accurately assess the strengths and limitations of their variant-calling workflows. GHGA’s architecture is built on cloud computing infrastructures and includes an ethico-legal framework to ensure data protection compliance. GHGA enables researchers to conduct reproducible, rigorous, and secure research by standardizing bioinformatics workflows and governing reusability through harmonized metadata schemas. Built using Nextflow, the nf-core/variantbenchmarking pipeline is scalable, reproducible, and compatible with diverse computational environments, including local systems, high-performance clusters, and cloud platforms. This ensures seamless integration with secure platforms like GHGA for smooth benchmarking analyses. Additionally, the pipeline is fully open source and adheres to nf-core community guidelines, ensuring high-quality, reviewed code, modularity, and extensibility.
2025-07-23 15:00:00 15:20:00 04AB BioInfo-Core Assembly Curator: rapid and interactive consensus assembly generation for bacterial genomes Thomas Roder Thomas Roder, Rémy Bruggmann Introduction Long-read sequencing technologies enable the generation of near-complete bacterial genome assemblies. However, no de novo assembler is perfect – issues like duplicated or missing plasmids, spurious contigs, and failures to circularize sequences remain common problems. Achieving optimal results still requires manual consensus generation. While tools like Trycycler simplify this process, they are labor-intensive and require command-line expertise. With the ability to sequence hundreds of datasets quickly and affordably, there is a growing need for faster, more accessible solutions. Methods Here, we present Assembly Curator, a platform that (i) imports multiple assemblies, (ii) clusters contigs, and (iii) facilitates interactive comparison and selection through a user-friendly graphical interface. The software has a plug-in system to enable the import of data produced by different assemblers. Assembly Curator enables on-the-fly calculation of dotplots and can submit contig subsequences directly to NCBI’s Blast servers for approximate taxonomic identification, aiding in contamination assessment. It generates standardized and informative headers in FASTA files which are directly compatible with the NCBI annotation pipeline PGAP. Results and Discussion Assembly Curator enables the semi-automatic processing of hundreds of genomes in just a few hours, significantly reducing manual effort while maintaining high assembly completeness and accuracy. Moreover, the browser-based UI enables biologists without programming skills but domain specific knowledge to perform or participate in the curation process. This can potentially lead to superior results.
2025-07-23 15:20:00 15:40:00 04AB BioInfo-Core Long Read Sequencing at Genomics England Adam Giess Adam Giess At Genomics England, in the Scientific R&D Team, we are evaluating the potential role of ‘long read’ technologies in clinical whole genome sequencing. Long read technologies such as those developed by Oxford Nanopore offer the promise of comprehensive whole genome sequencing, providing nucleotide variants and epigenetic modifications, alongside the potential to resolve large scale variation, and to uncover previously inaccessible parts of the genome. The possibility of such a comprehensive view of the genome is particularly appealing in clinical settings, and with developing platforms like the Oxford Nanopore promethION sequencer, long read sequencing at scale has become a realistic prospect. However, despite this there is still a lack of large publicly available clinical long read datasets, and this presents a problem for assessment of the technologies themselves, and for the development of tools to get the most from these technologies. Here we present our experiences with Oxford Nanopore promethION sequencing at Genomics England, moving from pilot studies to projects involving thousands of participants, across rare disease, cancer and diverse ancestries. We present our long read datasets and the steps that we took to generate them, highlighting the challenges unique to this developing technology and the solutions that we have taken along our journey to long read sequencing at scale.
2025-07-23 15:20:00 15:40:00 04AB BioInfo-Core Autonomous Single Cell Transcriptomics Analysis in Persist-seq Anil S. Thanki Anil S. Thanki, Pablo Moreno, Ultan McDermott The analysis of large-scale biological datasets poses considerable challenges, particularly in managing data complexity, ensuring reproducibility, and reducing manual intervention. Traditional data processing pipelines often suffer from scalability issues, susceptibility to errors, and inconsistent reproducibility across computing environments. The Persist-SEQ consortium, comprising multiple partner institutions, is focused on generating and analyzing single-cell sequencing data to investigate early persister tumor cells in cancer treatment. Like many collaborative efforts, the consortium encounters the limitations of conventional data processing approaches. To overcome these barriers, we have developed a fully automated, scalable, and reproducible infrastructure tailored for high-throughput Single Cell Transcriptomics analysis. This system leverages Kubernetes, Jenkins, and Galaxy. The platform automates data retrieval from AWS, constructs Galaxy data libraries, and executes predefined single-cell analysis workflows with no manual intervention. Jenkins coordinates the end-to-end workflow—from data ingestion through to results delivery—while Kubernetes ensures a consistent, portable execution environment across various deployments. Galaxy provides an intuitive interface for executing reproducible analytical workflows as well as provides users with access to data. For enhanced operational transparency, the system integrates with Slack to deliver real-time status updates and error alerts, facilitating prompt monitoring and resolution. Currently deployed on the secure EBI Embassy Cloud, the infrastructure offers robust performance, data security, and efficient resource utilization. Our platform has been successfully implemented within the Persist-SEQ consortium, demonstrating its ability to streamline transcriptomic data analysis, enhance reproducibility, and reduce operational overhead. This approach represents a scalable and reliable solution for the evolving demands of modern biological research.
2025-07-23 15:40:00 16:00:00 04AB BioInfo-Core Advancing The Expression Atlas Resources: A Scalable Single-Cell Transcriptomics Pipeline to Facilitate Scientific Discoveries Iris Diana Yu Iris Diana Yu, Pedro Madrigal, Anil Thanki, Christina Ernst The Expression Atlas (https://www.ebi.ac.uk/gxa) and Single Cell Expression Atlas (https://www.ebi.ac.uk/gxa/sc) are EMBL-EBI knowledgebases for gene and protein expression across tissues and cells. They provide standardised re-analysis of high-quality RNA-seq and single-cell RNA-seq (scRNA-seq) datasets, respectively. Both pipelines use open-source tools for quantification, aggregation, and downstream analysis, with workflows publicly available. Their web interfaces support dataset exploration via gene and metadata queries. Current datasets span animal, plant, fungal, and protist species, and integrate data from major public archives, including the International Cancer Genome Consortium, the COVID-19 Data Portal, and the Human Cell Atlas. Over the past year, the Single Cell Expression Atlas (SCEA) has had significant changes. First, it can now consume and partially re-analyse pre-processed data, enabling the inclusion of studies that cannot provide raw data due to various constraints.The SCEA pipeline was also recently restructured into a fully end-to-end Nextflow pipeline, replacing a hybrid of Nextflow and Galaxy. This redesign improves maintainability and enables faster adaptation to advances in single-cell transcriptomics. Additionally, the post-quantification analysis workflow has also been enhanced with new features that improve its portability and compatibility with non-Atlas workflows, supporting FAIR data principles. Key improvements include modularised processes, automated testing for continuous integration, full containerisation of the analysis environment, and support for different usage scenarios. Ongoing feature development aims to further modernise the pipeline and broaden its utility within the scientific community.
2025-07-23 15:40:00 16:00:00 04AB BioInfo-Core Mixed effects models applied to single nucleus RNA-seq data identify cell types associated with animal level pathological trait of Alzheimer’s disease Ayushi Agrawal Ayushi Agrawal, Michela Traglia, Nicole Koutsodendris, Yadong Huang, Reuben Thomas Apolipoprotein E4 (APOE4) is the strongest genetic risk factor for Alzheimer’s disease (AD). Although neuronal APOE4 expression is induced under conditions of stress or injury, its role in AD pathogenesis remains unclear. To investigate this, we analyzed single-nucleus RNA-seq data from APOE4 knock-in mice expressing human Tau-P301S mutant, alongside animal-level measurements of tau pathology, neurodegeneration, and myelin deficits. We applied generalized linear mixed effects models to test associations between transcriptionally defined cell-type proportions and neuropathological severity. Our analysis identified disease-associated subpopulations of neurons, oligodendrocytes, astrocytes, and microglia, whose relative abundance correlated with increased tau pathology, neuronal loss, and myelin disruption. These findings suggest that specific cellular populations track with AD-related pathologies in the presence of neuronal APOE4. To evaluate the validity of inferring absolute abundance changes from single-nucleus RNA-seq data, we generated a simulated dataset with known ground truth cell-type compositions and pathology metrics. By benchmarking multiple normalization strategies, we determined the conditions under which statistical inference of compositional changes is reliable. Together, our results underscore the utility of mixed-effects models for integrating single-nucleus transcriptomics with phenotypic data and highlight how normalization choices critically influence biological interpretation of cell-type shifts in complex tissues like the brain.
2025-07-23 15:40:00 16:00:00 04AB BioInfo-Core Optimizing Clustering Resolution for Multi-subject Single Cell Studies Natalie Gill Natalie Gill, Min-Gyoung Shin, Ayushi Agrawal, Reuben Thomas Increasingly, single cell -omics analysis is being done on large cohorts of patients and model organisms and modularity based graph clustering algorithms are used to identify cell types and states across all subjects. Selecting the clustering resolution parameter is often based on the concentration of expression of cell type marker genes within clusters, increasing the parameter as needed to resolve clusters with mixed cell type gene signatures. This approach is however subjective in situations where one does not have complete knowledge of condition/disease associated cell-types in the context of novel biology, it is time-consuming and has the potential to bias the final clustering results due to individual transcriptomic heterogeneity, and subject-specific differences in cell composition. We introduce clustOpt, a method that improves the reproducibility of modularity based clustering in multi-subject experiments by using a combination of subject-wise cross validation, feature splitting, random forests and measures of cluster quality using the silhouette metric to guide the selection of the resolution parameter. We describe the results from benchmarking of this method on the Asian Immune Diversity Atlas data set.
2025-07-23 16:40:00 17:00:00 04AB BioInfo-Core GEO Uploader: Simplifying the data deposition in the GEO repository Hubert Rehrauer Ronald Domi, Falko Noé, Peter Leary, Hubert Rehrauer Introduction Making data FAIR is a key step in every research project. For NGS data there are the GEO and ENA repositories that provide long term storage and open access, and are widely adopted in the research community. Transferring the data and compiling the meta-information appropriately is however still a manual activity which may be cumbersome for massive NGS data. Methods We implemented a Python-based web application that performs the data upload for users and compiles the meta-information in an appropriate way. The GEO Uploader can be run standalone but is in our environment tightly integrated with our SUSHI web framework for reproducible, web-based analysis of sequencing data. Results The GEO Uploader is running at our center at https://geo-uploader.fgcz.uzh.ch/ and so far already close to 50 datasets have been uploaded to GEO. The uploader collects the files, generates MD5 sums, transfers the data and compiles the Excel table that is needed to provide the meta-information. It fills protocol information automatically based on the data and let’s users enter other information through a convenient web interface. It currently supports bulk RNA-seq as well as single cell RNA-seq data. Discussion Our GEO Uploader contributes to the community-wide adoption of Open Resarch Data (ORD) best practices. The GEO Uploader invites researchers to make data available early in the research process and it simplifies and speeds up this step.
2025-07-23 16:40:00 17:00:00 04AB BioInfo-Core Enhancing Bioinformatics Workflows with Analytical Visualization Tools Carlos Prieto Carlos Prieto, David Barrios Current front end development technologies have enabled the development of new visual analytics tools. These methodologies allow data to be visualized in a web browser in an interactive and dynamic way. The development of new visualization tools is essential for the effective exploration and interpretation of datasets and results produced by bioinformatics analysis techniques. This work presents programming methodologies and analytical visualization solutions that have been applied applied to the analysis of high-throughput sequencing data. Their development has been carried out using new web visualization technologies and the creation of a development architecture called LAMPR (Linux, Apache, MySQL, PHP, R). The following bioinformatics tools will be presented: - RJSplot: A collection of 17 interactive plots implemented in R. - D3GB: An interactive genome browser. - Looking4clusters: A tool for the interactive visualization of single-cell data. - Rvisdiff: Analytical visualization of differential expression results. - RaNA-Seq: A web-based platform for the analysis and visualization of RNA-Seq data. - MutationMinning: A self-analytical interface for the exploration of DNA resequencing results. The use of interactive and dynamic visualization tools enhances the interpretation of complex datasets and enables study designers or wet lab members to work toward a deeper understanding of their data.
2025-07-23 16:40:00 17:00:00 04AB BioInfo-Core Competency framework profiles to reflect career progression within bioinformatics core facility scientists Patricia Carvajal-López Patricia Carvajal-López, Marta Lloret-Linares, Cath Brooksbank There is an expanding need for specialised services from Bioinformatics Core Facilities (BCF). Providing services for these infrastructures requires highly trained specialists; however, their ill- defined career pathways and their highly specialised skill set often hinder their efforts to progress in their professions. To address this challenge, members of the ISCB’s Bioinfo-Core group, the Curriculum Task Force of the ISCB Education Committee, and other interested individuals joined forces at the 2023 ISMB Bioinfo-Core meeting to create the ‘Bioinformatics Core Facility Scientists Competencies Taskforce’ (https://sites.google.com/ebi.ac.uk/bioinfocore-competencies). The taskforce worked to provide a benchmark for reflecting the knowledge, skills and attitudes required by professionals in BCFs, and to provide a potential template for career progression for BCF scientists. This benchmark was developed as an extension of the ISCB Competency Framework (https://competency.ebi.ac.uk), which defines a ‘minimum standard’ for a generic, mid-career BCF scientist (and for several other distinct career profiles). The outcome of this work was the addition of six BCF scientist-focused competencies (project management, people management, collaborator engagement, users and service, training, and leadership) to the thirteen that already exist. We also created four different professional profiles for BCF scientists, outlining a potential transition from entry level to a managerial role. The development of a well-defined, competency-based career pathway, along with training for this community, is essential to support career progression of BCF specialists who, in return, enable research and development within the life sciences.
2025-07-23 17:00:00 18:00:00 04AB BioInfo-Core Breakout Groups Our unconferencing event - attendees will break into groups based on topics of interest to discuss further with other core members.
2025-07-24 08:40:00 08:45:00 04AB BioVis Opening Qianwen Wang, Zeynep Gumus
2025-07-24 08:45:00 09:40:00 04AB BioVis TBD Kay Nieselt
2025-07-24 09:40:00 10:00:00 04AB BioVis GENET: AI-Powered Interactive Visualization Workflows to Explore Biomedical Entity Networks Bum Chul Kwon Bum Chul Kwon, Natasha Mulligan, Joao Bettencourt-Silva, Ta-Hsin Li, Bharath Dandala, Feng Lin, Pablo Meyer, Ching-Huei Tsou Formulating experimental hypotheses that test the association between SNPs and diseases involves logical reasoning derived from prior observations, followed by the labor-intensive process of collecting and analyzing relevant literature to test the scientific plausibility and viability. AI models trained with previous association data (e.g., GWAS Catalog) can help infer potential associations between SNPs and diseases, but scientists still need to manually collect and inspect the evidence for such predictions from prior literature. To alleviate this burden, we introduce an AI-enhanced, end-to-end visual analytics workflow called GENET, which aims to help scientists discover the SNP-Target associations, collect evidence from scientific literature, extract knowledge as biomedical entity networks, and interactively explore them using visualizations. The workflow consists of the following four steps, where each step’s output serves as the input for the next step: 1) biomedical network analysis: identify interesting genes/SNPs that are associated with a target disease through indirectly connected genes/SNPs using a neural network; 2) literature evidence mining pipeline: collect relevant literature on the target diseases or the infered genes/SNPs, and extract biomedical entities and their relations from the collection using large language models; 3) clustering: cluster the extracted entities and relations by generating the embeddings using pre-trained biomedical language models (e.g., BioBERT, BioLinkBERT); 4) interactive visualizations: visualize the clusters of biomedical entities and their networks and provide interactive handles for exploration. The workflow enables users to iteratively formulate and test hypotheses involving SNPs/genes and diseases against evidence from scientific literature and databases and gain novel insights.
2025-07-24 11:20:00 11:40:00 04AB BioVis Prostruc: an open-source tool for 3D structure prediction using homology modeling Olaitan I. Awe Shivani Pawar, Wilson Sena Kwaku Banini, Musa Muhammad Shamsuddeen, Toheeb A Jumah, Nigel N O Dolling, Abdulwasiu Tiamiyu, Olaitan I. Awe Homology modeling is a widely used computational technique for predicting the three-dimensional (3D) structures of proteins based on known templates,evolutionary relationships to provide structural insights critical for understanding protein function, interactions, and potential therapeutic targets. However, existing tools often require significant expertise and computational resources, presenting a barrier for many researchers. Prostruc is a Python-based homology modeling tool designed to simplify protein structure prediction through an intuitive, automated pipeline. Integrating Biopython for sequence alignment, BLAST for template identification, and ProMod3 for structure generation, Prostruc streamlines complex workflows into a user-friendly interface. The tool enables researchers to input protein sequences, identify homologous templates from databases such as the Protein Data Bank (PDB), and generate high-quality 3D structures with minimal computational expertise. Prostruc implements a two-stage vSquarealidation process: first, it uses TM-align for structural comparison, assessing Root Mean Deviations (RMSD) and TM scores against reference models. Second, it evaluates model quality via QMEANDisCo to ensure high accuracy. The top five models are selected based on these metrics and provided to the user. Prostruc stands out by offering scalability, flexibility, and ease of use. It is accessible via a cloud-based web interface or as a Python package for local use, ensuring adaptability across research environments. Benchmarking against existing tools like SWISS-MODEL,I-TASSER and Phyre2 demonstrates Prostruc's competitive performance in terms of structural accuracy and job runtime, while its open-source nature encourages community-driven innovation. Prostruc is positioned as a significant advancement in homology modeling, making high-quality protein structure prediction more accessible to the scientific community.
2025-07-24 11:40:00 12:00:00 04AB BioVis Automatic Generation of Natural Language Descriptions of Genomics Data Visualizations for Accessibility and Machine Learning Thomas C. Smits Thomas C. Smits, Sehi L'Yi, Andrew P. Mar, Nils Gehlenborg Availability of multimodal representations, i.e., visual and textual, is crucial for both information accessibility and construction of retrieval systems and machine learning (ML) models. Interactive data visualizations, omnipresent in data analysis tools and data portals, are key to accessing biomedical knowledge and detecting patterns in large datasets. However, large-scale ML models for generating descriptions of visualizations are limited and cannot handle the complexity of data and visualizations in fields like genomics. Generating accurate descriptions of complex interactive genomics visualizations remains an open challenge. This limits both access for blind and visually impaired users, and the development of multimodal datasets for ML applications. Grammar-based visualizations offer a unique opportunity. Since specifications of visualization grammars contain structured information about visualizations, they can be used to generate text directly, rather than interpreting the rendered visualization, potentially resulting in more precise descriptions. We present AltGosling, an automated description generation tool focused on interactive visualizations of genome-mapped data, created with grammar-based toolkit Gosling. AltGosling uses a logic-based algorithm to create descriptions in various forms, including a tree-structured navigable panel for keyboard accessibility, and visualization-text pairs for ML training. We show that AltGosling outperforms state-of-the-art large language models and image-based neural networks for text generation of genomics data visualizations. AltGosling was adopted in our follow-up study to construct a retrieval system for genomics visualizations combining different modalities (specification, image, and text). As a first in genomics research, we lay the groundwork for building multimodal resources, improving accessibility, and enabling integration of biomedical visualizations and ML.
2025-07-24 12:00:00 12:20:00 04AB BioVis Can LLMs Bridge Domain and Visualization? A Case Study onHigh-Dimension Data Visualization in Single-Cell Transcriptomics Qianwen Wang Qianwen Wang, Xinyi Liu, Nils Gehlenborg While many visualizations are built for domain users (biologists), understanding how visualizations are used in the domain has long been a challenging task. Previous research has relied on either interviewing a limited number of domain users or reviewing relevant application papers in the visualization community, neither of which provides comprehensive insight into visualizations in the wild of a specific domain. This paper aims to fill this gap by examining the potential of using Large Language Models (LLM) to analyze visualization usage in domain literature. We use high-dimension (HD) data visualization in sing-cell transcriptomics as a test case, analyzing 1,203 papers that describe 2,056 HD visualizations with highly specialized domain terminologies (e.g., biomarkers, cell lineage). To facilitate this analysis, we introduce a human-in-the-loop LLM workflow that can effectively analyze a large collection of papers and translate domain-specific terminology into standardized data and task abstractions. Instead of relying solely on LLMs for end-to-end analysis, our workflow enhances analytical quality through 1) integrating image processing and traditional NLP methods to prepare well-structured inputs for three targeted LLM subtasks (\ie, translating domain terminology, summarizing analysis tasks, and performing categorization), and 2) establishing checkpoints for human involvement and validation throughout the process. The analysis results, validated with expert interviews and a test set, revealed three often overlooked aspects in HD visualization: trajectories in HD spaces, inter-cluster relationships, and dimension clustering. This research provides a stepping stone for future studies seeking to use LLMs to bridge the gap between visualization design and domain-specific usage.
2025-07-24 12:20:00 12:40:00 04AB BioVis ClusterChirp: A GPU-Accelerated Web Platform for AI-Supported Interactive Exploration of High-Dimensional Omics Data Zeynep H. Gümüş Osho Rawal, Edgar Gonzalez-Kozlova, Sacha Gnjatic, Zeynep H. Gümüş Modern omics technologies generate high-dimensional datasets that overwhelm traditional visualization tools, requiring computational tradeoffs that risk losing important patterns. Researchers without computational expertise face additional barriers when tools demand specialized syntax or command-line proficiency, while connecting visual patterns to biological meaning typically requires manual navigation across platforms. To address these challenges, we developed ClusterChirp, a GPU-accelerated web platform for real-time exploration of data matrices containing up to 10 million values. The platform leverages deck.gl for hardware-accelerated rendering and optimized multi-threaded clustering algorithms that significantly outperform conventional methods. Its intuitive interface features interactive heatmaps and correlation networks that visualize relationships between biomarkers, with capabilities to dynamically cluster or sort data by various metrics, search for specific biomarkers, and adjust visualization parameters. Uniquely, ClusterChirp includes a natural language interface powered by an Artificial Intelligence (AI)-supported Large Language Model (LLM), enabling interactions through conversational commands. The platform connects with biological knowledge-bases for pathway and ontology enrichment analyses. ClusterChirp is being developed through iterative feedback from domain experts while adhering to FAIR principles, and will be freely available upon publication. By uniting performance, usability, and biological context, ClusterChirp empowers researchers to extract meaningful insights from complex omics data with unprecedented ease.
2025-07-24 12:40:00 13:00:00 04AB BioVis Sketch, capture and layout Phylogenies Daniel Huson Daniel Huson Phylogenetic trees and networks play a central role in biology, bioinformatics, and mathematical biology, and producing clear, informative visualizations of them is an important task. We present new algorithms for visualizing rooted phylogenetic networks as either combining or transfer networks, in both cladogram and phylogram style. In addition, we introduce a layout algorithm that aims to improve clarity by minimizing the total stretch of reticulate edges. To address the common issue that biological publications often omit machine-readable representations of depicted trees and networks, we also provide an image-based algorithm for extracting their topology from figures. All algorithms are implemented in our new PhyloSketch app, which is open source and freely available at: https://github.com/husonlab/phylosketch2.
2025-07-24 12:40:00 13:00:00 04AB BioVis PhageExpressionAtlas - a comprehensive transcriptional atlas of phage infections of bacteria Maik Wolfram-Schauerte Maik Wolfram-Schauerte, Caroline Trust, Nils Waffenschmidt, Kay Nieselt Bacteriophages (phages) are bacterial viruses that infect and lyse their hosts. Phages shape microbial ecosystems and have contributed essential tools for biotechnology and applications in medical research. Their enzymes, takeover mechanisms, and interactions with their bacterial hosts are increasingly relevant, especially as phage therapy emerges to combat antibiotic resistances. Therefore, a thorough understanding of phage-host interactions, especially on the transcriptional level, is key to unlocking their full potential. Dual RNA sequencing (RNA-seq) enables such insight by capturing gene expression in both phages and hosts across infection stages. While individual studies have revealed host responses and phage takeover strategies, comprehensive and systematic analyses remain scarce. To fill this gap, we present the PhageExpressionAtlas, the first interactive resource for exploring phage-host interactions at the transcriptome level. We developed a unified analysis pipeline to process over 20 public dual RNA-seq datasets, covering diverse phage-host systems, including therapeutic and model phages infecting ESKAPE pathogens like Staphylococcus aureus and Pseudomonas aeruginosa. Users can visualize gene expression across infection phases, download datasets, and classify phage genes as early, middle, or late expressed using customizable criteria. Expression data can be explored via heat maps, profile plots, and in genome context, aiding functional gene characterization and phage genome analysis. The PhageExpressionAtlas will continue to grow, integrating new datasets and features, including cross-phage/host comparisons and host transcriptome architecture analysis. We envision the PhageExpressionAtlas to become a central resource for the phage research community, fostering data-driven insights and interdisciplinary collaboration. The resource is available at phageexpressionatlas.cs.uni-tuebingen.de.
2025-07-24 14:00:00 14:40:00 04AB BioVis SEAL: Spatially-resolved Embedding Analysis with Linked Imaging Data Simon Warchol Simon Warchol, Grace Guo, Johannes Knittel, Dan Freeman, Usha Bhalla, Jeremy Muhlich, Peter K Sorger, Hanspeter Pfister Dimensionality reduction techniques help analysts interpret complex, high-dimensional spatial datasets by projecting data attributes into two-dimensional space. For instance, when investigating multiplexed tissue imaging, these techniques help researchers identify and differentiate cell types and states. However, they abstract away crucial spatial, positional, and morphological contexts, complicating interpretation and limiting deeper biological insights. To address these limitations, we present SEAL, an interactive visual analytics system designed to bridge the gap between abstract 2D embeddings and their rich spatial imaging context. SEAL introduces a novel hybrid-embedding visualization that preserves morphological and positional information while integrating critical high-dimensional feature data. By adapting set visualization methods, SEAL allows analysts to identify, visualize, and compare selections—defined manually or algorithmically—in both the embedding and original spatial views, enabling richer interpretation of the spatial arrangement and morphological characteristics of entities of interest. To elucidate differences between selected sets, SEAL employs a scalable surrogate model to calculate feature importance scores, identifying the most influential features governing the position of objects within embeddings. These importance scores are visually summarized across selections, with mathematical set operations enabling detailed comparative analyses. We demonstrate SEAL’s effectiveness through two case studies with cancer researchers: colorectal cancer analysis with a pharmacologist and melanoma investigation with a cell biologist. We then illustrate broader cross-domain applicability by exploring multispectral astronomical imaging data with an astronomer. Implemented as a standalone tool or integrated seamlessly with computational notebooks, SEAL provides an interactive platform for spatially informed exploration of high-dimensional datasets, significantly enhancing interpretability and insight generation.
2025-07-24 14:00:00 14:40:00 04AB BioVis Nightingale - A collection of web components for visualizing protein related data Swaathi Kandasaamy Swaathi Kandasaamy, Daniel Rice, Aurélien Luciani, Adam Midlik, Maria Martin Nightingale is an open-source web visualization library for rendering protein-related data including domains, sites, variants, structures, and interactions using reusable web components. It employs a track-based approach, where sequences are represented horizontally, and multiple tracks can be stacked vertically to visualize different annotations at the same position, aiding in the discovery of relationships across annotations. This intuitive approach enhances the exploration and interpretation of complex biological data. It leverages the HTML5 Canvas API for improved performance, handling large datasets efficiently in the most used tracks, while still using SVG as a layer on top of canvas for interactivity which is not critical for performance. It is a collaborative effort by UniProt, InterPro, and PDBe to provide a unified set of components for their websites, including UniProt’s ProtVista, while allowing flexibility for specific needs. As a collection of standard web components, Nightingale integrates seamlessly into any web application, ensuring compatibility with various frameworks and libraries. It utilizes standard DOM event propagation and attribute-based communication to facilitate interoperability between Nightingale components and other web components, irrespective of their internal implementation details. As an evolving platform, we aim to engage with parallel visualization projects to identify and promote best practices in the application of web standards, with a focus on advancing the adoption and integration of web components within the domain of biological data visualization.
2025-07-24 14:00:00 14:40:00 04AB BioVis A Multimodal Search and Authoring System for Genomics Data Visualizations Huyen N. Nguyen Huyen N. Nguyen, Sehi L'Yi, Thomas C. Smits, Shanghua Gao, Marinka Zitnik, Nils Gehlenborg We present a database system for retrieving interactive genomics visualizations through multimodal search capabilities. Our system offers users flexibility through three query methods: example images, natural language, or grammar-based queries, via a user interface. For each visualization in our database, we generate three complementary representations: a declarative specification using the Gosling visualization grammar, a pixel-based image, and a natural language description. To support investigation of multiple embeddings and retrieval strategies, we implement three embedding methods that capture different aspects of these visualizations: (1) Context-free grammar embeddings specifically designed for genomics visualizations, addressing specialized features like genomic tracks, views, and interactivity, (2) Multimodal embeddings derived from a state-of-the-art biomedical vision-language foundation model, and (3) Textual embeddings generated by our fine-tuned specification-to-text large language model. We evaluated the proposed embedding strategies across different modality variations using top-k retrieval accuracy. Notably, our findings demonstrate that context-free grammar embedding approaches achieve comparable retrieval results with lower computational demands. Our current collection contains over three thousand visualization examples spanning approximately 50 categories, from basic to scalable encodings, from single- to coordinated multi-view visualizations, supporting diverse genomics applications including gene annotations and single-cell epigenomics analysis. Retrieved visualizations serve as ready-to-use scaffolds for authoring: they are templates that users can modify with their data and customize to their visual preferences. This approach provides researchers with reusable examples, allowing them to concentrate on meaningful data analysis and interpretation instead of the technicalities of building visualizations from scratch.
2025-07-24 14:00:00 14:40:00 04AB BioVis Tersect Browser: characterising introgressions through interactive visualisation of large numbers of resequenced genomes Tomasz Kurowski Tomasz Kurowski, Fady Mohareb Introgressive hybridisation has long been a major source of genetic variation in plant genomes, and the ability to precisely identify and delimit intervals of DNA originating from wild species or cultivars of interest is of great importance to both researchers seeking insights into the evolution and breeding of crops, and to plant breeders seeking to protect their intellectual property. The low cost of genome resequencing and the public availability of large sets of resequenced genomes for many species of commercial importance, as well as for their wild relatives, have made it possible to reliably characterise the origins of specific genomic intervals. However, such analyses are often hampered by the same large volume of data that enables them. They generally take a long time to execute, and their results are difficult to visualise in an easily explorable manner. We present Tersect Browser, a Web-based tool that leverages a novel, multi-tier indexing and pre-calculation scheme to allow biologists to explore the relationships between large sets of resequenced genomes in a fully interactive fashion. Users have the option to freely adjust interval size and resolution while navigating through detailed genetic distance heatmaps and phylogenies for genomes and regions of interest, smoothly zooming in and out depending on the needs of their exploratory data analysis, aided by extendable plugins and annotations. Results and visualisations can also be shared with others and downloaded as high-resolution figures for use outside the application, placing the researcher best prepared to interpret the results in full control.
2025-07-24 14:40:00 15:40:00 04AB BioVis Visual Data Analysis Research in Biomedical Applications: Navigating the Line Between Scientific Novelty and Practical Impact Ingrid Hotz Ingrid Hotz Visualization has a long-standing tradition in biomedical research, yet its potential as a tool for data exploration and analytical reasoning remains underused. In this talk, I will share results and experiences from recent interdisciplinary collaborations in this area, including projects on molecular dynamics, electronic structure modeling, and hypothesis generation in medicine. In addition to presenting results, I will reflect on the challenges of working across domains, the sometimes slow but often rewarding process of building trust, and the tension between scientific innovation in both fields and real-world applicability. These reflections also raise broader questions about research sustainability: When is a project complete, and when is it time to move on?
2025-07-23 11:20:00 11:40:00 03A Bio-Ontologies and Knowledge Representation Knowledge-Graph-driven and LLM-enhanced Microbial Growth Predictions Marcin Joachimiak Marcin Joachimiak Predicting microbial growth preferences has far-reaching impacts in biotechnology, healthcare, and environmental management. Cultivating microbes allows researchers to streamline strain selection, develop targeted antimicrobials, and uncover metabolic pathways for biodegradation or biomanufacturing. However, with most microbial taxa remaining uncultivated and knowledge of their metabolic capabilities and organismal traits fragmented in unstructured text, cultivation remains a major challenge. To address this, we developed KG-Microbe, a knowledge graph (KG) of over 800,000 bacterial and archaeal taxa, 3,000 types of traits, and 30,000 types of functional annotations. Using KG-Microbe, we constructed machine learning pipelines to predict microbial growth preferences. We compared symbolic rule mining, which produces human-readable explanations, with "black box" methods like gradient boosted decision trees and deep graph-based models. While boosted tree models achieved a mean precision of over 70% across 46 diverse media, we demonstrate that symbolic rule mining can match their performance, offering crucial interpretability. To further validate predictions, we used large language models (LLMs) to interpret and explain model outputs. By comparing these different models and their outputs, we identified key data features and knowledge gaps relevant to predicting microbial cultivation media preferences. We also used vector embedding analogy reasoning as well as complex graph queries on KG-Microbe to generate novel hypotheses and identify organisms with specific properties. Our work highlights the power of a KG-driven approach and the trade-offs between model interpretability and predictive performance. These findings motivate the development of hybrid AI models that combine transparency with predictive accuracy to advance microbial cultivation.
2025-07-23 11:40:00 12:00:00 03A Bio-Ontologies and Knowledge Representation ProDiGenIDB – a unified resource of disease-associated genes, their protein products, and intrinsic disorder annotations Jovana Kovacevic Jovana Kovacevic, Anđelka Zečević, Lazar Vasović Understanding gene-disease associations is essential in biomedical research, yet relevant information is often distributed across multiple heterogeneous databases. To overcome this inconsistency, we developed ProDiGenIDB, an integrated database that consolidates gene-disease relationships from several recognized and publicly available sources, while also enriching them with complementary data on gene and protein identifiers, disease ontology, and protein structural disorder. ProDiGenIDB brings together over 400,000 curated associations sourced from DisGeNet, COSMIC, HumsaVar, Orphanet, ClinVar, HPO, and DISEASES. Each entry includes gene-related metadata (Gene Symbol, Entrez ID, UniProt ID, Ensembl ID), disease descriptors (Disease Name, DOID), and a reference to the original source database. Importantly, the database also incorporates predicted intrinsic disorder information for proteins encoded by the associated genes. These predictions were generated using commonly used protein disorder prediction tools such as IUPred and VSL2, providing an additional insight into potential the lack of structure of disease-related proteins. Another important aspect of the database construction involved mapping disease names to standardized Disease Ontology IDs (DOIDs). To improve this process, we applied Natural Language Processing (NLP) techniques using advanced text representation models to enhance the accuracy and consistency of term association. ProDiGenIDB represents a valuable resource for integrative biomedical studies, particularly in contexts where protein disorder is hypothesized to play a functional or pathological role.
2025-07-23 12:00:00 12:20:00 03A Bio-Ontologies and Knowledge Representation Causal knowledge graph analysis identifies adverse drug effects Sumyyah Toonsi Sumyyah Toonsi, Paul Schofield, Robert Hoehndorf Motivation: Knowledge graphs and structural causal models have each proven valuable for organizing biomedical knowledge and estimating causal effects, but remain largely disconnected: knowledge graphs encode qualitative relationships focusing on facts and deductive reasoning without formal probabilistic semantics, while causal models lack integration with background knowledge in knowledge graphs and have no access to the deductive reasoning capabilities that knowledge graphs provide. Results: To bridge this gap, we introduce a novel formulation of Causal Knowledge Graphs (CKGs) which extend knowledge graphs with formal causal semantics, preserving their deductive capabilities while enabling principled causal inference. CKGs support deconfounding via explicitly marked causal edges and facilitate hypothesis formulation aligned with both encoded and entailed background knowledge. We constructed a Drug–Disease CKG (DD-CKG) integrating disease progression pathways, drug indications, side-effects, and hierarchical disease classification to enable automated large-scale mediation analysis. Applied to UK Biobank and MIMIC-IV cohorts, we tested whether drugs mediate effects between indications and downstream disease progression, adjusting for confounders inferred from the DD-CKG. Our approach successfully reproduced known adverse drug reactions with high precision while identifying previously undocumented significant candidate adverse effects. Further validation through side effect similarity analysis demonstrated that combining our predicted drug effects with established databases significantly improves the prediction of shared drug indications, supporting the clinical relevance of our novel findings. These results demonstrate that our methodology provides a generalizable, knowledge-driven framework for scalable causal inference.
2025-07-23 12:20:00 12:40:00 03A Bio-Ontologies and Knowledge Representation CROssBARv2: A Unified Biomedical Knowledge Graph for Heterogeneous Data Representation and LLM-Driven Exploration Bünyamin Şen Bünyamin Şen, Erva Ulusoy, Melih Darcan, Mert Ergün, Tunca Dogan Developing effective therapeutics against prevalent diseases requires a deep understanding of molecular, genetic, and cellular factors involved in disease development/progression. However, such knowledge is dispersed across different databases, publications, and ontologies, making collecting, integrating and analysing biological data a major challenge. Here, we present CROssBARv2, an extended and improved version of our previous work (https://crossbar.kansil.org/), a heterogeneous biological knowledge graph (KG) based system to facilitate systems biology and drug discovery/repurposing. CROssBARv2 collects large-scale biological data from 32 data sources and stores them in a Neo4j graph database. CROssBARv2 consists of 2,709,502 nodes and 12,688,124 relationships between 14 node types. On top of that, we developed a GraphQL API and a large language model interface to convert users’ natural language-based queries into Neo4j's Cypher query language back and forth to access information within the KG and answer specific scientific questions without LLM hallucinations, mainly to facilitate the usage of the resource. To evaluate the capability of CROssBAR-LLMs (LLMs augmented with structured knowledge in CROssBAR) in answering biomedical questions, we constructed multiple benchmark datasets and employed an independent benchmark to systematically compare various open- and closed-source LLMs. Our results revealed that CROssBAR-LLMs display a significantly improved accuracy in answering these scientific questions compared to standalone LLMs and even LLMs augmented with web search. CROssBARv2 (https://crossbarv2.hubiodatalab.com/) is expected to contribute to life sciences research considering (i) the discovery of disease mechanisms at the molecular level and (ii) the development of effective personalised therapeutic strategies.
2025-07-23 12:40:00 12:45:00 03A Bio-Ontologies and Knowledge Representation Benchmarking Data Leakage on Link Prediction in Biomedical Knowledge Graph Embeddings Galadriel Brière Galadriel Brière, Thomas Stosskopf, Benjamin Loire, Anaïs Baudot In recent years, Knowledge Graphs (KGs) have gained significant attention for their ability to organize massive biomedical knowledge into entities and relationships. Knowledge Graph Embedding (KGE) models facilitate efficient exploration of KGs by learning compact data representations. These models are increasingly applied on biomedical KGs for various tasks, notably link prediction that enables applications such as drug repurposing. The research community has implemented benchmarks to evaluate and compare the large diversity of KGE models. However, existing benchmarks often overlook the issue of Data Leakage (DL), which can lead to inflated performance and compromise the validity of benchmark results. DL may occur due to inadequate separation between training and test sets (DL1), use of illegitimate features (DL2), or evaluation settings that fail to reflect real-world inference conditions (DL3). In this study, we implement systematic procedures to detect and mitigate these sources of DL. We evaluate popular KGE models on a biomedical KG and show that inappropriate data separation (DL1) artificially inflates model performances and that models do not rely on node degree as a shortcut feature (DL2). For DL3, we implement realistic inference conditions with i) a zero-shot training procedure in which drugs in test and validation sets have no known indications during training and ii) a drug repurposing ground-truth for rare diseases. Performances collapse in both these scenarios. Our findings highlight the need for more rigorous evaluation protocols and raise concerns about the reliability of current KGE models for real-world biomedical applications such as drug repurposing.
2025-07-23 12:45:00 12:50:00 03A Bio-Ontologies and Knowledge Representation A machine learning framework for extracting and structuring biological pathway knowledge from scientific literature Mun Su Kwon Mun Su Kwon, Junkyu Lee, Haechan Sung, Hyun Uk Kim Advances in text mining have significantly improved the accessibility of scientific knowledge from literature. However, a major challenge in biology and biotechnology remains in extracting information embedded within biological pathway images, which are not easily accessible through conventional text-based methods. To overcome this limitation, we present a machine learning–based framework called “Extraction of Biological Pathway Information” (EBPI). The framework systematically retrieves relevant publications based on user-defined queries, identifies biological pathway figures, and extracts structured information such as genes, enzymes, and metabolites. EBPI combines image processing and natural language models to identify texts from diagrams, classify terms into biological categories, and infer biochemical reaction directionality using graphical cues such as arrows. The extracted information is output in an editable, tabular format suitable for integration with pathway databases and knowledge graphs. Validated against manually curated pathway maps, EBPI enables scalable knowledge extraction from complex visual data of biological pathways and opens new directions for automated literature curation across many biological disciplines.
2025-07-23 12:50:00 13:00:00 03A Bio-Ontologies and Knowledge Representation Poster Madness Each accepted poster presenter is given up 1 minute to advertise their poster.
2025-07-23 14:00:00 14:20:00 03A Bio-Ontologies and Knowledge Representation ScGOclust: leveraging gene ontology to find functionally analogous cell types between distant species Yuyao Song Yuyao Song, Yanhui Hu, Julian Dow, Norbert Perrimon, Irene Papatheodorou Basic biological processes are shared across animal species, yet their cellular mechanisms are profoundly diverse. Comparing cell-type gene expression between species reveals conserved and divergent cellular functions. However, as phylogenetic distance increases, gene-based comparisons become less informative. The Gene Ontology (GO) knowledgebase offers a solution by serving as the most comprehensive resource of gene functions across a vast diversity of species, providing a bridge for distant species comparisons. Here, we present scGOclust, a computational tool that constructs de novo cellular functional profiles using GO terms, facilitating systematic and robust comparisons within and across species. We applied scGOclust to analyse and compare the heart, gut and kidney between mouse and fly, and whole-body data from C.elegans and H.vulgaris. We show that scGOclust effectively recapitulates the function spectrum of different cell types, characterises functional similarities between homologous cell types, and reveals functional convergence between unrelated cell types. Additionally, we identified subpopulations within the fly crop that show circadian rhythm-regulated secretory properties and hypothesize an analogy between fly principal cells from different segments and distinct mouse kidney tubules. We envision scGOclust as an effective tool for uncovering functionally analogous cell types or organs across distant species, offering fresh perspectives on evolutionary and functional biology.
2025-07-23 14:20:00 14:40:00 03A Bio-Ontologies and Knowledge Representation Integrating autoantibody-related knowledge in an ontology populated using a curated dataset from literature Fabien Maury Fabien Maury, Solène Grosdidier, Killian Halberda, Isabelle Desguerre, Adrien Coulet, Maud de Dieuleveult Autoimmune diseases (AIDs) are often characterized by the presence of autoantibodies (AAbs). But many of these diseases are rare and can be hard to diagnose, partly due to the lack of easily accessible knowledge such as the type of AAb to test for, in order to diagnose a particular AID. Indeed, to our knowledge, no centralized resource including all available knowledge related to human autoantibodies exists as of 04-2025. To fill this gap, first, we introduce a light ontology that allows to represent relationships about AAbs, their molecular targets, and the related AIDs and their clinical signs. Also, this ontology allows to specify the provenance of the relationships, by reusing the PROV-O ontology. Second, we introduce the MAKAAO Core dataset, a dataset compiled manually from the literature by several curators. MAKAAO Core includes the name and synonyms (both in English and French) of over 350 autoantibodies, along with their targets and associated AIDs. Targets and diseases are referred to using identifiers from reference resources. We used this dataset to populate our ontology, and named the result the MAKAAO knowledge graph (MAKAAO KG), which constitutes the central part of a future reference resource.
2025-07-23 14:40:00 15:00:00 03A Bio-Ontologies and Knowledge Representation Ontology pre-training improves machine learning predictions of aqueous solubility and other metabolite properties Charlotte Tumescheit Charlotte Tumescheit, Martin Glauer, Simon Flügel, Fabian Neuhaus, Till Mossakowski, Janna Hastings Predicting properties of small molecule metabolites from structures is a challenging task. Molecular language models have emerged as a highly performant AI approach for prediction of diverse properties directly from ‘language-like’ representations of the structures of molecules. However, for many prediction problems, there is a shortage of available training data and model performance is still limited. Integrating expert knowledge into language models has the potential to improve performance on prediction tasks and model generalisability. Bio-ontologies offer curated knowledge ideal for this purpose. Here, we demonstrate a novel approach to knowledge injection, ‘ontology pre-training’, which we have previously shown to work for a pilot case study in the classification task of toxicity prediction. Now, we extend this to regression tasks such as solubility prediction and a wider range of classification tasks. First, we pre-train a Transformer-based language model on molecules from PubChem. Then, using our novel method, we embed the knowledge contained in a classification hierarchy derived from the ChEBI ontology into the model as an intermediate training step between general-purpose pre-training and task-specific fine-tuning. Finally, we fine-tune the models on a range of regression tasks. We find a clear improvement in performance and training times across the diverse prediction tasks. Our results show that adding an additional knowledge-based training step to a machine learning model can improve performance. Our method is intuitive and generalisable and we plan to extend it to further biological modalities and prediction datasets, including proteins and RNA, as well as exploring the impact of different ontologies.
2025-07-23 15:00:00 15:20:00 03A Bio-Ontologies and Knowledge Representation Building the Aging Biomarkers Ontology and Its Applications in Aging Research Hande McGinty Hande McGinty, Srikar Reddy Gadusu, Yigit Kucuk, Aaron King Aging is a complex biological process shaped by numerous biomarkers—such as cholesterol and blood sugar levels—that serve as measurable indicators of health and disease. Despite the abundance of biomarker data, identifying meaningful patterns and relationships remains a significant challenge. To address this, we began developing the Aging Biomarkers Ontology (ABO), a structured framework that formally defines aging-related biomarkers, organizes them hierarchically, and maps their interconnections to facilitate deeper analysis. Furthermore, we employed two complementary approaches to enrich the graph and uncover hidden associations among aging biomarkers: Depth-Limited Search (DLS) and machine learning-based embedding search. DLS identifies associations by traversing connected nodes within a predefined depth, while the embedding-based method encodes biomarker relationships as numerical vectors and uses cosine similarity to predict potential links. We evaluated the performance of both methods in detecting known and novel relationships. Our results demonstrate the value of systematically integrating statistical analysis with graph-based reasoning and machine learning to explore aging-related biomarkers. The resulting framework enhances the interpretability of biomarker data, supports hypothesis generation, and contributes to advancing biomedical research in aging and longevity.
2025-07-23 15:20:00 15:40:00 03A Bio-Ontologies and Knowledge Representation Discovering cellular contributions to disease pathogenesis in the NLM Cell Knowledge Network Richard Scheuermann Richard Scheuermann, Anne Deslattes Mays, Matthew Diller, Caroline Eastwood, Rezarta Islamaj, James Leaman, Raymond LeClair, Zhiyong Lu, Chris Mungall, Vinh Nguyen, David Osumi-Sutherland, Beverly Peng, Noam Rotenberg, William Spear, Bingfang Xu, Yun Zhang Knowledge about the role of genes in disease pathogenesis has been obtained from genetic and genome-wide association studies. The proteins encoded by these genes are frequently found to be effective therapeutic targets. However, little is known about which cells are the functional home of these disease-associated genes and proteins. Single cell genomic technologies are now revealing the cellular complexity of human tissues at high resolution. The transcriptomes defined by these technologies reflect the functional cellular phenotypes. Database resources that capture and disseminate data derived from these single cell technologies have been developed. But the knowledge derived from their analysis and interpretation is largely buried as free text in the scientific literature. Here we describe the development of a Cell Knowledge Network (CKN) prototype at the National Library of Medicine (NLM) that captures and exposes knowledge about cell phenotypes (cell types and states) derived from single cell technologies and related experiments. NLM-CKN is populated using validated computational analysis pipelines and natural language processing of the scientific literature and integrated with other sources of relevant knowledge about genes, anatomical structures, diseases, and drugs. Using this integration of experimental sc/snRNAseq data with prior knowledge about disease predispositions and drug targets, a novel linkage between lung pericytes and pulmonary hypertension was discovered through the KCNK3 gene intermediary with implications for novel therapeutic interventions. Through the integration of knowledge from single cell technologies with other sources of knowledge about genetic predispositions and therapeutic targets, the NLM-CKN is revealing the cellular contributions to disease pathogenesis.
2025-07-23 15:40:00 16:00:00 03A Bio-Ontologies and Knowledge Representation Cat-VRS for Genomic Knowledge Curation: A Hyperintensional Representation Framework for FAIR Categorical Variation Daniel Puthawala Daniel Puthawala, Brendan Reardon Cat-VRS: A FAIR catvar Standard Categorical variants (catvars)—such as “MET exon 14 skipping” and “TP53 loss”—are foundational to genomic knowledge, linking sets of genomic variants to clinically relevant assertions like oncogenicity scores or predicted therapeutic response. Yet despite their importance, catvars remain unstandardized, ambiguous, and largely non-computable, creating persistent barriers to search, curation, interoperability, and reuse. Existing standards either offer flexible models for sequence-resolved variants (e.g., GA4GH VRS) or rigid top-down nomenclatures (e.g., HGVS) that fail to capture the diversity and nuance of categorical assertions. We present the Categorical Variation Representation Specification (Cat-VRS), a new GA4GH standard for representing catvars using a hyperintensional, constraint-based model. Cat-VRS encodes categorical meaning compositionally and bottom-up: structured constraints—such as sequence location or protein functional consequence—support precise, flexible representations at varying levels of granularity. Cat-VRS is fully interoperable with other GA4GH standards, supports ontology mappings, and was developed through global community collaboration in alignment with the FAIR data principles. Cat-VRS 1.0 was recently released by GA4GH and is already in use by ClinVar and MaveDB, with integration underway in CIViC and the VICC MetaKB. These early implementations demonstrate Cat-VRS’s practical utility in enabling reusable, computable representations of categorical knowledge. As precision medicine scales, so too does the need for infrastructure that supports consistent curation, standardized data sharing, and automated variant knowledge matching. We invite the bio-ontologies and knowledge representation community to engage with Cat-VRS as both a practical tool and an extensible framework for advancing interoperable genomic knowledge.
2025-07-23 16:40:00 17:40:00 03A Bio-Ontologies and Knowledge Representation Knowledge Graphs: Theory, Applications and Challenges Ian Horrocks Knowledge Graphs have rapidly become a mainstream technology that combines features of databases and AI. In this talk I will introduce Knowledge Graphs, explaining their features and the theory behind them. I will then consider some of the challenges inherent in both the theory and implementation of Knowledge Graphs and present some solutions that have made possible the development of popular language standards and robust and high-performance Knowledge Graph systems. Finally, I will illustrate the wide applicability of knowledge graph technology with some example use cases.
2025-07-23 17:40:00 17:45:00 03A Bio-Ontologies and Knowledge Representation Bridging Language Barriers in Bio-Curation: An LLM-Enhanced Workflow for Ontology Translation into Japanese Mark Streer Mark Streer, Olivia Watson, Mark McDowall, Jane Lomax SciBite’s ontology management and named entity recognition (NER) software relies on curated public ontologies to support data harmonization under FAIR principles (findable, accessible, interoperable, and reusable). Public ontologies are foundational for data FAIR-ification, providing structured vocabularies that enable consistent annotation and semantic integration; however, they are predominantly developed in English, creating barriers for non-English users and applications. To address this challenge for our Japanese customers, we developed a large language model (LLM)-enhanced bio-curation workflow for English-to-Japanese translation, focusing on synonym enrichment of the Uberon anatomy ontology as a case study. Our approach implements a three-step process: (1) importing mapped Japanese synonyms from existing bilingual datasets (e.g., DBCLS resources), (2) generating Japanese candidate synonyms based on English synonyms and definitions using an LLM, and (3) validating candidates against the source ontology to ensure appropriate placement as well as online dictionaries and other references to confirm their real-world applicability. Initially developed for synonym enrichment, this workflow can be extended to semantic refinement into broadMatch and narrowMatch relationships in addition to exactMatch—critical for terminology lacking perfect English equivalents. Furthermore, the workflow is well-suited to agentic frameworks such as LangGraph to orchestrate generation and Internet research processes, as well as LLM-ensemble evaluation to automatically confirm clear matches, allowing ambiguous cases to be prioritized for “human-in-the-loop" curation. This approach represents a promising solution for scalable ontology translation, contributing to the FAIR development and application of bio-ontologies across language barriers and enhancing international biomedical research collaboration.
2025-07-23 17:45:00 17:50:00 03A Bio-Ontologies and Knowledge Representation Enabling FAIR Single-Cell RNAseq Data Management with COPO Felix Shaw Felix Shaw, Debby Ku, Aaliyah Providence, Irene Papatheodorou We present our work on establishing standards and tools for validating and submitting single-cell RNA sequencing (scRNA-seq) data and metadata using the COPO brokering platform. Effective research data management is essential for enabling data reuse, integration, and the discovery of new biological insights. As new technologies like single-cell sequencing and transcriptomics emerge, they often outpace existing data infrastructure. Single-cell technologies allow detailed insights into biological processes, for example, tracking gene expression dynamics in crops, dissecting pathogen-host interactions at the cellular level, or identifying stress-resilient cell types. Yet without comprehensive metadata and appropriate data management tools, the full potential of these datasets remains unrealised. Implementing the FAIR principles—particularly around metadata quality is crucial. At present, there are few widely adopted standards or tools for describing scRNA-seq experiments. In response, we have developed a structured metadata template tailored to these experiments, informed by extensive consultation with researchers across the single-cell community and aligned with existing standards. This metadata standard is integrated into COPO, which provides a streamlined interface for validating and brokering data and metadata to public repositories. Standardised metadata improves discoverability, supports data integration across platforms, and enables consistent reuse. It also ensures proper attribution, facilitates collaboration across diverse disciplines, and enhances reproducibility. By submitting with FAIR metadata viaSingle-cell RNA-seq COPO, we transform scRNA-seq outputs from isolated experimental results into well-labelled, interoperable datasets suitable for downstream applications such as machine learning. Our work addresses a key infrastructure gap, enabling more effective, collaborative, and impactful research in the single-cell field.
2025-07-23 17:50:00 17:55:00 03A Bio-Ontologies and Knowledge Representation Cancer Complexity Knowledge Portal: A centralized web portal for finding cancer related data, software tools, and other resources Susheel Varma Orion Banks, Ashley Clayton, Aditi Gopalan, Amber Nelson, Stockard Simon, Verena Chung, Amy Heiser, Jay Hodgson, Aditya Nath, Adam Hindman, Milen Nikolov, Adam Taylor, James Eddy, Susheel Varma, Jineta Banerjee Applying artificial intelligence and machine learning to biomedical problems requires clean, high-quality data and reusable software tools. The Cancer Complexity Knowledge Portal (CCKP), a NIH-listed domain-specific repository maintained by the Multi-Consortia Coordinating (MC2) Center at Sage Bionetworks, makes oncology data findable and accessible. The MC2 Center coordinates resources among six cancer-focused research consortia funded by the National Cancer Institute. To establish metadata standards, the CCKP hosts data models for various modalities, including genomics and imaging. New models are also being developed for emerging types, such as spatial transcriptomics. These models undergo iterative development with versioned releases maintained in a public GitHub repository. They power data management tools developed by Sage Bionetworks, including the Schematic Python package and the Data Curator App, which support FAIR data annotation. The data models help researchers link research outputs and assist the CCKP in highlighting activities from NCI-funded cancer research programs. The portal offers search and filtering capabilities to accelerate discovery and collaboration. As of November 2024, it hosts information on 3,786 publications, 904 datasets, and 292 computational tools from over 140 research grants. The models incorporate elements from the Cancer Research Data Commons Data Hub to support integration within the CRDC ecosystem. We are engaging with scientists, clinicians, and patient advocates to leverage user-centred design and structured data models, making cancer data more findable, accessible, and reusable. These improvements aim to bridge the gap between experimental and computational labs, fueling scientific discovery.
2025-07-23 17:55:00 18:00:00 03A Bio-Ontologies and Knowledge Representation COSI Closing Remarks Augustin Luna, Tiffany Callahan, Augustin Luna, Tiffany Callahan
2025-07-21 11:20:00 11:40:00 03A BOSC Welcome to BOSC; Open Bioinformatics Foundation update; CoFest announcement; sponsor video Nomi Harris
2025-07-21 11:40:00 12:40:00 03A BOSC Working together to develop, promote and protect our data resources: Lessons learnt developing CATH and TED Christine Orengo Christine Orengo The CATH protein domain structure classification was the vision of the pioneering computational scientist Janet Thornton. Algorithms developed by Orengo and Taylor in the lab of Willie Taylor enabled the analyses that laid the foundations for CATH. Since then, the Orengo team have taken CATH forward in many ways. Working closely with the protein sequence, structural and evolutionary biology communities provided the focus and feedback to shape the resource. Maintaining the value and integrity of CATH has necessitated continuously embracing new types of data as it became relevant and developing the appropriate tools for this. For example, CATH was recently expanded >400-fold with predicted structures from AlphaFold Database (AFDB) using novel AI-based tools and the CATH team are collaborating with the AFDB team at the EBI to make the data available to the wider community. CATH is also a partner resource in InterPro and was also used by the Structural Genomics Consortia in the States for more than 15 years to probe novel fold and function space. All CATH data and tools are publicly available. The talk will present landmark developments and how the resource has benefitted from extensive collaborations with the wider community to handle the data expansions and to provide accurate data needed by the community. It will also draw on CATH experience to reflect on strategies for supporting open data and open source.
2025-07-21 12:40:00 13:00:00 03A BOSC Connecting Data, People, and Purpose: How Open Science is Advancing Bioinformatics in a Low-Resource region (Nigeria) Seun Olufemi Seun Olufemi In many low-resource regions, access to scientific training, collaboration opportunities, and computational tools remains limited, hindering both local research capacity and global scientific equity. In Nigeria, we sought to address these gaps by building a sustainable Community of Practice (CoP) for bioinformatics, grounded in open data, mentorship, and community-led learning. Through the Open Seeds program by Open Life Science (OLS), we launched Bioinformatics Outreach Nigeria (https://bioinformatics-outreach-nigeria.github.io/)—a grassroots initiative aimed at using open science principles to foster data literacy, equitable access to bioinformatics tools, and shared community resources. Our journey began with a survey across the nation, which revealed that over 60% of aspiring bioinformaticians lacked access to adequate training and infrastructure. In response, we designed and delivered an ""Open Science for Bioinformaticians"" workshop, training 48 out of 232 applicants in open data practices, principles of open science, reproducible research, and collaborative science. Pre- and post-training data showed significant knowledge gains and emphasized the value of continuous peer support. Building on these insights, we developed shared community infrastructure (documents) —such as a publicly accessible Open Canvas, a Code of Conduct, and open documentation practices—that not only reinforce transparency but also promote inclusive and collaborative scientific work. Our experience demonstrates how data-driven community building, open science mentorship, and collaborative infrastructure can enable lasting change, serving as a scalable model for other underserved regions aiming to bridge the scientific access gap
2025-07-21 12:40:00 13:00:00 03A BOSC Analytical code sharing practices in biomedical research Serghei Mangul Serghei Mangul, Dhrithi Deshpande, Viorel Munteanu, Viorel Bostan, Dumitru Ciorbă, Nicole Nogoy Data-driven computational analysis is becoming increasingly important in biomedical research, as the amount of data being generated continues to grow. However, the lack of practices of sharing research outputs, such as data, source code and methods, affects transparency and reproducibility of studies, which are critical to the advancement of science. Many published studies are not reproducible due to insufficient documentation, code, and data being shared. We conducted a comprehensive analysis of 453 manuscripts published between 2016–2021 and found that 50.1% of them fail to share the analytical code. Even among those that did disclose their code, a vast majority failed to offer additional research outputs, such as data. Furthermore, only one in ten articles organized their code in a structured and reproducible manner. We discovered a significant association between the presence of code availability statements and increased code availability. Additionally, a greater proportion of studies conducting secondary analyses were inclined to share their code compared to those conducting primary analyses. In light of our findings, we propose raising awareness of code sharing practices and taking immediate steps to enhance code availability to improve reproducibility in biomedical research. By increasing transparency and reproducibility, we can promote scientific rigor, encourage collaboration, and accelerate scientific discoveries. We must prioritize open science practices, including sharing code, data, and other research products, to ensure that biomedical research can be replicated and built upon by others in the scientific community.
2025-07-21 12:40:00 13:00:00 03A BOSC Introducing the Actionable Guidelines for FAIR Research Software Task Force Bhavesh Patel Bhavesh Patel, Daniel Garijo, Marie-Christine Jacquemot-Perbal, Kelvin Lee, Carlos Martinez-Ortiz, Alexander Struck The Research Software Alliance (ReSA) has established a Task Force dedicated to translating the FAIR principles for Research Software (FAIR4RS Principles) into practical, actionable guidelines. Existing field-specific actionable guidelines, such as the FAIR Biomedical Research Software (FAIR-BioRS) guidelines (presented first during BOSC 2022), lack cross-discipline community input. The Actionable Guidelines for FAIR Research Software Task Force, formed in December 2024, brings together a diverse team of research software developers to address this gap. The Task Force is using the FAIR-BioRS guidelines as a foundation to build upon while aiming to create generalized guidelines. The Task Force began by analyzing the FAIR4RS Principles, where it identified six key requirement categories: Identifiers, Metadata for software publication and discovery, Standards for inputs/outputs, Qualified references, Metadata for software reuse, and License. To address these requirements, six sub-groups are conducting literature reviews and community outreach to define actionable practices for each category. Some of the challenges include identifying suitable identifiers, archival repositories, and metadata standards across research domains. This presentation provides an overview of the Task Force, presents its current progress, and outlines opportunities for community involvement. We will also explain how the FAIR-BioRS guidelines have now led to this global effort. The Task Force is dedicated to making all its outcomes openly available (CC-BY-4.0 license). This initiative will significantly benefit the biomedical open-source community by providing generalized guidelines to make software FAIR that applies beyond just biomedical software, which is critical to prevent siloed practices and drive cross-discipline collaborations.
2025-07-21 14:00:00 14:20:00 03A BOSC AMRColab: An Open-Access, Modular Bioinformatics Suite for Accessible Antimicrobial Resistance Genome Analysis Su Datt Lam Su Datt Lam, Sabrina Di Gregorio, Mia Yang Ang, Emma Griffiths, Tengku Zetty Maztura Tengku Jamaluddin, Sheila Nathan, Hui-Min Neoh Antimicrobial resistance (AMR) is a global health crisis, projected to cause 39 million deaths by 2050. Surveillance of AMR pathogens is essential for tracking their resistance profiles and guiding interventions. However, many public health professionals face barriers in bioinformatics expertise and computational resources, limiting their ability to analyse pathogen genomes effectively. We developed AMRColab, an open-source, modular bioinformatics suite hosted on Google Colaboratory. Released under the CC BY 4.0 license, AMRColab enables users with minimal technical background to detect and visualise AMR determinants from pathogen genomes in a ‘plug-and-play’ format—without requiring local installation or HPC infrastructure. The platform integrates tools such as AMRFinderPlus, ResFinder, and hAMRonization, supporting comparative and transmission analysis. A proof-of-concept study using MRSA strains validated AMRColab’s effectiveness across labs. Two workshops with 60 participants demonstrated high adoption potential, with participants reporting increased confidence in using genomics for AMR surveillance. We recently introduced two genome assembly modules: (1) SPAdes and QUAST for Illumina/IonTorrent data; (2) a Nanopore pipeline with FastQC, FastP, NanoPlot, Flye, medaka, and BactInspector. These new modules are currently in beta testing and will be featured in future workshops. These standalone modules extend AMRColab into a full workflow from raw reads to AMR profiling. AMRColab’s accessible design makes it valuable for medical laboratory technologists, clinicians, and public health researchers to perform genome analysis, regardless of their computational expertise. By lowering technical barriers, AMRColab contributes towards democratizing AMR surveillance and equipping healthcare professionals with essential genomic analysis tools. Project repository: https://github.com/amrcolab/AMRColab/ License: CC BY 4.0
2025-07-21 14:00:00 14:20:00 03A BOSC NApy: Efficient Statistics in Python for Large-Scale Heterogeneous Data with Enhanced Support for Missing Data Fabian Woller Fabian Woller, Lis Arend, Christian Fuchsberger, Markus List, David B. Blumenthal Existing Python libraries and tools lack the ability to efficiently run statistical test (such as Pearson correlation, ANOVA, Mann-Whitney-U test) for large datasets in the presence of missing values. This presents an issue as soon as constraints on runtime and memory availability become essential considerations for a particular use case. Relevant research areas where such limitations arise include interactive tools and databases for exploratory analysis of large mixed-type data. At the same time, until today, biomedical data analyses on such large datasets (e.g. population cohorts or electronic health record data) mostly investigate statistical associations between specific variables (e.g., correlations between measurements as body mass index and blood pressure). However, the rapidly growing popularity of systems approaches in biomedicine makes it increasingly relevant to be able to efficiently compute pairwise statistical associations for all available pairs of variables in a dataset. To address this problem, we present the Python tool NApy, which relies on a Numba and C++ backend with OpenMP parallelization to enable scalable statistical testing for mixed-type datasets in the presence of missing values. Both with respect to runtime and memory consumption, we assess NApy’s efficiency on simulated as well as real-world input data originating from a population cohort study. We show that NApy outperforms Python competitor tools and baseline implementations with naïve Python-based parallelization by orders of magnitude enabling on-the-fly analyses in interactive applications. NApy is publicly available at https://github.com/DyHealthNet/NApy.
2025-07-21 14:00:00 14:20:00 03A BOSC pyANI-plus -- whole-genome classification of microbes using Average Nucleotide Identity and similar methods Peter Cock Peter Cock, Angelika Kiepas, Leighton Pritchard pyANI-plus is an open source MIT licensed Python tool for whole-genome classification of microbes using Average Nucleotide Identity (ANI) and similar methods. This reimplements our earlier tool pyani with additional schedulers and methods. Rather than biological applications or method insights, this presentation focuses on technical changes. The workflow system snakemake is used internally as a scheduler-agnostic high performance compute cluster wrapper. Compute jobs call any underlying tools and cache results as JSON files which the main process imports into an SQLite3 database. The slowest methods can take around a day to compute for a thousand bacteria, meaning one million pairwise comparisons. The database schema references each FASTA format input genome by MD5 checksum, and each pairwise comparison references the query and subject checksums and the method configuration (including underlying tool versions). This enables efficient reuse of previously computed results for common use-cases like resuming an interupted run, expanding an analysis by adding additional genomes, or reporting on a subset (for example after removing outliers). The plotting commands provided use matplotlib and seaborn, but also export the associated data as tab-separated plain-text allowing the user to produce their own custom figures. Our command line interface is defined using the typer library and Python type annotations. All the python type annotations are validated with mypy, code linting and style formatting with ruff, both run automatically with pre-commit hooks and continous integration testing. We have full test coverage in terms of lines of code, explicitly excluding corner cases like race conditions.
2025-07-21 14:20:00 14:40:00 03A BOSC InterProScan 6: a modern large-scale protein function annotation pipeline Matthias Blum Matthias Blum, Emma Hobbs, Laise Florentino, Alex Bateman InterProScan 6 represents a major step forward in protein function annotation, addressing the scalability, modularity, and usability limitations of its predecessor. Re-engineered as a Nextflow-based workflow, InterProScan 6 is optimised for flexible deployment across a wide range of computational environments, from local workstations and high-performance computing clusters to cloud infrastructures, enabling efficient analysis of large protein datasets. A key architectural innovation is the decoupling of application code from signature databases, allowing users to download only the required datasets on demand. This modular design significantly reduces storage overhead and supports concurrent use of multiple data releases, enhancing both flexibility and reproducibility. InterProScan 6 also integrates state-of-the-art deep learning predictors, including DeepTMHMM for transmembrane helix prediction and SignalP 6.0 for signal peptide detection, resulting in improved annotation accuracy. Native support for containerisation via Docker, Singularity, and Apptainer ensures consistent execution across platforms and simplifies environment management. The legacy match-lookup service from InterProScan 5 has been replaced by a redesigned Matches API; an intuitive, RESTful interface providing programmatic access to precomputed InterPro matches for all UniParc sequences. This facilitates seamless integration with a wide range of external tools and workflows. InterProScan 6 delivers substantial improvements in flexibility, annotation quality, and accessibility. The alpha release is scheduled for late April 2025, with a full release planned for summer 2025 under the Apache 2 open source license.
2025-07-21 14:40:00 15:00:00 03A BOSC Real-time base modification analysis for nanopore sequencing Suneth Samarasinghe Suneth Samarasinghe, Hasindu Gamaarachchi, Ira Deveson Real-time analysis of DNA base modifications, particularly methylation, is crucial for making rapid decisions in contexts such as forensics and clinical settings, particularly when combined with selective or adaptive sequencing. Traditional methods like bisulfite sequencing, while accurate, are limited by their need for large DNA samples and their inability to capture long-range methylation patterns. Nanopore sequencing, with its ability to generate long reads and detect base modifications through electrical signal analysis, offers a promising alternative. We introduce RealFreq, a lightweight framework for retrieving real-time base modification frequencies while sequencing on nanopore sequencing devices. Realfreq is composed of two primary components: realfreq-pipeline, a modular script that manages (monitors raw signals directory, basecalls raw signals, and aligns to a reference genome) data flow from the sequencing device to base modification detection, and realfreq-program, a C-based application that performs base modification calling based on information retrieved from alignment files and maintains an in-memory map of base modifications. Additionally, we developed realfreq-server within realfreq-program using simple socket connections, which provide an interface to query base modification information in real-time. We demonstrate realfreq’s ability to keep up with the output data stream of the Oxford Nanopore Technologies (ONT) PromethION. The base modification calling algorithm we developed for realfreq-prog is separately bundled with Minimod, a simple base modification analysis tool we developed. Minimod’s output aligns with the outputs of current state-of-the-art tools while improving execution time by 2x for DNA and 40x for RNA datasets running on a laptop computer.
2025-07-21 14:40:00 15:00:00 03A BOSC Open-Source GPU Acceleration for State-of-the-Art Nanopore Basecalling with Slorado Bonson Wong Bonson Wong, Hasindu Gamaarachchi Nanopore sequencing has become a popular technology for genomic research because of its cost-effectiveness and ability to sequence long reads. Nanopore technology offers solutions from portable sequencing devices, such as the MinION designed for in-field applications, to large-scale sequencing devices like the PromethION. A nanopore sequencer generates a time series 'raw signal, ' which is then converted into a nucleobase sequence (A, C, G, T) through basecalling. The basecalling step, however, occupies a narrow scope of hardware that can be performed on. Much of the implementation in the current state-of-the-art Dorado basecaller relies on a closed-source binary package for platform-specific optimisations. Dorado is specifically developed for high-compute Nvidia Graphics Processing Units (GPUs) as their main platform. Basecalling without these optimisations is impractical, and therefore, researchers working in resource-constrained environments will be limited by Dorado's limited hardware compatibility. We aim to open-source these large sections of the codebase to make basecalling technology accessible to researchers and developers. We provide two open-source software packages to the genomics community: 'Openfish' is a library that accelerates nanopore CRF decoding tailored towards nanopore signal processing. Openfish implements decoding in Dorado on the GPU for NVIDIA and AMD hardware. As a framework for testing and benchmarking the entire basecalling pipeline, we have also built the application ‘Slorado’: a lean and open-source basecaller that can be easily compiled for NVIDIA and AMD machines.
2025-07-21 14:40:00 15:00:00 03A BOSC Voyager-SDK: integrating and automating pipeline runs using Voyager-SDK and Voyager platform Sinisa Ivkovic Sinisa Ivkovic, Christopher Allan Bolipata, Nikhil Kumar, Eric Buehler, Danielle Pankey, Adrian Fraiha, Mark Donoghue, Nicholas Socci, Ronak Shah, David B. Solit At Memorial Sloan Kettering Cancer Center (MSKCC), we developed Voyager, a platform to automate the execution of computational pipelines built using community standards Common Workflow Language and Nextflow. Voyager streamlines the orchestration and monitoring of pipelines across various compute environments. By leveraging the nf-core input schema for Nextflow pipelines and the Common Workflow Language (CWL) schema, Voyager abstracts input handling across both technologies, enabling seamless integration and execution regardless of the underlying workflow engine. To enable broader adoption and community contribution, we are introducing the Voyager SDK—a toolkit that empowers developers to integrate their pipelines into the platform via modular components called Operators. As the number of pipelines in our organization grew, it became increasingly important to decouple the logic of these Operators from the core Voyager codebase. Operators encapsulate pipeline-specific logic and metadata, providing a structured interface to the Voyager engine. By externalizing this logic through the SDK, we enable independent development, promote extensibility and portability, and empower developers to onboard new pipelines without modifying the platform itself. This talk will present the architecture of the Voyager platform, demonstrate how the SDK supports the creation and testing of Operators, and discuss how open standards and open-source tooling have been central to our development strategy. We will also share lessons learned from building infrastructure that balances institutional requirements with community best practices.
2025-07-21 15:00:00 15:20:00 03A BOSC Empowering Bioinformatics Communities with nf-core: The success of an open-source bioinformatics ecosystem Jose Espinosa-Carrasco Sven Nahnsen, Peter W. Harrison, Matthias Hörtenhuber, Cyril Kurylo, Christa Kühn, Sandrine Lagarrigue, Delphine Lallias, Daniel J. Macqueen, Edmund Miller, Júlia Mir-Pedrol, Gabriel Costa Monteiro Moreira, Friederike Hanssen, Harshil Patel, Alexander Peltzer, Frederique Pitel, Yuliaxis Ramayo-Caldas, Marcel da Câmara Ribeiro-Dantas, Dominique Rocha, Mazdak Salavati, Alexey Sokolov, Jose Espinosa-Carrasco, Cedric Notredame, James A. Fellows Yates, Andreia Amaral, Marie-Odile Baudement, Franziska Bonath, Mathieu Charles, Praveen Krishna Chitneedi, Emily L. Clark, Paolo Di Tommaso, Sarah Djebali, Philip A. Ewels, Sonia Eynard, Björn Langer, Daniel Fischer, Evan W. Floden, Sylvain Foissac, Gisela Gabernet, Maxime U. Garcia, Gareth Gillard, Manu Kumar Gundappa, Cervin Guyomar, Christopher Hakkaart The nf-core community exemplifies how open-source software development fosters collaboration, innovation, and sustainability in bioinformatics. nf-core currently features a curated collection of 124 pipelines built using the Nextflow workflow management system according to community-agreed standards. These standards ensure that nf-core pipelines are high-quality, portable, and reproducible. Since its creation in 2018, nf-core has constantly grown. Notably, the project has expanded beyond genomics and now supports pipelines across domains such as imaging, mass spectrometry, protein structure prediction, and disciplines outside life sciences like economics or earth biosciences. One of the reasons for this community's success is nf-core’s strong commitment to outreach and inclusiveness, exemplified by free training videos, hackathons, webinars, and a mentorship program that supports newcomers and underrepresented groups. The recent introduction of the Domain-Specific Language 2 in Nextflow, enabled the development of reusable software components (modules and subworkflows), accelerating pipeline development. Currently, nf-core provides access to over 1000 modules (single tool components) and 50 subworkflows (combinations of modules that wrap high-level functionalities). This modular architecture strengthened nf-core’s collaborative nature, making it easier and more appealing to contribute. This open, community-driven framework was key for six European research consortia under the EuroFAANG umbrella, dedicated to farmed animal genomics. These consortia adopted nf-core as their standard for pipeline development, ensuring their contributions' long-term sustainability. Notably, nf-core standards have also been adopted by flagship projects such as the Darwin Tree of Life and Genomics England, highlighting the broad value and impact of the nf-core model.
2025-07-21 15:20:00 15:40:00 03A BOSC JASPAR-Suite: An open toolkit for accessing TF binding motifs Aziz Khan Aziz Khan, Anthony Mathelier JASPAR database (https://jaspar.elixir.no) is a widely used open-access database of manually curated, non-redundant transcription factor (TF) binding profiles across multiple species, supporting the global community of gene regulation researchers. As the field of regulatory genomics grows increasingly data-driven, JASPAR plays a vital role in providing high-quality position frequency matrices (PFMs) for TFs, enabling insights into gene expression regulation, enhancer activity, and transcriptional networks. The JASPAR database can be accessed through several user-friendly and programmatic interfaces, including a web interface for intuitive exploration, a RESTful API for cross-platform integration, the Bioconductor package for R users, and pyJASPAR—a flexible and Pythonic toolkit for both interactive and command-line access to TF motifs. In this talk, we will demonstrate how JASPAR can be accessed using its RESTful API (https://jaspar.elixir.no/api/) from any programming environment, allowing seamless integration into bioinformatics workflows. I will also introduce pyJASPAR (https://github.com/asntech/pyjaspar), a lightweight Python package we developed to make JASPAR motif queries easy, scriptable, and reproducible—whether from a Jupyter notebook or a shell terminal. Together, these tools form the JASPAR Suite, designed to empower the scientific community with open, reproducible, and interoperable access to TF binding motifs. All the code, data, and workflows are openly available under open licenses, supporting transparency and reproducibility in computational biology research.
2025-07-21 15:20:00 15:40:00 03A BOSC VueGen: automating the generation of scientific reports Sebastian Ayala-Ruano Sebastian Ayala-Ruano, Henry Webel, Alberto Santos Delgado The analysis of omics data typically involves multiple bioinformatics tools and methods, each producing distinct output files. However, compiling these results into comprehensive reports often requires additional effort and technical skills. This creates a barrier for non-bioinformaticians, limiting their ability to produce reports from their findings. Moreover, the lack of streamlined reporting workflows impacts reproducibility and transparency, making it difficult to communicate results and track analytical processes. Here, we present VueGen, an open-source software that addresses the limitations of current reporting tools by automating report generation from bioinformatics outputs, allowing researchers with minimal coding experience to communicate their results effectively. With VueGen, users can produce reports by simply specifying a directory containing output files, such as plots, tables, networks, Markdown text, and HTML components, along with the report format. Supported formats include documents (PDF, HTML, DOCX, ODT), presentations (PPTX, Reveal.js), Jupyter notebooks, and Streamlit web applications. To showcase VueGen’s functionality, we present two case studies and provide detailed documentation to help users generate customized reports. VueGen was designed with accessibility and community contribution in mind, offering multiple implementation options for users with varying technical expertise. It is available as a Python package, a portable Docker image, and an nf-core module, leveraging established open-source ecosystems to facilitate integration and reproducibility. Furthermore, a cross-platform desktop application for macOS and Windows provides a user-friendly interface for users less familiar with command-line tools. The source code is freely available on https://github.com/Multiomics-Analytics-Group/vuegen. Documentation is provided at https://vuegen.readthedocs.io/.
2025-07-21 15:20:00 15:40:00 03A BOSC The world’s biomedical knowledge in less than a gram: introducing the PGP incubator Peter Amstutz, Sarah Wait Zaranek, Alexander (Sasha) Wait Zaranek, Zoe Ma In this talk, we describe a new project, the Personal Genome Project incubator. The PGPincubator is an effort to create a distribution of open data, tools, workflows, AI models and learning materials that support validation, benchmarking, and education in bioinformatics and biomedicine for precision health and (pre-clinical) biomedical AI. In addition, the incubator is a distributed network of physical computing infrastructure used to test components included in the distribution, such as validating genomics workflows or benchmarking AI models. To help hatch this network, PGPincubator is running a private network of “h-grams.” An h-gram is 1-4 microSD cards (3-4 weigh about a gram!) each flashed with an operating system image that can be booted on compatible commodity PC hardware. The operating system (Ubuntu) is pre-configured to act as a server suitable for home, office or lab and is accessed by other devices through a browser. Each h-gram is pre-loaded with hundreds of gigabytes of openly licensed infrastructure software, bioinformatics tools, genomic datasets, AI models, and learning resources. The PGPincubator data and software distribution pre-loaded on the h-gram will be updated on a 6 month release schedule, inspired by Linux distribution releases. With both software and data sets distributed in versioned releases, it becomes far easier for researchers to precisely identify both software and data used in their work, for others to reproduce that work, and for students to study that work, while ensuring that validation and benchmarking methods are done fairly against a common baseline.
2025-07-21 15:40:00 16:00:00 03A BOSC Slivka: a new ecosystem for wrapping local code as web services Jim Procter Jim Procter, Mateusz Warowny, Stuart MacGowan, Javier Utges, Geoff Barton Slivka is a Python/Flask/MongoDB framework that allows command line tools to be made available as web services through creation of a YAML document that allows flexible execution configuration and semantic service discovery. Deployable via conda and Docker, this framework has been used to provide services for the Jalview desktop and web-based platform for interactive sequence alignment and analysis, and new services for analysis of protein ligand binding sites. Slivka is released under the Apache 2.0 license and is being developed under an open consortium model to foster community support across both industry and academia.
2025-07-21 16:40:00 17:00:00 03A BOSC FAIRDOM-SEEK: Platform for FAIR data and research asset management Munazah Andrabi Munazah Andrabi, Stuart Owen, Finn Bacall, Phil Reed, Xiaoming Hu, Ulrike Wittig, Maja Rey, Martin Golebiewski, Flora D'Anna, Kevin De Pelseneer, Jacky Snoep, Wolfgang Müller, Carole Goble As research becomes more data-driven, collaborative, and interdisciplinary, the need for structured, accessible, and well-curated data outputs with rich, standardized metadata is critical to ensure data is discoverable and reusable beyond its original context. FAIRDOM-SEEK platform addresses these challenges by providing a customisable, open-source, web-based catalogue designed to support FAIR (Findable, Accessible, Interoperable, Reusable) data and asset management. FAIRDOM-SEEK enables scientists to organize, document, share, and publish research data using the Investigation, Study, Assay (ISA) framework, which structures experiments and related assets such as samples, protocols. Key features include robust metadata and sample management, version control, linking to external repositories, and integration with modeling tools. Controlled sharing and DOI creation further enhance collaboration and long-term accessibility. The platform supports the creation of dedicated Project Hubs, which are customised local instances deployed for specific projects. These allow tailored use of the platform’s core capabilities, including modified appearance, structure, and content. Notable examples of hubs include IBISBAHub, WorkflowHub, NFDI4Heath Local DataHub and DataHub. MIT BioMicroCenter has integrated the platform to streamline data and sample management for their ongoing research initiatives. In addition, FAIRDOMHub, the flagship public instance, serves over 400 national and international projects as both a repository and a knowledge-sharing platform, promoting interdisciplinary collaboration and community engagement. As a core resource for many European organisations (e.g de.NBI, ELIXIR) and international consortia, FAIRDOMHub, plays a vital role in research data management. In the talk we will present the salient features of FAIRDOM-SEEK and highlighting how it facilitates FAIR Data Management.
2025-07-21 17:00:00 17:20:00 03A BOSC Walkthrough of GA4GH standards and interoperability it provides for genomic data implementations Jimmy Payyappilly Jimmy Payyappilly, Dashrath Chauhan, Sasha Siegel, Andrew D Yates, Chen Chen, Deeptha Srirangam The sharing of genomic and health-related data for biomedical research is of key importance in ensuring continued progress in our understanding of human health and wellbeing. In this journey, bioinformatics and genomics continue to be closely coupled. To further expand the benefits of research, the Global Alliance for Genomics and Health (GA4GH) builds the foundation for broad and responsible use of genomic data by setting standards and frames policies to expand genomic data use guided by the Universal Declaration of Human Rights. As is true with any data, interoperability between open-source systems processing genomic data is vital. When systems are based on standards, it eases interactions between technical ecosystems as there is a common framework and way to interact and request resources. In this talk, we present two GA4GH open-source standards, which through their reference implementations exhibit interoperability between standards. Through this session, we will showcase the use cases on how these standards support data science for genomics research and ensure easy discoverability of data across the globe.
2025-07-21 17:00:00 17:20:00 03A BOSC LiMeTrack: A lightweight biosample management platform for the multicenter SATURN3 consortium Laura Godfrey Florian Heyl, Jonas Gassenschmidt, Lukas Heine, Frederik Voigt, Jens Kleesiek, Oliver Stegle, Jens Siveke, Melanie Boerries, Roland Schwarz, Laura Godfrey Biomedical research projects involving large patient cohorts are increasingly complex, both in terms of data modalities and number of samples. Hence, they require robust data management solutions to foster data integrity, reproducibility and secondary use compliant with the FAIR principles. SATURN3, a German consortium with 17 partner sites investigates intratumoral heterogeneity using patient biosamples. As part of a complex, multicenter workflow, high-level multimodal analyses include bulk, single-cell, and spatial omics and corresponding data analysis. To manage this complexity and to avoid miscommunication, data loss and de-synchronization at different project sites, harmonization in a central infrastructure is essential. Additionally, real-time monitoring of the sample processing status must be accessible to all project partners throughout the project. This use case goes far beyond the capabilities of spreadsheets that are susceptible to security vulnerabilities, versioning mistakes, data loss and type conversion errors. Existing data management tools are often complex to set up or lack the necessary flexibility to be adopted for specific project needs. To address these challenges, we introduce LightMetaTrack (LiMeTrack), a biosample management platform built on the Django-Framework. Key features include customizable and user-friendly forms for data entry and a real-time dashboard for project and sample status overview. LiMeTrack simplifies the creation and export of sample sheets, streamlining subsequent bioinformatics analyses and research workflows. By integrating real-time monitoring with robust sample tracking and data management, LiMeTrack improves research transparency and reproducibility, ensures data integrity and optimizes workflows, making it a powerful solution for biosample management in multicenter biomedical research endeavours.
2025-07-21 17:00:00 17:20:00 03A BOSC BFVD – A release of 351k viral protein structure predictions Rachel Seongeun Kim Rachel Seongeun Kim, Eli Levy Karin, Milot Mirdita, Rayan Chikhi, Martin Steinegger While the AlphaFold Protein Structure Database (AFDB) is the largest resource of accurately predicted structures – covering 214 million UniProt entries with taxonomic labels – it excludes viral sequences, limiting its utility for virology. To fill this gap, we present the Big Fantastic Virus Database (BFVD), a repository of 351,242 protein structures predicted using ColabFold on viral sequence representatives of the UniRef30 clusters. By augmenting Logan’s petabase-scale SRA assemblies in homology searches and applying 12-recycle refinement, we further enhanced the confidence scores of 41% of BFVD entries. BFVD serves as an essential, viral-focused expansion to existing protein structure repositories. Over 62% of its entries show no or low structural similarity to the PDB and AFDB, underscoring the novelty of its content. Notably, BFVD enables identification of a substantial fraction of bacteriophage proteins, which remain uncharacterized at the sequence level, by matching them to similar structures. In that, BFVD is on par with the AFDB, despite holding nearly three orders of magnitude fewer structures. Freely downloadable at bfvd.steineggerlab.workers.dev and explorable via Foldseek with UniProt labels at bfvd.foldseek.com, BFVD offers new opportunities for advanced viral research.
2025-07-21 17:20:00 17:40:00 03A BOSC AI-readiness for biomedical data: Bridge2AI recommendations Monica Munoz-Torres Monica Munoz-Torres The convergence of biomedical research and artificial intelligence (AI) promises unprecedented insights into complex biological systems. However, realizing this potential demands datasets meticulously designed for AI/ML analysis, addressing challenges from data quality to the critical imperatives of explainable AI (XAI) and ethical, legal, and social implications (ELSI). The NIH Bridge2AI program is at the forefront of this effort, creating flagship biomedical datasets and establishing best practices for AI/ML data science. This paper, authored by the Bridge2AI Standards Working Group, introduces foundational criteria for assessing the AI-readiness of biomedical data. We present actionable methods and data standards perspectives developed within the program, emphasizing their crucial role in fostering scientific rigor and responsible innovation. These AI-readiness criteria encompass essential considerations for XAI – ensuring the interpretability of AI-driven discoveries – and proactive integration of ELSI principles. While the landscape of biomedical AI rapidly evolves, these guidelines provide a vital framework for scientific rigor, enabling the creation and utilization of high-quality, ethically sound data resources that will drive impactful advancements in bioinformatics and beyond. During this presentation, we will examine these foundational standards that are shaping the future of AI in molecular biology and medicine.
2025-07-21 17:40:00 17:50:00 03A BOSC Bridging the gap: advancing aging & dementia research through the open-access AD Knowledge Portal Susheel Varma Jo Scanlan, Amelia Kallaher, Zoe Leanza, Jessica Britton, Jaclyn Beck, Beatriz Saldana, Anthony Pena, William Poehlman, Victor Baham, Trisha Zintel, Jesse Wiley, Karina Leal, Jessica Malenfant, Laura Heath, Susheel Varma The AD Knowledge Portal (adknowledgeportal.org) is an NIA-funded resource developed by Sage Bionetworks to facilitate Alzheimer's Disease research through open data sharing. The secure Synapse platform enables researchers to share data with proper attribution while ensuring compliance with FAIR principles. The Portal aggregates resources from 14 NIH-funded research programs and 97 aging-related grants, housing approximately 800TB of data from over 11,000 individuals. This multimodal data encompasses genomics, transcriptomics, epigenetics, imaging, proteomics, metabolomics, and behavioural assays from various sources, including brain banks, longitudinal cohorts, cell lines, and animal models. Recent additions include 290 TB of single-cell and nucleus expression data, alongside experimental tools and computational resources. All content is available under Creative Commons BY 4.0 licenses, with software under open-source licenses such as Apache 2.0. The Portal's code is publicly available on GitHub with comprehensive documentation. The Community Data Contribution Program extends the Portal's scope beyond NIA-funded projects. Since January 2022, over 6,000 unique users have downloaded 12.57 PB of data, with monthly downloads doubling between 2023-2024. Portal data has been cited in over 1,000 publications since 2019, with more than half of these representing the reuse of secondary data. Integration with platforms like CAVATICA and Terra enhances accessibility. Future developments include interoperability with AD Workbench, NACC, NIAGADS, and LONI, as well as new data types such as spatial transcriptomics and longitudinal data from Alzheimer's disease models.
2025-07-21 17:50:00 18:00:00 03A BOSC The ELITE Portal: A FAIR Data Resource For Healthy Aging Over The Life Span Susheel Varma Milan Vu, Tanveer Talukdar, Amelia Kallaher, Melissa Klein, Natosha Edmonds, Jessica Malenfant, Christine Suver, Laura Heath, Alberto Pepe, Luca Foschini, Solly Sieberts, Susheel Varma Exceptional longevity (EL) is a rare phenotype characterized by an extended health span and sustained physiological function. Various domain-specific factors contribute to EL, influencing the maintenance of key physiological systems (e.g., respiratory, cardiovascular, immune) and functional domains (e.g., mobility, cognition). Studying the impacts of protective genetic variants and cellular mechanisms associated with EL facilitates the identification of novel therapeutic targets that replicate their beneficial effects. The Exceptional Longevity Translational Resources (ELITE) Portal (eliteportal.synapse.org) is a new, open-access repository for disseminating data and other research resources from translational longevity research projects. The portal supports diverse data types including genetic, transcriptomic, epigenetic, proteomic, metabolomic, and phenotypic data from longitudinal human cohort studies and cross-species comparative biology studies of tens- to hundreds- of nonhuman species; data from longevity-enhancing intervention studies in mouse and cell models; access to web applications and software tools to support exploration of EL-related research outcomes; and a catalog of publications associated with the National Institute on Aging (NIA)-funded translational longevity research projects. The portal also integrates with the external Trusted Research Environment (TRE) CAVATICA and is poised to support future integrations with additional data resources like Terra. All resources hosted in the ELITE Portal are distributed under FAIR (Findable, Accessible, Interoperable, and Reusable) principles. The ELITE Portal is funded by the NIA 5U24AG078753 and 2U24AG061340.
2025-07-22 11:20:00 12:20:00 03A BOSC Open Knowledge Bases in the Age of Generative AI Chris Mungall Dr. Chris Mungall is a Senior Scientist at Berkeley Lab, where he heads the Biosystems Data Science department in the Environmental Genomics and Systems Biology Division. Chris’s research interests center around the capture, computational integration, and dissemination of biological research data, and the development of methods for using this data to elucidate biological mechanisms underpinning the health of humans and of the planet. He and his team have led the creation of key biological ontologies for the integration of resources covering gene function, anatomy, phenotypes and the environment, including the the Uberon anatomy ontology, the Cell Ontology (CL), and the Mondo disease ontology. He is also one of the cofounders of the OBO Foundry. For decades, he has been a strong advocate for open-source bioinformatics software, open standards, and open science.
2025-07-22 12:20:00 12:40:00 03A BOSC textToKnowledgeGraph: Generation of Molecular Interaction Knowledge Graphs Using Large Language Models for Exploration in Cytoscape Favour James Favour James, Christopher Churas, Trey Ideker, Dexter Pratt, Augustin Luna Knowledge graphs (KGs) are powerful tools for structuring and analyzing biological information due to their ability to represent data and improve queries across heterogeneous datasets. However, constructing KGs from unstructured literature remains challenging due to the cost and expertise required for manual curation. Prior works have explored text-mining techniques to automate this process, but have limitations that impact their ability to capture complex relationships fully. Traditional text-mining methods struggle with understanding context across sentences. Additionally, these methods lack expert-level background knowledge, making it difficult to infer relationships that require awareness of concepts indirectly described in the text. Large Language Models (LLMs) present an opportunity to overcome these challenges. LLMs are trained on diverse literature, equipping them with contextual knowledge that enables more accurate extraction. Additionally, LLMs can process the entirety of an article, capturing relationships across sections rather than analyzing single sentences; this allows for more precise extraction. We present textToKnowledgeGraph, an artificial intelligence tool using LLMs to extract interactions from individual publications directly in Biological Expression Language (BEL). BEL was chosen for its compact and detailed representation of biological relationships, allowing for structured and computationally accessible encoding. This work makes several contributions. 1. Development of the open‑source Python textToKnowledgeGraph package (pypi.org/project/texttoknowledgegraph) for BEL extraction from scientific articles, usable from the command line and within other projects, 2. An interactive application within Cytoscape Web to simplify extraction and exploration, 3. A dataset of extractions that have been both computationally and manually reviewed to support future fine-tuning efforts.
2025-07-22 12:40:00 13:00:00 03A BOSC BMFM-RNA: An Open Framework for Building and Evaluating Transcriptomic Foundation Models built on Biomed-Multi-Omic Bharath Dandala Bharath Dandala, Michael M Danziger, Ching-Huei Tsou, Akira Koseki, Viatcheslav Gurev, Tal Kozlovski, Ella Barkan, Matthew Madgwick, Akihiro Kosugi, Tanwi Biswas, Liran Szalk, Matan Ninio High-throughput sequencing has revolutionized transcriptomic studies, and the synthesis of these diverse datasets holds significant potential for a deeper under- standing of cell biology. Recent advancements have introduced several promising techniques for building transcriptomic foundation models (TFMs), each emphasizing unique modeling decisions and demonstrating potential in handling the inherent challenges of high-dimensional, sparse data. However, despite their individual strengths, current TFMs still struggle to fully capture biologically meaningful representations, highlighting the need for further improvements. Recognizing that existing TFM approaches possess complementary strengths and weaknesses, a promising direction lies in the systematic exploration of various combinations of design, training, and evaluation methodologies. Thus, to accelerate progress in this field, we present bmfm-rna (shown in Figure 1), a comprehensive framework that not only facilitates this combinatorial exploration but is also inherently flexible and easily extensible to incorporate novel methods as the field continues to advance. This framework enables scalable data processing and features extensible transformer architectures. It supports a variety of input representations, pretraining objectives, masking strategies, domain-specific metrics, and model interpretation methods. Furthermore, it facilitates down- stream tasks such as cell type annotation, perturbation prediction, and batch effect correction on benchmark datasets. Models trained with the framework achieve performance comparable to scGPT, Geneformer and other TFMs on these downstream tasks. By open-sourcing this framework with strong performance, we aim to lower barriers for developing TFMs and invite the community to build more effective TFMs. bmfm-rna is available via Apache license at https://github.com/BiomedSciAI/biomed-multi-omic
2025-07-22 12:40:00 13:00:00 03A BOSC DOME Registry - Supporting ML transparency and reproducibility in the life sciences Gavin Farrell Gavin Farrell, Omar Attafi, Silvio Tosatto The adoption of machine learning (ML) methods in the life sciences has been transformative, solving landmark challenges such as accurate protein structure prediction, improving bioimaging diagnostics and accelerating drug discovery. However, researchers face a reuse and reproducibility crisis of ML publications. Authors are publishing ML methods lacking core information to transfer value back to the reader. Commonly absent are links to code, data and models eroding trust in the methods. In response to this ELIXIR Europe developed a practical checklist of recommendations covering key ML methods aspects for disclosure covering; data, optimisation, model and evaluation. These are now known collectively as the DOME Recommendations published in Nature Methods by Walsh et al. 2021. Building on this successful first step towards addressing the ML publishing crisis, ELIXIR has developed a technological solution to support the implementation of the DOME Recommendations. This solution is known as the DOME Registry and was published in GigaScience by Ataffi et al. in late 2024. This talk will cover the DOME Registry technology which serves as a curated database of ML methods for life science publications by allowing researchers to annotate and share their methods. The service can also be adopted by publishers during their ML publishing workflow to increase a publication’s transparency and reproducibility. An overview of the next steps for the DOME Registry will also be provided - considering new ML ontologies, metadata formats and integrations building towards a stronger ML ecosystem for the life sciences.
2025-07-22 12:40:00 13:00:00 03A BOSC AutoPeptideML 2: An open source library for democratizing machine learning for peptide bioactivity prediction Raúl Fernández-Díaz Raúl Fernández-Díaz, Thanh Lam Hoang, Vanessa Lopez, Denis Shields Peptides are a rapidly growing drug modality with diverse bioactivities and accessible synthesis, particularly for canonical peptides composed of the 20 standard amino acids. However, enhancing their pharmacological properties often requires chemical modifications, increasing synthesis cost and complexity. Consequently, most existing data and predictive models focus on canonical peptides. To accelerate the development of peptide drugs, there is a need for models that generalize from canonical to non-canonical peptides. We present AutoPeptideML, an open-source, user-friendly machine learning platform designed to bridge this gap. It empowers experimental scientists to build custom predictive models without specialized computational knowledge, enabling active learning workflows that optimize experimental design and reduce sample requirements. AutoPeptideML introduces key innovations: (1) preprocessing pipelines for harmonizing diverse peptide formats (e.g., sequences, SMILES); (2) automated sampling of negative peptides with matched physicochemical properties; (3) robust test set selection with multiple similarity functions (via the Hestia-GOOD framework); (4) flexible model building with multiple representation and algorithm choices; (5) thorough model evaluation for unseen data at multiple similarity levels; and (6) FAIR-compliant, interpretable outputs to support reuse and sharing. A webserver with GUI enhances accessibility and interoperability. We validated AutoPeptideML on 18 peptide bioactivity datasets and found that automated negative sampling and rigorous evaluation reduce overestimation of model performance, promoting user trust. A follow-up investigation also highlighted the current limitations in extrapolating from canonical to non-canonical peptides using existing representation methods. AutoPeptideML is a powerful platform for democratizing machine learning in peptide research, facilitating integration with experimental workflows across academia and industry.
2025-07-22 14:00:00 14:20:00 03A BOSC BioPortal: a rejuvenated resource for biomedical ontologies J. Harry Caufield J. Harry Caufield, Jennifer Vendetti, Nomi Harris, Michael Dorf, Alex Skrenchuk, Rafael Gonçalves, John Graybeal, Harshad Hegde, Timothy Redmond, Chris Mungall, Mark Musen BioPortal is an open repository of biomedical ontologies that supports data organization, curation, and integration across various domains. Serving as a fundamental infrastructure for modern information systems, BioPortal has been an open-source project for 20 years and currently hosts over 1,500 ontologies, with 1,192 publicly accessible. Recent enhancements include tools for creating cross-ontology knowledge graphs and a semi-automated process for ontology change requests. Traditionally, ontology updates required expertise and were time-consuming, as users had to submit requests through developers. BioPortal's new service expedites this process using the Knowledge Graph Change Language (KGCL). A user-friendly interface accepts change requests via forms, which are then converted to GitHub issues with KGCL commands. The new BioPortal Knowledge Graph (KG-Bioportal) tool merges user-selected ontology subsets using a common graph format and the Biolink Model. An open-source pipeline translates ontologies into the KGX graph format, facilitating interoperability with other biomedical knowledge sources. KG-Bioportal enables more integrated and flexible querying of ontologies, allowing researchers to connect information across domains. Future improvements include enhanced ontology pages, automated metadata updates, and KG features with graph-based search and large language model integration. These enhancements aim to position BioPortal as an interoperable resource that meets the community's evolving needs.
2025-07-22 14:20:00 14:40:00 03A BOSC Formal Validation of Variant Classification Rules Using Domain-Specific Language and Meta-Predicates Michael Bouzinier Michael Bouzinier, Dmitry Etin, Giorgi Shavtvalishvili, Eugenia Lvova This talk aims to initiate a community discussion on strategies for validating the selection and curation of genetic variants for clinical and research purposes. We present our approach using a Domain-Specific Language (DSL), first introduced with the AnFiSA platform at BOSC 2019. Since our 2022 publication, we have continued developing this methodology. At BOSC 2023, we presented two extensions: the strong typing of genetic variables in the DSL, and the application of our framework beyond genetics, into population and environmental health. This year, we focus on validating the provenance and evidentiary support of annotation elements based on purpose, knowledge domain, method of derivation, and scale — an ontology we introduced in 2023. We aim to support two key use cases: (1) logical validation during rule development, and (2) ensuring rule portability when existing rules are adapted for new clinical or laboratory settings. We present a proof of concept using meta-predicates — embedded assertions in DSL scripts that validate specific properties of genetic annotations used in variant curation. This technique draws inspiration from Invariant-based Programming. Finally, we frame our work in the context of AI-assisted code synthesis. Recent studies highlight the advantages of deep learning-guided program induction over test-time training and fine tuning (TTT/TTFT) for structured reasoning tasks. This reinforces the promise of DSL-based approaches as transparent, verifiable complements to generative AI in modern computational genomics.
2025-07-22 14:40:00 15:00:00 03A BOSC BioChatter: An open-source framework integrating knowledge graphs and large language models for Accessible Biomedical AI Sebastian Lobentanzer Sebastian Lobentanzer The integration of large language models (LLMs) with structured biomedical knowledge remains a key challenge for building robust, trustworthy, and reproducible AI applications in biomedicine. We present BioChatter (https://biochatter.org), an open-source Python framework that bridges ontology-driven knowledge graphs (KGs) and LLMs through a modular, extensible architecture. Built as a companion to the BioCypher ecosystem for constructing biomedical KGs (https://biocypher.org), BioChatter allows researchers to easily build LLM-powered applications grounded in domain knowledge and interoperable data standards. BioChatter emphasises transparent, community-driven development, supported by extensive documentation, real-world usage examples, and active support channels. Its design supports multiple modes of use from lightweight prototyping to server-based deployment and integrates naturally with open LLM ecosystems (e.g., Ollama, LangChain), knowledge graphs, and the Model Context Protocol (MCP) for LLM tool usage. We highlight ongoing applications across biomedical domains, including automated knowledge integration pipelines for drug discovery (Open Targets), clinical decision support prototypes, and data sharing platforms within the German research infrastructure. The open-source nature of BioChatter, together with its benchmark-first approach for validating biomedical LLM applications, facilitates broad adoption and collaboration. By lowering the entry barrier for building trustworthy biomedical AI systems, BioChatter contributes to the growing open-source ecosystem supporting reproducible, transparent, and community-driven AI development in the life sciences.
2025-07-22 15:00:00 15:20:00 03A BOSC Applications of Bioschemas in FAIR, AI and knowledge representation Nick Juty Nick Juty, Phil Reed, Helena Schnitzer, Leyla Jael Castro, Alban Gaignard, Carole Goble Bioschemas.org defines domain-specific metadata schemas based on schema.org extensions, which expose key metadata properties from resource records. This provides a lightweight and easily adoptable means to incorporate key metadata on web records, and a mechanism to link to domain-specific ontology/vocabulary terms. As an established community effort focused on improving the FAIRness of resources in the Life Sciences, we now aim to extend the impact of Bioschemas beyond improvements to ‘findability’. Bioschemas has been used to aggregate data in a distributed environment through federation, using metadata Bioschemas markup. More recently, we are leveraging Bioschemas deployments on resource websites, harvesting directly to populate SPARQL endpoints, subsequently creating queryable knowledge graphs. An improved Bioschemas validation process will assess the ‘FAIR’ level of the user’s web records and suggest the most appropriate Bioschemas profile based on similarity to those in the Bioschemas registry. Our learnings in operating this community will be extended into non-’bio’ domains wishing to more easily incorporate ontologies and metadata in their web-based records. To that end, we have a sister site dedicated to hosting the many domain-agnostic types/profiles that have already emerged from our work (so far 7 profiles aligned to digital objects in research, e.g., workflows, datasets, training materials): https://schemas.science/. Through this infrastructure we will develop a sustainable, cross-institutional collaborative space for long-term and wide ranging impact, supporting our existing community engagement with global AI, ML, and Training communities, and others in the future.
2025-07-22 15:20:00 15:40:00 03A BOSC RO-Crate: Capturing FAIR research outputs in bioinformatics and beyond Phil Reed Eli Chadwick, Stian Soiland-Reyes, Phil Reed, Claus Weiland, Dag Endresen, Felix Shaw, Timo Mühlhaus, Carole Goble RO-Crate is a mechanism for packaging research outputs with structured metadata, providing machine-readability and reproducibility following the FAIR principles. It enables interlinking methods, data, and outputs with the outcomes of a project or a piece of work, even where distributed across repositories. Researchers can distribute their work as an RO-Crate to ensure their data travels with its metadata, so that key components are correctly tracked, archived, and attributed. Data stewards and infrastructure providers can integrate RO-Crate into the projects and platforms they support, to make it easier for researchers to create and consume RO-Crates without requiring technical expertise. Community-developed extensions called “profiles” allow the creation of tailored RO-Crates that serve the needs of a particular domain or data format. Current uses of RO-Crate in bioinformatics include: ∙ Describing and sharing computational workflows registered with WorkflowHub ∙ Creating FAIR exports of workflow executions from workflow engines and biodiversity digital twin simulations ∙ Enabling an appropriate level of credit and attribution, particularly in currently under-recognised roles (eg. sample gathering, processing, sample distribution) ∙ Capturing plant science experiments as Annotated Research Contexts (ARC), complex objects which include workflows, workflow executions, inputs, and results ∙ Defining metadata conventions for biodiversity genomics This presentation will outline the RO-Crate project and highlight its most prominent applications within bioinformatics, with the aim of increasing awareness and sparking new conversations and collaborations within the BOSC community.
2025-07-22 15:20:00 15:40:00 03A BOSC PheBee: A Graph-Based System for Scalable, Traceable, and Semantically Aware Phenotyping David Gordon David Gordon, Max Homilius, Austin Antoniou, Connor Grannis, Grant Lammi, Adam Herman, Ashley Kubatko, Peter White The association of phenotypes and disease diagnoses is a cornerstone of clinical care and biomedical research. Significant work has gone into standardizing these concepts in ontologies like the Human Phenotype Ontology and Mondo, and in developing interoperability standards such as Phenopackets. Managing subject-term associations in a traceable and scalable way that enables semantic queries and bridges clinical and research efforts remains a significant challenge. PheBee is an open-source tool designed to address this challenge by using a graph-based approach to organize and explore data. It allows users to perform powerful, meaning-based searches and supports standardized data exchange through Phenopackets. The system is easy to deploy and share thanks to reproducible setup templates. The graph model underlying PheBee captures subject-term associations along with their provenance and modifiers. Queries leverage ontology structure to traverse semantic term relationships. Terms can be linked at the patient, encounter, or note level, supporting temporal and contextual pattern analysis. PheBee accommodates both manually assigned and computationally derived phenotypes, enabling use across diverse pipelines. When integrated downstream of natural language processing pipelines, PheBee maintains traceability from extracted terms to the original clinical text, enabling high-throughput, auditable term capture. PheBee is currently being piloted in internal translational research projects supporting phenotype-driven pediatric care. Its graph foundation also empowers future feature development, such as natural language querying using retrieval augmented generation or genomic data integration to identify subjects with variants in phenotypically relevant genes. PheBee advances open science in biomedical research and clinical support by promoting structured, traceable phenotype data.
2025-07-22 15:20:00 15:40:00 03A BOSC The role of the Ontology Development Kit in supporting ontology compliance in adverse legal landscapes Damien Goutte-Gattat Damien Goutte-Gattat Ontologies, like code, are a form of speech. As such, they can be subject to laws and other regulations that attempt to control how freedom of speech is exercised, and ontology editors may find themselves in the position of being legally compelled to introduce some changes in their ontologies for the sole purpose of complying with the laws that applies to them. Therefore, developers of tools used for ontology editing and maintenance need to ponder whether their tools should provide features to facilitate the introduction of such legally mandated changes, and how. As developers of the Ontology Development Kit (ODK), one of the main tools used to maintain ontologies of the OBO Foundry, we will consider both the moral and technical aspects of allowing ODK users to comply with arbitrary legal restrictions. The overall approach we are envisioning, in order to contain the impacts of such restrictions to the jurisdiction that mandate them, is a “split world” system, where the ODK would facilitate the production of slightly different editions of the same ontology.
2025-07-22 15:40:00 16:00:00 03A BOSC 10 years of the AberOWL ontology repository: moving towards federated reasoning and natural language access Robert Hoehndorf Fernando Zhapa-Camacho, Olga Mashkova, Maxat Kulmanov, Robert Hoehndorf AberOWL is a framework for ontology-based data access in biology that has provided reasoning services for bio-ontologies since 2015. Unlike other ontology repositories in the life sciences such as BioPortal, OLS, and OntoBee, AberOWL uniquely focuses on providing access to Description Logic querying through a Description Logic reasoner. The system comprises a reasoning service using OWLAPI and the Elk reasoner, an ElasticSearch service for natural language queries, and a SPARQL endpoint capable of embedding Description Logic queries within SPARQL queries. AberOWL contains all ontologies from BioPortal and the OBO library, enabling lightweight reasoning over the OWL 2 EL profile and implementing the Ontology-Based Data Access paradigm. This allows query enhancement through reasoning to infer implicit knowledge not explicitly stated in data. After a decade of operation, AberOWL is evolving in three key directions: (1) introducing a lightweight, containerized version enabling local deployment for single ontologies with the ability to register with the central repository for federated reasoning access; (2) integrating improved natural language processing through Large Language Models to facilitate Description Logic querying without requiring strict syntax adherence; and (3) implementing a FAIR API that standardizes access to ontology querying and repositories, improving interoperability. These advancements will transform AberOWL into a more federated system with FAIR API access and LLM integration for enhanced ontology interaction.
2025-07-22 16:40:00 16:50:00 03A BOSC The global biodata infrastructure: how, where, who, and what? Guy Cochrane Chuck Cook, Guy Cochrane Life science and biomedical research around the world is critically dependent on a global infrastructure of biodata resources that store and provide access to research data, and to tools and services that allow users to interrogate, combine and re-use these data to generate new insights. These resources, most of which are open and freely available, form a critical, globally distributed, and highly-connected infrastructure that has grown organically over time. Funders and managers of biodata resources are keenly aware that the long-term sustainability of this infrastructure, and of the individual resources it comprises, is under threat. The infrastructure has not been well described and there is a need to understand how many resources there are, where they are located, who funds them, and which are of the greatest importance for the scientific community. The Global Biodata Coalition has worked to describe the infrastructure by undertaking an inventory of global biodata resources and by running a selection process to identify a set of—currently 52—Global Core Biodata Resources (GCBRs) that are of fundamental importance to global life sciences research. We will present an overview of the location and funders of the GCBRs, and will summarise the findings of the two rounds of the global inventory of biodata resources, which identified over 3,700 resources. The results of these analyses provide an overview of the infrastructure and will allow the GBC to identify major funders of biodata resources that are not currently engaged in the discussion of issues of sustainability.
2025-07-22 16:50:00 17:50:00 03A BOSC Panel: Data Sustainability Chris Mungall, Susanna Sansone, Susanna Sansone, Chris Mungall, Varsha Khodiyar, Susanna Sansone, Chris Mungall, Varsha Khodiyar This BOSC 2025 panel will tackle the essential challenge of Data Sustainability, defined as the proactive and principled approach to ensuring bioinformatics research data remains FAIR, ethically managed, and valuable for future generations through sufficient infrastructure, funding, expertise, and governance. In light of current funding pressures and the risk of data loss that impedes scientific progress and wastes resources, establishing sustainable practices has become more urgent than ever. This discussion will incorporate diverse perspectives to examine practical strategies and solutions across key areas, including FAIR/CARE principles, funding models, open science, data lifecycle management, technical scalability, and ethical considerations.
2025-07-22 17:50:00 18:00:00 03A BOSC Closing Remarks Nomi Harris
2025-07-23 11:20:00 12:20:00 01C CAMDA Genome-based prediction of microbial traits Thomas Rattei Thomas Rattei The prediction of phenotypic traits from genomic information is an ongoing challenge in computational biology. Although the fundamental principles of information encoding in genomes have been studied since decades and allowed first directed modifications, the expression of phenotypic traits is often the result of complex interactions. Predictive approaches in bioinformatics therefore focus on machine learning from labeled genomic data. During the last years, we have focused on the computational prediction of microbial phenotypic traits from metagenomic data. These data have been collected on large scale, to explore the diversity and composition of microbial communities and to correlate them with environmental factors (e.g. human health and disease). The prediction of traits for these millions of genomes, based on neural networks that use protein families as features, goes one step further and can be used in first applications.
2025-07-23 12:20:00 12:40:00 01C CAMDA The Anti-Microbial Resistance Prediction Challenge - Introduction Leonid Chidelevitch Leonid Chidelevitch The AMR prediction challenge at CAMDA is now in its third year. This year's challenge on predicting AMR quantiatively (MIC values) as well as qualitatively (resistance vs susceptibility) has been developed in conjunction with our CABBAGE project. CABBAGE, which stands for a Comprehensive Assessment of Bacterial-Based AMR prediction from GEnotypes, involves the collection, curation, and exploitation of all the publicly available data on AMR genotypes and phenotypes, not only from databases, but also from individual publications. In this introductory talk I will describe the process by which we arrived at the selected datasets for this year's challenge, discuss other progress we've made on CABBAGE so far, and preview the plans for next year's challenge.
2025-07-23 12:40:00 13:00:00 01C CAMDA A Hybrid Pipeline for Feature Reduction, and Ordinal Classification to Predict Antimicrobial Resistance from Genetic Profiles Anton Pashkov Adriana Haydeé Contreras Peruyero, Yesenia Villicaña Molina, Nelly Sélem Mojica, Francisco Santiago Nieto de la Rosa, Victor Muñiz Sánchez, Anton Pashkov, Johanna Atenea Carreón Baltazar, Luis Raúl Figueroa Martínez, Evelia Lorena Coss Navarrete, César Augusto Aguilar Martínez One of the three challenges proposed by the Community of Interest Critical Assessment of Massive Data Analysis (CAMDA) involves predicting antimicrobial resistance or susceptibility for nine bacterial species and four antibiotics of interest. The dataset underwent a cleaning process to remove duplicate IDs with differing MIC values or phenotypes. After data cleaning and preprocessing, three distinct strategies were implemented to perform the predictions. The first strategy focused on predicting minimum inhibitory concentration (MIC) values. We adapted machine learning models for ordinal classification, assuming MIC as an ordinal variable. Two main approaches were used: multiple binary models (logistic regression, CART, random forests) and threshold models (neural networks). Due to the high dimensionality and sparsity of the AMR gene count data, we applied preprocessing techniques including a TF-IDF-like transformation (GF-IAF) and dimensionality reduction (truncated SVD and NMF). In the second strategy, we tested several classical machine learning models to predict the phenotype directly and used a grid search to find the optimal set of parameters, without using MIC values. In the third, we applied dimensionality reduction methods such as TF-IDF, along with a biological filtering step, before predicting the phenotype. Finally, as a preliminary result, ANI and pangenome analyses of E. coli isolates revealed divergence in gene content among some strains. Accessory regions potentially linked to antibiotic resistance suggest that key resistance determinants may lie outside the core genome.
2025-07-23 14:00:00 14:40:00 01C CAMDA Predicting Antimicrobial Resistance Using Microbiome-Pretrained DNABERT2 and DBGWAS-Derived Genomic Features Jack Vaska Jack Vaska, Pratik Dutta, Max Chao, Rekha Sathian, Zhihan Zhou, Han Liu, Ramana Davuluri Antimicrobial resistance (AMR) is an escalating public health threat, especially in hospitals where diverse resistance gene reservoirs have emerged. With the increasing availability of metagenomic and whole-genome sequencing data from AMR pathogens, there is a timely opportunity to develop predictive models. Given the complexity of these genomic datasets, large language models (LLMs) offer a promising approach due to their ability to capture long-range sequence patterns. DNABERT2, an LLM pretrained on diverse DNA sequences, has shown strong performance in various genomic tasks and is well-suited for AMR prediction (Zhou et al., 2023). We present a novel method to predict AMR across nine pathogenic bacterial species treated with four common antibiotics. Four custom DNABERT2 models, pretrained on human microbiome-derived genomic sequences, were fine-tuned on sequences obtained from de novo assembled bacterial genomes. To extract phenotype-associated features, we employed De Bruijn Graph-based Genome-Wide Association Study (DBGWAS) in an alignment-free manner (Jaillard et al., 2018). Statistically significant sequences (p < 0.05) were aligned back to assemblies using BLAST (≥80% identity), and 1,000 bp flanking subsequences were extracted. Resistant samples showed a markedly higher number of BLAST hits than susceptible ones. Data were grouped by antibiotic and each group was fine-tuned using a DNABERT2 model incorporating species and BLAST hit count as additional features. Consensus predictions across sequences achieved 84.5% accuracy and a macro F1 score of 0.84. Our findings demonstrate that resistant bacteria contain distinct genomic features absent in susceptible strains, highlighting the promise of LLM-based methods for AMR prediction.
2025-07-23 14:40:00 15:00:00 01C CAMDA The Antimicrobial Resistance Prediction Challenge Alper Yurtseven Alper Yurtseven, Dilfuza Djamalova, Marco Galardini, Olga V. Kalinina Antimicrobial Resistance (AMR) is an urgent threat to human health worldwide as microbes have developed resistance to even the most advanced drugs. In this year’s CAMDA challenge, we focused on predicting antimicrobial resistance of 5,346 bacterial strains that belong to 9 different species (Acinetobacter baumannii, Campylobacter jejuni, Escherichia coli, Klebsiella pneumoniae, Neisseria gonorrhoeae, Pseudomonas aeruginosa, Salmonella enterica, Staphylococcus aureus, Streptococcus pneumoniae) using two machine learning algorithms.
2025-07-23 15:00:00 15:20:00 01C CAMDA Antimicrobial Resistance Prediction via Binary Ensemble Classifier and Assessment of Variable Importance Owen Visser Owen Visser, Victor Agboli, Somnath Datta Antimicrobial resistance (AMR) presents a growing challenge to global health, driven by antibiotic overuse and the rapid evolution of resistant bacteria. Predicting whether an isolate is resistant or susceptible to a drug remains difficult due to genomic variability. As part of the 2025 CAMDA Challenge, we altered a standard bioinformatic pipeline to preprocess the variable raw sequencing data, and features were derived from strain-specific markers and AMR gene classes. Three machine learning methods which have shown high accuracy in recent AMR prediction research were trained and compiled into an ensemble to predict binary resistance phenotypes for nine bacterial pathogens for four antibiotics. The ensemble performed well across most species, notably achieving 96.8% accuracy for C. jejuni and 98.2% for A. baumannii. Permutation-based variable importance analysis identified relevant resistance genes and strains, such as sulphonamide and aminoglycoside genes and the LAC-4 strain in A. baumannii. These results demonstrate the utility of ensemble models for AMR prediction on large, heterogeneous genomic datasets.
2025-07-23 15:00:00 15:20:00 01C CAMDA A Highly Accurate Workflow for Inference of Antimicrobial Resistance from Genetic Data Based on Machine Learning and Global Data Curation David Danko Gabor Fidler, Heather Wells, Ford Combs, John Papciak, Mara Couto-Rodriguez, Sol Rey, Tiara Rivera, Lorenzo Uccellini, Christopher Mason, Niamh O'Hara, Dorottya Nagy-Szakal, David Danko Note: This abstract is paired with the prediction submission “Base Model, 2nd Submission (Biotia)” made by user gfidler from team Biotia on May 15, 2025. We present BIOTIA-DX Resistance, our submission to the CAMDA AMR Challenge. This tool builds off of our clinically validated metagenomic workflow to provide broad domain predictions for antimicrobial resistance from microbial sequencing data. We achieved an F1 score of 84 on the CAMDA challenge test set. Our technique is based on curation of global datasets, machine learning-based predictions from input data, and highly stringent prepreprocessing of input data and databases.
2025-07-23 15:20:00 15:40:00 01C CAMDA The Gut Microbiome Health Index Challenge - Introduction Kinga Zielińska Kinga Zielińska
2025-07-23 15:40:00 16:00:00 01C CAMDA Integrating Taxonomic and Functional Features for Gut Microbiome Health Indexing Rafael Pérez Estrada Shaday Guerrero Flores, Rafael Pérez Estrada, Juan Francisco Espinosa Maya, Nelly Selem Mojica, David Alberto García Estrada, Orlando Camargo Escalante, Mario Jardón Santos, Jose Daniel Chavez Gonzalez Accurate characterization of the gut microbiome is essential for understanding its role in health and disease; however, while current indices such as GMHI and hiPCA rely on taxonomic profiles to associate microbiome composition with health states, they do not consider underlying functional variability. Here, we integrate species-level (MetaPhlAn) and pathway-level (HUMAnN) data from 4,398 samples provided by CAMDA 2025 to understand key organisms and pathways in different groups of diseases and to develop and evaluate composite health indices. We first built co-occurrence networks, identifyin keystone taxa. We then recalibrated GMHI and hiPCA for both taxonomic and functional data and developed three ensemble models. The best-performing, the Optimized Pathway Ensemble, reached an F1-score of 0.76. We extended GMHI to distinguish between disease groups and tested pairwise classifiers across conditions—including healthy, gastrointestinal, metabolic, psychiatric, and neurological disorders. Additionally, we developed the Gut Microbiome Health Calculator, a web tool for computing and comparing these indices. Our results show that combining taxonomic and functional features enhances classification and reveals biologically relevant patterns in disease.
2025-07-23 16:40:00 17:20:00 01C CAMDA Building a Rare-Disease Microbiome Health Index: Integrating Gut Metagenomes, Synthetic PKU EHRs and Rare-Variant Profiles to Forecast Phenylalanine Crises Khartik Uppalapati, Bora Yimenicioglu, Shakeel Abdulkareem, Adan Eftekhari Phenylketonuria (PKU) is an autosomal recessive metabolic disorder characterized by deficient phenylalanine hydroxylase activity, leading to episodic neurotoxic elevations in plasma phenylalanine (Phe) despite strict dietary management. However, existing gut health metrics fail to capture rare-disease–specific dysbiosis. In order to address, these concerns, we developed a Rare-Disease Microbiome Health Index (RDMHI) that integrates MetaPhlAn-derived species abundances, HUMAnN functional pathways, synthetic electronic health record timelines, and rare-variant burdens to forecast imminent Phe crises. We curated 4 398 metagenomic profiles from the CAMDA dataset alongside three external PKU cohorts (n < 100), applied centered log-ratio transformation and batch correction, and generated 5 000 patient-month windows via Synthea-augmented GAN models to simulate clinical and laboratory events. Rare-variant burdens for PAH and BH₄-pathway genes were collapsed into gene-level indicators. A LightGBM-DART classifier was trained under nested five-fold, leave-one-dataset-out cross-validation and evaluated by AUROC, AUPRC, and Matthews correlation coefficient with 1 000-sample bootstrap CIs. RDMHI achieved an AUROC of 0.91 (95 % CI 0.88–0.94), and MCC 0.64, outperforming clinical-only (AUROC 0.78; MCC 0.38) and microbiome-only (AUROC 0.81; MCC 0.45) baselines. External validation on 50 registry windows yielded an AUROC of 0.85 (0.81–0.89) and 78 % sensitivity at a 22 % false-positive rate. By outperforming existing gut-health indices (GMHI and hiPCA), RDMHI demonstrates the impact of tailoring health indices to rare diseases and establishes a new standard of microbiome-based prognostic modeling for precision risk stratification in rare metabolic disorders.
2025-07-23 17:20:00 17:40:00 01C CAMDA Toward the Development of a Novel and Comprehensive Gut Health Index: An Ensemble Model Integrating Taxonomic and Functional Profiles Vincent Mei Vincent Mei, Yulin Li, Somnath Datta Diseases linked to the gut microbiome have been on the rise, which contributes to the rising cost of healthcare and worsening patient outcomes . Since stool samples provide an accurate representation of the gut microbiome and can be collected frequently and non-invasively, it is of clinical interest to create an index that can accurately classify samples as healthy or non-healthy. Several indices already exist to assess microbiome health, such as the Gut Microbiome Health Index (GMHI), health index with PCA (hiPCA), and Shannon entropy measures, but their reliance solely on species abundance limits their ability to distinguish between healthy and non-healthy individuals. To improve upon these indices, we proposed a novel ensemble-based index that integrates both taxonomical and metabolic pathway abundance data from stool samples to predict individual health status. From the provided data with 1211 species features and 619 pathway features, 61 species and 21 pathways were identified and used to train the ensemble model. The best threshold for the index generated from the ensemble model was selected using Youden’s index, resulting in a balanced accuracy of 0.7234 compared to values below 0.5 for GMHI, hiPCA, and Shannon entropy measures. Feature importance was also calculated simultaneously with the ensemble model training by permuting one feature at a time, leading to the identification of the 20 most important species and pathways when determining gut microbiome health.
2025-07-23 17:40:00 18:00:00 01C CAMDA Topology-Enabled Integration of Taxonomic and Functional Microbiome Profiles Reveals Distinct Subgroups in Healthy Individuals Doroteya Staykova Doroteya Staykova High-throughput sequencing technologies have enabled detailed taxonomic and functional profiling of the human gut microbiome. However, integrating these diverse, high-dimensional data sources remains a major challenge - particularly in defining robust, cross-modal indicators of gut health - due to significant inter-individual variability observed even within healthy populations. In this study, I applied Topological Data Analysis (TDA) to the CAMDA 2025 Microbiome Challenge dataset to integrate taxonomic and functional profiles from healthy individuals. My primary aim was to establish a baseline for human gut health by identifying microbial patterns within a large, healthy cohort. A cross-modal network representation of over 1,600 microbiome samples was constructed using the Mapper algorithm with PHATE-based topological lenses. The derived topological shape revealed two distinct subgroups within the landscape of the healthy gut microbiome. Subsequent statistical analyses identified characteristic taxonomic and functional signatures associated with each subgroup, demonstrating the utility of TDA in uncovering intrinsic patterns and providing a data-driven framework for more precise stratification of gut health.
2025-07-23 17:40:00 18:00:00 01C CAMDA Ensemble-Based Topic Selection for Text Classification via a Grouping, Scoring, and Modeling Approach Daniel Voskergian, Burcu Bakir-Gungor, Malik Yousef The exponential growth in scientific literature, especially in biomedical domains, has intensified the need for effective automatic text classification (ATC) systems. TextNetTopics is a recent approach that classifies documents using topic-based features derived from Latent Dirichlet Allocation (LDA), reducing dimensionality while maintaining semantic richness. However, TextNetTopics’ reliance on single topic models introduces performance variability across datasets, limiting its generalizability. This study introduces ENTM-TS (Ensemble Topic Modeling for Topic Selection), a novel framework that enhances TextNetTopics by integrating multiple topic models through a three-stage Grouping, Scoring, and Modeling (GSM) approach. First, topics are extracted from various models and merged based on semantic similarity to reduce redundancy and generate discriminative topic groups. These groups are then scored using internal and external evaluation strategies, ensuring normalized comparison and identifying top-performing subsets. Finally, a modeling phase iteratively aggregates and evaluates these groups to build an optimal feature set for classification. ENTM-TS was evaluated on two biomedical text datasets: the DILI dataset and the WOS-5736 dataset of scientific abstracts. Results demonstrate that ENTM-TS consistently meets or exceeds the performance of single-model configurations, improving classification accuracy and reducing variability. This ensemble-based approach not only preserves semantic richness but also ensures robustness across diverse datasets. ENTM-TS offers a generalizable and interpretable solution for biomedical text mining, with future work aimed at automating parameter selection for greater usability.
2025-07-24 08:40:00 09:40:00 01C CAMDA Data, Diagnoses, and Discovery: Improving Healthcare through Electronic Health Records Spiros Denaxas Spiros Denaxas Electronic health records (EHRs) represent rich, multidimensional data generated through routine interactions within the healthcare system. These records have transformed biomedical research, shifting the traditional approach of studying diseases in isolation toward the simultaneous analysis of thousands of conditions. This talk will explore the unique opportunities and challenges that EHRs present to researchers and highlight best practices through examples.
2025-07-24 09:40:00 10:00:00 01C CAMDA Stage-Disease Grouping, Scoring, and Modeling for Predicting Diabetes Complications from Electronic Health Records Daniel Voskergian Daniel Voskergian, Burcu Bakir-Gungor, Malik Yousef Diabetes mellitus remains a major global health challenge, contributing significantly to morbidity, disability, and mortality. Accurate prediction of diabetes-related complications from electronic health records (EHRs) is essential for early intervention and personalized care. This study proposes a novel predictive framework that utilizes a novel feature engineering, combined with XGBoost-based feature selection and a Grouping–Scoring–Modeling (GSM) approach to improve predictive performance. Rather than relying on individual features, the proposed method constructs Stage-Disease Groups—sets of clinically related variables grouped by disease category (e.g., cardiovascular, renal) and typical onset stage (e.g., early, mid, late) following diabetes diagnosis. Each group captures interactions between variables such as age range and chronic conditions, reflecting real-world progression patterns. Predictive models were developed for four critical diabetes complications: retinopathy, chronic kidney disease, ischemic heart disease, and amputations. These models were trained on a large-scale dataset of synthetic EHRs representing nearly 1 million patients, generated using dual-adversarial autoencoders to preserve realistic temporal and clinical patterns. Results demonstrate that leveraging structured, group-based features improves both classification accuracy and model interpretability. Final models achieved accuracies between 70% and 77% and AUC scores between 76% and 84%, underscoring the effectiveness of the GSM framework in clinical risk prediction.
2025-07-24 11:20:00 12:00:00 01C CAMDA Benchmarking for Better Private Algorithms Antti Honkela Antti Honkela, Antti Honkela Responsible application of machine learning (ML) on sensitive health and genetic data requires privacy-preserving algorithms to ensure that the data are not exposed. There is even legislative pressure, especially in Europe, requiring privacy in trained ML models. My talk will discuss how to organise a challenge for privacy-preserving ML to stimulate the development of better private algorithms. This is significantly more difficult than organising regular ML challenges, because there are no straightforward means of reliably evaluating privacy, and fair comparison of solutions requires specifying a comparable privacy-utility trade-off for all participants. Building on experience from running multiple privacy-preserving ML challenges, I will review good and not so good solutions to these issues, hoping to encourage others to include a privacy component in their challenges.
2025-07-24 12:00:00 12:20:00 01C CAMDA The Health Privacy Challenge - Introduction Hakime Öztürk Hakime Öztürk The Health Privacy Challenge, which is organized in the context of the European Lighthouse on Safe and Secure AI (ELSA, https://elsa-ai.eu/), explores the privacy-preserving aspect of synthetic data generation models in the context of biological datasets. The challenge, through a
2025-07-24 12:20:00 13:00:00 01C CAMDA The Health Privacy - panel discussion Spiros Denaxas, Oliver Stegle, Antti Honkela, David Kreil, Wenzhong Xiao, Joaquin Dopazo, Spiros Denaxas, Oliver Stegle, Antti Honkela, David Kreil, Wenzhong Xiao, Joaquin Dopazo
2025-07-24 14:00:00 14:20:00 01C CAMDA Synthetic genomic data generation through Differential Privacy-enhanced Non-Negative Matrix Factorization (DP-NMF) Andrew Wicks Andrew Wicks, Kyle Fogerty Generation of synthetic genomics data is increasingly considered as a routine approach for safely sharing sensitive genomic datasets. While traditional data-sharing methods often expose participants to privacy risks such as membership-inference attacks, the necessity of such methods may be reconsidered in favor of privacy-preserving alternatives. In this work, we outline scenarios in genomics research where synthetic data generation via non-negative matrix factorization (NMF) can effectively replace direct data sharing, thereby significantly enhancing privacy. We introduce a simple yet robust heuristic leveraging differential privacy (DP) integrated into NMF-based clustering, combined with a zero-inflated negative binomial or poisson sampling strategy. We demonstrate the utility and viability of this method through proof-of-concept evaluations on real genomic data, discuss practical use-cases, and highlight broader implications for secure and privacy-compliant genomic data dissemination.
2025-07-24 14:20:00 14:40:00 01C CAMDA Synthetic Data Generation for bulk RNA-seq Data: A CAMDA Health Challenge Analysis Steven Golob Shane Menzies, Sikha Pentyala, Daniil Filienko, Steven Golob, Jineta Banerjee, Luca Foschini, Martine De Cock One of the major barriers to AI-driven medical discoveries is the limited availability of high-quality, accessible healthcare data. This is because medical data is inherently sensitive, necessitating strict privacy protections that often lead to data being siloed across clinical sites and research institutions. Lack of access to such data hinders reproducibility and slows down the AI adaption. To address this bottleneck, we investigate the use of Synthetic Data Generation (SDG) algorithms, capable of generating realistic data with formal privacy guarantees. Here, we investigate the extent to which state-of-the-art SDG algorithms can be applied to bulk RNA-seq data to generate high-quality genomics data suitable for downstream analysis.
2025-07-24 14:40:00 15:00:00 01C CAMDA Comparison of Single Cell RNA Synthetic Data Generators: A CAMDA Health Challenge Analysis Patrick McKeever Patrick McKeever, Daniil Filienko, Steven Golob, Shane Menzies, Sikha Pentyala, Jineta Banerjee, Luca Foschini, Martine De Cock Single cell RNA sequencing has a wide range of applications in medical research, allowing researchers to identify distinct cell types and consider the impact of experimental conditions on a per-cell-type basis. However, the scarcity of counts data for rare cell types or experimental conditions poses considerable difficulties in the analysis of single-cell expression data. As such, a large literature has developed around the generation of synthetic single-cell data. Synthetic single-cell expression data allows biologists to model rare cell states, test new statistical methods against a known ground truth, perform in-silico gene perturbations, and guide the development of sequencing experiment structure in advance. However, while several comparative benchmarks of single cell data exist, much less literature has considered the privacy-preserving aspects of these algorithms. This extended abstract In this abstract, we explore and compare multiple types of synthetic data generators (SDGs) to generate single-cell RNA-seq (scRNA-seq) data using the OneK1K dataset provided by the CAMDA Healthcare Challenge. Specifically, we evaluate both the statistical methods scDesign2 \cite{sun2021_scdesign} and Private-PGM (which also provides formal differential privacy guarantees) as well as the recent diffusion-based modelcfDiffusion. Our analysis follows the evaluation pipeline and metrics defined by the challenge organizers. We find that scDesign2 far exceeds the other generators in terms of data quality.
2025-07-24 14:40:00 15:00:00 01C CAMDA NoisyDiffusion: Privacy Preserving Synthetic Gene Expression Data Generation Jules Kreuer Jules Kreuer Generating synthetic gene expression data has the potential to advance computational biology and health research by enabling broader access to data. However, creating synthetic data that is both highly faithful to the original and useful from a biological perspective while also ensuring privacy is a significant challenge. While diffusion models are powerful generative tools, their application to sensitive genomic data requires careful consideration of privacy implications, especially regarding their susceptibility to memorisation and membership inference attacks (MIAs). This project presents NoisyDiffusion: a conditional diffusion model designed to generate synthetic gene expression data while incorporating mechanisms for differential privacy to mitigate MIAs. As this project is part of the CAMDA 2025 - Health Privacy Challenge, it was evaluated on the TCGA-COMBINED and TCGA-BRCA datasets. NoisyDiffusion demonstrated strong utility, with classifiers trained on its synthetic data achieving high accuracy (e.g., 96.92% on TCGA-COMBINED) and AUPR, rivaling top non-private baselines (Multivariate, CVAE) and significantly outperforming other generative models, including those with explicit DP (DP-CVAE, CTGAN). Crucially, for privacy, Membership Inference Attack (MIA) AUCs were close to 0.5, suggesting good resilience and performance comparable to the Multivariate baseline. This work demonstrates that diffusion models can effectively generate high-quality, privacy-respecting synthetic genomic data, offering a promising pathway for advancing research while safeguarding sensitive information.
2025-07-24 15:00:00 15:20:00 01C CAMDA Reusability of Public Omics Data Across 6 Million Publications Serghei Mangul Serghei Mangul, Viorel Munteanu, Dumitru Ciorbă, Viorel Bostan, Mihai Dimian, Nicolae Drabcinski Over the past two decades, public repositories like GEO and SRA have accumulated vast omics datasets, sparking a crucial discussion on secondary data analysis. Access to this data is vital for reproducibility, novel experiments, meta-analyses, and new discoveries. However, the extent and factors influencing reuse have been unclear. A large-scale study analyzed over six million open-access publications from 2001 to 2025 to quantify reuse patterns and identify influencing factors. The analysis identified 213,213 omics-based publications, with approximately 65% based on secondary analysis, marking a significant shift. Since 2015, studies reusing existing gene expression data, particularly microarray data, have increasingly outnumbered those with new data. Despite this, a large portion of datasets, especially RNA-seq, remain underutilized, with over 72% of RNA-seq datasets in GEO and SRA not reused even once. Reusability varies by data type; microarray data shows the highest average Reusability Index (RI), while RNA-seq and other sequencing data have lower RIs. Human datasets consistently exhibit higher reusability than non-human ones. Significant barriers to reuse persist, including incomplete metadata, lack of standardization, and the complexity of raw data formats. Many researchers also lack the necessary computational tools or expertise. The study proposes solutions: enforcing metadata standards, integrating automated data processing tools into repositories, formally recognizing data contributions with metrics like RI and Normalized Reusability Index (NRI), and incentivizing reuse through journals and funding agencies. Addressing these challenges is crucial to unlock the full potential of existing omics data.
2025-07-24 15:00:00 15:20:00 01C CAMDA Pre-publication sharing of omics data improves paper citations Serghei Mangul Serghei Mangul, Dhrithi Deshpande, Viorel Munteanu, Mihai Dimian, Grigore Boldirev, Alexander Zelikovsky Advancements in omics technologies generate vast datasets, while public repositories facilitate their sharing, crucial for accelerating discovery, enhancing reproducibility, and meeting funder/journal mandates. Pre-publication data sharing, particularly alongside preprints, is increasingly beneficial, enabling early re-analysis and proving vital during public health crises like COVID-19, where data access is critical for verifying rapid findings and maintaining scientific integrity. However, a key question is whether raw omics data is consistently deposited when preprints are posted. Our study presents the first comprehensive analysis of pre-publication data sharing practices and their impact on citations in biomedical research. We analyzed 106,000 bioRxiv/medRxiv preprints and 72,715 publications with primary Gene Expression Omnibus (GEO) datasets, identifying 6,819 preprints mentioning GEO IDs and matching 2,022 preprint-publication pairs. Analysis revealed significant variability; only 29.7% of matched pairs had identical, single GEO IDs. While 71-87% of datasets were available before publication, only 9-23% were available at preprint posting. We examined the relationship between dataset release timing and citation counts, revealing statistically significant findings (Kolmogorov-Smirnov test, p = 8.596 x 10⁻⁶) indicating a discernible impact of early data availability on citation benefit. We also found over 1,600 cases where data IDs were in publications but not their preprints. Our findings reveal a fragmented landscape of pre-publication omics data sharing, challenging reproducibility and transparency.
2025-07-24 15:20:00 15:40:00 01C CAMDA HI-MGSyn: A Hypergraph and Interaction-aware Multi-Granularity Network for Predicting Synergistic Drug Combinations Yuexi Gu Yuexi Gu, Jian Zu, Yongheng Sun, Louxin Zhang Motivation: Drug combinations can not only enhance drug efficacy but also effectively reduce toxic side effects and mitigate drug resistance. With the advancement of drug combination screening technologies, large amounts of data have been generated. The availability of large data enables researchers to develop deep learning methods for predicting drug targets for synergistic combination. However, these methods still lack sufficient accuracy for practical use, and most overlook the biological significance of their models. Results: We propose the HI-MGSyn (Hypergraph and Interaction-aware Multi-granularity Network for Drug Synergy Prediction) model, which integrates a coarse-granularity module and a fine-granularity module to predict drug combination synergy. The former utilizes a hypergraph to capture global features, while the latter employs interaction-aware attention to simulate biological processes by modeling substructure-substructure and substructure-cell line interactions. HI-MGSyn outperforms state-of-the-art machine learning models on our validation datasets extracted from the DrugComb and GDSC2 databases. Furthermore, the fact that five of the 12 novel synergistic drug combinations predicted by HI-MGSyn are strongly supported by experimental evidence in the literature underscores its practical potential.
2025-07-24 15:40:00 16:00:00 01C CAMDA CAMDA Trophy David Kreil
2025-07-24 15:40:00 16:00:00 01C CAMDA Closing remarks David Kreil
2025-07-21 14:00:00 14:10:00 01C ISCB-China Workshop Welcome Address
2025-07-21 14:10:00 14:35:00 01C ISCB-China Workshop Single Cell Spatial Transcriptomics: Decoding Cellular Heterogeneity in Spatial Dimensions , Hongyu Zhao
2025-07-21 14:35:00 14:55:00 01C ISCB-China Workshop Seq2Image: Computational Paradigm and Genomic Applications , Kei Ye
2025-07-21 14:55:00 15:15:00 01C ISCB-China Workshop Single Cell Spatial Transcriptomics: Decoding Cellular Heterogeneity in Spatial Dimensions , Xun Xu
2025-07-21 15:15:00 16:00:00 01C ISCB-China Workshop Bioinformatics @ China
2025-07-21 16:40:00 17:10:00 01C ISCB-China Workshop Learning Multiscale Cellular Organization and Interaction Jian Ma
2025-07-21 17:10:00 17:30:00 01C ISCB-China Workshop Language-guided biology James Zou
2025-07-21 17:30:00 18:00:00 01C ISCB-China Workshop AI and Bioinformatics: The Next Era
2025-07-23 11:20:00 13:00:00 02F CollaborationFest CollaborationFest
2025-07-23 14:00:00 16:00:00 02F CollaborationFest CollaborationFest
2025-07-23 16:40:00 18:00:00 02F CollaborationFest CollaborationFest
2025-07-24 08:40:00 10:00:00 02F CollaborationFest CollaborationFest
2025-07-24 11:20:00 13:00:00 02F CollaborationFest CollaborationFest
2025-07-24 14:00:00 16:00:00 02F CollaborationFest CollaborationFest
2025-07-22 11:20:00 12:00:00 02F CompMS The elephant in the (metabolomics) room: computational approaches for small molecule structural annotation from single biological study to data repositories Warwick Dunn Warwick Dunn Multiple omics research strategies (metabolomics, lipidomics, exposomics) focus on the reporting of small molecules in biological systems in relation to human diseases, biotechnology, microbiomes and environmental impact as a few examples. Many small molecule omics studies apply liquid chromatography-mass spectrometry (LC-MS) to simultaneously collect data reporting on hundreds to low thousands of small molecules. The availability of these studies in data repositories (Metabolights, Metabolomics Workbench, GNPS) is rapidly increasing and provides the opportunity for large-scale data reuse. LC-MS data contain thousands of signals with one small molecule being detected as multiple complementary signals. These signals are applied to structurally annotate small molecules, a required process to derive biological knowledge. There have been significant advances in both the volumes of data available in metabolomics data repositories and in the development of computational tools to structurally annotate and biologically interpret. However, the complexity of the data collected along with the inability to sequence all common metabolomes and sparsity of metabolite coverage in libraries applied provide significant hurdles. In this presentation I will (1) describe the different types of data collected in LC-MS small molecule studies, (2) review the current strategies applied to these data to convert from signal to chemical structure and the hurdles to overcome computationally and (3) discuss moving from single lab/single study to the integration of studies available across multiple data repositories. By moving from small-scale to big data through reuse of data publicly available we can rapidly advance our biological understanding across species and geography.
2025-07-22 12:00:00 12:20:00 02F CompMS MetaboT: AI-based agent for natural language-based interaction with metabolomics knowledge graphs Madina Bekbergenova Madina Bekbergenova, Lucas Pradi, Emma Tysinger, Franck Michel, Florence Mehl, Marco Pagni, Wout Bittremieux, Jean-Luc Wolfender, Fabien Gandon, Louis-Félix Nothias Long abstract is submitted in pdf
2025-07-22 12:20:00 12:40:00 02F CompMS Reference data-driven analysis for joint metabolome–microbiome readout from untargeted mass spectrometry data Alejandro Mendoza Cantu Alejandro Mendoza Cantu, Julia Gauglitz, Wout Bitrremieux Untargeted tandem mass spectrometry (MS/MS) metabolomics enables broad chemical profiling of complex biological samples but is limited by low annotation rates and interpretability challenges. Reference data-driven (RDD) analysis addresses these limitations by leveraging metadata-rich MS/MS reference datasets to contextualize untargeted experiments. RDD improves spectrum interpretation by matching experimental MS/MS spectra to comprehensive reference libraries enriched with hierarchical metadata. The workflow begins with raw MS/MS files from both study samples and reference materials (e.g., foods, microbes). Reference samples are annotated with structured ontologies (e.g., plant → fruit → pome → apple), allowing multi-level biological interpretation. Both datasets are analyzed using GNPS molecular networking, which clusters spectra based on similarity. Clusters shared between study and reference samples are treated as spectral matches. A spectral count table is constructed by aggregating shared clusters across samples and reference files. These counts can be aggregated across different ontology levels to support flexible downstream analysis. RDD was first demonstrated with the Global FoodOmics Project dataset, enabling dietary pattern reconstruction and increasing spectral usage by 5.1 ± 3.3-fold. We now expand this approach to microbiome analysis using MS/MS data from 488 American Gut Project participants, matched to a curated subset of the microbeMASST reference database. This enabled microbial profiling from metabolic data, capturing key gut taxonomic trends. The full framework is available as an open-source Python library and web application, enabling custom dataset analysis without coding. RDD offers a generalizable strategy to enhance annotation and biological insight from untargeted MS/MS data.
2025-07-22 12:40:00 13:00:00 02F CompMS Combined MS and MS/MS deconvolution of SWATH DIA data with the DIA-NMF software for comprehensive annotation in metabolomics Diana Karaki Diana Karaki, Annelaure Damont, Antoine Souloumiac, Francois Fenaille, Etienne Thevenot, Sylvain Dechaumet Data-independent acquisition (DIA), particularly Sequential Window Acquisition of All Theoretical Mass Spectra (SWATH-MS), is gaining momentum in untargeted metabolomics due to its ability to fragment all detected ions within large consecutive isolation windows in a single run. The main challenge lies in processing the resulting hybrid fragmentation data and extracting pure MS/MS spectra based on the similarity of retention time profiles from precursors and their fragment ions. We recently demonstrated the value of a Non-Negative Matrix Factorization (NMF) approach for DIA deconvolution, compared to existing peak modeling methods such as MS-DIAL and DecoMetDIA. Here, we extended our strategy to simultaneous deconvolution of MS and MS/MS DIA data. This is not only more rigorous—since fragment ions of distinct ion species from the same molecule often share retention time profiles—but also more efficient, as MS1 pure spectra are now provided. Second, we redesigned the deconvolution strategy to extract all pure components from each retention time window in a single step, reducing redundancy and decreasing computation time. Post-processing quality filters were also included to discard weak or redundant components by analyzing their contribution to MS1 signals. In SWATH-DIA mode, we applied the DIA-NMF software to human plasma samples spiked with 47 chemical compounds at eight known concentrations (0–10 ng/mL). DIA-NMF identified more spiked compounds than MS-DIAL and DecoMetDIA at all concentrations. It also achieved higher reverse dot-product scores, indicating a better grouping of relevant fragments. These results highlight the value of the DIA-NMF method and software for integrated metabolomics workflows.
2025-07-22 12:40:00 13:00:00 02F CompMS Towards mzTab-M 2.1 - Evolving the HUPO-PSI standard format for reporting of small molecule mass spectrometry results Nils Hoffmann Nils Hoffmann, Bo Burla, Yasin El Abiead, Janik Kokot, Philippine Louail, Steffen Neumann, Kozo Nishida, Thomas Payne, Johannes Rainer, Juan Antonio Vizcaíno, Ozgur Yurekten Mass spectrometry (MS) is central to modern large-scale metabolomics, but a lack of data format standardization for intermediate and final MS data analysis results still limits data sharing, database deposition, and reanalysis. To address this, the Human Proteome Organization’s Proteomics Standards Initiative (HUPO-PSI) and the Metabolomics Standards Initiative (MSI) originally developed mzTab-M 2.0.0 (published in 2019) as an open standard for reporting MS-based metabolomics data. mzTab-M uses a simple, tab-separated text format designed for both human readability and computational processing, based on a JSON schema and complemented by controlled-vocabulary-defined metadata. The format is detailed in a specification document, while a reference implementation and validator ensure data quality and consistency. The format comprehensively represents metabolomics results, including final quantification values and the identification evidence linking these values back to the raw MS features. Importantly, mzTab-M explicitly accommodates ambiguity in molecule identification, allowing researchers to clearly communicate levels of confidence. mzTab-M aims to be flexible by supporting CV-term controlled optional columns, thereby adapting to different experimental setups, applications and workflows. Initial implementations of mzTab-M in software like xcms, mzmine, OpenMS and for submission to repositories like MetaboLights and GNPS require significant updates and extensions to the format, its documentation and implementations. Thus, in mzTab-M 2.1.0 we want to support those, as well as new MS-technologies and we want to provide improved integration with other HUPO-PSI formats for sample metadata, QC results and cross-links to mass spectra in public databases, and implement more efficient and faster serialization and deserialization options.
2025-07-22 14:00:00 14:20:00 02F CompMS Enhanced spectrum clustering for interpretable molecular networking Janne Heirman Janne Heirman, Yasin El Abiead, Wout Bittremieux Small molecule tandem mass spectrometry (MS/MS) produces vast datasets, making interpretation challenging. Molecular networking aids analysis by linking spectra based on similarity, identifying groups of similar compounds. Clustering is a crucial preprocessing step to reduce data redundancy and enhance network interpretability. Here we introduce enhanced clustering approaches integrated into Falcon, an efficient clustering tool available via GNPS. Low-quality spectra were removed, and noise peaks filtered before clustering spectra using the cosine similarity. Consensus spectra were generated with a novel noise-rejection algorithm. Clustering performance was evaluated using cluster completeness, the proportion of clustered spectra, and incorrect clustering rates. Clustering algorithms were evaluated using 502,993 MS/MS spectra acquired on a Thermo Astral mass spectrometer, with ground truth labels derived from MZmine-based feature detection. We evaluated hierarchical clustering (single, average, and complete linkage) alongside density-based clustering with DBSCAN, which was previously used in Falcon. At a maximum threshold of 14% incorrectly clustered spectra, hierarchical clustering with complete linkage significantly outperforms DBSCAN, clustering 89.7% of spectra (11.8% incorrect), while DBSCAN clusters only 39.8% (12.9% incorrect). While DBSCAN yielded a higher completeness score (0.856), complete linkage maintained strong completeness (0.826) with better accuracy. Despite the overall high incorrect clustering rates—partly due to strict ground truth criteria—hierarchical clustering offers a substantial improvement in clustering performance. By reducing redundancy and enhancing the interpretability of molecular networks, optimized clustering strategies like hierarchical clustering accelerate metabolite annotation and drive discovery in metabolomics research.
2025-07-22 14:20:00 14:40:00 02F CompMS Deep Learning for Small Molecule Analog Discovery From Untargeted Mass Spectrometry Juan Sebastian Piedrahita Giraldo Juan Sebastian Piedrahita Giraldo, Manuela Da Silva, Reza Shahneh, Mingxun Wang, Thomas De Vijlder, Kris Laukens, Wout Bittremieux Tandem mass spectrometry (MS/MS) is a key tool for analyzing the small molecule composition of biological samples. Untargeted metabolomics data analysis entails matching the experimental MS/MS spectra to spectral libraries based on spectrum similarity, typically using the cosine similarity. However, such heuristic techniques often fail to capture the structural similarity between molecules. With the purpose of discovering novel structural analogs, we developed a neural network model that has learned the relationship between MS/MS data and chemical structures. Our approach, called SIMBA, consists of a twin transformer encoder that receives pairs of MS/MS spectra and predicts the structural similarity of their molecules. The model was trained in a multi-task setting to predict both the “substructure edit distance,” a novel domain-inspired metric that reflects the number of modifications between molecules, and the maximum common edge subgraph (MCES), thus learning both the number of structural differences and their atomic cardinality. Harnessing the learning capabilities of transformers, the model was trained on 200 million spectrum pairs from the NIST20 and MassSpecGym spectral libraries. SIMBA significantly surpasses the state of the art for analog discovery, predicting the MCES with a higher Spearman correlation (r=0.93) compared to modified cosine (r=0.46). Likewise, SIMBA identifies analogs with high performance on the CASMI2022 dataset: SIMBA is able to retrieve analogs with a lower normalized MCES distance of 0.17 compared to traditional modified cosine (0.23) as well as deep learning approaches such as Spec2Vec (0.19) and MS2DeepScore (0.20).
2025-07-22 14:40:00 15:00:00 02F CompMS Mass Spectrometry and Machine Learning Reveal Stool-Based Multi-Signatures for Diagnosis and Longitudinal Monitoring of Inflammatory Bowel Disease Elmira Shajari Elmira Shajari, David Gagné, Patricia Roy, Mandy Malick, Maxime Delisle, François-Michel Boisvert, Marie Brunet, Jean-François Beaulieu Background: Monitoring inflammation activity in Inflammatory Bowel Disease (IBD) is essential for guiding treatment and preventing long-term complications. While fecal calprotectin is a common non-invasive biomarker, its diagnostic reliability declines significantly within the “gray zone” (50–300 µg/g), limiting its clinical utility. To address this challenge, we developed a stool-based proteomic biomarker panel for precise classification of inflammation activity in this diagnostically ambiguous range. Methods: We analyzed 155 stool samples from IBD patients for model training and reserved 53 samples for blind testing. The proteomic profiling was performed using SWATH-MS, a data-independent acquisition (DIA) mass spectrometry technique known for its reproducibility and depth. Protein- and peptide-level datasets were preprocessed separately. Feature selection was conducted using Boruta, LASSO, RF, and RFE. Features identified consistently across both data levels were prioritized. Six machine learning models (SVM, Random Forest, Naïve-Bayes, KNN, GLMnet, and XGBoost) were evaluated with 10-fold cross-validation, focusing on gray zone performance. Model interpretability was assessed using SHAP values and GO enrichment analysis explored the biological relevance of selected features. Results: We identified 19 protein-level and 14 peptide-level discriminatory features, with five robust overlapping markers selected for final modeling. The Support Vector Machine (SVM) model achieved the highest performance: 0.96 precision and 0.88 recall during training, and 1.00 precision and 0.86 recall in blind testing. SHAP analysis confirmed biomarker contribution, and enriched GO terms were linked to immune and inflammatory pathways. Conclusion: This proteomic signature offers a promising non-invasive tool for resolving diagnostic uncertainty in IBD monitoring within the gray zone.
2025-07-22 14:40:00 15:00:00 02F CompMS Rapid Deployment of Interactive and Visual Web Applications for Computational Mass Spectrometry Tom David Müller Tom David Müller, Arslan Siraj, Justin Cyril Sing, Joshua Charkow, Axel Walter, Samuel Wein, Ayesha Feroz, Matteo Pilz, Kyowon Jeong, Mingxuan Gao, Wout Bittremieux, Hannes Luc Röst, Oliver Kohlbacher, Timo Sachsenberg Mass Spectrometry (MS) is a highly versatile bioanalytical technique with a myriad of experimental approaches, instrumentation, and computational tools. If a desired analysis is not already supported by existing desktop applications, bioinformaticians often integrate scripts and tools to produce custom analyses and visualizations. While effective, this approach requires specialized expertise and limits accessibility for non-technical users. Traditional workflow systems can standardize and scale such analyses, but often lack user-friendly interfaces, visualization capabilities, and support for interactive decision-making during execution. To address these limitations, we present two freely available open-source solutions designed to streamline MS workflow development and deployment. pyOpenMS-viz enables rapid creation of publication-ready visualizations, such as spectra, chromatograms, and peak maps with a single line of code, directly from pandas DataFrames, a common data structure in Python-based MS tools. Straightforward use cases that do not require complex development, are well supported by Jupyter notebooks enhanced with pyOpenMS-viz. For broader accessibility and reuse, the OpenMS WebApp template offers a lightweight framework to develop interactive web applications with minimal effort. These apps guide users through uploading files, setting parameters, executing workflows involving arbitrary scripts and command line tools, and visualizing results interactively. Visualizations from pyOpenMS-viz and other libraries are fully supported. Applications can be deployed online allowing users to share results (e.g. with collaborators) via website URLs or offline via automatically generated windows executables. Together, pyOpenMS-viz and the OpenMS WebApp template empower rapid prototyping, streamline deployment, and make MS workflows accessible to a wider scientific audience, promoting collaboration and reproducibility.
2025-07-22 15:00:00 15:20:00 02F CompMS PepSi-Print: Unraveling Protein Fingerprints through Pairwise Intensity Ratios with a Peptide Siamese Network Zixuan Xiao Zixuan Xiao, Mathias Wilhelm In MS-based bottom-up proteomics, proteins are enzymatically digested into peptides, and identified and quantified to infer proteins. While MS2 spectra provide peptide sequence information, the stochastic nature of peptide sampling and fragmentation introduces ambiguity in protein inference. Inspired by fragment ion intensity patterns (FIIP) used in MS2-based identification, we propose using peptide ion intensity patterns (PIIP) in MS1-based identification. Combined with isotope pattern, retention time, and ion mobility, PIIP defines protein fingerprints facilitating direct protein identification. We present a deep learning approach to model PIIP and integrate it into an MS1-based workflow. In this pursuit, we leveraged a large bacterial dataset comprising 343 raw files. Exploratory analysis revealed highly consistent fingerprint patterns across measurements (median Pearson’s correlation = 0.9), suggesting a stable signal suitable for learning. Based on this, we developed PepSi-Print, a Siamese network architecture with Long Short-Term Memory arms and a regression head that predicts pairwise logarithmic intensity ratios between peptides from the same protein, thus avoiding the need for absolute intensity ground truths. PepSi-Print achieves a median absolute error of 0.85 on pairwise predictions and 0.73 after aggregation at the sequence level. When applied to unseen raw files, predicted fingerprints correlate with observed ones with a median Pearson’s r of 0.75. Integrated into the DirectMS1 workflow, PepSi-Print improves peptide-feature-match (PFM) discrimination, reducing uncontrolled peptide-level false discovery rates (FDR) from >30% to as low as 2–3%. These learned fingerprints offer a novel protein-specific signal to enhance identification, enable isoform resolution, and improve quantification in MS1-only proteomics.
2025-07-22 15:20:00 15:40:00 02F CompMS Identification of novel proteins by integrating ribosome profiling data into a transcriptomic language model for deeper mass spectrometry analyses Nicolas Provencher Nicolas Provencher, Sebastien Leblanc, Jean-Francois Jacques, Xavier Roucou Background – The human transcriptome contains millions of open reading frames (ORFeome) potentially coding for currently unknown or unannotated proteins. Typical mass spectrometry (MS)-based proteomics pipelines use protein databases to identify known proteins in a biological sample. The detection of proteins encoded in the human ORFeome would require a customized database including millions of proteins. However, proteomics analyses cannot be performed with millions of proteins predicted from the human ORFeome because of unacceptable high false detection rates caused by large protein databases. Goal – Identify the functional human ORFeome by excluding random ORFs to get a deeper detection of the proteome by detecting unannotated proteins in addition to previously annotated proteins. Method – 78 Ribosome profiling (Ribo-seq) studies from 47 unique tissue or cell lines samples were reanalysed to curate a set of ORFs showing ribosomal activity. This set was added to the training set of TIS transformer, a transcriptomic language model used to predict functional ORF across the human transcriptome. Results – The retraining and inference of the TIS transformer model allowed us to obtain a database containing 210 000 new unique protein sequences. In a preliminary experiment, our reanalysis of the ‘Deep HeLa proteome’ confirmed the expression of previously undetected proteins. Conclusion – Combining Ribo-seq with a transcriptomic language model can sort out relevant ORFs for the discovery of unannotated proteins. The reanalysis of a bigger set of MS studies will be able to identify proteins with a high potential to be biologically relevant.
2025-07-22 15:20:00 15:40:00 02F CompMS Zero-shot retention time prediction for unseen post-translational modifications with molecular structure encodings Ceder Dens Ceder Dens, Darien Yeung, Oleg Krokhin, Kris Laukens, Wout Bittremieux Identifying proteoforms with diverse post-translational modifications (PTMs) remains challenging in mass spectrometry-based proteomics. PTMs regulate protein activity and interactions and impact stability and localization. Limited knowledge of PTMs and their impact on liquid chromatography (LC) behavior hinders peptide identification, particularly for modified peptides. We introduce MoSTERT, a transformer-based model for retention time (RT) prediction of peptides with any PTM. MoSTERT encodes amino acids and their PTMs as a molecular structure, processed by a molecule encoder to generate residue-specific embeddings. A transformer then predicts RTs, even for peptides with unseen PTMs. We enhance accuracy by introducing a two-step model, MoSTERT-2S. First, a regular transformer encoder predicts the RT of the unmodified sequence. Then, MoSTERT predicts the RT shift induced by the modification. This strategy leverages the high prediction accuracy for unmodified peptides and the superior input representation for modified peptides. Trained on a dataset of 1.3M unmodified and 913K modified peptides (with 9 unique PTMs), MoSTERT was tested on external datasets, including ProteomeTools, with 70K peptides containing 16 unseen PTMs. Compared to DeepLC (MAE: 24.07 ± 13.44), MoSTERT-2S significantly improves RT prediction (MAE: 12.83 ± 11.83), setting a new state-of-the-art for peptides with novel PTMs.
2025-07-22 15:40:00 16:00:00 02F CompMS Open modification proteogenomics fosters reproducible detection of non-canonical proteins Valeriia Vasylieva Valeriia Vasylieva, Enrico Massignani, Francis Bourassa, Tine Claeys, Lennart Martens, Marie A. Brunet MS-based proteomics enables the identification of thousands of proteins within a single sample. Analysis of MS/MS spectra involves database search engines, which match experimental spectra to theoretical spectra generated from in silico digestion of a reference proteome. A key limitation of this method lies in its dependence on the size of the search space. As it grows, the false discovery rate (FDR) can increase, and the overall identification rate decline. Proteogenomic databases containing non-canonical proteins are notoriously large and flawed by FDR inflation. Unrestricted searches to consider all possible modifications further expands the search space. Ionbot is a fast, semi-supervised machine learning-based search engine. It handles large search spaces efficiently through a data-driven sequence tag-based approach. Here, we leveraged Ionbot to enhance the identification of non-canonical proteins through unrestricted searches. We showed that the identification rate is increased by 17% under a controlled FDR when using open modification search with ionbot. Open search with ionbot increased the detection of non-canonical proteins with 51% being supported by more than 1 PSM, compared to a mere 5% in a standard TPP search. Similarly, 40% of non-canonical proteins were detected in at least 2 samples with ionbot, compared to only 5% with TPP. 86% of non-canonical proteins were identified with at least one modified peptide, and 78% of peptides unique to non-canonical proteins were identified only in their modified form. Our study highlights the importance of open searches for robust and reliable detection of non-canonical proteins.
2025-07-22 15:40:00 16:00:00 02F CompMS PCI-DB: A novel primary tissue immunopeptidome database to guide next-generation peptide-based immunotherapy development Steffen Lemke Anna Dengler, Juliane S. Walz, Sven Nahnsen, Jonas S. Heitmann, Tatjana Bilich, Sven Fillinger, Cécile Gouttefangeas, Hans-Georg Rammensee, Yacine Maringer, Steffen Lemke, Susanne Jung, Marcel Wacker, Jonas Scheid, Naomi Hoenisch-Gravel, Annika Nelde, Jens Bauer, Patrick Zimmermann, Marissa L. Dubbelaar Various cancer immunotherapies rely on the T cell-mediated recognition of peptide antigens presented on human leukocyte antigens (HLA). However, the identification and selection of naturally presented peptide targets for the development of personalized as well as off-the-shelf immunotherapy approaches remains challenging. Here, we introduce the open-access Peptides for Cancer Immunotherapy Database (PCI-DB, https://pci-db.org/), a comprehensive resource of immunopeptidome data originating from various malignant and benign primary tissues that provides the research community with a convenient tool to facilitate the identification of peptide targets for immunotherapy development. The PCI-DB includes > 6.6 million HLA class I and > 3.4 million HLA class II peptides from over 40 tissue types and cancer entities. First application of the database provided insights into the presentation of cancer-testis antigens across malignant and benign tissues, enabling the identification and characterization of pan-tumor and entity-specific tumor-associated antigens as well as naturally presented neoepitopes from frequent cancer mutations. Further, we used the PCI-DB to design personalized peptide vaccines for two patients suffering from metastatic cancer. In a retrospective analysis, PCI-DB enabled to validate the composition of a multi-peptide vaccine for each patient comprising non-mutated, highly frequent tumor-associated antigens matching the immunopeptidome of the individual patient´s tumor and a neoepitope-based vaccine matching the mutational profile of the cancer patient. Both vaccines induced potent and long-lasting T-cell responses, accompanied by long-term survival of these advanced cancer patients. The PCI-DB is a highly versatile tool to broaden the understanding of cancer-related antigen presentation and, ultimately, supports the development of novel immunotherapies.
2025-07-22 16:40:00 17:00:00 02F CompMS CellPick: a cell selection toolkit for spatial proteomics Lucas Miranda Paolo Pellizzoni, Lucas Miranda, Caroline Weiss, Matthias Mann, Karsten Borgwardt We present CellPick: a computational tool for facilitating the selection of cells for laser microdissection in spatial proteomics applications. Laser microdissection is a technique that allows the dissection of single cells from tissues via a high-powered laser. A naïve selection of the cells to be cut, such as random selection, is usually effective in contexts with abundant cells. However, it often results in the selection of contiguous shapes when applied to tissue regions with sparse cellular distribution, thereby risking damaging the cells during microdissection. To overcome this, we introduce a custom shape selection technique that employs a combinatorial optimization procedure for the selection of cells, ensuring the selection of non-contiguous shapes, while approximately maximizing coverage within a restricted tissue area. Often, in spatial proteomics, one seeks a statistical correlation between protein intensities and the positions of the cells at hand. Our tool allows the specification of two points of interest, such as two types of veins, establishing a gradient along a relevant axis. The selected shapes are then automatically endowed with a value indicating how close they are to the two points of interest. This allows to find their position on the gradient of interest, which allows to correlate protein intensity levels and the closeness to the points of interest. We showcase an application of our tool in single cell spatial proteomics on liver samples.
2025-07-22 17:00:00 17:20:00 02F CompMS A living proteomics benchmark for comprehensive evaluation of deep learning-based de novo peptide sequencing tools Marina Pominova Marina Pominova, Wout Bittremieux, Charlotte Adams, Ceder Dens Mass spectrometry-based proteomics is essential for understanding protein composition and function, yet traditional sequence database-based methods face challenges in identifying novel peptides, post-translational modifications, and diverse proteomes. De novo peptide sequencing, which operates independently of sequence databases, offers a powerful approach for uncovering these unknown peptides. However, the lack of consistent evaluation frameworks for the growing array of deep learning-based de novo sequencing tools limits their adoption and effective application. Here, we introduce a comprehensive, community-driven benchmarking resource designed to assess the performance of various de novo sequencing tools across a broad range of experimental conditions and proteomic applications. Our benchmark employs heterogeneous datasets to establish a standardized evaluation framework, providing a transparent, evolving resource accessible through an interactive online dashboard. This benchmark is anticipated to offer key insights into tool performance, aiding researchers in selecting suitable tools and identifying areas for future refinement and development in de novo peptide sequencing.
2025-07-22 17:00:00 17:20:00 02F CompMS Mavis: An Ensemble of Methods for Mean-Variance Trend Modeling and Bayesian Decision in Comparative Proteomics George Popescu, Philip Berg Motivation: Comparative methods that use dataset-level information such as mean-variance have been highly successful for several types of -omics data. While a large number of software pipelines for the study of mass spectrometry, tools for statistical modeling of dataset-level properties after data quantification are lacking. We address this gap by introducing Mavis, an ensemble of statistical methods implemented in R for mean-variance trend modeling and Bayesian decision in comparative proteomics. Results: Mavis facilitates dataset-specific modeling, particularly emphasizing models that utilize mean-variance (M-V) trend properties. Mavis builds on the recent methodologies to model proteomics M-V trends with gamma regression. It proposes a new M-V trend clustering method, coined Gamma Cluster Regression (GCR). Mavis implements two imputation strategies: a random forest imputation and a trend-based multiple imputation. Finally, we present a new ensemble method that makes statistical decisions by aggregating p-values of component methods. We evaluated Mavis across several label-free proteomics benchmark datasets; GCR paired with Baldur consistently delivered the best precision, and weighted limma outperformed limma-trend and t-test on most datasets, while the ensemble gave a robust decision across all data. The pipeline supports development of other mean-variance trends within the same comparative proteomics software framework. Finally, Mavis is available as an R (R Core Team, 2021) package on GitHub (https://github.com/PhilipBerg/mavis).
2025-07-22 17:20:00 18:00:00 02F CompMS TBD TBD
2025-07-22 11:20:00 12:00:00 04AB Computational Systems Immunology Knowledge-based machine learning to study cellular regulation from spatial multi-omics data Julio Saez-Rodriguez Julio Saez-Rodriguez Multi-omics technologies, specially with single-cell and spatial resolution, provide unique opportunities to the key study intra- and inter-cellular processes that drive immunological systems and their deregulation in disease. The use of prior biological knowledge allows us to reduce the dimensionality and increase the interpretability of the data, in particular by extracting from the data features describing the activity of molecular processes such as signaling pathways, gene regulatory networks, and cell-cell communication events. In this talk, I will present resources and methods that combine multi-omic single cell and spatial data with biological knowledge and illustrate them on medically relevant cases.
2025-07-22 12:00:00 12:20:00 04AB Computational Systems Immunology Flexible and robust cell type annotation for highly multiplexed tissue images Robert F. Murphy Robert F. Murphy, Huangqingbo Sun, Anna Martinez Casals, Anna Bäckström, Yuxin Lu, Cecilia Lindskog, Matthew Ruffalo, Emma Lundberg Identifying cell types in highly multiplexed images is essential for understanding tissue spatial organization. Current cell type annotation methods often rely on extensive reference images and manual adjustments. We have developed an open-source tool, Robust Image-Based Cell Annotator (RIBCA), that enables accurate, automated, unbiased, and fine-grained cell type annotation for multichannel tissue images without requiring additional model training or human intervention. It can be used with a wide range of antibody panels, initially focused on immune cell types. The design has two novel aspects. This first is an ensemble approach the merges a number of distinct deep learning models that each annotate different subsets of cell types using different sets of markers. That design allows for each extension to additional cell types without the need for retraining the existing models. The second is the use of auxiliary models to allow prediction of markers missing from a given panel. This provides much better assignment that replacing missing markers with a blank channel. Our tool has successfully annotated over 3 million cells, revealing the spatial organization of various cell types across more than 40 different human tissues.
2025-07-22 12:20:00 12:40:00 04AB Computational Systems Immunology Dual-Graph Attention Network for Protein Imputation from Spatial Transcriptomics Haoyu Wang Haoyu Wang, Brittany Cody, Hatice Osmanbeyoglu Cell function in multicellular systems is shaped by its spatial context, making the study of cellular interactions within tissues essential for understanding development and disease. While spatial transcriptomic (ST) technologies capture genome-wide mRNA expression with spatial resolution, they do not provide protein-level insights, which are crucial for understanding cellular function and therapeutic targeting. Recent advancements like spatial CITE-seq enable simultaneous profiling of gene and protein expression. Here, we present a computational framework that imputes protein abundances from ST data by leveraging RNA–protein relationships learned from spatial CITE-seq using a Dual-Graph Attention Network (DGAT). Our method constructs heterogeneous graphs based on mRNA/protein expression and spatial coordinates, aligning these representations with Graph Attention Network encoders. These representations are decoded using a fully connected network for mRNA reconstruction and a multi-branch decoder for protein prediction. We applied DGAT to publicly available and in-house spatial CITE-seq datasets, including tonsil, breast cancer, glioblastoma, and malignant mesothelioma samples, demonstrating superior accuracy in imputing protein expression compared to methods that do not incorporate spatial information. We further applied DGAT to ST datasets from tonsil and breast cancer tissues, revealing deeper insights into cellular states, immune phenotypes, and spatial domains, such as germinal centers. Our approach enhances cell-type assignments, spatial domain detection, and the interpretation of ST datasets, advancing the understanding of tissue architecture and immune responses.
2025-07-22 12:40:00 13:00:00 04AB Computational Systems Immunology Single-cell and spatial atlas of the Human Ageing Thymus Veronika Kedlian Veronika Kedlian, Lisa Marie Milchsack, Marita Bosticardo, Nadav Yayon, Francesca Pala, Luigi D. Notarangelo, Sarah A. Teichmann The thymus is a primary immune organ responsible for the production of naive T cells which experiences age-related changes, also termed involution very early in the lifespan. Despite the thymus crucial role in the adaptive immune system we lack a complete picture of the cell types and mechanisms involved in the process. Here, we present the largest collection to date of thymus single-cell (~1M cells, 63 donors) and spatial sequencing data (Visium and Xenium spatial transcriptomics) across human lifespan and different stages of involution by integrating newly generated as well as publicly available datasets. This resource has allowed us to identify and clarify changes in major cellular and spatial compartments of the thymus as well as their temporal order in the involution process. Broadly, we observe a loss in developing thymocytes and cTECs, and an expansion of the set of resident stromal populations in the adult thymus causing decline in T cell production. More specifically, we highlight two major bottlenecks in the T cell development which are implicated in the thymocyte decline. We also observe the expansion of dysfunctional mcTEC “progenitors” with age that is progressively biassed towards mTEC generation. In the stromal compartment, we identify putative fibroblast states implicated in adipose and fibrotic changes of the thymus. Beyond these and other biological discoveries - we spatially position these populations using high-resolution spatial transcriptomics data, providing a valuable resource to study thymus and immune ageing for the community.
2025-07-22 14:00:00 14:40:00 04AB Computational Systems Immunology TBD Harinder Singh
2025-07-22 14:40:00 15:00:00 04AB Computational Systems Immunology Network-based integration of epigenetic and transcriptomic landscapes unveils molecular programs underlying T follicular helper cell differentiation Alisa Omelchenko Alisa Omelchenko, Rebecca Elsner, Syed Rahman, Vinay Mahajan, Jishnu Das Using networks approaches allows us to integrate multi-modal datasets and view the immune system pathways with a multi-scale lens. Therefore, designing methods which are interpretable are necessary for a holistic view of the immune system. We developed a novel integrated network-framework to study T follicular helper (Tfh) cell differentiation. Tfh cell differentiation is a highly heterogeneous process that remains poorly understood and difficult to study due to experimental limitations. As a result, existing Tfh network diagrams are incomplete, with each study providing valuable but often disconnected information. Specifically, we identified subnetworks by integrating epigenomic or transcriptomic signals with a protein-protein interaction (PPI) network through network propagation. We also detected key transcription factors (TFs) by merging epigenomic signals with the transcriptional regulatory network using Personalized Page Rank. In addition to capturing well-known circuits, we unveiled novel modules integral to Tfh cell differentiation including the IL-12/IL-23/IL-27 pathway and SHP2 signaling. IL-12 has been a controversial element, and its role is highly debated. Our network shows in an unbiased manner the function of IL-12 in the network. Further, while the functions of Tfh cells and their cooperative roles with B cells are broadly similar in mice and humans, several differences have previously been reported which have also been contested. We underscore significant similarities between human and murine Tfh networks, suggesting a higher degree of conservation than previously reported. These insights provide a more cohesive understanding of the regulatory mechanisms governing Tfh cell differentiation and pave the way for therapeutic interventions targeting humoral immunity.
2025-07-22 15:00:00 15:20:00 04AB Computational Systems Immunology Tissue first single cell RNA seq strategy reveals renal tumour specific expansion of regulatory DN1 B cells Isabella Withnell Isabella Withnell, Zara Baig, Joseph Ng, Franca Fraternali, Claudia Mauri Pan cancer atlases often integrate cells across tissues, obscuring context specific immune states. We developed a tissue first single cell RNA seq pipeline that (i) clusters B cells within each cancer type and then (ii) embeds the resulting clusters into a shared latent space learned by a variational auto encoder (VAE). This approach preserves rare phenotypes and enables quantitative, inter cluster distance measurements. Across four tumour types (breast, colorectal, lung, renal; 97,450 B cells), our approach separated tumour conserved from tissue restricted programs. Renal tumours were enriched for a Double Negative 1 (DN1) B cell subset (CD27⁻ IGHD⁻ CR2⁺) with high IL 10, IL 23A, TIGIT and hypoxia/osmotic stress signatures, suggesting environment driven expansion. Spatial immunohistochemistry and flow cytometry confirmed enrichment of these cells in the renal tumour, where they are in dysfunctional tertiary lymphoid structures. They suppressed CD8⁺ T cell cytotoxicity in vitro and enrichment is correlated with poor OS. Our framework recovers immune diversity hidden by global integrations and uncovers a kidney enriched B cell population with therapeutic relevance.
2025-07-22 15:20:00 15:40:00 04AB Computational Systems Immunology ImmunoMatch learns and predicts cognate pairing of heavy and light immunoglobulin chains Dongjun Guo Dongjun Guo, Deborah Dunn-Walters, Franca Fraternali, Joseph Ng The development of stable antibodies formed by compatible heavy (H) and light (L) chain pairs is crucial in both the in vivo maturation of antibody-producing cells and the ex vivo designs of therapeutic antibodies. We present here a novel machine learning framework, ImmunoMatch, for deciphering the molecular rules governing the pairing of antibody chains. Fine-tuned on an antibody-specific language model, ImmunoMatch learns from paired H and L sequences from single human B cells to distinguish cognate H-L pairs and randomly paired sequences. We find that the predictive performance of ImmunoMatch can be augmented by training separate models on the two types of antibody L chains in humans, κ and λ, in line with the in vivo mechanism of B cell development in the bone marrow. Using ImmunoMatch, we illustrate that refinement of H-L chain pairing is a hallmark of B cell maturation in both healthy and disease conditions. We find further that ImmunoMatch is sensitive to sequence differences at the H-L interface. ImmunoMatch focusses on H-L chain pairing as a specific, under-explored problem in antibody developability, and facilitates the computational assessment and modelling of stably assembled immunoglobulins towards large-scale optimisation of efficacious antibody therapeutics.
2025-07-22 15:40:00 16:00:00 04AB Computational Systems Immunology Taxon-specific linear B-cell epitope prediction with phylogeny-aware transfer learning Felipe Campelo Lindeberg Leite, Teófilo de Campos, Francisco Lobo, Felipe Campelo The identification of linear B-cell epitopes (LBCEs) is an important step in the development of immunodiagnostic tests and vaccines. Most existing computational methods for LBCE prediction are generalist models and do not incorporate explicit information on the target pathogen or its evolutionary relationships with other organisms present in the training data. This can lead to biases toward well-studied pathogens or taxa, with poorer performance for emerging or neglected infectious agents. To address this limitation, we present a phylogeny-aware framework that enhances LBCE prediction by incorporating evolutionary relationships into model training. Our approach employs taxonomy as a proxy to phylogeny and uses it to curate the training data. We introduce a transfer learning strategy that fine-tunes large protein language models using data available for higher-level taxa before deploying them to create a pathogen- or taxon-specific predictive model. This phylogeny-aware feature embedding substantially improves predictive accuracy compared to state-of-the-art methods, particularly but not exclusively for data-scarce or understudied pathogens. By leveraging evolutionary relationships, our framework optimises the use of available epitope data and provides more accurate LBCE prediction for emerging or neglected infectious agents. We report computational results for 20 target taxa including viral, bacterial and eukaryotic pathogens, which indicate median AUC gains between 0.15 and 0.2 in relation to current methods. Reference: The results presented in this work are described in greater detail in our paper "EpitopeTransfer: a Phylogeny-aware transfer learning framework for taxon-specific linear B-cell epitope prediction", currently under review.
2025-07-22 16:40:00 17:00:00 04AB Computational Systems Immunology Iterative Attack-and-Defend Framework for Improving TCR-Epitope Binding Prediction Models Pengfei Zhang Pengfei Zhang, Hao Mei, Seojin Bang, Heewook Lee Reliable TCR-epitope binding prediction models are essential for development of adoptive T cell therapy and vaccine design. These models often struggle with false positives, which can be attributed to the limited data coverage in existing negative sample datasets. Common strategies for generating negative samples, such as pairing with background TCRs or shuffling within pairs, fail to account for model-specific vulnerabilities or biologically implausible sequences. To address these challenges, we propose an iterative attack-and-defend framework that systematically identifies and mitigates weaknesses in TCR-epitope prediction models. During the attack phase, a Reinforcement Learning from AI Feedback (RLAIF) framework is used to attack a prediction model by generating biologically implausible sequences that can easily deceive the model. During the defense phase, these identified false positives are incorporated into fine-tuning dataset, enhancing the model's ability to detect false positives. A comprehensive negative control dataset can be obtained by iteratively attacking and defending the model. This dataset can be directly used to improve model robustness, eliminating the need for users to conduct their own attack-and-defend cycles. We apply our framework to five existing binding prediction models, spanning diverse architectures and embedding strategies to show its efficacy. Experimental results show that our approach significantly improves these models' ability to detect adversarial false positives. The combined dataset constructed from these experiments also provides a benchmarking tool to evaluate and refine prediction models.
2025-07-22 17:00:00 17:20:00 04AB Computational Systems Immunology NeoPrecis: A Computational Framework for Assessing Neoantigen Immunogenicity to Advance Cancer Immunotherapy Ko-Han Lee Ko-Han Lee, Timothy Sears, Maurizio Zanetti, Hannah Carter Cancer immunotherapy, including immune checkpoint inhibitors (ICIs) and personalized cancer vaccines, has transformed cancer treatment. However, response rates remain suboptimal, highlighting the need for more effective strategies. Accurate identification of immunogenic neoantigens is critical to improving therapeutic outcomes, yet current prediction methods have key limitations—including insufficient modeling of T-cell recognition, limited incorporation of major histocompatibility complex (MHC) class II pathways, and a lack of tumor clonality integration. To address these challenges, we developed NeoPrecis, a computational framework that combines enhanced T-cell recognition modeling with comprehensive immunogenicity and tumor clonality analysis. It comprises two modules: NeoPrecis-Immuno, which predicts T-cell recognition by modeling cross-reactivity distances between wild-type and mutant peptides, and NeoPrecis-Landscape, which integrates immunogenicity predictions with clonal architecture to derive a tumor-centric immunogenicity score. NeoPrecis-Immuno achieves superior accuracy on an independent gastrointestinal cancer dataset with validated CD4+ and CD8+ T-cell assays, outperforming state-of-the-art predictors including PRIME, DeepNeo, and ICERFIRE. The MHC-binding motif enrichment step enhances the model’s ability to capture features relevant to peptide-MHC (pMHC) and T-cell receptor (TCR) interaction. A derived motif benefit score—even in the absence of specific mutations—shows significant association with patient survival in melanoma (p = 0.04) and non-small cell lung cancer (NSCLC, p = 0.01). NeoPrecis-Landscape further outperforms tumor mutation burden (TMB) in predicting ICI response, particularly in melanoma and in heterogeneous NSCLC tumors with low immunoediting. Together, these findings highlight NeoPrecis as a robust, interpretable tool for neoantigen prioritization and personalized cancer immunotherapy guidance.
2025-07-22 17:20:00 17:40:00 04AB Computational Systems Immunology SHISMA: Shape-driven inference of significant celltype-specific subnetworks from time series single-cell transcriptomics Antonio Collesei Antonio Collesei, Francesco Spinnato, Pierangela Palmerini, Emilia Vigolo Recent advances in RNA sequencing technologies and the gradual decrease in costs have allowed to design serial experiments with time embeddings, even at single cell resolution (scRNAseq). This possibility unlocks a finer level of detail, as well as a huge amount of noisy information to decode. Tools inferring regulatory networks, or patterns, from this type of data often focus on trajectories, disregarding local shapes and fundamental time series primitives. Moreover, they fail to target the analysis on a few meaningful results, reporting large, noisy outputs that need further downstream analysis, especially considering the intricate protein interplay happening within, for example, the immune context. We describe SHISMA, a novel tool to infer significant cell type-specific regulatory patterns, or subnetworks, with strong statistical guarantees in terms of p-value. SHISMA exploits a time series primitive, the Bag-of-Patterns, adapted to discretize shorter temporal data (that is, with few time embeddings) and retain local shapes. SHISMA extracts significant patterns by performing a random walk approach on a protein-protein interaction network, with nodes identified by genes and scores derived from the shape-induced representation of the data, while properly validating via permutation and correcting for multiple hypothesis testing. Our extensive experimental evaluation on synthetic data shows that our tool is able to retrieve specific and significant subnetworks from single cell time series transcriptomic data. Moreover, being tested on a real-world B-cell scRNAseq dataset, the subnetworks identified by SHISMA confirm its ability to retrieve known cell type-specific immunological processes, as well as potentially novel patterns and regulatory mechanisms.
2025-07-22 17:40:00 17:45:00 04AB Computational Systems Immunology Unraveling Immune Signatures of Whole-Cell vs. Acellular Pertussis Vaccine Priming through Multi-Omics Feature Fusion Divya Sitani Nico Henschel, Vaishnavee Ms. Thote, Pia Grundschoettel, Thomas Ulas, Divya Sitani, Joachim L. Schultze Early life vaccination with whole cell or acellular pertussis vaccines shapes long term immune trajectories that influence responses to booster immunizations. In this study, we analyzed immune responses following tetanus, diphtheria, and acellular pertussis booster vaccination to investigate how infancy vaccination influences recall responses later in life. We applied machine learning to immune data from the CMI-PB Challenge, including gene expression, antibody titers, cytokine levels, and cell frequencies from annual donor cohorts collected from 2020 to 2022. Each measurement type was treated as a separate modality. We applied cohort wise normalization and SHAP based feature selection within each modality, followed by feature level fusion to integrate selected features across modalities. A range of classifiers, including random forests, SVM, KNN, logistic regression, multilayer perceptrons, and XGBoost was applied on individual modalities, pairs, and fused datasets to distinguish between whole cell and acellular priming. SHAP analysis identified IgG4 antibodies to filamentous hemagglutinin and pertussis toxin, along with cytokines such as CCL8, CCL2, IL1 alpha, and CXCL9, as key predictors, suggesting that repeated boosting may shape both antibody profiles and cytokine driven immune responses. In model evaluations, no individual modality consistently outperformed others across cohorts. For example, training on 2021 and 2022 and testing on 2020, gene expression achieved AUROC 0.859 while multimodal model reached 0.958. When testing on 2022, antibody features yielded AUROC 0.792 and the multimodal model achieved 0.866, potentially reflecting immune signatures shaped by COVID-19 vaccination. These findings highlight the importance of multimodal fusion for cohort generalizable immune response prediction.
2025-07-22 17:45:00 17:50:00 04AB Computational Systems Immunology The Integrated Cellular and Molecular Landscape of Autoimmunity Romina Appierdo, Pier Federico Gherardini, Francesco Vallania, Marina Sirota, Manuela Helmer-Citterich, Gerardo Pepe Autoimmune diseases are heterogeneous and multifactorial, making it challenging to identify unifying mechanisms or clinically useful biomarkers. Despite abundant transcriptomic data, there is no integrated framework that systematically captures shared and tissue-specific immune dysregulation across diseases. Here, we curated a large-scale transcriptomic compendium of 13,263 samples across 10 autoimmune diseases, integrating 6,238 blood and 7,025 tissue profiles from 156 public studies. Using meta-analysis, we derived robust, reproducible gene signatures for each disease, providing a foundational resource for downstream applications such as biomarker discovery and computational drug repurposing. To enhance interpretability, we used computational strategies to extract 700 immune-relevant features—including pathway activity, cytokine expression, immune cell proportions, and regulatory signatures from transcription factors and miRNAs. We organized these features into 15 biologically coherent modules based on correlation patterns across diseases. Each module summarizes key immunological processes, such as interferon responses or adaptive immune triggering, enabling systematic and interpretable comparisons across diseases. Subsequently, to address a major gap in the field, we systematically compared immune activity in blood and tissue across diseases. This revealed modest gene-level overlap but strong compartment-spanning coordination for specific pathways, particularly interferon signaling. Finally, we demonstrated the translational relevance of this modular framework through multiple clinical applications. Modules predicted treatment response in inflammatory bowel disease (AUC up to 0.80), correlated with clinical severity in psoriasis (R > 0.7 with PASI score), and differentiated responders from non-responders before treatment. Modules also responded as expected to immune perturbation in controlled stimulation experiments, confirming biological validity.
2025-07-22 17:50:00 17:55:00 04AB Computational Systems Immunology BepiCon: A Geometric Deep Learning Framework for Conformational B Cell Epitope Prediction Bünyamin Şen Bünyamin Şen, Tunca Doğan Accurate and reliable prediction of B cell epitopes holds critical importance in immunology and vaccine development. While traditional experimental methods offer high accuracy in identifying epitope regions, they are often laborious, time-consuming, and costly. Therefore, attempts are made to increase the efficiency of experimental characterization processes by using computational approaches. Since approximately 90% of epitopes are conformational, the prediction processes must account for three-dimensional protein structures and the geometric details of antigen-antibody interactions. In response to these requirements, our study introduces BepiCon (B-cell EPİtope Prediction Using CONtrastive learning), a two-step geometric deep learning framework that models antigen proteins as graph structures, incorporating structural and physicochemical properties and protein language model embeddings to automatically predict epitope regions on antigen proteins. In the first stage, the model was trained using a graph contrastive learning approach to learn high-quality representations of epitope and non-epitope residues. In the second stage, the pre-trained model was fine-tuned using supervised learning to perform conformational epitope prediction. The developed framework has demonstrated effective and generalizable performance when applied to both experimentally determined protein structures and predicted structures. Comparative analysis revealed that our approach distinguishes itself from existing B cell epitope prediction methods by exhibiting a lower false-positive rate and generating more reliable predictions. Our work contributes significantly to scientific research and therapeutic design processes by showcasing the advantages of geometric deep-learning approaches in B-cell epitope prediction.
2025-07-22 17:55:00 18:00:00 04AB Computational Systems Immunology NanoAIRR: full-length adaptive immune receptor profiling from Nanopore long-read sequencing Jonas Schuck Jonas Schuck, Samira Ortega Iannazzo, Zeina Yasser Mahmoud, Lucie Marie Hasse, Katharina Imkeller Characterizing the antigen receptor repertoire of adaptive immune cells in solid tumors is essential for understanding the dynamics of immune responses across diverse cancer types. Adaptive immune receptor repertoire (AIRR) studies commonly rely on short-read sequencing, primarily capturing complementarity-determining region 3 (CDR3) sequences. However, full-length immunoglobulin and T cell receptor sequences offer richer structural information, including allelic variants, somatic hypermutations, and constant regions that define receptor isotypes. Transcripts encoding B and T cell receptors in 10x Genomics Visium-based spatial transcriptomics experiments range from 500 to 3500 base pairs, which requires long-read sequencing to resolve their full sequence architecture. While Oxford Nanopore Technologies (ONT) offers a cost-effective solution for long-read sequencing, robust bioinformatic tools for processing and annotating full-length antigen receptor sequences are lacking. To fill this gap, we introduce NanoAIRR. This bioinformatic toolbox combines established software with novel functionality to enable accurate annotation of productive full-length immunoglobulin and T cell receptors at spatial tissue resolution. NanoAIRR is implemented as a modular bash tool comprising six main functionalities. These can be used independently or integrated into a streamlined end-to-end pipeline, e.g. by utilizing Snakemake. By integrating ultra-high-accuracy basecalling and error correction through Unique Molecular Identifiers, we improve long-read sequencing quality, enhancing the reliability of the downstream analysis. We demonstrate how NanoAIRR enables us to study the spatial organization of adaptive immune receptors, track clonal lineage relationships, and assess the presence and structure of tertiary lymphoid structures (TLS) across various tumor types.
2025-07-21 14:00:00 14:15:00 12 Dream Challenges Benchmarking foundation models in biology: where we are, and we where we want to go with the community Julio Saez-Rodriguez Julio Saez-Rodriguez The AI promise of powerful solutions to solve biomedical and healthcare related problems needs to be accompanied by a transparent evaluation and proof of reproducibility of the corresponding algorithms. The evaluation for algorithms in machine learning (ML) is typically done by assessing their performance on prediction tasks. The application of this benchmarking paradigm to foundation models (FM) is not straightforward. FMs are typically trained using self-supervised methods that don’t need labeled ground truth data, and therefore the models are embodiments of the phenomena that gave rise to the data. Usually, the training of FMs is followed by finetuning the models for specific tasks that are trained using more traditional ML methodologies, and can be benchmarked as such ML models. However, such an assessment would fail to elucidate if possible failures of the model reside in the FM or in the refined model. In this era of FMs, there is a need to rethink rigorous evaluation, both in what it means to validate and in how we validate. One strategy could be a concerted effort of the communities developing and using them, whereby crowdsourcing microtasks that test the limits of these models from every possible perspective within the domain of competence of the models, and in a continuous manner. The aim of this DREAM session at ISMB is to explore together this strategy and define as a community a roadmap to move forward with such a critically needed benchmark of foundation models.
2025-07-21 14:15:00 14:45:00 12 Dream Challenges Building Foundation Models for Single-cell Omics and Imaging Bo Wang Keynote: This talk delves into the innovative utilization of generative AI in propelling biomedical research forward. By harnessing single-cellsequencing data, we developed scGPT, a foundational model that extracts biological insights from an extensive dataset of over 33 million cells. Analogous to how words form text, genes define cells, effectively bridging the technological and biological realms. The strategic application of scGPT via transfer learning significantly boosts its efficacy in diverse applications such as cell-type annotation, multi-batch integration, and gene network inference.Additionally, the talk will spotlight MedSAM, a state-of-the-art segmentation foundational model. Designed for universal application, MedSAM excels across various medical imaging tasks and modalities. It showcased unprecedented advancements in 30 segmentation tasks, outperforming existing models considerably. Notably, MedSAM possesses the unique ability for zero-shot and few-shot segmentation, enabling it to identify previously unseen tumor types andswiftly adapt to novel imaging modalities.Collectively, these breakthroughs emphasize the importance of developing versatile andefficient foundational models. These models are poised to address the expanding needs of imaging and omics data, thus driving continuous innovation in biomedical analysis.
2025-07-21 14:45:00 15:00:00 12 Dream Challenges Predicting Perturbation Effects: Are We Really There? Maria Brbic Maria Brbic TBA
2025-07-21 15:00:00 15:15:00 12 Dream Challenges AI Alliance: Benchmarking foundation models for drug discovery Pablo Meyer-Rojas Pablo Meyer-Rojas The AI Alliance is focused on fostering an open community and enabling developers and researchers to accelerate responsible innovation in AI while ensuring scientific rigor, trust, safety, security, diversity and economic competitiveness. We bring together a critical mass of compute, data, tools, and talent to accelerate and advocate for open innovation in AI. Together with DREAM challenges we aim to create a world-class research community that harnesses the potential of AI foundation models, transforms the field of drug discovery, and accelerates scientific progress by driving interdisciplinary collaboration on AI-powered drug discovery projects in the open. IBM Research biomedical foundation model (BMFM) technologies leverage multi-modal data of different types, including drug-like small molecules and proteins (covering a total of more than a billion molecules), as well as DNA and single-cell RNA sequence.
2025-07-21 15:15:00 15:30:00 12 Dream Challenges Benchmarking in Service of Virtual Cell Models: Challenges, Opportunities, and a Path Forward Elizabeth Fahsbender Elizabeth Fahsbender Realizing the vision of AI-powered Virtual Cells demands robust and biologically meaningful benchmarks that ensure models are reliable, reproducible, and relevant. This talk will present key insights from a recent community workshop convened by the Chan Zuckerberg Initiative, which identified critical challenges to benchmarking in this space—ranging from data heterogeneity and systemic bias to evaluation metric gaps and ecosystem fragmentation. We will highlight a set of community-driven recommendations and describe how CZI is beginning to address these through targeted investments in infrastructure, high-quality data generation, and community coordination. These efforts aim to catalyze progress toward a trustworthy benchmarking ecosystem that can accelerate foundational model development for cell biology.
2025-07-21 15:30:00 16:00:00 12 Dream Challenges Deep learning models of regulatory DNA: A critical analysis of model design choices Anshul Kundaje Anshul Kundaje Keynote: Gene expression is tightly regulated by complexes of proteins that interpret complex sequence syntax encoded in regulatory DNA. Genetic variants influencing traits and diseases often disrupt this syntax. Several deep learning models have been developed to decipher regulatory DNA and identify functional variants. Most models use supervised learning to map sequences to cell-specific regulatory activity measured by genome-wide molecular profiling experiments. The general trend in model design is towards larger, multi-task, supervised models with expansive receptive fields. Further, emerging self-supervised DNA language models (DNALMs) promise foundational representations for probing and fine tuning on limited datasets. However, rigorous  evaluation of these models against lightweight alternatives on biologically relevant tasks have been lacking. In this talk, I will demonstrate that light-weight, single-task CNNs are competitive with or significantly outperform massive supervised transformer models and fine-tuned DNALMs on critical prediction tasks. Additionally, I will show that the multi-task, supervised models learn causally inconsistent features, impairing counterfactual prediction, interpretation, and design. In contrast, our lightweight, single task models are causally consistent and provide robust, interpretable insights into regulatory syntax and genetic variation, enabling scalable novel discoveries.
2025-07-21 16:40:00 16:55:00 12 Dream Challenges Benchmarking Multi-Modal Large Language Models for Metastatic Breast Cancer Prognosis Justin Guinney Justin Guinney Inputs into cancer prognostic models are primarily structured data such as demographic and clinicopathological features, and lack richer and temporal context often found in unstructured clinical notes. We hypothesize that creating a temporal clinical patient note from structured data that preserves longitudinal and clinical contextual information, and coupling it with a large language models (LLM) that is trained to prognosticate overall survival (OS), may improve model accuracy with an interpretable embedding space. In this study, we benchmark different LLMs and fine-tuning strategies to develop optimal models for predicting overall survival from time of metastasis in a large cohort of de-identified patients with metastatic breast cancer.
2025-07-21 16:55:00 17:30:00 12 Dream Challenges Crowdsourcing Experiment Gustavo Stolovitzky We will collectively conduct a small scale community experiment to simulate a large scale crowdsourcing initiative to benchmark foundation models.
2025-07-21 17:30:00 18:00:00 12 Dream Challenges Evaluating and Benchmarking Foundation Models All Speakers, Bo Wang, Anshul Kundaje, Justin Guinney, Katrina Kalantar, Maria Brbic, Luca Foschini Speakers will give their opinions and about best practices to evaluate foundation models in biomedicine, and engage in conversation with attendees.
2025-07-23 11:20:00 12:00:00 11A Education Cross-Sector Collaboration in Bioinformatics and Data Science: Tackling Skill Development Challenges in Academia and Industry Gabriela Rustici , Gabriela Rustici The rapid evolution of bioinformatics and data science has created shared challenges for both academia and industry, particularly in developing essential skills among emerging scientists and upskilling the experienced workforce. Despite improved mobility and collaboration, persistent barriers remain, including questions about where and how training is delivered, and what competencies are necessary for success. These hurdles not only impact knowledge exchange but also make transitions into industry daunting for young scientists unaware of workplace expectations and skill requirements. This talk will focus on ways to foster cross-sector cooperation, identify shared opportunities, and strengthen support for young scientists preparing for careers in the industrial sector.
2025-07-23 12:00:00 12:20:00 11A Education Building Omics Skills through the CFDE Training Center Allissa Dillman Allissa Dillman, David Burns, Kristi Sadowski, Kelli Bursey, Diane Krause, Jennifer Burnette, LaFrancis Gibson The Common Fund Data Ecosystem (CFDE) enables broad use of Common Fund data to advance scientific discovery. Five Centers integrate data, resources, and knowledge from many Common Fund Programs to empower the research community to pursue novel investigations that were previously not possible. The CFDE Training Center (TC) serves as a central hub supporting current and potential CFDE users through a comprehensive, learner-centered approach. An in-depth landscape analysis was conducted to assess, identify, and address the training opportunities and needs of the CFDE community. Key findings were broken into training barriers, mentoring challenges, and access issues. In response to these key findings and needs, the TC provides training in basic and advanced bioinformatics skills crucial for working with CFDE data. Meaningful engagement with CFDE data and tools is fostered with existing users and utilized to attract new users from the bioinformatics, data science, and research communities. The TC implemented a Learning Management System to provide a singular and seamlessly accessible location for all TC-produced trainings. This includes a foundational seminar series that defines omics-related research areas in the context of available CFDE data, Decoding the Data Ecosystem: A CFDE Training Center Podcast dedicated to unraveling the complexities and exploring the depths of omics research, and a FAIR and open-source Hackathon at the Bio-IT World conference that focused on integrating CFDE tools and data. By fostering a more knowledgeable and connected research community, the TC significantly accelerates scientific discovery and amplifies the CFDE's contribution to biomedical research.
2025-07-23 12:20:00 12:40:00 11A Education From live webinar series to self-paced learning resource: Creating structured bioinformatics learning pathways Ajay Mishra Ajay Mishra, Flaminia Zane, Anna Swan, Prakash Singh Gaur, Adam Broadbent, Aziz Mithani, Cath Brooksbank As bioinformatics continues to advance and expand across life sciences research, learners often struggle to navigate complex topics without structured learning paths. This presentation introduces our strategy to address this issue by organising thematic webinar series that focus on bioinformatics applications within specific life science domains, subsequently repurposing these, along with related tutorials, to build structured learning pathways. Webinars in a series are organised to methodically cover bioinformatic approaches within focused areas such as microbial ecosystems, plant sciences, or fundamental bioinformatics methods, offering participants a coherent learning progression. The live sessions enhance engagement through real-time interaction while serving a dual purpose: they are recorded and made available as stand-alone lectures as well as being repurposed into curated, on-demand training collections. By combining the recorded sessions with supplementary tutorials and relevant resources, we create comprehensive self-paced learning pathways. This hybrid approach allows learners to review training materials, enhance understanding, and access content according to their own schedule, accommodating various learning preferences and time constraints. Additionally, the resources are openly accessible under a CC-BY-4.0 Creative Commons license, allowing both learners and educators to reuse and adapt the materials to suit their individual needs and perspectives. In this presentation, we will showcase case studies developed around this structured training approach, share practical insights and lessons learned, and highlight user feedback and engagement outcomes.
2025-07-23 12:40:00 13:00:00 11A Education Breaking Down Barriers to Learning: Bioinformatics for Biologists Massive Open Online Courses Dusanka Nikolic Dusanka Nikolic, Fatma Guerfali, Martin Aslett, Victoria Offord, Ruth Nanjala, Andries Van Tonder, Jorge Batista da Rocha, Katherine Kaldeli, Treasa Creavin Background: To meet growing demand for training in core bioinformatics skills, we designed and delivered a free, two-part Massive Open Online Course (MOOC)-style course series: Bioinformatics for Biologists (B4B1 and B4B2). These courses provide free, introductory- and advanced-level learning pathways, catering to students and professionals working in genomics research or bioinformatics, and learners aspiring to data science career. Methodology: The first iteration of each course underwent formative evaluation using a mixed methods approach to identify recurring barriers to learning, informing the subsequent runs’ improvements. Results: The main challenges identified included: time constraints, technical difficulties, familiarity with the subject matter, the level of course content, and accessibility issues, all of which could have affected overall engagement with the material. To address these challenges, several improvements were implemented. Technical challenges were mitigated through active facilitation, peer learning support and alternative setup methods, whereas course content was enhanced with downloadable resources, formative quizzes, glossaries and refresher materials, to accommodate varying levels of prior knowledge. Accessibility was improved through transcripts, fall back resources and platform features, ensuring a more inclusive learning experience. The courses were offered at different times of the year and eventually transitioned to a free on-demand format, allowing greater flexibility for learners balancing professional and personal commitments. Conclusion: These enhancements focused on providing an adaptable and supportive environment for core bioinformatics training and resulted in smoother subsequent runs, reaching a global audience, of around 45,000 learners from more than 180 countries, illustrating the potential of MOOCs to bridge bioinformatics skills gaps.
2025-07-23 14:00:00 14:20:00 11A Education An educator framework for organizing Wikipedia editathons for computational biology Farzana Rahman Dan DeBlasio, Farzana Rahman, Alastair Kilpatrick, Lonnie Welch, Juan Vázquez-Martínez, Varinia López-Ramírez, Divanery Rodriguez-Gomez, Cynthia Paola Rangel-Chavez, Jorge Noé García-Chávez, Nelly Sélem-Mojica, Pradeep Eranti, Audra Anjum, Nicolas C Näpflin, Megha Hegde, Tülay Karakulak, Aarón Gallego-Crespo, Toni Hermoso Pulido, Tiago Lubiana Motivation Wikipedia is a vital open educational resource in computational biology; however, a significant knowledge gap exists between English and Non-English Wikipedias. Reducing this knowledge gap via intensive editing events, or ‘editathons’, would be beneficial in reducing language barriers that disadvantage learners whose native language is not English. Results We present a framework to guide educators in organizing editathons for learners to improve and create relevant Wikipedia articles. As a case study, we present the results of an editathon held at the 2024 ISCB Latin America conference, in which ten new articles were created in Spanish Wikipedia. We also present a web tool, ‘compbio-on-wiki’, which identifies relevant English Wikipedia articles missing in other languages. We demonstrate the value of editathons to expand the accessibility and visibility of computational biology content in multiple languages. Availability and Implementation Source code for the compbio-on-wiki Toolforge site is available at: https://github.com/lubianat/compbio-on-wiki
2025-07-23 14:20:00 14:40:00 11A Education Automated Assignment Grading with Large Language Models: Insights From a Bioinformatics Course Pavlin G. Poličar Pavlin G. Poličar, Martin Špendl, Tomaž Curk, Blaž Zupan Providing students with individualized feedback through assignments is a cornerstone of education that supports their learning and development. Studies have shown that timely, high-quality feedback plays a critical role in improving learning outcomes. However, providing personalized feedback on a large scale in classes with large numbers of students is often impractical due to the significant time and effort required. Recent advances in natural language processing and large language models (LLMs) offer a promising solution by enabling the efficient delivery of personalized feedback. These technologies can reduce the workload of course staff while improving student satisfaction and learning outcomes. Their successful implementation, however, requires thorough evaluation and validation in real classrooms. We present the results of a practical evaluation of LLM-based graders for written assignments in the 2024/25 iteration of the Introduction to Bioinformatics course at the University of Ljubljana. Over the course of the semester, more than 100 students answered 36 text-based questions, most of which were automatically graded using LLMs. In a blind study, students received feedback from both LLMs and human teaching assistants without knowing the source, and later rated the quality of the feedback. We conducted a systematic evaluation of six commercial and open-source LLMs and compared their grading performance with human teaching assistants. Our results show that with well-designed prompts, LLMs can achieve grading accuracy and feedback quality comparable to human graders. Our results also suggest that open-source LLMs perform as well as commercial LLMs, allowing schools to implement their own grading systems while maintaining privacy.
2025-07-23 14:40:00 15:00:00 11A Education Teaching LLM literacy improves AI-aided data analysis in a bioinformatics course Aparna Nathan Aparna Nathan, Nils Gehlenborg Large Language Model (LLM) tools (e.g., ChatGPT) are increasingly helping bioinformatics courses foster self-efficacy, personalize learning, and make biological data analysis accessible to students with less coding training. However, students with inadequate understanding of how LLM tools work may use them counterproductively, thus hindering their learning and problem-solving abilities. To address this, we developed and evaluated an interactive LLM Literacy curriculum to help bioinformatics students (1) learn LLM fundamentals, (2) develop best practices for using LLM tools as computational aids, and (3) explore the limitations and ethics of these technologies. At the end of the curriculum, students developed their own guidelines on how to use these tools in bioinformatic analyses. The lessons focus on debugging and statistical design, integrating literature on best practices in these fields with best practices for the use of LLMs as learning aids. The curriculum is tool-agnostic and adaptable to evolving LLM tools. We incorporated the curriculum into a graduate biological data analysis course. Based on a pre-test and post-test, students displayed significant improvements in LLM prompt-writing practices after completing the LLM Literacy curriculum. They were able to solve more coding and statistics problems correctly with fewer LLM interactions due to better-designed LLM prompts. Students also reported increased confidence in their computational skills, both in general and with LLM tools’ assistance. These findings show that LLM Literacy training promotes self-confidence, self-efficacy, and critical evaluation of computing tools. This underscores the importance of LLM literacy training as a necessary part of modern bioinformatics education.
2025-07-23 15:00:00 15:20:00 11A Education Integrating Bioinformatics into Undergraduate Biology Education: Innovation, Experiential Learning, and Sustainable Program Design Inimary Toby-Ogundeji Inimary Toby-Ogundeji As biology becomes increasingly data-driven, the integration of computational and quantitative skills into undergraduate life sciences education is essential for preparing students for research environments and emerging career pathways. Over a five-year period, a longitudinal survey was conducted to evaluate bioinformatics competencies among undergraduate Biology majors, with the aim of integrating computational and quantitative skills into the biology curriculum. The survey assessed students' proficiency, confidence, and awareness of the practical applications of these skills in biological research. Findings revealed a consistent trend: while many students enter with limited experience in computational methods, there is growing interest and recognition of their importance in modern biology. Key areas of deficiency included: coding literacy, data analysis, and algorithmic thinking. In response to these findings, the department introduced a tiered integration of bioinformatics across introductory biology courses and developed a summer program in bioinformatics to provide immersive, hands-on training. These initiatives, along with research-based course modules and collaborative workshops, significantly enhanced student confidence and practical skills application. Students who participated in these immersive experiences reported an increased understanding of the role of computational biology and a stronger ability to solve biological problems using quantitative tools. This educational model is not only responsive to current skill gaps but is also adaptable across diverse institutions and learning environments. These outcomes highlight the value of early, structured exposure to bioinformatics and reinforce the need for ongoing curriculum innovation to better prepare undergraduates for the interdisciplinary demands of modern biological research.
2025-07-23 15:20:00 15:40:00 11A Education GeneLab for Colleges and Universities (GL4U): On-Demand Bioinformatics Training Using Space Biology Omics Data Amanda Saravia-Butler Amanda Saravia-Butler, Lauren Sanders, Alexis Torres, Crystal Han, Samrawit Gebre The NASA GeneLab project provides open access to space-relevant multi-omics data, hosted on the Open Science Data Repository (OSDR), which can be mined to understand the impacts of spaceflight on biological systems. To engage the scientific community with the Space Biology field and increase the number of scientists who understand and utilize GeneLab data, GeneLab created GeneLab for Colleges and Universities (GL4U). GL4U offers space biology-relevant training in bioinformatics to prospective students, educators, and citizen scientists through various approaches. Since its inception in 2021, the GL4U program has conducted live annual bootcamps for students and educators where participants complete introductory and omics-specific module sets. The GL4U: Introduction modules include lecture-style overviews of NASA, Space Biology, and OSDR and hands-on training in basic Unix and R commands. The GL4U: Omics-specific module sets include a mix of lectures and hands-on training for a particular type of omics data using GeneLab’s standard processing pipelines. To-date GL4U has hosted 4 live bootcamps, training over 75 students and 12 educators across 8 institutions. Pre- versus post-bootcamp surveys revealed a 115% increase, on average, in both participant understanding of omics data processing and in familiarity with NASA Space Biology resources, showing the overwhelming success of these bootcamps. GeneLab has recently expanded GL4U into a series of open-access on-demand training modules. The GL4U: Introduction and GL4U: RNAseq modules, featuring recorded lectures and hands-on Jupyter Notebooks exercises, was launched in December 2024. The authors will present an overview of the GL4U on-demand platform and discuss initial user feedback.
2025-07-23 15:40:00 16:00:00 11A Education A Scalable Curriculum Model to Empower Rural Youth in Open Science Through Secondary Research and Peer-to-Peer Collaboration Nadiia Kasianchuk Nadiia Kasianchuk, Vladyslav Ostash, Mariia Yakovenko, Daria Nishchenko, Tetiana Povshyk, Kvitoslava-Olha Yarish, Dmytro Hinaliuk, Serhiy Kornyliuk, Oleksandra Konopatska The Reuse Science School is a scalable, peer-oriented educational program designed to empower youth from rural and displaced communities in Ukraine through open science and computational biology. In response to systemic educational disruptions caused by war, the program equips learners with data analysis skills using open datasets—an accessible alternative to lab-based research in crisis-affected regions. The curriculum was co-developed by NGO Youth Vitryla, NGO Genetically Modified Organisation, and the Kyiv School of Economics, with support from GIZ and the EU4Youth Project. It combines soft skills, community-building, and a hands-on Python track (15+ hours) covering fundamentals, data processing (NumPy, Pandas), and visualization (Matplotlib). Students apply these skills in mini-projects analyzing publicly available biological and medical datasets. The top 40 participants are invited to a 3-day bootcamp covering statistics, research ethics, and science communication. During the bootcamp, students also begin developing original group projects, on which they will work over the following three months, culminating in presentations at a closing conference. 343 students registered from across Ukraine, the majority aged 14–19 and with no prior coding experience. Despite wartime power outages and air raid disruptions, each session had 140+ live Zoom attendees, with others following via recordings. Interim self-assessments (n=65) showed a 47% average increase in confidence with data analysis. Over 75% of respondents expressed interest in mentoring, supporting a Training-of-Trainers (ToT) module piloted in Western Ukraine. Participants described the course as “fun,” “accessible,” and “inspiring.” Interests aligned strongly with computational biology–relevant fields such as bioinformatics, life sciences, and data science. The program demonstrates how open science education can be adapted to empower marginalized learners globally.
2025-07-23 16:40:00 17:00:00 11A Education Capacity Building for Pathogen Surveillance through Pathogen Genomics and Bioinformatics Training in Africa Nicola Mulder Nicola Mulder, Siddiqah George, Kirsty Lee Garson, Tony Li, Perceval Maturure The recent emergence and re-emergence of infectious diseases in Africa highlight the critical need for robust pathogen genomic surveillance systems across the continent. Effective surveillance depends on comprehensive training and capacity development in pathogen genomics and bioinformatics, as rapid public health responses to disease outbreaks rely on continuously enhancing these skills. Over the past four years, we have delivered hybrid training in pathogen genomic surveillance and bioinformatics to >290 participants from 36 African countries. These initiatives, tailored to diverse personas in national public health institutions, leveraged trainers and facilitators from across the continent to address varying competency levels. We have also developed and implemented resources to support our training initiatives, including a user-friendly helpdesk ticketing system, a robust trainer database, and intuitive websites hosting training materials. These tools work jointly to ensure that training and related resources are widely accessible, while also providing participants with support and engagement opportunities long after receiving training. To ensure consistency in the training of public health staff in Africa, a standardised pathogen genomics surveillance training curriculum has been developed. The curriculum is designed to serve as a comprehensive resource for trainers, encompassing content that ranges from foundational courses in generic, wet-lab, and bioinformatics topics to advanced pathogen-specific courses that include tailored genomic surveillance workflows. Currently, we are exploring the integration of AI in pathogen genomics curriculum development and training. We have benchmarked AI tools for curriculum design, content generation, skills and knowledge assessment and the implementation of a chatbot for trainee support.
2025-07-23 17:00:00 17:20:00 11A Education The emerging ecosystem of competitive educational programs in bioinformatics in Ukraine Alina Frolova Alina Frolova, Serhiy Naumenko, Anna Diamant, Ihor Arefiev, Nadiia Kasianchuk, Daryna Yakymenko, Taras K. Oleksyk, Walter Wolfsberger, Viorel Munteanu, Mangul Serghei, Valeriia Vasylieva Competitive education in the areas of bioinformatics, computational biology, and biological data analysis has become a pass to the world of modern biotechnology for any developed country. In Ukraine, many efforts are underway to bridge the gap between the availability of highly talented and motivated students and the scarcity of high-quality educational programs. As active ambassadors of bioinformatics in Ukraine, we report on our educational projects. The non-governmental organization Genomics UA leads yearly courses in RNA-seq data analysis, multi-omics, and spatial transcriptomics data analysis, and maintains a community portal and a discussion forum. Three yearly competitive international science schools - Bioinformatics For Ukraine, Ukrainian Biological Data Science Summer School (UBDS^3), and LifeScienceCourse, are entering their third season, providing intensive, research-oriented training to early career scientists. In parallel, academic institutions are expanding their role, four Ukrainian universities are developing Master’s programs in bioinformatics, two offer Bachelor’s degrees, and five have incorporated bioinformatics courses into broader life science curricula. The growing bioinformatics community in Ukraine received overwhelming support from scientists abroad: many of them contribute their time and expertise to participate in scientific schools, while others provide hiring and educational opportunities. Finally, we describe the challenges faced by the community. While the number of qualified instructors is insufficient and the basic textbooks in Ukrainian are scarce, the students pursuing degrees in bioinformatics or reorienting into the field come from immensely diverse backgrounds, pushed by the limited amount of funding opportunities in laboratory-based research.
2025-07-23 17:20:00 18:00:00 11A Education Fostering communities of practice in bioinformatics education and training Patricia Carvajal-López , Patricia Carvajal-López Communities of practice in bioinformatics education and training are essential, both for the technical training and professional development of bioinformatics practitioners, and for us, as bioinformatics educators and trainers, to develop our field, our recognition as a vital part of the broader bioinformatics community, and ourselves as individuals. I will illustrate my point using two main examples: first, the community that we built together with the BioInfoCore COSI to extend the ISCB's Competency Framework to reflect the career progression of bioinformatics core facility scientists.Then, I'll delve into the most recent Global Bioinformatics Education Summit, which took place in May 2025 as a hybrid meeting hosted in México City. Both illustrate our tremendous collective potential to do meaningful work that builds capacity and advances our field.
2025-07-23 11:20:00 11:40:00 12 NIH/Elixir Disease Ontology Knowledgebase: A Global BioData hub for FAIR disease data discovery Lynn Schriml Lynn Schriml Development of long-term biodata resources, by design, depends on a stable data model with persistent identifiers, regular data releases, and reliable responsiveness to ongoing community needs. Addressing evolving needs while continually advancing our data representation has facilitated the sustained 20-year growth and utility of the Human Disease Ontology (DO, https://www.disease-ontology.org/). Biodata resources must maintain their relevance, adapting to address and fulfill persistent, evolving needs. Strategically, the DO actively identifies and connects with our expanding user community, thusly driving DO’s integration of diverse disease factors (e.g., molecular, environmental and mechanistic) into a singular framework. Serving a vast user community since 2003 (> 415 biomedical resources across 45 countries), the DO’s continual content and classification expansion is driven by the ever-evolving disease knowledge ecosystem. The DO, a designated Global Core Biodata Resource (https://globalbiodata.org/), empowers disease data integration, standardization, and analysis across the interconnected web of biomedical information. A focus on modernizing infrastructure is imperative to provide new mechanisms for data interoperability and accessibility. Our strategic approach includes following community best practices (e.g., OBO Foundry, FAIR principles), adapting established technical approaches (e.g., Neo4j; Swagger for API), and openly sharing project-developed tooling - reduces technical debt while maximizing data delivery opportunities. The DO Knowledgebase (DO-KB) tools (DO-KB SPARQL service and endpoint, Faceted Search Interface, advanced API service, DO.utils) have been developed to enhance data discovery, delivering an integrated data system that exposes the DO’s semantic knowledge and connects disease-related data across Open Linked Data resources.
2025-07-23 11:40:00 12:00:00 12 NIH/Elixir Integrating Data Treasures: Knowledge graphs of the DSMZ Digital Diversity Julia Koblitz Julia Koblitz The DSMZ (German Collection of Microorganisms and Cell Cultures) hosts a wealth of biological data, covering microbial traits (BacDive), taxonomy (LPSN), enzymes and ligands (BRENDA), rRNA genes (SILVA), cell lines (CellDive), cultivation media (MediaDive), strain identity (StrainInfo), and more. To make these diverse datasets accessible and interoperable, the DSMZ Digital Diversity initiative provides a central hub for integrated data and establishes a framework for linking and accessing these resources (https://hub.dsmz.de). At its core lies the DSMZ Digital Diversity Ontology (D3O), an upper ontology designed to unify key concepts across all databases, enabling seamless integration and advanced exploration. This ontology is complemented by well-established ontologies such as ChEBI, ENVO, and NCIT, among others. By standardizing all resources within a defined vocabulary, we enhance their interoperability, both internally and with the Linked Open Data community. Where necessary, we will also develop and curate our own ontologies, such as the well-known BRENDA tissue ontology (BTO), a comprehensive ontology for LPSN taxonomy and nomenclature, and the Microbial Isolation Source Ontology (MISO), which has already been applied to annotate more than 80,000 microbial strains. D3O also provides a stable foundation for transforming our databases into RDF (resource description framework) and providing the knowledge graphs via open SPARQL endpoints. The first knowledge graphs of BacDive and MediaDive are already available at https://sparql.dsmz.de, enabling researchers to query and analyze microbial trait data and cultivation media. These initial steps lay the groundwork for integrating additional databases, such as BRENDA and StrainInfo, into unified, queryable knowledge graphs.
2025-07-23 12:00:00 12:20:00 12 NIH/Elixir Metabolomics Workbench: Data Sharing, Analysis and Integration at the National Metabolomics Data Repository Mano Maurya Mano Maurya The National Metabolomics Data Repository (NMDR) was developed as part of the National Institutes of Health (NIH) Common Fund Metabolomics Program to facilitate the deposition and sharing of metabolomics data and metadata from researchers worldwide. The NMDR, housed at the San Diego Supercomputer Center (SDSC), University of California, San Diego, has developed the Metabolomics Workbench (MW). The Metabolomics Workbench also provides analysis tools and access to metabolite standards, including RefMet, protocols, tutorials, training, and more. RefMet facilitates metabolite name harmonization, an essential step in data integration across different studies and collaboration across different research centers. Thus, the MW-NMDR serves as a one-stop infrastructure for metabolomics research and is widely regarded as one of the most FAIR (findable, accessible, interoperable, usable) data resources. In this work, we will present some of the key aspects of the MW-NMDR, such as continuous curation to maintain quality, use of controlled vocabularies and ontologies to promote interoperability, development of tools to contribute to driving scientific innovation, and integration of tools developed by the community into the MW. We will also discuss our involvement in other data sharing, reuse, and integration efforts, namely the NIH Common Fund Data Ecosystem (CFDE) and a collaboration with the European Bioinformatics Institute (EBI)’s MetabolomeXchange as part of the Chan Zuckerberg initiative.
2025-07-23 12:20:00 12:40:00 12 NIH/Elixir Building sustainable solutions for federally-funded open-source biomedical tools and technologies Karamarie Fecho Karamarie Fecho Federally-funded, open-source, biomedical tools and technologies often fail due to a lack of a business model for sustainability, which quickly leads to technical obsolescence and is often preceded by insufficient scientific impact and the failure to create a thriving Community of Practice. The open-source ROBOKOP (Reasoning Over Biomedical Objects linked in Knowledge Oriented Pathways) knowledge graph (KG) system is jointly funded by the National Institute of Environmental Health Sciences and the Office of Data Science Strategy within the National Institutes of Health as a modular biomedical KG system designed to explore relationships between biomedical entities. The ROBOKOP system includes the aggregated ROBOKOP KG composed of integrated and harmonized “knowledge” derived from dozens of “knowledge sources”, a user interface to the ROBOKOP KG, and a collection of supporting tools and resources. ROBOKOP has demonstrated its utility in a variety of use cases, including suggesting “adverse outcome pathways” to explain the biological relationships between chemical exposures and disease outcomes and the related concept of “clinical outcome pathways” to explain the biological mechanisms underlying the therapeutic effects of drug exposures. We have been evaluating approaches to ensure the long-term sustainability of ROBOKOP, independent of federal funding. One approach is to adopt and adapt the best practices of, and lessons learned by, successful open-source biomedical Communities of Practice, with engaged scientific end users and technical contributors. This presentation will provide an overview of our evaluation results and detail our proposed solution for transitioning ROBOKOP from federal funding to independent long-term sustainability.
2025-07-23 12:40:00 13:00:00 12 NIH/Elixir SEA CDM: An Ontology-Based Common Data Model for Standardizing and Integrating Biomedical Experimental Data in Vaccine Research Yongqun He Yongqun He With the increasing volume of experimental data across biomedical fields, standardizing, sharing, and integrating heterogeneous experimental data has become a major challenge. Our systematic VIOLIN vaccine database has collected and annotated over 4,700 vaccines against 217 infectious and non-infectious diseases such as cancer, and vaccine components such as over 100 vaccine adjuvants and over 1,700 vaccine-induced host immune factors. To support standardization, we developed the community-based Vaccine Ontology (VO) to represent vaccine knowledge and associated metadata. To support interoperable standardization, annotation, and integration of various biomedical experimental datasets, we have developed an ontology-supported Study-Experiment-Assay (SEA) common data model (CDM), consisting of 12 core classes (or called tables in a relational database setting), such as Organism, Sample, Intervention, and Assay. The SEA CDM was evaluated systematically using the vaccine-induced host gene immune response data from our VIOLIN VIGET (Vaccine Induced Gene Expression Analysis Tool) system. We also developed a MySQL database and a Neo4J knowledge graph based on the SEA CDM, to systematically represent the VIGET data and influenza-related host gene expression data from two large-scale data resources: ImmPort and CELLxGENE. Our results show that ontologies such as VO can greatly support interoperable data annotation and provide additional semantic knowledge (e.g., vaccine hierarchy). This proof-of-concept study demonstrated the feasibility and validity of the SEA CDM for standardizing and integrating heterogeneous datasets and highlights its potential for application to other big bioresources. The novel SEA CDM lays a foundation for building a FAIR and AI-ready Biodata Ecosystem, leading to advanced AI research.
2025-07-23 14:00:00 14:20:00 12 NIH/Elixir The Evolution of Ensembl: Scaling for Accessibility, Performance, and Interoperability Mallory Freeberg Mallory Freeberg Ensembl is an open platform that integrates publicly available genomics data across the tree of life, enabling activities spanning research to clinical and agricultural applications. Ensembl provides a comprehensive collection of data including genomes, genomic annotations, and genetic variants, as well as computational outputs such as gene predictions, functional annotations, regulatory region predictions, and comparative genomic analyses. In its 25-year history, Ensembl has grown to support all domains of life - from vertebrates to plants to bacteria - releasing new data roughly quarterly. Initially developed for the human genome, Ensembl expanded to include additional key vertebrates totalling a few hundred genomes. With the advent of global biodiversity and pangenome projects, Ensembl now contains thousands of genomes and is anticipated to grow to tens of thousands of genomes in the coming years. This explosion in data size necessitates a more scalable and rapidly deployable mechanism to ensure timely release of new high-quality genomes for immediate use by the community. Ensembl is evolving to meet increasing scalability demands to ensure continued accessibility, performance, and interoperability. We have developed a new service-oriented infrastructure, deployed as a set of orchestrated microservices. Our new refget implementation enables rapid, unambiguous sequence retrieval using checksum-based identifiers. Our GraphQL service has been expanded to support genome metadata queries, facilitating programmatic access to assembly composition and linked datasets. With streamlined components and more modern technologies, Ensembl will be easier to maintain, delivering high-quality data quickly and benefiting the global scientific communities that rely on this key resource.
2025-07-23 14:20:00 14:40:00 12 NIH/Elixir Insights from GlyGen in Developing Sustainable Knowledgebases with Well-Defined Infrastructure Stacks Kate Warner Kate Warner GlyGen is a data integration and dissemination project for glycan and glycoconjugate related data, which retrieves information from multiple international data sources to form a central knowledgebase for glycoscience data exploration. To maintain our high-quality service while meeting the needs of our users, we have structured GlyGen into related but distinct spaces - the Web portal, Data portal, API, Wiki, and SPARQL - which makes clear delineation of tasks for maintenance and innovation while providing different mechanisms for data access. General users can use the interactive GlyGen web portal to search and explore GlyGen data using our various tools and search functionalities. For programmatic access, users can use the API (https://api.glygen.org) to access GlyGen data objects for glycans and proteins, while the SPARQL endpoint (https://sparql.glygen.org) is built to provide an alternative programmatic access to the GlyGen data using semantic web technologies. For users interested in using the datasets in research, data mining or machine learning projects, versioned dataset flat files can be downloaded at our Data portal (https://data.glygen.org), along with the dataset’s Biocompute Object (BCO) (https://biocomputeobject.org) which documents the metadata of the dataset for proper attribution, reproducibility and data sharing. All components of the GlyGen ecosystem are built using well-established web technology stacks, enabling rapid development and deployment on both on-premise infrastructure and commercial cloud platforms, while also ensuring straightforward maintenance. Finally, we will discuss how the ability to be freely accessible and under the Creative Commons Attribution 4.0 International (CC BY 4.0) license helps to encourage FAIR data, open science, and collaboration.
2025-07-23 14:40:00 15:00:00 12 NIH/Elixir Philip Blood
2025-07-23 15:00:00 15:20:00 12 NIH/Elixir Production workflows and orchestration at MGnify, ELIXIR’s Core Data Resource for metagenomics Martin Beracochea Martin Beracochea MGnify is a key resource for the assembly, analysis and archiving of microbiome-derived sequencing datasets. Designed to be interoperable with the European Nucleotide Archive (ENA) for data archiving, MGnify’s analyses can be initiated from various ENA sequence data products, including private datasets. Accessioned data outputs are produced in commonly used formats and available via web visualisation and APIs. The rapid evolution of the field of microbiome research over the past decade has brought significant challenges: exponential dataset growth; increased sample diversity; additional data analyses and new sequencing technologies. To address these challenges, MGnify’s latest pipelines have transitioned from the Common Workflow Language to Nextflow, nf-core, and a new automation system. This enhances resource management and supports heterogeneous computing including cloud environments, handles large-scale data production, and reduces manual intervention. Key MGnify outputs include taxonomic and functional analyses of metagenomes, covering >600,000 datasets. The service produces and organises metagenome assemblies and metagenome-assembled genomes, totaling >480,000, as well as nearly 2.5 billion protein sequences. The available annotations have broadened to include the mobilome and virome, as well as increased taxonomic specificity via additional amplicon sequence variant analyses. While these developments have positioned MGnify to efficiently take advantage of elastic compute resources, the volume of demand still outstrips the available resources. As such, we have started to evaluate how analyses can be federated through the use of our Nextflow pipelines (and community produced Galaxy versions), in combination with Research Objects, to provide future scalability yet retaining a centralised point of discovery.
2025-07-23 15:20:00 15:40:00 12 NIH/Elixir A SCALE-Able Approach to Building “Hybrid” Repositories to Drive Sustainable, Data Ecosystems Robert Schuler Robert Schuler Scientific discovery increasingly relies on the ability to acquire, curate, integrate, analyze, and share vast and varied datasets. For instance, advancements like AlphaFold, an AI-based protein prediction tool, and ChatGPT, a large language model-based chatbot, have generated immense excitement in science and industry for harnessing data and computation to solve significant challenges. However, it’s easy to overlook that these remarkable achievements were only made possible after the accumulation of a critical mass of AI-ready data. Both examples relied on open data sources meticulously generated by user communities over several decades. We argue that scalable, sustainable data repositories that bridge the divide between domain-specific and generalist repositories and that actively engage communities of investigators in the task of organizing and curating data will be required to meet the challenge of producing a future critical mass of data to unlock new discoveries. Such resources must move beyond the label of “repository” and instead employ a socio-technical approach that inculcates a culture and skill set for data management, sharing, reuse, and reproducibility. In this talk, we will discuss our efforts toward developing FaceBase as a “SCALE-able” data resource built on the principles of Self-service Curation, domain-Agnostic data-centric platforms, Lightweight information models, and Evolvable systems. Based on our approach, working within the dental, oral, craniofacial, and biologically relevant research community, we have seen several hundred studies encompassing many thousands of subjects and specimens’ worth of data across multiple imaging modalities and sequencing assay types contributed and curated by the community.
2025-07-23 15:40:00 15:50:00 12 NIH/Elixir From Platforms to Practice: How the ELIXIR Model Enables Impactful, Sustainable Biodata Resources Fabio Liberante Fabio Liberante Biodata resources are only as impactful as the ecosystems in which they operate. ELIXIR provides a coordinated European infrastructure that supports the sustainability, discoverability, and effective reuse of life science data — enabling biodata resources to thrive in an increasingly complex global research environment. This talk will provide an overview of how ELIXIR delivers this support through its Core Data Resources, five Platforms — including Data and Interoperability — and an active network of Communities. Together, these elements underpin the long-term value and resilience of biodata infrastructures by helping resources: Implement FAIR practices Link across scientific domains Plan for the full biodata resource lifecycle We will highlight the role of registries and standards, the monitoring and periodic review of Core Data Resources, and the importance of both qualitative and quantitative indicators in tracking impact. Recent challenges — including the effects of large-scale data scraping — will also be discussed, alongside the need to balance openness with sustainability. Finally, we will share some insights from ELIXIR’s international collaborations, including with the NIH, to illustrate how global coordination enhances the visibility, value, and future-proofing of open data infrastructures.
2025-07-23 15:50:00 16:00:00 12 NIH/Elixir NIH-ODSS Ishwar Chandramouliswaran
2025-07-23 16:40:00 17:00:00 12 NIH/Elixir Melissa Harrison
2025-07-23 17:00:00 17:20:00 12 NIH/Elixir Mark Hahnel
2025-07-23 17:20:00 17:40:00 12 NIH/Elixir Evaluating the Impact of Biodata Resources: Insights from EMBL-EBI’s Impact Assessments Eleni Tzampatzopoulou Eleni Tzampatzopoulou The provision of open access data through biodata resources is a critical driver of breakthroughs in life sciences research, advances in clinical practice and industry innovations that benefit humankind. However, understanding their long-term economic and societal impacts remains a challenge. As part of ongoing efforts to establish a framework and evidence base for demonstrating the value of open data resources, EMBL-EBI employs a combination of qualitative and quantitative approaches, such as service monitoring metrics, cost-benefit analyses, large-scale user surveys, data resource usage analysis and in-depth case studies. Service monitoring metrics, including unique visitors, data submission volumes and citation of datasets, indicate the breadth and diversity of user engagement with FAIR data resources. The 2024 user survey showcased the depth of utility users derive from resources, such as research years saved and reduced duplication of effort. Surveys and other user engagement also highlight EMBL-EBI’s contribution to downstream products and AI model development. Economic impact analyses, focused on the impact of direct increases in research efficiency, do not quantify these secondary or indirect impacts through data reuse, even though qualitative data suggests they are likely to be significant. Here we explore how mixed methods can characterise the impact of data reuse, considering methodologies such as in-depth case studies, data mining, administrative data and other novel approaches. We consider different methodologies EMBL-EBI has explored and propose how future impact monitoring could capture a fuller extent of the direct and indirect impacts of biodata resources, informing priority setting for life sciences funders.
2025-07-23 17:40:00 18:00:00 12 NIH/Elixir Alex Bateman
2025-07-21 11:20:00 11:40:00 11A EvolCompGen Fair molecular feature selection unveils universally tumor lineage-informative methylation sites in colorectal cancer Cenk Sahinalp Xuan Li, Yuelin Liu, Alejandro Schäffer, Stephen Mount, Cenk Sahinalp In the era of precision medicine, performing comparative analysis over diverse patient populations is a fundamental step towards tailoring healthcare interventions. However, the critical aspect of fairly selecting molecular features across multiple patients is often overlooked. To address this challenge, we introduce FALAFL (FAir muLti-sAmple Feature seLection), an algorithmic approach based on combinatorial optimization. FALAFL is designed to perform feature selection in sequencing data which ensures a balanced selection of features from all patient samples in a cohort. We have applied FALAFL to the problem of selecting lineage-informative CpG sites within a cohort of colorectal cancer patients subjected to low read coverage single-cell methylation sequencing. Our results demonstrate that FALAFL can rapidly and robustly determine the optimal set of CpG sites, which are each well covered by cells across the vast majority of the patients, while ensuring that in each patient a large proportion of these sites have high read coverage. An analysis of the FALAFL-selected sites reveals that their tumor lineage-informativeness exhibits a strong correlation across a spectrum of diverse patient profiles. Furthermore, these universally lineage-informative sites are highly enriched in the inter-CpG island regions. FALAFL brings unsupervised fairness considerations into the molecular feature selection from single-cell sequencing data obtained from a patient cohort. We hope that it will aid in designing panels for diagnostic and prognostic purposes and help propel fair data science practices in the exploration of complex diseases.
2025-07-21 11:40:00 12:00:00 11A EvolCompGen Fast tumor phylogeny regression via tree-structured dual dynamic programming Henri Schmidt Henri Schmidt, Yuanyuan Qi, Ben Raphael, Mohammed El-Kebir Reconstructing the evolutionary history of tumors from bulk DNA sequencing of multiple tissue samples remains a challenging computational problem, requiring simultaneous deconvolution of the tumor tissue and inference of its evolutionary history. Recently, phylogenetic reconstruction methods have made significant progress by breaking the reconstruction problem into two parts: a regression problem over a fixed topology and a search over tree space. While effective techniques have been developed for the latter search problem, the regression problem remains a bottleneck in both method design and implementation due to the lack of fast, specialized algorithms. Here, we introduce fastppm, a fast tool to solve the regression problem via tree-structured dual dynamic programming. fastppm supports arbitrary, separable convex loss functions including the L2, piecewise linear, binomial and beta-binomial loss and provides asymptotic improvements for the L2 and piecewise linear loss over existing algorithms. We find that fastppm empirically outperforms both specialized and general purpose regression algorithms, obtaining 50-450x speedups while providing as accurate solutions as existing approaches. Incorporating fastppm into several phylogeny inference algorithms immediately yields up to 400x speedups, requiring only a small change to the program code of existing software. Finally, fastppm enables analysis of low-coverage bulk DNA sequencing data on both simulated data and in a patient-derived mouse model of colorectal cancer, outperforming state-of-the-art phylogeny inference algorithms in terms of both accuracy and runtime.
2025-07-21 12:00:00 12:20:00 11A EvolCompGen Bayesian inference of fitness landscapes via tree-structured branching processes Xiang Ge Luo Xiang Ge Luo, Jack Kuipers, Kevin Rupp, Koichi Takahashi, Niko Beerenwinkel Motivation: The complex dynamics of cancer evolution, driven by mutation and selection, underlies the molecular heterogeneity observed in tumors. The evolutionary histories of tumors of different patients can be encoded as mutation trees and reconstructed in high resolution from single-cell sequencing data, offering crucial insights for studying fitness effects of and epistasis among mutations. Existing models, however, either fail to separate mutation and selection or neglect the evolutionary histories encoded by the tumor phylogenetic trees. Results: We introduce FiTree, a tree-structured multi-type branching process model with epistatic fitness parameterization and a Bayesian inference scheme to learn fitness landscapes from single-cell tumor mutation trees. Through simulations, we demonstrate that FiTree outperforms state-of-the-art methods in inferring the fitness landscape underlying tumor evolution. Applying FiTree to a single-cell acute myeloid leukemia dataset, we identify epistatic fitness effects consistent with known biological findings and quantify uncertainty in predicting future mutational events. The new model unifies probabilistic graphical models of cancer progression with population genetics, offering a principled framework for understanding tumor evolution and informing therapeutic strategies.
2025-07-21 12:20:00 12:40:00 11A EvolCompGen MiClone: A Probabilistic Method for Inferring Cell Phylogenies from Mitochondrial Variants Emilia Hurtado Emilia Hurtado, Andrew Roth Cancer development and progression is largely fuelled by somatic mutations that give rise to clones – distinct subpopulations of malignant cells that emerge as a result of differential mutation and proliferation within tumours. Given that clones may exhibit selective survivorship in response to treatment, characterising the evolutionary history of a tumour’s clones is a critical task in cancer research. With the advent of single-cell resolution sequencing technologies, bulk clonal deconvolution is no longer strictly required – as the samples themselves are already separate cellular representations. However, even at single-cell resolution, challenges remain in the study of human disease. Interestingly, there does exist an adjacent source of potential phylogenetic signal in these single-cell measures, in the form of the mitochondrial genome. In the context of the cancer evolution characterisation problem, the use of mitochondrial variants could allow for improved study of copy number stable and low nuclear-genomic somatic variant dependent cancers. In this work we present MiClone, a Bayesian method to infer the phylogenetic relationship of single-cell genomes using the signal available in the mitochondrial genome. MiClone uses the proportions of mitochondrial variants across cells as input, treating each individual single-cell sample as a bulk mixture of mitochondrial genomes. Leveraging the PhyClone phylogenetic machinery, MiClone is able to scalably process thousands of single-cell samples to produce fine-grained and accurate clonal-prevalence estimates for each cell. We demonstrate MiClone’s performance using real-world datasets from 20 patients across a variety of cancer types, each consisting of thousands of single-cell genomes.
2025-07-21 12:40:00 12:50:00 11A EvolCompGen scVarID: Mapping Genetic Variants at Single-Cell Resolution to Uncover Precursor Cells in Cancer Jonghyun Lee Jonghyun Lee, Juyeon Cho, Byungjo Lee, Dongkwan Shin Next-generation sequencing technologies enable the identification of genetic variants and gene expression; however, measuring both features simultaneously within the same cell has remained challenging. Difficulties in co-isolating DNA and RNA have limited our ability to directly connect somatic mutations with transcriptional consequences. Additionally, variant detection from bulk sequencing cannot resolve haplotype-specific or cell-to-cell genetic heterogeneity—information that is crucial for understanding the progression of genetic diseases such as cancer. To address this, we developed scVarID, an algorithm that maps variant calls onto cell-level transcriptomes, producing cell-by-variant matrices of both variant and reference allele counts for each transcript across single cells. Using scVarID, we uncovered widespread allelic imbalance in peripheral blood mononuclear cells (PBMCs), particularly in subsets of cells exhibiting strong allele-specific expression (ASE) of HLA-related genes. This allelic imbalance was also observed in both normal and tumor epithelial cells from colorectal cancer patients. Most notably, ASE analysis of paired normal samples revealed a subset of cells with altered variant ratios in HLA-A, a gene frequently associated with ASE loss in tumors. This suggests the presence of potential precursor cell states marked by early post-transcriptional imbalance, which may precede tumorigenesis. These findings demonstrate scVarID’s ability to resolve genotype–phenotype relationships at single-cell resolution and to identify early-stage cellular alterations that may contribute to cancer development. This approach opens new avenues for early cancer detection and for studying the functional impact of somatic variation in pre-malignant cell populations.
2025-07-21 12:50:00 13:00:00 11A EvolCompGen High-Resolution Discovery of Lineage-Specific SVs in Pan Genus Through Assembly Comparisons Aisha, Hua Chen The two sister species in the Pan genus, chimpanzees (Pan troglodytes) and bonobos (Pan paniscus), exhibit lineage-specific differences for several behavioral and physiological traits. In this work, our goal was to identify the genome regions within Pan genus that have experienced structural rearrangements in a lineage-specific manner. Using the high-quality genome assemblies of chimpanzee, bonobo and human, we performed genome assembly comparisons using the human genome (GRCh38) as a reference. Lineage-specific structural variants (LSSVs) in Pan genus provided enhanced insights into the genomic rearrangements that are likely to affect gene function and phenotypes. We observed considerable variation in SV distribution between two species, with SVs widespread in chimpanzee assembly and scarce in bonobo assembly. Focusing on the SVs harbored in CDS regions of protein-coding genes, we found 22 genes highly impacted by SVs in bonobos, that either lead to feature truncation or transcript ablation. Notably, a total of 232 SV impacted genes experienced transcript ablation and were found to be involved in olfactory transduction, keratinization and transcriptional regulation. Functional enrichment analysis of LSSV-impacted genes revealed enrichment for body growth, brain function, and neurological disorders in bonobo lineage while metabolism and transcriptional regulation showed enrichment in chimpanzee. Additionally, it was discovered that a small number of the SV-affected genes were responsible for the distinctive behavioral differences between two lineages, indicating their role in determining the lineage-specific characteristics present in Pan genus.
2025-07-21 14:00:00 14:20:00 11A EvolCompGen Tracing the functional divergence of duplicated genes Irene Julca Alex Warwick Vesztrocy, Natasha Glover, Paul D. Thomas, Christophe Dessimoz, Irene Julca Gene duplication is a fundamental driver of functional innovation in evolution. Following duplication, paralogous genes may be retained, diverge in function (through sub- or neo-functionalisation), or be lost. When paralogues are retained, identifying which copy preserves the ancestral function can be challenging. The “least diverged orthologue” (LDO) conjecture proposes that the paralogue evolving more slowly at the sequence level is more likely to retain the ancestral function. In this study, we systematically test this hypothesis using a novel method that detects asymmetric sequence evolution in gene families. We applied this approach to all gene trees from the PANTHER database, encompassing gene duplications across the Tree of Life. We further integrated structural data for over one million proteins and gene expression data from 16 animal and 20 plant species. Our analysis, spanning thousands of gene families, reveals that although many paralogues evolve at comparable rates, a substantial fraction exhibits marked asymmetry in sequence divergence. This asymmetry correlates with differences in expression profiles and predicted functional annotations. Together, the results strongly support the LDO conjecture: the least diverged paralogue tends to retain ancestral function, while the more rapidly evolving copy may acquire specialised or novel roles. These findings have significant implications for orthology prediction and functional annotation in comparative genomics.
2025-07-21 14:20:00 14:40:00 11A EvolCompGen Duplication Episode Clustering in Phylogenetic Networks Agnieszka Mykowiecka Pawel Gorecki, Jarosław Paszek, Agnieszka Mykowiecka Phylogenetic networks provide a powerful framework for representing complex evolutionary histories that traditional tree-based models cannot adequately capture. In particular, phylogenetic networks allow modeling evolutionary processes that involve reticulate events such as hybridization, horizontal gene transfer, and introgression. At the same time, macro-evolutionary events such as genomic and whole-genome duplications add another layer of complexity, leading to gene family expansions, functional divergence, and lineage-specific innovation. While each process has been studied extensively in isolation, recent advances highlight the need to consider them jointly, as they often co-occur in shaping genomic landscapes. We present two novel problems that aim to infer genomic duplication episodes by duplication clustering in the phylogenetic network using a collection of gene family trees. First, we propose a polynomial-time dynamic programming (DP) formulation that verifies the existence of a set of episodes from a predefined set of episode candidates. We then demonstrate how to use DP to design an algorithm that solves a general inference problem. To evaluate our method, we perform computational experiments on empirical data containing whole genome duplication events in a network of Panadales species, showing that our algorithms can accurately verify genomic duplication hypotheses.
2025-07-21 14:40:00 15:00:00 11A EvolCompGen PhytClust: Accurate and Fast Clustering in Phylogenetic Trees Katyayni Ganesan Katyayni Ganesan, Elisa Billard, Tom L Kaufmann, Cody B Duncan, Maja C Cwikla, Adrian Altenhoff, Christophe Dessimoz, Roland F Schwarz Phylogenetic trees serve an important role in disentangling the evolutionary relationships between taxa, across diverse fields. A key question is the identification of distinct subpopulations within a phylogenetic tree. Several methods have been developed to classify taxa in phylogenetic trees into clusters based on their evolutionary distance. However, these approaches tend to rely on arbitrary thresholds that vary across studies, making meaningful interpretation and comparison between results challenging. Additionally, they often rely on heuristics to limit their cluster search, as enumerating all possible clusters within a tree is prohibitive for large trees. Here, we present PhytClust, a novel tool that provides a standardized approach to clustering taxa in phylogenetic trees into monophyletic clusters, bypassing the use of subjective parameters. PhytClust uses a score derived from the cumulative intra-cluster branch lengths to (i) find the optimal set of clusters for a given number of clusters, and (ii) from these candidate cluster sets, selects the one that optimally represents the tree’s topology and genetic distances. PhytClust provides an exact and efficient solution to the clustering problem based on dynamic programming, making it suitable for large trees with up to 100k taxa. Compared to existing methods, PhytClust is faster and more accurate in recovering ground truth clusters. We apply PhytClust to data spanning various biological domains, including cancer genomics, avian phylogenomics, bacterial and archaea phylogenetics, and plant genomics. By providing a standardized method for node clustering, PhytClust can help infer optimal clusters for a phylogenetic tree without any additional parameters.
2025-07-21 15:00:00 15:20:00 11A EvolCompGen Community detection at unprecedented scales with ExoLabel Erik Wright Aidan Lakshman, Erik Wright Many approaches in comparative genomics rely on clusters of orthologous genes (COGs). Methods for constructing COGs often employ community detection algorithms to identify clusters within a network of pairwise similarities among genes. As the number of available genome sequences continues to grow exponentially, this community detection step has proven to be the limiting factor for scaling COGs to more genomes – both in terms of memory and time required. In this study, we developed ExoLabel, a community detection program that can scale to enormous graphs by applying a linear-time algorithm to data outside of memory (i.e., on disk). We show that ExoLabel's accuracy rivals popular programs for identify COGs but is orders of magnitude faster and more memory efficient that existing programs. We demonstrate ExoLabel's performance by clustering a graph with 16.2 million nodes (genes) and 18.3 billion edges (pairwise similarities) in less than a day using only a few gigabytes of RAM. ExoLabel democratizes comparative genomics in settings without access to supercomputers and scales COG detection to new heights.
2025-07-21 15:20:00 15:40:00 11A EvolCompGen Accurate multiple sequence alignment at protein-universe scale with FAMSA 2 Adam Gudys Adam Gudys, Andrzej Zielezinski, Cedric Notredame, Sebastian Deorowicz Multiple sequence alignment (MSA) is a crucial analysis in computational biology applied, e.g., in phylogeny reconstruction or protein function prediction. Within few years, the large-scale sequencing efforts such as the Earth BioGenome Project will produce billions of sequences representing full diversity of the protein universe. However, the state-of-art MSA algorithms do not keep pace with the exponentially increasing size of sequence repositories. To address this issue, we present FAMSA 2. Compared to its predecessor, it offers higher accuracy and speed, enhanced robustness to non-homologous sequence contamination, and a number of usability features like alignment trimming. The algorithmic improvements include a novel guide tree heuristic based on medoid clustering particularly suited for ultra-scale analyses. The performance of FAMSA 2 has been evaluated on several data sets. They included Pfam families enriched with Homstrad reference alignments, AliSim-simulated alignments, and AlphaFold clusters. The experiments confirmed the presented algorithm to match or exceed the accuracy of Muscle5, Clustal Omega, or T-Coffee's regressive method, being orders of magnitude faster. FAMSA 2 was the only algorithm to align a set of over 12 million sequences. This was done in 40 minutes on a 64 GB RAM workstation. The first version of FAMSA with almost 60 000 downloads and applications in milestone projects like AlphaFold and Pfam confirmed its usability to the community. We believe that FAMSA 2, by enabling evolutionary and structural analyses at scale beyond reach of the competing tools, would gain an even larger impact in the field.
2025-07-21 15:40:00 16:00:00 11A EvolCompGen Learning the Language of Phylogeny with MSA Transformer Ruyi Chen Ruyi Chen, Gabriel Foley, Mikael Boden Classical phylogenetic inference assumes independence between sites, potentially undermining the accuracy of evolutionary analyses in the presence of epistasis. Some protein language models have the capacity to encode dependencies between sites in conserved structural and functional domains across the protein universe. We employ the MSA Transformer, which takes a multiple sequence alignment (MSA) as an input, and is trained with masked language modeling objectives, to investigate if and how effects of epistasis can be captured to enhance the analysis of phylogenetic relationships. We test whether the MSA Transformer internally encodes evolutionary distances between the sequences in the MSA despite this information not being explicitly available during training. We investigate the model's reliance on information available in columns as opposed to rows in the MSA, by systematically shuffling sequence content. We then use MSA Transformer on both natural and simulated MSAs to reconstruct entire phylogenetic trees with implied ancestral branchpoints, and assess their consistency with trees from maximum likelihood inference. We demonstrate how both previously known and novel evolutionary relationships are available from a ''non-classical'' approach with very different computational requirements, by reconstructing phylogenetic trees for the RNA virus RNA-dependent RNA polymerase and the nucleo-cytoplasmic large DNA virus domain. We anticipate that MSA Transformer will not replace but rather complement classical phylogenetic inference, to accurately recover the evolutionary history of protein families.
2025-07-21 16:40:00 17:00:00 11A EvolCompGen EdgeHOG: Scalable and Fine-Grained Ancestral Gene Order Inference Across the Tree of Life Charles Bernard Charles Bernard, Yannis Nevers, Alex Warwick Vesztrocy, Natasha Glover, Adrian Altenhoff, Christophe Dessimoz Ancestral genomes are essential for studying the diversification of life from the last universal common ancestor to modern organisms. Methods have been proposed to infer ancestral gene order, but they lack scalability, limiting the depth to which gene neighborhood evolution can be traced back. We introduce edgeHOG, a tool designed for accurate ancestral gene order inference with linear time complexity. Validated on various benchmarks, edgeHOG was applied to the entire OMA orthology database, encompassing 2,845 extant genomes across all domains of life. This represents the first tree-of-life scale inference, resulting in 1,133 ancestral genomes. In particular, we reconstructed ancestral contigs for the last common ancestor of eukaryotes, dating back around 1.8 billion years, and observed significant functional association among neighboring genes. The method also dates gene adjacencies, revealing conserved histone clusters and rapid chromosome rearrangements, enabling computational inference of these features.
2025-07-21 17:00:00 17:20:00 11A EvolCompGen JIGSAW: Accurate inference of exact copy numbers from targeted single-cell DNA sequencing Sophia Chirrane Sophia Chirrane, Simone Zaccaria Tumorigenesis is driven by the interplay between somatic single-nucleotide variants (SNVs) and larger structural alterations, like copy number alterations (CNAs), that are simultaneously accumulated in cancer cell genomes. This process results in highly heterogeneous tumours composed of distinct subpopulations of cells, or tumour clones, with different SNV and CNA combinations driving cancer progression and the development of treatment resistance. Recent targeted single-cell sequencing technologies (tag-scDNA-seq, e.g Mission Bio Tapestri platform) provide ideal data to study this interplay because the deep, unbiased sequencing coverage of a targeted gene panel allows the assessment of both SNVs and CNAs in each cell. However, while analyzing SNVs is relatively straightforward, no method currently exists to accurately infer CNAs from tag-scDNA-seq due to the extremely high level of random variance caused by the very low number of reads sequenced from only a minimal fraction of the genome. Here, we introduce JIGSAW (Joint Inference by Grouping Single-cell-clones of Amplicon-copy-numbers without Whole-genome), the first algorithm to infer CNAs from tag-scDNA-seq data. JIGSAW overcomes tag-scDNA-seq sparsity challenges by jointly grouping amplicons that share the same CNA state and clustering cells into clones using a Bayesian framework. Through extensive realistic simulations, we demonstrated that JIGSAW not only can accurately retrieve CNAs but is also robust to increasing levels of CNAs heterogeneity, cell-specific noise, and both clonal and subclonal whole-genome doublings. Applied to 2,153 pancreatic ductal adenocarcinoma cells and 12,000 AML cells, JIGSAW uncovered novel CNAs affecting cancer driver genes in conjunction with SNVs.
2025-07-21 17:20:00 17:40:00 11A EvolCompGen EASYstrata: a fully integrated workflow to infer evolutionary strata along sex chromosomes and other supergenes Ricardo C. Rodriguez de la Vega Quentin Rougemont, Elise Lucotte, Loreleï Boyer, Alexandra Jalaber de Dinechin, Alodie Snirc, Tatiana Giraud, Ricardo C. Rodriguez de la Vega New reference-level genomes are becoming increasingly available across the tree of life, opening new avenues for addressing exciting evolutionary questions. However, challenges remain in genome annotation, sequence alignment, evolutionary inference and a general lack of methodological standardization. Here, we present a new workflow designed to overcome these challenges in evolutionary analyses, facilitating the detection of recombination suppression and its consequences, such as structural rearrangements, transposable element accumulation and coding sequence degeneration. To achieve this, we integrate multiple bioinformatic steps into a single, reproducible and user-friendly pipeline. This workflow combines state-of-the-art tools to efficiently detect transposable elements, annotate newly assembled genomes, infer gene orthology, compute sequence divergence, as well as objectively identify stepwise extensions of recombination suppression (i.e., evolutionary strata) and their associated structural changes, while visualizing results throughout the process. We demonstrate how this Evolutionary analysis with Ancestral SYnteny for strata identification (EASYstrata) workflow was used to re-annotate 42 published Microbotryum genomes and a pair of giant plant sex chromosomes. We recovered all previously described strata and identified several that had gone unnoticed. While primarily developed to infer divergence between sex or mating-type chromosomes, EASYstrata can also be applied to any pair of haplotypes with diverging regions of interest, such as autosomal supergenes. This workflow will facilitate the study of the many non-model species for which newly sequenced, phased diploid genomes are now becoming available. EASYstrata and detailed use cases can be found at https://github.com/QuentinRougemont/EASYstrata Preprint: https://www.biorxiv.org/content/10.1101/2025.01.06.631483v1.full
2025-07-21 17:40:00 18:00:00 11A EvolCompGen The Phylogenetic Dynamic Regulatory Module Networks (P-DRMN) study infers Cis-regulatory features responsible for evolution of mammalian gene regulatory programs in aortic endothelium Suvojit Hazra Suvojit Hazra, Sara A Knaack, Erika Da-Inn Lee, Liangxi Wang, Mohamed Hawash, Huayun Hou, Michael Wilson, Sushmita Roy Cis-regulatory elements (CREs), such as promoters and enhancers, interact with transcription factors (TFs) to drive gene regulatory programs and contribute to morphological diversity across species. Comparative regulatory genomics, which integrates omic measurements across species, offers a powerful framework for studying the evolution of gene regulation. While multi-omic profiling, combining transcriptomic and epigenomic data, has advanced, computational tools that are both phylogenetically aware and capable of analyzing high-dimensional, cross-species data remain scarce. To address this, we introduce Phylogenetic Dynamic Regulatory Module Networks (P-DRMN), a novel multi-task regression-based algorithm that models dynamic gene module regulatory networks using RNA-seq, ATAC-seq, and ChIP-seq data while incorporating phylogenetic relationships. P-DRMN clusters genes into similarly expressing, discrete gene modules based on expression levels and uses a regression function of upstream CREs to infer species-specific module-TF regulatory programs. We applied P-DRMN to aortic endothelial cell data, including gene expression, promoter/motif accessibility, and five histone modifications (H3K27ac, H3K36me3, H3K4me3, H3K4me2, H3K27me3), from five mammals (human, rat, cow, pig, and dog). P-DRMN inferred 19-65% conservation of gene modules across species, with high- and low-expression modules being the most conserved and diverged, respectively. We identified 103 transitioning gene sets with species- or clade-specific expression patterns, many regulated by distinct TFs and chromatin marks, for example, CTCF in human-specific high-expression modules, and SHOX2, H3K27me3, and H3K4me3 in pig/cow-specific high-expression modules. These results demonstrate how CREs and chromatin states shape species-specific gene expression. Overall, P-DRMN provides a powerful framework for integrating multi-omics data to study the evolutionary dynamics of gene regulation.
2025-07-22 11:20:00 11:30:00 11A EvolCompGen Introduction , Iddo Friedberg Introduction to the joint session Function and EvolCompGen
2025-07-22 11:30:00 12:10:00 11A EvolCompGen Evolution of function in light of gene expression Marc Robinson-Rechavi Marc Robinson-Rechavi One of the fundamental questions of genome evolution is how gene function changes or is constrained, whether between species (orthologs) or inside gene families (paralogs). While computational prediction is making major progress on function in a broad sense, most evolutionary changes concern details that are small in the big picture, yet very significant for organismal function. For example, new organs or new physiological adaptations often come from repurposing genes whose basic molecular function is conserved while taking a novel role. Gene expression provides a unique window into such fine details of gene function. I will present how gene expression of diverse species, bulk and single-cell, is integrated into Bgee; how gene expression can be used to test hypotheses of functional change after duplication (the
2025-07-22 12:10:00 12:20:00 11A EvolCompGen Convergent evolution to similar proteins confounds structure search Erik Wright Erik Wright Advances in protein structure prediction and structural search tools (e.g., FoldSeek and PLMSearch) have enabled large-scale comparison of protein structures. It is now possible to quickly identify structurally similar proteins ("structurlogs"), but it remains unclear whether these similarities reflect homology (common ancestry) or analogy (convergent evolution). In this study, we found that ~2.6% of FoldSeek clusters lack sequence-level support for homology, including about 1% of matches with high TM-score (>= 0.5). The lack of sequence homology could be due to extreme protein divergence or independent evolution to a similar structure. Here, we show that tandem repeats provide strong evidence for the presence of analogous protein structures. Our results suggest analogs infiltrate structure search results and care should be taken when relying on structural similarity alone if homology is desired. This problem may extend beyond repeat proteins to other low complexity folds, and structure search tools could be improved by masking these regions in the same manner as done by sequence search programs.
2025-07-22 12:20:00 12:30:00 11A EvolCompGen Evolution of the Metazoan Protein Domain Toolkit Revealed by a Birth-Death-Gain Model Maureen Stolzer Maureen Stolzer, Yuting Xiao, Dannie Durand Domains, sequence fragments that encode protein modules with a distinct structure and function, are the basic building blocks of proteins. The set of domains encoded in the genome serves as the functional toolkit of the species. Here, we use a phylogenetic birth-death-gain model to investigate the evolution of this protein toolkit in metazoa. Given a species tree and the set of protein domain families in each present-day species, this approach estimates the most likely rates of domain origination, duplication, and loss. Statistical hierarchical clustering of domain family rates reveals sets of domains with similar rate profiles, consistent with groups of domains evolving in concert. Moreover, we find that domains with similar functions tend to have similar rate profiles. Interestingly, domains with functions associated with metazoan innovations, including immune response, cell adhesion, tissue repair, and signal transduction, tend to have the fastest rates. We further infer the expected ancestral domain content and the history of domain family gains, losses, expansions, and contractions on each branch of the species tree. Comparative analysis of these events reveals that a small number of evolutionary strategies, corresponding to toolkit expansion, turnover, specialization, and streamlining, are sufficient to describe the evolution of the metazoan protein domain complement. Thus, the use of a powerful, probabilistic birth-death-gain model reveals a striking harmony between the evolution of domain usage in metazoan proteins and organismal innovation.
2025-07-22 12:30:00 12:40:00 11A EvolCompGen Deep Phylogenetic Reconstruction Reveals Key Functional Drivers in the Evolution of B1/B2 Metallo-β-Lactamases Samuel Davis Samuel Davis, Pallav Joshi, Ulban Adhikary, Julian Zaugg, Phil Hugenholtz, Marc Morris, Gerhard Schenk, Mikael Boden Metallo-β-lactamases (MBLs) comprise a diverse family of antibiotic-degrading enzymes. Despite their growing implication in drug-resistant pathogens, no broadly effective clinical inhibitors against MBLs currently exist. Notably, β-lactam-degrading MBLs appear to have emerged twice from within the broader, catalytically diverse MBL-fold protein superfamily, giving rise to two distinct monophyletic groups: B1/B2 and B3 MBLs. Comparative analyses have highlighted distinct structural hallmarks of these subgroups, particularly in metal-coordinating residues. However, the precise evolutionary events underlying their emergence remain unclear due to challenges presented by extensive sequence divergence. Understanding the molecular determinants driving the evolution of β-lactamase activity may inform design of broadly effective inhibitors. We sought to infer the evolutionary features driving the emergence of B1/B2 MBLs via phylogenetics and ancestral reconstruction. To overcome challenges associated with evolutionary analysis at this scale, we developed a phylogenetically aware sequence curation framework centred on iterative profile HMM refinement. This framework was applied over several iterations to construct a comprehensive phylogeny encompassing the B1/B2 MBLs and several other recently diverged clades. The resulting tree represents the most robust hypothesis to date regarding the emergence of B1/B2 MBLs and implies a parsimonious evolutionary history of key features, including variation in active site architecture and insertions and deletions of distinct structural elements. Ancestral proteins inferred at key internal nodes were experimentally characterised, revealing distinct activity profiles that reflect underlying evolutionary transitions. These findings give rise to testable hypotheses regarding the molecular basis and evolutionary drivers of functional diversification, as well as potential targets for MBL inhibitor design.
2025-07-22 12:40:00 12:50:00 11A EvolCompGen A compendium of human gene functions derived from evolutionary modeling Paul D. Thomas Marc Feuermann, Huaiyu Mi, Pascale Gaudet, Anushya Muruganujan, Suzanna Lewis, Dustin Ebert, Tremayne Mushayahama, Gene Ontology Consortium, Paul D. Thomas A comprehensive, computable representation of the functional repertoire of all macromolecules encoded within the human genome is a foundational resource for biology and biomedical research. We have recently published a paper (Feuermann et al., Nature 640:146, 2025) describing our initial release of a human gene “functionome,” a comprehensive set of human gene function descriptions using Gene Ontology (GO) terms, supported by experimental evidence. This work involved integration of all applicable experimental Gene Ontology (GO) annotations for human genes and their homologs, using a formal, explicit evolutionary modeling framework. We will review this work and its major findings, and describe subsequent progress on an updated version. In more detail, we will describe the results of a large, international effort to integrate experimental findings from more than 100,000 publications to create a representation of human gene functions that is as complete and accurate as possible. Specifically, we applied an expert-curated, explicit evolutionary modeling approach to all human protein-coding genes, which integrates available experimental information across families of related genes into models reconstructing the gain and loss of functional characteristics over evolutionary time. The resulting set of integrated functions covers ~82% of human protein-coding genes, and the evolutionary models provide insights into the evolutionary origins of human gene functions. We show that our set of function descriptions can improve the widely used genomic technique of GO enrichment analysis. The experimental evidence for each functional characteristic is recorded, enabling the scientific community to help review and improve the resource, available at https://functionome.geneontology.org.
2025-07-22 12:50:00 01:00:00 11A EvolCompGen pLM in functional annotation: relationship between sequence conservation and embedding similarity Ana Rojas Ana Rojas, Ildefonso Cases, Rosa Fernandez, Gemma Martínez-Redondo, Francisco M. Perez-Canales Functional annotation of protein sequences remains a bottleneck for understanding the biology of both model and non model organisms, as conventional homology based tools often fail to assign functions to the majority of newly sequenced genes. We first benchmarked each pLM on well‐characterized model organisms, demonstrating superior recovery of functional signals from transcriptomic datasets compared to traditional methods. We then applied our pipeline to annotate ~1,000 animal proteomes, encompassing 23 million genes, and discovered candidate genes involved in gill regeneration in a non model insect. To elucidate how pLM embeddings relate to primary‐sequence conservation, we computed cosine distances between embeddings and aligned sequences to derive percent identity. Statistical analyses—including Pearson correlation, polynomial regression, and quantile regression—revealed complex, non linear relationships between embedding similarity and sequence identity that vary markedly across models. These findings indicate that pLM embeddings capture orthogonal functional features beyond simple residue conservation. Altogether, our work highlights the power of pLM based annotation for expanding functional insights in biodiversity projects and underscores the need to interpret embedding distances in light of each model’s unique representational biases.
2025-07-22 14:00:00 14:20:00 11A EvolCompGen Disentangling SARS-CoV-2 Lineage Importations and the Role of NPIs Using Bayesian Phylogeography of 1.8 Million Genomes Sama Goliaei Sama Goliaei, Mohammad-Hadi Foroughmand-Araabi, Aideen Roddy, Ariane Weber, Sanni Översti, Denise Kühnert, Alice McHardy Nonpharmaceutical interventions (NPIs) were key to limiting SARS-CoV-2 transmission before vaccines, though their effectiveness—especially regarding mask use and socioeconomic trade-offs—remains under discussion. Leveraging a Bayesian phylogeographic framework, we analyzed 1.8 million globally sampled SARS-CoV-2 genomes to quantify lineage importations into Germany during the third pandemic wave (late 2020–early 2021). Across three sampling strategies, we observed a consistent decline in importations following key NPIs, notably the provision of free rapid antigen tests and mandates for surgical/FFP2 mask usage. While mask efficacy has been debated, our data show that upgrading from cloth to medical-grade masks coincided with sharp reductions in importation frequency—particularly in densely populated states. We introduce a novel metric, the Smoothed Importation Frequency (SIF), and a daily effectiveness measure that allows more precise, real-time assessment of NPIs by smoothing fluctuations in importation data, thus overcoming limitations of previous methods that lacked temporal resolution and clarity. Our findings reveal that major lineage importations clustered around the Christmas holiday period, and spread disproportionately from populous states, identifying these as critical nodes in national transmission dynamics. These results demonstrate the importance of integrating phylogenetic data with real-world intervention timelines to decode the drivers of pathogen spread. Beyond confirming the effectiveness of masks and rapid testing, our study highlights the notable impact of restricting gatherings and movements, supporting a data-driven, targeted approach to pandemic response. The data suggests that scalable, low-socioeconomic-cost measures like rapid testing and surgical-grade masking, when accessible, may be especially valuable early in outbreaks, when vaccines are not yet available.
2025-07-22 14:20:00 14:40:00 11A EvolCompGen SARS-CoV-2 Intra-Host Evolution in Immuno-Compromised Individuals: A Fractal Perspective on Genome Geometry Nicole A. Rogowski Nicole A. Rogowski, Kees Mourik, Nithya Kuttiyarthu Veetil, Stefan A. Boers, Anna H.E. Roukens, Simon P. Jochems, Louis A.C.M. Kroes, Igor A. Sidorov, Jelle J. Goeman, Jutte J.C. de Vries Studies have associated the punctuated evolution of SARS-CoV-2 variants with prolonged infections and subsequent transmission. We describe the genetic signatures of SARS-CoV-2 intra-host evolution in 10 immuno-compromised (IC) patients and 5 competent controls, in 55 longitudinal samples. We included two types of IC: induced (immune suppressants) and innate (haematological disease). The mutational profile was analysed between IC types, over time, and in response to treatment (host directed and antiviral). However, almost all studies on viral evolution consider only a ‘consensus’ sequence for a virus (mutations >50% frequency represented as ambiguous nucleotides) – ignoring the diverse viral pool (quasi-species) arising from replication errors. Including the full viral quasi-species profile is essential to understanding how resistance mutations arise. When making phylogenetic trees for all variants, the high levels of ambiguous positions caused failure. Here we report a novel approach based on chaos game which can leverage viral quasi-species and produce phylogenetic trees regardless of ambiguity. Graphical representations were generated using Chaos Game Representation (CGR), which draws a “walk” to encode genetic information. Each walk has a set of independent mutations, and by compiling thousands of walks for each sample, covering most combinations of mutations, Frequency CGR (FCGR) objects were created. Due to the collection of walks, positional ambiguity and complex mutations can be easily incorporated in phylogeny (using topology-based methods), and results in closer relationships between patient samples. The same topology-based methods produced a 3D visualization of the genome space, similar to an antigen map, highlighting distinct signatures visible in IC patients.
2025-07-22 14:40:00 15:00:00 11A EvolCompGen Antarctica as a Viral Reservoir: Insights from Comparative Genomics and Metagenomics Caroline Martiniuc Caroline Martiniuc, Igor Taveira, Fernanda Abreu, Anderson Cabral, Rodolfo Paranhos, Deborah Leite, Lucy Seldin, Diogo Jurelevicius Two bioinformatics approaches stand out in the study of viromes in extreme environments: prophage comparative genomics and viral metagenomic analyses. The bacteria Rummeliibacillus stabekisii emerges as an interesting model for investigating extremophilic prophages, as it has been isolated from spacecraft surfaces and Antarctic soils, raising questions about the role of prophages in its environmental resilience. Additionally, Antarctica faces hydrocarbon contamination, making these regions even more hostile. Understanding ecological and metabolic interactions in this context can help elucidate microbial relationships in such environments. For the comparative genomics study, genomes of R. stabekisii from spacecraft surfaces and Antarctic soil were analyzed. PHASTER was used to identify prophages within the genomes, followed by annotation with BLAST. Furthermore, metagenomic analyses were performed on five hydrocarbon-contaminated Antarctic soil samples. The samples were sequenced using Illumina and assembled with MEGAHIT. Viral contigs were identified using VirSorter, and taxonomy was classified with the PhaGCN. Viral hosts were assigned based on data from the International Committee on Taxonomy of Viruses (ICTV) and the CHERRY software. Comparative genomic analysis revealed that Antarctic R. stabekisii harbored the highest number of intact prophages, with genes suggesting adaptive advantages and regions acting as hotspots for recombination. In contaminated soils, the class Caudoviricetes exhibited the highest abundance. Most detected viral hosts belonged to hydrocarbon-degrading bacterial genera within the phyla Pseudomonadota and Actinomycetota. Additionally, auxiliary viral metabolic genes associated with nitrogen and phosphorus cycles were identified. Both results reinforce the relevance of viruses as agents of genetic and ecological modulation in Antarctica.
2025-07-22 15:00:00 15:10:00 11A EvolCompGen Computational Genomics and Biosynthetic Potential Analysis of a Dead Sea Penicillium sp. Milana Frenkel-Morgenstern Dylan Dsouza, Milana Frenkel-Morgenstern Extreme environments harbor unique microbial life with biotechnological potential. Here, we characterize a novel Penicillium sp. isolated from the hypersaline Dead Sea, capable of thriving at 70‰ salt concentration. Whole-genome and transcriptome sequencing were performed, followed by de novo assembly and quality assessment using QUAST and BUSCO. Functional annotation of predicted peptides was conducted using InterProScan, UPIMAPI, and Blast+ with NCBI-RefSeq and UniProtKB, validating spectral data from LC-MS/MS (nanoAcquity coupled with Q Exactive HFX) analyzed via Proteome Discoverer v2.4, SequestHT, and MS Amanda 2.0. Key enzymes in penicillin biosynthesis were confirmed. Biosynthetic potential was assessed using AntiSMASH and dbCAN3, with SignalP 6.0 machine learning predicting secretory proteins. Phylogenetic analysis of single-copy orthologs was performed using OrthoFinder. The genome revealed biosynthetic gene clusters for valuable bioactives, including mellein, lovastatin, sorbicillin, and roquefortine. Strong antimicrobial inhibition was observed in E. coli NEB+ STABL from extracts grown in a high-nitrogen medium with phenylacetate and 20% Dead Sea water. At the transcript level, RFAM annotation identified THI4 and THI5 riboswitches, with secondary structures predicted via RNAfold and R2DT. Conservation analysis using LocARNA provided insights into regulatory mechanisms. These findings highlight the computational-driven discovery of biosynthetic pathways and stress-adaptive mechanisms in Penicillium sp., demonstrating its potential for industrial applications in extreme environments.
2025-07-22 15:10:00 15:20:00 11A EvolCompGen Unravelling the pangenome of autotrophic bacteria: Metabolic commonalities, evolutionary relationships, and industrially relevant traits Dr. Karan Kumar Dr. Karan Kumar, Tobias B. Alter, Lars M. Blank Atmospheric CO₂ fixation by microbial autotrophs presents a sustainable alternative to energy-intensive chemical processes, offering significant potential for biotechnological applications. However, understanding the genetic diversity, evolutionary adaptations, and metabolic capabilities of autotrophic carbon-fixing lineages (ACL) requires a comparative genomic approach. This study employs pangenome analysis to systematically assess the core, accessory, and unique genetic components across diverse ACL bacteria, with a particular focus on the recently revised genus Xanthobacter and the newly proposed Roseixanthobacter. By integrating phylogenetic, functional, and metabolic insights, we aim to elucidate conserved and variable genetic traits that contribute to CO₂ fixation efficiency and industrial relevance. A total of 546 high-quality genomes spanning 121 ACL microbial species were selected for analysis, following rigorous genome quality control measures based on CheckM contamination thresholds, contig limits, and genome size variation criteria. Initial phylogenomic analyses identified 16 microbial genera closely related to Xanthobacter, including Ancylobacter, Azorhizobium, Cupriavidus, Hydrogenophaga, Moorella, and Synechococcus, among others. Genomes were uniformly re-annotated to ensure consistency in gene identification. Pangenome reconstruction, core-genome diversity assessments, orthologous group clustering, and essential metabolic pathway mapping were performed to identify key functional traits enabling inorganic carbon assimilation, H₂ utilization, and N₂ fixation. Among these traits are RuBisCO for CO₂ fixation, hydrogenases for H₂ metabolism, and nitrogenase complexes for converting atmospheric N₂ into bioavailable forms. The findings of this study would contribute to metabolic engineering efforts, facilitating the development of optimized microbial strains for sustainable biotechnology applications such as alternative protein production, biofuel production, carbon sequestration, and synthetic biology efforts.
2025-07-22 15:20:00 15:30:00 11A EvolCompGen Spatiotemporal patterns in the human gut dysbiosis contrasted to healthy families Falk Hildebrand Falk Hildebrand, Katarzyna Sidorczuk, Rebecca Ansorge The gut microbiome is essential to the wellbeing and health of its human host, yet most studies to date resolve the gut microbial community only at genus or species level. Yet we do know that two bacterial strains of the same species can differ by more than half their genome and that pathogenicity is encoded at the strain - not species - level. Therefore, my group develops the technologies to track bacterial strain in metagenomic time series, and to investigate evolutionary pressures. Our studies have uncovered the extreme persistence of bacterial strains in individual human hosts (doi: 10.1016/j.chom.2021.05.008). Using strain tracking, we can uncover the colonization of multiple family members, creating a “family-specific microbiome”. Yet also in disease we can find significant shifts in microbial strains: Using a meta-analysis of >5,000 metagenomes, I will show typical strain enrichments associated with IBD and their temporal patterns during episodes of inflammatory flares. These research lines demonstrate the importance to increase both taxonomic and genome resolution in microbiome studies to uncover the microbial patterns prevalent in disease and health.
2025-07-22 15:30:00 15:40:00 11A EvolCompGen Marker discovery in the large Beatriz Vieira Mourato Beatriz Vieira Mourato, Ivan Tsers, Svenja Denker, Fabian Klötzl, Bernhard Haubold Pathogen outbreaks are now routinely tracked by whole genome sequencing. This leads to ever-increasing opportunities for marker discovery beyond the traditional candidate gene approach. Ideal genetic markers are present in all target organisms and nowhere else. Such markers have maximal sensitivity and specificity. Evolutionary biology implies that the vast majority of potentially non-specific sequences are present in the closest distinct relatives of the targets, their neighbors. We have implemented this insight in our software for finding unique genomic regions, Fur. Fur takes as input a set of target and neighbor genomes and returns the regions present in all targets that are absent from all neighbors. The resulting list of regions is highly enriched for diagnostic markers. Fur is based on suffix array algorithms, making it fast. However, its original version required memory proportional to the size of the neighborhood. Here we present the new Fur, which requires memory proportional to the longest neighbor sequence. This allows marker discovery from whole genome sequences on consumer-grade hardware. For example, the analysis of 178 target and 1,074 neighbor genomes of Streptococcus pneumoniae took 9m 16s and used 11.6GB RAM. We applied Fur to 120 diverse bacterial taxa and tested the marker candidates by comparison to nt. We found that the marker candidates had excellent in silico sensitivity and specificity making them ideal starting material for developing diagnostic genetic markers in vitro.
2025-07-22 15:40:00 15:50:00 11A EvolCompGen Whole-genome detection and origin identification of orphan genes in plant-parasitic nematodes Ercan Seçkin Ercan Seçkin, Etienne Danchin, Dominique Colinet, Edoardo Sarti Genes with no known homologs constitute 5% to 30% of every organism’s genome. These orphan genes have either rapidly diverged from a family or have appeared de novo from a previously non-coding region. Their detection, origin identification, and structural characterization are challenging, and evidences about the nature of de novo genes seem to be strongly species-dependent. In root-knot nematodes (Meloidogyne), orphan genes have been linked to parasitic functions, and are thus of great agronomical interest. Starting from recently sequenced whole genomes of eight species of Meloidogyne, we use comparative homology, transcriptomics and proteomics for robust detection of orphan genes. Then, we rely on ancestral sequence reconstruction strategies and synteny approaches for identifying their origin. We find that 19% of all orphan genes are most likely to be de novo, and 30% divergent. Taking an equilibrated subset, we perform protein structure prediction with AlphaFold2, ESMFold and OmegaFold, and find that all three protein language models produce low-confidence predictions. This result does not seem caused by an increased intrinsic disorder in orphan proteins (that we calculated with AIUPred and flDPnn), rather by the low similarity between the query orphan sequences and the training sets of the structure predictors. The dataset is thus a challenging, homology-free benchmark for structure, disorder, and emergence prediction.
2025-07-22 15:50:00 16:00:00 11A EvolCompGen Construction and Analysis of the Moniliophthora roreri pangenome Isabella Gallego Isabella Gallego, Diego Mauricio Riaño-Pachón Moniliophthora roreri, the causal agent of frosty pod rot, is a devastating fungal pathogen affecting cacao production across Latin America. Its broad host range, ecological adaptability, and high pathogenicity underscore the need to understand its genomic diversity to inform disease management strategies. Here, we present a comprehensive pangenome analysis of 24 publicly available M. roreri genomes using two state-of-the-art graph-based methods: Minigraph-Cactus and PGGB. Graph-based approaches allow us to integrate structural variation and genome-wide sequence diversity into a unified representation. The resulting pangenomes were used to classify genes into core, accessory, and strain-specific categories, revealing genomic features likely associated with adaptation and pathogenicity. Functional annotation was performed with HMMER and PANNZER2, and enriched Gene Ontology terms were identified for each gene category using the topGO and REVIGO tools, offering insight into biological processes specific to different parts of the genome. The study also includes a comparative analysis between our graph-based pangenomes and a previously constructed orthology-based version. This evaluation uses metrics such as genome completeness, representation of structural variants, core/accessory gene content, and computational performance. Our findings demonstrate the value of graph-based methods in capturing the genomic complexity of fungal pathogens and provide a foundation for future research into the molecular basis of virulence and host adaptation in M. roreri.
2025-07-22 16:40:00 17:00:00 11A EvolCompGen Recomb-Mix: fast and accurate local ancestry inference Yuan Wei Yuan Wei, Degui Zhi, Shaojie Zhang Motivation: The availability of large genotyped cohorts brings new opportunities for revealing the high-resolution genetic structure of admixed populations via local ancestry inference (LAI), the process of identifying the ancestry of each segment of an individual haplotype. Though current methods achieve high accuracy in standard cases, LAI is still challenging when reference populations are more similar (e.g., intra-continental), when the number of reference populations is too numerous, or when the admixture events are deep in time, all of which are increasingly unavoidable in large biobanks. Results: In this work, we present a new LAI method, Recomb-Mix. Recomb-Mix integrates the elements of existing methods of the site-based Li and Stephens model and introduces a new graph collapsing trick to simplify counting paths with the same ancestry label readout. Through comprehensive benchmarking on various simulated datasets, we show that Recomb-Mix is more accurate than existing methods in diverse sets of scenarios while being competitive in terms of resource efficiency. We expect that Recomb-Mix will be a useful method for advancing genetics studies of admixed populations. Availability and Implementation: The implementation of Recomb-Mix is available at https://github.com/ucfcbb/Recomb-Mix.
2025-07-22 17:00:00 17:20:00 11A EvolCompGen WINDEX: A hierarchical integration of site- and window-based statistics for modeling the footprint of positive selection Hannah Snell Hannah Snell, Scott McCallum, Dhruv Raghavan, Ritambhara Singh, Sohini Ramachandran, Lauren Sudgen In genetics studies, scientists search for mutations that explain changes in phenotype or population diversity. Adaptive mutations, or mutations that increase in frequency by conferring a fitness benefit, leave behind statistical signals in genetic data that genome-wide scans for selection can reveal. Computational methods have improved the localization of adaptive mutations in genetic samples using machine learning techniques. However, these methods fail to account for the effect of linkage disequilibrium on localization and miss the opportunity to incorporate statistics at varying resolutions. Leveraging statistics in both individual sites and local genetic windows allows us to capture features of positive selection footprints due to changes in allele frequencies, haplotypes, or site-frequency spectra (SFS). Our proposed method, WINDEX, aims to combine these differing resolutions of statistics with a hierarchical hidden Markov model architecture to improve the prediction of positively selected loci among hitchhiking signals. WINDEX contains site- and window-dependent latent states corresponding to neutral, linked, and adaptive regions. This structure uses the information provided by both statistical resolutions to make classifications, capturing a broader range of signals left by a positive selective sweep. WINDEX shows strong performance with 99% accuracy in artificially generated sequences, and competitive performance against baselines such as SWIF(r) and a multi-layer perceptron (MLP). WINDEX is currently being tested on canonical positive selection sites in the human genome using data from the 1000 Genomes Project. Overall, WINDEX provides the opportunity to incorporate the full range of existing selection statistics to improve localization and understand the footprint of positive selection.
2025-07-22 17:20:00 17:30:00 11A EvolCompGen Position-specific evolution in transcription factor binding sites, and a fast likelihood calculation for the F81 model Pavitra Selvakumar Pavitra Selvakumar, Rahul Siddharthan Transcription factor binding sites (TFBS), like other DNA sequence, evolve via mutation and selection relating to their function. Models of nucleotide evolution describe DNA evolution via single-nucleotide mutation. A stationary vector of such a model is the long-term distribution of nucleotides, unchanging under the model. Neutrally evolving sites may have uniform stationary vectors, but one expects that sites within a TFBS instead have stationary vectors reflective of the fitness of various nucleotides at those positions. We introduce 'position-specific stationary vectors' (PSSVs), the collection of stationary vectors at each site in a TFBS locus, analogous to the position weight matrix (PWM) commonly used to describe TFBS. We infer PSSVs for human TFs using two evolutionary models (Felsenstein 1981 and Hasegawa-Kishino-Yano 1985). We find that PSSVs reflect the nucleotide distribution from PWMs, but with reduced specificity. We infer ancestral nucleotide distributions at individual positions and calculate 'conditional PSSVs' conditioned on specific choices of majority ancestral nucleotide. We find that certain ancestral nucleotides exert a strong evolutionary pressure on neighbouring sequence while others have a negligible effect. Finally, we present a fast likelihood calculation for the F81 model on moderate-sized trees that makes this approach feasible for large-scale studies along these lines.
2025-07-22 17:30:00 18:00:00 11A EvolCompGen Concluding remarks
2025-07-22 11:20:00 11:30:00 12 Function Introduction , Iddo Friedberg Introduction to the joint session Function and EvolCompGen
2025-07-22 11:30:00 12:10:00 12 Function Evolution of function in light of gene expression Marc Robinson-Rechavi Marc Robinson-Rechavi One of the fundamental questions of genome evolution is how gene function changes or is constrained, whether between species (orthologs) or inside gene families (paralogs). While computational prediction is making major progress on function in a broad sense, most evolutionary changes concern details that are small in the big picture, yet very significant for organismal function. For example, new organs or new physiological adaptations often come from repurposing genes whose basic molecular function is conserved while taking a novel role. Gene expression provides a unique window into such fine details of gene function. I will present how gene expression of diverse species, bulk and single-cell, is integrated into Bgee; how gene expression can be used to test hypotheses of functional change after duplication (the
2025-07-22 12:10:00 12:20:00 12 Function Convergent evolution to similar proteins confounds structure search Erik Wright Erik Wright Advances in protein structure prediction and structural search tools (e.g., FoldSeek and PLMSearch) have enabled large-scale comparison of protein structures. It is now possible to quickly identify structurally similar proteins ("structurlogs"), but it remains unclear whether these similarities reflect homology (common ancestry) or analogy (convergent evolution). In this study, we found that ~2.6% of FoldSeek clusters lack sequence-level support for homology, including about 1% of matches with high TM-score (>= 0.5). The lack of sequence homology could be due to extreme protein divergence or independent evolution to a similar structure. Here, we show that tandem repeats provide strong evidence for the presence of analogous protein structures. Our results suggest analogs infiltrate structure search results and care should be taken when relying on structural similarity alone if homology is desired. This problem may extend beyond repeat proteins to other low complexity folds, and structure search tools could be improved by masking these regions in the same manner as done by sequence search programs.
2025-07-22 12:20:00 12:30:00 12 Function Evolution of the Metazoan Protein Domain Toolkit Revealed by a Birth-Death-Gain Model Maureen Stolzer Maureen Stolzer, Yuting Xiao, Dannie Durand Domains, sequence fragments that encode protein modules with a distinct structure and function, are the basic building blocks of proteins. The set of domains encoded in the genome serves as the functional toolkit of the species. Here, we use a phylogenetic birth-death-gain model to investigate the evolution of this protein toolkit in metazoa. Given a species tree and the set of protein domain families in each present-day species, this approach estimates the most likely rates of domain origination, duplication, and loss. Statistical hierarchical clustering of domain family rates reveals sets of domains with similar rate profiles, consistent with groups of domains evolving in concert. Moreover, we find that domains with similar functions tend to have similar rate profiles. Interestingly, domains with functions associated with metazoan innovations, including immune response, cell adhesion, tissue repair, and signal transduction, tend to have the fastest rates. We further infer the expected ancestral domain content and the history of domain family gains, losses, expansions, and contractions on each branch of the species tree. Comparative analysis of these events reveals that a small number of evolutionary strategies, corresponding to toolkit expansion, turnover, specialization, and streamlining, are sufficient to describe the evolution of the metazoan protein domain complement. Thus, the use of a powerful, probabilistic birth-death-gain model reveals a striking harmony between the evolution of domain usage in metazoan proteins and organismal innovation.
2025-07-22 12:30:00 12:40:00 12 Function Deep Phylogenetic Reconstruction Reveals Key Functional Drivers in the Evolution of B1/B2 Metallo-β-Lactamases Samuel Davis Samuel Davis, Pallav Joshi, Ulban Adhikary, Julian Zaugg, Phil Hugenholtz, Marc Morris, Gerhard Schenk, Mikael Boden Metallo-β-lactamases (MBLs) comprise a diverse family of antibiotic-degrading enzymes. Despite their growing implication in drug-resistant pathogens, no broadly effective clinical inhibitors against MBLs currently exist. Notably, β-lactam-degrading MBLs appear to have emerged twice from within the broader, catalytically diverse MBL-fold protein superfamily, giving rise to two distinct monophyletic groups: B1/B2 and B3 MBLs. Comparative analyses have highlighted distinct structural hallmarks of these subgroups, particularly in metal-coordinating residues. However, the precise evolutionary events underlying their emergence remain unclear due to challenges presented by extensive sequence divergence. Understanding the molecular determinants driving the evolution of β-lactamase activity may inform design of broadly effective inhibitors. We sought to infer the evolutionary features driving the emergence of B1/B2 MBLs via phylogenetics and ancestral reconstruction. To overcome challenges associated with evolutionary analysis at this scale, we developed a phylogenetically aware sequence curation framework centred on iterative profile HMM refinement. This framework was applied over several iterations to construct a comprehensive phylogeny encompassing the B1/B2 MBLs and several other recently diverged clades. The resulting tree represents the most robust hypothesis to date regarding the emergence of B1/B2 MBLs and implies a parsimonious evolutionary history of key features, including variation in active site architecture and insertions and deletions of distinct structural elements. Ancestral proteins inferred at key internal nodes were experimentally characterised, revealing distinct activity profiles that reflect underlying evolutionary transitions. These findings give rise to testable hypotheses regarding the molecular basis and evolutionary drivers of functional diversification, as well as potential targets for MBL inhibitor design.
2025-07-22 12:40:00 12:50:00 12 Function A compendium of human gene functions derived from evolutionary modeling Paul D. Thomas Marc Feuermann, Huaiyu Mi, Pascale Gaudet, Anushya Muruganujan, Suzanna Lewis, Dustin Ebert, Tremayne Mushayahama, Gene Ontology Consortium, Paul D. Thomas A comprehensive, computable representation of the functional repertoire of all macromolecules encoded within the human genome is a foundational resource for biology and biomedical research. We have recently published a paper (Feuermann et al., Nature 640:146, 2025) describing our initial release of a human gene “functionome,” a comprehensive set of human gene function descriptions using Gene Ontology (GO) terms, supported by experimental evidence. This work involved integration of all applicable experimental Gene Ontology (GO) annotations for human genes and their homologs, using a formal, explicit evolutionary modeling framework. We will review this work and its major findings, and describe subsequent progress on an updated version. In more detail, we will describe the results of a large, international effort to integrate experimental findings from more than 100,000 publications to create a representation of human gene functions that is as complete and accurate as possible. Specifically, we applied an expert-curated, explicit evolutionary modeling approach to all human protein-coding genes, which integrates available experimental information across families of related genes into models reconstructing the gain and loss of functional characteristics over evolutionary time. The resulting set of integrated functions covers ~82% of human protein-coding genes, and the evolutionary models provide insights into the evolutionary origins of human gene functions. We show that our set of function descriptions can improve the widely used genomic technique of GO enrichment analysis. The experimental evidence for each functional characteristic is recorded, enabling the scientific community to help review and improve the resource, available at https://functionome.geneontology.org.
2025-07-22 12:50:00 01:00:00 12 Function pLM in functional annotation: relationship between sequence conservation and embedding similarity Ana Rojas Ana Rojas, Ildefonso Cases, Rosa Fernandez, Gemma Martínez-Redondo, Francisco M. Perez-Canales Functional annotation of protein sequences remains a bottleneck for understanding the biology of both model and non model organisms, as conventional homology based tools often fail to assign functions to the majority of newly sequenced genes. We first benchmarked each pLM on well‐characterized model organisms, demonstrating superior recovery of functional signals from transcriptomic datasets compared to traditional methods. We then applied our pipeline to annotate ~1,000 animal proteomes, encompassing 23 million genes, and discovered candidate genes involved in gill regeneration in a non model insect. To elucidate how pLM embeddings relate to primary‐sequence conservation, we computed cosine distances between embeddings and aligned sequences to derive percent identity. Statistical analyses—including Pearson correlation, polynomial regression, and quantile regression—revealed complex, non linear relationships between embedding similarity and sequence identity that vary markedly across models. These findings indicate that pLM embeddings capture orthogonal functional features beyond simple residue conservation. Altogether, our work highlights the power of pLM based annotation for expanding functional insights in biodiversity projects and underscores the need to interpret embedding distances in light of each model’s unique representational biases.
2025-07-22 14:00:00 14:20:00 12 Function GOAnnotator: Accurate protein function annotation using automatically retrieved literature Huiying Yan Huiying Yan, Hancheng Liu, Shaojun Wang, Shanfeng Zhu Automated protein function prediction/annotation (AFP) is vital for understanding biological processes and advancing biomedical research. Existing text-based AFP methods including the state-of-the-art method, GORetriever, rely on expert-curated relevant literature, which is costly and time-consuming, and covers only a small portion of the proteins in UniProt. To overcome this limitation, we propose GOAnnotator, a novel framework for automated protein function annotation. It consists of two key modules: PubRetriever, a hybrid system for retrieving and re-ranking relevant literature, and GORetriever+, an enhanced module for identifying Gene Ontology (GO) terms from the retrieved texts. Extensive experiments over three benchmark datasets demonstrate that GOAnnotator delivers high-quality functional annotations, surpassing GORetriever by uncovering unique literature and predicting additional functions. These results highlight its great potential to streamline and enhance the annotation of protein functions without relying on manual curation.
2025-07-22 14:20:00 14:40:00 12 Function Semi-Supervised Data-Integrated Feature Importance Enhances Performance and Interpretability of Biological Classification Tasks Jun Kim Jun Kim, Russ Altman Accurate model performance on training data does not ensure alignment between the model’s feature weighting patterns and human knowledge, which can limit the model’s relevance and applicability. We propose Semi-Supervised Data-Integrated Feature Importance (DIFI), a method that numerically integrates a priori knowledge, represented as a sparse knowledge map, into the model’s feature weighting. By incorporating the similarity between the knowledge map and the feature map into a loss function, DIFI causes the model’s feature weighting to correlate with the knowledge. We show that DIFI can improve the performance of neural networks using two biological tasks. In the first task, cancer type prediction from gene expression profiles was guided by identities of cancer type-specific biomarkers. In the second task, enzyme/non-enzyme classification from protein sequences was guided by the locations of the catalytic residues. In both tasks, DIFI leads to improved performance and feature weighting that is interpretable. DIFI is a novel method for injecting knowledge to achieve model alignment and interpretability.
2025-07-22 14:40:00 15:00:00 12 Function On the completeness, coherence, and consistency of protein function prediction: lifting function prediction from isolated proteins to biological systems Rund Tawfiq Rund Tawfiq, Maxat Kulmanov, Robert Hoehndorf The Critical Assessment of Functional Annotation (CAFA) defines protein function prediction as the task of assigning Gene Ontology (GO) terms to individual proteins, and evaluates performance using ontology-based metrics. However, proteins rarely function in isolation; instead, they act within biological systems that impose genome-wide constraints. With the increasing availability of complete genomes, we define a new computational problem that extends the CAFA approach to genome-scale protein function prediction. Defining this task allows us to evaluate the biological plausibility of a set of predicted functions. We propose three evaluation criteria: completeness, coherence, and consistency. Completeness requires that all biologically essential functions are predicted for at least one protein in a genome. Coherence ensures that all necessary dependencies between functions are satisfied. Consistency is the absence of mutually exclusive functions within a genome or protein. We formalize these criteria as logical constraints using GO axioms, inter-ontology mappings, and curated biological knowledge. We implemented an evaluation framework based on the constraints we define, and applied it to six function prediction methods (DeepGOMeta, InterProScan, DeepFRI, TALE, DeepGraphGO, SPROF-GO) across 1,000 complete bacterial genomes. We also applied it to annotations from six well-annotated bacterial model organisms. The methods were not specifically designed to perform our genome-scale function prediction task, and our results revealed limitations in all methods when assessed against the metrics. Our results demonstrate that current methods, while effective at the protein level, do not produce biologically plausible proteome annotations, motivating new frameworks for function prediction grounded in system-level biological constraints.
2025-07-22 15:00:00 15:20:00 12 Function Contextual Gene Set Analysis with Large Language Models Chih-Hsuan Wei Zhizheng Wang, Chi-Ping Day, Chih-Hsuan Wei, Qiao Jin, Robert Leaman, Yifan Yang, Shubo Tian, Aodong Qiu, Yin Fang, Qingqing Zhu, Xinghua Lu, Zhiyong Lu Gene set analysis (GSA) is a foundational technique in genomics research, enabling the identification of biological processes and disease mechanisms associated with genes. Traditional GSA methods typically rely on predefined, manually curated biological databases to identify statistically enriched functions from gene sets created by high-throughput studies. However, these approaches as well as the recent large language model (LLM)-based methods generally overlook the biological and experimental contexts in which the gene sets were derived. Consequently, they often produce extensive lists of enriched pathways that are generic, redundant, or misaligned with the study objectives. In addition, conventional GSA methods do not account for gene interactions within the input set, frequently resulting in the overrepresentation of central hub genes. This lack of context-awareness limits the biological relevance of the findings and obstacles the accurate interpretation of results, thereby reducing the potential to derive meaningful insights or generate hypothesis-driven conclusions.
2025-07-22 15:20:00 15:40:00 12 Function Fine-tuning protein language models with a disorder-aware vocabulary improves intrinsic disorder classification and function prediction Harsh Srivastava Harsh Srivastava, Daniel Berenberg, Omar Qassab, Jane M. Carlton, Richard Bonneau Intrinsically disordered proteins (IDPs) and intrinsically disordered regions (IDRs) are essential to cellular processes but lack stable 3D conformations amenable to experimental structure determination. However, identifying key disorder-driving residues and their disorder-related functions remains challenging. Although protein language models (pLMs) generate rich sequence embeddings for many classification tasks, their explicit application to IDPs/IDRs is underexplored. Drawing from prior protein structure tokenization approaches, we hypothesize that fine-tuning pLM embeddings with disorder-aware tokens can substantially enhance downstream performance while preserving pretrained model representations. Here, we introduce a unified framework for predicting disordered residues, disordered binding regions, and disordered linker regions. (1) We developed DisToken, a disorder-aware per-residue vocabulary generated using a VQ-VAE trained on relevant intrinsic disorder annotations from MobiDB. DisTokens encode a meaningful composite of annotations, capture nuanced residue context, and distinguish intrinsic disorder from broader features, providing an alternative to conventional one-hot encodings used previously for fine-tuning pLMs. (2) We fine-tuned a low-parameter ESM-2 model with DisTokens, resulting in ESM-DisTok, which learned disorder-aware representations. (3) Minimal 1-D CNN classifiers trained on ESM-DisTok embeddings significantly outperformed those using baseline ESM embeddings and structure-aware ESM-3Di embeddings in disorder-residue classification, disorder-binding, and disorder-linker tasks. On CAID-2 benchmarks, our minimal ESM-DisTok-based classifiers ranked 1st by AUC and AUPR in predicting disorder-PDB, disorder-binding, and disorder-NOX, and 2nd for disorder-linker tasks relative to previously published methods. Overall, we demonstrate that integrating a disorder-aware vocabulary into pLM embeddings drastically enhances downstream intrinsic disorder-related predictive tasks.
2025-07-22 15:40:00 15:50:00 12 Function A Novel Computational Pipeline for the Functional Characterization and Deorphanization of G-Protein Coupled Receptors Catherine Zhou Catherine Zhou G protein-coupled receptors (GPCRs) are integral membrane proteins central to cellular signaling and intercellular communication, with Class A GPCRs playing key roles in many physiological processes and diseases. Despite their therapeutic potential, many remain orphan receptors, lacking identified endogenous ligands. Traditional de-orphaning methods are labor- and resource-intensive, highlighting the need for more efficient strategies. Here, we describe ongoing development of a multi-omics pipeline combining GPCR and ligand features, AI structural predictions, binding pocket analyses, and genomic and transcriptomic sequencing data to streamline the discovery of ligand pairings with orphan GPCRs. The pipeline analyzes tissue-specific gene expression data to identify co-expressed GPCR-ligand pairs, which are positioned to interact. Receptor and candidate ligand sequences and motif analyses inform potential ligand binding regions, while coevolution, conservation, and binding site similarity analysis refine interaction predictions. To model GPCR-ligand complexes, structural predictions (AlphaFold2/3, Boltz-1/2) are generated using a high-throughput pipeline optimized for parallelized batch execution on high performance servers. Models are evaluated using novel metrics to assess ligand binding feasibility, such as distance measurements between ligand and receptor domains and aggregated interaction scores across different types of contacts. The computational predictions are validated using experimental techniques. Initial application of this integrated approach has successfully identified novel ligand-receptor interactions, with ongoing efforts to develop a recurrent neural network for improved interaction classification. The pipeline’s success in deorphanizing GPCRs will lead to initiatives to expand its use for drug discovery, accelerating the identification of therapeutic targets for complex diseases.
2025-07-22 15:50:00 16:00:00 12 Function VaLPAS: Leveraging variation in experimental multi-omics data to elucidate protein function Jason McDermott Yannick Mahlich, Lummy Monteiro, Jason McDermott Despite continuing advances in sequencing and computational function determination, large parts of the studied gene, protein and metabolite space remain functionally undetermined. Most function assignment is driven by homology searches and annotation transfer from known and extensively studied proteins but often fails to leverage available experimental omics data generated via technologies like mass-spectrometry. The VaLPAS (Variation-Leveraged Phenomic Association Study) framework is an approach combining experimental multi-omics readouts with computational methods to establish functional relationships between different omics modalities. The goal of this approach is to shed light on the functional dark matter of protein space by elucidating previously unknown functions of proteins and metabolites via association metrics (e.g. protein-metabolite correlation) and graph algorithms. We demonstrate that the framework can reliably recapitulate known functional relationships, by applying VaLPAS to multi-omic data from Rhodosporidium toruloids and Yarrowia lipolytica cultured under different growth and stress conditions. We used KEGG Ortholog for detected proteins and KEGG Compound annoations for metabolites, evaluating the resulting association scores in the context of chemical reactions (KEGG Reactions) and metabolict pathways (KEGG modules & pathways) utilizing network analysis approaches. The resulting performance metrics detail the applicability of using experimental abundance data from detectable metabolites and proteins (extendable to other modes of experimental data) to infer protein functionality and metabolite annotation for as of yet unannotated data. Finally, the results imply that the approach can also aid in guiding experimental design to validate functional annotations.
2025-07-22 16:40:00 17:00:00 12 Function Accelerating protein family classification in InterPro with AI innovations Matthias Blum Matthias Blum, Alessandro Polignano, Irina Ponamareva, Alex Bateman InterPro is a freely accessible resource for classifying protein sequences into families, domains, and functional sites, integrating predictive signatures from member databases such as Pfam, CDD, and PROSITE. However, generating descriptive abstracts for unannotated signatures is a time-consuming manual task. To address this, we employed large language models (LLMs) to generate high-quality family descriptions. Using GPT-4 with Swiss-Prot-derived context, we automatically produced abstracts for over 5,000 PANTHER families. Nearly 3,900 of these were used to create new InterPro entries, completing in days what previously took months of curation. Since 2021, in collaboration with Dr Lucy Colwell's team at Google DeepMind, we have also explored deep learning for protein domain classification. This led to the development of InterPro-N, a novel model inspired by computer vision techniques and trained on all 13 InterPro member databases. InterPro-N significantly expands annotation coverage, assigning at least one annotation to ~90% of UniProtKB 2025_02 sequences, up from 84% using traditional methods. Predictions are accessible via the InterPro website, REST API, and FTP. Additionally, we have integrated over 300,000 structure predictions from the Big Fantastic Virus Database (BFVD) and domain boundaries from The Encyclopedia of Domains (TED), derived from AlphaFold models. These structure-based insights are now shown alongside conventional InterPro and InterPro-N results, enabling users to compare annotations across methodologies. Together, these AI-driven advances accelerate curation, expand functional coverage, and enrich protein classification, supporting faster and more comprehensive annotation of the rapidly growing protein sequence universe.
2025-07-22 17:00:00 17:20:00 12 Function Thousands of confident genetic interactions in an Escherichia coli mutant collection elucidate numerous gene functions Simon Jeanneau Simon Jeanneau, Mathias Martin Silva, Antoine Champie, Amélie De Grandmaison, Antoine Castonguay, Jean-Philippe Côté, Sébastien Rodrigue, Pierre-Étienne Jacques Despite extensive research, nearly one-third of Escherichia coli genes remain uncharacterized. Understanding how these genes interact to support cellular viability is essential not only for fundamental biology but also for identifying vulnerabilities that may guide novel antimicrobial strategies. While resources such as the Keio collection, which includes a comprehensive set of single-gene deletion mutants, have significantly advanced our knowledge of essential genes, the combinatorial nature of gene interactions remains largely unexplored at the genome scale, particularly in the context of synthetic lethality. We recently developed High-Throughput Transposon Mutagenesis (HTTM), an optimized, high-resolution method for the systematic exploration of genetic interactions. By applying HTTM across thousands of mutants, we probed nearly 16 million gene pairs for synthetic lethality, resulting in the most comprehensive interaction screen conducted in E. coli to date. Our analysis successfully recovered known synthetic lethal pairs and identified thousands of previously unreported interactions, including many involving poorly annotated or uncharacterized genes. Within this dataset, we identified densely connected regions of the interaction network, revealing genes that participate in numerous critical interactions. These interaction hubs represent vulnerable nodes in bacterial survival networks. Furthermore, the recurring association of uncharacterized genes with well-annotated functional clusters supports the concept of functional propagation—a process by which gene function can be inferred from shared interaction patterns. This extensive interaction map enhances the functional annotation of the E. coli genome and highlights combinatorial genetic vulnerabilities. These findings provide a valuable foundation for investigating bacterial physiology and for identifying new targets in the pursuit of antimicrobial development.
2025-07-22 17:20:00 17:40:00 12 Function Present and future of the critical assessment of protein function annotation algorithms (CAFA) M. Clara De Paolis Kaluza M. Clara De Paolis Kaluza, Rashika Ramola, Parnal Joshi, An Phan, Priyanka Banarjee, Damiano Piovesan, Walter Reade, Maggie Demkin, Addison Howard, Nate Keating, Paul Thomas, Maria Martin, Sandra Orchard, Iddo Friedberg, Predrag Radivojac Since its launch in 2010, the Critical Assessment of Functional Annotation (CAFA) has brought together computational biologists, biocurators, and experimental biologists to benchmark the state of computational prediction of protein function. It has served as a forum for discussion and collaboration to drive innovation in the field. Recent advances in protein representation, coupled with a growing interest from the machine learning community in biological applications, motivated CAFA organizers to expand their reach and invite a broader range of model developers to participate. To this end, the fifth CAFA experiment (CAFA 5) was conducted in partnership with Kaggle, a platform for data science competitions and collaborative model development. The reach and technology of this format resulted in a 22-fold increase over previous CAFAs in the number of participating teams, composed of entrants from 77 counties and various scientific and technical backgrounds. In this talk, we present an expanded analysis of the prediction models in CAFA 5 and discuss plans for CAFA 6. Our analysis of CAFA 5 shows marked improvements in the performance of predictions on Gene Ontology (GO) term annotations compared to models from past CAFA evaluations. We present a new setting for evaluating predictions of function annotations added to proteins with previously incomplete annotations and we suggest new directions for future computational prediction improvements based on these evaluations. Finally, we turn our attention to the future and discuss the planned challenges and assessments for CAFA 6, which will be launched in 2025.
2025-07-22 17:40:00 18:00:00 12 Function ProtHGT: Heterogeneous Graph Transformers for Automated Protein Function Prediction Using Biological Knowledge Graphs and Language Models Erva Ulusoy Erva Ulusoy, Tunca Dogan Accurate functional annotation of proteins is crucial for understanding complex biological systems. As protein sequence data grows rapidly, experimental methods cannot keep pace, underscoring the need for scalable computational approaches. In this study, we present ProtHGT, a heterogeneous graph transformer-based model designed to predict protein functions by integrating diverse biological datasets, including protein-protein interactions, pathways, domains, and phenotypic data. ProtHGT constructs a comprehensive heterogeneous graph with over 542,000 nodes and 3.7 million edges to capture complex biological relationships and employs relationship-specific attention mechanisms to refine node embeddings into biologically meaningful representations. It achieves state-of-the-art performance on benchmark datasets, consistently outperforming graph-based and sequence-based approaches. Advanced pretrained embeddings further enhance predictive accuracy by providing rich feature representations. Ablation analyses highlight the critical role of heterogeneous data integration, demonstrating the value of incorporating multiple node types, such as pathways and domains, to improve predictions. To ensure accessibility, ProtHGT is available as a programmatic tool on https://github.com/HUBioDataLab/ProtHGT and as a user-friendly web service on https://huggingface.co/spaces/HUBioDataLab/ProtHGT, enabling researchers with varying expertise to easily utilize the model. By integrating diverse data sources and leveraging cutting-edge graph transformer architecture, ProtHGT establishes itself as a powerful and accessible tool for advancing bioinformatics research.
2025-07-21 11:20:00 11:40:00 02N General Computational Biology Harnessing Deep Learning for Proteome-Scale Detection of Amyloid Signaling Motifs Witold Dyrka Krzysztof Pysz, Jakub Gałązka, Witold Dyrka Amyloid signaling sequences adopt the cross-β fold that is capable of self-replication in the templating process. Propagation of the amxyloid fold from the receptor to the effector protein is used for signal transduction in the immune response pathways in animals, fungi and bacteria. So far, a dozen of families of amyloid signaling motifs (ASMs) have been classified. Unfortunately, due to the wide variety of ASMs it is difficult to identify them in large protein databases available, which limits the possibility of conducting experimental studies. To date, various deep learning (DL) models have been applied across a range of protein-related tasks, including domain family classification and the prediction of protein structure and protein-protein interactions. In this study, we develop tailor-made bidirectional LSTM and BERT-based architectures to model ASM, and compare their performance against a state-of-the-art machine learning grammatical model. Our research is focused on developing a discriminative model of generalized amyloid signaling motifs, capable of detecting ASMs in large data sets. The DL-based models are trained on a diverse set of motif families and a global negative set, and used to identify ASMs from remotely related families. We analyze how both models represent the data and demonstrate that the DL-based approaches effectively detect ASMs, including novel motifs, even at the genome scale.
2025-07-21 11:40:00 12:00:00 02N General Computational Biology From High-Throughput Evaluation to Wet-Lab Studies: Advancing Mutation Effect Prediction with a Retrieval-Enhanced Model Bingxin Zhou Yang Tan, Ruilin Wang, Banghao Wu, Liang Hong, Bingxin Zhou Enzyme engineering is a critical approach for producing enzymes that meet industrial and research demands by modifying wild-type proteins to enhance properties such as catalytic activity and thermostability. Beyond traditional methods like directed evolution and rational design, recent advancements in deep learning offer cost-effective and high-performance alternatives. By encoding implicit coevolutionary patterns, these pre-trained models have become powerful tools for mutation effect prediction, with the central challenge being to uncover the intricate relationships among protein sequence, structure, and function. In this study, we present VenusREM, a retrieval-enhanced protein language model designed to capture local amino acid interactions across both spatial and temporal scales. VenusREM achieves state-of-the-art performance on 217 assays from the ProteinGym benchmark. Beyond high-throughput open benchmark validations, we conducted a low-throughput post-hoc analysis on more than 30 mutants to verify the model’s ability to improve the stability and binding affinity of a VHH antibody. We also validated the practical effectiveness of VenusREM by designing 10 novel mutants of a DNA polymerase and performing wet-lab experiments to evaluate their enhanced activity at elevated temperatures. Both in silico and experimental evaluations not only confirm the reliability of VenusREM as a computational tool for enzyme engineering but also demonstrate a comprehensive evaluation framework for future computational studies in mutation effect prediction. The implementation is publicly available at https://github.com/tyang816/VenusREM.
2025-07-21 12:00:00 12:20:00 02N General Computational Biology BE3D: A Computational Workflow for Integrative Structure-Function Analysis of Base-Editor Tiling Mutagenesis Data Yoochan Myung Yoochan Myung, Calvin Hu, Surya Mani, Annie Chen, Vivian Lu, Brian Liau, Guillaume Poncet-Montange, Gabriel Griffin, Sumaiya Iqbal Understanding functional consequences of single-nucleotide variants is critical for elucidating the genetic basis of diseases, yet current variant screening technologies have limitations. CRISPR base editors (BEs) efficiently generate transition mutations, enabling targeted variant screens. However, interpreting these screens in the context of protein structure-function relationships remains challenging due to technical constraints and biological variability. We introduce BE3D, an integrated workflow to systematically analyze BE tiling mutagenesis data within protein structural contexts. BE3D comprises three modules: (A) BE-QA, assessing screening quality based on biological hypotheses (e.g., knockout vs. neutral guides); (B) BE-Clust3D, identifying hits from BE screening with an expanded coverage using protein 3D structures and highlighting their clusters; and (C) BE-MetaClust3D, aggregating data from multiple screens, enhancing detection of functionally relevant sites across cell lines and species. Applying BE3D to published BE screens on DNMT3A and MEN1, we show that BE-Clust3D method increased the coverage of functional residues by integrating structural data, yielding up to 3.5-fold improved detection of critical domains in DNMT3A and highlighting crucial drug-binding MEN1 residues (e.g., Met327, Trp346), inaccessible and unidentifiable by Bes due to PAM limitations. Meta-aggregation of MEN1 BE screen readouts from two cell lines (MOLM-13, MV4-11) using BE-MetaClust3D further emphasized a drug-resistant mutational hotspot, achieving a stronger drug-binding site enrichment (3.43-fold) compared to individual screens (average odds ratio 2.2). In summary, BE3D is an open-source, scalable tool for integrative structure-function analysis and interpretation of BE tiling mutagenesis data (Github: https://github.com/broadinstitute/beclust3d-public). BE3D is expected to accelerate variant-to-function investigation and the discovery of drug-targetable sites.
2025-07-21 12:20:00 12:40:00 02N General Computational Biology Enhanced protein evolution with inverse folding models using structural and evolutionary constraints Yunjia Li Yunjia Li, Hongyuan Fei, Caixia Gao Protein engineering enables artificial protein evolution through iterative sequence changes, but current methods often suffer from low success rates and limited cost-effectiveness. Here, we present AiCE (AI-informed Constraints for protein Engineering), an approach that facilitates efficient protein evolution using generic protein inverse folding models, reducing dependence on human heuristics and task-specific models. By sampling sequences from inverse folding models and integrating structural and evolutionary constraints, AiCE identifies high-fitness single- and multi-mutations. We applied AiCE to eight protein engineering tasks, including deaminases, a nuclear localization sequence, nucleases, and a reverse transcriptase, spanning proteins from tens to thousands of residues, with success rates of 11%-88%. We also developed base editors for precision medicine and agriculture, including enABE8e (5 bp window), enSdd6-CBE (1.3-fold improved fidelity), and enDdd1-DdCBE (up to 14.3-fold enhanced mitochondrial activity). These results demonstrate that AiCE is a versatile, user-friendly mutation-design method that outperforms conventional approaches in efficiency, scalability, and generalizability.
2025-07-21 12:40:00 13:00:00 02N General Computational Biology Precise Prediction of Hotspot Residues in Protein-RNA Complexes Using Graph Attention Networks and Pre-trained Protein Language Models Siyuan Shen Siyuan Shen, Jie Chen, Zhijian Huang, Yuanpeng Zhang, Ziyu Fan, Yuting Kong, Lei Deng Motivation: Protein-RNA interactions play a pivotal role in biological processes and disease mechanisms, with hotspot residues being critical for targeted drug design. Traditional experimental methods for identifying hotspot residues are often inefficient and expensive. Moreover, many existing prediction methods rely heavily on high-resolution structural data, which may not always be available. Consequently, there is an urgent need for an accurate and efficient sequence-based computational approach for predicting hotspot residues in protein-RNA complexes. Results: In this study, we introduce DeepHotResi, a sequence-based computational method designed to predict hotspot residues in protein-RNA complexes. DeepHotResi leverages a pre-trained protein language model to predict protein structure and generate an amino acid contact map. To enhance feature representation, DeepHotResi integrates the Squeeze-and-Excitation (SE) module, which processes diverse amino acid-level features. Next, it constructs an amino acid feature network from the contact map and SE-Module-derived features. Finally, DeepHotResi employs a Graph Attention Network (GAT) to model hotspot residue prediction as a graph node classification task. Experimental results demonstrate that DeepHotResi outperforms state-of-the-art methods, effectively identifying hotspot residues in protein-RNA complexes with superior accuracy on the test set.
2025-07-21 14:00:00 14:20:00 02N General Computational Biology Trustworthy Causal Biomarker Discovery: A Multiomics Brain Imaging Genetics based Approach Jin Zhang Jin Zhang, Yan Yang, Muheng Shang, Lei Guo, Daoqiang Zhang Discovering genetic variations underpinning brain disorders is important to understand their pathogenesis. Indirect associations or spurious causal relationships pose a threat to the reliability of biomarker discovery for brain disorders, potentially misleading or incurring bias in subsequent decision-making. Unfortunately, the stringent selection of reliable biomarker candidates for brain disorders remains a predominantly unexplored challenge. In this paper, to fill this gap, we propose a fresh and powerful scheme, referred to as the Causality-aware Genotype intermediate Phenotype Correlation Approach (Ca-GPCA). Specifically, we design a bidirectional association learning framework, integrated with a parallel causal variable decorrelation module and sparse variable regularizer module, to identify trustworthy causal biomarkers. A disease diagnosis module is further incorporated to ensure accurate diagnosis and identification of causal effects for pathogenesis. Additionally, considering the large computational burden incurred by high-dimensional genotype-phenotype covariances, we develop a fast and efficient strategy to reduce the runtime and prompt practical availability and applicability. Extensive experimental results on four simulation data and real neuroimaging genetic data clearly show that Ca-GPCA outperforms state-of-the-art methods with excellent built-in interpretability. This can provide novel and reliable insights into the underlying pathogenic mechanisms of brain disorders.
2025-07-21 14:20:00 14:40:00 02N General Computational Biology SVQ-MIL: Small-Cohort Whole Slide Image Classification via Split Vector Quantization Yao-Zhong Zhang Dawei Shen, Yao-Zhong Zhang, Keita Tamura, Yohei Okubo, Seiya Imoto Whole Slide Images (WSIs) are high-resolution digital scans of microscope slides that play important roles in pathological analysis. Recent advancements in deep learning have significantly improved WSI classification. However, challenges persist, particularly in small cohorts with limited training samples. Multiple Instance Learning (MIL) has emerged as a leading framework for WSI classification. In MIL, each WSI is divided into image tiles, and each tile is represented by an embedding generated by a pretrained vision foundation model. Nevertheless, these embeddings are general-purpose and typically exhibit high variability, rendering them suboptimal for specific classification tasks. In this study, we introduce SVQ-MIL, a generalized framework that leverages Split Vector Quantization (SVQ) with a learnable codebook to quantize instance embeddings. The learned codebook reduces embedding variability and abbreviates the input for MIL model, making it advantageous for small-cohort datasets. Additionally, SVQ-MIL enhances model interpretability by providing a profiling of the WSI instances through the learned codebook. Experimental evaluations demonstrate that SVQ-MIL achieves competitive performance compared with the-state-of-the-art methods on two benchmark datasets. \textcolor{red}{The source code is available at \url{https://github.com/aCoalBall/SVQMIL}.}
2025-07-21 14:40:00 15:00:00 02N General Computational Biology Genetic Confounding and Comorbidity: Re-evaluating Causal Inference in Disease Associations Hadasa Kaufman Hadasa Kaufman, Nadav Rapoport, Michal Linial Comorbidity analyses indicate that ~35% of common disease pairs tend to occur sequentially within the same individual. Understanding whether this comorbidity pairing reflects a causal relationship or results from shared (often unknown) external factors is crucial for clinical decisions. Although causal inference methods are increasingly used in clinical research, most methods fail to incorporate genetic information, despite the well-documented pleiotropy of single-nucleotide polymorphisms (SNPs). Herein, we develop methodologies aimed at addressing this knowledge gap. In an extensive analysis of 440×440 disease pairs (with ≥500 cases each) from the UK Biobank (UKB), we found that approximately 58% of disease pairs share at least one associated SNP. We compared and evaluated two complementary approaches for addressing the genetic confounding effects. In the first scheme (coined EXPO for Exclude Population), we removed all individuals that displayed shared associated SNPs for both diseases. When EXPO was applied to the 440×440 disease pairs, this method showed a significant shift in p-value distributions (p-value 6e-4), but failed to identify pairs confirming elimination of residual genetic signals. The second approach relied on a propensity score matching (PSM) protocol to balance genetic risk between matched groups. In a pilot test of 5×5 abundant disease pairs, we combined the PSM with polygenic risk scores (PRS). The PRS-PSM and classical PSM yielded consistent results in 80% of cases, and for another 15%, significant results were confirmed only in PRS-PSM. These findings suggest that incorporating genetic information via PRS-PSM will enhance genetic interpretation and validate the genuine causal relationship of outcomes.
2025-07-21 15:00:00 15:20:00 02N General Computational Biology Sparse modeling of interactions enables fast detection of genome-wide epistasis in biobank-scale studies Julian Stamp Julian Stamp, Samuel Pattillo Smith, Daniel Weinreich, Lorin Crawford The lack of computational methods capable of detecting epistasis in biobanks has led to uncertainty about the role of non-additive genetic effects on complex trait variation. The marginal epistasis framework is a powerful approach because it estimates the likelihood of a SNP being involved in any interaction, thereby reducing the multiple testing burden. Current implementations of this approach have failed to scale to large human studies. To address this, we present the sparse marginal epistasis (SME) test, which concentrates the scans for epistasis to regions of the genome that have known functional enrichment for a trait of interest. By leveraging the sparse nature of this modeling setup, we develop a novel statistical algorithm that allows SME to run 10 to 90 times faster than state-of-the-art epistatic mapping methods. In a study of blood traits measured in 349,411 individuals from the UK Biobank, we show that reducing searches of epistasis to variants in accessible chromatin regions facilitates the identification of genetic interactions associated with regulatory genomic elements.
2025-07-21 15:20:00 15:40:00 02N General Computational Biology Pan-cancer analysis in the real-world setting uncovers immunogenomic drivers of acquired resistance post-immunotherapy Mohamed Reda Keddar Mohamed Reda Keddar, Martin Miller Immune checkpoint blockade (ICB) has transformed cancer care, procuring long-lasting benefit to patients across various cancer types. However, >80% of patients fail to respond to ICB (primary resistant) or eventually develop resistance after initial clinical benefit (acquired resistant). Due to difficulty in accessing post-progression clinical samples, remarkably little is known about which immunogenomic features emerge as patients progress on therapy. Here, we use the Tempus AI real-world clinicogenomic database to build a pan-cancer and multimodal dataset of clinical and pre/post-treatment RNA/DNA-seq data from >5,000 patients across NSCLC, HNC, and TNBC. Using a systematic bioinformatics approach, we characterise and compare the clinical and molecular features of acquired vs. primary resistant patients in the post-progression setting. We find that acquired resistant patients consistently derive an ICB-specific prognostic advantage, as they survive significantly longer than their primary counterpart even after progressing. At the molecular level, acquired resistant tumours show a universally inflamed tumour microenvironment (TME) post-progression, specifically maintained or induced by ICB. Using dN/dS to evaluate mutation selection from pre- to post-treatment, we identify ICB-specific mutations selected for post-acquired resistance. These mutations were involved in functionally-relevant molecular processes, including loss of antigen processing and presentation, dysregulated metabolism, and putative immune escape via onogenic signalling pathways. Altogether, our analysis of post-progression samples mapped out the molecular underpinnings of acquired vs. primary ICB resistance and offers an opportunity for improved patient selection strategies and positioning of next-generation immunotherapies to re-activate an effective anti-tumour response and optimise outcome.
2025-07-21 15:40:00 16:00:00 02N General Computational Biology Beyond Mutation Frequency: A Bayesian Framework for Identifying Functional Cancer Drivers from Single-Cell Data Komlan Atitey Komlan Atitey, Benedict Anchang Cancer is driven by genetic alterations, especially gain-of-function mutations in oncogenes (OGs) and loss-of-function mutations in tumor suppressor genes (TSGs). Traditional approaches to identifying cancer driver genes (CDGs) rely heavily on mutation frequency across patient cohorts. While effective at detecting common drivers, these methods often miss rare but functionally significant mutations, and they struggle with the complexity introduced by tumor heterogeneity. To address these limitations, we present PICDGI (Predict Immunosuppressive Cancer Driver Genes using gene-gene Interaction features), a Bayesian framework that integrates time-series single-cell RNA sequencing (scRNA-seq) data with gene-gene interaction dynamics. PICDGI moves beyond mutation frequency by modeling gene regulatory influence and functional impact within evolving tumor cell populations. PICDGI begins by identifying cancer progenitor cells across tumor stages and reconstructs gene expression trajectories during tumor development. It then uses variational Bayesian inference to infer dynamic gene interaction networks and introduces the gene driver coefficient, a novel metric that quantifies each gene’s regulatory influence on downstream targets. This enables the identification of both known and previously unrecognized driver genes based on their functional roles in tumor progression and immune evasion. When applied to scRNA-seq data from nine samples across three lung adenocarcinoma (LUAD) patients, PICDGI successfully recovered established OGs and TSGs (62%) and revealed novel candidate drivers (38%) with strong expression patterns and relevance to tumor evolution, as confirmed by Moran’s I test. Overall, PICDGI provides a biologically grounded, interaction-driven strategy for identifying functional cancer drivers from single-cell data, offering a powerful tool for advancing personalized cancer genomics.
2025-07-21 16:40:00 17:00:00 02N General Computational Biology AdaGenes: A streaming processor for high-throughput annotation and filtering of sequence variant data Nadine S. Kurz Nadine S. Kurz, Klara Drofenik, Kevin Kornrumpf, Kirsten Reuter-Jessen, Jürgen Dönitz The amount of sequencing data resulting from whole exome or genome sequencing (WES / WGS) presents challenges for annotation, filtering, and analysis. We introduce the Adaptive Genes processor (AdaGenes), a sequence variant streaming processor designed to efficiently annotate, filter, LiftOver and transform large-scale VCF files. AdaGenes provides a unified solution for researchers to streamline VCF processing workflows and address common challenges in genomic data processing, e.g. to filter out non-relevant variants to focus on further processing of the relevant positions. AdaGenes integrates genomic, transcript and protein data annotations, while maintaining scalability and performance for high-throughput workflows. Leveraging a streaming architecture, AdaGenes processes variant data incrementally, enabling high-performance on large files due to low memory consumption and seamless handling of whole genome files. The interactive front end provides the user with the ability to dynamically filter variants based on user-defined criteria. It allows researchers and clinicians to efficiently analyze large genomic datasets, facilitating variant interpretation in diverse genomics applications, such as population studies, clinical diagnostics, and precision medicine. AdaGenes is able to parse and convert multiple file formats while preserving metadata, and provides a report of the changes made to the variant file. AdaGenes is available at https://mtb.bioinf.med.uni-goettingen.de/adagenes.
2025-07-21 17:00:00 17:20:00 02N General Computational Biology Comprehensive framework for assessing discrepancies in genomic content and species-level annotations across microbial reference genomes Serghei Mangul Grigore Boldirev, Mohammed Alser, Peace Aguma, Viorel Munteanu, Mihai Dimian, Alex Zelikovsky, Serghei Mangul Metagenomics research provides insights into the composition, diversity, and functions of microbial communities in various environments. To identify bacterial species, sequencing reads from samples are typically mapped to reference genomes found in bacterial reference databases. However, multiple references may share the same taxonomic identifiers while containing different genomic information, which can lead to inconsistencies in downstream analyses. We have developed a novel comprehensive framework for assessing discrepancies in genomic content and species-level annotations across microbial reference genomes, and applied it to evaluate the two most widely used bacterial reference databases: PATRIC and RefSeq. NCBI’s taxonomic identifiers were used to assess the agreement between databases at the species level. Species found in both databases were identified by matching taxIDs. To compare genomic representation, the BLAST tool was used to align all contigs from one database to all contigs of the corresponding strain in the other database. This analysis was extended to all overlapping species where strain-level information was available. The study revealed substantial discrepancies between databases. Among single-contig genomes, 85.5% exhibited 100% genomic similarity, 14.4% demonstrated an average similarity of 94.3%, and 17 genomes showed less than 75% similarity. For genomes with 2–10 contigs, 82.6% had 100% similarity, 17% averaged 94.79% similarity, and 128 genomes fell below the 75% threshold. Our results emphasize significant variability in genome representation across reference databases, especially for multi-contig genomes. Our framework will provide a foundation for building a more consistent and comprehensive reference database, which will improve the accuracy, rigor, and reproducibility of metagenomics research.
2025-07-21 17:20:00 17:40:00 02N General Computational Biology Building Ultralarge Pangenomes Using Scalable and Compressive Techniques Sumit Walia Sumit Walia, Harsh Motwani, Yu-Hsiang Tseng, Kyle Smith, Russell Corbett-Detig, Yatish Turakhia Pangenomics studies intra-species genetic diversity by analyzing collections of genomes from the same species. As pangenomics scales to millions of sequences, efficient data formats become crucial to enabling future applications and ensuring efficient computational and memory performance for pangenomic analysis. Current pangenomic formats primarily store variation across genomes but fail to capture shared evolutionary and mutational histories, limiting their applicability. They also face scalability issues due to storage and computational inefficiencies. To address these limitations, we present PanMAN (Pangenome Mutation-Annotated Network), a novel pangenomic format that is the most compact, scalable, and information-rich among all variation-preserving formats. PanMAN encodes not only genome alignments and variations but also shared mutational and evolutionary histories inferred across genomes, making it the first format to unify multiple whole-genome alignment, phylogeny, and mutational histories into a single unified framework. By leveraging "evolutionary compression," PanMAN achieves 3.5X to 1391X compression over other formats (GFA, VG, GBZ, PanGraph, AGC, and tskit) across microbial datasets. To demonstrate scalability, we built the largest pangenome in terms of number of sequences —a PanMAN with 8 million SARS-CoV-2 genomes—requiring just 366MB of disk space. Using SARS-CoV-2 as a case study, we show that PanMAN offers a detailed and accurate portrayal of the pathogen's evolutionary and mutational history, facilitating the discovery of new biological insights. We also present panmanUtils, a software toolkit for constructing, analyzing, and integrating PanMANs with existing pangenomic workflows. PanMANs are poised to enhance the scale, speed, resolution, and overall scope of pangenomic analyses and data sharing.
2025-07-21 17:40:00 18:00:00 02N General Computational Biology Information Content as a metric to evaluate and compare DNA Language Models Melissa Sanabria Melissa Sanabria, Anna R. Poetsch Large language models have transformed the field of natural language processing by enabling the generation of coherent and meaningful text. This success has inspired researchers to apply similar approaches to biological sequences, particularly DNA, where the underlying "language" of the genome holds biological insight. DNA language models, such as GROVER, offer a promising avenue for advancing genomic analysis. Despite their potential, evaluating and comparing those models remains a significant challenge. Existing metrics often rely on genome-specific motifs, biological annotations, or the number of parameters used during model training. These limitations make it difficult to perform consistent and generalizable assessments across different models or genomic contexts. We propose the use of entropy and information content as general-purpose metrics to evaluate DNA language models. By computing these measures over whole-genome predictions, we can quantify how much information the model captures during training. It allows us to compare not only between different types of genomic elements and regions—such as coding vs. non-coding sequences or promoters vs. intergenic regions—but also across different versions of the human genome. We also introduce a set of pretrained DNA language models for three major human genome builds: hg19, hg38, and telomere-to-telomere (T2T). Our analysis reveals that, although T2T includes a substantially greater proportion of repetitive sequences, this increase does not adversely affect the information content observed in other genomic regions. Our approach provides a more interpretable and genome-agnostic framework for evaluating DNA language models and offers new insights into how different genome assemblies influence model learning and performance.
2025-07-24 08:40:00 09:00:00 03A General Computational Biology ADME-Drug-Likeness: Enriching Molecular Foundation Models via Pharmacokinetics-Guided Multi-Task Learning for Drug-likeness Prediction Dongmin Bang Dongmin Bang, Juyeon Kim, Haerin Song, Sun Kim Recent breakthroughs in AI-driven generative models enable the rapid design of extensive molecular libraries, creating an urgent need for fast and accurate drug-likeness evaluation. Traditional approaches, however, rely heavily on structural descriptors and overlook pharmacokinetic (PK) factors such as absorption, distribution, metabolism, and excretion (ADME). Furthermore, existing deep-learning models neglect the complex interdependencies among ADME tasks, which play a pivotal role in determining clinical viability. We introduce ADME-DL (drug likeness), a novel two-step pipeline that first enhances diverse range of Molecular Foundation Models (MFMs) via sequential ADME multi-task learning. By enforcing an A→D→M→E flow—grounded in a data-driven task dependency analysis that aligns with established pharmacokinetic principles—our method more accurately encodes PK information into the learned embedding space. In Step 2, the resulting ADME-informed embeddings are leveraged for drug-likeness classification, distinguishing approved drugs from negative sets drawn from chemical libraries. Through comprehensive experiments, our sequential ADME multi-task learning achieves up to +2.4% improvement over state-of-the-art baselines, and enhancing performance across tested MFMs by up to +18.2%. Case studies with clinically annotated drugs validate that respecting the PK hierarchy produces more relevant predictions, reflecting drug discovery phases. These findings underscore the potential of ADME-DL to significantly enhance the early-stage filtering of candidate molecules, bridging the gap between purely structural screening methods and PK-aware modeling.
2025-07-24 09:00:00 09:20:00 03A General Computational Biology Understanding the Sources of Performance in Deep Drug Response Models Reveals Insights and Improvements Nikhil Branson Nikhil Branson, Pedro Rodriguez Cutillas, Conrad Bessant Anti-cancer drug response prediction (DRP) using cancer cell lines (CLs) is crucial in stratified medicine and drug discovery. Recently new deep learning models for DRP have improved performance over their predecessors. However, different models use different input data types and architectures making it hard to find the source of these improvements. Here we consider published DRP models that report state-of-the-art performance predicting continuous response values. These models take chemical structures of drugs and omics profiles of CLS as input. By experimenting with these models and comparing with our simple benchmarks we show that no performance comes from drug features, instead, performance is due to the transcriptomics CL profiles. Furthermore, we show that, depending on the testing type, much of the current reported performance is a property of the training target values. We address these limitations by creating BinaryET and BinaryCB that predict binary drug response values, guided by the hypothesis that this reduces the noise in the drug efficacy data. Thus, better aligning them with biochemistry that can be learnt from the input data. BinaryCB leverages a chemical foundation model, while BinaryET is trained from scratch using a transformer-type architecture. We show that these models learn useful chemical drug features, which is the first time this has been demonstrated for multiple testing types to our knowledge. We further show binarising the drug response values causes the models to learn useful chemical drug features. We also show that BinaryET improves performance over BinaryCB, and the published models that report state-of-the-art performance.
2025-07-24 09:20:00 09:40:00 03A General Computational Biology FACT: Feature Aggregation and Convolution with Transformers for predicting drug classification code Gwang-Hyeon Yun Gwang-Hyeon Yun, Jong-Hoon Park, Young-Rae Cho Motivation: Drug repositioning, identifying new therapeutic applications for existing drugs, can significantly reduce the time and cost involved in drug development. Recent studies have explored the use of Anatomical Therapeutic Chemical (ATC) codes in drug repositioning, offering a systematic framework to predict ATC codes for a drug. The ATC classification system organizes drugs according to their chemical properties, pharmacological actions, and therapeutic effects. However, its complex hierarchical structure and the limited scalability at higher levels present significant challenges for achieving accurate ATC code prediction. Results: We propose a novel approach to predict ATC codes of drugs, named Feature Aggregation and Convolution with Transformer models (FACT). This method computes three types of drug similarities, incorporating ATC code similarity with hierarchical weights and masked drug-ATC code associations. These features are then aggregated for each target drug-ATC code pair and processed through a convolution-transformer encoder to generate three embeddings. The embeddings are finally used to estimate the probability of an association between the target pair. The experimental results demonstrate that the proposed method achieves an AUROC of 0.9805 and an AUPRC of 0.9770 at level 4 of the ATC codes, outperforming the previous methods by 15.05% and 18.42%, respectively. This study highlights the effectiveness of integrating diverse drug features and the potential of transformer-based models in ATC code prediction.
2025-07-24 09:40:00 10:00:00 03A General Computational Biology Efficient 3D kernels for molecular property prediction Ankit Ankit, Sahely Bhadra, Juho Rousu This paper addresses the challenge of incorporating 3-dimensional structural information in graph kernels for machine learning-based virtual screening, a crucial task in drug discovery. Existing kernels that capture 3D information often suffer from high computational complexity, which limits their scalability. To overcome this, we propose the 3-dimensional chain motif graph kernel (c-MGK), which effectively integrates essential 3D structural properties—bond length, bond angle, and torsion angle—within the three-hop neighborhood of each atom in a molecule. In addition, we introduce a more computationally efficient variant, the 3-dimensional graph hopper kernel (3DGHK), which reduces the complexity from the state-of-the-art $\mathcal{O}(n^{6})$ (for the 3D pharmacophore kernel) to $\mathcal{O}(n^{2}(m + \log(n) + \delta^2 +dT^{6}))$. Here, $n$ is the number of nodes, $T$ is the highest degree of the node, $m$ is the number of edges, $\delta$ is the diameter of the graph, and $d$ is the dimension of the attributes of the nodes. We conducted experiments on 21 datasets, demonstrating that 3DGHK not only outperforms state-of-the-art 2D and 3D graph kernels, but also surpasses deep learning models in classification accuracy, offering a powerful and scalable solution for virtual screening tasks.
2025-07-24 11:20:00 11:40:00 03A General Computational Biology Haplotype-specific copy number profiling of cancer genomes from long reads sequencing data Tanveer Ahmad Tanveer Ahmad, Ayse Keskus, Mikhail Kolmogorov, Sergey Aganezov, Midhat Farooqi, Anton Goretsky, Ataberk Donmez, Michael Dean Attached as PDF
2025-07-24 11:40:00 12:00:00 03A General Computational Biology Multi-omics and liquid biopsy profiling of rapid autopsies reveals evolutionary dynamics and heterogeneity in metastatic bladder cancer Pushpa Itagi Pushpa Itagi, Gavin Ha, Andrew Hsieh, Hung-Ming Lam, Samantha Schuster, Sonali Arora, Patricia Galipeau The extensive molecular, transcriptomic and genomic complexity of metastatic bladder cancer (mBLCA) significantly complicates clinical management. Approximately 75% of mBLCA cases are conventional urothelial carcinoma, while 25% display variant histologies, which have a poorer prognosis. We characterized heterogeneity and clonal evolution in a rapid autopsy cohort of 20 patients using tumor tissues, matched normal samples, and cell-free DNA (cfDNA). Clonal evolution and metastatic seeding and migration patterns were inferred from mutation data for all patients. We used COSMIC signatures linking mutation profiles to histological and clinical features for various subtypes. Custom approaches and frameworks were developed for analyzing mutations, copy number alterations (CNAs), structural variants (SVs) in tumors and cfDNA. Mutational clonal evolution analyses and RNA-seq highlighted cisplatin resistance in the plasmacytoid urothelial carcinoma (PUC) subtype, driven by enhanced DNA damage response pathways. Most patients showed significant mutational heterogeneity (~20–30% subclonal) and for CNAs/SVs (>40% subclonal), potentially driving therapy resistance and elevating tumor heterogeneity. cfDNA detected about 90% of founder, 85% shared, and 25% private mutations from matched tumors. Nucleosome profiling from cfDNA differentiated mBLCA from healthy controls and identified variant-specific transcription factors that are active in mBLCA. Integrating multi-omics with cfDNA effectively captures intra-patient and inter-patient tumor heterogeneity, providing a comprehensive view of clonal dynamics. Insights and findings from this work pave the way for targeted therapies against evolving tumor clones and offer strategies to overcome resistance mechanisms in mBLCA.
2025-07-24 12:00:00 12:20:00 03A General Computational Biology Using spatial transcriptomics to elucidate the primary site of Cancers of Unknown Primary (CUPs) Oscar González Velasco Oscar González Velasco, Siao-Han Wong, Marta Casado, Veronica Davalos, Javier De Las Rivas Sanz, Manell Steller, Benedikt Brors Cancers of unknown primary (CUP) are a challenging group of poorly differentiated metastatic cancers, that due to its nature limited treatment options are available, resulting in a poor prognosis and overall sur-vival. Recently, novel predictive models to characterize CUP patients showed encouraging results and suggested relevant therapeutic interventions yet lacked consistency and interpretability to be widely adopt-ed in clinical care. We have developed a state-of-the-art AI CNN using bulk RNA-Seq gene expression and prior knowledge in the form of curated gene signature of transcription factors and their associated gene targets. The training corpus consists of more than 27000 samples from cancer patients and healthy donors, targeting 28 primary sites. The model displayed an accuracy of 97.17% on validation data at predicting the primary sites. Additionally, we analyzed 40 spatial transcriptomic samples from a wide range of known primary sites, including distant metastasis, for which we unambiguously located the correct primary site in 39 of them. Additionally, we analysed 20 novel CUP spatial transcriptomics samples. Results show that, by using annotations from pathologist, our suggested primaries could help to identify plausible origins, yielding coherent results (in contrast with the homing tissue) for those which did not have any clinical-derived hypothesis. By identifying the true primary site from metastatic CUPs we hope to provide in the future clinical bene-fits from site-specific therapies, opening the possibility for many existing treatment options.
2025-07-24 12:20:00 12:40:00 03A General Computational Biology Inherited genetic risk factors associated with young adult versus late-onset lung cancers Zeynep H. Gümüş Myvizhi Esai Selvan, Robert J. Klein, Zeynep H. Gümüş Genetics plays a key role in lung cancer risk. While lung cancer primarily affects older adults, incidence among young adults is increasing. However, whether the germline genetics differ between young adults (<45 years) and older lung cancer patients (≥45 years) remains unclear. We performed whole-genome sequencing on 171 predominantly young lung cancer patients and integrated germline whole-exome sequencing datasets from existing lung cancer cohorts and biobanks, totaling 9,065 participants—the largest analysis of lung cancer patients to date, with 186 young adults and 6359 older cases after sample QC. We compared the prevalence of rare pathogenic and likely pathogenic (P/LP) variants in cancer-related genes and 33,591 pathways from the Human Molecular Signatures Database (MSigDB) between two age groups using Fisher’s exact test, accounting for histology, gender and smoking status. Young adult lung cancer patients harbored significantly more rare P/LP variants in DNA damage response genes compared to older patients, especially in lung squamous cell carcinoma patients and females. This association persisted in lung adenocarcinoma patients after controlling for smoking status. Young adult patients showed enrichment of rare P/LP variants in cancer driver, Fanconi Anemia and complement pathway genes. Notably, rare P/LP variants in BRIP1, ERCC6 and MSH5 were significantly more prevalent in young adult patients. Our results demonstrate that the inherited genetics of early-onset lung cancer differs significantly from late-onset lung cancer. These findings can inform age-specific risk assessment and guide precision prevention, screening and targeted treatment strategies for young adult individuals harboring these variants.
2025-07-24 12:40:00 13:00:00 03A General Computational Biology pC-SAC: Method for High-Resolution 3D Genome Reconstruction from Low-Resolution Hi-C Data Carlos Angel Carlos Angel, Narjis El Amraoui, Gamze Gürsoy The three-dimensional (3D) organization of the genome is crucial for gene regulation, with disruptions linked to various diseases. High-throughput Chromosome Conformation Capture (Hi-C) and related technologies have advanced our understanding of genome architecture by mapping interactions between distant genomic regions. However, capturing enhancer-promoter interactions at high resolution remains challenging due to the high sequencing depth required. We introduce pC-SAC (probabilistically Constrained-Self-Avoiding-Chromatin), a novel computational method for producing accurate high-resolution Hi-C matrices from low-resolution data. pC-SAC uses adaptive importance sampling with sequential Monte Carlo to generate ensembles of 3D chromatin chains that satisfy physical constraints derived from low-resolution Hi-C data. Our method achieves over 95% accuracy in reconstructing high-resolution chromatin maps and identifies novel interactions enriched with candidate cis-regulatory elements (cCREs) and expression Quantitative Trait Loci (eQTLs). Benchmarking against state-of-the-art deep learning models demonstrates pC-SAC's superior performance in both short- and long-range interaction reconstruction. pC-SAC offers a cost-effective solution for enhancing the resolution of Hi-C data, thus enabling deeper insights into genome organization and its role in gene regulation and disease. Our tool can be found at https://github.com/G2Lab/pCSAC.
2025-07-24 14:00:00 14:20:00 03A General Computational Biology HIDE: Hierarchical cell-type Deconvolution Franziska Görtler Dennis Völkl, Malte Mensching-Buhr, Thomas Sterr, Sarah Bolz, Andreas Schäfer, Nicole Seifert, Jana Tauschke, Austin Rayford, Oddbjørn Straume, Helena U. Zacharias, Sushma Nagaraja Grellscheid, Tim Beissbarth, Michael Altenbuchinger, Franziska Görtler Motivation: Cell-type deconvolution is a computational approach to infer cellular distributions from bulk transcriptomics data. Several methods have been proposed, each with its own advantages and disadvantages. Reference based approaches make use of archetypic transcriptomic profiles representing individual cell types. Those reference profiles are ideally chosen such that the observed bulks can be reconstructed as a linear combination thereof. This strategy, however, ignores the fact that cellular populations arise through the process of cellular differentiation, which entails the gradual emergence of cell groups with diverse morphological and functional characteristics. Results: Here, we propose Hierarchical cell-type Deconvolution (HIDE), a cell-type deconvolution approach which incorporates a cell hierarchy for improved performance and interpretability. This is achieved by a hierarchical procedure that preserves estimates of major cell populations while inferring their respective subpopulations. We show in simulation studies that this procedure produces more reliable and more consistent results than other state-of-the-art approaches. Finally, we provide an example application of HIDE to explore breast cancer specimens from TCGA. Availability: A python implementation of HIDE is available at zenodo: doi:10.5281/zenodo.14724906.
2025-07-24 14:20:00 14:40:00 03A General Computational Biology RVINN: A Flexible Modeling for Inferring Dynamic Transcriptional and Post-Transcriptional Regulation Using Physics-Informed Neural Networks Osamu Muto Osamu Muto, Zhongliang Guo, Rui Yamaguchi Dynamic gene expression is controlled by transcriptional and post-transcriptional regulation. Recent studies on transcriptional bursting and buffering have increasingly highlighted the dynamic gene regulatory mechanisms. However, direct measurement techniques still face various constraints and require complementary methodologies, which are both comprehensive and versatile. To address this issue, inference approaches based on transcriptome data and differential equation models representing the messenger RNA lifecycle have been proposed. However, the inference of complex dynamics under diverse experimental conditions and biological scenarios remains challenging. In this study, we developed a flexible modeling using Physics-Informed Neural Networks and demonstrated its performance using simulation and experimental data. Our model has the ability to computationally revalidate and visualize dynamic biological phenomena, such as transcriptional ripple, co-bursting, and buffering in a breast cancer cell line. Furthermore, our results suggest putative molecular mechanisms underlying these phenomena. We propose a novel approach for inferring transcriptional and post-transcriptional regulation and expect to offer valuable insights for experimental and systems biology.
2025-07-24 14:40:00 15:00:00 03A General Computational Biology A deep learning framework for predicting single gene expression from cell-free DNA Robert Patton Robert Patton, Alexander Netzley, Thomas Persse, Akira Nair, Peter Nelson, Gavin Ha Liquid biopsy derived circulating tumor DNA (ctDNA) profiling is increasingly used as a minimally invasive alternative to traditional biopsies. Epigenetic inference from ctDNA has made considerable strides, but current methods struggle with single gene resolution and require specialized assays or ultra-deep, targeted sequencing. Herein we jointly introduce Triton, a tool for comprehensive fragmentomic and nucleosome profiling of cell-free DNA (cfDNA), and Proteus, a multi-modal deep learning framework for predicting single gene expression as a direct RNA-Seq analog, using standard depth whole genome sequencing of cfDNA. By synthesizing fragmentation and inferred nucleosome positioning patterns in the promoter and gene body, Proteus is capable of reproducing expression profiles from patient-derived xenograft (PDX) pure ctDNA with an accuracy similar to RNA-Seq technical replicates. Applying Proteus to cfDNA from four patient cohorts with matched tumor RNA-Seq, we show that the model can accurately predict the expression of specific prognostic and phenotype markers and therapeutic targets at as low as 3% tumor fraction. As a direct analog to RNA-Seq, we further confirm this method’s immediate applicability to existing tools through accurate prediction of gene set and pathway enrichment scores. Our results demonstrate the potential clinical utility of Triton and Proteus as minimally invasive tools for cancer monitoring and therapeutic guidance, without requiring specialized assays or targeted panels.
2025-07-24 15:00:00 15:20:00 03A General Computational Biology MAGPIE: Multi-modal alignment of genes and peaks for integrated exploration of spatial transcriptomics and spatial metabolomics data Marco Vicari, Irina Mohorianu, Jennifer Tan, Anna Ollerstam, Patrik Ståhl, Marianna Stamou, Jorrit Hornberg, Aleksandr Zakirov, Laura Setyo, Joakim Lundeberg, Eleanor Williams, Javier Escudero Morlanes, Muntasir Mamun Majumder, Azam Hamidinekoo, James Denholm, Steven Oag, Gregory Hamm, Martina Olsson Lindvall, Lovisa Franzén Recent developments in spatially resolved -omics have enabled studies linking gene expression and metabolite levels to tissue morphology, offering new insights into biological pathways. By capturing multiple modalities on matched tissue sections, one can better probe how different biological entities interact in a spatially coordinated manner. However, such cross-modality integration presents experimental and computational challenges. To align multimodal datasets into a shared coordinate system and facilitate enhanced integration and analysis, we propose MAGPIE (Multi-modal Alignment of Genes and Peaks for Integrated Exploration), a framework for co-registering spatially resolved transcriptomics, metabolomics, and tissue morphology from the same or consecutive sections. We illustrate the generalisability and scalability of MAGPIE on spatial multi-omics data from multiple tissues, combining Visium with both MALDI and DESI mass spectrometry imaging. MAGPIE was also applied to newly generated multimodal datasets created using specialised experimental sampling strategy to characterise the metabolic and transcriptomic landscape in an in vivo model of drug-induced pulmonary fibrosis, to showcase the linking of small-molecule co-detection with endogenous responses in lung tissue. MAGPIE highlights the refined resolution and increased interpretability of spatial multimodal analyses in studying tissue injury, particularly in pharmacological contexts, and offers a modular, accessible computational workflow for data integration.
2025-07-24 15:20:00 15:40:00 03A General Computational Biology Randomized Spatial PCA (RASP): a computationally efficient method for dimensionality reduction of high-resolution spatial transcriptomics data Ian Gingerich Ian Gingerich, Brittany Goods, H. Robert Frost Spatial transcriptomics (ST) provides critical insights into the complex spatial organization of gene expression in tissues, enabling researchers to unravel the intricate relationship between cellular environments and biological function. Identifying spatial domains within tissues is essential for understanding tissue architecture and the mechanisms underlying various biological processes, including development and disease progression. Here, we present Randomized Spatial PCA (RASP), a novel spatially aware dimensionality reduction method for spatial transcriptomics (ST) data. RASP is designed to be orders-of-magnitude faster than existing techniques, scale to ST data with hundreds-of-thousands of locations, support the flexible integration of non-transcriptomic covariates, and enable the reconstruction of de-noised and spatially smoothed expression values for individual genes. To achieve these goals, RASP uses a randomized two-stage principal component analysis (PCA) framework that leverages sparse matrix operations and configurable spatial smoothing. We compared the performance of RASP against five alternative methods (BASS, GraphST, SEDR, spatialPCA, and STAGATE) on four publicly available ST datasets generated using diverse techniques and resolutions (10x Visium, Stereo-Seq, MERFISH, and 10x Xenium) on human and mouse tissues. Our results demonstrate that RASP achieves tissue domain detection performance comparable or superior to existing methods with a several orders-of-magnitude improvement in computational speed. The efficiency of RASP enhances the analysis of complex ST data by facilitating the exploration of increasingly high-resolution subcellular ST datasets that are being generated.
2025-07-24 15:40:00 16:00:00 03A General Computational Biology CAdir: Fast Clustering and Visualization of Single-Cell Transcriptomics Data by Direction in CA Space Clemens Kohl Clemens Kohl, Martin Vingron Clustering for single-cell RNA-seq aims at finding similar cells and grouping them into biologically meaningful clusters. Many available clustering algorithms however do not not provide the cluster defining marker genes or are unable to infer the number of clusters in an unsupervised manner as well as lack tools to easily determine the quality of the label assignments. Therefore, clustering quality is commonly evaluated by visually inspecting low-dimensional embeddings as produced by e.g. UMAP or t-SNE. These embeddings can, however, distort the true cluster structure and are known to produce radically different embeddings depending on the chosen hyperparameters. In order to improve the interpretability of clustering results, we developed CAdir, a clustering algorithm that can infer the number of clusters in the data, determine cluster specific genes and provides easy to interpret diagnostic plots. CAdir exploits the geometry induced by correspondence analysis (CA) to cluster cells as well as cluster associated genes based on their direction in CA space. Using the angle between the cluster directions, it is able to automatically infer the number of clusters in the data by merging and splitting clusters. A comprehensive set of diagnostic and explanatory plots provides users with valuable feedback about the clustering decisions and the quality of the final as well as intermediary clusters. CAdir is scalable to even the largest data set and provides similar clustering performance to other state-of-the-art cell clustering algorithms in our benchmarking. CAdir can be downloaded from GitHub: https://github.com/VingronLab/CAdir
2025-07-23 11:20:00 12:20:00 01A HiTSeq Learning variant effects on chromatin accessibility and 3D structure without matched Hi-C data Valentina Boeva Valentina Boeva Chromatin interactions provide insights into which DNA regulatory elements connect with specific genes, informing the activation or repression of gene expression. Understanding these interactions is crucial for assessing the role of non-coding mutations or changes in chromatin organization due to cell differentiation or disease. Hi-C and single-cell Hi-C experiments can reveal chromatin interactions, but these methods are costly and labor-intensive. Here, I will introduce our computational approach, UniversalEPI, an attention-based deep ensemble model that predicts regulatory interactions in unseen cell types with a receptive field of 2 million nucleotides, relying solely on DNA sequence data and chromatin accessibility profiles. Demonstrating significantly better performance than state-of-the-art methods, UniversalEPI—with a much lighter architecture—effectively predicts chromatin interactions across malignant and non-malignant cancer cell lines (Spearman’s Rho > 0.9 on unseen cell types). To further expand its applicability, we integrate ASAP, our deep learning toolset that predicts the effects of genomic variants on ATAC-seq profiles. These predicted accessibility profiles can serve as input to UniversalEPI. Importantly, the accuracy of Hi-C interaction prediction remains virtually unchanged when replacing experimental ATAC-seq profiles with those generated by ASAP, indicating strong robustness and enabling predictions even in the absence of experimental accessibility data. This combined framework represents an advancement in in-silico 3D chromatin modeling, essential for exploring genetic variant impacts on disease and monitoring chromatin architecture changes during development.
2025-07-23 12:20:00 12:40:00 01A HiTSeq Spatial transcriptomics deconvolution methods generalize well to spatial chromatin accessibility data Laura D. Martens Sarah Ouologuem, Laura D. Martens, Anna C. Schaar, Maiia Shulman, Julien Gagneur, Fabian J. Theis Motivation: Spatially resolved chromatin accessibility profiling offers the potential to investigate gene regulatory processes within the spatial context of tissues. However, current methods typically work at spot resolution, aggregating measurements from multiple cells, thereby obscuring cell-type-specific spatial patterns of accessibility. Spot deconvolution methods have been developed and extensively benchmarked for spatial transcriptomics, yet no dedicated methods exist for spatial chromatin accessibility, and it is unclear if RNA-based approaches are applicable to that modality. Results: Here, we demonstrate that these RNA-based approaches can be applied to spot-based chromatin accessibility data by a systematic evaluation of five top-performing spatial transcriptomics deconvolution methods. To assess performance, we developed a simulation framework that generates both transcriptomic and accessibility spot data from dissociated single-cell and targeted multiomic datasets, enabling direct comparisons across both data modalities. Our results show that Cell2location and RCTD, in contrast to other methods, exhibit robust performance on spatial chromatin accessibility data, achieving accuracy comparable to RNA-based deconvolution. Generally, we observed that RNA-based deconvolution exhibited slightly better performance compared to chromatin accessibility-based deconvolution, especially for resolving rare cell types, indicating room for future development of specialized methods. In conclusion, our findings demonstrate that existing deconvolution methods can be readily applied to chromatin accessibility-based spatial data. Our work provides a simulation framework and establishes a performance baseline to guide the development and evaluation of methods optimized for spatial epigenomics. Availability: All methods, simulation frameworks, peak selection strategies, analysis notebooks and scripts are available at https://github.com/theislab/deconvATAC.
2025-07-23 12:40:00 13:00:00 01A HiTSeq Towards Personalized Epigenomics: Learning Shared Chromatin Landscapes and Joint De-Noising of Histone Modification Assays Tanmayee Narendra Tanmayee Narendra, Giovanni Visonà, Crhistian de Jesus Cardona, James Abbott, Gabriele Schweikert Epigenetic mechanisms enable cellular differentiation and the maintenance of distinct cell-types. They enable rapid responses to external signals through changes in gene regulation and their registration over longer time spans. Consequently, chromatin environments exhibit cell-type and individual specificity contributing to phenotypic diversity. Their genomic distributions are measured using ChIP-Seq and related methods. However, the chromatin landscape introduces significant biases into these measurements. Here, we introduce DecoDen to simultaneously learn shared chromatin landscapes while de-biasing individual measurement tracks. We demonstrate DecoDen's effectiveness on an integrative analysis of histone modification patterns across multiple tissues in personal epigenomes.
2025-07-23 14:00:00 14:20:00 01A HiTSeq Alevin-fry-atac enables rapid and memory frugal mapping of single-cell ATAC-Seq data using virtual colors for accurate genomic pseudoalignment Noor Pratap Singh Noor Pratap Singh, Jamshed Khan, Rob Patro Ultrafast mapping of short reads via lightweight mapping techniques such as pseudoalignment has significantly accelerated transcriptomic and metagenomic analyses with minimal accuracy loss compared to alignment-based methods. However, applying pseudoalignment to large genomic references, like chromosomes, is challenging due to their size and repetitive sequences. We introduce a new and modified pseudoalignment scheme that partitions each reference into “virtual colors”. These are essentially overlapping bins of fixed maximal extent on the reference sequences that are treated as distinct “colors” from the perspective of the pseudoalignment algorithm. We apply this modified pseudoalignment procedure to process and map single-cell ATAC-seq data in our new tool alevin-fry-atac. We compare alevin-fry-atac to both Chromap and Cell Ranger ATAC. Alevin-fry-atac is highly scalable and, when using 32 threads, is approximately 2.8 times faster than Chromap (the second fastest approach) while using approximately 3 times less memory and mapping slightly more reads. The resulting peaks and clusters generated from alevin-fry-atac show high concordance with those obtained from both Chromap and the Cell Ranger ATAC pipeline, demonstrating that virtual color-enhanced pseudoalignment directly to the genome provides a fast, memory-frugal, and accurate alternative to existing approaches for single-cell ATAC-seq processing. The development of alevin-fry-atac brings single-cell ATAC-seq processing into a unified ecosystem with single-cell RNA-seq processing (via alevin-fry) to work toward providing a truly open alternative to many of the varied capabilities of CellRanger.
2025-07-23 14:20:00 14:40:00 01A HiTSeq Oarfish: Enhanced probabilistic modeling leads to improved accuracy in long read transcriptome quantification Zahra Zare Jousheghani Zahra Zare Jousheghani, Noor Pratap Singh, Rob Patro Motivation: Long read sequencing technology is becoming an increasingly indispensable tool in genomic and transcriptomic analysis. In transcriptomics in particular, long reads offer the possibility of sequencing full-length isoforms, which can vastly simplify the identification of novel transcripts and transcript quantification. However, despite this promise, the focus of much long read method development to date has been on transcript identification, with comparatively little attention paid to quantification. Yet, due to differences in the underlying protocols and technologies, lower throughput (i.e. fewer reads sequenced per sample compared to short read technologies), as well as technical artifacts, long read quantification remains a challenge, motivating the continued development and assessment of quantification methods tailored to this increasingly prevalent type of data. Results: We introduce a new method and corresponding user-friendly software tool for long read transcript quantification called oarfish. Our model incorporates a novel coverage score, which affects the conditional probability of fragment assignment in the underlying probabilistic model. We demonstrate, in both simulated and experimental data, that by accounting for this coverage information, oarfish is able to produce more accurate quantification estimates than existing long read quantification tools. Availability and Implementation: Oarfish is implemented in the Rust programming language, and is made available as free and open-source software under the BSD 3-clause license. The source code is available at https://www.github.com/COMBINE-lab/oarfish.
2025-07-23 14:40:00 15:00:00 01A HiTSeq Identification of interactions defining 3D chromatin folding from micro to meso-scale Leonardo Morelli Leonardo Morelli, Stefano Cretti, Davide Cittaro, Tiago P. Peixoto, Alessio Zippo Understanding the structural principles of chromatin organization is a central challenge in computational epigenomics, largely due to the sparse, noisy, and complex nature of Hi-C data. Existing methods tend to focus either on local features, such as topologically associating domains (TADs), or global structures, like compartments. This methodological split often leads to poor agreement between models, limiting our ability to obtain a unified view of genome architecture. We introduce HiCONA, a novel graph-based framework that directly infers global 3D chromatin folding from both Hi-C contact maps and super resolution microscopy data. Unlike existing approaches, HiCONA optimizes a nested hierarchical representation of chromatin architecture by minimizing the entropy of the partition, thereby capturing the most informative and functionally relevant interactions. HiCONA enables simultaneous identification of topologically associating domains (TADs) and subcompartments using a single unified model, and performs robustly across gold-standard datasets. In benchmarking experiments, HiCONA recovers key chromatin contacts under both wild-type and cohesin-deficient conditions, offering insight into the structural consequences of architectural protein depletion. Furthermore, HiCONA provides a shared representation that facilitates direct comparison between imaging and sequencing-based data, bridging a major methodological gap in chromatin biology. By capturing chromatin folding from micro to mesoscale, HiCONA opens new avenues for understanding genome organization and its functional implications. This integrative and interpretable framework marks a significant advance in uncovering the forces that shape nuclear architecture, with potential applications in development, disease, and synthetic genome design.
2025-07-23 15:00:00 15:20:00 01A HiTSeq SpliSync: Genomic language model-driven splice site correction of long RNA reads Liliana Florea Wui Wang Lui, Liliana Florea We developed SpliSync, a deep learning method for accurate splice site correction in long read alignments. It combines a genomic language model, HyenaDNA, and a 1D U-net segmentation head, integrating genome sequence and alignment embeddings. SpliSync improves the detection of splice sites and introns and, when integrated with a short read transcript assembler, allows for improved transcript reconstruction, matching or outperforming reference methods like IsoQuant and FLAIR. The method shows promise for transcriptomic applications, especially in species with incomplete gene annotations or for discovering novel transcript variations.
2025-07-23 15:20:00 15:40:00 01A HiTSeq adverSCarial: a toolkit for exposing classifier vulnerabilities in single-cell transcriptomics Ghislain Fievet Ghislain Fievet, Julien Broséus, David Meyre, Sébastien Hergalant Adversarial attacks pose a significant risk to machine learning (ML) tools designed for classifying single-cell RNA-sequencing (scRNA-seq) data, with potential implications for biomedical research and future clinical applications. We present adverSCarial, a novel R package that evaluates the vulnerability of scRNA-seq classifiers to various adversarial perturbations, ranging from barely detectable, subtle changes in gene expression to large-scale modifications. We demonstrate how five representative classifiers spanning marker-based, hierarchical, support vector machine, random forest, and neural network algorithms, respond to these attacks on four hallmarks scRNA-seq datasets. Our findings reveal that all classifiers eventually fail under different amplitudes of perturbations, which depend on the ML algorithm they are based on and on the nature of the modifications. Beyond security concerns, adversarial attacks help uncover the inner decision-making mechanisms of the classifiers. The various attack modes and customizable parameters proposed in adverSCarial are useful to identify which gene or set of genes is crucial for correct classification and to highlight the genes that can be substantially altered without detection. These functionalities are critical for the development of more robust and interpretable models, a step toward integrating scRNA-seq classifiers into routine research and clinical workflows. The R package is freely available on Bioconductor (10.18129/B9.bioc.adverSCarial) and helps evaluate scRNA-seq-based ML models vulnerabilities in a computationally-cheap and time-efficient framework.
2025-07-23 15:40:00 16:00:00 01A HiTSeq Transcriptome Assembly at Single-Cell Resolution with Beaver Qian Shi Qian Shi, Qimin Zhang, Mingfu Shao Motivation: The established single-cell RNA sequencing technologies (scRNA-seq) has revolutionized biological and biomedical research by enabling the measurement of gene expression at single-cell resolution. However, the fundamental challenge of reconstructing full-length transcripts for individual cells remains unresolved. Existing single-sample assembly approaches cannot leverage shared information across cells while meta-assembly approaches often fail to strike a balance between consensus assembly and preserving cell-specific expression signatures. Results: We present Beaver, a cell-specific transcript assembler designed for short-read scRNA-seq data. Beaver implements a transcript fragment graph to organize individual assemblies and designs an efficient dynamic programming algorithm that searches for candidate full-length transcripts from the graph. Beaver incorporates two random forest models trained on 51 meticulously engineered features that accurately estimate the likelihood of each candidate transcript being expressed in individual cells. Our experiments, performed using both real and simulated Smart-seq3 scRNA-seq data, firmly show that Beaver substantially outperforms existing meta-assemblers and single-sample assemblers. At the same level of sensitivity, Beaver achieved 32.0%-64.6%, 13.5%-36.6%, and 9.8%-36.3% higher precision in average compared to meta-assemblers Aletsch, TransMeta, and PsiCLASS, respectively, with similar improvements over single-sample assemblers Scallop2 (10.1%-43.6%) and StringTie2 (24.3%-67.0%). Availability: Beaver is freely available at https://github.com/Shao-Group/beaver. Scripts that reproduce the experimental results of this manuscript are available at https://github.com/Shao-Group/beaver-test.
2025-07-23 16:40:00 17:00:00 01A HiTSeq Bioinformatics analysis for long-read RNA sequencing: challenges and promises Elizabeth Tseng Pacific Biosciences, TBA , Elizabeth Tseng Long-read RNA sequencing has emerged as a powerful tool in transcriptomics, offering the ability to sequence full-length cDNAs—often exceeding 10 kb—without the need for transcript assembly. This capability shifted early bioinformatics efforts toward the discovery of novel isoforms, enabling the development of new nomenclature to describe isoform features previously undetectable by short reads. Renewed focus was also placed on identifying and filtering potential cDNA artifacts. With long read lengths and high accuracy, PacBio’s Iso-Seq data prompted new tool developments covering cancer fusion detection, direct open reading frame predictions, allele-specific isoform expression, and finally, differential isoform expression analyses. However, gaps remain the tool space that need to be addressed with the advent of large, population-scale long-read RNA-Seq projects. In this talk, I will explore how Iso-Seq has propelled the long-read sequencing field forward, highlight the current challenges in tool development and data analysis, and discuss the promising avenues for discovery that lie ahead.
2025-07-23 17:00:00 17:20:00 01A HiTSeq Quality assessment of long read data in multisample lrRNA-seq experiments using SQANTI-reads Netanya Keil Netanya Keil, Carolina Monzó, Lauren McIntyre, Ana Conesa SQANTI-reads leverages SQANTI3, a tool for the analysis of the quality of transcript models, to develop a read-level quality control framework for replicated long-read RNA-seq experiments. The number and distribution of reads, as well as the number and distribution of unique junction chains (transcript splicing patterns), in SQANTI3 structural categories are informative of raw data quality. Multisample visualizations of QC metrics are presented by experimental design factors to identify outliers. We introduce new metrics for 1) the identification of potentially under-annotated genes and putative novel transcripts and for 2) quantifying variation in junction donors and acceptors. We applied SQANTI-reads to two different datasets, a Drosophila developmental experiment and a multiplatform dataset from the LRGASP project and demonstrate that the tool effectively reveals the impact of read coverage on data quality, and readily identifies strong and weak splicing sites. SQANTI-reads is open source and is available in versions ≥ 5.3.0 in the SQANTI3 GitHub repository.
2025-07-23 17:20:00 18:20:00 01A HiTSeq Quantifying RNA Expression and Modifications using Long Read RNA-Seq Jonathan Göke Jonathan Göke The human genome contains instructions to transcribe more than 200,000 RNAs. However, many RNA transcripts are generated from the same gene, resulting in alternative isoforms that are highly similar. Furthermore, the addition of post-transcriptional RNA modifications further impacts their function. The availability of long read RNA-Seq provides an opportunity to sequence entire RNA transcripts, enabling the analysis of individual RNA isoforms and their modifications. In this presentation I will show how the raw nanopore signal data can be used to identify and distinguish multiple RNA modifications from direct RNA-Seq data, I will summarise new results from the Singapore Nanopore Expression Project (SG-NEx) and describe computational methods that analyse long read RNA-Seq data to estimate isoform expression, track full length reads, and identify novel isoforms at single cell and spatial resolution.
2025-07-24 08:40:00 09:40:00 01A HiTSeq Pangenome based analysis of structural variation Tobias Marschall Tobias Marschall Breakthroughs in long-read sequencing technology and assembly methodology enable the routine de novo assembly of human genomes to near completion. Such assemblies open a door to exploring structural variation (SV) in previously inaccessible regions of the genome. The Human Pangenome Reference Consortium (HPRC) and the Human Genome Structural Variation Consoritum (HGSVC) have produced high quality genome assemblies, which provide a basis for comparative genome analysis using pangenome graphs. First, we will ask how a pangenomic resource like this can be leveraged in order to better analyze structural variants in samples with short-read whole-genome sequencing (WGS) data. In a process called genome inference, implemented in the PanGenie software, we can use a pangenome reference to infer the haplotype sequences of individual genomes to a quality clearly superior to standard variant calling workflows. This process allows us to detect more than twice the number of structural variants per genome from short-read WGS and therefore provides an opportunity for genome-wide association studies to include these SVs. Second, we introduce Locityper, a tool specifically designed for targeted genotyping of complex loci using short and long-read whole genome sequencing. For each target, Locityper recruits and aligns reads to local haplotypes and finds the likeliest haplotype pair by optimizing read alignment, insert size and read depth profiles. Locityper accurately genotypes up to 194 of 256 challenging medically relevant loci (95% haplotypes at QV33), an 8.8-fold gain compared to 22 genes achieved with standard variant calling pipelines. Furthermore, Locityper provides access to hyperpolymorphic HLA genes and other gene families, including KIR, MUC and FCGR.
2025-07-24 09:40:00 10:00:00 01A HiTSeq Resolving Paralogues and Multi-Copy Genes with Nanopore Long-Read Sequencing Sergey Nurk Oxford Nanopore Technologies, TBA , Sergey Nurk
2025-07-24 11:20:00 11:40:00 01A HiTSeq GreedyMini: Generating low-density DNA minimizers Arseny Shur Shay Golan, Ido Tziony, Matan Kraus, Yaron Orenstein, Arseny Shur Motivation: Minimizers are the most popular k-mer selection scheme in algorithms and data structures analyzing high-throughput sequencing (HTS) data. In a minimizer scheme, the smallest k-mer by some predefined order is selected as the representative of a sequence window containing w consecutive k-mers, which results in overlapping windows often selecting the same k-mer. Minimizers that achieve the lowest frequency of selected k-mers over a random DNA sequence, termed the expected density, are desired for improved performance of HTS analyses. Yet, no method to date exists to generate minimizers that achieve minimum expected density. Moreover, for k and w values used by common HTS algorithms and data structures there is a gap between densities achieved by existing selection schemes and the theoretical lower bound. Results: We developed GreedyMini, a toolkit of methods to generate minimizers with low expected or particular density, to improve minimizers, to extend minimizers to larger alphabets, k, and w, and to measure the expected density of a given minimizer efficiently. We demonstrate over various combinations of k and w values, including those of popular HTS methods, that GreedyMini can generate DNA minimizers that achieve expected densities very close to the lower bound, and both expected and particular densities much lower compared to existing selection schemes. Moreover, we show that GreedyMini's k-mer rank-retrieval time is comparable to common k-mer hash functions. We expect GreedyMini to improve the performance of many HTS algorithms and data structures and advance the research of k-mer selection schemes.
2025-07-24 11:40:00 12:00:00 01A HiTSeq Exploiting uniqueness: seed-chain-extend alignment on elastic founder graphs Nicola Rizzo Nicola Rizzo, Manuel Cáceres, Veli Mäkinen Sequence-to-graph alignment is a central challenge of computational pangenomics. To overcome the theoretical hardness of the problem, state-of-the-art tools use seed-and-extend or seed-chain-extend heuristics to alignment. We implement a complete seed-chain-extend alignment workflow based on indexable elastic founder graphs (iEFGs) that support linear-time exact searches unlike general graphs. We show how to construct iEFGs, find high-quality seeds, chain, and extend them at the scale of a telomere-to-telomere assembled human chromosome. Our sequence-to-graph alignment tool and the scripts to replicate our experiments are available in https://github.com/algbio/SRFAligner.
2025-07-24 12:00:00 12:20:00 01A HiTSeq FroM Superstring to Indexing: a space-efficient index for unconstrained k-mer sets using the Masked Burrows-Wheeler Transform (MBWT) Ondřej Sladký Ondřej Sladký, Pavel Veselý, Karel Brinda The exponential growth of DNA sequencing data calls for efficient solutions for storing and querying large-scale k-mer sets. While recent indexing approaches use spectrum-preserving string sets (SPSS), full-text indexes, or hashing, they often impose structural constraints or demand extensive parameter tuning, limiting their usability across different datasets and data types. Here, we propose FMSI, a minimally parametrized, highly space-efficient membership index and compressed dictionary for arbitrary k-mer sets. FMSI combines approximated shortest superstrings with the Masked Burrows-Wheeler Transform (MBWT). Unlike traditional methods, FMSI operates without predefined assumptions on k-mer overlap patterns but exploits them when available. We demonstrate that FMSI offers superior memory efficiency for processing queries over established indexes such as SSHash, Spectral Burrows-Wheeler Transform (SBWT), and Conway-Bromage-Lyndon (CBL), while supporting fast membership and dictionary queries. Depending on the dataset, k, or sampling, FMSI offers 2–3x space savings for processing queries over all state-of-the-art indexes; only a space-optimized SBWT (without indexing reverse complement) matches its memory efficiency in some cases but is 2–3x slower. Overall, this work establishes superstring-based indexing as a highly general, flexible, and scalable approach for genomic data, with direct applications in pangenomics, metagenomics, and large-scale genomic databases.
2025-07-24 12:20:00 12:40:00 01A HiTSeq The Alice assembler: dramatically accelerating genome assembly with MSR sketching Roland Faure Roland Faure, Jean-François Flot, Dominique Lavenier The PacBio HiFi technology and the R10.4 Oxford Nanopore flowcells are transforming the genomic world by producing for the first time long and accurate sequencing reads. The low error rate of these reads opens new venues for computational optimizations. However, genome and particularly metagenome assembly using high-fidelity reads still faces challenges. Current assemblers (e.g., Flye, hifiasm, metaMDBG) struggle to efficiently resolve highly similar haplotypes (homologous chromosomes, bacterial strains, repeats) while maintaining computational speed, creating a gap between rapid and haplotype-resolved methods. We investigated this issue using on several dataset including a human gut microbiome sequencing and a diploid, finding that hifiasm_meta and metaFlye required over a month of CPU time to produce an assembly, while metaMDBG, which collapses similar strains, assembles the same dataset in four days. We present Alice, a new assembler which introduces a new sequence sketching method called MSR sketching to bridge this gap and produce efficiently haplotype-resolved assemblies, for both genomic and metagenomic datasets. On the aforementioned human gut dataset, Alice completed the assembly in just 7 CPU hours. Furthermore, the analysis of the assemblies revealed that Alice missed <1% of abundant 31-mers (≥20x coverage), compared to >15% missed by both metaMDBG and hifiasm_meta. Overall, our results indicate that Alice accelerates assembly dramatically while providing high quality assemblies, offering a powerful new tool for the field.
2025-07-24 12:40:00 13:00:00 01A HiTSeq BINSEQ: A Family of High-Performance Binary Formats for Nucleotide Sequences Noam Teyssier Noam Teyssier, Alexander Dobin Modern genomics routinely generates billions of sequencing records per run, typically stored as gzip-compressed FASTQ files. This format's inherent limitations—single-threaded decompression and sequential parsing of irregularly sized records—create significant bottlenecks for bioinformatics applications that would benefit from parallel processing. We present BINSEQ, a family of simple binary formats designed for high-throughput parallel processing of sequencing data. The family includes BINSEQ, optimized for fixed-length reads with true random access capability through two-bit encoding, and VBINSEQ, supporting variable-length sequences with optional quality scores and block-based organization. Both formats natively handle paired-end reads, eliminating the need for synchronized files. Our comprehensive evaluation demonstrates that BINSEQ formats deliver substantial performance improvements across bioinformatics workflows while maintaining competitive storage efficiency. Both formats achieve up to 32x faster processing than compressed FASTQ and continue to scale with increasing thread counts where traditional formats quickly plateau due to I/O bottlenecks. These advantages extend to complex workflows like alignment, with BINSEQ formats showing 2-5x speedups at higher thread counts when tested with tools like minimap2 and STAR. Storage requirements remain comparable to or better than existing formats, with BINSEQ (610.35 MB) similar to gzip-compressed FASTA (647.29 MB) and VBINSEQ (509.89 MB) approaching CRAM (491.85 MB) efficiency. To facilitate adoption, we provide high-performance libraries, parallelization APIs, and conversion tools as free, open-source implementations. BINSEQ addresses fundamental inefficiencies in genomic data processing by considering modern parallel computing architectures.
2025-07-24 14:00:00 14:20:00 01A HiTSeq Ultrafast and Ultralarge Multiple Sequence Alignments using TWILIGHT Yu-Hsiang Tseng Yu-Hsiang Tseng, Sumit Walia, Yatish Turakhia Motivation: Multiple sequence alignment (MSA) is a fundamental operation in bioinformatics, yet existing MSA tools are struggling to keep up with the speed and volume of incoming data. This is because the runtimes and memory requirements of current MSA tools become untenable when processing large numbers of long input sequences and they also fail to fully harness the parallelism provided by modern CPUs and GPUs. Results: We present TWILIGHT (Tall and Wide Alignments at High Throughput), a novel MSA tool optimized for speed, accuracy, scalability, and memory constraints, with both CPU and GPU support. TWILIGHT incorporates innovative parallelization and memory-efficiency strategies that enable it to build ultralarge alignments at high speed even on memory-constrained devices. On challenging datasets, TWILIGHT outperformed all other tools in speed and accuracy. It scaled beyond the limits of existing tools and performed an alignment of 1 million RNASim sequences within 30 minutes while utilizing less than 16 GB of memory. TWILIGHT is the first tool to align over 8 million publicly available SARS-CoV-2 sequences, setting a new standard for large-scale genomic alignment and data analysis. Availability: TWILIGHT’s code is freely available under the MIT license at https://github.com/TurakhiaLab/TWILIGHT. The test datasets and experimental results, including our alignment of 8 million SARS-CoV-2 sequences, are available at https://zenodo.org/records/14722035.
2025-07-24 14:20:00 14:40:00 01A HiTSeq CREMSA: Compressed Indexing of (Ultra) Large Multiple Sequence Alignments Mikaël Salson Mikaël Salson, Arthur Boddaert, Awa Bousso Gueye, Laurent Bulteau, Yohan Hernandez-Courbevoie, Camille Marchet, Nan Pan, Sebastian Will, Yann Ponty Recent viral outbreaks motivate a systematic collection of pathogenic genomes, including a strong focus on genomic RNA, in order to accelerate their study and monitor the apparition/spread of variants. Due to their limited length and temporal proximity of their collection, viral genomes are usually organized, and analyzed as oversized Multiple Sequence Alignments (MSAs). Such MSAs are largely ungapped, and mostly homogeneous on a column-wise level but not at a sequential level due to local variations, hindering the performances of sequential compression algorithms. In order to enable an efficient manipulation of MSAs, including subsequent statistical analyses, we introduce CREMSA (Column-wise Run-length Encoding for Multiple Sequence Alignments), a new index that builds on sparse bitvector representations to compress an existing or streamed MSA, all the while allowing for an expressive set of accelerated requests to query the alignment without prior decompression. Using CREMSA, a 65GB MSA consisting of 1.9M SARS-CoV 2 genomes could be compressed into 22MB using less than half a gigabyte of main memory, while supporting access requests in the order of 100ns. Such a speed up enables a comprehensive analysis of covariation over this very large MSA. We further assess the impact of the sequence ordering on the compressibility of MSAs and propose a resorting strategy that, despite the proven NP-hardness of an optimal sort, induces greatly increased compression ratios at a marginal computational cost.
2025-07-24 14:40:00 15:00:00 01A HiTSeq LYCEUM: Learning to call copy number variants on low coverage ancient genomes Mehmet Alper Yilmaz Mehmet Alper Yilmaz, Ahmet Arda Ceylan, Gun Kaynar, A. Ercument Cicek Motivation: Copy number variants (CNVs) are pivotal in driving phenotypic variation that facilitates species adaptation. They are significant contributors to various disorders, making ancient genomes crucial for uncovering the genetic origins of disease susceptibility across populations. However, detecting CNVs in ancient DNA (aDNA) samples poses substantial challenges due to several factors: (i) aDNA is often highly degraded; (ii) contamination from microbial DNA and DNA from closely related species introduce additional noise into sequencing data; and finally, (iii) the typically low coverage of aDNA renders accurate CNV detection particularly difficult. Conventional CNV calling algorithms, which are optimized for high coverage read-depth signals, underperform under such conditions. Results: To address these limitations, we introduce LYCEUM, the first machine learning-based CNV caller for aDNA. To overcome challenges related to data quality and scarcity, we employ a two-step training strategy. First, the model is pre-trained on whole genome sequencing data from the 1000 Genomes Project, teaching it CNV-calling capabilities similar to conventional methods. Next, the model is fine-tuned using high-confidence CNV calls derived from only a few existing high-coverage aDNA samples. During this stage, the model adapts to making CNV calls based on the downsampled read depth signals of the same aDNA samples. LYCEUM achieves accurate detection of CNVs even in typically low-coverage ancient genomes. We also observe that the segmental deletion calls made by LYCEUM show correlation with the demographic history of the samples and exhibit patterns of negative selection inline with natural selection. Availability: LYCEUM is available at https://github.com/ciceklab/LYCEUM.
2025-07-24 15:00:00 15:20:00 01A HiTSeq POPSICLE: a probabilistic method to capture uncertainty in single-cell copy-number calling Lucrezia Patruno Lucrezia Patruno, Sophia Chirrane, Simone Zaccaria During tumour evolution, cancer cells acquire somatic copy-number alterations (CNAs), that are frequent genomic alterations resulting in the amplification or deletion of large genomic regions. Recent single-cell technologies allow the accurate investigation of CNA rates and their underlying mechanism by performing whole-genome sequencing of thousands of individual cancer cells in parallel (scWGS-seq). While several methods have been developed to identify the most likely CNAs from scWGS-seq data, the high levels of variability in these data make the accurate inference of point estimates for CNAs (i.e., a single value for the most likely copy number) challenging. Moreover, given that variability increases with increasing copy numbers, this is especially true when considering high amplifications and highly aneuploid cells, which play a key role in cancer. However, to date existing methods are limited to the inference of point estimates for CNAs in single cells and do not capture their related uncertainty. To address these limitations we introduce POPSICLE, a novel probabilistic approach that computes the probability of having different copy numbers for every genomic region in each single cell. Using simulations, we show that POPSICLE improves ploidy and CNA inference for up to 20% of the genome in 90% of cells. Using a dataset comprising more than 60,000 of breast and ovarian cancer cells, we show how POPSICLE leverages uncertainty to improve the identification of genes that are recurrently highly amplified and might play a key role in tumour progression.
2025-07-24 15:20:00 15:40:00 01A HiTSeq MutSuite: A Toolkit for Simulating and Evaluating Mutations in Aligned Sequencing Reads Kendell Clement Kendell Clement Simulated sequencing reads containing known mutations are essential for developing, testing, and benchmarking mutation detection tools. Most existing simulation tools introduce mutations into synthetic reads and then realign them to a reference genome prior to downstream analysis. However, this realignment step can obscure the true position of insertions and deletions, introducing ambiguity and potential error in evaluation. In particular, the alignment process can shift the apparent location of insertions and deletions, complicating efforts to assess recall and precision of variant callers. To address this limitation and support the development of more accurate and sensitive mutation detection algorithms, we developed MutSim, a tool that introduces substitutions, insertions, and deletions directly into aligned reads (e.g., in BAM files). By avoiding realignment, MutSim ensures that each simulated mutation remains at its exact specified position, enabling precise evaluation of variant caller performance. MutSim is part of a larger toolkit we call MutSuite, which also includes MutRun, a companion tool that automates the execution of variant calling software on simulated datasets, and MutAgg, which aggregates and summarizes results across multiple variant callers for performance comparison. Together, these tools provide a robust and flexible framework for mutation simulation and benchmarking. MutSuite is open-source and freely available at: https://github.com/clementlab/mutsuite.
2025-07-24 15:40:00 16:00:00 01A HiTSeq Landscape of The Dark Genome’s variants and their influence on cancer Joao P. C. R. Mendonca Joao P. C. R. Mendonca, Kristoffer Staal Rohrberg, Peter Holst, Frederik Otzen Bagger Human endogenous retroviruses (HERVs) are remnants of ancient viral infections that now make up ~8% of the human genome. Although typically silenced, HERVs can become reactivated in cancer and are emerging as biomarkers and immunotherapeutic targets. However, their clinical utility is limited by challenges in resolving individual loci due to high sequence similarity, incomplete genome annotations, and an overreliance on linear reference genomes. To address this, we constructed a variational pangenome using long-read sequencing data from Genome in a Bottle and the Platinum Pedigree projects. This approach enables accurate detection of single nucleotide variants (SNVs), insertions/deletions (indels), and structural variants (SVs) in a reference-free manner, revealing polymorphic HERV insertions absent from the human reference genome. By integrating data from the Copenhagen Prospective Personalized Oncology (CoPPO) biobank, we link these variants to HERV expression in cancer, distinguishing potentially pathogenic variants from benign ones. We combine pangenome-informed annotations with locus-specific expression quantification tools to resolve HERV transcription at individual loci and connect specific sequence variants to tumorigenesis and immune modulation. Our findings enhance the resolution of HERV mapping across individuals and cancer types, uncovering previously inaccessible variation in a historically overlooked portion of the genome. This work not only improves our understanding of HERV-driven disease mechanisms but also lays the groundwork for variant-informed biomarker discovery and therapeutic targeting in precision oncology.
2025-07-23 11:20:00 12:10:00 03B The Innovation Pipeline: How Industry & Academia Can Work Together in Computational Biology Opening Remarks & Framing the Ecosystem • Introduction of the ISCB Industry Advisory Council and session overview • Overview of the academia–industry–startup pipeline • Framing questions: How do discoveries become products? What does meaningful collaboration look like?
2025-07-23 12:10:00 13:00:00 03B The Innovation Pipeline: How Industry & Academia Can Work Together in Computational Biology Startup Ecosystems & Translational Science Explore the early stages of innovation where research meets entrepreneurship. Speakers will discuss: • What makes a scientific idea “translatable” in the eyes of investors or biotech accelerators • How academics can navigate the startup
2025-07-23 14:00:00 16:00:00 03B The Innovation Pipeline: How Industry & Academia Can Work Together in Computational Biology Public-Private Partnerships and Career Navigation Focus on later-stage collaborations and real-world applications. Panelists will cover: • How computational biology fuels industry R&D pipelines • How to start an Open Source software company? • Collaboration models between academia, industry, and public funders • Career navigation
2025-07-23 16:40:00 18:00:00 03B The Innovation Pipeline: How Industry & Academia Can Work Together in Computational Biology Funding & Government Innovation Strategies Review some key areas of funding strategy aligning with innovation pipelines such as: • Navigating government and cross-sector funding channels • Emerging biotech/AI policy in the UK and globally • Role of public datasets and infrastructure in enabling industry research • Strategies to foster equitable, sustainable academic-industry partnerships
2025-07-23 11:20:00 11:30:00 02N iRNA Introduction to iRNA Michelle Scott, Athma Pai
2025-07-23 11:30:00 12:10:00 02N iRNA Sequential verification of transcription by Integrator and Restrictor Steven West Steven West The decision between productive elongation and premature termination of promoterproximal RNA polymerase II (RNAPII) is fundamental to metazoan gene regulation. Integrator and Restrictor complexes are implicated in promoter-proximal termination, but why metazoans utilise two complexes and how they are coordinated remains unknown. Here, we show that Integrator and Restrictor act sequentially and nonredundantly to monitor distinct stages of transcription. Integrator predominantly engages with promoter-proximally paused RNAPII to trigger premature termination, which is prevented by cyclin-dependent kinase 7/9 activity. After pause release, RNAPII enters a previously unrecognised “restriction zone” universally imposed by Restrictor. Unproductive RNAPII terminates within this zone, while progression through it is promoted by U1 small nuclear ribonucleoprotein (snRNP), which antagonises Integrator and Restrictor in a U1-70K dependent manner. These findings reveal the principles of a sequential verification mechanism governing the balance between productive and attenuated transcription, rationalising the necessity of Integrator and Restrictor complexes in metazoans.
2025-07-23 12:10:00 12:20:00 02N iRNA CIRI-Deep Enables Single-Cell and Spatial Transcriptomic Analysis of Circular RNAs with Deep Learning Yuan Gao Zihan Zhou, Yuan Gao Circular RNAs (circRNAs) are a crucial yet relatively unexplored class of transcripts known for their tissue- and cell-type-specific expression patterns. Despite the advances in single-cell and spatial transcriptomics, these technologies face difficulties in effectively profiling circRNAs due to inherent limitations in circRNA sequencing efficiency. To address this gap, a deep learning model, CIRI-deep, is presented for comprehensive prediction of circRNA regulation on diverse types of RNA-seq data. CIRI-deep is trained on an extensive dataset of 25 million high-confidence circRNA regulation events and achieved high performances on both test and leave-out data, ensuring its accuracy in inferring differential events from RNA-seq data. It is demonstrated that CIRI-deep and its adapted version enable various circRNA analyses, including cluster- or region-specific circRNA detection, BSJ ratio map visualization, and trans and cis feature importance evaluation. Collectively, CIRI-deep’s adaptability extends to all major types of RNA-seq datasets including single-cell and spatial transcriptomic data, which will undoubtedly broaden the horizons of circRNA research.
2025-07-23 12:20:00 12:40:00 02N iRNA Enhancing circRNA–miRNA Interaction Prediction with Structure-aware Sequence Modeling Juseong Kim Juseong Kim, Sanghun Sel, Giltae Song Circular RNAs (circRNAs) function as key post-transcriptional regulators by interacting with microRNAs (miRNAs) to modulate gene expression. These interactions play a central role in gene regulatory networks and are implicated in various diseases. Accurate prediction of circRNA–miRNA interactions is therefore essential for understanding regulatory mechanisms and advancing therapeutic development. Notably, sequence variability among circRNA isoforms sharing the same back-splice junction can result in distinct miRNA binding profiles, highlighting the importance of isoform-level modeling. However, existing computational methods, including rule-based approaches (e.g., Miranda) and graph-based neural architectures, often fail to incorporate structural information and cannot effectively capture isoform-specific characteristics, thereby limiting their predictive performance. To address these challenges, we propose Thymba, a hybrid deep learning framework for structure-informed prediction of circRNA–miRNA interactions. Thymba combines Mamba modules, self-attention mechanisms, and one-dimensional convolutions to jointly model local sequence motifs and long-range dependencies. Furthermore, it employs a structure-aware pretraining strategy that concurrently optimizes masked language modeling and RNA secondary structure learning, enabling the model to generate representations that encode both sequential and structural contexts. We additionally construct a high-quality isoform-level dataset by integrating AGO-supported interaction data from public repositories and generating hard negative pairs via RNAhybrid-based thermodynamic and alignment filtering. This dataset supports both interaction prediction and binding site prediction tasks. Experimental results show that Thymba consistently outperforms existing methods, particularly on isoform-specific benchmarks, and demonstrates strong generalizability to related RNA–RNA interaction tasks such as circRNA–RBP binding prediction.
2025-07-23 12:40:00 13:00:00 02N iRNA Flash talks Multiple 1-minute flash talks advertising iRNA posters
2025-07-23 14:00:00 14:20:00 02N iRNA Predicting relevant snoRNA genes across any eukaryote genome using SnoBIRD Étienne Fafard-Couture Étienne Fafard-Couture, Pierre-Étienne Jacques, Michelle S Scott Small nucleolar RNAs (snoRNAs) are a group of noncoding RNAs identified in all eukaryotes. In human, C/D box snoRNAs are the most prevalent class, displaying crucial functions like regulating ribosome biogenesis and splicing. We have recently reported that less than a third of all annotated snoRNA genes are expressed in human. The remaining two-thirds, named the snoRNA pseudogenes, present features that are incompatible with their expression (e.g., mutations in their boxes). However, current annotations are often incomplete and overlook these snoRNA pseudogenes. To address this, we developed SnoBIRD. Based on DNABERT, SnoBIRD identifies C/D box snoRNA genes from any input sequence and classifies them as expressed or pseudogenes using sequence features (e.g., mutations in boxes). We show that SnoBIRD outperforms its competitor tools on a test set representative of all eukaryote kingdoms using relevant biological signal in the input sequence. By applying SnoBIRD on different genomes, we find that its runtime is adequate on the small Schizosaccharomyces pombe genome, and really outperforms the other tools on the large human genome (<13h compared to >3.5 days). Moreover, we identify with SnoBIRD most of the already annotated snoRNAs in these two species (respectively 19/32 and 358/403), as well as 8 and 22 novel expressed C/D box snoRNAs in their respective genome. Finally, we applied SnoBIRD on the genome of varied eukaryote species and show that it is an efficient and generalizable snoRNA predictor, as it identifies the known C/D box snoRNAs as well as dozens of novel expressed snoRNAs in these species.
2025-07-23 14:20:00 14:40:00 02N iRNA Charting the dynamics of the tRNAome in health and disease with AMaNITA Xanthi Lida Katopodi Xanthi Lida Katopodi, Laia Llovera Nadal, Alexane Ollivier, Leszek Pryszcz, Cornelius Pauli, Daniel Heid, Thomas Muley, Marc Schneider, Laura Klotz, Michael Allgäuer, Michaela Frye, Carsten Müller-Tidow, Oguzhan Begik, Eva Maria Novoa Transfer RNAs (tRNAs) play a pivotal role in decoding genetic information, determining which transcripts are highly and poorly translated at a given moment. Dysregulation of tRNA abundances and their RNA modifications is a well-known feature in cancer cells, which leads to enhanced expression of specific oncogenic transcripts and proteins or, complementary, to the depletion of proteins essential to the proper cell function. A novel protocol named Nano-tRNAseq was recently developed to study tRNA populations using native RNA nanopore sequencing technologies, providing tRNA abundance and modification information from the same individual molecules. To analyze information-rich nanopore native tRNA sequencing datasets, here we have developed AMaNITA (Abundance, Modifications, and Nanopore Intensity Toolbox/Application), a toolkit that facilitates Nano-tRNAseq analysis and provides a simple and user-friendly computational framework for the analysis of Nano-tRNAseq data. AMaNITA performs several steps, including filtering, quality control, batch effect estimation and automated correction, differential tRNA expression, and differential modification analyses, thus providing a start-to-end analysis of the data. Harnessing the data produced by Nano-tRNAseq with AMaNITA, we then examine whether tRNAs can be used to distinguish biological states, tissue of origin, and disease state. We find that our method separately clusters tumor and normal samples and identifies individual tRNA molecules that are dysregulated in cancer, with potential diagnostic and therapeutic applications in the clinic. When applied on a lung cancer cohort consisting of 69 matched tumor/normal samples, our method reveals that tRNA information can segregate healthy and tumor samples with high accuracy.
2025-07-23 14:40:00 15:00:00 02N iRNA Identification and characterization of chromatin-associated long non-coding RNAs in human Lina Ma Zhao Li, Zhang Zhang, Lina Ma Chromatin-associated long non-coding RNAs (ca-lncRNAs) play crucial regulatory roles within the nucleus by preferentially binding to chromatin. Despite their importance, systematic identification and functional studies of ca-lncRNAs have been limited. Here, we identified and characterized human ca-lncRNAs genome-wide, utilizing 323,950 lncRNAs from LncBook 2.0 and integrating high-throughput sequencing datasets that assess RNA-chromatin association. We identified 14,138 high-confidence ca-lncRNAs enriched on chromatin across six cell lines, comprising nearly 80% of analyzed chromatin-associated RNAs, highlighting their significant role in chromatin localization. To explore the sequence basis for chromatin localization, we applied the LightGBM machine learning model to identify contributing nucleotide k-mers and derived 12 sequence elements through k-mer assembly and feature ablation. These sequence elements are frequently found within Alu repeats, with more Alu repeats enhancing chromatin localization. Meta-profiling of chromatin-binding sequencing segments further demonstrated that ca-lncRNAs bind to chromatin through Alu repeats. To delve deeper into the molecular mechanisms underlying the binding, we conducted integrative interactome analysis and computational prediction, revealing that Alu repeats primarily tether to chromatin through dsDNA-RNA triplex formation. Finally, to address sample constraints in ca-lncRNA identification, we developed a machine learning model based on sequential feature selection for large-scale prediction. This approach yielded 201,959 predicted ca-lncRNAs, approximately 70% of which are predicted to be preferentially located in the nucleus. Collectively, these high-throughput-identified and machine-learning-predicted ca-lncRNAs together form a robust resource for further functional studies.
2025-07-23 15:00:00 15:10:00 02N iRNA Toward a Computational Pipeline for Prokaryotic miRNAs: The Case of Pseudomonas aeruginosa in Lung Disease Laura Veschetti Cristina Cigana, Elisa Lovo, Alessandra Bragonzi, Giovanni Malerba, Laura Veschetti Background: miRNAs are key regulators in eukaryotes, yet little is known about their existence and function in bacteria. Although various noncoding RNAs have been identified in prokaryotes, only a few bacterial miRNAs have been validated. Given the clinical impact of Pseudomonas aeruginosa (PA) in chronic respiratory diseases, we investigated PA-derived miRNAs and their potential interactions with human genes. Motivation: Research has mainly focused on eukaryotic miRNA and the lack of computational tools for prokaryotic miRNA prediction has slowed progress in microbial miRNA research. Our study aims to propose a computational framework for bacterial miRNA prediction, offering an application on PA. Methods: We analyzed 36 RNAseq datasets from clinical PA isolates. Precursor miRNAs were predicted and filtered for structural stability. Mature miRNAs were identified through read mapping. Phylogenetic comparison was performed across organisms, and interactions with human UTRs were predicted. In silico validation across 4 PA reference strains was carried out through genome mapping, expression profiling, and de novo predictions. Results: We identified a mean of 422 precursors and 247 mature miRNAs per sample. Some candidates showed homology with human and were conserved across species. Predicted targets were enriched in immune, metabolic, and signaling pathways. Fifty-six miRNAs scored high in the integrative in silico validation. Experimental confirmation is ongoing. Conclusions: We propose a computational framework for identifying bacterial miRNAs with potential roles in host-pathogen interactions. Significance: The knowledge generated through the study advances the characterization of currently under-studied microbial miRNAs, paving the way for therapeutic interventions in chronic respiratory disease.
2025-07-23 15:10:00 15:20:00 02N iRNA Characterisation of the role of SNORD116 in RNA processing during cardiomyocyte differentiation Sofia Kudasheva Sofia Kudasheva, Wilfried Haerty, Terri Holmes, James Smith, Vanda Knitlhoffer Deletions of the SNORD116 small nucleolar RNA cluster result in Prader-Willi syndrome (PWS), a developmental disorder with a complex multisystem phenotype. Emerging clinical data highlight a high incidence of congenital cardiac defects in individuals with PWS, whilst SNORD116 was found to be elevated in a human pluripotent stem cell (hPSC) model of cardiomyopathy. While previous research in neuronal cells has implicated SNORD116 in regulation of RNA processing, its molecular targets and function in the heart remain unclear. To investigate this, we used an hPSC-derived cardiomyocyte model with SNORD116 knockout. We performed Oxford Nanopore long-read sequencing at three differentiation stages to simultaneously detect effects of SNORD116 knockout on alternative splicing, cleavage and polyadenylation (APA), and poly(A) tail length. We identified 40,018 novel isoforms; 174 of which were involved in significant isoform switches between control and SNORD116 knockout. Analysis of functional changes resulting from these switches revealed a developmental stage-dependent shift in 3’UTR usage in knockout cells, characterised by increased distal poly(A) site usage at day 2 and a reversal by day 30. Transcriptome-wide APA analysis confirmed these trends and revealed significant enrichment for predicted SNORD116 binding sites among APA-regulated genes. Notably, genes showing consistent poly(A) tail shortening in SNORD116 KO cells were enriched for ribosomal components, suggesting coordinated regulation of RNA stability and translation. These findings highlight a previously unrecognised role for SNORD116 in modulating APA and poly(A) tail length during cardiomyocyte differentiation, with implications for understanding the molecular underpinnings of PWS-associated cardiac phenotypes.
2025-07-23 15:20:00 15:30:00 02N iRNA EpiCRISPR: Improving CRISPR/Cas9 on-target efficiency prediction by multiple epigenetic marks, high-throughput datasets, and flanking sequences Yaron Orenstein Michal Rahimi, Yaron Orenstein CRISPR/Cas9 has transformed gene editing, enabling targeted modification of genomic loci using a 20-nt guide RNA followed by an NGG motif. However, editing efficiency varies due to target sequence, flanking regions, and epigenetic context. Measuring endogenous efficiency experimentally is labor-intensive, prompting the development of predictive models. Prior models were trained on small datasets, limiting generalizability. Leenay et al. recently released a dataset of ~1,600 endogenous efficiency measurements in T cells. We present EpiCRISPR, a neural network trained on this dataset that integrates guide RNA sequence, flanking regions, epigenetic marks, and high-throughput predictions. We found that incorporating downstream flanking sequences improved prediction (Spearman correlation from 0.309 to 0.375). Including epigenetic features—especially open chromatin, H3K4me3, and H3K27ac—boosted performance to 0.496. Adding high-throughput-based predictions further raised correlation to 0.514. Importantly, EpiCRISPR generalized well across cell types and revealed biologically meaningful feature importance via saliency maps. EpiCRISPR is publicly available at github.com/OrensteinLab/EpiCRISPR.
2025-07-23 15:30:00 15:40:00 02N iRNA Enhancing CRISPR/Cas9 Guide RNA Design Using Active Learning Techniques Stefano Roncelli Stefano Roncelli, Gül Sude Demircan, Christian Anthon, Lars Juhl Jensen, Jan Gorodkin CRISPR/Cas systems have significantly advanced genome editing, yet the precise design of guide RNAs (gRNAs) for optimal efficiency and specificity remains a persistent challenge. The CRISPRnet project seeks to enhance model performance and predictive accuracy by generating new data from gRNAs that are strategically selected to enrich existing datasets. To determine which gRNAs should be validated experimentally, we utilize methods for estimating prediction uncertainty. The idea being that the gRNAs, for which the efficiency prediction models are most uncertain, are the ones that would be the most valuable to experimentally validate. A key difficulty in this effort lies in the absence of definitive ground truth for model uncertainty. To address this, we modified the state-of-the-art CRISPRon model, which was trained on 30mer gRNA targets with context sequence and the binding energy between the gRNA spacer and the target DNA, by using deep neural networks to predict the editing efficiency. We implemented two approaches: (1) an ensemble of CRISPRon models trained with nested cross-validation to quantify prediction variance, and (2) an ensemble of modified CRISPRon models, extended with an additional classifier head and a customized loss function for uncertainty estimation. The effectiveness of these methods is evaluated through benchmarking against a curated set of candidate gRNAs, enabling data augmentation based on the recommendations made by the models.
2025-07-23 15:40:00 16:00:00 02N iRNA Single-base tiled screen reveals design principles of PspCas13b-RNA targeting and informs automated screening of potent targets Syed Faraz Ahmed Syed Faraz Ahmed, Mohamed Fareh, Wenxin Hu, Matthew R McKay The advancement of RNA therapeutics hinges on developing precise RNA-editing tools with high specificity and minimal off-target effects. We present a framework for optimizing CRISPR PspCas13b, a programmable RNA nuclease with a 30-nucleotide spacer sequence that offers potentially superior targeting specificity. Through single-base tiled screening and computational analyses, we identified critical design principles governing effective RNA recognition and cleavage in human cells. Our analyses revealed position-specific nucleotide preferences that significantly impact crRNA efficiency. Specifically, guanosine bases at positions 1-2 enhance catalytic activity, while cytosine bases at positions 1-4 and 11-17 dramatically reduce efficiency. This positional weighting system forms the foundation of our algorithm, which predicts highly effective crRNAs with ~90% accuracy. Comprehensive spacer-target mutagenesis analysis, implemented through computational modeling, demonstrated that PspCas13b requires ~26-nucleotide base pairing and tolerates only up to four mismatches to activate its nuclease domains. This computational insight explains PspCas13b's superior specificity compared to other RNA interference tools and predicts an extremely low probability of off-target effects, subsequently validated through proteomic analysis. We developed an open-source, R-based computational tool (https://cas13target.azurewebsites.net/) that implements these design principles to generate optimized crRNAs for any target sequence. The tool scores potential crRNAs based on nucleotide composition and position. Additionally, it performs off-target analysis by assessing sequence complementarity with human transcriptome data. This computational approach represents a significant advancement in RNA targeting technology and offers a powerful platform for the development of more effective RNA therapeutics with minimized off-target effects.
2025-07-23 16:40:00 17:20:00 02N iRNA Gene regulation of human cell systems Roser Vento-Tormo The study of human tissues requires a systems biology approach. Their development starts in utero and during adulthood, they change their organization and cell composition. Our team has integrated comprehensive maps of human developing and adult tissues generated by us and others using a combination of single-cell and spatial transcriptomics, chromatin accessibility assays and fluorescent microscopy. We utilise these maps to guide the development and interpretability of in vitro models. To do so, we develop and apply bioinformatic tools that allow us to quantitatively compare both systems and predict changes.
2025-07-23 17:20:00 17:40:00 02N iRNA EdiSetFlow: A robust pipeline for RNA editing detection and differential analysis in bulk RNA-seq Jacob Munro Jacob Munro, Melanie Bahlo, Brendan Ansell Adenosine-to-inosine (A-to-I) RNA editing is a post-transcriptional modification catalyzed by ADAR enzymes that can alter codons, splicing patterns, and RNA secondary structures. This process is essential for neuronal development and immune function, with dysregulation implicated in neurological disorders, cancers, and autoimmune diseases. Despite its biological importance, accurate detection of RNA editing from RNA-seq data remains technically challenging, and robust inference of differential editing between experimental conditions is not straightforward. To address these challenges, we have developed EdiSetFlow, a reproducible and scalable pipeline for transcriptome-wide A-to-I RNA editing analysis from bulk RNA-seq data. EdiSetFlow is implemented in Nextflow takes raw FASTQ files as input, performs read trimming and quality filtering, aligns reads to the reference genome, and identifies editing sites with JACUSA. Common genetic variants are excluded based on the gnomAD population database. Identified sites are annotated for gene context and predicted functional consequences, with results summarized in a user-friendly HTML report. The pipeline is designed to efficiently scale to hundreds or thousands of samples, making it suitable for large datasets such as GTEx. An accompanying R package enables advanced analyses, including model fitting, hypothesis testing, false discovery rate control, and visualisations, facilitating reliable statistical comparisons of editing between experimental groups. Applying EdiSetFlow to GTEx brain RNA-seq data, we uncovered distinct RNA editing signatures across brain regions, identifying both known and previously uncharacterized regional editing patterns. EdiSetFlow provides researchers with a robust, end-to-end solution to efficiently discover and interpret biologically meaningful RNA editing events in diverse transcriptomic datasets.
2025-07-23 17:40:00 18:00:00 02N iRNA Statistical modeling of single-cell epitranscriptomics enabled trajectory and regulatory inference of RNA methylation Jia Meng As a fundamental mechanism for gene expression regulation, post-transcriptional RNA methylation plays versatile roles in various biological processes and disease mechanisms. Recent advances in single-cell technology have enabled simultaneous profiling of transcriptome-wide RNA methylation in thousands of cells, holding the promise to provide deeper insights into the dynamics, functions, and regulation of RNA methylation. However, it remains a major challenge to determine how to best analyze single-cell epitranscriptomics data. In this study, we developed SigRM, a computational framework for effectively mining single-cell epitranscriptomics datasets with a large cell number, such as those produced by the scDART-seq technique from the SMART-seq2 platform. SigRM not only outperforms state-of-the-art models in RNA methylation site detection on both simulated and real datasets but also provides rigorous quantification metrics of RNA methylation levels. This facilitates various downstream analyses, including trajectory inference and regulatory network reconstruction concerning the dynamics of RNA methylation.
2025-07-24 08:40:00 09:00:00 02N iRNA Prediction and validation of Split Open Reading Frames across cell types Christina Kalk Christina Kalk, Marcel Schulz, Michaela Müller-McNicoll, Vladimir Despic, Mauro Siragusa, Justin Murtagh Background: Split Open Reading frames (Split-ORFs) exist on transcripts containing at least two open reading frames, each of which encodes a part of the same full-length protein. These multiple open reading frames arise from alternatively spliced transcript isoforms. The phenomenon of Split-ORFs has been observed for the SR protein family of splicing factors, where the Split-ORF proteins play important autoregulatory roles. Aims/purpose: The aim of this study was to investigate the translation and expression of Split-ORFs. Methods: We built a pipeline that predicts potential Split-ORFs for a user supplied set of transcripts and determines the regions unique to the potential Split-ORFs. These unique regions are absent from protein coding transcripts. The translation of the predicted Split-ORFs can be validated by finding their unique regions in Ribo-seq or proteomics data. Results: The Split-ORF pipeline was applied to a set of transcripts containing premature termination codons or retained introns. Novel Split-ORF transcripts and their unique regions were predicted and a substantial fraction had significant Ribo-seq coverage in data from different cell types. Additionally, the Split-ORF candidate start sites had a significantly higher probability of being translation initiation sites than background sites as predicted by a deep neural network. Outlook: These results suggest that the occurrence of Split-ORFs is more widespread than previously assumed and that they are expressed across different cell types. This paves the road for further functional investigations of the validated Split-ORF candidates and mechanisms of their biogenesis.
2025-07-24 09:00:00 09:20:00 02N iRNA Bridging the Gap: Recalibrating In-vitro Models for Accurate In-vivo RBP Binding Predictions Ilyes Baali Ilyes Baali, Alexander Sasse, Quaid Morris Accurate identification of RNA-binding protein (RBP) binding sites is essential for understanding post-transcriptional gene regulation. However, current models face two major challenges: the limited availability of in vivo data and the poor generalization of models trained solely on in vitro assays. These limitations hinder our ability to make reliable in vivo predictions and obscure the true regulatory roles of RBPs in cellular contexts. This study aims to understand the root causes of discrepancies between these two assay types. By analyzing data from both assays, we investigate whether differences arise from biological context, experimental artifacts, or model limitations. To address these challenges, we introduce a recalibration model that integrates in vitro and in vivo data to improve prediction accuracy and interpretability. We evaluate model performance across multiple generalization tasks—including chromosome, cell-type, and RBP-wise splits—and find that in-vitro-only models generalize poorly to in-vivo settings. In contrast, the recalibrated model significantly improves performance and even outperforms in-vivo-only models, demonstrating the added value of recalibrated in-vitro data. Feature importance analysis shows that the recalibration model corrects for incomplete binding preferences in in vitro assays and adjusts for assay-specific artifacts, such as G-rich motif enrichment in eCLIP. These findings suggest that many observed differences between assays are driven by technical biases rather than fundamental biological divergence and highlight the importance of accounting for such factors when modeling RBP binding in vivo.
2025-07-24 09:20:00 09:40:00 02N iRNA Multi-Tool Intron Retention Analysis in Autism Adi Gershon Adi Gershon, Saira Jabeen, Asa Ben Hur, Maayan Salton Intron retention is an alternative splicing event in which introns remain in mature mRNA, altering protein isoforms or triggering transcript decay. Recent evidence highlights IR’s involvement in key biological processes, including neurodevelopment. However, quantifying IR remains difficult due to intronic complexity and ambiguous read mapping. We systematically analyzed IR in autism spectrum disorder (ASD) using three computational tools with distinct strategies. rMATS (junction-based modeling), IRFinder (intron/spliced read ratios), and iDiffIR (log fold-change in intron coverage). Our focus was on six splicing factors (NOVA2, RBFOX1, SRRM2, SART3, U2AF2, WBP4) implicated in syndromic ASD, alongside idiopathic ASD brain tissue. We aimed to identify shared IR events that might reflect underlying splicing dysregulation in ASD. All tools revealed hundreds of significantly altered introns in ASD and splicing factor models, consistently showing increased retention in ASD or mutant conditions. Despite tool-specific differences, we identified 574 genes with significant intron retention in both splicing factor models and ASD brain, enriched for neurodevelopmental pathways and known autism genes. At the event level, 21 introns were detected across multiple splicing factor models and ASD brains, enriched for transcription factor motifs such as TFAP2A and PLAGL2, suggesting shared regulatory mechanisms. Notably, rMATS and IRFinder detected more events and showed pronounced associations with intron length and GC content, whereas iDiffIR displayed greater variability. Our multi-tool approach highlights the complexity of IR detection and underscores the value of integrating complementary strategies to elucidate splicing dysregulation in ASD. These findings provides prioritized IR candidates for future functional studies in neurodevelopmental disorders.
2025-07-24 09:40:00 10:00:00 02N iRNA Detection of statistically robust interactions from diverse RNA-DNA ligation data Timothy Warwick Simonida Zehr, Ralf Brandes, Marcel Schulz, Timothy Warwick Background: Chromatin-localized RNAs play key roles in gene regulation and nuclear architecture. Genome-wide RNA-DNA interactions can be mapped using molecular methods like RADICL-seq, GRID-seq, Red-C, and ChAR-seq, which utilize bridging oligonucleotides for RNA-DNA ligation. Despite advancements in these methods, a computational tool for reliably identifying biologically meaningful RNA-DNA interactions is lacking. Approach: Herein, we present RADIAnT, a reads-to-interactions pipeline for analysing RNA-DNA ligation data. These data are often confounded by multiple factors, including nascent transcription and expression differences. To manage these confounders, RADIAnT calls interactions against a dataset-specific, unified background which considers RNA binding site-TSS distance, genomic region bias and relative RNA abundance. Results: By calling interactions against the multifactor background described above, RADIAnT is sensitive enough to detect specific interactions of lowly expressed transcripts, while remaining specific enough to discount false positive interactions of highly abundant RNAs. In addition to calling consistent interactions between different molecular methodologies, RADIAnT outperforms previously proposed methods in the accurate identification of genome-wide Malat1-DNA interactions in murine data, and NEAT1-DNA interactions in human cells, with orthogonal one-to-all data used to classify binding regions in each case. In a further use case, RADIAnT was utilized to identify dynamic chromatin-associated RNAs in the physiologically- and pathologically-relevant process of endothelial-to-mesenchymal transition. Conclusion: RADIAnT represents a reproducible, generalisable approach for analysis of RNA-DNA ligation data, and provides users with statistically stratified RNA-DNA interactions which can be probed for biological function.
2025-07-24 11:20:00 11:50:00 02N iRNA Building the future of RNA tools Blake Sweeney Blake Sweeney, Blake Sweeney From the epitranscriptome and 3D structure prediction to large language models, RNA science is experiencing a transformative shift. Recent advances in RNA 3D structure prediction and RNA-focused language models represent early milestones in what's possible. The explosion in data availability and computational power will fundamentally change how we approach RNA research. This computational revolution will be shaped by the tools we build today. This talk serves as an introduction to our special section and panel discussion, where we'll discuss frontiers in RNA tool development. This talk will outline the key themes that our following speakers and panel will explore in detail. In the panel, we aim to tackle questions like: What are the highest-impact tools missing from our current toolkit? What problems can machine learning solve, and what limitations does it face in RNA science? How can these limitations be overcome? What would it take to make sophisticated RNA analysis accessible to every researcher? We encourage anyone interested in RNA research or seeking new computational frontiers to attend this section and contribute to the following panel discussion.
2025-07-24 11:50:00 12:00:00 02N iRNA Sci-ModoM: a quantitative database of transcriptome-wide high-throughput RNA modification sites promoting cross-disciplinary collaborative research Etienne Boileau Etienne Boileau, Harald Wilhelmi, Anne Busch, Andrea Cappannini, Andreas Hildebrand, Janusz M Bujnicki, Christoph Dieterich We recently presented Sci-ModoM [1], the first next-generation RNome database offering a one-stop source for RNA modifications originating from state-of-the-art high-resolution detection methods. Sci-ModoM provides quantitative measurements per site and dataset, enabling researchers, including non-experts, to assess the confidence level of the reported modifications across datasets. Currently, users can Search and Compare over seven million modifications across 162 datasets, Browse or download datasets, and retrieve metadata; and these figures keep growing as data is continuously added. Sci-ModoM addresses critical challenges that are foundational to open science such as the need for standardized nomenclatures, common standards and guidelines for data sharing. It promotes data reuse, as it relies solely on the authors' published results; data are accessible in a human-readable, interoperable format, developed in consultation with the community [2]. In this talk, we will present Sci-ModoM in the context of a broader pan-European roadmap to (i) facilitate access to and sharing of high-throughput transcriptome-wide RNA modification data, and (ii) to promote data-driven sustainability in the development of reliable methods to map and identify RNA modifications. Our current work aims to expand the different RNA types (mRNA, non-coding RNA, tRNA, rRNA) in Sci-ModoM, to further establish FAIR data treatment, and to improve guidelines for data analysis and exchange, under the umbrella of the Human RNome project [3]. [1] Etienne Boileau, Harald Wilhelmi, Anne Busch, Andrea Cappannini, Andreas Hildebrand, Janusz M. Bujnicki, Christoph Dieterich. Sci-ModoM: a quantitative database of transcriptome-wide high-throughput RNA modification sites Nucleic Acids Research, 2024, gkae972. [2] https://dieterich-lab.github.io/euf-specs [3] https://humanrnomeproject.org
2025-07-24 12:00:00 12:10:00 02N iRNA RNAtranslator: A Generative Language Model for Protein-Conditional RNA Design A. Ercument Cicek Sobhan Shukueian Tabrizi, Sina Barazandeh, Helya Hashemi Aghdam, A. Ercument Cicek Protein-RNA interactions are essential in gene regulation, splicing, RNA stability, and translation, making RNA a promising therapeutic agent for targeting proteins, including those considered undruggable. However, designing RNA sequences that selectively bind to proteins remains a significant challenge due to the vast sequence space and limitations of current experimental and computational methods. Traditional approaches rely on in vitro selection techniques or computational models that require post-generation optimization, restricting their applicability to well-characterized proteins. We introduce RNAtranslator, a generative language model that formulates protein-conditional RNA design as a sequence-to-sequence natural language translation problem for the first time. By learning a joint representation of RNA and protein interactions from large-scale datasets, RNAtranslator directly generates binding RNA sequences for any given protein target without the need for additional optimization. Our results demonstrate that RNAtranslator produces RNA sequences with natural-like properties, high novelty, and enhanced binding affinity compared to existing methods. This approach enables efficient RNA design for a wide range of proteins, paving the way for new RNA-based therapeutics and synthetic biology applications. The model and the code is released at github.com/ciceklab/RNAtranslator.
2025-07-24 12:10:00 12:20:00 02N iRNA miRXplain: transformer-driven explainable microRNA target prediction leveraging isomiR interactions Giulia Cantini Ranjan Kumar Maji, Giulia Cantini, Hui Cheng, Annalisa Marsico, Marcel Schulz microRNAs (miRNAs) are short (~22 nt) RNA sequences key regulators of transcript expression. miRNAs bind to target mRNA sites to repress genes. isomiRs, generated with alternate processing of miRNA hairpins during biogenesis, exhibit variations that change the relative seed position to their canonical forms. This results in the selection of a different target transcript repertoire compared to canonical, diversifying miRNA regulation. However, mRNA configurations that enable miRNA target selection are still undetermined. isomiRs, together with canonical miRNA targets, have not been studied due to the lack of high-throughput experiments that capture exact miRNAs bound to their targets. Deep learning (DL) approaches have neither used such datasets nor have they investigated isomiR target interactions. To address this gap, we developed a new transformer model, miRXplain, that predicts miRNA target interactions using miRNA and target sequences from CLIP-L chimeras. We analyzed CLIP-L experiments, which tether exact miRNA variations to their mRNA target site. We annotated these interactions and revealed nucleotide biases at the 5’ end of the target region. We addressed these biases and constructed miRNA and interacting site pairs to learn isomiR differences from their canonicals in their target interaction. miRXplain surpassed in performance all the benchmarked models and performed on par with TEC-miTarget, however ~2 times faster during training per epoch. Model attention weights revealed distinct importance of nucleotide positions for canonical and isomiR types. miRXplain can contribute to the discovery of isomiR targeting rules to enhance our understanding of miRNA biology. Code availability: https://github.com/marsico-lab/miRXplain.
2025-07-24 12:20:00 12:30:00 02N iRNA Designing functional RNA sequences using conditional diffusion models Cho Joohyun Cho Joohyun, Daniil Melnichenko, Jongmin Lim, Dongsup Kim, Young-suk Lee The function of RNA is largely determined by its networks of protein-RNA interactions. A key challenge in RNA engineering is in designing the sequence in a manner that controls its interacting partners. Towards this effort, we built a RNA sequence generator using conditional diffusion models that automatically designs based on the structure of a given RNA-binding protein. The RNA generator is a single unified deep-learning framework of 64 million parameters and is trained on high-quality structure data of 1,190 distinct protein-RNA complexes. The model’s cross-attention mechanism suggests that it learns the evolutionary homology of protein-RNA interactions. When benchmarking on RoseTTAFoldNA’s training and test dataset, we find that our model generates RNA sequences with AlphaFold3-confidence scores comparable to the bound RNA sequence. In all, these results call for experimental confirmation from a complementary source of protein-RNA interaction, and expands the possibility of automatically designing functional RNAs for biomedical applications.
2025-07-24 12:30:00 13:00:00 02N iRNA Panel: The future of RNA tools
2025-07-24 14:00:00 14:40:00 02N iRNA Decoding RNA language in plants Yiliang Ding RNA structure plays an important role in the post-transcriptional regulations of gene expression. Using in vivo RNA structure profiling methods, we have determined the functional roles of RNA structure in diverse biological processes such as mRNA processing (splicing and polyadenylation), translation and RNA degradation in plants. We also developed a new method to reveal the existence of tertiary RNA G-quadruplex structures in eukaryotes and uncovered that RNA G-quadruplex structure serves as a molecular marker to facilitate plant adaptation to the cold during evolution. Additionally, we have developed the single-molecule RNA structure profiling method and revealed the functional importance of RNA structure in long noncoding RNAs. Recently, we established a powerful RNA foundation model, PlantRNA-FM, that facilitates the explorations of functional RNA structure motifs across transcriptomes.
2025-07-24 14:40:00 15:00:00 02N iRNA EnsembleDesign: Messenger RNA Design Minimizing Ensemble Free Energy via Probabilistic Lattice Parsing Liang Huang Ning Dai, Tianshuo Zhou, Wei Yu Tang, David Mathews, Liang Huang The task of designing optimized messenger RNA (mRNA) sequences has received much attention in recent years thanks to breakthroughs in mRNA vaccines during the COVID-19 pandemic. Because most previous work aimed to minimize the minimum free energy (MFE) of the mRNA in order to improve stability and protein expression, which only considers one particular structure per mRNA sequence, millions of alternative conformations in equilibrium are neglected. More importantly, we prefer an mRNA to populate multiple stable structures and be flexible among them during translation when the ribosome unwinds it. Therefore, we consider a new objective to minimize the ensemble free energy of an mRNA, which includes all possible structures in its Boltzmann ensemble. However, this new problem is much harder to solve than the original MFE optimization. To address the increased complexity of this problem, we introduce EnsembleDesign, a novel algorithm that employs continuous relaxation to optimize the expected ensemble free energy over a distribution of candidate sequences. EnsembleDesign extends both the lattice representation of the design space and the dynamic programming algorithm from LinearDesign to their probabilistic counterparts. Our algorithm consistently outperforms LinearDesign in terms of ensemble free energy, especially on long sequences. Interestingly, as byproducts, our designs also enjoy lower average unpaired probabilities (AUP, which correlates with degradation) and flatter Boltzmann ensembles (more flexibility between conformations). Our code is available on: https://github.com/LinearFold/EnsembleDesign.
2025-07-24 15:00:00 15:20:00 02N iRNA Machine learning-guided isoform quantification in bulk and single-cell RNA-seq using joint short- and long-read modeling Hamed Najafabadi Michael Apostolides, Jichen Wang, Ali Saberi, Benedict Choi, Hani Goodarzi, Hamed Najafabadi Accurate quantification of transcript isoforms is crucial for understanding gene regulation, functional diversity, and cellular behavior. Existing RNA sequencing methods have important limitations: short-read (SR) sequencing provides high depth but struggles with isoform deconvolution, especially in single-cell data where substantial positional biases are common; long-read (LR) sequencing offers isoform resolution but suffers from lower depth, higher noise, and technical biases. To address these challenges, we introduce Multi-Platform Aggregation and Quantification of Transcripts (MPAQT), a generative model that combines the complementary strengths of multiple RNA-seq platforms to achieve state-of-the-art isoform-resolved quantification. MPAQT explicitly models platform-specific biases, including positional biases in short-read single-cell data and sequence-dependent biases in LR data. We show that MPAQT enables state-of-the-art gene- and isoform-level quantification both in SR-only single-cell data and in bulk datasets integrating SR and LR reads. By applying MPAQT to an in vitro model of human embryonic stem cell differentiation into cortical neurons, followed by machine learning-based modeling of transcript abundances, we show that untranslated regions (UTRs) are major determinants of isoform proportion and exon usage. This effect is mediated through isoform-specific sequence features embedded in UTRs, which interact with RNA-binding proteins that modulate mRNA stability. We further demonstrate that machine learning-based predictions can be fed back into MPAQT to resolve ambiguities in read-to-isoform assignment, resulting in more accurate abundance estimates. These findings highlight MPAQT’s potential to enhance our understanding of transcriptomic complexity across platforms and cell types, while bridging statistical quantification with machine learning models of isoform regulation.
2025-07-24 15:20:00 15:40:00 02N iRNA Long-read RNA sequencing unveils a novel cryptic exon in MNAT1 along with its full-length transcript structure in TDP-43 proteinopathy Yoshihisa Tanaka Yoshihisa Tanaka, Naohiro Sunamura, Rei Kajitani, Marie Ikeguchi, Ryo Kunimoto Understanding the role of transcript isoforms is crucial for dissecting disease mechanisms. TAR DNA binding protein-43 (TDP-43) is a key regulator of RNA splicing, and its dysfunction in neurons is a hallmark of some neurodegenerative diseases such as amyotrophic lateral sclerosis (ALS) and frontotemporal degeneration (FTD). Specifically, TDP-43 maintains proper splicing by preventing the aberrant inclusion of cryptic exons into mRNA, thereby preserving normal transcript isoforms. Although TDP-43-dependent cryptic exons have been implicated in disease pathogenesis, an approach to investigate how cryptic exons disrupt transcript isoforms has yet to be established. To address this, we developed IsoRefiner, a novel method for identifying full-length transcript structures using long-read RNA-seq. Our results show that IsoRefiner outperforms existing long-read analysis tools. Leveraging this method, we conducted long-read RNA-seq, guided by prior short-read RNA-seq, to comprehensively resolve the full-length structures of aberrant transcripts caused by TDP-43 depletion in human induced pluripotent stem cell (iPSC)-derived motor neurons. This led to the discovery of a novel TDP-43-dependent cryptic exon in the MNAT1 gene, along with its full-length transcript structure. Furthermore, we confirmed the presence of the MNAT1 cryptic exon in tissues derived from patients with ALS and FTD. Our findings deepen understanding of TDP-43 proteinopathy, and our approach provides a powerful framework for investigating splicing mechanisms across diverse cellular and disease contexts.
2025-07-24 15:40:00 15:50:00 02N iRNA Transcriptome Universal Single-isoform COntrol (TUSCO): A Framework for Evaluating Transcriptome Quality Tianyuan Liu Tianyuan Liu, Adam Frankish, Ana Conesa, Alejandro Paniagua, Fabian Jetzinger Long-read sequencing (LRS) platforms, such as Oxford Nanopore (ONT) and Pacific Biosciences (PacBio), enable comprehensive transcriptome analysis but face challenges such as sequencing errors, sample quality variability, and library preparation biases. Current benchmarking approaches address these issues insufficiently: BUSCO assesses transcriptome completeness using conserved single-copy orthologs but can misinterpret alternative splicing as gene duplications, while spike-ins (SIRVs, ERCCs) oversimplify real-sample complexity, neglecting RNA degradation and RNA extraction artifacts, thus inflating performance metrics. To overcome these limitations, we introduce the Transcriptome Universal Single-isoform COntrol (TUSCO), a curated internal reference set of conserved genes lacking alternative isoforms. TUSCO evaluates precision by identifying transcripts deviating from reference annotations and assesses sensitivity by verifying detection completeness in human and mouse samples. Our validation demonstrates that TUSCO provides accurate and reliable benchmarking without external controls, significantly improving quality control standards for transcriptome reconstruction using LRS.
2025-07-24 15:50:00 16:00:00 02N iRNA Concluding remarks and poster prizes Maayan Salton
2025-07-20 18:30:00 19:30:00 01A Distinguished Keynotes John Jumper
2025-07-21 08:40:00 09:00:00 01A Distinguished Keynotes Morning Welcome and Keynote Introduction
2025-07-21 09:00:00 10:00:00 01A Distinguished Keynotes Plus ça change, plus c'est la même chose Amos Bairoch
2025-07-22 08:40:00 09:00:00 01A Distinguished Keynotes Morning Welcome and Keynote Introduction
2025-07-22 09:00:00 10:00:00 01A Distinguished Keynotes James Zou
2025-07-23 08:40:00 09:00:00 01A Distinguished Keynotes Morning Welcome and Keynote Introduction
2025-07-23 09:00:00 10:00:00 01A Distinguished Keynotes Charlotte Deane
2025-07-24 16:20:00 18:00:00 01A Distinguished Keynotes Decoding cellular systems: From observational atlases to generative interventions David Baker, David Baker, Fabian Theis Over the past decade, the field of computational cell biology has undergone a transformation — from cataloging cell types to modeling how cells behave, interact, and respond to perturbations. In this talk, I will review and explore how machine learning is enabling this shift, focusing on two converging frontiers: integrated cellular mapping and actionable generative models. I’ll begin with a brief overview of recent advances in representation learning for atlas-scale integration, highlighting work across the Human Cell Atlas and beyond. These efforts aim to unify diverse single-cell and spatial modalities into shared manifolds of cellular identity and state. As one example, I will present our recent multimodal atlas of human brain organoids, which integrates transcriptomic variation across development and lab protocols. From there, I’ll review the emerging landscape of foundation models in single-cell genomics, including our work on Nicheformer, a transformer trained on millions of spatial and dissociated cells. These models offer generalizable embeddings for a range of tasks—but more importantly, they set the stage for predictive modeling of biological responses. I’ll close by introducing perturbation models leveraging generative AI to model interventions on these systems. As example I will show Cellflow, a generative framework that learns how perturbations such as drugs, cytokines or gene edits — shift cellular phenotypes. It enables virtual experimental design, including in silico protocol screening for brain organoid differentiation. This exemplifies a move toward models that not only interpret biological systems, but help shape them.
2025-07-23 14:40:00 15:20:00 01B MICROBIOME Microbiome multitudes and metadata madness Fiona Brinkman Fiona Brinkman Microbiome analysis is increasingly becoming a critical component of a wide range of health, agri-foods, and environmental studies. I will present case studies showing the benefit of integrating very diverse metadata into such analyses - and also pitfalls to watch out for. The results of one such cohort study will be further presented, illustrating the need for analyses that allow one to flexibly view metadata in the context of microbiome data. The results support the multigenerational importance of “healthy
2025-07-23 15:20:00 15:30:00 01B MICROBIOME Species-level taxonomic profiling of Earth’s microbiomes with mOTUs4 Marija Dmitrijeva Marija Dmitrijeva, Hans-Joachim Ruscheweyh, Lilith Feer, Kang Li, Samuel Miravet-Verde, Anna Sintsova, Andrew Abi Younes, Wolf-Dietrich Hardt, Daniel Mende, Georg Zeller, Shinichi Sunagawa Microbial communities are crucial to the health and functioning of diverse ecosystems on Earth. A key step in their analysis is taxonomic profiling, i.e., the identification and quantification of microbial community composition, typically done by comparing environmental samples to reference genome collections. However, species from underexplored ecosystems are poorly represented in public databases, limiting the accuracy of taxonomic profiling tools. Here, we present mOTUs4 and its accompanying online database, accessible at https://motus-db.org/. This resource comprises 2.83 million metagenome-assembled genomes (MAGs) recovered from over 50 environments using a unified genome reconstruction workflow. The MAGs are accompanied by 919,090 genomes from reference databases, totalling 3.75 million prokaryotic genomes. mOTUs4 can profile 124,295 species, expanding taxonomic coverage of underrepresented ecosystems. The associated genomic data can be interactively browsed online and filtered based on taxonomy, mOTUs identifiers, and genome quality metrics; the user-friendly interface minimises the need for programming skills to link profiling results with genomic context. In addition, the output produced by mOTUs4 can serve as a proxy for the number of cells within a sample, allowing its use as a scaling factor for normalising gene counts. This opens the utility of using the profiler output to calculate per cell copy numbers of diverse gene functional groups, such as antimicrobial resistance genes. By improving accuracy and interpretability in taxonomic profiling across diverse ecosystems and standardising quantification of gene functional groups, mOTUs4 offers a scalable approach to microbial community analysis.
2025-07-23 15:30:00 15:40:00 01B MICROBIOME Accurate profiling of microbial communities for shotgun metagenomic sequencing with Meteor2 Amine Ghozlane Amine Ghozlane, Florence Thirion, Florian Plaza Oñate, Franck Gauthier, Emmanuelle Le Chatelier, Anita Annamalé, Mathieu Almeida, Stanislav Ehrlich, Nicolas Pons The characterization of complex microbial communities is a critical challenge in microbiome research. Metagenomic profiling has advanced to include taxonomic, functional, and strain-level profiling (TFSP) of microbial communities. We present Meteor2, a tool that leverages compact, environment-specific microbial gene catalogues to deliver comprehensive TFSP insights from metagenomic samples. Meteor2 currently supports ten ecosystems, with 63,494,365 microbial genes clustered into 11,653 metagenomic species pangenomes (MSPs). In benchmark tests, Meteor2 demonstrated strong performance in TFSP, excelling in detecting low-coverage species. It improved species detection sensitivity by at least 45% compared to other tools, such as MetaPhlAn4 and sylph, in human and mouse gut microbiota simulations. For functional profiling, Meteor2 improved abundance estimation accuracy by at least 35% compared to HUMAnN3. Additionally, Meteor2 tracked more strain pairs than StrainPhlAn, capturing an additional 9.8% on the human dataset and 19.4% on the mouse dataset. In its fast configuration, Meteor2 emerges as one of the fastest available tools for profiling, requiring only 2.3 minutes for taxonomic analysis and 10 minutes for strain-level analysis against the human microbial gene catalogue when processing 10M paired reads — operating within a modest 5GB RAM footprint. We futher validated Meteor2 using a published faecal microbiota transplantation (FMT) dataset, demonstrating its ability to deliver extensive and actionable metagenomic analysis. As an open-source, easy-to-install, and accurate analysis platform, Meteor2 is highly accessible to researchers, facilitating the exploration of complex microbial ecosystems. Meteor2 is available on github (https://github.com/metagenopolis/meteor) and bioconda (bioconda/meteor). A preprint is currently available here (DOI:21203/rs.3.rs-6122276/v1).
2025-07-23 15:40:00 15:50:00 01B MICROBIOME Benchmarking metagenomic binning tools on real datasets across sequencing platforms and binning modes Shanfeng Zhu Haitao Han, Ziye Wang, Shanfeng Zhu Metagenomic binning is a culture-free approach that facilitates the recovery of metagenome-assembled genomes by grouping genomic fragments. However, there remains a lack of a comprehensive benchmark to evaluate the performance of metagenomic binning tools across various combinations of data types and binning modes. In this study, we benchmark 13 metagenomic binning tools using short-read, long-read, and hybrid data under co-assembly, single-sample, and multi-sample binning, respectively. The benchmark results demonstrate that multi-sample binning exhibits optimal performance across short-read, long-read, and hybrid data. Moreover, multi-sample binning outperforms other binning modes in identifying potential antibiotic resistance gene hosts and near-complete strains containing potential biosynthetic gene clusters across diverse data types. This study also recommends three efficient binners across all data-binning combinations, as well as high-performance binners for each combination.
2025-07-23 15:50:00 16:00:00 01B MICROBIOME Metagenomics-Toolkit: The Flexible and Efficient Cloud-Based Metagenomics Workflow Nils Kleinbölting Nils Kleinbölting, Peter Belmann, Benedikt Osterholz The metagenome analysis of complex environments with thousands of datasets, such as those available in the Sequence Read Archive, requires immense computational resources to complete the computational work within an acceptable time frame. Such large-scale analyses require that the underlying infrastructure is used efficiently. In addition, any analysis should be fully reproducible and the workflow must be publicly available to allow other researchers to understand the reasoning behind computed results. To address this challenge, we have developed and like to present the Metagenomics-Toolkit, a scalable, data agnostic workflow that automates the analysis of short and long metagenomic reads obtained from Illumina or Oxford Nanopore Technology devices, respectively. The Metagenomics-Toolkit offers not only standard features expected in a metagenome workflow, such as quality control, assembly, binning, and annotation, but also distinctive features, such as plasmid identification based on various tools, the recovery of unassembled microbial community members and the discovery of microbial interdependencies through a combination of dereplication, co-occurrence, and genome-scale metabolic modeling. Furthermore, the Metagenomics-Toolkit includes a machine learning-optimized assembly step that tailors the peak RAM value requested by a metagenome assembler to match actual requirements, thereby minimizing the dependency on dedicated high-memory hardware. While the Metagenomics Toolkit can be executed on user workstations, it also offers several optimizations for an efficient cloud-based cluster execution.
2025-07-23 16:40:00 17:20:00 01B MICROBIOME TBA Rob Knight
2025-07-23 17:20:00 17:40:00 01B MICROBIOME DNABERT-S: Pioneering Species Differentiation with Species-Aware DNA Embeddings Weimin Wu Zhihan Zhou, Weimin Wu, Harrison Ho, Jiayi Wang, Lizhen Shi, Ramana Davuluri, Zhong Wang, Han Liu We introduce DNABERT-S, a tailored genome model that develops species-aware embeddings to naturally cluster and segregate DNA sequences of different species in the embedding space. Differentiating species from genomic sequences (i.e., DNA and RNA) is vital yet challenging, since many real-world species remain uncharacterized, lacking known genomes for reference. Embedding-based methods are therefore used to differentiate species in an unsupervised manner. DNABERT-S builds upon a pre-trained genome foundation model named DNABERT-2. To encourage effective embeddings to error-prone long-read DNA sequences, we introduce Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. We further enhance it with the proposed Curriculum Contrastive Learning (C$^2$LR) strategy. Empirical results on 23 diverse datasets show DNABERT-S's effectiveness, especially in realistic label-scarce scenarios. For example, it identifies twice more species from a mixture of unlabeled genomic sequences, doubles the Adjusted Rand Index (ARI) in species clustering, and outperforms the top baseline's performance in 10-shot species classification with just a 2-shot training.
2025-07-23 17:40:00 17:50:00 01B MICROBIOME MGnify Genomes: generating richly annotated, searchable biome-specific genome catalogues Tatiana Gurbich Tatiana Gurbich, Germana Baldi, Martin Beracochea, Alejandra Escobar-Zepeda, Varsha Kale, Jennifer Lu, Lorna Richardson, Alexander Rogers, Ekaterina Sakharova, Mahfouz Shehu, Robert Finn The generation of metagenome-assembled genomes (MAGs) has become a routine method for studying microbiomes. With the growing availability of MAGs in public repositories, MGnify, a free platform for metagenomic data assembly, analysis, and archiving, has introduced MGnify Genomes. This resource serves as a hub for systematically organising and annotating publicly available MAGs and isolate genomes into non-redundant, biome-specific catalogues. The resource includes over half a million genomes and has recently expanded to incorporate eukaryotic genomes in addition to prokaryotic ones. These genomes are sourced from a wide range of biomes, including both host-associated and environmental contexts. Within each biome, genomes are organised into species-level clusters, with the highest-quality genome selected as the representative, prioritising isolate genomes over MAGs. Each representative genome is richly annotated with comprehensive functional information, including antimicrobial resistance. Additional annotations cover biosynthetic gene clusters, carbohydrate metabolism—including polysaccharide utilisation loci, non-coding RNAs, CRISPR, phage sequences, plasmids, and integrative mobile elements. An open-source Nextflow pipeline is maintained for generating new catalogues and updating existing ones. The platform offers multiple ways to utilise these references: each biome-specific catalogue is accompanied by Kraken2, protein, and gene databases. A fast, k-mer-based search tool is available on the MGnify Genomes website, allowing users to quickly compare their genomes against the reference catalogues. The resource supports a wide range of applications, including the identification of novel genomes, analysis of species-level adaptation across environments, and research in agricultural, environmental, and health and disease fields.
2025-07-23 17:50:00 18:00:00 01B MICROBIOME Rapid and Consistent Genome Clustering for Navigating Bacterial Diversity with Millions of Genomes Johanna von Wachsmann Johanna von Wachsmann, John A. Lees, Robert D. Finn The exponential growth of bacterial genomic databases presents unprecedented challenges for researchers, with isolate genomes increasing from 661,405 samples in 2021 to 2,440,377 samples by August 2024, alongside expanding MAG repositories like those provided by MGnify. While removing genome redundancy at species or strain levels is essential for navigating this vast landscape, current gold-standard tools like dRep have become computationally infeasible for datasets exceeding 50,000 genomes - illustrated by the human gut MAG catalogue in MGnify requiring artificial splitting into multiple chunks for processing, risking taxonomic inconsistencies and demanding extensive manual intervention. We present a novel sketching-based clustering approach that dramatically improves scalability while maintaining high biological accuracy. Our method is built on sketchlib.rust (approximately 100× faster than MASH) for sketching genomes and constructing genome similarity networks that effectively partition millions of genomes into species clusters. When benchmarked against dRep on a 1,125-genome dataset, our approach clusters the genomes in just 0.2 CPU hours compared to dRep's 92 CPU hours. More importantly, our method successfully processes 219,000 genomes in only 17.1 CPU hours - a task impossible for dRep. Quality assessment across multiple datasets demonstrates excellent taxonomic coherence, with monophyletic scores >99%. This breakthrough enables researchers to effectively navigate and utilise the unprecedented scale of available bacterial genomic data, facilitating analyses previously considered impracticable or even impossible.
2025-07-24 08:40:00 09:00:00 01B MICROBIOME GiantHunter: Accurate detection of giant virus in metagenomic data using reinforcement-learning and Monte Carlo tree search Jiayu Shang Fuchuan Qu, Cheng Peng, Jiaojiao Guan, Donglin Wang, Yanni Sun, Jiayu Shang Motivation: Nucleocytoplasmic large DNA viruses (NCLDVs) are notable for their large genomes and extensive gene repertoires, which contribute to their widespread environmental presence and critical roles in processes such as host metabolic reprogramming and nutrient cycling. Metagenomic sequencing has emerged as a powerful tool for uncovering novel NCLDVs in environmental samples. However, identifying NCLDV sequences in metagenomic data remains challenging due to their high genomic diversity, limited reference genomes, and shared regions with other microbes. Existing alignment-based and machine learning methods struggle with achieving optimal trade-offs between sensitivity and precision. Results: In this work, we present GiantHunter, a reinforcement learning-based tool for identifying NCLDVs from metagenomic data. By employing a Monte Carlo tree search strategy, GiantHunter dynamically selects representative non-NCLDV sequences as the negative training data, enabling the model to establish a robust decision boundary. Benchmarking on rigorously designed experiments shows that GiantHunter achieves high precision while maintaining competitive sensitivity, improving the F1-score by 10% and reducing computational cost by 90% compared to the second-best method. To demonstrate its real-world utility, we applied GiantHunter to 60 metagenomic datasets collected from six cities along the Yangtze River, located both upstream and downstream of the Three Gorges Dam. The results reveal significant differences in NCLDV diversity correlated with proximity to the dam, likely influenced by reduced flow velocity caused by the dam. These findings highlight GiantHunter's potential to advance our understanding of NCLDVs and their ecological roles in diverse environments.
2025-07-24 09:00:00 09:10:00 01B MICROBIOME CAMI Benchmarking Portal: online evaluation and ranking of metagenomic software Fernando Meyer Fernando Meyer, Gary Robertson, Zhi-Luo Deng, David Koslicki, Alexey Gurevich, Alice C. McHardy Finding appropriate software and parameter settings to process shotgun metagenome data is essential for meaningful metagenomic analyses. To enable objective and comprehensive benchmarking of metagenomic software, the community-led initiative for the Critical Assessment of Metagenome Interpretation (CAMI) promotes standards and best practices. Since 2015, CAMI has provided comprehensive datasets, benchmarking guidelines, and challenges. However, benchmarking had to be conducted offline, requiring substantial time and technical expertise and leading to gaps in results between challenges. We present the CAMI Benchmarking Portal — a central repository of CAMI resources and web server for the evaluation and ranking of metagenome assembly, binning, and taxonomic profiling software. The portal simplifies evaluation, enabling users to easily compare their results with previous and other users’ submissions through a variety of metrics and visualizations. The portal currently hosts 28,675 results and is freely available at https://cami-challenge.org/.
2025-07-24 09:10:00 09:20:00 01B MICROBIOME CAMI community exchange Alice McHardy Alice McHardy
2025-07-24 09:20:00 09:30:00 01B MICROBIOME NanoGraph: Mapping Nanopore Squiggles to Graphs Enables Accurate Taxonomic Assignment Wenhuan Zeng Wenhuan Zeng, Daniel H. Huson Nanopore sequencing technology offers long sequencing reads and real-time analysis capabilities, making it a powerful tool for addressing diverse questions in the life sciences. This technology detects electronic raw signals from samples, which are converted into nucleotide sequences (A, T, G, and C) through a process known as basecalling. These sequences can subsequently be used for various types of analysis. To enhance the efficiency of taxonomic classification in Nanopore sequencing and explore the challenges of applying deep learning algorithms to ultra-long sequences, we developed NanoGraph, which is a graph-based deep learning framework designed to classify samples based on their taxonomic lineages. NanoGraph processes raw signals (of substantial length) by transforming them into topological graph structures using novel methods. We evaluated NanoGraph’s performance using a customized simulated dataset and benchmarked it against a previous study on public datasets, demonstrating superior results. Additionally, we assessed its practical usability after fine-tuning the trained model on real raw signal datasets generated in our wet lab. In summary, NanoGraph provides a robust and effective approach for the taxonomic classification of Nanopore-sequenced samples, offering insights that advance the application of graph neural networks to raw signal data and help bridge the gap between computational efficiency and ultra-long sequencing reads.
2025-07-24 09:30:00 09:40:00 01B MICROBIOME MEGAN7: Enhanced Optimization and Advanced Functionality for Metagenomic Analysis Anupam Gautam Anupam Gautam, Daniel H. Huson MEGAN is a widely used, user-friendly tool for metagenomic analysis, suitable for long and short read data, and remains the only tool with a GUI interface. MEGAN7 introduces optimized workflows and enhanced functionality. By utilizing smaller, clustered reference databases, MEGAN7 improves computational efficiency while maintaining high-quality taxonomic and functional assignments, making it a scalable solution for diverse datasets. This study presents current work on MEGAN7, a major update of our MEGAN software, and highlights the impact of utilizing smaller reference databases on the computational efficiency and effectiveness of metagenomic sequencing data analysis, as integrated into MEGAN7. Metagenomic analysis was conducted on short and long reads from ten diverse datasets. Reads were aligned to various resolutions of the UniRef database (100%, 90%, and 50%) and clustered NCBI-nr databases (90% and 50% identity) using DIAMOND. Taxonomic and functional binning of the aligned reads was carried out using MEGAN7. Smaller reference databases, particularly at 90% and 50% identity, significantly accelerated processing times while maintaining high-quality alignment and assignment rates. The integration of DIAMOND's clustering capabilities further enhanced efficiency, demonstrating improved performance across all downsized databases. MEGAN7 achieved good and agreeable assignment rates for both taxonomic and functional binning, even with reduced database sizes. These findings illustrate that downsizing reference databases effectively reduces the computational burden of metagenomic analysis without compromising result quality. The incorporation of DIAMOND's clustering features offers additional efficiency gains. With these optimized workflows, MEGAN7 presents a scalable and efficient tool for metagenomic data analysis, offering enhanced functionality for diverse datasets.
2025-07-24 09:40:00 09:50:00 01B MICROBIOME TaxSEA: Rapid Interpretation of Microbiome Alterations Using Taxon Set Enrichment Analysis and Public Databases Feargal Ryan Feargal Ryan Microbial communities are essential regulators of ecosystem function, with their composition commonly assessed through DNA sequencing. Most current tools focus on detecting changes among individual taxa (e.g., species or genera), however in other omics fields, such as transcriptomics, enrichment analyses like Gene Set Enrichment Analysis (GSEA) are commonly used to uncover patterns not seen with individual features. Here, we introduce TaxSEA, a taxon set enrichment analysis tool available as an R package, a web portal (https://shiny.taxsea.app), and a Python package. TaxSEA integrates taxon sets from five public microbiota databases (BugSigDB, MiMeDB, GutMGene, mBodyMap, and GMRepoV2) while also allowing users to incorporate custom sets such as taxonomic groupings. In-silico assessments show TaxSEA is accurate across a range of set sizes. When applied to differential abundance analysis output from Inflammatory Bowel Disease and Type 2 Diabetes metagenomic data, TaxSEA can rapidly identify changes in functional groups corresponding to known associations. We also show that TaxSEA is robust to the choice of differential abundance (DA) analysis package. In summary, TaxSEA enables researchers to efficiently contextualize their findings within the broader microbiome literature, facilitating rapid interpretation and advancing understanding of microbiome–host and environmental interactions.
2025-07-24 09:50:00 10:00:00 01B MICROBIOME SinProVirP: a Signature Protein-based Approach for Accurate and Efficient Profiling of the Human Gut Virome Junhua Li Junhua Li, Fangming Yang, Liwen Xiong, Min Li, Xuyang Feng, Huahui Ren, Zhun Shi, Huanzi Zhong The human gut virome represents a critical yet underexplored microbial component that regulates bacterial communities, modulates host immunity, and maintains gut health. However, virome analysis remains challenging due to the vast diversity and genomic variability of viruses. Existing profiling methods often struggle with accuracy and efficiency, hindering their ability to detect novel viral species and perform large-scale analyses. Here, we present SinProVirP, a genus-level virome profiling tool based on signature proteins. By analyzing 275,202 phage genomes to establish a curated database of 109,221 signature proteins across 6,780 viral clusters (VCs), SinProVirP achieves genus-level phage quantification with precision and recall comparable to the benchmark method while reducing computational demands by over 80%. Crucially, SinProVirP significantly outperforms existing tools in detecting novel viruses, achieving over 80% recall by using signature protein-based identification strategy. Applied to inflammatory bowel disease (IBD) cohorts, SinProVirP revealed disease-specific virome dysbiosis, identified phage-host interactions, and improved performance of bacteria-only disease classification models. This approach enables robust, large-scale virome analysis, facilitates the integrative analysis of viral and bacterial communities, and improves our understanding of the virome’s role in health.
2025-07-24 11:20:00 11:40:00 01B MICROBIOME Leveraging Large Language Models to Predict Antibiotic Resistance in Mycobacterium tuberculosis Conrad Testagrose Conrad Testagrose, Sakshi Pandey, Mohammadali Serajian, Simone Marini, Mattia Prosperi, Christina Boucher Antibiotic resistance in Mycobacterium tuberculosis (MTB) poses a significant challenge to global public health. Rapid and accurate prediction of antibiotic resistance can inform treatment strategies and mitigate the spread of resistant strains. In this study, we present a novel approach leveraging large language models (LLMs) to predict antibiotic resistance in MTB (LLMTB). Our model is trained on a large dataset of genomic data and associated resistance profiles, utilizing natural language processing techniques to capture patterns and mutations linked to resistance. The model's architecture integrates state-of-the-art transformer-based LLMs, enabling the analysis of complex genomic sequences and the extraction of critical features relevant to antibiotic resistance. We evaluate our model's performance using a comprehensive dataset of MTB strains, demonstrating its ability to achieve high performance in predicting resistance to various antibiotics. Unlike traditional machine learning methods, fine-tuning or few-shot learning open avenues for LLMs to adapt to new or emerging drugs thereby reducing reliance on extensive data curation. Beyond predictive accuracy, LLMTB uncovers deeper biological insights, identifying critical genes, intergenic regions, and novel resistance mechanisms. This method marks a transformative shift in resistance prediction and offers significant potential for enhancing diagnostic capabilities and guiding personalized treatment plans, ultimately contributing to the global effort to combat tuberculosis and antibiotic resistance. All source code is publicly available at https://github.com/ctestagrose/LLMTB.
2025-07-24 11:40:00 11:50:00 01B MICROBIOME De novo discovery of conserved gene clusters in microbial genomes with Spacedust Johannes Soeding Ruoshi Zhang, Johannes Soeding, Milot Mirdita Metagenomics has revolutionized environmental and human-associated microbiome studies. However, the limited fraction of proteins with known biological process and molecular functions presents a major bottleneck. In prokaryotes and viruses, evolution favors keeping genes participating in the same biological processes co-localized as conserved gene clusters. Conversely, conservation of gene neighborhood indicates functional association. Spacedust is a tool for systematic, de novo discovery of conserved gene clusters. To find homologous protein matches it uses fast and sensitive structure comparison with Foldseek. Partially conserved clusters are detected using novel clustering and order conservation P-values. We demonstrate Spacedust's sensitivity with an all-vs-all analysis of 1\,308 bacterial genomes, identifying 72\,843 conserved gene clusters containing 58\% of the 4.2 million genes. It recovered recover 95% of antiviral defense system clusters annotated by a specialized tool. Spacedust's high sensitivity and speed will facilitate the large-scale annotation of the huge numbers of sequenced bacterial, archaeal and viral genomes.
2025-07-24 11:50:00 12:00:00 01B MICROBIOME Nerpa 2: linking biosynthetic gene clusters to nonribosomal peptide structures Ilia Olkhovskii Ilia Olkhovskii, Azat Tagirdzhanov, Alexey Gurevich, Aleksandra Kushnareva, Petr Popov Nonribosomal peptides (NRPs) are clinically important molecules produced by microbial specialized enzymes encoded in biosynthetic gene clusters (BGCs). Linking BGCs to their products is crucial for predicting and manipulating NRP production, yet BGC-to-NRP biosynthesis is often complex and non-unique, making prediction from the genome challenging. Here, we present Nerpa 2, a high-throughput BGC–NRP matching tool. Compared to its predecessor, we improved prediction of NRP monomers selected during synthesis, introduced a hidden Markov model–based alignment strategy for handling complex biosynthetic paths, and added interactive visualizations for result interpretation. We evaluated Nerpa 2 on 191 BGCs and 1,205 NRP structures, demonstrating a notable accuracy improvement over both Nerpa 1 and a related tool BioCAT (50% vs. 42% and 8%). In addition to higher overall precision, Nerpa 2 performs significantly better on especially challenging cases. Nerpa 2 streamlines a range of tasks in NRP research, including annotation of computationally predicted BGCs, prioritization of BGCs more likely to yield novel NRPs, and guiding bioengineering experiments by identifying BGCs that yield molecules close to user-specified target structures. The software is freely available at https://github.com/gurevichlab/nerpa.
2025-07-24 12:00:00 12:10:00 01B MICROBIOME Phylo-Spec: a phylogeny-fusion deep learning model advances microbiome status identification Xiaoquan Su Junhui Zhang, Fan Meng, Yangyang Sun, Wenfei Xu, Shunyao Wu, Xiaoquan Su Motivation: The human microbiome is crucial for health regulation and disease progression, presenting a valuable opportunity for health state classification. Traditional microbiome-based classification rely on pre-trained machine learning (ML) or deep learning (DL) models, which typically focus on microbial distribution patterns, neglecting the underlying relationships between microbes. As a result, model performance can be significantly affected by data sparsity, misclassified features, or incomplete microbial profiles. Methods: To overcome these challenges, we introduce Phylo-Spec, a phylogeny-driven deep learning algorithm that integrates multi-aspect microbial information for improved status recognition. Phylo-Spec fuses convolutional features of microbes within a phylogenetic hierarchy via a bottom-up iteration, significantly alleviates the challenges due to sparse data and inaccurate profiling. Additionally, the model dynamically assigns unclassified species to virtual nodes on the phylogenetic tree based on higher-level taxonomy, minimizing interferences from uncertain microbes. Phylo-Spec also captures the feature importance via an information gain-based mechanism through the phylogenetic structure propagation, enhancing the interpretability of classification decisions. Results: Phylo-Spec demonstrated superior efficacy in microbiome status classification across two in-silico synthetic datasets that simulates the aforementioned cases, outperforming existing ML and DL methods. Validation with real-world metagenomic and amplicon data further confirmed the model’s performance in multiple status classification, establishing a powerful framework for microbiome-based health state identification and microbe-disease association.
2025-07-24 12:10:00 12:20:00 01B MICROBIOME Beyond Taxonomy and Function: Protein Language Models for Scalable Microbial Representations Petra Matyskova Petra Matyskova, Gijs Selten, Sanne Abeln, Ronnie de Jonge Traditional microbial representations based on taxonomy or functional annotations like KEGG Orthology (KOs) and OrthoFinder groups (OGs) suffer from low coverage, high dimensionality, or long computation times. In this work, we explore the use of protein large language models (PLLMs), specifically ESM-2, to generate compact and informative microbial embeddings. We benchmark these embeddings against KOs and OGs using a dataset of 988 microbial genomes. We compare the three approaches in terms of protein coverage, feature dimensionality, runtime, and predictive performance in a biologically relevant task: predicting the root competence of microbes on Arabidopsis thaliana. ESM-2 embeddings achieved full protein coverage and required less runtime than OGs or KOs while producing compact 320-dimensional feature sets. In the classification task, random forest and multi-layer perceptron based on ESM-2 embeddings outperformed traditional methods. Additionally, the results were replicated on external synthetic community datasets. Importantly, ESM-2 embeddings preserved relevant taxonomic and functional information, as confirmed through hierarchical clustering and PCA. Through analysing the embedding weights, we also identified key proteins predictive of root competence, including known and novel candidates. Our results suggest that PLLM-based microbial representations offer an efficient and scalable alternative to conventional functional annotation-based approaches, especially for small datasets common in microbiome studies. This approach lays the foundation for more advanced applications such as multi-modal embedding based data integration and the discovery of new biologically meaningful traits beyond taxonomic labels or annotated proteins.
2025-07-24 12:20:00 12:30:00 01B MICROBIOME Guided tokenizer enhances metagenomic language models performance Ali Rahnavard Ali Rahnavard, Vedant Mahangade, Keith Crandall Tokenization is a critical step in adapting language models for genomic and metagenomic sequence analysis. Traditional tokenization methods—such as fixed-length k-mers or statistical compression algorithms like byte-pair encoding (BPE)—often fail to capture the biological relevance embedded in DNA sequences. We introduce Guided Tokenization (GT), a novel, adaptive strategy that prioritizes biologically meaningful subsequences by leveraging importance scores derived from functional annotations, class distributions, and model attention mechanisms. Unlike conventional approaches, GT dynamically selects high-importance tokens by integrating (1) class-specific unique k-mers, (2) frequently observed informative subsequences, (3) model-informed weighted tokens after fine-tuning, and (4) biologically annotated fragments such as promoters or coding regions. This token prioritization strategy is applied during pretraining, fine-tuning, and prediction phases of genomic language models (gLMs), enabling more efficient learning with fewer parameters and reduced sequence inflation. We evaluated GT across a range of metagenomic classification and sequence modeling tasks, including taxonomic profiling, antibiotic resistance gene classification, and read classification (e.g., host vs. microbial and plasmid vs. chromosome). Results consistently demonstrate that GT improves model performance, especially for small and mid-sized models, by enhancing classification accuracy, representation quality, and computational efficiency. These findings position guided tokenization as a scalable and biologically aware framework for advancing the next generation of metagenomic language models.
2025-07-24 12:30:00 12:40:00 01B MICROBIOME REMAG: recovery of eukaryotic genomes from metagenomes using reference-free contrastive learning Daniel Gómez-Pérez Daniel Gómez-Pérez, Sebastién Raguideau, Falk Hildebrand, Christopher Quince Assembly-based metagenomic approaches, including generation of metagenome‑assembled genome (MAG) catalogues, are pivotal for exploring and understanding microbial communities. Yet, despite the relevance of protists and fungi for ecological communities, eukaryotic MAG recovery lags behind that of prokaryotes. State‑of‑the‑art binning pipelines rely on reference databases of single‑copy core genes that are sparse for eukaryotes. This problem is further complicated as reference databases scale poorly as sequence diversity and dataset size increase. Here, we present, REMAG (Recovery of Eukaryotic MAGs), a tool that learns from individual metagenomic datasets to recover eukaryotic bins. By embedding contig‑level composition and coverage features into a shared latent space optimized by contrastive learning followed by hierarchical clustering, the method accurately extracts representative bins. In benchmarks based on real and simulated synthetic community datasets of varying sizes (including prokaryotes and eukaryotes), we show its ability to recover eukaryotic genomes with higher completeness and less contamination than similar state-of-the-art tools, which often result in high fragmentation of eukaryotic bins. Overall, our approach provides a reference‑free method for eukaryotic binning that scales well with the increased growth and higher depth of diverse metagenomic datasets.
2025-07-24 12:40:00 12:50:00 01B MICROBIOME Flexible Log-odds Homology Features for Plasmid Identification Tomas Vinar Brona Brejova, Veronika Tordova, Kristian Andrascik, Cedric Chauve, Tomas Vinar We study the problem of plasmid identification in short-read assemblies of bacterial isolates. The goal is to classify individual contigs as coming from a chromosome or a plasmid. This problem is typically addressed by machine learning methods combining features derived from input contigs. Some methods also use additional features based on homology to sequences typical for known plasmids or chromosomes. In this work we propose a method for creating such features using log-odds scores based on ideas similar to those traditionally used in sequence alignment scoring. The framework is flexible as it can handle both close homologs as well as protein domains capturing distant homology. Inclusion of these features into the plASgraph2 graph neural network significantly improves its accuracy.
2025-07-24 12:50:00 13:00:00 01B MICROBIOME Accurate plasmid reconstruction from metagenomics data using assembly-alignment graphs and contrastive learning Pau Piera Lindez Pau Piera Lindez, Jakob Nissen, Simon Rasmussen Plasmids are extrachromosomal DNA molecules that enable horizontal gene transfer in bacteria, often conferring advantages such as antibiotic resistance. Despite their significance, plasmids are underrepresented in genomic databases due to challenges in assembling them, caused by mosaicism and micro-diversity. Current plasmid assemblers rely on detecting circular paths in single-sample assembly graphs, but face limitations due to graph fragmentation and entanglement, and low coverage. We introduce PlasMAAG (Plasmid and organism Metagenomic binning using Assembly Alignment Graphs), a framework to recover plasmids and organisms from metagenomic samples that leverages an approach that we call "assembly-alignment graphs” alongside common binning features. On synthetic benchmark datasets, PlasMAAG reconstructed 50–121% more near-complete plasmids than competing methods and improved the Matthews Correlation Coefficient of geNomad contig classification by 28–106%. On hospital sewage samples, PlasMAAG outperformed all other methods, reconstructing 33% more plasmid sequences. PlasMAAG enables the study of organism-plasmid associations and intra-plasmid diversity across samples, offering state-of-the-art plasmid reconstruction with reduced computational costs.
2025-07-24 14:00:00 14:20:00 01B MICROBIOME Predicting coarse-grained representations of biogeochemical cycles from metabarcoding data Arnaud Belcour Arnaud Belcour, Loris Megy, Sylvain Stephant, Caroline Michel, Sétareh Rad, Petra Bombach, Nicole Dopffel, Hidde de Jong, Delphine Ropers Motivation: Taxonomic analysis of environmental microbial communities is now routinely performed thanks to advances in DNA sequencing. Determining the role of these communities in global biogeochemical cycles requires the identification of their metabolic functions, such as hydrogen oxidation, sulfur reduction, and carbon fixation. These functions can be directly inferred from metagenomics data, but in many environmental applications metabarcoding is still the method of choice. The reconstruction of metabolic functions from metabarcoding data and their integration into coarse-grained representations of geobiochemical cycles remains a difficult bioinformatics problem today. Results: We developed a pipeline, called Tabigecy, which exploits taxonomic affiliations to predict metabolic functions constituting biogeochemical cycles. In a first step, Tabigecy uses the tool EsMeCaTa to predict consensus proteomes from input affiliations. To optimise this process, we generated a precomputed database containing information about 2,404 taxa from UniProt. The consensus proteomes are searched using bigecyhmm, a newly developed Python package relying on Hidden Markov Models to identify key enzymes involved in metabolic function of biogeochemical cycles. The metabolic functions are then projected on coarse-grained representation of the cycles. We applied Tabigecy to two salt cavern datasets and validated its predictions with microbial activity and hydrochemistry measurements performed on the samples. The results highlight the utility of the approach to investigate the impact of microbial communities on geobiochemical processes. Availability: The Tabigecy pipeline is available at https://github.com/ArnaudBelcour/tabigecy. The Python package bigecyhmm and the precomputed EsMeCaTa database are also separately available at \https://github.com/ArnaudBelcour/bigecyhmm and https://doi.org/10.5281/zenodo.13354073, respectively.
2025-07-24 14:20:00 14:30:00 01B MICROBIOME CroCoDeEL: accurate control-free detection of cross-sample contamination in metagenomic data Florian Plaza Oñate Lindsay Goulet, Florian Plaza Oñate, Pauline Barbet, Alexandre Famechon, Benoît Quinquis, Eugeni Belda, Edi Prifti, Emmanuelle Le Chatelier, Guillaume Gautreau Metagenomic sequencing provides profound insights into microbial communities, but it is often compromised by technical biases, including cross-sample contamination. This phenomenon arises when microbial content is inadvertently exchanged among concurrently processed samples, distorting microbial profiles and compromising the reliability of metagenomic data and downstream analyses. Existing detection methods often rely on negative controls, which are inconvenient and do not detect contamination within real samples. Meanwhile, strain-level bioinformatics approaches fail to distinguish contamination from natural strain sharing and lack sensitivity. To fill this gap, we introduce CroCoDeEL, a decision-support tool for detecting and quantifying cross-sample contamination. Leveraging linear modeling and a pre-trained supervised model, CroCoDeEL identifies specific contamination patterns in species abundance profiles. It requires no negative controls or prior knowledge of sample processing positions, offering improved accuracy and versatility. Benchmarks across three public datasets demonstrate that CroCoDeEL accurately detects contaminated samples and identifies their contamination sources, even at low rates (<0.1%), provided sufficient sequencing depth. Notably, we discovered critical contamination cases in highly cited studies, calling some of their results into question. Our findings suggest that cross-sample contamination is a widespread yet underexplored issue in metagenomics and emphasize the necessity of systematically integrating contamination detection into sequencing quality control. Future work will consist in developping an innovative approach to remove the contamination signal detected by CroCoDeEL. CroCoDeEL is freely available at https://github.com/metagenopolis/CroCoDeEL. Reference Goulet, L. et al. ""CroCoDeEL: accurate control-free detection of cross-sample contamination in metagenomic data"" bioRxiv (2025). https://doi.org/10.1101/2025.01.15.633153.
2025-07-24 14:30:00 14:40:00 01B MICROBIOME Longflow: A comprehensive end-to-end solution for long-read metagenomics. Sebastien Raguideau Sebastien Raguideau Transitioning from short-read to long-read sequencing in metagenomics requires methodological refinements. We present Longflow, a versatile pipeline tailored for long-read data, supporting analysis from raw FASTQ/BAM files to annotated metagenome-assembled genomes (MAGs). Built with Snakemake and containerised for reproducibility, Longflow is robust and easily deployed on HPC systems. Longflow enables flexible analysis, including per-sample or co-assembly schemes, and co-binning, leveraging samples not part of the assembly to enhance binning performance. It integrates tools for taxonomy (e.g., Silva, NR), functional annotation (KEGG, InterProScan), viral detection (GeNomad), and SNV calling (Longshot). MAGs are curated using a consensus approach from multiple binners and classified via GTDB-Tk. To address the issue of chimeric contigs, particularly problematic in long-read assemblies due to larger contig sizes, we created a visualisation tool to detect these artefacts and implemented a fragmentation heuristic, thus improving MAG recovery and removing one source of contamination. Longflow also facilitates the incorporation of short-read data for co-binning. We improved read assignment and overall binning results by using a novel k-mer coverage estimation method to handle ambiguous mappings. Longflow is a reliable and flexible tool for contemporary metagenomic research, and it is constantly being developed and maintained to increase its functionality.
2025-07-24 14:40:00 14:50:00 01B MICROBIOME Long-reads metagenome-assembled genomes can be higher quality than reference genomes: the case of the Shanghai pet dog microbiome catalog Luis Pedro Coelho Anna Cuscó, Yiqian Duan, Fernando Gil, Shaojun Pan, Nithya Kruthi, Alexei Chklovski, Xing-Ming Zhao, Luis Pedro Coelho We present a comprehensive analysis of the gut microbiome of 50 pet dogs living in Shanghai (China). Both long-read and short-read sequencing methods were employed to deeply sequence fecal samples, enabling high-quality metagenome-assembled genome (MAG) recovery. Polishing long-read assemblies with short reads notably improved MAG quality, particularly for genomes with lower sequencing coverage. The final MAG collection comprises 2,676 MAGs (72% high-quality), representing 320 bacterial species, and captures global microbial diversity, evidenced by high read mapping rates (>90%) to external datasets from multiple countries. The predominant phyla were Bacillota, Bacteroidota, and Fusobacteriota. Many of the resulting MAGs are of higher quality than reference genomes available for the same species. In particular, our MAGs more consistently contain ribosomal genes, tRNAs, and mobilome-associated genes; all classes that are known to be difficult to recover (even from sequencing isolates) using short-reads. Extra-chromosomal (e.g., plasmids or viruses) are another blind spot when using short reads. We recovered 185 circular elements (comprising 58 plasmids, 30 viruses, and 97 elements that cannot be confidently assigned). Several of these contain antibiotic resistance genes, including beta-lactamases. One-third of identified bacterial species were novel, particularly within genera such as CAG-269 and Dysosmobacter. Additionally, this study demonstrated clear microbiome differences between pet dogs and colony-living dogs, the latter showing higher microbial diversity and higher abundance of probiotic-associated species. Overall, this study provides the best known resource for pet dog microbiome studies and demonstrates the value of hybrid sequencing to build the highest quality resources.
2025-07-24 14:50:00 15:00:00 01B MICROBIOME Use of Long-Read SMRT PacBio Sequencing for Detailed Genomic and Epigenetic Studies of Complex Microbial Communities in the Wheat Rhizosphere to Abiotic Stress Oleg Reva Siphiwe Maseko, Nwabisa Ngwentle, Teresa Coutinho, Ngwekazi Mehlomakulu, Oleg Reva The wheat rhizosphere harbours complex microbial communities essential for plant health and soil fertility. Traditional sequencing reveals microbial diversity but often misses genomic and epigenetic interactions. Here, long-read SMRT PacBio sequencing was applied to a wheat field in South Africa (34.08551°S, 20.26628°E) to profile microbial communities across varying environmental conditions from August to November 2023, spanning heavy rains to extreme drought seasons. This approach enabled a detailed reconstruction of microbial interactions and the identification of key taxa influencing soil fertility. Network analysis revealed species-specific associations shaping the microbial community. Epigenetic analysis of metagenome assembled contigs demonstrated that Pseudomonas fluorescens, Flavobacterium pectinovorum, and Flavobacterium aquicola thrived in wet conditions but suffered during drought, evidenced by increased oxidized guanine residues in their genome under unfavourable conditions. Conversely, Amycolatopsis camponoti and some uncultured Alpha-proteobacteria and Actinomycetota struggled in floods but flourished in arid conditions. These findings demonstrate the varying responses of rhizobacterial community members to environmental stressors, highlighting the need for a strategic selection of beneficial bacteria used in agro-biopreparations. Selecting microbial inoculants based on their optimal environmental conditions can enhance their efficacy in improving soil fertility and crop resilience. Long-read SMRT sequencing enables species-level identification and detailed genomic and epigenetic insights, which could not be achieved before. Additionally, novel computational tools were developed for modelling microbial networks and predicting oxidized guanine distribution along metagenome-assembled contigs. This study was conducted for the TRIBIOME Project (https://www.tribiome.eu/) and funded by the Horizon Europe research and innovation program (grant Nº 101084485).
2025-07-24 15:00:00 15:10:00 01B MICROBIOME proMGEflow: recombinase-based detection of mobile genetic elements in bacterial meta(genomes) Anastasiia Grekova Anastasiia Grekova, Supriya Khedkar, Christian Schudoma, Chan Yeong Kim, Daniel Podlesny, Anthony Noel Fullam, Jonas Richter, Thomas Sebastian Schmidt, Daniel Mende, Suguru Nishijima, Askarbek Orakov, Michael Kuhn, Ivica Letunic, Peer Bork Mobile Genetic Elements (MGEs) are drivers of bacterial adaptation and can increase fitness of microbial communities in the changing environment. Yet the identification of MGEs remains challenging due to the fuzziness of different MGE types and incompleteness of metagenomic assembled genomes (MAGs). Here we present proMGEflow - a Nextflow pipeline designed to annotate full genomes and MAGs with discrete MGE categories: plasmids, integrons, phages and transposable elements. In comparison to other tools, proMGEflow takes a top-down approach to harmonize all MGEs on one go from a given (meta)genome. Our pipeline uses subfamilies of recombinases as universal MGE markers, as well as MGE type-specific mobility machinery, e.g. structural phage genes, for fine-grained assignment. The MGE boundaries estimation is based on the joined bacterial species pangenome from the MAG and species cluster of high-quality reference genomes from the ProGenomes3 database. By decoupling the MGE boundary determination step into the Python package MGExpose, we can further annotate MGEs in user-provided genomic regions by rule-based classification of their machinery and recombinases. In total, we applied proMGEflow to around 200,000 MAGs from the Searchable Planetary-scale mIcrobiome REsource (SPIRE). This did not only result in the discovery of around 3 million MGEs of different types but also helped to gain the first functional insights into a global environmental mobilome. The availability of scalable and reproducible pipelines for unified MGE annotation from metagenomes will improve our understanding of mechanisms of gene mobility as well as the cross-talk with their prokaryotic hosts.
2025-07-24 15:10:00 15:20:00 01B MICROBIOME Extracting host-specific developmental signatures from longitudinal microbiome data Balazs Erdos Balazs Erdos, Christos Chatzis, Jonathan Thorsen, Jakob Stokholm, Age K. Smilde, Morten A. Rasmussen, Evrim Acar Longitudinal microbiome studies offer critical insights into microbial community dynamics, helping to distinguish true biological signals from interindividual variability. Tensor decompositions, such as CANDECOMP/PARAFAC (CP), have been applied to analyze longitudinal microbiome data by arranging temporal measurements as a third-order tensor with modes representing taxa, time, and hosts. While these methods have proven useful in revealing the underlying structures in such data, they are limited in their ability to capture host-specific microbial dynamics including individual accelerated or delayed phenomena. To address this limitation, we use the PARAFAC2 model, a more flexible tensor model, which can account for host-specific differences in temporal trajectories of microbial communities. We analyze longitudinal microbiome data from the COPSAC2010 (Copenhagen Prospective Studies on Asthma in Childhood) cohort, tracking gut microbiome maturation in children over their first six years of life, along with data from the FARMM (Food and Resulting Microbial Metabolites) study, examining dietary effects before and after microbiota depletion. We show that both CP and PARAFAC2 decompositions reveal meaningful microbial signatures, including compositional shifts associated with birth mode, presence of older siblings, and dietary interventions. However, while CP captures the main microbial trends in time, PARAFAC2 uncovers host- and subgroup-specific developmental trajectories, offering a more nuanced view of microbiome maturation, highlighting its potential to enhance longitudinal microbiome data analysis. In addition, we discuss the interpretability of the extracted patterns facilitated by the uniqueness properties of CP and PARAFAC2, and discuss potential challenges related to the generalization of the patterns through the concept of replicability.
2025-07-24 15:20:00 15:30:00 01B MICROBIOME Complex SynCom inoculations to study root community assembly Gijs Selten Gijs Selten, Florian Lamouche, Adrian Gomez Repolles, Simona Radutoiu, Ronnie de Jonge The root microbiome is a complex system composed of millions of interacting microbes, some of which have plant-beneficial traits such as priming the plant’s defenses or promoting growth. To apply these traits for sustainable agriculture, however, an understanding of root microbiome assembly, dynamics, and functioning is required. To gain this understanding, we isolated hundreds of rhizobacterial strains from Arabidopsis, Barley, and Lotus, when grown in natural soil. These strains were then cultured and used to reconstitute highly complex Synthetic Communities (SynComs) comprising between 175 and 1,000 strains, which were subsequently inoculated onto the three hosts. After cultivation, the roots were harvested, DNA was extracted, and were subjected to shotgun metagenomics to identify and quantify the SynCom members. Using the genomic sequences of the bacterial strains, we examined both communal functions – i.e. bacterial functions enriched in the root microbiome–and individual traits that enhance a strain’s competitiveness. Community analyses revealed the three hosts to select for functions irrespective of taxonomic origin, with an enriched selection of functions related to amino acid and vitamin metabolism, quorum sensing, and flagellar assembly. Furthermore, metagenome-wide association analysis of the most successful strains highlighted metabolic diversity, motility and secretion systems as key traits in driving a strain’s competitiveness. Although root competence functions in rhizobacteria have been studied extensively, our dataset enables the investigation of these traits within a complex microbiome context that closely resembles natural communities. This unique feature offers new insights into how plant-microbe and microbe-microbe interactions shift across different environmental and community contexts.
2025-07-24 15:30:00 15:40:00 01B MICROBIOME Spatial and temporal variation of marine microbial interactions around the west Antarctic Peninsula Julia C Engelmann Julia C Engelmann, Swan Ls Sow, Willem H van de Poll, Rachel Eveleth, Jeremy J Rich, Hugh W Ducklow, Patrick D Rozema, Catherine M Luria, Henk Bolhuis, Michael P Meredith, Linda Amaral-Zettler The west Antarctic Peninsula (WAP) has experienced more dramatic increases in temperature due to climate change than the rest of the continent and the global average. Moreover, the northern region of the WAP, hosting the research station Palmer, has seen higher temperatures and lower sea ice extent than the South, where the Rothera research station is located. We assessed bacterial and microbial eukaryote communities and their seasonal variation at the Palmer and Rothera time-series sites between July 2013-April 2014 and predicted inter-and intra-domain causal effects. We found that microbial communities were considerably different between the two sites, with differences being attributed to seawater temperature and sea ice coverage in combination with sea ice type differences. We predicted microbial interactions with causal effect modelling, which corrects for spurious correlations and takes the direction of information flow into account (using a directed acyclic graph reconstruction approach to identify confounders). Causal effect analysis suggested that bacteria were stronger drivers of ecosystem dynamics at Palmer, while microbial eukaryotes played a stronger role at Rothera. The parasitic taxa Syndiniales persevered at both sites across the seasons, with Palmer and Rothera harbouring different key groups. However, at Rothera Syndiniales dominated in the set of negative causal effects while this was not the case at Palmer, suggesting that parasitism drives community dynamics at Rothera more strongly than at Palmer. Our research sheds light on the dynamics of microbial community composition and potential microbial interactions at two sampling locations that represent different climate regimes along the WAP.
2025-07-24 15:40:00 15:50:00 01B MICROBIOME Associations between Microbiome-Associated Variants and Diseases Tess Cherlin Tess Cherlin, Jagyshila Das, Colleen Morse, Regeneron Genetics Center, Penn Medicine Biobank, Seth Bordenstein, Anurag Verma, Shefali Setia-Verma High throughput sequencing, studies have investigated the microbiome’s association with diseases and genetic variants. We aimed to 1) extended previously identified microbiome-associated variants (MAVs) from microbiome GWAS (mbGWAS) to include newer non-European population studies and 2) leverage biobank data with large sample sizes for genetically diverse population groups. We did phenome-wide association study (PheWAS) in the Penn Medicine Biobank (PMBB) which included 41,102 patients from two genetically inferred ancestry groups. Next, we mined MAV associations from PheWAS data in the NIH’s All of Us Biobank (n = 205,237) and the Million Veterans Program (MVP) Biobank (n= 630,969). We then meta-analyzed the MAV by PheWAS results from these three datasets as well as the results from each ancestry-specific meta-analysis. We found 13 significant associations from the AFR meta-analysis (p-value ≤ 4.6e-08), 205 significant associations from the EUR meta-analysis (p-value ≤ 3.4e-08), and 122 significant associations for 25 unique MAVs from the meta-analysis across all ancestries (p-value ≤ 6.6e-08). To extend findings from our PheWAS, we performed QTL and causal inference testing analysis on these 25 loci using the SMR portal to determine whether these loci showed evidence of shared genetic signals across traits. We found several significant relationships especially in significantly associated traits like psoriasis, venous thromboembolism, and type 2 diabetes, and gout. Future work will investigate microbiome-QTLs to understand the potential causal relationship between MAVs, the microbiome, and disease phenotypes. This research sets the stage for further investigations aiming to uncover the mechanisms and clinical implications of microbiome-disease associations.
2025-07-24 15:50:00 16:00:00 01B MICROBIOME Detecting Synergistic Associations in Microbial Communities via Multi-Dimensional Feature Selection Witold Rudnicki Sajad Shahbazi, Piotr Stomma, Tara Zakerali, Balakrishnan Subramanian, Kinga Zielinska, Paweł Łabaj, Izabela Święcicka, Marek Bartoszewicz, Krzysztof Mnich, Witold Rudnicki The gut microbiome regulates host immunity, barrier function, and inflammatory processes. While many studies have identified individual taxa associated with disease, they often overlook higher-order dependencies within microbial communities. We present a methodology that combines information-theoretic feature selection and machine learning to identify taxa whose predictive relevance may depend on synergy with other community members. We apply this framework to data from the American Gut Project, focusing on the presence or absence of self-reported food allergy. Taxonomic profiles were normalised and binarised using two thresholding strategies. After quality filtering, the dataset included samples from 1694 healthy and 1847 allergic individuals. We used the Multi-Dimensional Feature Selection (MDFS) algorithm to evaluate information gain in both univariate (1D) and pairwise (2D) settings. As a baseline, we performed U-tests to identify taxa with significantly different abundances between groups. Predictive models were built using Random Forest classifiers trained separately on features selected via each method. MDFS outperformed the U-test in both sensitivity and robustness. Fifteen taxa were consistently selected by all methods, while MDFS variants uniquely recovered 42. The 2D analysis revealed 18 taxa that carried no predictive value alone but contributed significant information in combination with others, suggesting synergistic structure. Conventional univariate approaches would have overlooked these taxa. The results demonstrate the utility of synergy-aware feature selection for capturing complex, non-additive associations in microbial communities. Similar patterns observed across other cohorts indicate the potential generalizability of this approach.
2025-07-21 11:20:00 12:20:00 01A MLCSB Is distribution shift still an AI problem Sanmi Koyejo Sanmi Koyejo Distribution shifts describe the phenomena where the deployment performance of an AI model exhibits differences from training. On the one hand, some claim that distribution shifts are ubiquitous in real-world deployments. On the other hand, modern implementations (e.g., foundation models) often claim to be robust to distribution shifts by design. Similarly, phenomena such as “accuracy on the line” promise that standard training produces distribution-shift-robust models. When are these claims valid, and do modern models fail due to distribution shifts? If so, what can be done about it? This talk will outline modern principles and practices for understanding the role of distribution shifts in AI, discuss how the problem has changed, and outline recent methods for engaging with distribution shifts with comprehensive and practical insights. Some highlights include a taxonomy of shifts, the role of foundation models, and finetuning. This talk will also briefly discuss how distribution shifts might interact with AI policy and governance. Bio: Sanmi Koyejo is an assistant professor in the Department of Computer Science at Stanford University and a co-founder of Virtue AI. At Stanford, Koyejo leads the Stanford Trustworthy Artificial Intelligence (STAIR) lab, which works to develop the principles and practice of trustworthy AI, focusing on applications to science and healthcare. Koyejo has been the recipient of several awards, including a Skip Ellis Early Career Award, a Presidential Early Career Award for Scientists and Engineers (PECASE), and a Sloan Fellowship. Koyejo serves on the Neural Information Processing Systems Foundation Board, the Association for Health Learning and Inference Board, and as president of the Black in AI Board.
2025-07-21 12:20:00 12:40:00 01A MLCSB Locality-aware pooling enhances protein language model performance across varied applications Minh Hoang Minh Hoang, Mona Singh Protein language models (PLMs) are amongst the most exciting recent advances for characterizing protein sequences, and have enabled a diverse set of applications including structure determination, functional property prediction, and mutation impact assessment, all from single protein sequences alone. State-of-the-art PLMs leverage transformer architectures originally developed for natural language processing, and are pre-trained on large protein databases to generate contextualized representations of individual amino acids. To harness the power of these PLMs to predict protein-level properties, these per-residue embeddings are typically ``pooled'' to fixed-size vectors that are further utilized in downstream prediction networks. Common pooling strategies include Cls-Pooling and Avg-Pooling, but neither of these approaches can capture the local substructures and long-range interactions observed in proteins. To address these weaknesses in existing PLM pooling strategies, we propose the use of attention pooling, which can naturally capture these important features of proteins. To make the expensive attention operator (quadratic in length of the input protein) feasible in practice, we introduce bag-of-mer pooling (BoM-Pooling), a locality-aware hierarchical pooling technique that combines windowed average pooling with attention pooling. We empirically demonstrate that both full attention pooling and BoM-Pooling outperform previous pooling strategies on three important, diverse tasks: (1) predicting the activities of two proteins as they are varied; (2) detecting remote homologs; and (3) predicting signaling interactions with peptides. Overall, our work highlights the advantages of biologically inspired pooling techniques in protein sequence modeling and is a step towards more effective adaptations of language models in biological settings.
2025-07-21 12:40:00 13:00:00 01A MLCSB NEAR: Neural Embeddings for Amino acid Relationships Daniel Olson Daniel Olson, Thomas Colligan, Daphne Demekas, Jack Roddy, Ken Youens-Clark, Travis Wheeler Protein language models (PLMs) have recently demonstrated potential to supplant classical protein database search methods based on sequence alignment, but are slower than common alignment-based tools and appear to be prone to a high rate of false labeling. Here, we present NEAR, a method based on neural representation learning that is designed to improve both speed and accuracy of search for likely homologs in a large protein sequence database. NEAR’s ResNet embedding model is trained using contrastive learning guided by trusted sequence alignments. It computes per-residue embeddings for target and query protein sequences, and identifies alignment candidates with a pipeline consisting of residue-level k-NN search and a simple neighbor aggregation scheme. Tests on a benchmark consisting of trusted remote homologs and randomly shuffled decoy sequences reveal that NEAR substantially improves accuracy relative to state-of-the-art PLMs, with lower memory requirements and faster embedding / search speed. While these results suggest that the NEAR model may be useful for standalone homology detection with increased sensitivity over standard alignment-based methods, in this manuscript we focus on a more straightforward analysis of the model's value as a high-speed pre-filter for sensitive annotation. In that context, NEAR is at least 5x faster than the pre-filter currently used in the widely-used profile hidden Markov model (pHMM) search tool, HMMER3, and also outperforms the pre-filter used in our fast pHMM tool, nail.
2025-07-21 14:00:00 14:20:00 01A MLCSB LoRA-DR-suite: adapted embeddings predict intrinsic and soft disorder from protein sequences Gianluca Lombardi Gianluca Lombardi, Beatriz Seoane, Alessandra Carbone Intrinsic disorder regions (IDR) and soft disorder regions (SDR) provide crucial information on a protein structure to underpin its functioning, interaction with other molecules and assembly path. Circular dichroism experiments are used to identify intrinsic disorder residues, while SDRs are characterized using B-factors, missing residues, or a combination of both in alternative X-ray crystal structures of the same molecule. These flexible regions in proteins are particularly significant in diverse biological processes and are often implicated in pathological conditions. Accurate computational prediction of these disordered regions is thus essential for advancing protein research and understanding their functional implications. To address this challenge, LoRA-DR-suite employs a simple adapter-based architecture that utilizes protein language models embeddings as protein sequence representations, enabling the precise prediction of IDRs and SDRs directly from primary sequence data. Alongside the fast LoRA-DR-suite implementation, we release SoftDis, a unique soft disorder database constructed for approximately 500,000 PDB chains. SoftDis is designed to facilitate new research, testing, and applications on soft disorder, advancing the study of protein dynamics and interactions.
2025-07-21 14:20:00 14:40:00 01A MLCSB TCR-epiDiff: Solving Dual Challenges of TCR Generation and Binding Prediction Se Yeon Seo Se Yeon Seo, Je-Keun Rhee Motivation: T-cell receptors (TCRs) are fundamental components of the adaptive immune system, recognizing specific antigens for targeted immune responses. Understanding their sequence patterns for designing effective vaccines and immunotherapies. However, the vast diversity of TCR sequences and complex binding mechanisms pose significant challenges in generating TCRs that are specific to a particular epitope. Results: Here, we propose TCR-epiDiff, a diffusion-based deep learning model for generating epitope-specific TCRs and predicting TCR-epitope binding. TCR-epiDiff integrates epitope information during TCR sequence embedding using ProtT5-XL and employs a denoising diffusion probabilistic model for sequence generation. Using external validation datasets, we demonstrate the ability to generate biologically plausible, epitope-specific TCRs. Furthermore, we leverage the model's encoder to develop a TCR-epitope binding predictor that shows robust performance on the external validation data. Our approach provides a comprehensive solution for both de novo generation of epitope-specific TCRs and TCR-epitope binding prediction. This capability provides valuable insights into immune diversity and has the potential to advance targeted immunotherapies. Availability and implementation: The data and source codes for our experiments are available at https://github.com/seoseyeon/TCR-epiDiff
2025-07-21 14:40:00 15:00:00 01A MLCSB Incorporating Hierarchical Information into Multiple Instance Learning for Patient Phenotype Prediction with scRNA-seq Data Chau Do Chau Do, Harri Lähdesmäki Multiple Instance Learning (MIL) provides a structured approach to patient phenotype prediction with single-cell RNA-sequencing (scRNA-seq) data. However, existing MIL methods tend to overlook the hierarchical structure inherent in scRNA-seq data, especially the biological groupings of cells, or cell types. This limitation may lead to suboptimal performance and poor interpretability at higher levels of cellular division. To address this gap, we present a novel approach to incorporate hierarchical information into the attention-based MIL framework. Specifically, our model applies the attention-based aggregation mechanism over both cells and cell types, thus enforcing a hierarchical structure on the flow of information throughout the model. Across extensive experiments, our proposed approach consistently outperforms existing models and demonstrates robustness in data-constrained scenarios. Moreover, ablation test results show that simply applying the attention mechanism on cell types instead of cells leads to improved performance, underscoring the benefits of incorporating the hierarchical groupings. By identifying the critical cell types that are most relevant for prediction, we show that our model is capable of capturing biologically meaningful associations, thus facilitating biological discoveries.
2025-07-21 15:00:00 15:10:00 01A MLCSB AI-Guided Multi-Objective Discovery of Antimicrobial Peptides via Self-Play Reinforcement Learning Chia-Ru Chung, Tzong-Yi Lee, Yun Tang The rising crisis of antimicrobial resistance has created an urgent demand for innovative therapeutic solutions, and antimicrobial peptides (AMPs) present a promising option due to their broad-spectrum effectiveness and unique mechanisms of action. However, current experimental and computational discovery pipelines for AMPs are often slow and limited in the diversity of sequences and properties they can explore. Although AI-driven generative models for peptide design are gaining traction, the field still lacks methods that offer adequate sequence diversity, simultaneous optimization of multiple therapeutic properties, and biologically contextualized control over safety and structural attributes. To address these shortcomings, we propose a strategic reinforcement learning framework for multi-objective AMP discovery, employing policy-guided Monte Carlo tree search within a self-play environment. This AI-driven approach utilizes surrogate models for antimicrobial activity, structural stability, and hemolytic toxicity to inform the generation process while incorporating biologically inspired filtering rules to ensure safety and structural constraints. Candidate peptide sequences were further assessed through an ensemble classifier consensus to validate their predicted efficacy and safety robustly. Our results demonstrate that this framework effectively generates structurally reliable, potent, and non-toxic AMPs, surpassing existing design strategies in the diversity of sequences produced and the favorable trade-offs achieved among activity, stability, and toxicity objectives. The AI-designed peptides display stable secondary structures and strong antimicrobial potency with minimal hemolytic activity, highlighting the advantages of our multi-objective optimization strategy. In summary, this work establishes a powerful AI-driven paradigm for therapeutic peptide design.
2025-07-21 15:10:00 15:20:00 01A MLCSB NetStart 2.0: Prediction of Eukaryotic Translation Initiation Sites Using a Protein Language Model Line Sandvad Nielsen Line Sandvad Nielsen, Anders Gorm Pedersen, Ole Winther, Henrik Nielsen "Background: Accurate identification of translation initiation sites is essential for the proper translation of mRNA into functional proteins. In eukaryotes, the choice of the translation initiation site is influenced by multiple factors, including its proximity to the 5' end and the local start codon context. Translation initiation sites mark the transition from non-coding to coding regions. This fact motivates the expectation that the upstream sequence, if translated, would assemble a nonsensical order of amino acids, while the downstream sequence would correspond to the structured beginning of a protein. This distinction suggests potential for predicting translation initiation sites using a protein language model. Results: We present NetStart 2.0, a deep learning-based model that integrates the ESM-2 protein language model with the local sequence context to predict translation initiation sites across a broad range of eukaryotic species. NetStart 2.0 was trained as a single model across multiple species, and despite the broad phylogenetic diversity represented in the training data, it consistently relied on features marking the transition from non-coding to coding regions. Conclusion: By leveraging ""protein-ness"", NetStart 2.0 achieves state-of-the-art performance in predicting translation initiation sites across a diverse range of eukaryotic species. This success underscores the potential of protein language models to bridge transcript- and peptide-level information in complex biological prediction tasks. The NetStart 2.0 webserver is available at: https://services.healthtech.dtu.dk/services/NetStart-2.0/"
2025-07-21 15:20:00 15:30:00 01A MLCSB Towards a more inductive world for drug repurposing approaches Uxía Veleiro Jesus De la Fuente Cedeño, Guillermo Serrano, Uxía Veleiro, Mikel Casals, Laura Vera, Marija Pizurica, Nuria Gomez-Cebrian, Leonor Puchades-Carrasco, Antonio Pineda-Lucena, Idoia Ochoa, Silve Vicent, Olivier Gevaert, Mikel Hernaez Drug–target interaction (DTI) prediction is a challenging albeit essential task in drug repurposing. Learning on graph models has drawn special attention as they can substantially reduce drug repurposing costs and time commitment. However, many current approaches require high-demand additional information besides DTIs that complicates their evaluation process and usability. Additionally, structural differences in the learning architecture of current models hinder their fair benchmarking. In this work, we first perform an in-depth evaluation of current DTI datasets and prediction models through a robust benchmarking process and show that DTI methods based on transductive models lack generalization and lead to inflated performance when traditionally evaluated, making them unsuitable for drug repurposing. We then propose a biologically driven strategy for negative-edge subsampling and uncovered previously unknown interactions via in vitro validation, missed by traditional subsampling. Finally, we provide a toolbox from all generated resources, crucial for fair benchmarking and robust model design.
2025-07-21 15:30:00 15:40:00 01A MLCSB Hierarchical Multi-Agent Reinforcement Learning For Optimizing CRISPR-Based Polygenic Therapeutic Design Nhung Duong Nhung Duong, Tuan Do, Anh Truong, Ngoc Do, Lap Nguyen Background. CRISPR therapies for polygenic disorders require simultaneous optimization of multiple guides and delivery constraints. Current approaches optimize guides individually, neglecting target interactions. We developed a hierarchical multi-agent reinforcement learning (MARL) framework to optimize CRISPR strategies for polygenic diseases while balancing efficiency, synergy, vector capacity, and immunogenicity. Methods. We implemented a three-layer MARL architecture consisting of: (1) Optimize guide RNA sequences for each target gene; (2) Selects optimal editing modes and guide combinations while maximizing synergy; and (3) Ensures vector capacity and immunogenicity constraints are satisfied. For computational feasibility, we used a simplified 1 Mb mini-genome for off-target analysis rather than the full human genome, and employed approximated versions of RuleSet2 scoring, synergy effect and immunogenicity prediction. We trained the framework using iterative policy optimization and validated it on a model polygenic retinal disease involving five genes with known pathogenic mutations. Results. The framework generated optimized guides with high on-target efficiency (0.70-1.00) and minimal off-target effects. The system selected optimal editing modes (prime editing for PDE6B, RHO, USH2A; base editing for RP1) while maximizing synergistic effects. The final design utilized Cas9 with a total size of within 4500 bp capacity and zero immunogenic epitopes, requiring only two rework iterations. Conclusion. Our MARL approach demonstrates AI's potential for solving complex therapeutic design challenges. The framework navigates multi-dimensional optimization involving sequence, strategy, and clinical constraints simultaneously. While current implementation uses simplified biological modeling, the foundation is robust and provides a proof-of-concept for AI-guided design of personalized CRISPR therapeutics for polygenic diseases.
2025-07-21 15:40:00 15:50:00 01A MLCSB Sliding Window INteraction Grammar (SWING): a generalized interaction language model for peptide and protein interactions Jishnu Das Jane Siwek, Alisa Omelchenko, Prabal Chhibbar, Alok Joglekar, Jishnu Das Protein language models (pLMs) can embed protein sequences for different proteomic tasks. However, these methods are suboptimal at learning the language of protein interactions. We developed an interaction LM (iLM), Sliding Window Interaction Grammar (SWING) which leverages differences in amino acid properties to generate an interaction vocabulary. This is embedded by an LM and supervised learning is performed on the embeddings. SWING was used across a range of tasks. Using only sequence information, it successfully predicted both class I and class II pMHC interactions as well as state-of-the-art approaches. Further, the Class I SWING model could uniquely cross-predict Class II interactions, a complex prediction task not attempted by existing methods. A unique Mixed Class model effectively predicted interactions for both classes. Using only human Class I or Class II data, SWING accurately predicted novel murine Class II pMHC interactions involving risk alleles in SLE and T1D. SWING also accurately predicted how Mendelian and population variants can disrupt specific protein-protein interactions, based on sequence information alone. Across these tasks, SWING outperformed passive uses of pLM embeddings, demonstrating the value of the unique iLM architecture. Overall, SWING is a first-in-class generalizable zero-shot iLM that learns the language of PPIs.
2025-07-21 15:50:00 16:00:00 01A MLCSB Evolutionary constraints guide AlphaFold2 in predicting alternative conformations and inform rational mutation design Francesca Cuturello Valerio Piomponi, Alberto Cazzaniga, Francesca Cuturello Investigating structural variability is essential for understanding protein biological functions. Although AlphaFold2 accurately predicts static structures, it fails to capture the full spectrum of functional states. Recent methods have used AlphaFold2 to generate diverse structural ensembles, but they offer limited interpretability and overlook the evolutionary signals underlying predictions. In this work, we enhance the generation of conformational ensembles and identify sequence patterns that influence alternative fold predictions for several protein families. Building on prior research that clustered Multiple Sequence Alignments to predict fold-switching states, we introduce a refined clustering strategy that integrates protein language model representations with hierarchical clustering, overcoming limitations of density-based methods. Our strategy effectively identifies high-confidence alternative conformations and generates abundant sequence ensembles, providing a robust framework for applying Direct Coupling Analysis (DCA). Through DCA, we uncover key coevolutionary signals within the clustered alignments, leveraging them to design mutations that stabilize specific conformations, which we validate using alchemical free energy calculations from molecular dynamics. Notably, our method extends beyond fold-switching, effectively capturing a variety of conformational changes.
2025-07-21 16:40:00 16:50:00 01A MLCSB Integrating Machine Learning and Systems Biology to rationally design operational conditions for in vitro / in vivo translation of microphysiological systems Nikolaos Meimetis Nikolaos Meimetis, Jose Cadavid, Linda Griffith, Douglas Lauffenburger Preclinical models are used extensively to study diseases and potential therapeutic treatments. Complex in vitro platforms incorporating human cellular components, known as microphysiological systems (MPS), can model cellular and microenvironmental features of diseased tissues. However, determining experimental conditions -- particularly biomolecular cues such as growth factors, cytokines, and matrix proteins -- providing the most effective translatability of MPS-generated information to in vivo human subject contexts is a major challenge. Here, using metabolic dysfunction-associated fatty liver disease (MAFLD) studied using the CNBio PhysioMimix as a case study, we developed a machine learning framework called Latent In Vitro to In Vivo Translation (LIV2TRANS) to ascertain how MPS data map to in vivo data, first sharpening translation insights and consequently elucidating experimental conditions that can further enhance translation capability. Our findings in this case study highlight TGFβ as a crucial cue for MPS translatability and indicate that adding JAK-STAT pathway perturbations via interferon stimuli could increase the predictive performance of this MPS in MAFLD studies. Finally, we developed an optimization approach that identified androgen and EGFR signaling as key for maximizing the capacity of this MPS to capture in vivo human biological information germane to MAFLD. More broadly, this work establishes a mathematically principled approach for identifying experimental conditions that most beneficially capture in vivo human-relevant molecular pathways and processes, generalizable to preclinical studies for a wide range of diseases and potential treatments.
2025-07-21 16:50:00 17:00:00 01A MLCSB Data Splitting Against Information Leakage with DataSAIL Roman Joeres Roman Joeres, David B. Blumenthal, Olga Kalinina Information leakage (IL) is an increasingly important topic in machine learning (ML) research, especially in biomedical applications. When IL happens during a model's development, the model is prone to memorizing the training data instead of learning generalizable properties. This can lead to inflated performance metrics that do not reflect the actual performance at inference time. Therefore, we present DataSAIL, a versatile Python package to facilitate leakage-reduced data splitting to enable realistic evaluation of ML models for biological data that are intended to be applied in out-of-distribution scenarios. DataSAIL is based on formulating the problem to find leakage-reduced data splits as a combinatorial optimization problem. We prove that this problem is NP-hard and provide a scalable heuristic based on clustering and integer linear programming. DataSAIL uses similarities between samples to compute leakage-reduced splits for classical property prediction tasks, stratified splits, and drug-target interaction datasets where information can be leaked along two dimensions. DataSAIL is accepted in principle by Nature Communications. We empirically demonstrate DataSAIL's impact on evaluating biomedical ML models. We compare DataSAIL to seven other algorithms on 14 datasets from the MoleculeNet benchmark and LP-PDBBind. We show that DataSAIL is consistently amongst the best algorithms in removing IL. Furthermore, we train 6 different ML models on each split to evaluate how information leakage affects different models. We observe that deep learning models generally perform better than statistical models and that higher IL leads to better performance estimates. Another ablation study shows that DataSAIL reduces IL better than the PLINDER benchmark.
2025-07-21 17:00:00 18:00:00 01A MLCSB Generative AI for Unlocking the Complexity of Cells Maria Brbic We are witnessing an AI revolution. At the heart of this revolution are generative AI models that, powered by advanced architectures and large datasets, are transforming AI across a variety of disciplines. But how can AI facilitate and eventually enable discoveries in life sciences? How can it bring us closer to understanding biology, the functions of our cells and relationships across different molecular layers? In this talk, I will present AI methods that can extract meaningful differences between classes from representations of foundation models with minimal or no supervision. I will then introduce generative AI methods designed to uncover relationships across different omics layers. I will demonstrate how these approaches enable the reassembly of tissues from dissociated single cells and how AI-driven tissue reconstruction can overcome existing technological limitations.
2025-07-22 11:20:00 12:20:00 01A MLCSB Where does it hurt (in your genome)? Julien Gagneur The identification of genetic variants strongly affecting when phenotypes remains an unsolved problem with major relevance in rare diseases diagnostics, oncology, and for the identification of effector genes of complex traits and diseases. I will present a series of published and ongoing work from my lab tackling this issue, with a focus on non-coding variants. This will span variant scoring based on genomic language models [1], methods to predict aberrant expression [2] and splicing [3], all the way to integrative deep learning models for rare variant association analyses demonstrated on UK Biobank [4]. 1. Tomaz da Silva, et al. Nucleotide dependency analysis of DNA language models reveals genomic functional elements. bioRxiv, 2024 2. Hölzlwimmer et al. Aberrant gene expression prediction across human tissues. Nature Communications, 2025 3. Wagner et al. Aberrant splicing prediction across human tissues. Nature Genetics, 2023 4. Clarke, Holtkamp, et al. Integration of variant annotations using deep set networks boosts rare variant association genetics. Nature Genetics, 2024
2025-07-22 12:20:00 12:40:00 01A MLCSB Deep learning models for unbiased sequence-based PPI prediction plateau at an accuracy of 0.65 Judith Bernett Timo Reim, Anne Hartebrodt, David B. Blumenthal, Judith Bernett, Markus List As most proteins interact with other proteins to perform their respective functions, methods to computationally predict these interactions have been developed. However, flawed evaluation schemes and data leakage in test sets have obscured the fact that sequence-based protein-protein interaction (PPI) prediction is still an open problem. Recently, methods achieving better-than-random performance on leakage-free PPI data have been proposed. Here, we show that the use of ESM-2 protein embeddings explains this performance gain irrespective of model architecture. We compared the performance of models with varying complexity, per-protein, and per-token embeddings, as well as the influence of self- or cross-attention, where all models plateaued at an accuracy of 0.65. Moreover, we show that the tested sequence-based models cannot implicitly learn a contact map as an intermediate layer. These results imply that other input types, such as structure, might be necessary for producing reliable PPI predictions.
2025-07-22 12:40:00 13:00:00 01A MLCSB Accurate PROTAC targeted degradation prediction with DegradeMaster Jie Liu Jie Liu, Michael Roy, Luke Isbel, Fuyi Li Motivation: Proteolysis-targeting chimeras (PROTACs) are heterobifunctional molecules that can degrade ‘undruggable’ protein of interest (POI) by recruiting E3 ligases and hijacking the ubiquitin-proteasome system. Some efforts have been made to develop deep learning-based approaches to predict the degradation ability of a given PROTAC. However, existing deep learning methods either simplify proteins and PROTACs as 2D graphs by disregarding crucial 3D spatial information or exclusively rely on limited labels for supervised learning without considering the abundant information from unlabeled data. Nevertheless, considering the potential to accelerate drug discovery, developing more accurate computational methods for PROTAC-targeted protein degradation prediction is critical. Results: This study proposes DegradeMaster, a semi-supervised E(3)-equivariant graph neural network-based predictor for targeted degradation prediction of PROTACs. DegradeMaster leverages an E(3)-equivariant graph encoder to incorporate 3D geometric constraints into the molecular representations and utilizes a memory-based pseudo-labeling strategy to enrich annotated data during training. A mutual attention pooling module is also designed for interpretable graph representation. Experiments on both supervised and semi-supervised PROTAC datasets demonstrate that DegradeMaster outperforms state-of-the-art baselines, substantially improving AUROC by 10.5%. Case studies show DegradeMaster achieves 88.33% and 77.78% accuracy in predicting the degradability of VZ185 candidates on BRD9 and ACBI3 on KRAS mutants. Visualization of attention weights on 3D molecule graph demonstrates that DegradeMaster recognises linking and binding regions of warhead and E3 ligands and emphasizes the importance of structural information in these areas for degradation prediction. Together, this shows the potential for cutting-edge tools to highlight functional PROTAC components, thereby accelerating novel compound generation.
2025-07-22 14:00:00 14:20:00 01A MLCSB GPO-VAE: Modeling Explainable Gene Perturbation Responses utilizing GRN-Aligned Parameter Optimization Seungheun Baek Seungheun Baek, Soyon Park, Yan Ting Chok, Mogan Gim, Jaewoo Kang Predicting cellular responses to genetic perturbations is essential for understanding biological systems and developing targeted therapeutic strategies. While variational autoencoders (VAEs) have shown promise in modeling perturbation responses, their limited explainability poses a significant challenge, as the learned features often lack clear biological meaning. Nevertheless, model explainability is one of the most important aspects in the realm of biological AI. One of the most effective ways to achieve explainability is incorporating the concept of gene regulatory networks (GRNs) in designing deep learning models such as VAEs. GRNs elicit the underlying causal relationships between genes and are capable of explaining the transcriptional responses caused by genetic perturbation treatments. We propose GPO-VAE, an explainable VAE enhanced by GRN-aligned Parameter Optimization that explicitly models gene regulatory networks in the latent space. Our key approach is to optimize the learnable parameters related to latent perturbation effects towards GRN-aligned explainability. Experimental results on perturbation prediction show our model achieves state-of-the-art performance in predicting transcriptional responses across multiple benchmark datasets. Furthermore, additional results on evaluating the GRN inference task reveal our model's ability to generate meaningful GRNs compared to other methods. According to qualitative analysis, GPO-VAE posseses the ability to construct biologically explainable GRNs that with experimentally validated regulatory pathways.
2025-07-22 14:20:00 14:40:00 01A MLCSB Fast and scalable Wasserstein-1 neural optimal transport solver for single-cell perturbation prediction Yanshuo Chen Yanshuo Chen, Zhengmian Hu, Wei Chen, Heng Huang Predicting single-cell perturbation responses requires mapping between two unpaired single-cell data distributions. Optimal transport (OT) theory provides a principled framework for constructing such mappings by minimizing transport cost. Recently, Wasserstein-2 ($W_2$) neural optimal transport solvers (\textit{e.g.}, CellOT) have been employed for this prediction task. However, $W_2$ OT relies on the general Kantorovich dual formulation, which involves optimizing over two conjugate functions, leading to a complex min-max optimization problem that converges slowly. To address these challenges, we propose a novel solver based on the Wasserstein-1 ($W_1$) dual formulation. Unlike $W_2$, the $W_1$ dual simplifies the optimization to a maximization problem over a single 1-Lipschitz function, thus eliminating the need for time-consuming min-max optimization. While solving the $W_1$ dual only reveals the transport direction and does not directly provide a unique optimal transport map, we incorporate an additional step using adversarial training to determine an appropriate transport step size, effectively recovering the transport map. Our experiments demonstrate that the proposed $W_1$ neural optimal transport solver can mimic the $W_2$ OT solvers in finding a unique and ``monotonic" map on 2D datasets. Moreover, the $W_1$ OT solver achieves performance on par with or surpasses $W_2$ OT solvers on real single-cell perturbation datasets. Furthermore, we show that $W_1$ OT solver achieves $25 \sim 45\times$ speedup, scales better on high dimensional transportation task, and can be directly applied on single-cell RNA-seq dataset with highly variable genes. Our implementation and experiments are open-sourced at \url{https://github.com/poseidonchan/w1ot}.
2025-07-22 14:40:00 15:00:00 01A MLCSB Recovering Time-Varying Networks From Single-Cell Data Euxhen Hasanaj Euxhen Hasanaj, Barnabás Póczos, Ziv Bar-Joseph Gene regulation is a dynamic process that underlies all aspects of human development, disease response, and other key biological processes. The reconstruction of temporal gene regulatory networks has conventionally relied on regression analysis, graphical models, or other types of relevance networks. With the large increase in time series single-cell data, new approaches are needed to address the unique scale and nature of this data for reconstructing such networks. Here, we develop a deep neural network, Marlene, to infer dynamic graphs from time series single-cell gene expression data. Marlene constructs directed gene networks using a self-attention mechanism where the weights evolve over time using recurrent units. By employing meta learning, the model is able to recover accurate temporal networks even for rare cell types. In addition, Marlene can identify gene interactions relevant to specific biological responses, including COVID-19 immune response, fibrosis, and aging, paving the way for potential treatments. The code use to train Marlene is available at https://github.com/euxhenh/Marlene.
2025-07-22 15:00:00 15:10:00 01A MLCSB Developing a Deep Learning Model for Single-Cell RNA Splicing Analysis Luyang Li Luyang Li, You Zhou Approximately 95% of human genes undergo alternative splicing (AS), a process that allows a single gene to produce multiple proteins with distinct functions. This mechanism enormously increases the complexity of our genome and plays an important role in maintaining health. Disruptions in normal splicing can lead to various diseases, and predicting AS events and understanding their regulatory mechanisms at the single-cell level can open the door to discovering new therapeutic targets. Despite the importance of RNA splicing, current single-cell RNA sequencing (scRNA-seq) efforts primarily focus on gene expression profiling. And very few scRNA-seq computational tools are available for identifying and quantifying RNA splicing. Inspired by the successful application of large language models in biomedical research, we developed a new State space model based framework for Alternative Splicing prediction, named SAS, trained on long-read RNA sequencing data. SAS employs a stacked selective state space model architecture to generate latent state representations of transcript sequences, enabling accurate predictions of diverse AS events, even in data-limited conditions. Furthermore, this model is specifically tailored for single-cell splicing prediction. Our results show that SAS outperforms existing methods, achieving an accuracy of 0.97, PR-AUC of 0.99, and F1 score of 0.97. This innovative framework provides valuable insights into identifying splicing events at single-cell resolution, guiding experimental efforts to uncover novel splicing mechanisms and therapeutic targets.
2025-07-22 15:10:00 15:20:00 01A MLCSB Benchmarking foundation cell models for post-perturbation RNA-seq prediction Gerold Csendes Gerold Csendes, Gema Sanz, Krisóf Szalay, Bence Szalai Accurately predicting cellular responses to perturbations is essential for understanding cell behaviour in both healthy and diseased states. While perturbation data is ideal for building such predictive models, its availability is considerably lower than baseline (non-perturbed) cellular data. To address this limitation, several foundation cell models have been developed using large-scale single-cell gene expression data. These models are fine-tuned after pre-training for specific tasks, such as predicting post-perturbation gene expression profiles, and are considered state-of-the-art for these problems. However, proper benchmarking of these models remains an unsolved challenge. In this study, we benchmarked two recently published foundation models, scGPT and scFoundation, against baseline models. Surprisingly, we found that even the simplest baseline model - taking the mean of training examples - outperformed scGPT and scFoundation. Furthermore, basic machine learning models that incorporate biologically meaningful features outperformed scGPT by a large margin. Additionally, we identified that the current Perturb-Seq benchmark datasets exhibit low perturbation-specific variance, making them suboptimal for evaluating such models. Our results highlight important limitations in current benchmarking approaches and provide insights into more effectively evaluating post-perturbation gene expression prediction models.
2025-07-22 15:20:00 15:30:00 01A MLCSB scPRINT: pre-training on 50 million cells allows robust gene network predictions Jeremie Kalfon Jeremie Kalfon, Gabriel Peyré, Laura Cantini A cell is governed by the interaction of myriads of macromolecules. Inferring such a network of interactions has remained an elusive milestone in cellular biology. Building on recent advances in large foundation models and their ability to learn without supervision, we present scPRINT, a large cell model for the inference of gene networks pre-trained on more than 50 million cells from the cellxgene database. Using innovative pretraining tasks and model architecture, scPRINT pushes large transformer models towards more interpretability and usability when uncovering the complex biology of the cell. Based on our atlas-level benchmarks, scPRINT demonstrates superior performance in gene network inference to the state of the art, as well as competitive zero-shot abilities in denoising, batch effect correction, and cell label prediction. On an atlas of benign prostatic hyperplasia, scPRINT highlights the profound connections between ion exchange, senescence, and chronic inflammation.
2025-07-22 15:30:00 15:40:00 01A MLCSB MGCL-ST: Multi-view Graph Self-supervised Contrastive Learning for Spatial Transcriptomics Enhancement Hongmin Cai Hongmin Cai, Siqi Ding, Weitian Huang Spatial transcriptomics enables the investigation of gene expression within its native spatial context, but existing technologies often suffer from low resolution and sparse sampling. These limitations hinder the accurate delineation of fine tissue structures and reduce robustness to noise. To address these challenges, we propose MGCL-ST, a spatial transcriptomics super-resolution framework that integrates multi-view contrastive learning with a dual-metric neighbor selection strategy. By combining spatial structure and histological image features, MGCL-ST achieves robust, pixel-level gene expression imputation. Experimental results on both simulated and real datasets demonstrate its superior reconstruction accuracy and generalization capability, supporting advanced analysis of the tumor microenvironment.
2025-07-22 15:40:00 15:50:00 01A MLCSB Characterizing cell-type spatial relationships across length scales in spatially resolved omics data Rafael dos Santos Peixoto Rafael dos Santos Peixoto, Brendan Miller, Maigan Brusko, Gohta Aihara, Lyla Atta, Manjari Anant, Adina Jailova, Mark Atkinson, Todd Brusko, Clive Wasserfall, Jean Fan Spatially resolved omics (SRO) technologies enable the identification of cell types while preserving their organization within tissues. Application of such technologies offers the opportunity to delineate cell-type spatial relationships, particularly across different length scales, and enhance our understanding of tissue organization and function. To quantify such multi-scale cell-type spatial relationships, we present CRAWDAD, Cell-type Relationship Analysis Workflow Done Across Distances, as an open-source R package. To demonstrate the utility of such multi-scale characterization, recapitulate expected cell-type spatial relationships, and evaluate against other cell-type spatial analyses, we apply CRAWDAD to various simulated and real SRO datasets of diverse tissues assayed by diverse SRO technologies. We further demonstrate how such multi-scale characterization enabled by CRAWDAD can be used to compare cell-type spatial relationships across multiple samples. Finally, we apply CRAWDAD to SRO datasets of the human spleen to identify consistent as well as patient and sample-specific cell-type spatial relationships. In general, we anticipate such multi-scale analysis of SRO data enabled by CRAWDAD will provide useful quantitative metrics to facilitate the identification, characterization, and comparison of cell-type spatial relationships across axes of interest.
2025-07-22 15:50:00 16:00:00 01A MLCSB Segger: Fast and accurate cell segmentation of imaging-based spatial transcriptomics data Elyas Heidari Elyas Heidari, Andrew Moorman, Moritz Gerstung, Dana Pe'Er, Oliver Stegle, Tal Nawy Accurate cell segmentation is a critical first step in the analysis of imaging-based spatial transcriptomics (iST). Despite decades of research in cell segmentation, current methods fail to address this task with adequate accuracy, tending to either over- or under segment, create false positive transcript assignments, and additionally many methods fail to scale to large datasets with hundreds of millions of transcripts. To address these limitations, we introduce segger, a versatile graph neural network (GNN) that frames cell segmentation as a transcript-to-cell link prediction task. Segger employs a heterogeneous graph representation of individual transcripts and cells, and can optionally leverage single-cell RNA-seq information to enhance transcript assignments. In benchmarks on multiple iST dataset, including a lung adenocarcinoma dataset with membrane staining for validation, segger demonstrates superior sensitivity and specificity compared to existing methods such as Baysor and BIDCell. At the same time, segger requires orders of magnitude less compute time than existing approaches. The Segger software features adaptive tiling and efficient task scheduling, supporting multi-GPU processing and multi-threading for scalability. Segger also includes a new workflow to cluster unassigned transcripts into ‘fragments’, enabling the recovery of information missed by nucleus or membrane marker-dependent methods. Segger is implemented as user-friendly open source software (https://github.com/PMBio/segger), comes with extensive documentation and integrates seamlessly into existing workflows, enabling atlas-scale applications with high accuracy and speed.
2025-07-22 16:40:00 16:50:00 01A MLCSB Dissecting cellular and molecular mechanisms of pancreatic cancer with deep learning Aarthi Venkat Aarthi Venkat, Cathy Garcia, Daniel McQuaid, Smita Krishnaswamy, Mandar Muzumdar Pancreatic endocrine-exocrine crosstalk plays a key role in normal physiology and disease and is perturbed by altered host metabolic states. For example, obesity imparts an stress-induced endocrine secretion of cholecystokinin (CCK), which promotes pancreatic ductal adenocarcinoma (PDAC), an exocrine tumor. However, the mechanisms governing endocrine-exocrine signaling in obesity-driven tumorigenesis remain unclear. Here, we design a suite of machine learning tools (TrajectoryNet, AAnet, scMMGAN, DiffusionEMD) to reveal from single-cell RNA-seq data the cellular and molecular mechanisms by which beta cells express CCK and promote obesity-driven PDAC. AAnet identifies an immature beta cell state characterized by low insulin and maturation marker expression and high dedifferentiation and immaturity marker expression. TrajectoryNet predicts obesity stimulates this immature state to expand and adapt toward a pro-tumorigenic CCK-hi state, which we validate with in vivo genetic lineage tracing. TrajectoryNet-based gene regulatory network inference predicts cJun regulates CCK, validated by JNK inhibition and CUT&RUN sequencing showing cJun mediates CCK expression by binding to a novel conserved 3’ enhancer ~3kb downstream of the Cck gene. Finally, mapping beta cells from diverse physiologic and pharmacologic stressors, developmental stages, and species to our dataset with scMMGAN and DiffusionEMD reveals concordance between adult beta cell dedifferentiation and embryonic beta cells, as well as shared stress induction mechanisms between obesity and type II diabetes in mice and humans. Together, this work uncovers new avenues to target the endocrine pancreas to subvert exocrine tumorigenesis and highlights the utility of developing biological and computational models in a wet-to-dry and dry-to-wet fashion toward mechanistic discovery.
2025-07-22 16:50:00 17:00:00 01A MLCSB SpliceSelectNet: A Hierarchical Transformer-Based Deep Learning Model for Splice Site Prediction Yuna Miyachi Yuna Miyachi, Kenta Nakai RNA splicing is a critical post-transcriptional process that enables the generation of diverse protein isoforms. Aberrant splicing is implicated in a wide range of genetic disorders and cancers, making accurate prediction of splice sites and mutation effects essential. Convolutional neural network-based models such as SpliceAI and Pangolin have achieved high accuracy but often lack interpretability. Recently, Transformer-based models like DNABERT and SpTransformer have been applied to genomic sequences, yet they typically inherit input length limitations from natural language processing models, restricting context to a few thousand base pairs, which are insufficient for capturing long-range regulatory signals. To overcome these challenges, we propose SpliceSelectNet (SSNet), a hierarchical Transformer model that integrates local and global attention mechanisms to handle up to 100 kb of input while maintaining nucleotide-level interpretability. Trained on multiple datasets, including those incorporating splice site usage derived from RNA-seq data, SSNet outperforms SpliceAI and Pangolin on the Gencode test dataset, a clinically curated BRCA variant dataset, and a deep intronic variant benchmark. It demonstrates improved performance, particularly in regions characterized by complex splicing regulation, such as long exons and deep introns, as measured by area under the precision-recall curve. Furthermore, SSNet’s attention maps provide direct insight into sequence context. In the case of a pathogenic variant in BRCA1 exon 10, the model highlighted an upstream region that may contribute to cryptic splice site activation. These results demonstrate that SSNet combines high predictive performance with biological interpretability, offering a powerful tool for splicing analysis in both research and clinical settings.
2025-07-22 17:00:00 18:00:00 01A MLCSB Toward Mechanistic Genomics: Advances in Sequence-to-Function Modeling Maria Chikina Recent advances have firmly established sequence-to-function models as essential tools in modern genomics, enabling unprecedented insights into how genomic sequences drive molecular and cellular phenotypes. As these models have matured—with increasingly robust architectures, improved training strategies, and the emergence of standardized software frameworks—the field has rapidly evolved from proof-of-concept demonstrations to widespread practical applications across a variety of biological systems. With the core methodologies now widely adopted and infrastructure in place, the community's focus is shifting toward ambitious new frontiers. There is growing momentum around developing models that are biologically interpretable, capable of uncovering causal mechanisms of gene regulation, and generalizable to novel contexts—such as predicting the effects of perturbing a regulatory protein rather than simply altering a DNA sequence. These efforts reflect a broader aspiration: to create models that serve not just as black-box predictors, but as scientific instruments that deepen our understanding of genome function. In this talk, we will explore how such models can move us from descriptive genomics to mechanistic insight, highlighting recent innovations in architecture and training that support interpretability, modularity, and reusability. We will examine the contexts in which these models offer clear advantages, the limitations that remain, and practical considerations for their training. Ultimately, we will consider how advancing these models may refine the role of machine learning in biology, supporting not only accurate prediction but also the generation of more detailed and mechanistically informed hypotheses.
2025-07-22 11:20:00 12:00:00 01B NetBio Multi-modal learning for single-cell data integration Laura Cantini Laura Cantini Single-cell RNA sequencing (scRNAseq) is revolutionizing biology and medicine. The possibility to assess cellular heterogeneity at a previously inaccessible resolution, has profoundly impacted our understanding of development, of the immune system functioning and of many diseases. While scRNAseq is now mature, the single-cell technological development has shifted to other large-scale quantitative measurements, a.k.a. ‘omics’, and even spatial positioning. Each single-cell omics presents intrinsic limitations and provides a different and complementary information on the same cell. The current main challenge in computational biology is to design appropriate methods to integrate this wealth of information and translate it into actionable biological knowledge. In this talk, I will discuss three main computational directions currently explored in my team: (i) dimensionality reduction to study cellular heterogeneity simultaneously from multiple omics; (ii) gene network inference to integrate a large range of interactions between the features of various omics and isolate the regulators underlying cellular heterogeneity and (iii) spatially-informed trajectory inference to reconstruct the spatiotemporal landscape underlying cell dynamics.
2025-07-22 12:00:00 12:20:00 01B NetBio Nichesphere: A method to identify disease specific physical cell-cell interactions and underlying cellular communication networks Mayra Luisa Ruiz Tejada Segura Hélène Gleitz, Mayra Luisa Ruiz Tejada Segura, James Nagai, Giulia Cesaro, Ivan G. Costa, Rebekka Schneider Understanding disease specific cellular crosstalk is crucial for therapeutic targeting but challenging as signaling pathways dependent on direct physical interactions are lost in current single-cell sequencing protocols. Some technologies, such as Physically Interacting Cells sequencing (PIC-seq) and spatial transcriptomics technologies, provide information on the spatial context of cells, with potential for constructing physical interaction maps. However, computational analysis of cell-cell interaction data remains challenging, as current methods are unable to compare cell-cell interaction between two conditions: homeostasis and disease. Furthermore, although ligand-receptor based cell communication analysis provides an opportunity to functionally characterize these interactions, no approach has yet linked both cell-cell physical interaction and cell communication mechanisms. To address this gap, we introduce Nichesphere, a computational tool to identify disease related physical cell cell interactions and underlying cellular communication networks. We apply Nichesphere to analyze bone marrow (BM) fibrosis PIC-seq data. This analysis revealed molecular niches with increased interactions in BM fibrosis and enabled the characterization of cellular processes, such as extracellular matrix remodeling and immune recruitment, associated with these interactions.
2025-07-22 12:20:00 12:40:00 01B NetBio NetREm: Network Regression Embeddings reveal cell-type transcription factor coordination for gene regulation Saniya Khullar Saniya Khullar, Xiang Huang, Raghu Ramesh, John Svaren, Daifeng Wang Background: Transcription factor (TF) coordination plays a key role in gene regulation via direct and/or indirect protein–protein interactions (PPIs) and co-binding to regulatory elements on DNA. Single-cell technologies enable gene expression measurement for individual cells and identification of distinct cell types, yet the link between TF-TF coordination and target gene (TG) regulation across diverse cell types remains poorly understood. Method: To address this, we introduce Network Regression Embeddings (NetREm), an innovative computational approach to uncover cell-type-specific TF-TF coordination activities driving TG regulation. NetREm leverages network-constrained regularization, integrating prior knowledge of TF-TF PPIs with single-cell (or bulk-level) gene expression data. It identifies transcriptional regulatory modules (TRMs) composed of antagonistic/cooperative TF-TF PPIs and predicts novel TF-TG regulatory links complementing state-of-the-art gene regulatory networks (GRNs). Results: We validate NetREm’s performance through simulation studies and benchmark it across multiple datasets in humans, mice, yeast. NetREm prioritizes biologically meaningful TF-TF coordination networks in 9 peripheral blood mononuclear cell types and 42 immune cell subtypes. Additionally, we apply NetREm to neural cell types (e.g., neurons, glia, Schwann cells) from central and peripheral nervous systems, and to Alzheimer’s disease versus control brains. Top predictions are supported by orthogonal experimental validation data, including: ChIP-seq, CUT&RUN, scATAC-seq, knockout studies, expression QTLs, genome-wide association studies. We further link disease-associated variants to our inferred TRMs and GRNs. Conclusion: NetREm provides a powerful and interpretable framework to predict cutting-edge GRNs and unprecedented coordination networks in a cell-type-specific manner. Our tool is on GitHub to help propel functional genomics and therapeutic discovery.
2025-07-22 12:40:00 13:00:00 01B NetBio Cell-specific Graph Operation Strategy on Signaling Intracellular Pathways Giulia Cesaro Giulia Cesaro, James Nagai, Giacomo Baruzzo, Barbara Di Camillo, Ivan Costa Recent advances in single-cell RNA sequencing have enabled a detailed exploration of cell-cell communication. Several computational tools infer extracellular signaling via ligand-receptor interactions and associate them with downstream transcription factors and target genes using prior knowledge of signaling pathways. However, most approaches overlook the expression of intermediate signaling genes within individual cells, limiting their ability to reflect cell-specific signal transduction. Furthermore, the high dimensionality and technical noise in single-cell RNA sequencing data, particularly dropout events, make capturing and identifying changes in intracellular pathways difficult. We introduce CellGOSSIP, a novel framework that integrates single-cell RNA sequencing data with curated biological signaling pathway networks to estimate cell-specific intracellular signaling activity. CellGOSSIP employs a personalized network propagation algorithm over pathway-specific gene graphs, using ligand-receptor interactions as seeds for initiating signal propagation. This approach smooths expression noise and captures pathway dynamics by taking into account gene expression levels and pathway topology. Our evaluation shows that CellGOSSIP outperforms traditional network propagation-based denoising methods in terms of stability of the reconstructed single-cell matrix to increasing levels of dropout noise in the single-cell RNA sequencing data. In a controlled perturbation experiment of ligand-receptor signaling, CellGOSSIP successfully reconstructs transcription factor activity and identifies distinct pathway activation patterns across experimental conditions. Moreover, embeddings based on CellGOSSIP-inferred signaling profiles uncover functional cell subpopulations that are not discernible using raw gene expression data.
2025-07-22 14:00:00 14:20:00 01B NetBio Prediction of Gene Regulatory Connections with Joint Single-Cell Foundation Models and Graph-Based Learning Sindhura Kommu Sindhura Kommu, Yizhi Wang, Yue Wang, Xuan Wang Motivation: Single-cell RNA sequencing (scRNA-seq) data offers unprecedented opportunities to infer gene regulatory networks (GRNs) at a fine-grained resolution, shedding light on cellular phenotypes at the molecular level. However, the high sparsity, noise, and dropout events inherent in scRNA-seq data pose significant challenges for accurate and reliable GRN inference. The rapid growth in experimentally validated transcription factor-DNA binding data has enabled supervised machine learning methods, which rely on known regulatory interactions to learn patterns, and achieve high accuracy in GRN inference by framing it as a gene regulatory link prediction task. This study addresses the gene regulatory link prediction problem by learning vectorized representations at the gene level to predict missing regulatory interactions. However, a higher performance of supervised learning methods requires a large amount of known TF-DNA binding data, which is often experimentally expensive and therefore limited in amount. Advances in large-scale pre-training and transfer learning provide a transformative opportunity to address this challenge. In this study, we leverage large-scale pre-trained models, trained on extensive scRNA-seq datasets and known as single-cell foundation models (scFMs). These models are combined with joint graph-based learning to establish a robust foundation for gene regulatory link prediction. Results: We propose scRegNet, a novel and effective framework that leverages scFMs with joint graph-based learning for gene regulatory link prediction. scRegNet achieves state-of-the-art results in comparison with nine baseline methods on seven scRNA-seq benchmark datasets. Additionally, scRegNet is more robust than the baseline methods on noisy training data. Availability: The source code is available at https://github.com/sindhura-cs/scRegNet
2025-07-22 14:20:00 14:40:00 01B NetBio Enhanced Gaussian noise augmentation-based contrastive learning for predicting the longevity effects of genes using protein-protein interaction networks Ibrahim Alsaggaf Ibrahim Alsaggaf, Alex A. Freitas, Cen Wan Protein-protein interaction (PPI) networks are a type of informative feature source that has already been widely used in Bioinformatics research. However, the enormous number of proteins in PPI networks leads to a natural challenge in analytics. Although network embedding methods (e.g. node2vec [1]) recently showed good performance in reducing the extremely high dimensionality of PPI network dataset, the predictive performance of network embeddings still needs to be improved. In this abstract, we introduce a recently proposed contrastive learning algorithm, namely Sup-EGsCL, which exploits an enhanced Gaussian noise augmentation approach to learn a better feature representation space, leading to improved accuracy in predicting the pro-/anti-longevity effects of different model organisms using PPI network-based features. The content of this abstract was published in [2].
2025-07-22 14:40:00 15:00:00 01B NetBio Uncovering the systems-level mutational landscape of intrinsically disordered regions in cancer Kivilcim Ozturk Kivilcim Ozturk, Hannah Carter Biological functions and cellular behaviors arise from interactions among proteins and other molecules within cells, and cancers often act to perturb these interactions, resulting in disease phenotypes. Many proteins contain intrinsically disordered regions (IDRs) that perform biological functions without relying on a single well-defined conformation. While IDRs of several cancer drivers have emerged as central mediators of oncogenic signaling and post-translational modifications, their role in protein-protein interactions (PPIs) is less clear. Here, we set out to characterize the mutational landscape of IDRs in how they contribute to perturbation of underlying protein interaction networks in cancer. A comprehensive analysis of our structurally resolved PPI network showed that IDRs are targeted by cancer missense mutations. Furthermore, proteins containing IDRs are more centrally located in the network, especially cancer drivers, where disordered drivers are significantly more central than ordered ones, suggesting that the inherent conformational heterogeneity of IDRs might enable them to interact with a wider range of molecular partners, allowing them to easily propagate signals through the cell and the mutations targeting them to generate a larger impact on cellular activity and phenotypes. Finally, using a domain-level cell fitness screen, we discovered that cancer drivers contain IDRs on their interaction interface regions corresponding to depletion of cell fitness, pointing to potential cancer cell vulnerabilities. Overall, our work demonstrates the importance of uncovering the systems-level mutational landscape of IDRs to identify mechanisms driving cancer development and progression, enabling more effective selection and development of cancer therapeutics.
2025-07-22 15:00:00 15:20:00 01B NetBio Advancing Network Biology with FunCoup 6 Erik Sonnhammer Erik Sonnhammer We recently released FunCoup 6, a major update to the FunCoup network database, providing researchers with a significantly improved and redesigned platform for exploring the functional coupling interactome. The FunCoup network database (https://FunCoup.org) contains some of the most comprehensive functional association networks of genes/proteins available. Functional associations are inferred by integrating different types of evidence combined with orthology transfer. FunCoup’s high coverage comes from using ten different types of evidence, and extensive transfer of information between species. Key innovations in release 6: - Enhanced regulatory link coverage: FunCoup 6 now includes over half a million directed gene regulatory links in the human network alone. 13 species in FunCoup now contain regulatory links.. - New website: We completely redesigned the FunCoup website and updated its API functionalities, ​enhancing user accessibility and experience. - Integrated advanced online tools for network analysis: The integration of TOPAS for disease and drug target module identification, along with network-based KEGG pathway enrichment analysis using ANUBIX, expands the utility of FunCoup 6 for biomedical research. - New training framework: applied to produce comprehensive networks for 23 primary species and 618 additional orthology-transferred species. - FunCoup 6 is also available as a Cytoscape app. A unique feature of both the FunCoup website and the Cytoscape app is the possibility to perform ‘comparative interactomics’ such that subnetworks of different species are aligned using orthologues. FunCoup further demonstrates superior performance compared to other functional association networks, offering researchers enhanced capabilities for studying gene regulation, protein interactions, and disease-related pathways.
2025-07-22 15:20:00 15:40:00 01B NetBio SPACE: STRING proteins as complementary embeddings Dewei Hu Dewei Hu, Damian Szklarczyk, Christian Von Mering, Lars Juhl Jensen Representation learning has revolutionized sequence-based prediction of protein function and subcellular localization. Protein networks are an important source of information complementary to sequences, but the use of protein networks has proven to be challenging in the context of machine learning, especially in a cross-species setting. To address this, we leveraged the STRING database of protein networks and orthology relations for 1,322 eukaryotes to generate network-based cross-species protein embeddings. We did this by first creating species-specific network embeddings and subsequently aligning them based on orthology relations to facilitate direct cross-species comparisons. We show that these aligned network embeddings ensure consistency across species without sacrificing quality compared to species-specific network embeddings. We also show that the aligned network embeddings are complementary to sequence embedding techniques, despite the use of seqeuence-based orthology relations in the alignment process. Finally, we validated the embeddings by using them for two well-established tasks: subcellular localization prediction and protein function prediction. Training logistic regression classifiers on aligned network embeddings and sequence embeddings improved the accuracy over using sequence alone, reaching performance numbers close to state-of-the-art deep-learning methods. The precomputed cross-species network embeddings and ProtT5 embeddings for all eukaryotic proteins have been included in STRING version 12.0.
2025-07-22 15:40:00 16:00:00 01B NetBio Disentangling the genetic and non-genetic origin of disease co-occurrences Beatriz Urda-García Beatriz Urda-García, Davide Cirillo, Alfonso Valencia Numerous diseases co-occur more than expected by chance, likely due to a combination of genetic and environmental factors. However, the extent to which these influences shape disease relationships remains unclear. Here, we integrate large-scale RNA-seq data and heritability measures from human diseases with genomic data from the UK Biobank to disentangle the genetic and non-genetic origins of disease co-occurrences (DCs). Our findings show that gene expression not only recovers but also expands upon genomically explained DCs, capturing disease relationships beyond genetic variation. Approximately 60% of transcriptomically inferred DCs have a detectable genomic component, whereas the remaining 40% are not explained by known genomic layers, suggesting contributions from regulatory or environmental mechanisms. Consistent with this interpretation, the relative contributions of transcriptomics and genomics reconstruct disease etiology and correlate with comorbidity burden, revealing key aspects of disease mechanisms. Complex diseases with strong genetic predispositions tend to be captured by both omics, whereas those primarily influenced by non-genetic factors are better explained by transcriptomics. Additionally, we find that diseases do not generally co-occur based on their heritability, except when sharing SNPs. However, highly heritable diseases tend to have genetically driven co-occurrences, even with lowly heritable diseases. In contrast, transcriptomics explains DCs regardless of heritability, at least partly due to non-heritable mechanisms, such as regulatory or environmental. Integrating transcriptomic and genomic data provides near-complete coverage of DCs among the analyzed diseases, with a considerable portion likely rooted in factors beyond DNA sequence and, therefore, potentially modifiable.
2025-07-22 16:40:00 17:00:00 01B NetBio GRACKLE: An interpretable matrix factorization approach for biomedical representation learning Lucas Gillenwater Lucas Gillenwater, Lawrence Hunter, James Costello Motivation: Disruption in normal gene expression can contribute to the development of diseases and chronic conditions. However, identifying disease-specific gene signatures can be challenging due to the presence of multiple co-occurring conditions and limited sample sizes. Unsupervised representation learning methods, such as matrix decomposition and deep learning, simplify high-dimensional data into understandable patterns, but often do not provide clear biological explana-tions. Incorporating prior biological knowledge directly can enhance understanding and address small sample sizes. Nevertheless, current models do not jointly consider prior knowledge of mo-lecular interactions and sample labels. Results: We present GRACKLE, a novel non-negative matrix factorization approach that applies Graph Regularization Across Contextual KnowLedgE. GRACKLE integrates sample similarity and gene similarity matrices based on sample metadata and molecular relationships, respectively. Sim-ulation studies show GRACKLE outperformed other NMF algorithms, especially with increased background noise. GRACKLE effectively stratified breast tumor samples and identified condition-enriched subgroups in individuals with Down syndrome. The model's latent representations aligned with known biological patterns, such as autoimmune conditions and sleep apnea in Down syn-drome. GRACKLE's flexibility allows application to various data modalities, offering a robust solution for identifying context-specific molecular mechanisms in biomedical research. Availability and implementation: GRACKLE is available at: https://github.com/lagillenwater/GRACKLE
2025-07-22 17:00:00 17:20:00 01B NetBio Quantum Random Walks for Biomarker Discovery in Biomolecular Networks Aritra Bose Viacheslav Dubovitskii, Aritra Bose, Filippo Utro, Laxmi Parida Biomolecular networks, such as protein–protein interactions, gene–gene associations, and cell–cell interactions, offer valuable insights into the complex organization of biological systems. These networks are key to understanding cellular functions, disease mechanisms, and identifying therapeutic targets. However, their analysis is challenged by the high dimensionality, heterogeneity, and sparsity of multi-omics data. Random walk algorithms are widely used to propagate information through disease modules, helping to identify disease-associated genes and uncover relevant biological pathways. In this work, we investigate the limitations of classical random walks and explore the potential of quantum random walks (QRW) for biomolecular network analysis. We evaluate QRW in a gene–gene interaction network associated with asthma, autism, and schizophrenia. QRW more accurately rank disease-associated genes compared to classical random walk with restart. Our findings suggest that quantum random walks offer a promising alternative to classical approaches for biomarker discovery, with improved sensitivity to network structure and better performance in identifying biologically relevant features. This highlights their potential in advancing network medicine and systems biology.
2025-07-22 17:20:00 18:00:00 01B NetBio Quantum computing for network medicine-based epistatic disease mechanism mining - Fake it till you make it? Jan Baumbach Jan Baumbach Most heritable diseases are polygenic and yield complex disease mechanisms. To comprehend the underlying genetic architecture, it is crucial to discover the clinically relevant epistatic interactions (EIs) between genomic single nucleotide polymorphisms (SNPs). Existing statistical methods for EI detection are mostly limited to pairs of SNPs due to the combinatorial explosion of higher-order EIs. With NeEDL (network-based epistasis detection via local search), we leverage network medicine to inform the selection of EIs that are an order of magnitude more statistically significant compared to existing tools and consist, on average, of five SNPs / disease genes in a candidate mechanism. We further show that this computationally demanding task can be accelerated with quantum computing. We apply NeEDL to eight different diseases and discover genes (affected by EIs of SNPs) that are partly known to affect the disease, additionally, these results are reproducible across independent cohorts. EIs for these eight diseases can be interactively explored in the Epistasis Disease Atlas (https://epistasis-disease-atlas.com). In summary, we demonstrate the potential of seamlessly integrated quantum computing techniques to accelerate mechanism mining. Our network medicine approach detects higher-order EIs with unprecedented statistical and biological evidence, yielding unique insights into polygenic diseases and providing a basis for the development of drug repurposing candidates and improved combination therapies.
2025-07-23 11:20:00 11:40:00 01B NetBio MixingDTA: Improved Drug-Target Affinity Prediction by Extending Mixup with Guilt-By-Association Dongmin Bang Youngoh Kim, Dongmin Bang, Bonil Koo, Jungseob Yi, Changyun Cho, Jeonguk Choi, Sun Kim Drug–Target Affinity (DTA) prediction is an important regression task for drug discovery, which can provide richer information than traditional drug-target interaction prediction as a binary prediction task. To achieve accurate DTA prediction, quite large amount of data is required for each drug, which is not available as of now. Thus, data scarcity and sparsity is a major challenge. Another important task is `cold-start' DTA prediction for unseen drug or protein. In this work, we introduce MixingDTA, a novel framework to tackle data scarcity by incorporating domain-specific pre-trained language models for molecules and proteins with our MEETA (MolFormer and ESM-based Efficient aggregation Transformer for Affinity) model. We further address the label sparsity and cold-start challenges through a novel data augmentation strategy named GBA-Mixup, which interpolates embeddings of neighboring entities based on the Guilt-By-Association (GBA) principle, to improve prediction accuracy even in sparse regions of DTA space. Our experiments on benchmark datasets demonstrate that the MEETA backbone alone provides up to a 19% improvement of mean squared error over current state-of-the-art baseline, and the addition of GBA-Mixup contributes a further 8.4% improvement. Importantly, GBA-Mixup is model-agnostic, delivering performance gains across all tested backbone models of up to 16.9%. Case studies shows how MixingDTA interpolates between drugs and targets in the embedding space, demonstrating generalizability for unseen drug–target pairs while effectively focusing on functionally critical residues. These results highlight MixingDTA’s potential to accelerate drug discovery by offering accurate, scalable, and biologically informed DTA predictions. The code for MixingDTA is available at https://github.com/rokieplayer20/MixingDTA.
2025-07-23 11:40:00 12:00:00 01B NetBio Interactome-based computational solutions to support personalized drug therapy decisions in glioblastoma Nicoleta Siminea Nicoleta Siminea, Victor Bogdan Popescu, Ion Petre, Andrei Păun Glioblastoma is an aggressive cancer with a poor survival rate, and standard treatments often yield limited results. To explore personalized treatment options, we employed network-based analyses. Our aim was to investigate how drug recommendations might vary for individual patients compared to the conventional treatment approaches for glioblastoma. We began by identifying patient-specific proteins from glioblastoma cases in The Cancer Genome Atlas (TCGA). Using these, we constructed protein–protein interaction networks that incorporated not only the patient-specific proteins but also those encoded by cancer-related genes and known targets of antineoplastic or immunomodulatory drugs. We then applied controllability analysis using NetControl4BioMed to identify proteins that could potentially be targeted with drugs. For each case, we compared the drugs identified through in silico analysis with those predicted to help restore the disease state to a healthy condition. We also examined the differences between using personalized versus generic networks. Notably, 12% of drugs identified via the generic network appeared in fewer than half of the individual networks. We also found some drugs that, while absent in the generic network, were predicted to offer therapeutic value in individual patients. This approach enables the construction and analysis of novel individual networks based on proteins identified in new glioblastoma samples. Moreover, the methodology can be adapted for other diseases. For conditions with poor prognoses, enhancing individualized network analyses is essential to improve treatment outcomes.
2025-07-23 12:00:00 12:20:00 01B NetBio Improving Target-Adverse Event Association Prediction by Mitigating Topological Imbalance in Knowledge Graphs Terence Egbelo Terence Egbelo, Zeyneb Kurt, Charlie Jeynes, Mike Bodkin, Val Gillet Drug discovery faces high clinical failure rates due to adverse events (AEs) from both on- and off-target interactions. Biomedical knowledge graphs (KGs) integrate domain knowledge in network form. KG completion is the classification task of predicting new relations based on the existing graph. Our study predicts novel target-AE associations as KG completion using a large-scale biomedical knowledge graph. We incorporated ground truth target-AE associations from multiple sources, including the T-ARDIS database (Galletti et al 2021), into the Drug Repurposing Knowledge Graph (DRKG) by Ioannidis et al (2020). Rather than ""black-box"" deep learning approaches to KG completion, we employ interpretable ""metapath""-based predictive features that maintain direct reference to domain semantics, following precedents set by Fu et al (2016) and Himmelstein et al (2017). We introduce a novel approach to address the problem of topological imbalance in KGs, a type of graph data bias. This bias occurs when high-degree entities (nodes in the KG) are overrepresented in the positive ground truth associations, leading to poor prediction performance on less-studied (and thus low-degree) entities—precisely where good inference is most critical. Our bias mitigation method transforms degree sparsity into a useful signal when learning associations of sparsely connected protein targets. Our approach demonstrated significantly improved AE prediction for the least-studied targets (accuracy increase of ~15% for the bottom 15% of targets by number of AE associations) compared to the well-known Degree-Weighted Path Count (DWPC) method by Himmelstein et al (2017). Finally, we demonstrate prediction interpretability, including in cases where alternative methods produce errors.
2025-07-23 12:20:00 12:40:00 01B NetBio A Nextflow Pipeline for Network-Based Disease Module Identification and Drug Repurposing Johannes Kersting Johannes Kersting, Lisa Spindler, Joaquim Aguirre-Plans, Chloé Bucheron, Quirin Manz, Tanja Pock, Mo Tan, Fernando Delgado-Chaves, Cristian Nogales, Jan Baumbach, Jörg Menche, Emre Guney, Markus List Disease modules provide unique insights into the mechanisms of complex diseases and lay the foundation for mechanistic drug repurposing. Algorithms for their identification leverage biological networks to extend an initial set of disease-associated genes (seeds) into subnetworks reflecting biological processes likely to be integral components of the investigated disease. These subnetworks can unveil causal pathways and provide drug repurposing efforts with promising new targets for therapeutics. Various computational methods have been developed for disease module identification. Since these methods differ in their modeling assumptions and techniques, evaluating various tools across different parameters to optimize for a specific use case is advisable. However, this can be tedious since the individual tools require specific installation and input preparation procedures. Moreover, identifying the best modules is not straightforward and requires topological and biological validation strategies. To mitigate this, we developed a comprehensive pipeline for disease module identification, evaluation, and subsequent drug prioritization utilizing the workflow system Nextflow. Our pipeline automatically deploys software dependencies using Docker, making installation easy. It prepares the inputs for and runs six popular module detection tools, including DIAMOnD, DOMINO, and ROBUST. The generated outputs are annotated with biological context information, converted into a unified BioPAX format, and extensively evaluated. The latter includes assessing the biological relevance based on overrepresentation analysis and the tool DIGEST, as well as robustness and consistency analyses. With our contribution, we allow the community to systematically compare different approaches for disease module discovery, thus contributing to robustness and reproducibility in systems and network medicine.
2025-07-23 12:40:00 13:00:00 01B NetBio Graph Antiviral Target Explorer (GATE): predicting disease genes in viral infections with Message Passing Neural Networks Samuele Firmani Samuele Firmani, Valter Bergant, Corinna Grünke, Yang An, Alexander Henrici, Francesco Paolo Casale, Andreas Pichlmair, Annalisa Marsico Recent outbreaks of COVID‑19 and monkeypox (Mpox) underscore the need for scalable tools that can disentangle complex host‑virus‑drug interactions and reveal potential therapeutic targets. Although multi‑omics technologies and high‑throughput screens generate rich datasets, integrating these heterogeneous signals to prioritise disease genes remains difficult, especially for poorly understood viruses. We present GATE, a graph message‑passing neural network (MPNN) that ranks host factors and candidate antiviral drug targets across viral infections. Thanks to the pre‑training phase, GATE learns how extracts maximum value from sparse, weakly labelled data by learning to recognise disease‑related proteins and drug targets within a protein–protein interaction (PPI) network enriched with Gene Ontology functional embeddings, PPI‑derived positional encodings (PEs) and ESM2 language‑model embeddings. The model is then fine‑tuned with virus‑specific multi-omics data. On both SARS‑CoV‑2 and Mpox tasks, GATE outperformed state‑of‑the‑art models and simpler baselines. Of the architectures evaluated, the Principal Neighbourhood Aggregation (PNA) layer propagated protein features most effectively. Positional encodings were the most informative inputs, while for SARS‑CoV‑2 the protein–viral interactome and effectome data also proved effective. Pre‑training further boosted performance, particularly in the data‑sparse Mpox setting. Predicted host‑factor genes significantly overlapped hits from independent CRISPR‑KO screens and matched validated antiviral targets for both viruses. In addition, GATE’s explanations show underlying biological mechanisms and help prioritise candidates for experimental validation. GATE is task‑agnostic, scalable and accommodates future omics modalities through modular input feature sets, accelerating discovery and repurposing of antiviral therapeutics.
2025-07-23 14:00:00 14:20:00 01B NetBio Multilayer Networks Identify Clinically Relevant Patient Endotypes in COVID-19 and Sepsis Piotr Sliwa Piotr Sliwa, Heather Harrington, Gesine Reinert, Julian C. Knight Integrative analyses of multi-omic patient datasets are crucial to uncover disease subtypes, yet challenges arise from modality‑specific variability and missing data. We propose MLModNet, a multilayer network framework for robust patient stratification. MLModNet employs an extended resampling‑based method (Pareto‑COGENT) to build stable, informative, and modality‑specific patient similarity networks, integrates them into a multiplex network including patients missing individual assays, and detects patient stratification via multiplex‑adapted Leiden clustering. We applied MLModNet to the COVID‑19 Multi‑omic Blood Atlas (COMBAT) dataset, integrating proteomics, transcriptomics, and cytometry. MLModNet discovered five patient endotypes that refine clinical WHO severity categories, each exhibiting distinct immune‑metabolic signatures involving IL‑33, TREM1, interferon response pathways, and shifts in cell proportions. Clinically, MLModNet clusters significantly stratified ICU‑free survival, and early cluster assignment probabilities predicted subsequent clinical markers (CRP, D‑dimer, Acuity scores). External validation on an independent Olink dataset confirmed the reproducibility of these endotypes and their prognostic relevance. Extensive ablation analyses further supported the robustness of the identified clusters. MLModNet thus provides a scalable strategy to translate heterogeneous, incomplete multi‑modal data into biologically meaningful, clinically actionable patient stratifications.
2025-07-23 14:20:00 14:40:00 01B NetBio Cytoscape Visualization Competition Results Chad Myers , Chad Myers
2025-07-23 14:40:00 15:20:00 01B NetBio Microbiome multitudes and metadata madness Fiona Brinkman Fiona Brinkman Microbiome analysis is increasingly becoming a critical component of a wide range of health, agri-foods, and environmental studies. I will present case studies showing the benefit of integrating very diverse metadata into such analyses - and also pitfalls to watch out for. The results of one such cohort study will be further presented, illustrating the need for analyses that allow one to flexibly view metadata in the context of microbiome data. The results support the multigenerational importance of “healthy
2025-07-21 11:20:00 11:25:00 02F NIH Cyberinfrastructure and Emerging Technologies Sessions Opening Remarks for NIH Track Susan Gregurick, Susan Gregurick, PhD
2025-07-21 11:25:00 11:44:00 02F NIH Cyberinfrastructure and Emerging Technologies Sessions Digital Twins: Functional and Mechanistic Reconstruction of Multiple Myeloma Tumors Ariosto S. Silva, Ariosto Silva, Ariosto S. Silva, Ariosto S. Silva, PhD
2025-07-21 11:44:00 12:03:00 02F NIH Cyberinfrastructure and Emerging Technologies Sessions Network Science for Cyber-physical Twinning of Human Heart Timothy Kuo
2025-07-21 12:03:00 12:22:00 02F NIH Cyberinfrastructure and Emerging Technologies Sessions A Digital Twins Prototype for Monitoring and Predicting Dynamic Diet-related Health Conditions Honggang Wang Hua Fang, Honggang Wang, Hua Fang, Honggang Wang, Hua Fang, PhD, Honggang Wang, PhD, Hua Fang, Honggang Wang A digital twin system aims to create a virtual representation of a physical subject by modeling both its intrinsic attributes and the external factors that influence them. In this work, we present a prototype digital twin system composed of three integrated components: (1) a non-parametric machine learning algorithm for modeling temporal and spatial data; (2) a translation module that converts predicted health outcomes and risks into natural language descriptions of physical and mental states; and (3) a generative 3D visualization engine that dynamically illustrates health changes over time. At the core of the system is a novel random forest learning model enhanced with Choquet LASSO feature selection, designed to capture complex, nonlinear interactions among high-dimensional features. This approach improves predictive accuracy while maintaining computational efficiency. We compare the performance of this model with both traditional and emerging methods using standard evaluation metrics. Predicted health outcomes are translated into interpretable natural language narratives, such as estimated body shape or biomarker trajectories, which are then used to drive the generation of 3D digital twins that visually reflect the subject’s evolving physical state. The prototype is designed to monitor and forecast the progression of health and chronic conditions based on individual-level data inputs, including food intake, electronic health records, and user-reported variables such as age, gender, weight, and waist circumference. Dietary data are processed into personalized diet quality scores using the Alternative Healthy Eating Index (AHEI), and further decomposed into macro- and micronutrient components to support granular nutritional tracking. In our case studies, we demonstrate the system using real-world longitudinal datasets spanning up to 35 years, integrating historical data on biomarkers, dietary patterns, and disease progression. Although the digital twins in our current study are generated retrospectively, the system architecture supports real-time monitoring and simulation. This enables users to intuitively explore how dietary and lifestyle factors impact their health, and allows healthcare professionals to deliver personalized, real-time recommendations based on individuals’ behavioral, lifestyle, and environmental contexts.
2025-07-21 12:22:00 12:41:00 02F NIH Cyberinfrastructure and Emerging Technologies Sessions Multiscale Digital-Twin Modeling and Estimation with Indirect, Neurological Data Matthew F. Singh, PhD Matthew F. Singh, Matthew F. Singh, PhD Linking data and predictions across spatial scales has been a key hurdle to digital twin applications in medicine. Noninvasive measurements, such as electrophysiology, provide key temporal-insight but are indirect and spatially-coarse. By contrast, many disease mechanisms involve feedback between local cellular-signaling (spatially-fine) and the larger organ-system (spatially-coarse). These issues are inherent to neurology, as brain function relies upon a hierarchy of networks which span from local circuits defined by cell-type to the long-range “wiring” between brain regions. However, this complexity is at a mismatch with the much more limited resolution of human brain data. Thus, there is critical need to estimate digital-twins which span scales. Methodology: We present an algorithm for estimating physiologically-detailed digital-twins using indirect (noninvasive) measurements. In this scenario, both the model-states and parameters are unknown, leading to a dual-estimation problem that challenges current paradigms. We propose a solution in which detailed biological digital-twins are “trained” by converting the estimation-problem into a form of recurrent-neural-network which we term the generalized-Backpropagation Kalman Filter (gBPKF). This reformulation retains the original physiological-model (there are no “black-boxes”) but enables AI approaches to solve the challenging optimization problem. We benchmark our approach and demonstrate its power for digital twins in neurology using two electrophysiology datasets: the Human Connectome Project (HCP) and a study comparing Transcranial Magnetic Stimulation (TMS) protocols (waveforms) within-patient. We establish the accuracy of person-specific predictions of key neurological markers (frequency-domain statistics) which we trace back to biological mechanisms through bifurcation analysis (HCP data). Using TMS neurostimulation data, we further tested the ability to track the timecourse of delayed biological changes (plasticity). We hypothesized that these changes would mirror the delayed time course of neuroplasticity and that TMS treatment protocols would differentially alter specific microcircuits within digital-twin models. Key Results: Our algorithm presents state-of-the-art accuracy and efficiency (lower complexity) in benchmarking with a simulated ground-truth. Applied to real-data, we demonstrate highly reliable digital-twin estimates even with detailed brain models (>1,000 unknown parameters). Utilizing genetic-twins (monozygotic vs. dizygotic), we demonstrate high heritability of model parameters. Models correctly forecast short-term changes in brain-activity (sub-second oscillations) and long-term temporal-patterns that differentiate individuals. We identify a specific bifurcation mechanism, arising from microcircuits but not macroscopic-connections, which drives human variability in brain-dynamics. This finding illustrates the power of linking multiple spatial scales. Fit to windows of post-TMS data, digital-twins identified a sequence of slow microcircuit changes whose delayed timecourse mirrored that of brain-plasticity. The direction of effect depended upon TMS-protocol, in agreement with theorized mechanisms. Significance: Our research advances model-estimation and prediction for digital-twins operating at multiple spatial scales. The developed gBPKF architecture provides an efficient, accurate solution to estimating biological models from noisy, indirect measurements. It also expands digital twin-technology to complex systems, such as the brain, in which current algorithms prove intractable due to the high number of dimensions. Our applications to neurology demonstrate the power of digital-twins to predict individual outcomes and track treatment-responses in terms of local-circuits which are not directly accessible with noninvasive technology.
2025-07-21 12:41:00 13:00:00 02F NIH Cyberinfrastructure and Emerging Technologies Sessions Towards a Digital Twin Initiative for Neurodegenerative Diseases Karuna P Joshi, Karuna P Joshi, PhD
2025-07-21 14:00:00 14:20:00 02F NIH Cyberinfrastructure and Emerging Technologies Sessions Advancing Discovery through GenAI and Scalable Infrastructure Sean D. Mooney. PhD Opening remarks from the NIH Center for Information Technology (CIT) will highlight how the STRIDES Initiative and Cloud Lab program are accelerating biomedical research through expanded access to scalable, secure cloud infrastructure. The Director will frame NIH’s vision for enabling FAIR data practices, fostering innovation through generative AI, and lowering technical barriers to empower the research community with next-generation tools and resources.
2025-07-21 14:20:00 14:40:00 02F NIH Cyberinfrastructure and Emerging Technologies Sessions A Global Model for FAIR and Open Research: Scalable, Collaborative Infrastructure in Action Kristi Holmes, PhD Kristi Holmes, PhD Learn how Zenodo and InvenioRDM are advancing global biomedical research through scalable, AI-integrated, and FAIR-aligned infrastructure. This session highlights a powerful open-source model enabling reproducible science, seamless data sharing, and next-gen curation workflows—built by and for the research community.
2025-07-21 14:40:00 15:00:00 02F NIH Cyberinfrastructure and Emerging Technologies Sessions Beyond Data Sharing: AI-Powered Solutions for Effective Biomedical Data Reuse Luca Foschini Luca Foschini This session will showcase AI-powered innovations that move biomedical research beyond data sharing toward meaningful data reuse. Using real-world platforms like Synapse.org and the Neurofibromatosis Data Portal, this session will highlight new tools for automating metadata harmonization, accelerating data discovery through conversational AI, and preparing datasets for machine learning. The talk will also explore emerging standards and open science challenges that are shaping the future of AI-ready biomedical data.
2025-07-21 15:00:00 15:20:00 02F NIH Cyberinfrastructure and Emerging Technologies Sessions Reusable Cyberinfrastructure and Use Cases for the Cancer Research Data Commons (CRDC) Tanja Davidsen Tanja Davidsen This session highlights how NIH’s Cancer Research Data Commons (CRDC) and supporting cyberinfrastructure are transforming cancer research through scalable, secure, and interoperable platforms. Learn how modular technologies, cloud-native services, and AI-driven tools are accelerating multi-modal data integration, enabling predictive analytics, and advancing personalized cancer care across the biomedical research ecosystem.
2025-07-21 15:20:00 15:40:00 02F NIH Cyberinfrastructure and Emerging Technologies Sessions Power Your Kids First or INCLUDE Data Analysis on The Interoperable CAVATICA Cloud Analytics Workspace Jared Rozowsky Jared Rozowsky The NIH-funded Gabriella Miller Kids First Data Resource Center (KF-DRC) and the INCLUDE Data Coordinating Center (INCLUDE DCC) provide harmonized datasets for researchers to investigate pediatric cancer, structural birth defects, and co-occurring conditions of Down Syndrome. Broadly, the goals of the two programs are to accelerate discovery, enhance healthcare, and change lives. CAVATICA is a data analysis and sharing platform designed to accelerate discovery in a scalable, cloud-based compute environment that is shared by both programs. CAVATICA supports a unique integration with STRIDES, allowing all academic users on the platform to leverage the STRIDES discount without having to set up individual accounts. This setup means research dollars can go farther and drive us closer to a cure. Additionally, STRIDES has funded the KF and INCLUDE Cloud Credit program. While researchers can use primary files from the data portals without incurring storage fees, data analysis and storage of secondary files do incur charges. To aid researchers, the Cloud Credit Program supports data generators and secondary data users who want to analyze data in the cloud, leveraging existing tools, or developing their own tools to analyze data. To date, Kids First has approved 31 research projects and allocated $49,000 of funding. INCLUDE has approved 12 projects and allocated $22,000 of funding. Both programs have supported researchers leading to multiple abstracts, presentations, and manuscripts. Some of the tools generated with the support of the Cloud Credit program are also available on CAVATICA for others to use in the public apps gallery and referenced in publications. Applications to the Cloud Credit Program are open, and the program continues to support researchers in their endeavor to accelerate discovery, enhance healthcare, and change lives. We have open office hours to help users get started twice a week (https://www.cavatica.org/contact-us) and a 24/7 helpdesk staffed by our support staff. As part of the KF and INCLUDE data ecosystems, CAVATICA not only allows researchers to leverage the cloud-based platform to access and analyze data from their respective data portals, researchers can also integrate their own data or utilize the platforms interoperability with the Cancer Research Data Commons, BioData Catalyst, or NCBI’s Sequence Read Archive, giving access to all data controlled by dbGaP. CAVATICA uses Research Auth Service (RAS) to ensure proper authorization of files. All analyses can be shared with other users with appropriate permission controls. CAVATICA supports workflow languages (CWL and NextFlow) for ‘tasks’ or ‘interactive analysis’ using JupyterLab or RStudio, either through the graphical user interface or API. Put together, CAVATICA allows researchers securely access and analyze controlled data, accelerating discovery and driving cures.
2025-07-21 15:40:00 16:00:00 02F NIH Cyberinfrastructure and Emerging Technologies Sessions The Gene Set Browser: An interoperable and AI/ML-ready tool for gene set analysis in the Common Fund Data Ecosystem (CFDE) Julie Jurgens Julie Jurgens Summary: This session introduces the NIH Common Fund Data Ecosystem (CFDE) Gene Set Browser, an AI/ML-ready tool that connects diverse biomedical datasets to uncover novel gene-disease associations. Learn how this interoperable resource leverages Bayesian modeling and LLM-driven insights to power cross-program analysis, enable hypothesis generation, and drive discovery through FAIR, integrated data. Abstract: In an AI/ML-ready world, data interoperability and integration are becoming increasingly critical. The US National Institutes of Health (NIH) has risen to address these needs through major initiatives including the Common Fund Data Ecosystem (CFDE), which promotes accessibility, (re)use, and integration of NIH Common Fund programs’ data and resources through a cohesive ecosystem. By establishing common standards, data, tools, and infrastructure, CFDE serves as a model for data accessibility and interoperability. As a compelling use case of how increased interoperability can drive data utility and scientific discovery, we present the CFDE Gene Set Browser, available through https://cfdeknowledge.org. This open-access web resource performs cross-program analyses of gene sets (lists of genes) and their relationship to additional genes, human phenotypes, and mechanisms. Importantly, this tool connects multiple disparate CFDE and non-CFDE programs, phenotypes, and data types. Through Gene Set Browser, users can learn a) which gene sets capture important biological mechanisms, and b) which mechanisms are relevant to human health. Gene sets are derived from six CFDE programs (GlyGen, GTEx, IDG, IMPC/ KOMP2, LINCS, and MoTrPAC); intersections between CFDE programs; and differential expression analyses of CFDE transcriptomic data. Phenotypes include rare diseases from Orphanet (n=2,927) and common phenotypes/ traits from the NHGRI Association to Function Knowledge Portal (n=1,237) and the EBI GWAS Catalog (n=2,213). Relationships between phenotypes and gene sets were computed using PIGEAN (Priors Inferred from GEne ANnotations), a novel Bayesian method. PIGEAN jointly models the probability that each gene is involved in each phenotype, given the gene sets that contain the gene and the genome-wide association study (GWAS) statistics for variants near the gene. We applied PIGEAN to the above common and rare disease phenotypes/ traits, in each case fitting a model using all CFDE gene sets, intersections of CFDE gene sets, and gene sets from the Mouse Genome Informatics database (MGI; >11,000 mouse model phenotypes) and MSigDB (pathway analyses). Users can obtain the estimated probability that the genes within each gene set are involved in disease. Additionally, the estimated probability that each gene is involved in disease is provided. For each result, an LLM enables users to explore hypotheses underlying each gene set-to-disease connection. The Gene Set Browser has unearthed a wide range of known and novel candidate genes and mechanisms for human biological processes and diseases. For example, a gene set from MoTrPAC, a CFDE program that studies the molecular effects of exercise, reveals a list of genes that are upregulated in the blood of male rats after 2 weeks of exercise and their connection to reticulocyte count. Through the Gene Set Browser, users can discover gene sets relevant to a wide range of research questions, explore connections between gene sets and other biological information (e.g., pathways and disease associations from external databases), and generate new hypotheses that might not be apparent from individual resources. Connecting CFDE gene sets to external resources is a powerful demonstration of how leveraging interoperability can foster scientific discovery.
2025-07-21 16:40:00 16:59:00 02F NIH Cyberinfrastructure and Emerging Technologies Sessions Quantum Approximate Optimization for K-Area Clustering of Biological Data Yong Chen, PhD Fei Li, Yong Chen, Fei Li, PhD, Yong Chen, PhD
2025-07-21 16:59:00 17:18:00 02F NIH Cyberinfrastructure and Emerging Technologies Sessions Efficient quantum algorithm to simulate open systems through a single environmental qubit Vischi Michele Vischi Michele Simulating the dynamics of open quantum systems allows to understand real-world quantum phenomena which is a crucial task in a variety of fields. Recently, the idea that open system dynamics can be understood not only as a model of physical systems, but also as a general-purpose algorithmic framework for preparing target quantum states has been explored. For example, just as Hamiltonian dynamics are often used in simulation without specifying a physical system, open system dynamics can be designed without referencing an actual system-environment interaction. The environment can be viewed as a fictitious and engineered resource akin to artificial thermostats in classical Monte Carlo or molecular dynamics simulations. Such an approach is relevant for many biomedical problems such as large-scale molecular simulations and optimizations in system biology (that can be interpreted as a specific class of state preparation problems). The approach can be implemented on quantum computers, provided the development of accurate and efficient quantum algorithms. Such algorithms have to efficiently encode the environment as well as approximating the open system dynamics with a suitable system-environment interaction to drive the system evolution. Recently many proposals appeared in the literature to achieve these goals. In this talk I will present an efficient quantum algorithm for simulating open quantum systems dynamics described by the Markovian Lindblad master equation. In contrast to existing approaches, the proposed method achieves two significant advancements. First, it employs a repetition of unitary gates on a set of n system qubits and, remarkably, only a single ancillary bath qubit to represent the environment. It follows that, for the typical case of m locality of the Lindblad operators, we reach an exponential improvement of the number of ancilla in terms of m and up to a polynomial improvement in ancilla overhead for large n with respect to other approaches. Although stochasticity is introduced, requiring multiple circuit realizations, the sampling overhead is independent of the system size. Second, we show that, under fixed accuracy conditions, our algorithm enables a reduction in the number of Trotter steps compared to other approaches, substantially decreasing circuit depth. These advancements hold particular significance for running the algorithm on near-term quantum computers, where minimizing both width and depth is critical due to inherent noise in their dynamics. I will further discuss how this approach can be extended to simulate non-Markovian evolution, thus including memory effects of the environment.
2025-07-21 17:18:00 17:37:00 02F NIH Cyberinfrastructure and Emerging Technologies Sessions Advancing quantum algorithms for elementary mode and metabolic flux analysis Chi Zhang Chi Zhang, Chi Zhang Metabolic networks play a central role in cellular function, supporting energy production, biosynthesis, and adaptation to environmental conditions. Elementary flux modes (EFMs) represent minimal sets of reactions that support steady-state flux distributions, and they form the basis for understanding metabolic capabilities and constraints. However, identifying biologically feasible EFMs in genome-scale metabolic networks remains a fundamental challenge, as the number of possible EFMs grows exponentially with network complexity, rendering full enumeration computationally infeasible. To address these limitations, we propose a quantum-based framework for efficiently exploring biologically plausible EFM distributions and predicting sample-specific metabolic fluxes. Our method formulates both tasks as Quadratic Unconstrained Binary Optimization (QUBO) problems, which we solve using quantum annealing. By leveraging the parallel sampling capability of quantum computing, this approach enables scalable and efficient search over high-dimensional solution spaces under biological constraints. To accommodate large genome-scale models, we incorporate tensor decomposition techniques that reduce model dimensionality and enable tractable QUBO formulations. Preliminary experiments on simulated metabolic networks with up to 25 reactions demonstrate that our method recovers diverse and structurally feasible EFMs. We observe that EFMs satisfying key properties—including stoichiometric balance, support minimality, and irreducibility—consistently appear with higher occurrence percentages across repeated sampling runs, while structurally invalid modes are rarely sampled. In contrast, EFMs that violate these constraints are sampled with very low frequency. Furthermore, when applying different sample-specific constraints, we find that the high-frequency EFM sets vary across samples, indicating that the framework can distinguish condition-specific flux distributions without requiring full enumeration. Our framework offers a new direction for integrating omics data with constraint-based modeling using quantum-enhanced computation. This approach can be applied to a range of applications, including identifying altered pathways in disease, prioritizing therapeutic metabolic interventions, and uncovering condition-specific metabolic strategies. By bridging quantum optimization and systems biology, this method contributes a practical and interpretable tool for personalized metabolic analysis and hypothesis generation across diverse biological contexts.
2025-07-21 17:37:00 17:56:00 02F NIH Cyberinfrastructure and Emerging Technologies Sessions Quantum Computing for Modeling Epigenetic Plasticity in Cancer Evolution Ariosto S. Silva, Ariosto S. Silva, PhD
2025-07-21 17:56:00 18:00:00 02F NIH Cyberinfrastructure and Emerging Technologies Sessions Closing Remarks Sean D. Mooney. PhD, Sean D. Mooney, PhD
2025-07-21 11:20:00 11:22:00 12 Publications - Navigating Journal Submissions Welcome and Introductions Ragothaman Yennamalli, Yana Bromberg, Sergio Pantano, Ragothaman Yennamalli
2025-07-21 11:22:00 11:40:00 12 Publications - Navigating Journal Submissions Effective cover letter writing and manuscript preparation for submitting manuscripts in crowded research areas Thomas Lengauer, Max Planck Institute for Informatics, Germany, EiC Bioinformatics Advances Thomas Lengauer, Max Planck Institute for Informatics, Germany, EiC Bioinformatics Advances I am one of the two Editors-in-Chief of the ISCB Society Journal Bioinformatics Advances that is published jointly with Oxford University Press. After giving a short introduction into the profile of the journal I will describe the process of editorial paper handling by our journal and the recommendation that can be derived from that about preparing a submission such as to most clearly place the contribution made by the authors. Careful placement of the original contribution is especially critical for topics in research areas that gather a large number of publications.
2025-07-21 11:40:00 11:50:00 12 Publications - Navigating Journal Submissions Choosing journals for submission in popular topics Laura Mesquita, Elsevier Laura Mesquita, Elsevier There are many factors that authors may take into account when submitting to a journal: scope fit, journal metrics, speed, names on the editorial board, business model (e.g. Open Access) and a journal's reputation. With the number of journals increasing at a rapid pace and the increasing prevalence of broad-scope journals, how authors make decisions about where to publish becomes increasingly complex. This presentation will cover successful strategies for choosing the right journal, and ways to pivot if a manuscript is rejected by the first-choice journal.
2025-07-21 11:50:00 12:00:00 12 Publications - Navigating Journal Submissions How editors make decisions on submissions? Feilim Mac Gabhann Jason Papin, University of Virginia, PLOS Computational Biology, USA, Feilim Mac Gabhann
2025-07-21 12:00:00 12:10:00 12 Publications - Navigating Journal Submissions How do editors decide on accepting papers on highly similar topics? Michael J E Sternberg, Department of Life Sciences, Imperial College London Michael J E Sternberg, Journal of Molecular Biology, Imperial College, UK, Michael J E Sternberg, Department of Life Sciences, Imperial College London I will base my talk on my experience of being an Editor for Journal of Molecular Biology focussing on computational biology. In particular, every year we publish a special issue entitled "Computational Resources for Molecular Biology". A major criterion for accepting resource papers is that if the resource could be delivered to the community via a web server and if this is not provided, we are unlikely to accept the paper. Another criterion is to demonstrate a method on an appropriate large number of examples rather than just applying it to one particular study - there is the saying "one swallow does not make a summer". Any paper reporting supervised machine learning must ensure that there is not data leakage between training and testing and this often arises when there are homologues spanning the two sets.
2025-07-21 12:10:00 12:20:00 12 Publications - Navigating Journal Submissions How and why eLife selects papers for peer review Michael Markie Michael Markie
2025-07-21 12:20:00 12:55:00 12 Publications - Navigating Journal Submissions Panel Discussion Thomas Lengauer, Laura Mesquita, Michael Markie, Micheal J E Sternberg & Feilim Mac Gabhann,
2025-07-23 11:20:00 12:00:00 11BC RegSys Exploring cellular plasticity: 4D epigenomes in the context of the tumour microenvironment Vera Pancaldi Vera Pancaldi Oncogenesis is characterized by alterations in chromatin organization and the reactivation of unicellular phenotypes at both metabolic and transcriptional levels. The underlying mechanisms remain largely unexplored, despite their critical relevance in cancer biology. We studied the spatial organization of genes in relation to their evolutionary origins, as well as changes occurring during cell differentiation and oncogenesis. We reveal significant topological changes in chromatin organization during cell differentiation, with patterns in specific regulatory marks involving Polycomb repression and RNA Polymerase II pausing, being reversed during oncogenesis. Reflecting on recent findings regarding epigenomic routes to oncogenesis made us consider the importance of the tumour microenvironment in determining plasticity of cancer cells in different environments, which we are studying through data-driven inference of regulatory networks in simplified in-vitro culture systems. We will discuss our recent results and frame them in the context of changing oncogenesis paradigms.
2025-07-23 12:00:00 12:20:00 11BC RegSys Leveraging Transcription Factor Physical Proximity for Enhancing Gene Regulation Inference Yijie Wang Xiaoqing Huang, Aamir Raza Muneer Ahemad Hullur, Elham Jafari, Kaushik Shridhar, Kun Huang, Yijie Wang, Kenneth Mackie, Mu Zhou Motivation: Gene regulation inference, a key challenge in systems biology, is crucial for understanding cell function, as it governs processes such as differentiation, cell state maintenance, signal transduction, and stress response. Leading methods utilize gene expression, chromatin accessibility, Transcription Factor (TF) DNA binding motifs, and prior knowledge. However, they overlook the fact that TFs must be in physical proximity to facilitate transcriptional gene regulation. Results: To fill the gap, we develop GRIP – Gene Regulation Inference by considering TF Proximity – a gene regulation inference method that directly considers the physical proximity between regulating TFs. Specifically, we use the distance in a protein-protein interaction (PPI) network to estimate the physical proximity between TFs. We design a novel Boolean convex program, which can identify TFs that not only can explain the gene expression of target genes (TGs) but also stay close in the PPI network. We propose an efficient algorithm to solve the Boolean relaxation of the proposed model with a theoretical tightness guarantee. We compare our GRIP with state-of-the-art methods (SCENIC+, DirectNet, Pando, and CellOracle) on inferring cell-type-specific (CD4, CD8, and CD 14) gene regulation using the PBMC 3k scMultiome-seq data and demonstrate its out-performance in terms of the predictive power of the inferred TFs, the physical distance between the inferred TFs, and the agreement between the inferred gene regulation and PCHiC ground-truth data.
2025-07-23 12:20:00 12:40:00 11BC RegSys miRBench: novel benchmark datasets for microRNA binding site prediction that mitigate against prevalent microRNA Frequency Class Bias Panagiotis Alexiou Stephanie Sammut, Katarina Gresova, Dimosthenis Tzimotoudis, Eva Marsalkova, David Cechak, Panagiotis Alexiou Motivation: MicroRNAs (miRNAs) are crucial regulators of gene expression, but the precise mechanisms governing their binding to target sites remain unclear. A major contributing factor to this is the lack of unbiased experimental datasets for training accurate prediction models. While recent experimental advances have provided numerous miRNA-target interactions, these are solely positive interactions. Generating negative examples in silico is challenging and prone to introducing biases, such as the miRNA frequency class bias identified in this work. Biases within datasets can compromise model generalization, leading models to learn dataset-specific artifacts rather than true biological patterns. Results: We introduce a novel methodology for negative sample generation that effectively mitigates the miRNA frequency class bias. Using this methodology, we curate several new, extensive datasets and benchmark several state-of-the-art methods on them. We find that a simple convolutional neural network model, retrained on some of these datasets, is able to outperform state-of-the-art methods. This highlights the potential for leveraging unbiased datasets to achieve improved performance in miRNA binding site prediction. To facilitate further research and lower the barrier to entry for machine learning researchers, we provide an easily accessible Python package, miRBench, for dataset retrieval, sequence encoding, and the execution of state-of-the-art models. Availability: The miRBench Python Package is accessible at https://github.com/katarinagresova/miRBench/releases/tag/v1.0.0 Contact: panagiotis.alexiou@um.edu.mt
2025-07-23 12:40:00 13:00:00 11BC RegSys Flash Talk Session 1 Aryan Kamal, Damla Baydar, Laura Hinojosa, Charles-Henri Lecellier Session with 4 short talks: Aryan Kamal - Transcriptional regulation of cell fate plasticity in hematopoiesis Damla Övek Baydar - Enhancing JASPAR and UniBind databases with deep learning models for transcription factor-DNA interactions Laura Hinojosa - Master Transcription Factors Regulate Replication Timing Charles-Henri Lecellier - DNA replication timing and Copy Number Variations are confounders of RNA-DNA interaction data
2025-07-23 14:00:00 14:20:00 11BC RegSys Unicorn: Enhancing Single-Cell Hi-C Data with Blind Super-Resolution for 3D Genome Structure Reconstruction Oluwatosin Oluwadare Mohan Kumar Chandrashekar, Rohit Menon, Samuel Olowofila, Oluwatosin Oluwadare Motivation: Single-cell Hi-C (scHi-C) data provide critical insights into chromatin interactions at individual cell levels, uncovering unique genomic 3D structures. However, scHi-C datasets are characterized by sparsity and noise, complicating efforts to accurately reconstruct high-resolution chromosomal structures. In this study, we present ScUnicorn, a novel blind Super-Resolution framework for scHi-C data enhancement. ScUnicorn employs an iterative degradation kernel optimization process, unlike traditional Super-resolution approaches, which rely on downsampling, predefined degradation ratios, or constant assumptions about the input data to reconstruct high-resolution interaction matrices. Hence, our approach more reliably preserves critical biological patterns and minimizes noise. Additionally, we propose 3DUnicorn, a maximum likelihood algorithm that leverages the enhanced scHi-C data to infer precise 3D chromosomal structures. Result: Our evaluation demonstrates that ScUnicorn achieves superior performance over the state-of-the-art methods in terms of Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and GenomeDisco scores. Moreover, 3DUnicorn’s reconstructed structures align closely with experimental 3D-FISH data, underscoring its biological relevance. Together, ScUnicorn and 3DUnicorn provide a robust framework for advancing genomic research by enhancing scHi-C data fidelity and enabling accurate 3D genome structure reconstruction. Code Availability: Unicorn implementation is publicly accessible at https://github.com/OluwadareLab/Unicorn
2025-07-23 14:20:00 14:40:00 11BC RegSys Predicting gene-specific regulation with transcriptomic and epigenetic single-cell data Laura Rumpf Laura Rumpf, Fatemeh Behjati, Dennis Hecker, Marcel Schulz To gain insights into phenotype-specific gene regulation, we present our integrative analysis approach MetaFR harnessing single-cell epigenetic and transcriptomic data. MetaFR generates random forest regression models in a gene-specific manner utilizing both scATAC-seq and scRNA-seq data to predict gene expression in a large window around a target gene. The gene window is partitioned into bins of equal size which correspond to the model features holding the epigenetic signal counts. The importance of model features can be leveraged to prioritize enhancer-gene interactions. The inherent sparsity problem of single-cell data is addressed by aggregating the scRNA-seq and scATAC-seq signal into metacells based on gene activity similarities. MetaFR enables large-scale analysis of scATAC-seq and scRNA-seq data in an automated fashion. The automated pipeline has been successfully applied to a human PBMC dataset to identify immune cell-specific enhancer-gene interactions. We validated our findings with experimentally measured interactions (CRISPRi regions) and fine-mapped eQTLs. We benchmarked our performance against the state-of-the-art method SCARlink. We were able to outperform SCARlink in both accuracy and runtime. Our pipeline allows time-efficient analysis and obtains reliable models of gene expression, which can be used to study gene regulatory elements in any organism for which scRNA-seq and scATAC-seq data becomes available.   
2025-07-23 14:40:00 15:00:00 11BC RegSys Biophysical deep learning resolves how TF and DNA sequence specify the genome state of every cell population in human embryogenesis Vitalii Kleshchevnikov Vitalii Kleshchevnikov, Oliver Stegle, Omer Bayraktar Understanding how interactions between transcription factors (TFs) and DNA sequence are orchestrated and give rise to the vast complexity of cell types is a major challenge of regulatory developmental biology. Large-scale multimodal single-cell RNA-seq and ATAC-seq atlases enable reconstructing the regulatory mechanisms across cell types from data, laying the foundation for cell programming and design of synthetic regulatory elements. Despite significant progress, current DNA sequence models fail to account for cellular context, TF-DNA sequence relationships and TF combinatorics in a principled manner, limiting their causal expressiveness and generalization capacity across cell types. To overcome this, we developed cell2state, an end-to-end deep learning model with biophysical constraints on how TFs specify the genome accessibility state in every cell population. Cell2state leverages known TF-motif interactions while accounting for biophysical constraints, employs an interpretable neural network based on HyenaDNA architecture and captures TF-TF synergy and antagonism, enablings the model to integrate DNA sequence and transcription factor (TF) abundance. We demonstrated cell2state generalisation capabilities by predicting ATAC-seq signals for new chromosomes and cell types. To link regulatory TF interactions to developmental processes at whole embryo scale, we applied cell2state to an unpublished multimodal single-cell and spatial transcriptomics atlas covering over 1,000 human developmental cell states (n=4,000 pseudobulk replicates, n=5 embryos). At critical developmental junctions, such as the dorsal-ventral patterning of the spinal cord/hindbrain and anterior-posterior patterning of the forebrain, cell2state revealed how enhancer DNA sequences integrate activities of cell-type-defining TFs (LHX2, PAX6) with cell communication pathway TFs (GLI, TCF).
2025-07-23 15:00:00 15:20:00 11BC RegSys Nona: A unifying multimodal masked modeling framework for functional genomics Surag Nair Surag Nair, Alex Tseng, Ehsan Hajiramezanali, Nathaniel Diamant, Avantika Lal, Tommaso Biancalani, Gabriele Scalia, Gokcen Eraslan We present Nona, a unifying multimodal masked modeling paradigm for functional genomics. Nona is a neural network model that operates on both DNA sequence and epigenetic tracks such as DNase-seq, ChIP-seq, and RNA-seq at base-pair resolution. By leveraging a flexible masking strategy, Nona can predict any subset of masked DNA and/or tracks from the unmasked subset. As a result, Nona encompasses versatile existing and novel use cases that were hitherto addressed using separate models. In addition to vanilla sequence-to-function prediction and DNA language modeling, Nona enables multiple novel application modes, of which we highlight 3: 1) context-aware prediction, where the model predicts epigenetic tracks in a local genomic window by taking into account the observed epigenetic tracks in adjacent windows, in addition to the DNA sequence, 2) sequence generation, where a conditional language model is used to iteratively generate a DNA sequence with desired epigenetic profiles across cellular states, 3) functional genotyping, where a conditional language model trained on base resolution ATAC-seq is used to infer the genotype of the sample donors. Beyond these applications, Nona can enable use cases such as functional perturbations and denoising functional measurements. Altogether, Nona is a versatile paradigm that extends sequence-to-function and masked language modeling to novel applications in regulatory genomics.
2025-07-23 15:20:00 15:40:00 11BC RegSys SCRIMPy: Single Cell Replication Inference from Multiome data using Python Tatevik Jalatyan Tatevik Jalatyan, Jennifer Herrmann, Antonio Rodriguez-Romera, Beth Psaila, Jim Hughes, Simone Riva, Robert Beagrie The cell cycle is a fundamental biological process crucial for an organism’s growth and development. Dysregulation of the cell cycle can lead to diseases such as cancer, neurodegenerative, cardiovascular, or autoimmune disorders. Thus, accurate characterization of cell cycle dynamics in healthy and disease states is important for understanding disease mechanisms. Existing methods for cell cycle state prediction from single-cell data use the expression of marker genes in individual cells. However, these approaches perform poorly on single-cell multiome (ATAC+GEX) data, likely due to the increased data sparsity and nuclear RNA bias. To address these limitations, we propose a novel method for cell cycle state inference that uses replication-driven DNA copy number signals from scATAC-seq data. Our approach is based on two complementary metrics that reflect the replication state of individual cells. First, we capture the imbalance of ATAC fragment depth between early- and late-replicating regions of genome to identify S-phase cells with higher DNA copy number in early replicating domains. Second, we introduce a novel metric for DNA copy number in ATAC-seq data to differentiate G1-phase cells from G2/M-phase cells, since the latter have duplicated DNA content. We apply this method to multiome data from mouse embryonic stem cells sorted by cell cycle state (G1, S, G2/M) and show that SCRIMPy outperforms the commonly used expression-based classifier Seurat. With the increasing availability of multiome datasets, this approach holds promise for deriving novel insights into cell cycle mechanisms in diseases and identifying potential therapeutic targets.
2025-07-23 15:40:00 16:00:00 11BC RegSys Soffritto: a deep-learning model for predicting high-resolution replication timing Dante Bolzan Dante Bolzan, Ferhat Ay Motivation: Replication Timing (RT) refers to the order by which DNA loci are replicated during S phase. RT is cell-type specific and implicated in cellular processes including transcription, differentiation, and disease. RT is typically quantified genome-wide using two-fraction assays (e.g., Repli-Seq) which sort cells into early and late S phase fractions followed by DNA sequencing yielding a ratio as the RT signal. While two-fraction RT data is widely available in multiple cell lines, it is limited in its ability to capture high-resolution RT features. To address this, high-resolution Repli-Seq, which quantifies RT across 16 fractions, was developed, but it is costly and technically challenging with very limited data generated to date. Results: Here we developed Soffritto, a deep learning model that predicts high-resolution RT data using two-fraction RT data, histone ChIP-seq data, GC content, and gene density as input. Soffritto is composed of a Long Short Term Memory (LSTM) module and a prediction module. The LSTM module learns long- and short-range interactions between genomic bins while the prediction module is composed of a fully connected layer that outputs a 16-fraction probability vector for each bin using the LSTM module’s embeddings as input. By performing both within cell line and cross cell line training and testing for five human and mouse cell lines, we show that Soffritto is able to capture experimental 16-fraction RT signals with high accuracy and the predicted signals allow detection of high-resolution RT patterns.
2025-07-23 16:40:00 17:00:00 11BC RegSys Ledidi: Programmatic design and editing of cis-regulatory elements Jacob Schreiber Jacob Schreiber, Franziska Lorbeer, Monika Heinzl, Yang Lu, Alexander Stark, William Noble The development of modern genome editing tools has enabled researchers to make such edits with high precision, but has left unsolved the problem of designing these edits. As a solution, we propose Ledidi, a computational approach that rephrases the design of genomic edits as a continuous optimization problem where the goal is to produce the desired outcome as measured by one or more predictive models using as few edits from an initial sequence as possible. Ledidi can be paired with any pre-trained machine learning model, and when applied across dozens of such models, we find that Ledidi can quickly design edits to precisely control transcription factor binding, chromatin accessibility, transcription, and enhancer activity across several species. Ledidi can achieve its target objective using surprisingly few edits by converting weak affinity TF binding sites into stronger affinity ones, and can do so almost an order of magnitude faster than other approaches. Unlike other approaches, Ledidi can use several models simultaneously to programmatically design edits that exhibit multiple desired characteristics. We demonstrate this capability by designing uniformly accessible regions with controllable patterns of TF binding, by designing cell type-specific enhancers, and by showing how one can use multiple models that predict the same thing to more robustly design edits. Finally, we introduce the concept of an affinity catalog, in which multiple sets of edits are designed that induce a spectrum of outcomes, and demonstrate the practical benefits of this approach for design tasks and scientific understanding.
2025-07-23 17:00:00 17:20:00 11BC RegSys Lilliput: Compact native regulatory element design with machine learning-guided miniaturization Laura Gunsalus Laura Gunsalus, Avantika Lal, Tommaso Biancalani, Gokcen Eraslan Size-limited gene therapy vectors require compact cell type-specific regulatory elements. Existing miniaturized sequences have been hand-selected and curated, relying on costly experimental iteration. We present Lilliput, a method for designing compact and specific regulatory elements by nominating and iteratively editing endogenous elements with state-of-the-art DNA sequence-to-function models. Our approach involves scoring elements in silico, removing subsequences with limited predicted impact, and introducing minimal mutations to increase specificity. We demonstrate the effectiveness of our approach by reducing a 10kb heart-specific locus to under 300bp. Our method offers a generalizable framework for engineering mini-elements across diverse target cell types. More broadly, we identify core sequence features sufficient to determine cell-type specific expression patterns, advancing our understanding of the mechanisms underlying precise control of gene expression.
2025-07-23 17:20:00 18:00:00 11BC RegSys Learning the Regulatory Genome by Destruction and Creation Luca Pinello Luca Pinello The regulatory genome operates through a complex DNA language that controls gene expression. In this keynote, I will present two complementary approaches to decode this language: learning by precise perturbation and learning by generative design. First, I will introduce CRISPR-CLEAR, which combines dense base editing with sequencing of resulting mutations to map regulatory elements at single-nucleotide resolution. We systematically "destroy" regulatory sequences through targeted mutations to identify functional nucleotides. Applied to the CD19 enhancer, we pinpoint exact bases whose alteration confers resistance to CAR-T therapy—revealing clinically actionable insights. Second, I will present DNA-Diffusion, which uses generative AI to "create" novel regulatory elements by learning from thousands of cell-type-specific sequences. This diffusion model generates synthetic 200bp elements that often exceed the activity of endogenous enhancers. We validated 5,850 sequences through reporter assays and pioneered direct genomic replacement to show these synthetic elements can precisely modulate therapeutic targets like AXIN2 in leukemia cells. Together, systematic perturbation and generative design provide complementary lenses for understanding regulatory logic. CRISPR-CLEAR reveals which nucleotides matter; DNA-Diffusion demonstrates we can engineer better solutions. This dual framework opens new avenues for precision gene therapy, where understanding and designing regulatory elements become two sides of the same coin.
2025-07-24 08:40:00 09:20:00 11BC RegSys TBA Roser Vento-Tormo
2025-07-24 09:20:00 09:40:00 11BC RegSys Anomaly Detection in Spatial Transcriptomics via Spatially Localized Density Comparison Gary Hu Gary Hu, Julian Gold, Uthsav Chitra, Sunay Joshi, Benjamin Raphael Motivation Perturbations in biological tissues – e.g. due to inflammation, disease, or drug treatment – alter the composition of cell types and cell states in the tissue. These alterations are often spatially localized in different regions of a tissue, and can be measured using spatial transcriptomics technologies. However, current methods to analyze differential abundance in cell types or cell states, either do not incorporate spatial information – and thus cannot identify spatially localized alterations – or use heuristic and inaccurate approaches. Results We introduce Spatial Anomaly Region Detection in Expression Manifolds (Sardine), a method to estimate spatially localized changes in spatial transcriptomics data obtained from tissue slices from two or more conditions. Sardine estimates the probability of a cell state being at the same (relative) spatial location between different conditions using spatially localized density estimation. On simulated data, Sardine recapitulates the spatial patterning of expression changes more accurately than existing approaches. On a Visium dataset of the mouse cerebral cortex before and after injury response, as well as on a Visium dataset of a mouse spinal cord undergoing electrotherapy, Sardine identifies regions of spatially localized expression changes that are more biologically plausible than alternative approaches.
2025-07-24 09:40:00 10:00:00 11BC RegSys Flash Talk Session 2 Maxime Christophe, Gabriela A Merino, Erick Isaac Navarro Delgado, Tomas Rube Session with 4 short talks: Maxime Christophe - Interpretable deep learning reveals sequence determinants of nucleosome positioning in mammalian genomes Gabriela A Merino - Ensembl’s multispecies catalogue of regulatory elements Erick Isaac Navarro Delgado - RAMEN: A reproducible framework for dissecting individual, additive and interactive gene-environment contributions in genomic regions with variable DNA methylation Tomas Rube - Accurate affinity models for SH2 domains from peptide binding assays and free-energy regression
2025-07-24 11:20:00 11:40:00 11BC RegSys GASTON-Mix: a unified model of spatial gradients and domains using spatial mixture-of-experts Uthsav Chitra Uthsav Chitra, Shu Dan, Fenna Krienen, Ben Raphael Motivation: Gene expression varies across a tissue due to both the organization of the tissue into spatial domains, i.e. discrete regions of a tissue with distinct cell type composition, and continuous spatial gradients of gene expression within di↵erent spatial domains. Spatially resolved transcriptomics (SRT) technologies provide high-throughput measurements of gene expression in a tissue slice, enabling the characterization of spatial gradients and domains. However, existing computational methods for quantifying spatial variation in gene expression either model only spatial domains – and do not account for continuous gradients of expression – or require restrictive geometric assumptions on the spatial domains and spatial gradients that do not hold for many complex tissues. Results: We introduce GASTON-Mix, a machine learning algorithm to identify both spatial domains and spatial gradients within each domain from SRT data. GASTON-Mix extends the mixture-of-experts (MoE) deep learning framework to a spatial MoE model, combining the clustering component of the MoE model with a neural field model that learns a separate 1-D coordinate (“isodepth”) within each domain. The spatial MoE is capable of representing any geometric arrangement of spatial domains in a tissue, and the isodepth coordinates define continuous gradients of gene expression within each domain. We show using simulations and real data that GASTON-Mix identifies spatial domains and spatial gradients of gene expression more accurately than existing methods. GASTON-Mix reveals spatial gradients in the striatum and lateral septum that regulate complex social behavior, and GASTON-Mix reveals localized spatial gradients of hypoxia and TNF-$alpha$ signaling in the tumor microenvironment.
2025-07-24 11:40:00 12:00:00 11BC RegSys Refinement Strategies for Tangram for Reliable Single-Cell to Spatial Mapping Merle Stahl Merle Stahl, Lena J. Straßer, Chit Tong Lio, Judith Bernett, Richard Röttger, Markus List Motivation: Single-cell RNA sequencing (scRNA-seq) provides comprehensive gene expression data at a single-cell level but lacks spatial context. In contrast, spatial transcriptomics captures both spatial and transcriptional information but is limited by resolution, sensitivity, or feasibility. No single technology combines both the high spatial resolution and deep transcriptomic profiling at the single-cell level without trade-offs. Spatial mapping tools that integrate scRNA-seq and spatial transcriptomics data are crucial to bridge this gap. However, we found that Tangram, one of the most prominent spatial mapping tools, provides inconsistent results over repeated runs. Results: We refine Tangram to achieve more consistent cell mappings and investigate the challenges that arise from data characteristics. We find that the mapping quality depends on the gene expression sparsity. To address this, we (1) train the model on an informative gene subset, (2) apply cell filtering, (3) introduce several forms of regularization, and (4) incorporate neighborhood information. Evaluations on real and simulated mouse datasets demonstrate that this approach improves both gene expression prediction and cell mapping. Consistent cell mapping strengthens the reliability of the projection of cell annotations and features into space, gene imputation, and correction of low-quality measurements. Our pipeline, which includes gene set and hyperparameter selection, can serve as guidance for applying Tangram on other datasets, while our benchmarking framework with data simulation and inconsistency metrics is useful for evaluating other tools or Tangram modifications. Availability: The refinements for Tangram and our benchmarking pipeline are available at https://github. com/daisybio/Tangram_Refinement_Strategies.
2025-07-24 12:00:00 12:20:00 11BC RegSys Encoding single-cell chromatin landscapes as probability distributions with optimal transport Cassandra Burdziak Cassandra Burdziak, Danielle Maydan, Doron Haviv, Ronan Chaligne, Dana Pe'Er Single-cell measurement of paired epigenetic and transcriptomic features is becoming routine, and promises to license more sophisticated models of gene regulation. Still, most existing models are limited to the cis-regulatory element representation (typically, averaged signal at pre-defined accessibility “peaks”), which shrouds much of the chromatin molecule’s fine-grained structure. To maximize chromatin’s explanatory power for cell-state (and fate) prediction, we sought to achieve a more unbiased, quantitative representation of the chromatin molecule by treating the accessibility landscape as a discrete (per-base pair) probability distribution. Given single-cell accessibility data, our approach embeds the chromatin landscape of each cell state according to the optimal transport (OT) distance between the empirical distribution of accessibility at particular loci, whilst controlling for sequence-related biases in DNA tagmentation. The resulting embeddings capture the precise shape of the accessibility distribution, which itself reflects transcription factor binding footprints, nucleosome positions, and RNA polymerase movement. Application of this model in the well-studied hematopoiesis system highlights its superior ability to explain cell-state: the latent accessibility distribution is more universally predictive of gene expression than promoter accessibility, and can define transcription factor binding modes active in specific branches of development. Most excitingly, position in latent space may closely correspond with the presence of certain activating or repressive chromatin marks, despite the model lacking such information during training. This representation may thus empower future models of gene regulation with a richer representation of epigenetic data with stronger ties to cellular phenotypes.
2025-07-24 12:20:00 12:40:00 11BC RegSys scooby: Modeling multi-modal genomic profiles from DNA sequence at single-cell resolution Laura D. Martens Laura D. Martens, Johannes C. Hingerl, Alexander Karollus, Trevor Manz, Jason D. Buenrostro, Fabian J. Theis, Julien Gagneur Understanding how regulatory sequences shape gene expression across individual cells is a fundamental challenge in genomics. Joint RNA-seq and epigenomic profiling provides opportunities to build models capturing sequence determinants across steps of gene expression. However, current models, developed primarily for bulk omics data, fail to capture the cellular heterogeneity and dynamic processes revealed by single-cell multi-modal technologies. Here, we introduce scooby, a framework to model genomic profiles of scRNA-seq coverage and scATAC-seq insertions from sequence at single-cell resolution. For this, we leverage the pre-trained multi-omics profile predictor Borzoi and equip it with a cell-specific decoder. Scooby recapitulates cell-specific expression levels of held-out genes and identifies regulators and their putative target genes. Moreover, scooby allows resolving single-cell effects of bulk eQTLs and delineating their impact on chromatin accessibility and gene expression. We anticipate scooby to aid unraveling the complexities of gene regulation at the resolution of individual cells.
2025-07-24 12:40:00 13:00:00 11BC RegSys Uncovering Novel Cellular Programs and Regulatory Circuits Underlying Bifurcating Human B Cell States Jishnu Das Zarifeh Rarani, Swapnil Keshari, Akanksha Sachan, Nicholas Pease, Jingyu Fan, Peter Gerges, Harinder Singh, Jishnu Das B cells upon antigen encounter undergo activation followed by a bifurcation either into extrafollicular plasmablasts (PB) or into germinal center (GC) B cells. We have assembled gene regulatory networks (GRNs) underlying this bifurcation using temporally resolved single cell multiomics. To complement this, we analyzed transcriptomic states of GC and PB cells using SLIDE, a novel interpretable machine learning approach method to infer a small set of cellular programs (latent factors/LFs) necessary and sufficient to distinguish GC and PB cells. These LFs provide stronger discrimination between the two emergent cell states, than DEG analyses. Interestingly, when the LF genes were cross-referenced with state-specific GRNs, the LFs recapitulated aspects of GRN architecture orchestrating the bifurcation. Intriguingly, the LFs also captured gene programs reflective of cell-fate propensity prior to the bifurcation in activated B cells. These programs were validated using perturbation of key TFs. To move beyond high-resolution static state-specific GRNs, we used a stochastic ODE-based framework to construct a dynamic GRN across the 5 states. In addition to recapitulating previously known lineage-defining TFs and their regulons, we identify novel regulons as driving divergent gene activity across the bifurcation trajectory. We also combined the dynamic GRN with the inferred cellular programs to predict TF pairs that combinatorically control B cell fate dynamics. Intriguingly, several of these inferred TF pairs are not detected by conventional network topological metrics. Overall, our framework is generalizable and applicable across contexts to identify cellular programs and regulatory circuits underlying diverse cell fate bifurcations.
2025-07-24 14:00:00 14:20:00 11BC RegSys Detection of Cell-type-specific Differentially Methylated Regions in Epigenome-Wide Association Studies Yingying Wei Ruofan Jia, Yingying Wei DNA methylation at cytosine-phosphate-guanine (CpG) sites is one of the most important epigenetic markers. Therefore, epidemiologists are interested in investigating DNA methylation in large cohorts through epigenome-wide association studies (EWAS). However, the observed EWAS data are bulk data with signals aggregated from distinct cell types. Deconvolution of cell-type-specific signals from EWAS data is challenging because phenotypes can affect both cell-type proportions and cell-type-specific methylation levels. Recently, there has been active research on detecting cell-type-specific risk CpG sites for EWAS data. However, since existing methods all assume that the methylation levels of different CpG sites are independent and perform association detection for each CpG site separately, although they significantly improve the detection at the aggregated-level−identifying a CpG site as a risk CpG site as long as it is associated with the phenotype in any cell type, they have low power in detecting cell-type-specific associations for EWAS with typical sample sizes. Here, we develop a new method, Fine-scale inference for Differentially Methylated Regions (FineDMR), to borrow strengths of nearby CpG sites to improve the cell-type-specific association detection. Via a Bayesian hierarchical model built upon Gaussian process functional regression, FineDMR takes advantage of the spatial dependencies between CpG sites. FineDMR can provide cell-type-specific association detection as well as output subject-specific and cell-type-specific methylation profiles for each subject. Simulation studies and real data analysis show that FineDMR substantially improves the power in detecting cell-type-specific associations for EWAS data. FineDMR is freely available at https://github.com/JiaRuofan/Detection-of-Cell-type-specific-DMRs-in-EWAS.
2025-07-24 14:20:00 14:40:00 11BC RegSys MutBERT: Probabilistic Genome Representation Improves Genomics Foundation Models Weicai Long Weicai Long, Houcheng Su, Jiaqi Xiong, Yanlin Zhang Motivation: Understanding the genomic foundation of human diversity and disease requires models that effectively capture sequence variation, such as single nucleotide polymorphisms (SNPs). While recent genomic foundation models have scaled to larger datasets and multi-species inputs, they often fail to account for the sparsity and redundancy inherent in human population data, such as those in the 1000 Genomes Project. SNPs are rare in humans, and current masked language models (MLMs) trained directly on whole-genome sequences may struggle to efficiently learn these variations. Additionally, training on the entire dataset without prioritizing regions of genetic variation results in inefficiencies and negligible gains in performance. Results: We present MutBERT, a probabilistic genome-based masked language model that efficiently utilizes SNP information from population-scale genomic data. By representing the entire genome as a probabilistic distribution over observed allele frequencies, MutBERT focuses on informative genomic variations while maintaining computational efficiency. We evaluated MutBERT against DNABERT-2, various versions of Nucleotide Transformer, and modified versions of MutBERT across multiple downstream prediction tasks. MutBERT consistently ranked as one of the top-performing models, demonstrating that this novel representation strategy enables better utilization of biobank-scale genomic data in building pretrained genomic foundation models. Availability: https://github.com/ai4nucleome/mutBERT Contact: yanlinzhang@hkust-gz.edu.cn
2025-07-24 14:40:00 15:00:00 11BC RegSys Detecting and avoiding homology-based data leakage in genome-trained sequence models Abdul Muntakim Rafi Abdul Muntakim Rafi, Brett Kiyota, Nozomu Yachcie, Carl de Boer Models that predict function from DNA sequence have become critical tools in deciphering the roles of genomic sequences and genetic variation within them. However, traditional approaches for dividing the genomic sequences into training data, used to create the model, and test data, used to determine the model’s performance on unseen data, fail to account for the widespread homology that permeates the genome. Using models that predict human gene expression from DNA sequence, we demonstrate that model performance on test sequences varies by their similarity with training sequences, consistent with homology-based ‘data leakage’ that influences model performance by rewarding overfitting of homologous sequences. Because the sequence and its function are inexorably linked, even a maximally overfit model with no understanding of gene regulation can predict the expression of sequences that are similar to its training data. To prevent leakage in genome-trained models, we introduce ‘hashFrag,' a scalable solution for partitioning data with minimal leakage. hashFrag improves estimates of model performance and can actually increase model performance by providing improved splits for model training. Altogether, we demonstrate how to account for homology based leakage when partitioning genomic sequences for model training and evaluation, and highlight the consequences of failing to do so.
2025-07-24 15:00:00 15:20:00 11BC RegSys Predicting gene expression using millions of yeast promoters reveals cis-regulatory logic Susanne Bornelöv Tirtharaj Dash, Susanne Bornelöv Gene expression is largely controlled by transcription factors and their binding and interactions in gene promoter regions. Early attempts to use deep learning to learn about this gene-regulatory logic were limited to training sets containing naturally occurring promoter sequences. However, using massive parallel reporter assays, potential training data can now be expanded by orders of magnitude, going beyond naturally occurring sequences. Nevertheless, a clear understanding of how to best use deep learning to study gene regulation is still lacking. Here we investigate the complex association between promoters and gene expression in S. cerevisiae using Camformer, a residual convolutional neural network that ranked 4th in the Random Promoter DREAM Challenge 2022. We explore the original Camformer model trained on 6.7 million random promoter sequences and investigate 270 alternative models to determine what factors contribute most to model performance. We show that Camformer accurately decodes the association between promoters and gene expression (r2 = 0.914 ± 0.003, ρ = 0.962 ± 0.002) and provides a substantial improvement over previous state of the art. Using explainable AI techniques, such as in silico mutagenesis, we demonstrate that the model learns both individual motifs and their hierarchy. For example, while an IME1 motif on its own increases gene expression, the co-occurrence of IME1 and UME6 motifs strongly reduces gene expression, beyond the repressive effect of UME6 on its own. Thus, we demonstrate that Camformer can be used to provide detailed insights into cis-regulatory logic.
2025-07-24 15:20:00 16:00:00 11BC RegSys What can the diversity of life of Earth teach us about disease? Mafalda Dias Mafalda Dias Biological sequences across the tree of life reflect the cumulative effects of millions of years of evolution. Modelling variation in these sequences offers a powerful window into the sequence constraints that shape protein function and genome regulation — and holds great promise for uncovering the genetic basis of human disease. In this talk, I will explore how recent advances in deep learning are enabling us to decode these evolutionary signatures at scale. I will highlight how such models are already improving diagnostic yield of patient sequencing, by providing evidence for hundreds of new disorders, and offer new avenues to assess disease risk before symptoms arise.
2025-07-24 08:40:00 09:00:00 03B Stewardship Critical Infrastructure Introduction Mihai Pop
2025-07-24 09:00:00 09:20:00 03B Stewardship Critical Infrastructure Beyond Citations: Measuring the Economic and Scientific Impact of UniProt in the Biodata Ecosystem Alex Bateman Alex Bateman, Alex Bateman This talk presents a comprehensive cost-benefit analysis of UniProt, the universal protein resource that serves as a crucial catalogue for protein data in the scientific community. The analysis was carried out by CSIL as part of the Pathos project funded by the European Union's Horizon Europe framework programme. Drawing from extensive quantitative and qualitative research, we examine UniProt's economic impact across multiple dimensions including transaction cost savings, access cost savings, and labor cost savings for its diverse global user base. The analysis establishes a counterfactual scenario to evaluate what the research landscape would look like without UniProt, revealing significant efficiency gains and economic benefits that substantially outweigh operational costs. Beyond direct economic impacts, we explore UniProt's broader influence through citation and patent analysis, demonstrating its critical role in enabling scientific advancements across multiple fields and supporting sustainable development goals. The assessment methodology combines survey data, bibliometric analysis, and stakeholder interviews to provide a holistic view of how this essential resource facilitates knowledge dissemination and scientific innovation. Our findings offer valuable insights for research infrastructure evaluation and underscore UniProt's position as a foundational element of the global bioinformatics ecosystem.
2025-07-24 09:10:00 10:00:00 03B Stewardship Critical Infrastructure TBD Deepak Nair, Deepak Nair
2025-07-24 09:20:00 09:40:00 03B Stewardship Critical Infrastructure Challenges in biological data/infrastructure stewardship from an Asia-Pacific perspective Shoba Ranganathan Shoba Ranganathan The Asia-Pacific (APAC) region covers countries and territories in Australasia, East Asia, and Southeast Asia are often included. In a wider context, Central Asia, North Asia, the Pacific Islands, South Asia, West Asia (including the Arabian Peninsula and the Levant). The region provides striking contrasts between high-income countries and low- and middle-income countries. Bioinformatics data and infrastructure stewardship in this region will need to address challenges in bridging the existing gaps. APAC data challenges include data privacy, security, and the immense quantity of data generated by modern sequencing technologies. Linking quality data generation to bioinformatics analysis tools in clinical settings, comprehensive analysis of large datasets as well as data security and confidentiality are primary hurdles to be crossed. Bioinformatics infrastructure stewardship covers challenges related to data accessibility, interoperability, and sustainability. Better data management practices, infrastructure investments, and global collaboration to make bioinformatics resources more readily available and usable for research and development are critical. However, the high cost of specialized tools and technologies, limited computing resources and network woes limit progress in e-science. I will present the current state of APAC data and infrastructure and how to address these challenges.
2025-07-24 11:20:00 11:40:00 03B Stewardship Critical Infrastructure A Proposal on top of FAIR: Quality of Knowledge Representation (QKR) Julio Collado Vides Julio Collado Vides, Julio Collado Vides The FAIR (Findable, Accessible, Interoperable, and Reusable ) principles define the current standard for data representation. However, there is still room to go beyond and improve the representation of knowledge itself through explicit criteria for Quality of Knowledge Representation (QKR). The proposal is based on universal properties in the representation of knowledge, especially scientific knowledge. Certainly, any piece of knowledge - explicitly or implicitly- has one or more evidence and their corresponding sources, a confidence level, and one or multiple contexts. These four criteria define QKR-version 1.0. Additionally, any knowledge can be described at different levels of detail, which offers a way to organize it. I will focus on how confidence level is defined and used in RegulonDB, the biodata resource of transcriptional regulation in E.coli, and will show what could be achieved with different levels of detail. The vision is for QKR to become a natural extension of the FAIR principles. This is more relevant now given the impact of quality of knowledge in the output of AI systems. The societal impact is evident since representation and sharing are inseparable. The quality of representation determines the quality of communication of knowledge in broader contexts as well.
2025-07-24 11:40:00 12:00:00 03B Stewardship Critical Infrastructure Perspectives on biological knowledgebase management and the advent of AI Maria Martin, Maria Martin
2025-07-24 12:00:00 12:20:00 03B Stewardship Critical Infrastructure TBD Peter McCallum, Peter McCallum
2025-07-24 12:20:00 12:40:00 03B Stewardship Critical Infrastructure Technical Discussion
2025-07-24 14:00:00 14:20:00 03B Stewardship Critical Infrastructure The missing link in FAIR data policy: data resources Christophe Dessimoz Christophe Dessimoz Over the past decade, the FAIR principles which provide guidance in making data Findable, Accessible, Interoperable, and Reusable have transformed the way research is funded and evaluated. In a paradigm shift, funders now routinely require data management plans, which involves researcher training in FAIR practices and data deposition. The FAIR movement has also led to pronounced behavioural changes among researchers, while largely overlooking the essential role of infrastructure: the biodata resources — deposition databases and knowledgebases — that turn scattered data sets into readily-available coherent knowledge. Without infrastructure, FAIR data policy risks becoming a compliance exercise where data might be shared, but remain fragmented, inconsistently annotated, or practically inaccessible. Achieving FAIR at a global scale and reaping its benefits for discovery, artificial intelligence (AI), and innovation depends on infrastructure designed to capture, curate, and connect research data systematically. In life sciences, such infrastructure is referred to as “biodata resources”. In this talk, I will argue that investing in biodata resources provides some of the most effective and cost-efficient means of achieving the goal of the FAIR principles. I will call on funders and institutions to provide stable, competitive support for these vital resources such as at a level of at least 1% of research budgets to secure the foundations of FAIR data, accelerate AI-driven discovery, and maximise the impact of public investment in science.
2025-07-24 14:20:00 14:40:00 03B Stewardship Critical Infrastructure Sustaining global biodata: from resources to sustained infrastructure Guy Cochrane Guy Cochrane Just as scientific data are essential for life science and biomedical research, the databases, services and tools that enable scientists to safeguard, share and access data provide a critical foundation for scientific advance. However, unlike many other scientific infrastructures, biodata resources exist as a globally distributed open ecosystem of independently managed activities that lack a coordinated holistic approach to sustained operation development. Individual resources are often at risk and many survive on short-term research grant funding, hampering long-range strategic planning. The Global Biodata Coalition, an initiative that brings together funding organisations working towards greater sustainability in the biodata infrastructure, through consultation of stakeholders, has developed a set of nine principles to guide the development of models for greater biodata resource sustainability and is exploring models through which funders can cooperate to achieve this. In the talk, I will present the principles and outline a number of models under exploration.
2025-07-24 14:40:00 15:00:00 03B Stewardship Critical Infrastructure TBD Susan Gregurick
2025-07-24 15:00:00 16:00:00 03B Stewardship Critical Infrastructure Open Discussion
2025-07-20 08:30:00 08:45:00 02N Student Council Symposium Student Council Symposium
2025-07-20 08:45:00 09:30:00 02N Student Council Symposium Según Fatumo
2025-07-20 09:30:00 09:45:00 02N Student Council Symposium Nutri-omics: how omics investigation can help designing personalized nutrition research Mirko Treccani Federica Bergamo, Pedro Mena, Davide Martorana, Daniele Del Rio, Giovanni Malerba, Valeria Barili, Riccardo Bonadonna, Alessandra Dei Cas, Marco Ventura, Francesca Turroni, Letizia Bresciani, Mirko Treccani, Cristiano Negro, Alice Rosi, Cristina Del Burgo-Gutiérrez, Maria Sole Morandini, Nicola Luigi Bragazzi, Claudia Favari, José Fernando Rinaldi de Alvarenga, Lucia Ghiretti, Cristiana Mignogna (Poly)phenols (PPs) are a group of bioactive compounds found in plant-based food, widely consumed within diet. Several studies have reported the beneficial effects of PPs in preventing chronic diseases through a myriad of mechanisms of action. However, the bioavailability and effects of these compounds greatly differ across individuals, causing uneven physiological responses. To understand their inter-individual variability, we present a multi-omics investigation comprising genomics, metagenomics and metabolomics. We recruited 300 healthy individuals and collected biological samples (blood, urine, and faeces), anthropometric measurements, health status and lifestyle/dietary information. After identification by UPLC-IMS-HRMS and quantification by UPLC-QqQ-MS/MS, the large set of phenolic metabolites underwent dimensionality reduction and clustering to identify individuals with similar metabolic profiles (metabotypes), identifying high and low PP producers. Then, genomics and metagenomics investigations were performed to gain insights on inter-individual differences and unravel the potential pathophysiological impact of these molecules, with particular regards to cardiometabolic diseases. In details, genome-wide association studies followed by computational functional analyses on genetic variants, and taxonomic and functional investigations of gut microbiome were performed, showing hints for associations in genes and microbial species related to PP metabolism, together with unprecedented genetic associations. Genomics were further investigated in terms of gene networks and computational functional analyses, identifying differentially expressed genes, gene sets enrichments, candidate regulatory regions, and interacting loci and chromatin states, and associations with metabolic traits and diseases. Overall, we demonstrated the benefits of omics research in nutrition, advancing the field of personalised nutrition and health.
2025-07-20 09:45:00 10:00:00 02N Student Council Symposium Nocardia Genomes are a Large Reservoir of Diverse Gene Content, Biosynthetic Gene Clusters, and Species-specific Genes Kiran Kumar Eripogu Kiran Kumar Eripogu, Wen-Hsiung Li Nocardia, an opportunistic pathogenic bacterial genus, remains underexplored in terms of biosynthetic potential, gene content, and evolutionary history. By analyzing 263 genomes across 88 species, we found that Nocardia varies greatly in genome size and gene content. It exhibits an open pangenome, with a small core genome (< 900 genes), and high genomic fluidity (0.76), indicating high gene turnover. A large proportion (75%) of its genes are species-specific, indicating its high genomic plasticity and dynamic evolutionary adaptation. Average Nucleotide Identity (ANI) analysis confirmed taxonomic relationships among Nocardia species, with most exhibiting high between-species ANI values (80-85%). N. globerula showed a high ANI of ~84% with Rhodococcus erythropolis, strongly supporting its reclassification under Rhodococcus. The biosynthetic capabilities of the Nocardia genus are striking with the presence of >8,000 biosynthetic gene clusters (BGCs), dominated by type 1 polyketide synthase, terpenes, and non-ribosomal polypeptide synthetases. This establishes Nocardia as the Actinomycetota genus that has the largest biosynthetic repertoire. Our study is the first to identify a prodigiosin BGC in Nocardia. Network analysis revealed complex evolutionary connections between Nocardia’s gene cluster families (GCFs) and MIBiG reference BGCs, suggesting evolutionary changes, including gene gains and losses, that may have influenced the genus’s BGC diversity and composition. Synteny analysis uncovered conserved and unique gene arrangements across Nocardia and related genera, mostly with core genes conserved in Actinomycetota. The findings from our study contribute to advancing microbial genomics, evolution, and biotechnology by uncovering the potential of Nocardia to address challenges in infectious diseases and natural product discovery.
2025-07-20 10:00:00 10:05:00 02N Student Council Symposium Lifting the veil on Challenging Medically Relevant Genes Victor Grentzinger Victor Grentzinger, Leonor Palmeira, Keith Durkin, Maria Artesi, Vincent Bours While the cost of DNA sequencing has never been cheaper, a number of genetic diseases remain difficult to diagnose. Nearly 400 medically relevant genes are still challenging to characterize due to the complex nature of their sequence. This complexity can arise from a variety of factors, such as the existence of pseudogene, large Short Tandem Repeat region or Variable Number Tandem Repeat region. As such, the access to reliable and cost-effective genetic tests is limited. To resolve this issue, we decided to focus on improving the characterization of the following genes by using long-read sequencing: PKD1/PKD2, responsible for Autosomal Dominant Polycystic Kidney Disease (ADPKD), and FLG, involved in Atopic Dermatitis. For PKD genes, we amplified their sequence by long-range PCR before sequencing the products by Oxford Nanopore Sequencing. We were able to retrieve all variants previously confirmed by Sanger sequencing on 34 samples with ADPKD. For FLG, while investigating the 23 publicly available PacBio HiFi data of the 1000 Genome project, we identified new undescribed alleles in African samples. To determine if these variations are population specific, we analyzed 1111 additional public samples with long-read data. We discovered 5 novel alleles mostly from Sub-Saharan populations. We also investigated, in our cohort of public data, the MUC1 and SMN1/SMN2 genes, responsible respectively for Autosomal Dominant Tubulointerstitial Kidney Disease and Spinal Muscular Atrophy. Our next goal is to design cost efficient techniques to improve the sequencing of these challenging medically relevant genes in a clinical setting.
2025-07-20 10:05:00 10:10:00 02N Student Council Symposium AccuRate: A Tool Supporting Genotype–Phenotype Analysis and Causal Mutation Discovery in Soybean Alžbeta Rástocká Alžbeta Rástocká, Jana Biová, Mária Škrabišová Soybean is one of the world’s most significant crops, serving as an indispensable source of high-quality plant protein and oil for both human and livestock consumption. Advances in soybean research support genomics-assisted breeding, guiding the development of more resilient, nutritious, and high-yielding varieties. Soybean also possesses an extensive collection of genomic and phenotypic data, including a large database of phenotypic traits. This enables the creation of new strategies for analysing genotype-phenotype associations. While association studies are important for identifying genomic loci linked to phenotypic traits, pinpointing causal mutations remains a challenge due to many factors. Building on these resources, this study presents new algorithms for analysing, visualizing, and automatically categorizing quantitative and categorical phenotypes. Given that most functional mutations are biallelic, and that quantitative traits often arise from the combined effects of multiple genes, phenotype binarization provides a practical basis for further analysis. Since many traits exist on a spectrum, various categorization methods are applied to transform them into binary form. This step is essential for calculating an accuracy parameter that quantifies genotype-phenotype correlation and facilitates the identification of causal mutations. The algorithm AccuRate was tested on well-characterized genes influencing protein and oil content in soybean. Results confirmed its ability to identify genotype-phenotype correlations. Additionally, two candidate genes were analysed, and a causal mutation was confirmed in one of them (Glyma.06G205800), linked to flowering and maturation time. AccuRate is a promising tool for uncovering genotype-phenotype relationships in soybean and, after optimizing for high-throughput testing, may be extended to other crops.
2025-07-20 10:10:00 10:15:00 02N Student Council Symposium Early colorectal cancer detection with deep learning on ultra-shallow whole genome sequencing of cell-free DNA Ritchie Yu Ritchie Yu, Jasmin Coulombe-Huntington, Yu Xia Early detection of cancer can mitigate adverse patient outcomes by reducing the time to intervention and treatment. Cell-free DNA (cfDNA) circulating the bloodstream contains signatures of cancer which can be obtained and sequenced through liquid biopsy. Given a large collection of sequencing reads, features can be extracted and used to develop predictive models for patient cancer classification. However, current techniques for early cancer detection rely on tens of millions of sequencing reads, which can increase the cost of diagnosis. In our work, using whole genome sequencing data obtained from the Sequence Read Archive (SRA), we adapted convolutional neural networks to predict colorectal cancer. We found that the number of reads used by the model can be scaled down from approximately 60 million reads to 1 million reads. Our model achieved a classification performance of 0.902 AUC. This result suggests that the blood sample size required for liquid biopsy could be significantly reduced, thereby reducing the cost of diagnosis. Furthermore, through an ablation study, we showed that the fragment end distribution by itself produced a classification performance of 0.904 AUC. Meanwhile, relying only on fragment length distribution and end motif distribution produced 0.771 and 0.790 AUC, respectively. This suggests that fragment end distribution is a much more predictive feature for classification. In future work, we intend to incorporate fragment end features into transformer-based models to improve classification performance.
2025-07-20 10:15:00 10:30:00 02N Student Council Symposium DNA-DistilBERT: A small language model for non-coding variant effect prediction from human DNA sequences Megha Hegde Megha Hegde, Jean-Christophe Nebel, Farzana Rahman Genetic variants have been associated with changes in disease risk. Historically, research has focused on coding variants; however, emerging research shows that non-coding variants also have strong links to disease causality, via transcription and gene regulation. Next-generation sequencing has exponentially increased genomic data availability, necessitating scalable computational approaches for accurate variant effect prediction. Transformer-based LLMs, such as BERT (Bidirectional Encoder Representations from Transformers), have achieved good results on coding variants, however, results on non-coding variants remain inconsistent. Moreover, the quadratic computational complexity of attention mechanisms with sequence length imposes substantial resource demands, restricting innovation in this area to a few institutions with high-end infrastructure. Arguably, BERT is the most successful of such architectures as it excels in context-aware modelling of genomic sequences due to its bidirectional nature. However, to substantially decrease computational costs, it is proposed to exploit DistilBERT, which uses knowledge distillation during pretraining to reduce the number of model parameters. While small language models (SLMs) such as DistilBERT are established in natural language processing, they remain underexplored in genomics. Experiments show that, when pretrained on human reference genome sequences, and fine-tuned for variant effect prediction, the SLM approach can match state-of-the-art LLMs such as DNABERT-2 in accuracy, while significantly reducing resource requirements. This innovative, energy-efficient approach not only makes variant effect prediction more scalable but also advances equitable research by enabling training on a single GPU, eliminating the need for high-performance computing.
2025-07-20 10:30:00 10:45:00 02N Student Council Symposium Generative AI for Childhood and Adult Cancer Research Guillermo Prol Castelo Guillermo Prol Castelo, Davide Cirillo, Alfonso Valencia Cancer is one of the most common causes of death worldwide, and its complexity makes it especially challenging to study. Despite ongoing progress in cancer research, a significant challenge is the scarcity of detailed data on disease subgroups and stages. To overcome this problem, Generative AI techniques and, specifically, the Variational Autoencoder (VAE), have been widely used to handle high-dimensional data. We propose a robust, explainable Synthetic Data Generation (SDG) pipeline based on the VAE using cancer transcriptomics data. Here, two main scenarios are presented, where we use our SDG pipeline to study different cancer types, addressing data scarcity limitations effectively. First, we present the case of Medulloblastoma, a rare, childhood brain tumor traditionally classified into four molecular subgroups, where we provide evidence supporting the existence of an additional subgroup with distinct molecular features. Additionally, we apply explainability techniques to the VAE, uncovering key relationships between gene expression and disease subgroups. Second, we tackle cancer's dynamic nature to link the most similar patients and leverage our SDG pipeline to direct the process of data generation along a trajectory between patients at different stages of the disease. Our pipeline generates stage-separable patients, revealing actionable molecular insights at intermediate reconstructed steps. These studies demonstrate the potential of synthetic data generation in highly specific contexts, shed light on the temporal aspects of cancer, and advance our understanding of the underlying biological mechanisms.
2025-07-20 11:00:00 11:05:00 02N Student Council Symposium AutoPeptideML 2: An open source library for democratizing machine learning for peptide bioactivity prediction Raúl Fernández-Díaz Raúl Fernández-Díaz, Thanh Lam Hoang, Vanessa Lopez, Denis Shields Peptides are a rapidly growing drug modality with diverse bioactivities and accessible synthesis, particularly for canonical peptides composed of the 20 standard amino acids. However, enhancing their pharmacological properties often requires chemical modifications, increasing synthesis cost and complexity. Consequently, most existing data and predictive models focus on canonical peptides. To accelerate the development of peptide drugs, there is a need for models that generalize from canonical to non-canonical peptides. We present AutoPeptideML, an open-source, user-friendly machine learning platform designed to bridge this gap. It empowers experimental scientists to build custom predictive models without specialized computational knowledge, enabling active learning workflows that optimize experimental design and reduce sample requirements. AutoPeptideML introduces key innovations: (1) preprocessing pipelines for harmonizing diverse peptide formats (e.g., sequences, SMILES); (2) automated sampling of negative peptides with matched physicochemical properties; (3) robust test set selection with multiple similarity functions (via the Hestia-GOOD framework); (4) flexible model building with multiple representation and algorithm choices; (5) thorough model evaluation for unseen data at multiple similarity levels; and (6) FAIR-compliant, interpretable outputs to support reuse and sharing. A webserver with GUI enhances accessibility and interoperability. We validated AutoPeptideML on 18 peptide bioactivity datasets and found that automated negative sampling and rigorous evaluation reduce overestimation of model performance, promoting user trust. A follow-up investigation also highlighted the current limitations in extrapolating from canonical to non-canonical peptides using existing representation methods. AutoPeptideML is a powerful, platform for democratizing machine learning in peptide research, facilitating integration with experimental workflows across academia and industry.
2025-07-20 11:05:00 11:10:00 02N Student Council Symposium ENQUIRE automatically reconstructs, expands, and drives enrichment analysis of gene and MeSH co-occurrence networks from context-specific biomedical literature Luca Musella Luca Musella, Alejandro Afonso Castro, Xin Lai, Max Widmann, Julio Vera The accelerating growth of scientific literature overwhelms our capacity to manually distil complex phenomena like molecular networks linked to diseases. Moreover, biases in biomedical research and database annotation limit our interpretation of facts and generation of hypotheses. ENQUIRE (Expanding Networks by Querying Unexpectedly Inter-Related Entities) offers a time- and resource-efficient alternative to manual literature curation and database mining. ENQUIRE reconstructs and expands co-occurrence networks of genes and biomedical ontologies from user-selected input corpora and network-inferred PubMed queries. Its modest resource usage and the integration of text mining, automatic querying, and network-based statistics mitigating literature biases makes ENQUIRE unique in its broad-scope applications. We benchmarked and illustrated ENQUIRE‘s capabilities in several case scenarios and published the results earlier this year (Musella L. et al., 2025, PLoS Comput Biol). At ISMB/ECCB 2025, we showcase how ENQUIRE can support biomedical researchers using melanoma resistance to immunotherapy as an example case study. The frameworks enabled by ENQUIRE include gene set reconstruction, pathway enrichment analysis, and knowledge graph annotation, which can ease literature annotation, boosting hypothesis formulation, and facilitating the identification of molecular targets for subsequent experimentation.
2025-07-20 11:10:00 11:15:00 02N Student Council Symposium Automating Linear Motif Predictions to Map Human Signaling Networks Yitao (Eric) Sun Yitao (Eric) Sun, Yu Xia, Jasmin Coulombe-Huntington Short linear motifs (SLiMs) are critical mediators of transient protein-protein interactions (PPIs), yet only 0.2% of human SLiMs are experimentally verified. Their short length (3–11 residues), rapid evolution, and frequent location in intrinsically disordered regions make them difficult to systematically uncover using conventional approaches. We present an automated computational framework for proteome-wide SLiM discovery that integrates structural, evolutionary, and machine learning attributes to overcome limitations in current resources (e.g., MEME Suite, ELM). Our method combines Gibbs sampling for de novo motif discovery with hidden Markov models (HMMs) that explicitly model insertions and deletions, enabling a more realistic representation of motif variation. To improve specificity, we incorporate four discriminative features: ProtT5-derived motif propensity scores, AlphaFold-based intrinsic disorder (pLDDT), solvent accessibility, and cross-species conservation from multiple sequence alignments. Together, these features enable robust motif characterization even in noisy biological contexts. Biological relevance is ensured by searching the interactors of the SLiM-binding domain protein through BioGRID PPIs and motif clustering via HMM similarity (HH-suite). Our framework validated MAPK1 (ERK2)-mediated phosphorylation motif in RUNX1, exhibiting high feature scores and validated via independent phosphoproteomic data. This site, previously biochemically characterized but not recognized as an SLiM, shows the power of our approach in identifying functional motifs missed by traditional tools. Our database allows biologists to browse through validated motifs alongside high-quality predictions. This work lays the foundation for systematic reconstruction of motif-mediated signaling networks and advances the discovery of novel regulatory mechanisms and therapeutic targets.
2025-07-20 11:15:00 11:20:00 02N Student Council Symposium TCRBench: A Unified Benchmark for TCR–Antigen Binding Prediction and Clustering Muhammed Hunaid Topiwala Muhammed Hunaid Topiwala, Pengfei Zhang, Heewook Lee T-cell receptor (TCR) recognition of antigenic peptides presented by major histocompatibility complex (MHC) molecules is central to adaptive immunity, driving pathogen-specific responses and informing therapeutic vaccine development. Computational tasks such as predicting TCR-antigen binding affinity (NetTCR, Montemurro et al., 2021; ImRex, Moris et al., 2021) and clustering TCR sequences by epitope specificity (GLIPH, Glanville et al., 2017; TCRdist, Dash et al., 2017) have emerged as key challenges to decoding immune specificity. While recent models leveraging convolutional neural networks, transformers (e.g., ATM-TCR, Xu et al., 2021), and multimodal embeddings (ERGO, Springer et al., 2020; TCRMatch, Chronister et al., 2021) have significantly advanced performance, fragmented datasets and inconsistent evaluation methods have limited direct model comparisons and generalization. We propose a unified benchmark dataset integrating rigorously curated TCR sequences from human, mouse, and macaque responses to major pathogens (Influenza A, CMV, EBV, SARS-CoV-2) sourced from comprehensive databases such as VDJdb (Shugay et al., 2018) and IEDB (Vita et al., 2019). The benchmark incorporates standardized evaluation splits, structural representations enabled by AlphaFold2 predictions (Jumper et al., 2021), and robust evaluation metrics to ensure fair, reproducible comparisons. By consolidating disparate data and evaluation practices, our benchmark provides clarity on current progress, facilitating future innovation in computational TCR-antigen interaction modeling.
2025-07-20 11:20:00 11:35:00 02N Student Council Symposium Fold first, ask later: structure-informed function prediction in Pseudomonas phages Hannelore Longin Hannelore Longin, George Bouras, Susanna Grigson, Robert Edwards, Hanne Hendrix, Rob Lavigne, Vera van Noort Phages, the viruses of bacteria, are the most abundant biological entities on earth. In general, phage genomes are densely coded and contain many open reading frames, yet up to 70% encode proteins of unknown function. Despite clinical, biotechnological and fundamental interests in unravelling these proteins’ functions, phage proteins are absent from recent large-scale structure-based efforts (such as AlphaFold database). Here, we investigate the efficacy of structure-based protein annotation for Pseudomonas-infecting phages, comparing different post-processing strategies to obtain function annotations from FoldSeek output. Briefly, we collected every protein annotated as ‘hypothetical/phage protein’ in NCBI and of at least 100 amino acids in length, of 887 Pseudomonas-infecting phages. These 38,025 proteins (31% of all proteins) were then clustered into 10,453 groups of homologs. Protein structures were predicted with ColabFold and structural similarity to the PDB and AlphaFold database was assessed with FoldSeek. Of all proteins, 59% displayed significant similarity to at least one structure in these databases. We benchmarked various strategies for extracting function from these FoldSeek hits, integrating different information resources, hit selection methods, and structure-based clustering of the hits. The resulting annotations were then compared with state-of-the-art sequence- and structure-based phage annotation tools Pharokka and Phold. On average, up to 42% of the phage proteins of unknown function could be annotated using structure-based methods, depending on the post-processing strategies applied. While caution is warranted when transferring protein annotations based on similarity, these methods can significantly speed up research into new antimicrobials and biotechnological applications inspired by nature’s finest bioengineers: phages.
2025-07-20 11:35:00 11:50:00 02N Student Council Symposium Exploring capabilities of protein language models for cryptic binding site prediction Vít Škrhák Vít Škrhák, David Hoksza Identifying protein-ligand binding sites is essential for understanding biological mechanisms and supporting drug discovery. However, accurate prediction remains challenging - particularly in the case of cryptic binding sites (CBSs), which require significant conformational changes to form upon ligand binding. Structure-based prediction methods typically rely on a specific conformation (apo vs. holo), making them less effective for identifying CBSs. A promising alternative is the use of sequence-based approaches, enabled by the emergence of protein language models (pLMs). In this work, we explored the capabilities of various pLMs for predicting CBSs. As a baseline, we created a simple model trained using transfer learning. We then experimented with several fine-tuning strategies to further improve performance. Specifically, we applied multitask learning - not only to predict whether a residue is part of a CBS, but also to estimate its flexibility. This additional task enhanced the model’s awareness of protein dynamics, which is critical for accurate CBS identification. Our primary data source is the recently published CryptoBench dataset, which contains annotations of cryptic sites, although additional data sources were also considered. The combination of novel fine-tuning strategies and various training data improved performance across all key metrics, including a gain of over 2% in AUC. To better understand model limitations, we also conducted an analysis of common prediction errors. Finally, we introduced a simple post-processing method designed to refine and smooth the model’s outputs.
2025-07-20 11:50:00 11:55:00 02N Student Council Symposium Coarse-grained and Multi-Scale Modeling of Lytic Polysaccharide Monooxygenases: Insights into Family-Specific Dynamics and Protein Frustration Nisha Nandhini Shankar Nisha Nandhini Shankar, Ragothaman M Yennamalli Lytic polysaccharide monooxygenases (LPMOs) are copper-dependent redox enzymes that catalyze the oxidative cleavage of C1 and/or C4 bonds in recalcitrant polysaccharides, playing a vital role in biomass conversion. The CAZy database classifies LPMOs into eight families (AA9, AA10, AA11, AA13, AA14, AA15, AA16, and AA17). These families exhibit diversity in their structure as well as catalytic features. This study focuses on analyzing the structure, dynamics and energetic landscapes of LPMO families using FrustratormeteR, SignDy, and multiscale modeling approaches. FrustratormeteR quantifies configurational and mutational frustration, identifying energetically unfavorable interactions. AA9 exhibited high local frustration in the residue range of 100-230, while AA10 showed a more stable profile. SignDy was employed to explore slow collective motions, revealing significant conformational changes in AA9 linked to enzymatic adaptability, with the first six modes indicating notable flexibility. In contrast, AA10 displayed lower mobility in its first three modes, suggesting greater rigidity and substrate specificity. Protein models from AlphaFold2 were used for proteins with missing residues. These models were prepared and subjected to 100 ns all-atom molecular dynamics simulations using the OPLS-AA/L force field. The increase in RMSD in the course of the simulation shows the conformational changes. RMSF and energy analyses revealed flexible regions consistent with mode analysis, with average potential energies stabilizing at -6.25×105 kJ/mol. The radius of gyration (Rg) remained stable around 1.65-1.75 nm. Analysing the coarse-grained Gō model simulations, run using SMOG for 200 million steps will provide further insights into the folding and long-range dynamic behavior of these enzymes.
2025-07-20 11:55:00 12:00:00 02N Student Council Symposium Identification and structural modeling of the novel TTC33-associated core (TANC) complex involved in DNA damage response Małgorzata Drabko Małgorzata Drabko, Rafał Tomecki, Małgorzata Siek, Aneta Jurkiewicz, Miłosz Ludwinek, Kamil Kobyłecki, Dominik Cysewski, Agata Malinowska, Magdalena Bakun, Łukasz S. Borowski, Roman J. Szczęsny, Rafał Płoski, Agnieszka Tudek Of the ~20,200 human proteins, ~9% remain functionally uncharacterized, highlighting a gap in our understanding of cell physiology. Structural proteins without enzymatic activity are particularly difficult to study. Here, we applied a “function by proximity” approach to TTC33, a nuclear structural tetratricopeptide repeat (TPR) protein conserved in bony vertebrates. Using comparative label-free mass spectrometry, we identified the TTC33-associated network (TAN), which includes WDR61, CCDC97, UNG, PP2A-B55α, PHF5A, and the SF3B subcomplex of U2. At the core of TAN is a novel trimeric complex (TANC) formed by TTC33:WDR61:PHF5A, with this claim being supported by co-purification and size exclusion chromatography. Structural predictions performed by AlphaFold 3, and their experimental validation showed WDR61 and PHF5A bind opposite sides of TTC33’s TPR4, while TPR1-3 recruit other TAN factors. To expand the structural model we employed molecular dynamics to identify the most stable amino acid contact pairs between complex subunits. Although TTC33 forms a complex with WDR61 and PHF5A, both of which are involved in RNA metabolism, our RNA-seq assays revealed only a subtle impact on mRNA levels and splicing patterns. In contrast, TTC33 appears more involved in DNA repair through interaction with UNG1/2. TTC33 loss led to increased DNA double-strand breaks, a phenotype previously associated with UNG1/2 knock-down. We showed that TTC33 protein levels are regulated in vivo, and that changes to TTC33 abundance reduced cellular proliferation rate and resistance to hydrogen peroxide. Moreover, the depletion or loss of either TTC33 or CCDC97 induced redistribution of p53-S15P, a marker of DNA damage.
2025-07-20 12:00:00 12:05:00 02N Student Council Symposium Functional Interfaces at Ordered–Disordered Transitions: Conserved Linear Motifs and Flanking Regions in Modular Proteins Carla Luciana Padilla Franzotti Carla Luciana Padilla Franzotti, Nicolas Palopoli, Gustavo Pierdominici-Sottile, Miguel Andrade Multidomain proteins integrate ordered domains, structured tandem repeats (STRs), and intrinsically disordered regions (IDRs) to generate modular architectures optimized for dynamic and specific protein-protein interactions. In this study, we analyze the role of short linear motifs (SLiMs) located at the interface between ordered and disordered segments, focusing on their contribution to structural connectivity and interaction regulation. Two model systems are examined: (1) the large T antigen from simian virus 40 (LTSV40), in which the LxCxE motif—positioned at the junction between a folded domain and an IDR—mediates binding to the retinoblastoma protein (pRb), and (2) the regulatory complex between protein phosphatase 1 delta (PP1δ) and its MYPT1 subunit, where ankyrin repeats (ANKs) are connected to DOC-type docking motifs through an intervening IDR. In both cases, the regions flanking the SLiMs exhibit high sequence conservation and specific biophysical properties, consistent with a modulatory role. Molecular dynamics simulations demonstrate that these flanking regions promote extended conformations upon complex formation, facilitating physical occlusion of critical interaction interfaces (such as the E2F-binding pocket in pRb) without requiring large-scale allosteric rearrangements. In the PP1-MYPT1 complex, ANK repeats and IDRs exhibit cooperative behavior that contributes to the stabilization of the bound conformation and enhances interaction specificity. These findings support the existence of a conserved ordered–motif–disordered architectural module recurrently employed in both viral and cellular regulatory systems. This topological arrangement constitutes a potential target for therapeutic intervention in diseases involving aberrant protein-protein interactions mediated by SLiMs at ordered–disordered interfaces.
2025-07-20 12:05:00 12:10:00 02N Student Council Symposium Automating Linear Motif Predictions to Map Human Signaling Networks Yitao (Eric) Sun Yitao (Eric) Sun, Yu Xia, Jasmin Coulombe-Huntington Short linear motifs (SLiMs) are critical mediators of transient protein-protein interactions (PPIs), yet only 0.2% of human SLiMs are experimentally verified. Their short length (3–11 residues), rapid evolution, and frequent location in intrinsically disordered regions make them difficult to systematically uncover using conventional approaches. We present an automated computational framework for proteome-wide SLiM discovery that integrates structural, evolutionary, and machine learning attributes to overcome limitations in current resources (e.g., MEME Suite, ELM). Our method combines Gibbs sampling for de novo motif discovery with hidden Markov models (HMMs) that explicitly model insertions and deletions, enabling a more realistic representation of motif variation. To improve specificity, we incorporate four discriminative features: ProtT5-derived motif propensity scores, AlphaFold-based intrinsic disorder (pLDDT), solvent accessibility, and cross-species conservation from multiple sequence alignments. Together, these features enable robust motif characterization even in noisy biological contexts. Biological relevance is ensured by searching the interactors of the SLiM-binding domain protein through BioGRID PPIs and motif clustering via HMM similarity (HH-suite). Our framework validated MAPK1 (ERK2)-mediated phosphorylation motif in RUNX1, exhibiting high feature scores and validated via independent phosphoproteomic data. This site, previously biochemically characterized but not recognized as an SLiM, shows the power of our approach in identifying functional motifs missed by traditional tools. Our database allows biologists to browse through validated motifs alongside high-quality predictions. This work lays the foundation for systematic reconstruction of motif-mediated signaling networks and advances the discovery of novel regulatory mechanisms and therapeutic targets.
2025-07-20 12:10:00 12:25:00 02N Student Council Symposium Deep Phylogenetic Reconstruction Reveals Key Functional Drivers in the Evolution of B1/B2 Metallo-β-Lactamases Samuel Davis Samuel Davis, Pallav Joshi, Ulban Adhikary, Julian Zaugg, Phil Hugenholtz, Marc Morris, Gerhard Schenk, Mikael Boden Metallo-β-lactamases (MBLs) comprise a diverse family of antibiotic-degrading enzymes. Despite their growing implication in drug-resistant pathogens, no broadly effective clinical inhibitors against MBLs currently exist. Notably, β-lactam-degrading MBLs appear to have emerged twice from within the broader, catalytically diverse MBL-fold protein superfamily, giving rise to two distinct monophyletic groups: B1/B2 and B3 MBLs. Comparative analyses have highlighted distinct structural hallmarks of these subgroups, particularly in metal-coordinating residues. However, the precise evolutionary events underlying their emergence remain unclear due to challenges presented by extensive sequence divergence. Understanding the molecular determinants driving the evolution of β-lactamase activity may inform design of broadly effective inhibitors. We sought to infer the evolutionary features driving the emergence of B1/B2 MBLs via phylogenetics and ancestral reconstruction. To overcome challenges associated with evolutionary analysis at this scale, we developed a phylogenetically aware sequence curation framework centred on iterative profile HMM refinement. This framework was applied over several iterations to construct a comprehensive phylogeny encompassing the B1/B2 MBLs and several other recently diverged clades. The resulting tree represents the most robust hypothesis to date regarding the emergence of B1/B2 MBLs and implies a parsimonious evolutionary history of key features, including variation in active site architecture and insertions and deletions of distinct structural elements. Ancestral proteins inferred at key internal nodes were experimentally characterised, revealing distinct activity profiles that reflect underlying evolutionary transitions. These findings give rise to testable hypotheses regarding the molecular basis and evolutionary drivers of functional diversification, as well as potential targets for MBL inhibitor design.
2025-07-20 12:25:00 12:30:00 02N Student Council Symposium Multilingual model improves zero-shot prediction of disease effects on proteins Ruyi Chen Ruyi Chen, Nathan Palpant, Gabriel Foley, Mikael Boden Models for mutation effect prediction in coding sequences rely on sequence-, structure-, or homology-based features. Here, we introduce a novel method that combines a codon language model with a protein language model, providing a dual representation for evaluating effects of mutations on disease. By capturing contextual dependencies at both the genetic and protein level, our approach achieves a 3% increase in ROC-AUC classifying disease effects for 137,350 ClinVar missense variants across 13,791 genes, outperforming two single-sequence-based language models. Obviously the codon language model can uniquely differentiate synonymous from nonsense mutations at the genomic level. Our strategy uses information at complementary biological scales (akin to human multilingual models) to enable protein fitness landscape modeling and evolutionary studies, with potential applications in precision medicine, protein engineering, and genomics.
2025-07-20 12:30:00 12:45:00 02N Student Council Symposium Integrated analysis of bulk and single-nuclei RNA sequencing data of primary and metastatic pediatric Medulloblastoma. Ana Isabel Castillo Orozco Ana Isabel Castillo Orozco, Geoffroy Danieau, Livia Garzia Medulloblastoma (MB) is a highly aggressive and the most common brain tumor in childhood. MB presents a high intertumoral heterogeneity, with at least four molecular subgroups identified (SHH, WNT, Group 3, and Group 4). Metastatic MB, or Leptomeningeal Disease (LMD), is predominantly found in the MB Group 3 type. Although LMD represents a main clinical challenge, its molecular mechanisms remain poorly characterized. Recent research has shown that primary and MB metastasis diverge dramatically. Our work has focused on establishing therapy naïve Group 3 Patient-Derived Xenografts models of primary and metastatic Medulloblastoma to conduct transcriptomic profiling at the bulk and single-nuclei RNAseq levels to identify genetic drivers/pathways that sustain leptomeningeal disease compartment. Our results show various signaling pathways enriched across LMD models, such as MYC targets, unfolded protein response, and fatty acid metabolism. Using single-sample GSEA analysis (ssGSEA) and deconvolution approaches, we have also identified that our PDXes models retain neoplastic subpopulations previously identified in MB single-cell sequencing studies. Similarly, we have identified slight differences in cell subpopulation proportions between primary and leptomeningeal compartments. Our single-nuclei studies have confirmed these results and differentially expressed genes previously found in bulk RNAseq analyses. These results suggest the presence of cell populations enriched in the metastatic compartment with an aberrant transcription phenotype and adaptations in metabolism to survive the leptomeningeal space. Our recent findings suggest that LMD should be treated differently from primary brain tumors and that identified metabolic pathways may be potential targets for targeted therapeutics to treat or prevent this devastating disease.
2025-07-20 12:45:00 12:50:00 02N Student Council Symposium Investigating novel transcriptional regulators in symbiotic nodule development of Medicago truncatul Sara Eslami Sara Eslami, Mahboobeh Azarakhsh Biological nitrogen fixation is a crucial process for sustainable agriculture, allowing leguminous plants to convert atmospheric nitrogen into bioavailable forms through a symbiotic relationship with rhizobia. This interaction results in the formation of specialized root structures called nodules, where nitrogen fixation takes place. A deeper understanding of the molecular mechanisms governing nodule formation is essential for enhancing plant-microbe interactions and improving agricultural productivity. In this study, we investigate key transcription factors (TFs) involved in the nodulation process of Medicago truncatula, including MtIPD3, MtNSP1, MtNSP2, MtNIN, and MtERNs. Using co-expression analysis (Phytozome database) and interaction network studies (STRING database), we identify novel regulatory elements that potentially play a role in nodule organogenesis. Our findings suggest a strong interaction between IPD3 and splicing factors, implicating its involvement in RNA processing and cell cycle regulation during nodule formation. Additionally, we identify the cytokinin transporter gene ABCG38 as significantly upregulated in nodules, suggesting its role in cytokinin-mediated regulation. Moreover, our analysis indicates that the auxin response factor Medtr2g043250 is a likely transcriptional target of NIN, highlighting a possible cross-talk between auxin and cytokinin signaling in nodulation. These insights contribute to a deeper understanding of the transcriptional and hormonal regulation of nodule development, offering potential strategies for enhancing biological nitrogen fixation in legumes.
2025-07-20 12:50:00 12:55:00 02N Student Council Symposium Meta-Analysis of Bovine Transcriptome Reveals Key Immune Gene Profiles and Signaling Pathways Vennila Kanchana Devi Marimuthu Vennila Kanchana Devi Marimuthu, Kishore Matheswaran, Menaka Thambiraja, Ragothaman M Yennamalli Understanding immune mechanisms in cattle is crucial for improving disease resistance through informed breeding decision and development. Meta-analysis serves as a powerful approach to integrate findings from multiple transcriptomic studies that uncover significant gene expression patters across various experimental conditions and increase statistical power and. In this study, we conducted a meta-analysis of four bovine transcriptomic datasets (GSE45439, GSE62048, GSE125964, and GSE247921) to identify immune-related differentially expressed genes (DEGs) in Bos taurus. These datasets encompassed a range of immune-challenging conditions, including infections caused by Mycobacterium bovis and Mycobacterium avium subsp. paratuberculosis, comparing transcriptomic profiles between diseased and healthy cattle. We implemented a comprehensive transcriptome analysis pipeline involving FastQC, Trimmomatic, Bowtie2, SAMtools, FeatureCounts, DESeq2, and MetaRNASeq, which resulted in the identification of 28 significant DEGs, comprising 12 upregulated and 16 downregulated genes. Comparison with an innate immune gene database revealed five immune-related genes such as IL1A, RGS2, RCAN1, and ZBP1, known to play important regulatory roles in immune responses. KEGG pathway enrichment analysis showed that these genes were involved in four critical immune-related pathways: Necroptosis, Osteoclast Differentiation, Oxytocin Signaling, and cGMP–PKG Signaling. These pathways are associated with various immune functions, including inflammatory cell death, cytokine signaling, immune cell differentiation, and leukocyte trafficking. Overall, this meta-analysis provides a deeper understanding of conserved immune signaling mechanisms in cattle and highlights key genes that could serve as biomarkers for immune competence, disease susceptibility, or vaccine responsiveness. The findings offer valuable insights for future functional studies and applications in bovine immunogenomics.
2025-07-20 12:55:00 13:00:00 02N Student Council Symposium Post-translational regulation of stemness under DNA damage response contributes to the gingivobuccal oral squamous cell carcinoma relapse and progression Sachendra Kumar Sachendra Kumar, Annapoorni Rangarajan, Debnath Pal Tobacco consumption (smoking and particularly smokeless form) contributes to a high prevalence of gingivobuccal oral squamous cell carcinoma (OSCC-GB) in India. OSCC-GB patients exhibit high rates of locoregional relapse and therapeutic failure, often attributed to the involvement of cancer stem cells (CSCs). This study aims to leverage the generalizability of the machine learning prediction model for ‘Tumor Status’ to conduct a comparative somatic mutation analysis between ‘With Tumor’ (recurred/relapsed/progressed) and ‘Tumor Free’ (disease-free/complete remission) OSCC-GB patients. Our results revealed that support vector machines (SVM) classified the ‘Tumor Status’ classes with a mean accuracy of 89% based on clinical features. Furthermore, RNA-seq-based somatic mutation analysis using the classified groups revealed molecular mechanisms underlying tumor relapse and progression within OSCC-GB subgroups. The identified mutational signature (C>T mutations) linked to DNA damage suggests the role of tobacco-related carcinogens in OSCC-GB subgroups. The analysis of distinct somatic variants, functional impact predictions, protein-protein interactions, and survival analysis highlights the involvement of DNA damage response (DDR)-related genes in the ‘With Tumor’ subgroup. This analysis particularly emphasizes the significant role of the Mitogen-activated protein kinase associated protein 1 (MAPKAP1) gene, a key player in the mTORC2 signaling pathway. The study suggests that loss-of-function in the identified MAPKAP1 somatic variant may promote stemness and elevate the risk of disease relapse and progression in ‘With Tumor’ OSCC-GB under DDR conditions, potentially contributing to higher mortality rates among Indian OSCC-GB patients.
2025-07-20 14:00:00 15:00:00 02N Student Council Symposium ,
2025-07-20 15:00:00 15:45:00 02N Student Council Symposium Pof. Dame Janet Thornton
2025-07-20 15:45:00 16:00:00 02N Student Council Symposium Closing remarks
2025-07-20 09:00:00 16:30:00 Special Track ISCB Board of Directors Meeting
2025-07-20 15:30:00 18:00:00 Special Track Fellows Workshop
2025-07-20 16:00:00 18:00:00 Special Track Career Fair
2025-07-20 19:30:00 21:00:00 Special Track Welcome Networking Reception
2025-07-21 18:00:00 19:30:00 Special Track Bioinformatics in the UK Reception - ticketed event Bioinformatics in the UK Reception - ticketed event,
2025-07-22 18:00:00 19:30:00 Special Track Success Circles - Reinvented Networking where participants are grouped into small circles, each led by a knowledgeable facilitator
2025-07-22 19:30:00 23:00:00 Special Track INVITE ONLY - President's Reception
2025-07-23 18:00:00 18:45:00 Special Track JPI Meet-Up
2025-07-23 19:00:00 21:00:00 Special Track All Conference Networking Event at Punch Tarmeys and ArCains - Ticketed Event
2025-07-22 11:20:00 11:30:00 01C SysMod Opening Talk Matteo Barberis The community of special interest (COSI) in systems modeling (SysMod) organizes annual one-day gatherings. In 2025 the meeting comprises three sessions that cover a broad variety of topics, beginning with metabolic modeling, followed by the afternoon session on multiscale modeling and concludes with inference of cellular processes. This year's meeting features two keynote speakers, Ronan Fleming and Jasmin Fisher. The event is hosted by Chiara Damiani and Matteo Barberis on behalf of the eight COSI organizers. This brief talk introduces all speakers, organizers, and main topics of the 2025 meeting.
2025-07-22 11:30:00 12:10:00 01C SysMod Ronan Fleming
2025-07-22 12:10:00 12:30:00 01C SysMod A dynamic multi-tissue metabolic reconstruction reveals interindividual variation in postprandial metabolic fluxes Shauna O'Donovan Lisa Corbeij, Natal van Riel, Shauna O'Donovan Genome-scale metabolic models (GEMs) are large network-based metabolic reconstructions that can predict the flux of numerous metabolites making them valuable for analysing metabolism across a wide variety of human tissues and microbial species. However, the steady-state assumption needed to solve these GEMs limits their utility to study disturbances in metabolic resilience.. In this study, we embed GEMs of the liver, skeletal muscle, and adipocyte into the Mixed Meal Model (MMM), a physiology-based computational model describing the interplay between glucose, insulin, triglycerides and non-esterified fatty acids (NEFAs). We implement dynamically updating objective functions for each GEM, where cellular objective depends on the model-calculated insulin values. The MMM is simulated using a fixed-step ODE solver; at each time-step the exchange reaction bounds for glucose, triglycerides, and NEFAs in each GEM are updated according to the MMM outputs and flux balance analysis is used to determine the metabolic fluxes. The insulin-dependent objective function allowed the GEMs to accurately simulate the transition from glucose and NEFA secretion in the fasting state to nutrient storage post meal. Moreover, the dynamic tissue-specific GEMs also correctly simulated postprandial changes in metabolites like lactate and glycerol, that are not directly modulated by the differential equations. Personalised hybrid multi-tissue Meal Models, derived from meal response data reveal changes in tissue-specific flux associated with insulin resistance and liver fat accumulation. This research demonstrates the potential of merging GEMs with physiological models to deepen our understanding of metabolic dynamics, offering promising avenues for personalized medicine in metabolic disorders.
2025-07-22 12:30:00 12:50:00 01C SysMod Decoding organ-specific breast cancer metastasis through single-cell metabolic modeling Garhima Arora Garhima Arora, Samrat Chatterjee Breast organotropism, the preferential metastasis of breast cancer cells to specific organs, remains a critical challenge but clinically significant phenomenon, with a limited understanding of the metabolic factors driving site-specific colonization. In this study, we employed genome-scale metabolic models (GSMMs) integrated with single-cell RNA sequencing data from patient-derived xenograft models to investigate the metabolic basis of breast cancer organotropism. We constructed 14 tissue-specific metabolic models from primary breast tumors and their corresponding metastatic sites in the liver, bone, and brain and systematically explored metabolic perturbations associated with disease progression. Our analysis revealed distinct metabolic adaptations in metastatic tissues, characterized by upregulation in lipid metabolism, vitamin and cofactor metabolism, and amino acid pathways, particularly in bone and brain metastases compared to the liver. Furthermore, flux-based comparisons of primary tumors predisposed to different metastatic destinations identified metabolic signatures predictive of organotropism. Using robust Metabolic Transformation Algorithm (rMTA), we simulated gene over-expression and knock-out strategies, identifying candidate metabolic genes capable of driving primary to metastatic phenotypes during breast organotropism. This systems-level approach not only advances our understanding of the metabolic determinants of breast cancer organotropism but also highlights potential metabolic targets for therapeutic intervention aimed at halting metastatic progression.
2025-07-22 12:50:00 12:55:00 01C SysMod Enzyme activation network facilitates regulatory crosstalk between metabolic pathways Sultana Al Zubaidi, Muhammad Ibtisam Nasar, Richard Notebaart, Markus Ralser, Mohammad Tauqeer Alam The metabolic network, the largest inter connected system in the cell, is constantly regulated by a range of regulatory interactions. To characterize metabolite-enzyme activatory interactions we reconstructed the cell-intrinsic enzyme-metabolite activation-interaction network ("activation network") using the Saccharomyces cerevisiae metabolic model (Yeast9) as the basis. We integrated the Yeast9 metabolic network with the list of cross-species activating compounds from the BRaunschweig ENzyme DAtabase (BRENDA) database. The cell-intrinsic activation network comprises 1,499 activatory interactions involving 344 enzymes and 286 cellular metabolites. Although only 54% of yeast metabolic enzymes (344 out of 635) are intracellularly activated, these enzymes are distributed across nearly all pathways, underscoring the widespread role of activation in cellular metabolism. Notably, in 94% of pathways at least one initial reaction is intracellularly activated. These initial reactions are typically non-equilibrium, flux-generating steps that must be regulated to control the overall pathway flux. Moreover, our analysis shows that highly activating metabolites are predominantly essential, whereas highly activated enzymes tend to be non-essential for growth. Additionally, we find that activator metabolites are produced in fewer steps compared to non-activators, suggesting a streamlined synthesis for regulatory compounds. We further examined cross-pathway activation and found a significant degree of trans-activation, emphasizing the interconnected nature of cellular metabolism. This coordination ensures that metabolic pathways are selectively activated and dynamically adjusted to meet cellular demands.
2025-07-22 12:55:00 13:00:00 01C SysMod Cell-cycle dependent DNA repair and replication unifies patterns of chromosome instability Bingxin Lu Bingxin Lu, Samuel Winnall, Will Cross, Chris Barnes Chromosomal instability (CIN) is pervasive in human tumours and often leads to structural or numerical chromosomal aberrations. Somatic structural variants (SVs) are intimately related to copy number alterations but the two types of variant are often studied independently. Additionally, despite numerous studies on detecting various SV patterns, there are still no general quantitative models of SV generation. To address this issue, we develop a computational cell-cycle model for the generation of SVs from end-joining repair and replication after double-strand break formation. Our model provides quantitative information on the relationship between breakage fusion bridge cycle, chromothripsis, seismic amplification, and extra-chromosomal circular DNA. Given whole-genome sequencing data, the model also allows us to infer important parameters in SV generation with Bayesian inference. Our quantitative framework unifies disparate genomic patterns resulted from CIN, provides a null mutational model for SV, and reveals deeper insights into the impact of genome rearrangement on tumour evolution.
2025-07-22 14:00:00 14:40:00 01C SysMod Virtual Tumours for Predictive Precision Oncology Jasmin Fisher Jasmin Fisher Cancer is a complex systemic disease driven by genetic and epigenetic aberrations that impact a multitude of signalling pathways operating in different cell types. The dynamic, evolving nature of the disease leads to tumour heterogeneity and an inevitable resistance to treatment, which poses considerable challenges for the design of therapeutic strategies to combat cancer. In this talk, I will discuss some of the progress made towards addressing these challenges, using the design of computational models of cancer signalling programs (i.e., virtual tumours). I will showcase a growing library of mechanistic, data-driven computational models, focused on the intra- and inter-cellular signalling in various types of cancer (namely triple-negative breast cancer, non-small cell lung cancer, melanoma and glioblastoma). These computational models are predictive and mechanistically interpretable, enabling us to understand and anticipate emergent resistance mechanisms and to design patient-specific treatment strategies to improve outcomes for patients with hard-to-treat cancers.
2025-07-22 14:40:00 15:00:00 01C SysMod A community benchmark of off-lattice multiscale modelling tools reveals differences in methods and across-scales integrations Arnau Montagud Thaleia Ntiniakou, Othmane Hayoun-Mya, Marco Ruscone, Alejandro Madrid Valiente, Adam Smelko, Jose Luis Estragués Muñoz, Jose Carbonell-Caballero, Alfonso Valencia, Arnau Montagud The emergence of virtual human twins (VHT) in biomedical research has sparked interest in multiscale modelling frameworks, particularly in their application bridging cellular to tissue levels. Among these tools, off-lattice agent-based models (O-ABM) offer a promising approach due to their depiction of cells in 3D space, closely resembling biological reality. Despite the proliferation of O-ABM tools addressing various biomedical challenges, comprehensive and systematic comparisons among them have been lacking. This paper presents a community-driven benchmark initiative aimed at evaluating and comparing O-ABM for biomedical applications, akin to successful efforts in other scientific domains such as CASP. Enlisting developers from leading tools like BioDynaMo, Chaste, PhysiCell, and TiSim, we devised a benchmark scope, defined metrics, and established reference datasets to ensure a meaningful and equitable evaluation. Unit tests targeting different solvers within these tools were designed, ranging from diffusion and mechanics to cell cycle simulations and growth scenarios. Results from these tests demonstrate varying tool performances in handling diffusion, mechanics, and cell cycle equations, emphasising the need for standardised benchmarks and interoperability. Discussions among the community underscore the necessity for defining gold standards, fostering interoperability, and drawing lessons from analogous benchmarking experiences. The outcomes, disseminated through a public platform in collaboration with OpenEBench, aim to catalyse advancements in computational biology, offering a comprehensive resource for tool evaluation and guiding future developments in cell-level simulations. This initiative endeavours to strengthen and expand the computational biology simulation community through continued dissemination and performance-oriented benchmarking efforts to enable the use of VHT in biomedicine.
2025-07-22 15:00:00 15:20:00 01C SysMod Multi-objective Reinforcement Learning for Optimizing JAK/STAT Pathway Interventions: A Quantitative System Pharmacology Study Tien Nguyen Nhung Duong, Tuan Do, Tien Nguyen, Hoa Vu, Lap Nguyen Background and Aims: JAK/STAT cancer pathway-oriented treatment optimization poses challenges due to pathway complexity, feedback loops, resistance development, and harmonization between efficacy, immunity preservation, and toxicity minimization. We aimed to develop and utilize a comprehensive in silico framework based on multi-objective reinforcement learning (MORL) to identify optimal intervention strategies targeting the JAK/STAT pathway. Methods: We constructed a multi-scale mechanistic model integrating JAK/STAT intracellular signaling dynamics (ODEs), tumor-immune cell interactions, adaptive resistance evolution, and pharmacokinetics/pharmacodynamics/toxicity of inhibitors (JAKi, STATi, cytokine blockers). We trained MORL agents (multi-objective PPO) employing this model as an environment to discover treatment schedules balancing four objectives: tumor reduction, immune preservation, resistance prevention, and toxicity minimization. Pareto-optimal strategies were used. Results: MORL successfully identified a diverse set of non-dominated intervention strategies, revealing inherent trade-offs. Distinct treatment paradigms emerged, such as an efficacy-focused strategy yielding ~28% tumor reduction but incurring higher toxicity/resistance, contrasted with a resistance-prevention strategy achieving excellent resistance/toxicity scores (>0.94, >0.25) but limited tumor control (~6.4%). Sensitivity analyses highlighted SOCS3 regulation, STAT kinetics, and resistance parameters as critical determinants of outcomes. Conclusion: This study introduces a MORL-empowered framework for navigating complex therapeutic trade-offs in JAK/STAT targeting. Our findings reveal diverse optimal strategies and underscore key biological factors influencing treatment success, offering a computational basis for the rational design and personalization of JAK/STAT-targeted therapies and showcasing the potential of MORL in quantitative systems pharmacology.
2025-07-22 15:20:00 15:40:00 01C SysMod Decoding CXCL9 regulatory mechanisms by integrating perturbation screenings with active learning of mechanistic logic-ODE models Bi-Rong Wang Bi-Rong Wang, Maaruthy Yelleswarapu, Federica Eduati Despite advances in immunotherapy, pancreatic cancer remains highly lethal due to an immunosuppressive tumor microenvironment characterized by immune exclusion. Inducing chemokines like CXCL9 with targeted therapy can promote cytotoxic T cell recruitment. However, systematic approaches are needed to better understand chemokine regulation and identify drug combinations upregulating CXCL9 expression. We combined logic-ODE models with wet lab experiments for two pancreatic cancer cell lines (AsPC1 and BxPC3), investigating CXCL9 responses to combinations of cytokines (IFNγ, TNFα) and inhibitors targeting JAK, IKK, RAS, MEK, or PI3K. Analyzing model parameters fitted to the screening data revealed both shared and cell line-specific mechanisms. To efficiently navigate the large space of possible drug combinations, we developed a pipeline for experimental design, integrating active learning with mechanistic modeling. In this pipeline, an acquisition function iteratively selects the most informative conditions from a pool of unseen drug combinations; these are added to the training data to update the model. We benchmarked different acquisition functions on in silico data: “greedy” selects conditions predicted to yield the highest CXCL9 levels, while “uncertainty” prioritizes those where the model is least confident. Both strategies outperformed random sampling: “greedy” most efficiently identified high-CXCL9 conditions, while “uncertainty” improved overall model generalizability. We are currently performing wet lab experiments to validate in silico predictions. Our framework demonstrates how active learning can be combined with dynamic logic-based models to accelerate the discovery of immunomodulatory drug combinations, offering a generalizable approach for hypothesis-driven experimental design in systems biology.
2025-07-22 15:40:00 16:00:00 01C SysMod ARTEMIS integrates autoencoders and Schrödinger Bridges to predict continuous dynamics of gene expression, cell population and perturbation from time-series single-cell data Sayali Anil Alatkar Cellular processes like development, differentiation, and disease progression are highly complex and dynamic (e.g., gene expression). These processes often undergo cell population changes driven by cell birth, proliferation, and death. Single-cell sequencing enables gene expression measurement at the cellular resolution, allowing us to decipher cellular and molecular dynamics underlying these processes. However, the high costs and destructive nature of sequencing restrict observations to snapshots of unaligned cells at discrete timepoints, limiting our understanding of these processes and complicating the reconstruction of cellular trajectories. To address this challenge, we propose ARTEMIS, a generative model integrating a variational autoencoder (VAE) with unbalanced Diffusion Schrödinger Bridge (uDSB) to model cellular processes by reconstructing cellular trajectories, reveal gene expression dynamics, and recover cell population changes. The VAE maps input time-series single-cell data to a continuous latent space, where trajectories are reconstructed by solving the Schrödinger bridge problem using forward-backward stochastic differential equations (SDEs). A drift function in the SDEs captures deterministic gene expression trends. An additional neural network estimates time-varying kill rates for single cells along trajectories, enabling recovery of cell population changes. Using three scRNA-seq datasets—pancreatic β-cell differentiation, zebrafish embryogenesis, and epithelial-mesenchymal transition (EMT) in cancer cells—we demonstrate that ARTEMIS: (i) outperforms state-of-art methods to predict held-out timepoints, (ii) recovers relative cell population changes over time, and (iii) identifies “drift” genes driving deterministic expression trends in cell trajectories. Furthermore, in silico perturbations show that these genes influence processes like EMT. The code for ARTEMIS: https://github.com/daifengwanglab/ARTEMIS.
2025-07-22 16:40:00 17:00:00 01C SysMod Calibrating agent‐based models of colicin-mediated inhibition in microfluidic traps using single-cell time-lapse microscopy Ati Ahmadi Ati Ahmadi, Samantha Schwartz, Brian Ingalls I uploaded the long abstract below.
2025-07-22 17:00:00 17:20:00 01C SysMod Inferring metabolic activities from single-cell and spatial transcriptomic atlases Erick Armingol Erick Armingol, James Ashcroft, Magda Mareckova, Martin Prete, Valentina Lorenzi, Cecilia Icoresi Mazzeo, Jimmy Tsz Hang Lee, Marie Moullet, Christian Becker, Krina Zondervan, Omer Ali Bayraktar, Luz Garcia-Alonso, Nathan E. Lewis, Roser Vento-Tormo Metabolism is fundamental to cellular function, supporting macromolecule synthesis, signaling, growth, and cell-cell communication. While single-cell and spatial metabolomics technologies have advanced, large-scale applications remain challenging. In contrast, transcriptomics provides vast datasets to infer metabolic states. Here, we present scCellFie, a computational tool that predicts metabolic activities from transcriptomic data at single-cell and spatial resolutions. scCellFie enables scalable analysis of large cell atlases, leverages metabolic tasks for interpretable results, and includes modules for identifying metabolic markers, condition-specific changes, and cell-cell communication. We applied scCellFie to ~30 million human cells, generating a comprehensive metabolic atlas across organs while demonstrating our tool’s scalability. Additionally, we used scCellFie to study the human endometrium, the uterine lining that undergoes substantial remodeling throughout the menstrual cycle due to sex hormones, and identified cell type-specific metabolic programs supporting cyclical changes. Epithelial cells exhibited metabolic regulation covering pathways supporting proliferation and mitigating oxidative stress. Endometrial diseases, including endometriosis and endometrial carcinoma, often arise from metabolic dysregulation. By inspecting eutopic endometrium from donors with endometriosis, we identified altered metabolic programs that likely drive atypical proliferation and inflammation of the distinct cell types. In endometrial carcinoma, malignant cells displayed metabolic rewiring, including increased glucose-to-lactate conversion and dysregulated kynurenine and estrogen signaling. These shifts suggest shared mechanisms promoting aberrant proliferation and may reveal therapeutic targets. Together, our findings demonstrate scCellFie as a scalable, interpretable tool for characterizing metabolism in health and disease. By linking metabolic functions to cellular processes, scCellFie provides deeper insights into metabolic regulation across diverse biological systems.
2025-07-22 17:20:00 17:40:00 01C SysMod Spatiotemporal Variational Autoencoders for Continuous Single-Cell Tissue Dynamics Koichiro Majima Koichiro Majima, Teppei Shimamura Single-cell spatial genomics provides unprecedented molecular insights, yet it still struggles to track both the spatial and temporal progression of tissues under native conditions. Experimental constraints and destructive sampling yield discrete snapshots rather than the continuous record required to fully understand how cells organize and function over time. Optimal transport (OT) approaches have attempted to bridge these snapshots across different assays, but they typically rely on unimodal data, scale poorly, and oversimplify the complexity of ongoing morphogenetic events. We introduce a spatiotemporal variational autoencoder (VAE) that models the continuous evolution of tissue pixels—capturing dynamic changes in both spatial location and gene-expression patterns. By embedding these pixels into a latent space governed by a learned dynamics network, our method reveals how tissues grow, reorganize, and express key genes over time. At each time point, behavioral parameters are decoded from the latent state via a neural decoder, quantifying probabilities of growth, disappearance, seeking, displacement, attraction, and clustering. A differentiable growth module further refines this process by modeling region appearance and disappearance, allowing gradient-based optimization of tissue occupancy patterns. We demonstrate the power of our approach by analyzing a mouse embryogenesis dataset. The model uncovers unobserved developmental trajectories, pinpoints morphogenetic transitions, and aligns coronal sections at scale—capabilities that standard OT-based methods find intractable. These results highlight how a spatiotemporal VAE can reconstruct the story of tissue formation in both forward and backward directions, opening up new avenues to interpret single-cell and spatial data as a cohesive, dynamic narrative rather than disjointed snapshots.
2025-07-22 17:40:00 17:45:00 01C SysMod Computational Modeling of Shortening and Reconstruction of Telomeres Marek Kimmel Marek Kimmel, Marie Doumic, Leonard Mauvernay, Teresa Teixeira We discuss a stochastic model of growth of a cell population of cultured yeast cells with gradually decaying chromosome endings called the telomeres, as well as models of telomere reconstruction using the so-called ALT (alternate lengthening of telomeres) Mechanism. Telomeres play a major role in aging and carcinogenesis in humans. Our models correspond in part to the experiments of one of us (Teixeira). For telomere shortening, we modify the method of Olofsson and Kimmel, who considered properties of the branching process of telomere shortening; this leads to consideration of a random walk on a two-dimensional grid. We derive an integral equation for the probability generating functions (pgf’s) characterizing the dynamics of shortening of telomeres. We find that the general solutions have the form of exponential polynomials. Stochastic simulations lead to interesting and non-obvious effects if cell death is included. We further consider more complex models, involving cell death and the ALT mechanism. of telomere reconstruction. These are based on our works and are intended to address the experiments in Kockler et al. (2021). In one version of the ALT Mechanism model, we consider expectations conditional on non-extinction, since only a fraction of ALT telomeres is stably elongated (see Kockler et al. 2021). As a conclusion, the multitype branching processes produce realistic prediction concurrent with complex biological experiments involving telomeres. Our aim is to use the models for longer-term prognoses for human telomeres.
2025-07-22 17:45:00 17:50:00 01C SysMod TFvelo: gene regulation inspired RNA velocity estimation Jiachen Li Jiachen Li, Xiaoyong Pan, Ye Yuan, Hong-Bin Shen RNA velocity is closely related with cell fate and is an important indicator for the prediction of cell states with elegant physical explanation derived from single-cell RNA-seq data. Most existing RNA velocity models aim to extract dynamics from the phase delay between unspliced and spliced mRNA for each individual gene. However, unspliced/spliced mRNA abundance may not provide sufficient signal for dynamic modeling, leading to poor fit in phase portraits. Motivated by the idea that RNA velocity could be driven by the transcriptional regulation, we propose TFvelo, which expands RNA velocity concept to various single-cell datasets without relying on splicing information, by introducing gene regulatory information. Our experiments on synthetic data and multiple scRNA-Seq datasets show that TFvelo can accurately fit genes dynamics on phase portraits, and effectively infer cell pseudo-time and trajectory from RNA abundance data. TFvelo opens a novel, robust and accurate avenue for modeling RNA velocity for single cell data.
2025-07-22 17:50:00 18:00:00 01C SysMod Closing remarks Chiara Damiani This concluding talk aims to briefly discuss the diversity of topics presented at the “Computational Modeling of Biological Systems” (SysMod) COSI track. This diversity illustrates the importance of the field and the broad range of applications in systems biology and disease. Then, forthcoming meetings of interest will be announced, and the three poster awards will be delivered as a closing event.
2025-07-21 11:20:00 11:40:00 11BC Tech Track UniProt: Evolving Tools and Data for Protein Science Daniel Rice Daniel Rice The Universal Protein Resource (UniProt) is a cornerstone of molecular biology and bioinformatics, delivering high-quality, freely accessible protein sequence and functional information for over 20 years. This session presents a guided tour of UniProt’s latest features, datasets, and tools, reflecting its continued evolution to meet the needs of the scientific community. We will highlight data integrations—including AlphaFold structural predictions, AlphaMissense variant effect predictions, RNA editing, post-translational modifications (PTMs), and Human Proteome Project (HPP) datasets—and demonstrate embedded visualizations developed by UniProt and third-party contributors. Attendees will learn about improved tools for browsing, analyzing, and exporting data, along with recent enhancements to UniProt’s API and new Swagger documentation that streamline programmatic access and data integration. Whether you're a student or a seasoned researcher, this session will help you better leverage UniProt in your work. We will emphasize practical applications and encourage engagement with UniProt’s expanding capabilities. Attendees will leave with a deeper understanding of how to integrate UniProt resources into their workflows—and how to contribute feedback to guide its future development.
2025-07-21 11:40:00 12:20:00 11BC Tech Track Genomics 2 Proteins portal: A resource and discovery platform for linking genetic screening outputs to protein sequences and structures Sumaiya Iqbal, Jordan Safer Recent advances in AI-based methods have revolutionized the field of structural biology. Concomitantly, high-throughput sequencing and functional genomics have generated genetic variants at an unprecedented scale. However, efficient tools and resources are needed to link disparate data types – to “map” variants onto protein structures, to better understand how the variation causes disease, and thereby design therapeutics. Here we present the Genomics 2 Proteins Portal (G2P; g2p.broadinstitute.org): a human proteome-wide resource that maps 49,500,857 genetic variants onto 42,481 protein sequences and 84,318 structures (according to Dec 2024 release), with a comprehensive set of structural and functional features. Additionally, the G2P portal allows users to interactively upload protein residue-wise annotations (variants, scores, etc.) as well as the protein structure beyond databases to establish the connection between genomics to proteins. The portal serves as an easy-to-use discovery tool for researchers and scientists to hypothesize the structure-function relationship between natural or synthetic variations and their molecular phenotypes.
2025-07-21 12:20:00 13:00:00 11BC Tech Track ModCRE: modelling protein–DNA interactions and transcription-factor co-operativity in cis-regulatory elements Patrick Gohl Patrick Gohl, Patricia Bota ModCRE is a web server using a structural approach to predict Transcription Factor binding preferences and auto-mate modelling of higher order regulatory complexes with DNA.
2025-07-21 14:00:00 14:20:00 11BC Tech Track Orchestrating Microbiome Analysis with Bioconductor Tuomas Borman Tuomas Borman, Leo Lahti Microbes play a crucial role in human health, disease, and the environment. Despite their significant impact, the mechanisms underlying microbiome interactions remain largely unknown due to their complexity. While microbiome research has heavily relied on sequencing data, understanding these interactions requires multi-omics approaches and computational methods. R/Bioconductor is a well-established platform for biological data analysis, providing high-quality, open-source software. It is driven by a global community of researchers who collaborate through software development, shared standards, and active forums. The software is built on standardized data containers, with SummarizedExperiment being the most widely used, adopted across a wide range of biological fields. Shared data containers enable interoperability and facilitate advanced data integration. This session will showcase Bioconductor tools for microbiome data science, with a particular focus on the mia (Microbiome Analysis) framework through a practical case study. By the end of the session, participants will gain insights into the latest advances in microbiome research within Bioconductor, including TreeSummarizedExperiment data container along with essential methods. They will also be prepared to further explore the data science ecosystem, supported by the online book Orchestrating Microbiome Analysis with Bioconductor.
2025-07-21 14:20:00 14:40:00 11BC Tech Track Smart Turkana Beads: A Culturally Embedded IoT Innovation for Health Monitoring and Drug Adherence Meya Brian Access to quality healthcare remains a persistent challenge in marginalized regions such as Turkana County in northern Kenya, where traditional beliefs, nomadic lifestyles, and weak infrastructure significantly hinder health service delivery and uptake, particularly in chronic disease management and drug adherence. This proposal introduces an innovative solution: culturally embedded Internet of Things (IoT) technology in the form of Smart Turkana Beads—a wearable health-monitoring device seamlessly integrated into the community’s traditional attire to enhance health surveillance and promote drug adherence. Leveraging Indigenous Knowledge Systems (IKS), the project seeks to reduce the cultural dissonance typically encountered by conventional biomedical technologies, while addressing key barriers to healthcare access in a way that aligns with local values and practices (Mwakalinga et al., 2021)
2025-07-21 14:40:00 15:00:00 11BC Tech Track AI and Quantum in Healthcare and Life Sciences Filippo Utro "The advent of foundation models (FM) and quantum computing (QC) has ushered in a new paradigm for tackling complex problems, igniting significant interest across diverse sectors, particularly within healthcare and life sciences. This talk will provide an exploration of the latest efforts at IBM Research dedicated to leveraging FM and QC for accelerating discovery in healthcare and life sciences. The discussion will span a range of applications in omics data, clinical trials, and drug discovery. Finally, in this presentation, I will discuss technical challenges, envisioning the new era of FM and QC in healthcare and life sciences."
2025-07-21 15:00:00 15:20:00 11BC Tech Track Integrating Long-Read Sequencing and Multiomics for Precision Cell Line Engineering Daniel Fabian Optimizing the biomanufacturing of therapeutic molecules, such as monoclonal antibodies, is critical for delivering efficient and scalable patient treatment. A key early step in this production pipeline is the development of Chinese Hamster Ovary (CHO) cell lines that produce these biologic molecules. To address the rising demand for therapeutic biologics, it is essential to enhance cell line expression to achieve consistently high and stable product titers. Lonza has recently integrated nanopore long-read sequencing into its multiomics pipelines, providing unprecedented molecular insights into the genetic and epigenetic landscapes of CHO cell lines, advancing both genetic engineering and biomarker discovery. This presentation will highlight recent progress in improving omics data accuracy through de novo genome assembly, and the integration of nanopore whole genome sequencing and DNA methylation analysis with other next-generation sequencing technologies.
2025-07-21 15:20:00 15:40:00 11BC Tech Track Title yet to be given by the sponsor
2025-07-21 15:40:00 16:00:00 11BC Tech Track Title yet to be given by the sponsor
2025-07-22 11:20:00 11:40:00 11BC Tech Track Scale with Seqera: Accelerate, Expand, and Collaborate Adam Talbot, Geraldine Van der Auwera Turning a promising research project into a robust, real-world solution requires tools that support both early experimentation and enterprise-scale deployment. When reproducibility and reliability are non-negotiable, you need a platform that's flexible during ideation and powerful enough to meet the demands of mega-scale computation and collaborative research. Too often, scaling up means switching tools, rewriting pipelines, or reprovisioning infrastructure — an expensive, frustrating process that can introduce errors and undermine scientific reproducibility. In this talk, we will explore how Seqera's integrated suite of products empowers you to scale and accelerate your scientific research.
2025-07-22 11:40:00 12:00:00 11BC Tech Track SimpleVM - Effortless Cloud Computing for Research Viktor Rudko Viktor Rudko, Peter Belmann, Nils Hoffmann, David Weinholz, Alexander Sczyrba SimpleVM empowers life science researchers to harness cloud resources, regardless of their expertise in cloud computing.
As a multi-cloud application, SimpleVM is optimized for seamless integration with multiple OpenStack® installations. From an OpenStack administrator's perspective, all that is required is an OpenStack project without the need for additional admin privileges. SimpleVM features enhanced security components that scan connection attempts to virtual machines and automatically block suspicious access attempts.
 By combining KeyCloak with a Django-based service layer, SimpleVM provides comprehensive user management and customizable role-based access control. This facilitates the integration of Authentication and Authorisation Infrastructure (AAI) for seamless use of various Identity Providers (IDPs), including LifeScience AAI or a local university IDP. 

In addition to the straightforward launch of virtual machines, the emphasis is placed on advanced features that are intended to streamline and enhance the user experience when working with cloud resources. For example, Virtual Research Environments (VREs) can be deployed with just a few clicks, providing access to powerful applications such as Visual Studio Code(R) or RStudio directly from the browser.
  The SimpleVM workshop mode fosters the realization of high-attendance teaching sessions. Workshop instructors can easily pre-configure, launch and assign machines to attendees in no time.  Finally, SimpleVM improves the use of cloud resources with features such as the auto-scalable SLURM clusters.
2025-07-22 12:00:00 12:20:00 11BC Tech Track Overture Prelude: A toolkit for small teams with big data problems Mitchell Shiell Mitchell Shiell, Melanie Courtot, Brandon Chan, Jon Eubank, Robin Haw, Justin Richardsson, Leonardo Rivera, Lincoln Stein, Overture Team Overture is used to build platforms that enable researchers to organize and share their data quickly, flexibly and at multiple scales. While Overture successfully powers major international platforms like ICGC-ARGO (100,000+ participants) and VirusSeq (500,000+ genomes), smaller teams generating massive data face prohibitive technical requirements during implementation. How then can we enable teams to build their data platform efficiently and with fewer resources? Prelude addresses this challenge by breaking down platform development into incremental phases, reducing the technical overhead during development and allowing teams to systematically verify requirements through hands-on testing, gaining insights into workflows, data needs, and platform fit. Prelude guides teams through three progressive phases of data platform development each building upon the previous one's foundation: - Phase one focuses on data exploration and theming, enabling teams to visualize and search their data through a customizable UI; - Phase two expands capabilities to enable tabular data management and validation with persistent storage; - Phase three adds file management and object storage. These phases are supported by comprehensive documentation, deployment automations, and utilities that generate key configuration files, reducing unnecessary time spent on tedious manual configurations. Prelude represents a practical step toward making data platform development accessible to all teams of all sizes. By providing a widely accessible platform we hope to encourage community requests and feedback such that we can improve and iterate on the platform making it the best it can be for advancing data sharing and reuse across the scientific community.
2025-07-22 14:00:00 14:20:00 11BC Tech Track GPCRVS - AI-driven decision support system for GPCR virtual screening Dorota Latek Dorota Latek GPCRVS represents an efficient, simple, easily accessible, and open-source web service that, as a decision support system, aims to facilitate the preclinical testing of drug candidates targeting peptides and small protein-binding G protein-coupled receptors. There are three major areas of drug discovery that GPCRVS could facilitate: prediction of drug selectivity, prediction of drug efficacy approximated by Autodock Vina docking scores, or by activity class assigned by the TensorFlow multiclass classifier, or by pChEMBL predictions using the LightGBM regressor, and finally prediction of the drug binding mode, showing the most crucial amino acids involved in the drug-receptor interactions. A comparison with precomputed results for known active compounds enables the prioritization of drug candidates, thereby significantly reducing the cost and length of experimental screening. In addition, a novel approach to using peptide ligand data sets as SMILES-based fingerprints in conjunction with small-molecule ligand data sets in the training of DNN and GBM models was proposed. This makes possible to benefit from all GPCR-like ligand data sets deposited in ChEMBL, and to design new drugs that could include both peptide and non-peptide scaffolds of increased, unified activity and selectivity. Currently, two groups of peptide/small protein-binding GPCR receptors are included in GPCRVS, allowing it to make comparative predictions for class A and B receptors at the same time. The evaluation of GPCRVS performed using the patent compound data set from Google Patents showed that LightGBM provides the most accurate results among the three classifiers implemented in GPCRVS.
2025-07-22 14:20:00 14:40:00 11BC Tech Track Self-supervised generative AI enables conversion of two non-overlapping cohorts Supratim Das Supratim Das, Mahdie Rafiei, Andreas Maier, Linda Baumbach, Jan Baumbach Prognostic models in healthcare often rely on big data, which is typically distributed across multiple medical cohorts. Even if collected for similar purposes (e.g., capturing symptoms, treatments, and outcomes in osteoarthritis), they frequently differ in acquisition methods, structures, and variable definitions used. These discrepancies impede their integration into a unified, multi-cohort database for joint prognostic model training and pose significant challenges to model transferability, meaning a model trained on one cohort needs to be applied to data of a similar cohort with an incompatible data structure. Current cohort conversion approaches rely on AIs trained on linked, overlapping samples, which many healthcare cohorts lack. Here, we present DB-converter, a self-supervised deep learning architecture leveraging category theory and designed to convert data from different cohorts with different data structures into each other. We demonstrate the power and robustness of the DB-converter using synthetic and real health survey data. Our approach opens new avenues for multi-cohort analyses operating under the assumption that all cohorts to be integrated have been acquired for at least a similar real-world purpose.
2025-07-22 14:40:00 15:00:00 11BC Tech Track The Bioverse - Biomolecule data processing for AI made easy Tim Kucera Tim Kucera, Karsten Borgwardt We introduce the bioverse, a free and open-source Python package that streamlines biological data preparation for machine learning. Focused on structural biology, it standardizes diverse biomolecular formats for flexible, high-performance workflows. Our demonstration will showcase key features, code examples, and how to launch your own ML projects in minutes.
2025-07-22 15:00:00 15:20:00 11BC Tech Track Data, We Need to Chat Alberto Pepe Susheel Varma, Jineta Banerjee, Robert Allaway, John Hill, Jay Hodgson, Alberto Pepe, Christine Suver, Luca Foschini "The exponential growth of biomedical datasets presents unprecedented opportunities for scientific discovery, yet researchers struggle to find and explore relevant data. Traditional search methods fall short when navigating complex, highly regulated biomedical data repositories. This paper examines these limitations and proposes AI-powered conversational interfaces as a solution. Key obstacles to effective data discovery include repository fragmentation, inconsistent metadata, vocabulary mismatches, complex search requirements, and inadequate interface design. These challenges are intensified in biomedical research by regulatory restrictions on accessing sensitive data. Conversational AI systems offer a promising alternative by enabling natural language dialogue with data repositories. Unlike keyword searches, these interfaces understand user intent, ask clarifying questions, and guide researchers to relevant datasets. Synapse.org's experimental chatbot implementation demonstrates how AI-assisted discovery processes complex queries (e.g., ""datasets related to people over 60 with Alzheimer's disease and Type 2 diabetes"") without requiring database expertise. This approach leverages Retrieval-Augmented Generation (RAG) while respecting authorization levels and regulatory compliance. Such systems facilitate ""metadata spelunking,"" allowing researchers to explore dataset composition, methodology, and potential utility without needing to access sensitive raw data. The paper addresses ethical considerations related to privacy, bias, and trust, while outlining future possibilities for interdisciplinary data discovery. By bridging the gap between vast biomedical data repositories and researchers, conversational AI interfaces promise to democratize data access, accelerate discovery, and ultimately improve human health."
2025-07-22 15:20:00 15:40:00 11BC Tech Track BioInfore: A No-Code Genome Data Management System Based On AI Agents Zheng Chen Zheng Chen, Ziwei Yang, Xihao Piao, Peng Gao, Yasushi Sakurai, Yasuko Matsubara In many genomic projects, selecting and preparing assemblies requires complex database queries, manual metadata curation, and bespoke code scripting. We introduce a no-code AI agent workflow that replaces all of these steps with a single plain natural language request. Behind the scenes, five specialized AI agents handle retrieval, quality filtering, ranking, and format conversion. Users receive analysis-ready genome datasets in minutes, freeing them from programming barriers and manual errors so they can focus on biological discovery.
2025-07-22 15:40:00 16:00:00 11BC Tech Track Omi: Bridging the Informatics to Bio Gap with a Natural Language Co-pilot Prashant Bharadwaj Kalvapalle, Eddie Kim, Marko Tanevski, Sahil Joshi, Benjamin Mao, Anshumali Shrivastava, Todd Treangen Omi facilitates bioinformatics analysis by replacing complex command-line processes with a natural language bioinformatics co-pilot. We codify bioinformatics best practices into the LLM to select appropriate pipelines, provide explanations before running them, and return results after pipeline execution. Further democratization through coding bespoke statistical and data visualizations is underway.
2025-07-24 08:40:00 09:00:00 12 Text Mining Opening remarks
2025-07-24 09:00:00 09:40:00 12 Text Mining Keynote - TBA , Chris Mungall
2025-07-24 09:40:00 10:00:00 12 Text Mining Poster lightning talks
2025-07-24 11:20:00 11:40:00 12 Text Mining Representations of Cells in the Biomedical Literature: First Look at the NLM CellLink Corpus Noam H. Rotenberg Noam H. Rotenberg, Robert Leaman, Rezarta Islamaj, Brian Fluharty, Helena Kuivaniemi, Savannah Richardson, Gerard Tromp, Zhiyong Lu, Richard H. Scheuermann Single-cell technologies are enabling the discovery of many novel cell phenotypes, but this growing body of knowledge remains fragmented across the scientific literature. Natural language processing (NLP) offers a promising approach to extract this information at scale, however, the existing annotated datasets required for system development and evaluation do not reflect the complex assortment of cell phenotypes described in recent studies. We present a new corpus of excerpts from recent articles, manually annotated with mentions of human and mouse cell populations. The corpus distinguishes three types of mentions: (1) specific cell phenotypes (cell types and states), (2) heterogenous cell populations, and (3) vague cell population descriptions. Mentions of the first two types were linked to Cell Ontology identifiers, using their meaning in context, with matches labeled as exact or related, where possible. Annotation was performed by four cell biologists using a multi-round process, with automated pre-annotation. The corpus contains over 22,000 annotations across more than 3,000 passages selected from 2,700 articles, covering nearly half the concepts in the current Cell Ontology. Fine-tuning BiomedBERT in a simplified named entity recognition task on this corpus resulted in substantially higher performance than the same configuration fine-tuned on previously annotated datasets. Our corpus will be a valuable resource for developing automated systems to identify cell phenotype mentions in the biomedical literature, a challenging benchmark for evaluating biomedical NLP systems, and a foundation for the future extraction of relationships between cell types and key biomedical entities, including genes, anatomical structures, and diseases.
2025-07-24 11:40:00 12:00:00 12 Text Mining Contextualizing phenotypes in medical notes with small language models Connor Grannis Connor Grannis, Max Homilius, Austin A. Antoniou, David M. Gordon, Ashley Kubatko, Peter White Accurate phenotypic extraction from clinical notes is essential for precision medicine. While manual approaches are time-consuming and prone to bias, automated phenotype recognition tools often misinterpret contextual attributes—such as negation, temporality, or familial association—due to variability in documentation styles. Evaluation is further hindered by the lack of gold-standard datasets with context attributes. To address this gap, we 1) annotated the ID-68 dataset (an open-source dataset of 68 clinical notes from a cohort of patients with intellectual disabilities) with context attributes using a large language model (LLM) followed by manual review; 2) generated 50 synthetic clinical notes modeled off the ID-68 dataset by using an LLM, seeding them with phenotypes associated with OMIM diseases and diverse contextual attributes; and 3) fine-tuned small language models (SLMs) to perform binary classification of whether a phenotype is negated, hypothetical, or associated with a family member. We evaluated several phenotype concept recognition models on a span-level NER task, including the correct classification of negated, family-related or hypothetical phenotype mentioned. Our results demonstrate that existing phenotype recognition tools are effective for extracting phenotypes that are mostly patient-related (i.e., ID-68), but insufficient for more complex contexts. By augmenting extracted phenotypes with SLMs, we boosted context accuracy in the synthetic dataset from 57% to 89%. These findings highlight the importance of accurate contextual awareness in phenotype extraction pipelines. Our synthetic dataset and evaluation framework offer a foundation for benchmarking future tools and advancing scalable, high-fidelity phenotype extraction for precision medicine applications.
2025-07-24 12:00:00 12:20:00 12 Text Mining CSpace: A concept embedding space for bio-medical applications Danilo Tomasoni Danilo Tomasoni, Luca Marchetti Motivation: The rise of transformer-based architectures dramatically improved our ability to analyze natural language. However, the power and flexibility of these general-purpose models come at the cost of highly complex model architectures with billions of parameters that are not always needed. Results: In this work, we present CSpace: a concise word embedding of bio-medical concepts that outperforms alternatives in terms of out-of-vocabulary ratio and semantic textual similarity task and have comparable performance with respect to transformer-based alternatives in the sentence similarity task. This ability can serve as the foundation for semantic search by enabling efficient retrieval of conceptually related terms. Additionally, CSpace incorporates ontological identifiers (MeSH, NCBI gene and taxonomy IDs) enabling computationally efficient cross-ontology relatedness measurement, potentially unlocking previously unknown disease-condition associations. Method: CSpace was trained with the FastText algorithm on full-text articles from PubMed, PubMedCentral, ClinicalTrials (US) and preprints from BioRxiv and MedRxiv published in 2024. CSpace encodes concepts rather than words: it combines multiple words pertaining to the same concepts both with Pubtator3 annotations and statistical word co-occurrence. Conclusion: CSpace outperforms other embedding models in both concept and sentence similarity tasks. It also surpasses the transformer-based OpenAI ada-v2 model in the concept similarity task, with a performance trade-off of less than 5% in the sentence similarity task. Additionally, CSpace can effectively measure associations among diseases, genes, and clinical conditions, even across different ontologies, using less than 10% of the embedding dimensions required by ada-v2, making it a highly efficient and accessible tool for democratizing advanced embedding technologies.
2025-07-24 12:20:00 12:40:00 12 Text Mining VectorSage: Enhancing PubMed Article Retrieval with Advanced Semantic Search Rahul Brahma Yasas Wijesekara, Rahul Brahma, Mehdi Lotfi, Marcus Vollmer, Lars Kaderali The exponential growth of academic literature has presented unprecedented opportunities and highlighted the need for advanced search methodologies for efficient knowledge discovery. While effective for structured queries, traditional keyword-based search engines often struggle with the inherent variability of language, where the same concept can be expressed in many ways, leading to imprecise retrieval of relevant articles. Recent advancements in natural language processing (NLP) have facilitated the development of semantic similarity techniques that extend beyond simple text matching, enabling more contextually aware search capabilities. Taking advantage of these advancements, to address the limitations of the traditional approach, we introduce VectorSage, a hybrid search framework that integrates semantic similarity search and keyword-based retrieval to enhance academic literature discovery in peer-reviewed articles. VectorSage employs a multi-step ranking mechanism executed in parallel: (1) a semantic similarity search using FAISS with Stella-400M embeddings to retrieve conceptually related articles; and (2) a keyword-based search leveraging BM25S for probabilistic text ranking. The results are independently ranked and merged into a globally optimized ranked list using a weighted scoring function, balancing semantic relevance with keyword specificity. This hybrid approach is particularly useful where terminology consistency varies, allowing researchers to retrieve articles that traditional search techniques might otherwise miss. Tested on over 26 million PubMed abstracts, VectorSage significantly improves retrieval of relevant articles, facilitating more effective literature exploration. As a freely accessible web tool, VectorSage enhances high-quality academic literature search across disciplines. VectorSage is live at: https://vectorsage.nube.uni-greifswald.de/.
2025-07-24 12:40:00 13:00:00 12 Text Mining Large Language Model Applications on the Uniprot Protein Sequence and Annotation Database Melike Akkaya Melike Akkaya, Rauf Yanmaz, Sezin Yavuz, Vishal Joshi, Maria-Jesus Martin, Tunca Doğan Efficiently accessing and analyzing comprehensive biological datasets remains challenging due to traditional querying complexities. To address this, we developed an intuitive, scalable query interface using advanced large language models. Our system enables users, regardless of computational expertise, to formulate natural-language queries that automatically translate into precise Solr database searches, significantly simplifying interaction with UniProtKB. Additionally, we implemented a semantic vector search for rapid protein similarity analyses, using protein embeddings generated by ProtT5 protein language model within an optimized approximate nearest-neighbor search framework (Annoy). This method significantly outperforms conventional BLAST searches, offering a speed increase of up to tenfold on GPU hardware. Functional insights are further enriched through integrated Gene Ontology analyses, providing biologically meaningful context to similarity searches. Currently, we are expanding the system using Retrieval-Augmented Generation, integrating real-time annotations from UniProt flat files to enhance contextual relevance and accuracy of generated responses. Evaluations using diverse biological queries demonstrated the robustness of our interface, highlighting its ability to mitigate intrinsic variability in LLM outputs through controlled prompt engineering and query retry mechanisms. Overall, our novel project substantially streamlines the retrieval process, facilitating quicker, more accurate exploration of protein functions, evolutionary relationships, and annotations.
2025-07-24 14:00:00 14:20:00 12 Text Mining Human-AI Collaboration for Cancer Knowledge Verification: Insights from the CIViC-Fact Dataset Caralyn Reisle Caralyn Reisle, Cameron J. Grisdale, Kilannin Krysiak, Arpad M. Danos, Mariam Khanfar, Erin Pleasance, Jason Saliba, Melika Bonakdar, Nilan V. Patel, Joshua F. McMichael, Malachi Griffith, Obi L. Griffith, Steven J.M. Jones Interpretation of genomic findings remains one of the largest barriers to automation in processing precision oncology patient data due to the high level of expertise required in cancer biology, genomics, and bioinformatics. Efforts to streamline this process include creating cancer knowledge bases (KB) to store annotations of individual genes and variants, but creating such resources is time-consuming. The open-data cancer KB CIViC (civicdb.org) adopted crowd-sourcing to curate content efficiently. However, these submissions still require expert review, leading to a new bottleneck. To address this, we introduce CIViC-Fact, a novel benchmark designed to support automated fact-checking and claim verification in the biomedical domain. CIViC-Fact augments thousands of curated entries in the CIViC knowledge base with sentence-level evidence provenance, linking each claim to the specific sentences that support or contradict it. We evaluate the performance of several open large language models (LLMs) on this dataset. Existing LLMs struggle (up to 60% accuracy) and require fine-tuning to achieve reasonable performance. While fine-tuned language models perform well (up to 88% accuracy), there is significant room for improvement in the quality of their reasoning. Despite these remaining challenges, we have applied the current pipeline to the entirety of CIViC, flagging any errors detected. Flagged entries were returned to the CIViC curation team for follow-up, resulting in corrections to the KB content. This demonstrates the practical utility of CIViC-Fact, not only as a new benchmark for NLP research, but as a tool for semi-automated auditing of scientific knowledge bases.
2025-07-24 14:20:00 14:40:00 12 Text Mining Named entity recognition and relationship extraction to mine minimum inhibitory concentration of antibiotics from biomedical text Tiffany Ta Tiffany Ta, Arman Edalatmand, Andrew McArthur Antimicrobial resistance (AMR) poses a global public health threat, undermining modern medicine by diminishing the effectiveness of antibiotics for treating bacterial infections. The minimum inhibitory concentration (MIC) is the lowest concentration at which an antibiotic inhibits bacterial growth. Based on pre-established MIC cutoff values, MIC can be used to determine if an isolate will be susceptible or resistant to an antibiotic. To interpret MIC, various metadata (i.e., infection site, bacterial isolate, etc.) must also be collected. This information is spread piecemeal throughout the scientific literature available at PubMed Central (PMC) but has yet to be mined. The Comprehensive Antibiotic Resistance Database (CARD) is a globally used, expert-curated resource and knowledgebase of AMR determinants and antibiotics. Yet CARD lacks MIC information, which can provide insights into the phenotypic risk profile of individual ARGs. We are leveraging natural language processing to extract MIC values and relevant information from 5,704,429 PMC articles. We have trained a text classifier to identify PMC articles associated with bacterial drug resistance (F1 = 0.9699) and filtered the papers via Regular Expressions to identify papers with MIC data, yielding 10,086 papers. Afterwards, named entity recognition (NER) was used to mine relevant MIC information, generating 1,082,026 annotations. We are now working towards employing generative models to extract MIC values from PMC articles. Once CARD has MIC data, we can track MIC across time, as increasing MIC values can be a forewarning for pathogens that may develop resistance.
2025-07-24 14:40:00 15:00:00 12 Text Mining Enhancing Biomedical Relation Extraction with Directionality Chih-Hsuan Wei Po-Ting Lai, Chih-Hsuan Wei, Shubo Tian, Robert Leaman, Zhiyong Lu Biological relation networks contain rich information for understanding the biological mechanisms behind the relationship of entities such as genes, proteins, diseases, and chemicals. The vast growth of biomedical literature poses significant challenges updating the network knowledge. The recent Biomedical Relation Extraction Dataset (BioRED) provides valuable manual annotations, facilitating the development of machine-learning and pre-trained language model approaches for automatically identifying novel document-level (inter-sentence context) relation-ships. Nonetheless, its annotations lack directionality (subject/object) for the entity roles, essential for studying complex biological networks. Herein we annotate the entity roles of the relationships in the BioRED corpus and subsequently propose a novel multi-task language model with soft-prompt learning to jointly identify the relationship, novel findings, and entity roles. Our results include an enriched BioRED corpus with 10,864 directionality annotations. Moreover, our proposed method outperforms existing large language models such as the state-of-the-art GPT-4 and Llama-3 on two benchmarking tasks.
2025-07-24 15:00:00 15:20:00 12 Text Mining Metadata extraction: Large Language Models (LLMs) to the rescue Daniela Gaio Daniela Gaio In this project, our research group embarked on an extensive effort to download and re-analyze all globally and publicly accessible metagenomic samples from the NCBI database, culminating in the creation of MicrobeAtlas.org—a resource for the scientific community. We employed Large Language Models (LLMs) to efficiently extract relevant information from the often chaotic and submitter-dependent metadata files. This technique represents a significant leap over traditional methods such as the employment of conventional natural language processing tools, offering unparalleled efficiency in metadata mining. The value of clean, accessible metadata in microbial -omics is critical for the analysis of metagenomic samples. Despite advancements, the challenge of disorganized metadata remains, limiting dataset utility. The application of LLM greatly enhanced the extraction of keywords, geographical data, sample's nature, and host. This improvement unveiled signals within the metagenomic data previously masked by conventional NLP tool limitations, thus increasing dataset value and access. Our approach included developing and validating a pipeline for processing metadata from 3.8 million samples. I will outline the encountered challenges and the implemented solutions, including the comparison of paid and free LLM models. Conclusively, our efforts in improving metagenomic dataset accessibility and utility not only enable the reuse of existing data for comparative analysis and new discoveries, but also establish a new benchmark in metadata analysis within microbial ecology. The advancements in metadata extraction foster more detailed and comprehensive research, significantly enhancing our microbial ecosystem understanding.
2025-07-24 15:20:00 15:40:00 12 Text Mining Large-scale semantic indexing of Spanish biomedical literature using contrastive transfer learning Shanfeng Zhu Ronghui You, Tianyang Huang, Ziye Wang, Yuxuan Liu, Hong Zhou, Shanfeng Zhu The exponential growth of biomedical literature has made automatic indexing essential for advancing biomedical research. While automatic indexing has made strides in English biomedical literature, there has been limited research on non-English biomedical texts due to insufficient high-quality training data. We propose BERTDeCS, a novel deep learning framework for automatically indexing Spanish biomedical literature using contrastive transfer learning. BERTDeCS utilizes a multilingual BERT (M-BERT) to generate multi-language representations and adapts M-BERT for Spanish biomedical literature domain through contrastive learning. Additionally, BERTDeCS enhances its semantic indexing capabilities on Spanish biomedical literature by leveraging enriched English annotated literature in MEDLINE through transfer learning. Experimental results on Spanish datasets demonstrate that BERTDeCS outperforms state-of-the-art indexing methods, achieving top performance in the MESINESP and MESINESP2 Tasks on medical semantic indexing in Spanish within the BioASQ challenge. Notably, when extended to other languages (e.g., Portuguese) or applied in settings lacking manual indexing, BERTDeCS maintains exceptional performance, affirming its robustness in non-English biomedical semantic indexing.
2025-07-24 15:40:00 16:00:00 12 Text Mining Reading papers: Extraction of molecular interaction networks with large language models Enio Gjerga Enio Gjerga, Philipp Wiesenbach, Christoph Dieterich Motivation: Signalling occurs within and across cells and orchestrates essential cellular processes in complex tissues. Cell signalling involves several different components, including protein-protein interactions (PPIs). Dynamically changing conditions oftentimes lead to the rewiring of cellular communication networks. Computational modelling approaches typically rely on databases of molecular interactions. Evidently, manual curation of databases is time-consuming and automatic relation extraction (RE) from scientific literature would greatly support our strive to understand molecular mechanisms. To ease this process, we reason that prompt-based data mining with Large Language Models (LLMs) could be used to extract information from relevant scientific publications. Approach: We focus on the extraction of entity relations between proteins, as exemplified in protein-protein interaction networks, over a corpus of annotated short scientific texts. We analyze vanilla and fine-tuned models with different prompt setups where we either give no examples at all or follow specific patterns, e.g. only give correct, incorrect or both sorts of examples. Results: We rely on the RegulaTome corpus of annotated abstracts and short scientific texts where we obtain promising evaluation results as measured by precision, recall and F1-score for the extraction of PPI relations: 79%, 70% and 71%. Our workflow also ingests entire manuscripts and yields 96%, 65% and 77% for PPI relation extraction over a corpus of manually annotated cardiac manuscripts. Availability: Codes with scripts and results have been provided in: https://github.com/dieterich-lab/LLM_Relations.
2025-07-22 11:20:00 12:00:00 02N TransMed Melanie Schirmer
2025-07-22 12:00:00 12:20:00 02N TransMed How (poly)phenols can shape a healthier life? A nutri-omics investigation on their cardiometabolic health effects Mirko Treccani Federica Bergamo, Pedro Mena, Davide Martorana, Daniele Del Rio, Giovanni Malerba, Valeria Barili, Riccardo Bonadonna, Alessandra Dei Cas, Marco Ventura, Francesca Turroni, Letizia Bresciani, Mirko Treccani, Cristiano Negro, Alice Rosi, Cristina Del Burgo-Gutiérrez, Maria Sole Morandini, Nicola Luigi Bragazzi, Claudia Favari, José Fernando Rinaldi de Alvarenga, Lucia Ghiretti, Cristiana Mignogna (Poly)phenols (PPs) are a group of bioactive compounds found in plant-based food, widely consumed within diet. Several studies have reported the beneficial effects of PPs in preventing chronic diseases through a myriad of mechanisms of action. However, the bioavailability and effects of these compounds greatly differ across individuals, causing uneven physiological responses. To understand their inter-individual variability, we present a multi-omics investigation comprising genomics, metagenomics and metabolomics. We recruited 300 healthy individuals and collected biological samples (blood, urine, and faeces), anthropometric measurements, health status and lifestyle/dietary information. After identification by UPLC-IMS-HRMS and quantification by UPLC-QqQ-MS/MS, the large set of phenolic metabolites underwent dimensionality reduction and clustering to identify individuals with similar metabolic profiles (metabotypes), identifying high and low PP producers. Then, genomics and metagenomics investigations were performed to gain insights on inter-individual differences and unravel the potential pathophysiological impact of these molecules, with particular regards to cardiometabolic diseases. In details, genome-wide association studies followed by computational functional analyses on genetic variants, and taxonomic and functional investigations of gut microbiome were performed, showing hints for associations in genes and microbial species related to PP metabolism, together with unprecedented genetic associations. Genomics were further investigated in terms of gene networks and computational functional analyses, identifying differentially expressed genes, gene sets enrichments, candidate regulatory regions, and interacting loci and chromatin states, and associations with metabolic traits and diseases. Overall, we demonstrated the benefits of omics research in nutrition, advancing the field of personalised nutrition and health.
2025-07-22 12:20:00 12:30:00 02N TransMed Genomic Scars of Survival: Translating Therapy-Induced Mutagenesis into Clinical Insights in Childhood Cancer Mehdi Layeghifard Mehdi Layeghifard, Marcos Díaz-Gay, Erik N. Bergstrom, Mathepan J. Mahendralingam, Sasha Blay, Pedro L. Ballester, Elli Papaemmanuil, Mark J. Cowley, Anita Villani, Ludmil B. Alexandrov, Adam Shlien Children with cancer endure numerous short- and long-term side effects of treatment, but the extent of DNA damage associated with chemotherapy exposure remains largely unclear. We used mutational signatures to measure this damage directly in 611 whole-genome sequenced tumours from a multi-institutional pediatric cancer cohort enriched with treated tumours. Compared to treatment-naïve tumours, post-therapy cancers harbored nearly three times as many private signatures (p = 0.0001), twice the total burden of somatic mutations (p = 0.023), and 10% more oncogenic drivers (p = 0.016). Our analysis uncovered 15 therapy-associated signatures and revealed patterns of drug-specific and tissue-specific mutagenesis. Platinum-based treatments, for which we more than doubled the number of known associated signatures, contributed the highest mutation burden in most patients. By integrating clinical exposure timelines with signature evolution, we defined a minimum latency of 91 days and a burden threshold of ~1500 mutations for the emergence of platinum signatures. Notably, over one-third of platinum-treated tumours exhibited measurable resistance within 12 months. To further explore therapy-associated genomic alterations beyond signatures, we employed machine learning (ML) models, which identified additional, non-canonical genomic features predictive of platinum, anthracycline, and antimetabolite exposure, suggesting a broader landscape of therapy-induced damage. This comprehensive genomic analysis provides critical insights into the complex mutagenic legacy of chemotherapy in childhood cancer. The identified therapy-specific signatures, their temporal dynamics, and the novel genomic features uncovered by ML offer potential biomarkers for monitoring treatment response, predicting resistance, and informing future interventions aimed at treatment de-escalation and early detection of resistant clones.
2025-07-22 12:30:00 12:40:00 02N TransMed MOSAIC-AD: Multilayered Patient Similarity Analysis Integrating Omics and Clinical Data for Patient Stratification in Atopic Dermatitis Lena Möbus Lena Möbus, Angela Serra, Stephan Weidinger, Dario Greco Atopic dermatitis (AD) presents with high inter-individual variability in both disease progression and treatment response. Understanding how patients evolve over time—clinically and molecularly—is critical for enabling more personalised care. This study aims to stratify AD patients through multi-layered patient similarity analysis by integrating rich longitudinal clinical records with transcriptomic and other omics data. By quantifying patient-to-patient similarity across multiple data types and timepoints, we seek to uncover dynamic patterns linking clinical trajectories with underlying molecular profiles. We analyse a prospective, observational cohort of 419 moderate-to-severe AD patients receiving routine care, with repeated clinical assessments and optional biosampling at approximately 3, 6, and 12 months post-treatment initiation. We compute patient-pairwise distance matrices within and across data layers: clinical variables (numeric, ordinal, categorical), skin and blood transcriptomics, and genomic profiles (rank-based distance metrics and other custom measures). A key challenge is the incomplete overlap of data layers across patients; thus, distances are computed using only shared features between patient pairs. These are integrated into a composite similarity matrix to support longitudinal stratification and comparative analyses. Preliminary analyses reveal structured shifts in patient similarity over time, both in clinical scores and transcriptomic profiles. Clinical distance matrices at different timepoints show evolving clustering patterns, while transcriptomic data indicate increased molecular divergence following treatment. These dynamic shifts are reflected in the density distributions of patient-pairwise distances, supporting temporal changes in both clinical and molecular disease states. This integrative approach supports precision medicine in AD by accommodating data sparsity while capturing meaningful patient-level similarities.
2025-07-22 12:40:00 12:50:00 02N TransMed Gene mutant dosage determine prognosis and metastatic tropism in 60,000 clinical cancer samples Nicola Calonaci Nicola Calonaci, Stefano Scalera, Giulio Caravagna, Eriseld Krasniqi, Giorgia Gandolfi, Biagio Ricciuti, Daniel Colic, Marcello Maugeri-Saccà, Salvatore Milite The intricate interplay between somatic mutations and copy number alterations critically influences tumour evolution and clinical outcomes. Yet, conventional genomic analyses often treat these biomarkers independently, overlooking the role of mutant gene dosage—a key mechanistic consequence of their interaction. We developed INCOMMON, an innovative computational model for rapidly inferring allele-specific copy numbers directly from read count data. We applied INCOMMON to 500,000 mutations in 60,000 publicly available clinical samples spanning 39 cancer types. We found 11 genes and 3 mutational hotspots exhibiting recurrent tumour-specific patterns associated with high mutant dosage across 17 tumours. By stratifying more than 24,000 patients based on mutant dosage across actionable oncogenes and tumour suppressors, we identified 6 groups with distinct prognostic significance across mutant dosage classes, and 4 novel biomarkers not detectable in a standard mutation-centric stratification. Additionally, 11 mutant dosage-defined subgroups showed increased metastatic propensity, with 6 enriched for site-specific dissemination patterns. By eliminating reliance on controlled-access raw sequencing data, our method offers a practical and scalable path for integrating dosage-aware biomarkers into clinical research. This augmented insight into genomic drivers enhances our understanding of cancer progression and metastasis, holding the potential to significantly foster biomarker discovery.
2025-07-22 12:50:00 13:00:00 02N TransMed
2025-07-22 14:00:00 14:20:00 02N TransMed
2025-07-22 14:20:00 14:30:00 02N TransMed Unraveling Early Changes in Alzheimer's Disease: Causal Relationships Among Sleep Behavior, Immune Dynamics, and Cognitive Performance Through Multimodal Data Fusion Sophia Krix Sophia Krix, Neus Falgàs, Andrea del Val-Guardiola, Sarah Hücker, Raquel Sanchez-Valle, Kuti Baruch, Stefan Kirsch, Holger Fröhlich In the early stages of Alzheimer’s Disease (AD), significant changes in sleep behavior and immune system dynamics occur prior to the onset of pathological alterations in the brain, ultimately leading to cognitive decline. To investigate the intricate relationships between these changes, we employ multimodal data fusion and modern nonlinear gradient-based Bayesian Network structure learning techniques. On a molecular biology level our ADIS study (https://adis-project.eu/) employs single-cell RNA sequencing data from peripheral blood mononuclear cells of 75 patients (25 cognitively unimpaired, 25 mild cognitively impaired, 25 with Alzheimer’s dementia), representing cell-type-specific immune pathway alterations, which we embed into a lower dimensional space via a conditional variational autoencoder. In addition, we incorporate neuroinflammatory cytokine levels measured in the cerebral spinal fluid (CSF). We combine these molecular data with questionnaire-based sleep assessments, standardized and app-based tests of different cognitive functions and MRI-derived brain volume measures. To uncover the interdependencies among these different data modalities, we utilize a recent neural-network based Bayesian Network structure learning method, DAGMA (Directed Acyclic Graphs via M-matrices for Acyclicity). Our approach allows us to observe that in different stages of AD there exist different dependencies between immune system dysregulation, brain volume changes, cognitive function and sleep. Altogether and in agreement with earlier findings, our results implicate that the peripheral immune system plays a pivotal role during the development of AD pathology, opening perspectives for innovative treatment options, which are currently tested in clinical trials by one of our project partners.
2025-07-22 14:30:00 14:40:00 02N TransMed Combining Clinical Embeddings with Multi-Omic Features for Improved Interpretability in Parkinson’s Disease Patient Classification Barry Ryan Barry Ryan This study demonstrates the integration of Large Language Model (LLM)-derived clinical text embeddings from the Movement Disorder Society Unified Parkinson’s Disease Rating Scale (MDS-UPDRS) questionnaire with molecular genomics data to enhance interpretability in Parkinson’s disease (PD) classification. By combining genomic modalities encoded using an interpretable biological architecture with a patient similarity network constructed from clinical text embeddings, our approach leverages both clinical and genomic information to provide a robust, interpretable model for disease classification and molecular insights. We benchmarked our approach using the baseline time point from the Parkinson’s Progression Markers Initiative (PPMI) dataset, identifying the Llama-3.2-1B text embedding model on Part III of the MDS-UPDRS as the most informative. We further validated the framework at years 1, 2, and 3 post-baseline, achieving significance in identifying PD associated genes from a random null set by year 2 and replicating the association of MAPK with PD in a heterogeneous cohort. Our findings demonstrate that LLM text embeddings enable robust interpretable genomic analysis, revealing molecular signatures associated with PD progression.
2025-07-22 14:40:00 15:00:00 02N TransMed
2025-07-22 15:00:00 15:20:00 02N TransMed
2025-07-22 15:20:00 15:40:00 02N TransMed Spatial Regulatory Landscape of the Glioblastoma Tumor Immune Microenvironment Hatice Osmanbeyoglu Linan Zhang, Matthew Lu, Hatice Osmanbeyoglu Glioblastoma (GBM) is the most aggressive primary brain tumor, with poor prognosis and limited treatment options. Its tumor microenvironment (TME) plays a critical role in driving cancer progression, immune evasion, and therapy resistance. Spatial transcriptomics (ST) technologies now allow for the investigation of gene regulation and cell-cell interactions in their spatial context. We developed STAN (Spatially informed Transcription Factor Activity Network), a computational method to infer spot-specific transcription factor (TF) activity from ST data and cis-regulatory information. STAN enables identification of TFs associated with specific cell types, spatial domains, and ligand-receptor signaling events. We further extend this with SPAN (Spatially informed Pathway Activity Network) to predict localized pathway activity. Applying STAN and SPAN to GBM ST datasets (n=26), we uncovered spatial regulatory networks and key ligand-receptor interactions in the TME. Notably, we observed strong correlation between STAN-predicted SOX2 activity and expression of CD44 and VIM, which we validated at the protein level in independent GBM specimens. To identify therapeutic opportunities, we integrated our regulatory maps with Drug2Cell, which links drugs to targetable expression patterns at cellular resolution. This revealed cell type- and region-specific drug-target relationships, nominating compounds with potential to modulate malignant or immunosuppressive cell populations in GBM. Together, STAN, SPAN, and Drug2Cell form a comprehensive framework to connect spatial gene regulation with therapeutic insights. This approach not only advances understanding of GBM biology but is also broadly applicable to other diseases and tissue contexts.
2025-07-22 15:40:00 15:50:00 02N TransMed SpatialPathomicsToolkit: A Comprehensive Framework for Pathomics Feature Analysis and Integration Shilin Zhao Yu Wang, Yuechen Yang, Jiayuan Chen, Ruining Deng, Mengmeng Yin, Haichun Yang, Yuankai Huo, Shilin Zhao Abstract: We present SpatialPathomicsToolkit, a modular and platform-agnostic toolkit for comprehensive analysis of pathomics features from whole slides image and their integration with spatial transcriptomics data. The toolkit supports diverse data sources including CellProfiler, PySpatial, and AI-based feature extractors, and enables feature summarization, group comparison, dimensionality reduction, and correlation with transcriptomic and clinical data. We evaluated its utility across multiple renal pathology studies involving both human and mouse samples. Our results demonstrate the toolkit’s generalizability and value for spatially resolved, multimodal pathology analysis.
2025-07-22 15:50:00 16:00:00 02N TransMed CellTFusion: A novel approach to unravel cell states via cell type deconvolution and TF activity estimated from bulk RNAseq data identifies clinically relevant cell niches Marcelo Hurtado Marcelo Hurtado, Abdelmounim Essabbar, Leila Khajavi, Vera Pancaldi The tumor microenvironment (TME) plays a key role in cancer development by influencing physiopathological processes. Despite significant progress in understanding this complex system, it remains unclear why some patients respond to specific therapies while others experience recurrence. Computational methods for cell type deconvolution from bulk RNA-seq data have been developed, yet their high feature complexity and variability limit their effectiveness for patient stratification. This project introduces CellTFusion, a novel framework for characterizing TME patient profiles by constructing transcriptional regulatory networks (TRNs) based on inferred transcription factor (TF) activity and cell type deconvolution from bulk RNA-seq data. This approach is able to capture multiple possible cell phenotypes and states within patient samples. We applied CellTFusion to several publicly available cancer datasets, including melanoma, neuroblastoma, lung cancer, and bladder cancer. Using existing algorithms, we inferred TF activity from gene expression data and constructed TF networks from highly correlated features. These networks were integrated with cell type proportion estimates derived from both bulk and single-cell reference signatures to generate cell group scores that reflect their association. Additionally, we incorporated a robust machine learning pipeline to identify if these potential cell states are significantly associated with clinical outcomes, including survival, recurrence, and response to immunotherapy. CellTFusion provides a novel framework to identify clusters of deconvolution features based on regulatory activities that can be used as TME profiles to allow a better patient stratification. It also overcomes the problem of heterogeneity and high-dimensionality of current deconvolution methods by the integration of prior-knowledge networks.
2025-07-22 16:40:00 16:50:00 02N TransMed TRESOR: a disease signature integrating GWAS and TWAS for therapeutic target discovery in rare diseases Satoko Namba Satoko Namba, Michio Iwata, Shin-Ichi Nureki, Noriko Yuyama Otani, Yoshihiro Yamanishi Identifying therapeutic targets for diseases is important in drug discovery. However, the depletion of viable therapeutic targets has a major bottleneck, contributing to the recent stagnation in drug development, especially for rare and orphan diseases.  Here, we propose a disease signature, TRESOR, which characterizes the functional mechanisms of each disease through genome-wide association study (GWAS) and transcriptome-wide association study (TWAS) data. Based on TRESOR and target perturbation signatures (TGPs)—i.e., gene knockdown and overexpression profiles of target-coding genes—we develop machine learning methods for predicting inhibitory and activatory therapeutic targets for various diseases. TRESOR enables highly accurate identification of target candidate proteins based on inverse correlations between TRESOR and TGPs. Furthermore, Bayesian integrative method combines TRESOR-based inverse correlations and omics-based disease similarities, providing more reliable predictions for rare and orphan diseases with few or no known targets. We make comprehensive predictions for 284 diseases with 4,345 inhibitory target candidates and 151 diseases with 4,040 activatory target candidates. These predictions were validated through literature-based cross-referencing and independently assessed using human cohort data, supporting their therapeutic potential. For instance, multiple endocrine neoplasia, a rare disease with only one known inhibitory target, was predicted to have new candidate targets such as FHL2, whose downregulation correlated with improved survival in independent cohorts. The proposed method was also applied to orphan diseases lacking any known targets, such as tauopathies, identifying promising candidates including RAB1B. Proposed methods are expected to be useful for understanding disease–disease relationships and identifying therapeutic targets for rare and orphan diseases.
2025-07-22 16:50:00 17:00:00 02N TransMed GENIUS: Genomic Evaluation using Next-generation Intelligence for Understanding & Swift Diagnosis Peter White Peter White, Bimal Chaudhari, Austin Antoniou, David Gordon, Ashley Kubatko, Ben Knutson More than 350 million individuals globally suffer from approximately 10,000 known rare diseases. Despite genomic advances, patients today experience prolonged diagnostic journeys, averaging six years, and roughly 50% remain undiagnosed. This delay often results in inappropriate care, irreversible disease progression, and increased medical costs. To address these challenges, we developed GENIUS, a comprehensive framework targeting patient identification, variant prioritization, and continuous genomic data reanalysis to accelerate diagnoses and improve patient outcomes. GENIUS integrates three innovative machine learning algorithms: NeoGX identifies undiagnosed patients through phenotypic features extracted via NLP from electronic health records, facilitating timely genetic testing referrals, particularly in NICU settings. CAVaLRi employs an advanced likelihood-ratio framework incorporating variant pathogenicity, phenotype overlap, parental genotypes, and segregation data, effectively prioritizing diagnostic genetic variants amidst noisy phenotype data. PARDIGM automates genomic data reanalysis by continuously updating clinicians as new gene-disease associations emerge. GENIUS demonstrated remarkable performance, with NeoGX accurately predicting the need for genetic testing (ROC AUC = 0.855), halving testing initiation time from 62 to 31 days. CAVaLRi significantly outperformed existing methods (PR AUC = 0.701), ranking diagnostic variants first in over 70% of cases. PARADIGM, the automated genomic reanalysis component, achieved a 40% diagnostic yield, substantially surpassing conventional methods. GENIUS exemplifies a scalable computational framework integrating predictive analytics, precise variant prioritization, and dynamic genomic reanalysis. By automating complex diagnostic workflows, GENIUS substantially accelerates diagnosis, optimizes clinical decision-making, and demonstrates the transformative potential of machine learning to advance personalized genomic medicine in rare genetic disorders.
2025-07-22 17:00:00 17:10:00 02N TransMed SIDISH Identifies High-Risk Disease-Associated Cells and Biomarkers by Integrating Single-Cell Depth and Bulk Breadth Yasmin Jolasun Yasmin Jolasun, Yumin Zheng, Kailu Song, Jingtao Wang, David H. Eidelman, Jun Ding Single-cell RNA sequencing (scRNA-seq) offers unparalleled resolution for studying cellular heterogeneity but is costly, restricting its use to small cohorts that often lack comprehensive clinical data, limiting translational relevance. In contrast, bulk RNA sequencing is scalable and cost-effective but obscures critical single-cell insights. We introduce SIDISH, a neural network framework that integrates the granularity of scRNA-seq with the scalability of bulk RNA-seq. Using a Variational Autoencoder, deep Cox regression, and transfer learning, SIDISH identifies High-Risk cell populations while enabling robust clinical predictions from large-cohort data. Its in silico perturbation module identifies therapeutic targets by simulating interventions that reduce High-Risk cells associated with adverse outcomes. Applied across diverse diseases, SIDISH establishes the link between cellular dynamics and clinical phenotypes, facilitating biomarker discovery and precision medicine. By unifying single-cell insights with large-scale clinical data, SIDISH advances computational tools for disease risk assessment and therapeutic prioritization, offering a transformative approach to precision medicine.
2025-07-22 17:10:00 17:50:00 02N TransMed Jonathan Carlson
2025-07-22 17:50:00 18:00:00 02N TransMed
2025-07-20 09:00:00 10:45:00 11A Tutorials Tutorial IP1: Machine Learning for Omics: Best practices and Real-Life Insights with TidyModels Omics data analysis presents unique challenges due to its high dimensionality and complexity. Supervised machine learning (ML) offers powerful tools for gaining insights from these data but currently faces a crisis of reproducibility due to poor adherence to best practices when undertaking feature selection, model evaluation, and needs for further interpretability. This full-day tutorial introduces participants to the common pitfalls and best practices of applying ML to omics research. It exemplifies good practice through example using the Tidymodels framework for ML workflows in R, tailored to omics applications. The course will feature a mixture of lectures, quizzes, real-life coding tutorials and hands-on practicals with 1-1 support. Example applications will illustrate regression analysis with methylation clocks, gene prioritisation and classification with cancer biomarker discovery. Special attention will be paid to challenges in working with highly multivariate data and integrating various data types as well as providing tips to extract meaningful insights from complex data. Beginner-level R skills are required, and attendees will leave with practical skills to apply Tidymodels to their own datasets.
2025-07-20 09:00:00 10:45:00 03A Tutorials Tutorial IP2: Massively parallel reporter assays in functional regulatory genomics and as part of the IGVF data resource This tutorial is designed to empower bioinformatics researchers with the knowledge and skills to effectively utilize Massively Parallel Reporter Assays (MPRAs) data in their work. MPRAs are gaining wider applications across the functional genomics community and are used as part of the Impact of Genomic Variation on Function (IGVF) Consortium. IGVF is a collaborative research initiative funded by the NHGRI that aims to systematically study how genomic variations affect genome function and, consequently, phenotypes. By integrating experimental and computational approaches, IGVF seeks to map and predict the functional impacts of genetic variants, providing a comprehensive catalog of these effects. This tutorial provides a thorough introduction in MPRAs and IGVF data resources, practical training on MPRA data, and insights into advanced analysis methods for such data. Participants will gain an understanding of MPRA experiments, including their various experimental designs and the rationale for using them in functional genomics. This will involve learning the process of associating tags/barcodes with sequences incorporated in the reporter constructs from raw sequencing reads and counting barcodes from DNA sequencing and RNA expression. The tutorial will guide participants through data processing using MPRAsnakeflow, a streamlined snakemake workflow developed with IGVF for efficient MPRA data handling and QC reporting. Statistical analysis for sequence-level and variant-level effect testing of MPRA count data will be introduced using BCalm, a barcode-level MPRA analysis package developed as part of our IGVF efforts. Further, the tutorial will provide a starting point for training (deep learning) sequence models on MPRA data and related functional genomics datasets. Participants will learn how to extract meaningful insights from their datasets by investigating the sequence activity relationship and extracting important sequence motifs. By integrating these topics and methods, participants will leave the tutorial equipped with both theoretical knowledge and practical skills necessary for analyzing and using MPRA data effectively.
2025-07-20 09:00:00 10:45:00 04AB Tutorials Tutorial IP3: Genomic Variant Interpretation & prioritisation for clinical research The interpretation of genetic variation is important for understanding human health and disease. Increased knowledge leads to societal benefits including faster disease diagnosis, a better understanding of disease progression, more efficient identification and prioritisation of drug targets for testing, resulting in overall better health outcomes for a population. Whilst the speed and cost of sequencing has reduced, the complexity of variant interpretation remains a bottleneck for understanding. This tutorial will explore the variety of annotations and techniques available to assess human variation and the implications of variant effects on human health and disease.
2025-07-20 09:00:00 10:45:00 03B Tutorials Tutorial IP4: Quantum Machine Learning for multi-omics analysis Single-cell and population-level multi-omics analyses have greatly enhanced our understanding of biological complexity. By integrating various types of biological data—such as genomics, proteomics, and transcriptomics, collectively known as multi-omics—these approaches have provided deep insights into the molecular mechanisms underlying complex diseases, both at the cellular level and across patient populations. As the size and complexity of multi-omics data continues to grow, the need to leverage emerging technologies such as artificial intelligence (AI) and quantum computing (QC) also grows. Recently, advances in QC have shown promise in solving real-world problems in machine learning and optimization in biomedicine, drug discovery, biomarker discovery, clinical trials, among other healthcare and life sciences objectives [1,2,3,4,5]. In this tutorial, participants will learn the fundamental concepts of QC, engage in hands-on experiments that apply classical machine learning (ML) techniques. They will also learn best practices for pre-processing multi-omics data in preparation for quantum machine learning (QML) tasks. Through a systematic evaluation of various data complexity measures and their impact on the performance of different ML and QML models, participants will gain insights into when to effectively utilize QML models. Additionally, they will explore quantum-classical hybrid workflows for ML, with a focus in biomedical data analysis.
2025-07-20 09:00:00 10:45:00 12 Tutorials Tutorial IP5: Introduction to Causal Analysis using Mendelian Randomisation Mendelian randomisation (MR) is a method that uses genetic variation associated with an exposure (e.g., behaviours, biomarkers) to infer its causal effect on an outcome (e.g. health status). In statistical terms, it functions as an "instrumental variable" approach. By mimicking the design of a randomised controlled trial through genetic inheritance, MR provides a framework for addressing confounding and reverse causation, making it a valuable tool in epidemiological and biomedical research. This workshop offers a beginner-friendly introduction to the key concepts and assumptions underlying MR, such as the use of genome-wide association study (GWAS) data and the three key assumptions for valid instrumental variables: relevance, independence, and exclusion restriction. Participants will explore common challenges in MR analysis, including pleiotropy, population stratification, and measurement error while learning strategies to overcome these using advanced methods. The workshop also includes a two-hour hands-on session in which attendees will work with real-world data to conduct MR analyses using R. By the end of the session, participants will have a clear understanding of MR principles, the ability to critically evaluate MR studies, and practical skills to apply MR methods in their own research.
2025-07-20 09:00:00 10:45:00 11BC Tutorials Tutorial IP6: Hello Nextflow: Getting started with workflows for bioinformatics Nextflow is a powerful and flexible open-source workflow management system that simplifies the development, execution, and scalability of data-driven computational pipelines. It is widely used in bioinformatics and other scientific fields to automate complex analyses, making it easier to manage and reproduce large-scale data analysis workflows. This training workshop is intended as a “getting started” course for students and early-career researchers who are completely new to Nextflow. It aims to equip participants with foundational knowledge and skills in three key areas: (1) understanding the logic of how data analysis workflows are constructed, (2) Nextflow language proficiency and (3) command-line interface (CLI) execution. Participants will be guided through hands-on, goal-oriented exercises that will allow them to practice the following skills: Use core components of the Nextflow language to construct simple multi-step workflows effectively. Launch Nextflow workflows locally, navigate output directories to access results, interpret log outputs for insights into workflow execution, and troubleshoot basic issues that may arise during workflow execution. By the end of the workshop, participants will be well-prepared for tackling the next steps in their journey to develop and apply reproducible workflows for their scientific computing needs. Additional study-at-home materials will be provided for them to continue learning and developing their skills further.
2025-07-20 11:00:00 13:00:00 11A Tutorials Tutorial IP1: Machine Learning for Omics: Best practices and Real-Life Insights with TidyModels Omics data analysis presents unique challenges due to its high dimensionality and complexity. Supervised machine learning (ML) offers powerful tools for gaining insights from these data but currently faces a crisis of reproducibility due to poor adherence to best practices when undertaking feature selection, model evaluation, and needs for further interpretability. This full-day tutorial introduces participants to the common pitfalls and best practices of applying ML to omics research. It exemplifies good practice through example using the Tidymodels framework for ML workflows in R, tailored to omics applications. The course will feature a mixture of lectures, quizzes, real-life coding tutorials and hands-on practicals with 1-1 support. Example applications will illustrate regression analysis with methylation clocks, gene prioritisation and classification with cancer biomarker discovery. Special attention will be paid to challenges in working with highly multivariate data and integrating various data types as well as providing tips to extract meaningful insights from complex data. Beginner-level R skills are required, and attendees will leave with practical skills to apply Tidymodels to their own datasets.
2025-07-20 11:00:00 13:00:00 03A Tutorials Tutorial IP2: Massively parallel reporter assays in functional regulatory genomics and as part of the IGVF data resource This tutorial is designed to empower bioinformatics researchers with the knowledge and skills to effectively utilize Massively Parallel Reporter Assays (MPRAs) data in their work. MPRAs are gaining wider applications across the functional genomics community and are used as part of the Impact of Genomic Variation on Function (IGVF) Consortium. IGVF is a collaborative research initiative funded by the NHGRI that aims to systematically study how genomic variations affect genome function and, consequently, phenotypes. By integrating experimental and computational approaches, IGVF seeks to map and predict the functional impacts of genetic variants, providing a comprehensive catalog of these effects. This tutorial provides a thorough introduction in MPRAs and IGVF data resources, practical training on MPRA data, and insights into advanced analysis methods for such data. Participants will gain an understanding of MPRA experiments, including their various experimental designs and the rationale for using them in functional genomics. This will involve learning the process of associating tags/barcodes with sequences incorporated in the reporter constructs from raw sequencing reads and counting barcodes from DNA sequencing and RNA expression. The tutorial will guide participants through data processing using MPRAsnakeflow, a streamlined snakemake workflow developed with IGVF for efficient MPRA data handling and QC reporting. Statistical analysis for sequence-level and variant-level effect testing of MPRA count data will be introduced using BCalm, a barcode-level MPRA analysis package developed as part of our IGVF efforts. Further, the tutorial will provide a starting point for training (deep learning) sequence models on MPRA data and related functional genomics datasets. Participants will learn how to extract meaningful insights from their datasets by investigating the sequence activity relationship and extracting important sequence motifs. By integrating these topics and methods, participants will leave the tutorial equipped with both theoretical knowledge and practical skills necessary for analyzing and using MPRA data effectively.
2025-07-20 11:00:00 13:00:00 04AB Tutorials Tutorial IP3: Genomic Variant Interpretation & prioritisation for clinical research The interpretation of genetic variation is important for understanding human health and disease. Increased knowledge leads to societal benefits including faster disease diagnosis, a better understanding of disease progression, more efficient identification and prioritisation of drug targets for testing, resulting in overall better health outcomes for a population. Whilst the speed and cost of sequencing has reduced, the complexity of variant interpretation remains a bottleneck for understanding. This tutorial will explore the variety of annotations and techniques available to assess human variation and the implications of variant effects on human health and disease.
2025-07-20 11:00:00 13:00:00 03B Tutorials Tutorial IP4: Quantum Machine Learning for multi-omics analysis Single-cell and population-level multi-omics analyses have greatly enhanced our understanding of biological complexity. By integrating various types of biological data—such as genomics, proteomics, and transcriptomics, collectively known as multi-omics—these approaches have provided deep insights into the molecular mechanisms underlying complex diseases, both at the cellular level and across patient populations. As the size and complexity of multi-omics data continues to grow, the need to leverage emerging technologies such as artificial intelligence (AI) and quantum computing (QC) also grows. Recently, advances in QC have shown promise in solving real-world problems in machine learning and optimization in biomedicine, drug discovery, biomarker discovery, clinical trials, among other healthcare and life sciences objectives [1,2,3,4,5]. In this tutorial, participants will learn the fundamental concepts of QC, engage in hands-on experiments that apply classical machine learning (ML) techniques. They will also learn best practices for pre-processing multi-omics data in preparation for quantum machine learning (QML) tasks. Through a systematic evaluation of various data complexity measures and their impact on the performance of different ML and QML models, participants will gain insights into when to effectively utilize QML models. Additionally, they will explore quantum-classical hybrid workflows for ML, with a focus in biomedical data analysis.
2025-07-20 11:00:00 13:00:00 12 Tutorials Tutorial IP5: Introduction to Causal Analysis using Mendelian Randomisation Mendelian randomisation (MR) is a method that uses genetic variation associated with an exposure (e.g., behaviours, biomarkers) to infer its causal effect on an outcome (e.g. health status). In statistical terms, it functions as an "instrumental variable" approach. By mimicking the design of a randomised controlled trial through genetic inheritance, MR provides a framework for addressing confounding and reverse causation, making it a valuable tool in epidemiological and biomedical research. This workshop offers a beginner-friendly introduction to the key concepts and assumptions underlying MR, such as the use of genome-wide association study (GWAS) data and the three key assumptions for valid instrumental variables: relevance, independence, and exclusion restriction. Participants will explore common challenges in MR analysis, including pleiotropy, population stratification, and measurement error while learning strategies to overcome these using advanced methods. The workshop also includes a two-hour hands-on session in which attendees will work with real-world data to conduct MR analyses using R. By the end of the session, participants will have a clear understanding of MR principles, the ability to critically evaluate MR studies, and practical skills to apply MR methods in their own research.
2025-07-20 11:00:00 13:00:00 11BC Tutorials Tutorial IP6: Hello Nextflow: Getting started with workflows for bioinformatics Nextflow is a powerful and flexible open-source workflow management system that simplifies the development, execution, and scalability of data-driven computational pipelines. It is widely used in bioinformatics and other scientific fields to automate complex analyses, making it easier to manage and reproduce large-scale data analysis workflows. This training workshop is intended as a “getting started” course for students and early-career researchers who are completely new to Nextflow. It aims to equip participants with foundational knowledge and skills in three key areas: (1) understanding the logic of how data analysis workflows are constructed, (2) Nextflow language proficiency and (3) command-line interface (CLI) execution. Participants will be guided through hands-on, goal-oriented exercises that will allow them to practice the following skills: Use core components of the Nextflow language to construct simple multi-step workflows effectively. Launch Nextflow workflows locally, navigate output directories to access results, interpret log outputs for insights into workflow execution, and troubleshoot basic issues that may arise during workflow execution. By the end of the workshop, participants will be well-prepared for tackling the next steps in their journey to develop and apply reproducible workflows for their scientific computing needs. Additional study-at-home materials will be provided for them to continue learning and developing their skills further.
2025-07-20 14:00:00 16:00:00 11BC Tutorials Tutorial IP7: AI large cellular models and in-silico perturbation Transformer-based large language models (LLMs) are changing the world. The capabilities they illustrated in sophisticated natural language, vision and multi-modal tasks have inspired the development of large cellular models (LCMs) for single-cell transcriptomic data, such as scBERT, Geneformer, scGPT, scFoundation, GeneCompass, scMulan, etc. After pretraining on massive amount of single-cell RNA-seq data agnostic to any downstream task, these transformer-based models have demonstrated exceptional performance in various tasks such as cell type annotation, data integration, gene network inference, and the prediction of drug sensitivity or perturbation responses. Such advancements, albeit still in their early stage, suggested promising revolutionary approaches for leveraging AI to understand the complex system of cells from extensive datasets beyond human analytical capacity. Especially, such models have made it possible to conduct in-silico perturbation on cells of various types to predict their responses to gene perturbations without doing experiments on the cells. These models provided prototypes of digital virtual cells that can be used to reconstruct and simulate live cells, which will revolutionize many aspects of future biomedical studies. Although the community is high enthusiastic to these exciting progresses, the structures and algorithms of LCMs and other similar-scale AI models are mysterious to many people who were not equipped with relevant backgrounds. This tutorial will try to fill this gap. In the tutorial, we will begin from an introduction of basic principles of deep neural networks, and explain the basic structure and algorithm of the original Transformer for natural language tasks. We’ll show to the attendees how to build such models based on current machine learning platforms. Then we’ll introduce several successful ways to build large cellular models based on the basic Transformer model, and overview how such models are pretrained on single-cell RNA-seq data. We’ll show and let the attendees to practice how to use LCMs for basic tasks such as cell type annotation, and look into the specific application of LCMs on in-silico perturbation tasks. Attendees will engage in hands-on activities such as building basic transformer models and executing downstream single-cell tasks, including cell type annotation and in-silico perturbation. These activities will remove the mystery of LCMs for the attendees and help them better understand and feel how LCMs can be built and applied
2025-07-20 14:00:00 16:00:00 12 Tutorials Tutorial IP8: Representation Learning and Feature Engineering for Genomic Sequences Analysis Machine learning (ML) has been successfully applied in different omics problems, such as sequence classification in the field of genomics. The effectiveness of ML methods relies greatly on the selection of the data representation, or features, that extract meaningful information from sequences. Genomic sequences can be viewed as one-dimensional strings of successive letters representing nucleotides. However, to make these sequences compatible with ML methods, they must first be transformed into structured numerical representations, such as vectors or matrices. Traditional methods for sequence classification often rely on manually crafted or pre-defined features, which require domain expertise and may not fully capture the complexity of the underlying biological information. Recently, representation learning has emerged as a powerful alternative, enabling the automatic extraction of latent patterns directly from raw data and reducing the dependence on manually crafted features. In genomics, representation learning methods have been introduced to characterize DNA and RNA sequences. In genomics, techniques like Word2Vec, Convolutional Neural Networks (CNNs) and Large Language Models (LLMs) have demonstrated the ability to learn optimal sequence representations that effectively capture both local and global patterns in DNA and RNA sequences. This tutorial provides a comprehensive introduction to feature engineering and representation learning for genomic sequences (DNA/RNA). Participants will explore traditional techniques for extracting features from genomic sequences, building a foundation in classical approaches. Furthermore, the tutorial will cover representation learning, introducing concepts such as embeddings and their applications. Topics include methods such as Word2vec and LLMs to obtain meaningful representations from genomic sequences. Through hands-on exercises and comparative analyses, attendees will learn to combine traditional feature engineering with representation learning approaches, developing practical skills and insights that are adaptable to diverse genomic research challenges. The goal is to offer participants the knowledge and tools to enhance genomic sequence analysis using different techniques for sequence representation.
2025-07-20 14:00:00 16:00:00 11A Tutorials Tutorial IP1: Machine Learning for Omics: Best practices and Real-Life Insights with TidyModels Omics data analysis presents unique challenges due to its high dimensionality and complexity. Supervised machine learning (ML) offers powerful tools for gaining insights from these data but currently faces a crisis of reproducibility due to poor adherence to best practices when undertaking feature selection, model evaluation, and needs for further interpretability. This full-day tutorial introduces participants to the common pitfalls and best practices of applying ML to omics research. It exemplifies good practice through example using the Tidymodels framework for ML workflows in R, tailored to omics applications. The course will feature a mixture of lectures, quizzes, real-life coding tutorials and hands-on practicals with 1-1 support. Example applications will illustrate regression analysis with methylation clocks, gene prioritisation and classification with cancer biomarker discovery. Special attention will be paid to challenges in working with highly multivariate data and integrating various data types as well as providing tips to extract meaningful insights from complex data. Beginner-level R skills are required, and attendees will leave with practical skills to apply Tidymodels to their own datasets.
2025-07-20 14:00:00 16:00:00 03A Tutorials Tutorial IP2: Massively parallel reporter assays in functional regulatory genomics and as part of the IGVF data resource This tutorial is designed to empower bioinformatics researchers with the knowledge and skills to effectively utilize Massively Parallel Reporter Assays (MPRAs) data in their work. MPRAs are gaining wider applications across the functional genomics community and are used as part of the Impact of Genomic Variation on Function (IGVF) Consortium. IGVF is a collaborative research initiative funded by the NHGRI that aims to systematically study how genomic variations affect genome function and, consequently, phenotypes. By integrating experimental and computational approaches, IGVF seeks to map and predict the functional impacts of genetic variants, providing a comprehensive catalog of these effects. This tutorial provides a thorough introduction in MPRAs and IGVF data resources, practical training on MPRA data, and insights into advanced analysis methods for such data. Participants will gain an understanding of MPRA experiments, including their various experimental designs and the rationale for using them in functional genomics. This will involve learning the process of associating tags/barcodes with sequences incorporated in the reporter constructs from raw sequencing reads and counting barcodes from DNA sequencing and RNA expression. The tutorial will guide participants through data processing using MPRAsnakeflow, a streamlined snakemake workflow developed with IGVF for efficient MPRA data handling and QC reporting. Statistical analysis for sequence-level and variant-level effect testing of MPRA count data will be introduced using BCalm, a barcode-level MPRA analysis package developed as part of our IGVF efforts. Further, the tutorial will provide a starting point for training (deep learning) sequence models on MPRA data and related functional genomics datasets. Participants will learn how to extract meaningful insights from their datasets by investigating the sequence activity relationship and extracting important sequence motifs. By integrating these topics and methods, participants will leave the tutorial equipped with both theoretical knowledge and practical skills necessary for analyzing and using MPRA data effectively.
2025-07-20 14:00:00 16:00:00 04AB Tutorials Tutorial IP3: Genomic Variant Interpretation & prioritisation for clinical research The interpretation of genetic variation is important for understanding human health and disease. Increased knowledge leads to societal benefits including faster disease diagnosis, a better understanding of disease progression, more efficient identification and prioritisation of drug targets for testing, resulting in overall better health outcomes for a population. Whilst the speed and cost of sequencing has reduced, the complexity of variant interpretation remains a bottleneck for understanding. This tutorial will explore the variety of annotations and techniques available to assess human variation and the implications of variant effects on human health and disease.
2025-07-20 14:00:00 16:00:00 03B Tutorials Tutorial IP4: Quantum Machine Learning for multi-omics analysis Single-cell and population-level multi-omics analyses have greatly enhanced our understanding of biological complexity. By integrating various types of biological data—such as genomics, proteomics, and transcriptomics, collectively known as multi-omics—these approaches have provided deep insights into the molecular mechanisms underlying complex diseases, both at the cellular level and across patient populations. As the size and complexity of multi-omics data continues to grow, the need to leverage emerging technologies such as artificial intelligence (AI) and quantum computing (QC) also grows. Recently, advances in QC have shown promise in solving real-world problems in machine learning and optimization in biomedicine, drug discovery, biomarker discovery, clinical trials, among other healthcare and life sciences objectives [1,2,3,4,5]. In this tutorial, participants will learn the fundamental concepts of QC, engage in hands-on experiments that apply classical machine learning (ML) techniques. They will also learn best practices for pre-processing multi-omics data in preparation for quantum machine learning (QML) tasks. Through a systematic evaluation of various data complexity measures and their impact on the performance of different ML and QML models, participants will gain insights into when to effectively utilize QML models. Additionally, they will explore quantum-classical hybrid workflows for ML, with a focus in biomedical data analysis.
2025-07-20 16:15:00 18:00:00 11BC Tutorials Tutorial IP7: AI large cellular models and in-silico perturbation Transformer-based large language models (LLMs) are changing the world. The capabilities they illustrated in sophisticated natural language, vision and multi-modal tasks have inspired the development of large cellular models (LCMs) for single-cell transcriptomic data, such as scBERT, Geneformer, scGPT, scFoundation, GeneCompass, scMulan, etc. After pretraining on massive amount of single-cell RNA-seq data agnostic to any downstream task, these transformer-based models have demonstrated exceptional performance in various tasks such as cell type annotation, data integration, gene network inference, and the prediction of drug sensitivity or perturbation responses. Such advancements, albeit still in their early stage, suggested promising revolutionary approaches for leveraging AI to understand the complex system of cells from extensive datasets beyond human analytical capacity. Especially, such models have made it possible to conduct in-silico perturbation on cells of various types to predict their responses to gene perturbations without doing experiments on the cells. These models provided prototypes of digital virtual cells that can be used to reconstruct and simulate live cells, which will revolutionize many aspects of future biomedical studies. Although the community is high enthusiastic to these exciting progresses, the structures and algorithms of LCMs and other similar-scale AI models are mysterious to many people who were not equipped with relevant backgrounds. This tutorial will try to fill this gap. In the tutorial, we will begin from an introduction of basic principles of deep neural networks, and explain the basic structure and algorithm of the original Transformer for natural language tasks. We’ll show to the attendees how to build such models based on current machine learning platforms. Then we’ll introduce several successful ways to build large cellular models based on the basic Transformer model, and overview how such models are pretrained on single-cell RNA-seq data. We’ll show and let the attendees to practice how to use LCMs for basic tasks such as cell type annotation, and look into the specific application of LCMs on in-silico perturbation tasks. Attendees will engage in hands-on activities such as building basic transformer models and executing downstream single-cell tasks, including cell type annotation and in-silico perturbation. These activities will remove the mystery of LCMs for the attendees and help them better understand and feel how LCMs can be built and applied
2025-07-20 16:15:00 18:00:00 12 Tutorials Tutorial IP8: Representation Learning and Feature Engineering for Genomic Sequences Analysis Machine learning (ML) has been successfully applied in different omics problems, such as sequence classification in the field of genomics. The effectiveness of ML methods relies greatly on the selection of the data representation, or features, that extract meaningful information from sequences. Genomic sequences can be viewed as one-dimensional strings of successive letters representing nucleotides. However, to make these sequences compatible with ML methods, they must first be transformed into structured numerical representations, such as vectors or matrices. Traditional methods for sequence classification often rely on manually crafted or pre-defined features, which require domain expertise and may not fully capture the complexity of the underlying biological information. Recently, representation learning has emerged as a powerful alternative, enabling the automatic extraction of latent patterns directly from raw data and reducing the dependence on manually crafted features. In genomics, representation learning methods have been introduced to characterize DNA and RNA sequences. In genomics, techniques like Word2Vec, Convolutional Neural Networks (CNNs) and Large Language Models (LLMs) have demonstrated the ability to learn optimal sequence representations that effectively capture both local and global patterns in DNA and RNA sequences. This tutorial provides a comprehensive introduction to feature engineering and representation learning for genomic sequences (DNA/RNA). Participants will explore traditional techniques for extracting features from genomic sequences, building a foundation in classical approaches. Furthermore, the tutorial will cover representation learning, introducing concepts such as embeddings and their applications. Topics include methods such as Word2vec and LLMs to obtain meaningful representations from genomic sequences. Through hands-on exercises and comparative analyses, attendees will learn to combine traditional feature engineering with representation learning approaches, developing practical skills and insights that are adaptable to diverse genomic research challenges. The goal is to offer participants the knowledge and tools to enhance genomic sequence analysis using different techniques for sequence representation.
2025-07-20 16:15:00 18:00:00 11A Tutorials Tutorial IP1: Machine Learning for Omics: Best practices and Real-Life Insights with TidyModels Omics data analysis presents unique challenges due to its high dimensionality and complexity. Supervised machine learning (ML) offers powerful tools for gaining insights from these data but currently faces a crisis of reproducibility due to poor adherence to best practices when undertaking feature selection, model evaluation, and needs for further interpretability. This full-day tutorial introduces participants to the common pitfalls and best practices of applying ML to omics research. It exemplifies good practice through example using the Tidymodels framework for ML workflows in R, tailored to omics applications. The course will feature a mixture of lectures, quizzes, real-life coding tutorials and hands-on practicals with 1-1 support. Example applications will illustrate regression analysis with methylation clocks, gene prioritisation and classification with cancer biomarker discovery. Special attention will be paid to challenges in working with highly multivariate data and integrating various data types as well as providing tips to extract meaningful insights from complex data. Beginner-level R skills are required, and attendees will leave with practical skills to apply Tidymodels to their own datasets.
2025-07-20 16:15:00 18:00:00 03A Tutorials Tutorial IP2: Massively parallel reporter assays in functional regulatory genomics and as part of the IGVF data resource This tutorial is designed to empower bioinformatics researchers with the knowledge and skills to effectively utilize Massively Parallel Reporter Assays (MPRAs) data in their work. MPRAs are gaining wider applications across the functional genomics community and are used as part of the Impact of Genomic Variation on Function (IGVF) Consortium. IGVF is a collaborative research initiative funded by the NHGRI that aims to systematically study how genomic variations affect genome function and, consequently, phenotypes. By integrating experimental and computational approaches, IGVF seeks to map and predict the functional impacts of genetic variants, providing a comprehensive catalog of these effects. This tutorial provides a thorough introduction in MPRAs and IGVF data resources, practical training on MPRA data, and insights into advanced analysis methods for such data. Participants will gain an understanding of MPRA experiments, including their various experimental designs and the rationale for using them in functional genomics. This will involve learning the process of associating tags/barcodes with sequences incorporated in the reporter constructs from raw sequencing reads and counting barcodes from DNA sequencing and RNA expression. The tutorial will guide participants through data processing using MPRAsnakeflow, a streamlined snakemake workflow developed with IGVF for efficient MPRA data handling and QC reporting. Statistical analysis for sequence-level and variant-level effect testing of MPRA count data will be introduced using BCalm, a barcode-level MPRA analysis package developed as part of our IGVF efforts. Further, the tutorial will provide a starting point for training (deep learning) sequence models on MPRA data and related functional genomics datasets. Participants will learn how to extract meaningful insights from their datasets by investigating the sequence activity relationship and extracting important sequence motifs. By integrating these topics and methods, participants will leave the tutorial equipped with both theoretical knowledge and practical skills necessary for analyzing and using MPRA data effectively.
2025-07-20 16:15:00 18:00:00 04AB Tutorials Tutorial IP3: Genomic Variant Interpretation & prioritisation for clinical research The interpretation of genetic variation is important for understanding human health and disease. Increased knowledge leads to societal benefits including faster disease diagnosis, a better understanding of disease progression, more efficient identification and prioritisation of drug targets for testing, resulting in overall better health outcomes for a population. Whilst the speed and cost of sequencing has reduced, the complexity of variant interpretation remains a bottleneck for understanding. This tutorial will explore the variety of annotations and techniques available to assess human variation and the implications of variant effects on human health and disease.
2025-07-20 16:15:00 18:00:00 03B Tutorials Tutorial IP4: Quantum Machine Learning for multi-omics analysis Single-cell and population-level multi-omics analyses have greatly enhanced our understanding of biological complexity. By integrating various types of biological data—such as genomics, proteomics, and transcriptomics, collectively known as multi-omics—these approaches have provided deep insights into the molecular mechanisms underlying complex diseases, both at the cellular level and across patient populations. As the size and complexity of multi-omics data continues to grow, the need to leverage emerging technologies such as artificial intelligence (AI) and quantum computing (QC) also grows. Recently, advances in QC have shown promise in solving real-world problems in machine learning and optimization in biomedicine, drug discovery, biomarker discovery, clinical trials, among other healthcare and life sciences objectives [1,2,3,4,5]. In this tutorial, participants will learn the fundamental concepts of QC, engage in hands-on experiments that apply classical machine learning (ML) techniques. They will also learn best practices for pre-processing multi-omics data in preparation for quantum machine learning (QML) tasks. Through a systematic evaluation of various data complexity measures and their impact on the performance of different ML and QML models, participants will gain insights into when to effectively utilize QML models. Additionally, they will explore quantum-classical hybrid workflows for ML, with a focus in biomedical data analysis.
2025-07-21 11:20:00 12:00:00 01B Bioinformatics in the UK Molecular Digitisation and Biodiversity Bioinformatics Paul Kersey Paul Kersey Biological collections (such as herbarium and fungarium specimens) are the prescurors of modern biobanks; the defining types of taxonomic concepts; together with their associated metadata, a record of what lifeforms were found where and when; there; and increasingly, a physical reference from which molecular data can be extracted. Digitisation of specimen images and metadata, and molecular characterisation through DNA sequencing, are making historic collections newly relevant to contemporary scientific questions. Although DNA degrades with age, it is still possible to obtain significant information about the phylogenetic placement and gene content of many specimens. In this talk, Dr. Kersey will present three large-scale sequencing projects that utilise the collections of the Royal Botanic Gardens, Kew: the Plant and Fungal Trees of Life project, the Darwin Tree of Life project, and the Fungarium Sequencing project. He will discuss the data they are generating, the challenges these raise and the opportunities these present, and the changing role of collections in the scientific community as biodiversity science becomes a big data field.
2025-07-21 12:00:00 12:20:00 01B Bioinformatics in the UK KnetMiner for Smarter Science: Leveraging Knowledge Graphs & LLMs for Productive Gene Research Arne De Klerk Arne De Klerk, Marco Brandizi, Sam Holegar, Alex Warr, Sardor Asatillaev, Keywan Hassani-Pak In the interpretation of high‑throughput genomic data, the identification of candidate genes underlying differential expression or genome‑wide association study (GWAS) signals remains a major challenge. Here, we describe recent enhancements to the KnetMiner platform, which integrates knowledge mining, large language models (LLMs) and retrieval‑augmented generation (RAG) to accelerate gene discovery. KnetMiner constructs a comprehensive knowledge graph by integrating curated ontologies, structured databases and literature‑derived relationships. Upon input of a gene list or genomic loci, semantic queries extract relevant subgraphs that are transformed into context‑aware prompts for an LLM. Through RAG, the model retrieves supporting evidence from external sources - including publications and functional annotations - to produce gene summaries and prioritisation scores. We will present the platform's modular architecture and real use cases of KnetMiner assisting scientists in mining for candidate genes for complex traits in wheat and other crops.
2025-07-21 12:20:00 12:40:00 01B Bioinformatics in the UK Lichen Cell Atlas: Tools for exploring photosymbiotic associations Ellen Cameron Ellen Cameron, Gulnara Tagirdzhanova, Nicholas Talbot, Robert Finn, Mark Blaxter, Irene Papatheodorou Photosymbiotic associations are partnerships where one partner is photosynthetic, termed the photobiont. Such associations span the eukaryotic tree of life from fungi to metazoans. Certain fungi (lichens) have evolved to live as a symbiont dependent on energy from a photobiont. Lichens uniquely produce an anatomically complex structure through the symbiotic association between fungi and algae which resembles neither symbiont independently but instead resembles a multicellular organism. Co-evolution of such symbioses is likely underpinned by molecular interactions which has previously been characterized using bulk sequencing approaches. However, bulk sequencing fails to capture diversity of individual symbionts preventing the exploration of functional differentiation and cell types found in symbiotic associations. Single-nucleus RNA sequencing (snRNAseq) provides a high-resolution tool to investigate symbiont cellular heterogeneity and further inform roles of symbionts, cell-cell communication, and complex tissue differentiation. Many questions and challenges remain when working with biodiverse data types including: how to define ‘cell types’ in biological systems that form through interaction of simple microbial symbionts? In this project, we present the first cell atlas for a lichen, Xanthoria parietina, using snRNAseq data and a computational framework for studying symbiotic partnerships at the single nuclear level. Functional clusters for fungal and algal symbionts corresponding to key biological functions including carbohydrate transport and photosynthesis were generated suggestive of ‘cell types’. As more cell atlases are generated for mutualistic symbiotic associations (e.g., corals), future cross-species comparisons will enable identification of potential functional conservation and new understanding of the evolution of symbioses across the eukaryotic tree of life.
2025-07-21 12:40:00 13:00:00 01B Bioinformatics in the UK How can emerging AI technologies benefit multi-omic analysis for crop and soil sustainability Laura-Jayne Gardiner TBC , Laura-Jayne Gardiner The field of AI is rapidly evolving. At IBM Research (UK), our work in molecular biology has developed from a focus on bioinformatics and computational biology, to include classic Machine Learning (ML), Deep Learning (DL), and now emerging AI-based technologies such as Foundational models (FMs) and Agentic AI. FMs are large, pre-trained models that can be adapted to various tasks, they have revolutionised AI and provide opportunities to accelerate discovery in the multi-omics domain. Here we take you on our technological journey via a range of application use-cases relating to our sustainability work in crop and soil multi-omics, where we are harnessing AI-based technology to identify genetic and molecular targets for disease, disease management and other key phenotypes. Our current toolkit for the analysis of crop genomes and soil metagenomes includes this full stack of bioinformatics and AI-based technologies, and we show that, in combination, these approaches can more effectively guide biological discovery for target identification. What sets us apart is our combination of: (1) cutting edge AI and multi-modal datasets (2) the interdisciplinary nature of our research team including biologists, computational biologists, bioinformaticians, mathematicians, computational scientists and AI specialists and (3) our end-user translational focus that provides a testbed for AI application. In the UK we work with a series of industrial partners focused on healthcare and life sciences within a unique collaborative centre led by IBM Research and STFC, called the Hartree National Centre for Digital Innovation.
2025-07-21 14:00:00 14:20:00 01B Bioinformatics in the UK Civic Data-Driven Innovation for Global Health and AI for All Iain Buchan Iain Buchan Professor Buchan will explore how some of the world’s most pressing public health challenges might best be tackled with a global network of learning health systems – across 3 levels – linking patient, provider and population level insights and actions. He will give examples from Liverpool’s Covid-19 responses as to how the data-action gap was narrowed sufficiently to create the flywheel effect needed for ‘learning health systems’. For example, data-driven deployment and rapid evaluation of the world’s first voluntary mass testing with lateral flow devices, which reduced Covid-19 cases by a fifth and hospitalisation by a quarter. The underlying data science and engineering has persisted in Liverpool as core business after the pandemic via an NHS “Data into Action” programme and University-hosted Civic Health Innovation Labs (CHIL). Professor Buchan will argue that society needs civic clusters of health and care services, academic and industry partners to train and test AIs in a systemic way, networked to allow health systems to borrow strength from each other automatically. Further, Professor Buchan will argue the need for a standard, interactive form of digital self – the Health Avatar – that improves self-care, provider services and science synergistically. He will explore ways that avatar/AI-driven bio-sampling might work at scale – showing that prevention, precision, payer-value need common data and interoperable AI to radically improve healthcare.
2025-07-21 14:20:00 14:30:00 01B Bioinformatics in the UK Health data organisation and landscape across the UK Emily Jefferson Emily Jefferson Professor Jefferson will explore the key challenges and opportunities in the field of clinical and health data science. Drawing on her transition from bioinformatics to health data science, she will share insights from her career journey. Her presentation will highlight how the Five Safes framework is used to protect data confidentiality and explain the role of Trusted Research Environments (TREs) and Secure Data Environments (SDEs) in enabling secure access to sensitive data. She will also outline essential resources for navigating the health data ecosystem, including Data Discovery tools and Information Governance processes. As bioinformatics and health informatics increasingly converge, there is a growing need for bioinformaticians to engage in this interdisciplinary domain.
2025-07-21 14:30:00 15:00:00 01B Bioinformatics in the UK The COALESCE study Angela Wood Angela Wood Professor Wood will present a flagship health data research programme, showcasing how national-scale data assets and analytical infrastructure have enabled transformative COVID-19 research in the UK. Her talk will focus on the integration of linked electronic health records (EHRs) covering more than 67 million individuals in England. These datasets span primary and secondary care, mortality records, medication dispensing, COVID-19 vaccinations, specialist audits, and environmental exposures. The presentation will also delve into the technical and governance foundations that support this work—including reproducible data curation and analysis pipelines, use of open-source tools, and patient and public involvement to promote transparent and trustworthy population-scale research. Professor Wood will highlight the impact of the collaborative research initiative the COALESCE study, which conducted the first UK-wide meta-analysis on COVID-19 undervaccination. This study revealed a clear association between undervaccination and increased risks of hospitalisation and death from COVID-19. This session is designed to inform and inspire researchers, analysts, and policymakers about the future of health data science in the UK—emphasising its transformative potential, ethical foundations, and collaborative spirit.
2025-07-21 15:00:00 15:20:00 01B Bioinformatics in the UK TRE-FX platform for federated analytics of sensitive data in Trusted Research Environments Tim Beck Tim Beck, Justin Biddle, Jonathan Couldridge, Grazziela Figueredo, Alexander Hambley, Alexandra Lee, Hazel Lockhart-Jones, Douglas Lowe, Chris Orton, Vasiliki Panagi, Andy Rae, Stian Soiland-Reyes, Simon Thompson, Carole Goble, Philip Quinlan A Trusted Research Environment (TRE) is a highly secure computer system that allows sensitive data from different sources to be combined, de-personalised and then made available for approved researchers to analyse within a secure virtual environment. The TRE-FX platform enables secure cross-TRE federated analytics. Federated analytics is where the data does not move, but instead the computer code the researchers write is sent to the data. There are many different software tools available for federated analytics, but most have not been designed for use within the considerable technical constraints of TREs. TRE-FX separates the different logical stages of a federated project, to support the use of analytics software within these environments. TRE-FX also uses international open standards from the Global Alliance for Genomics and Health (GA4GH) and ELIXIR to provide a solution for the remote execution of reproducible federated analyses. The Task Execution Service from GA4GH is used to receive federated analysis requests into the TRE. RO-Crate from ELIXIR is used in a standards framework to create the FAIR (Findable, Accessible, Interoperable, Reusable) metadata needed for reproducing analyses across TRE networks. The Five Safes RO-Crate encapsulates the metadata required for the exchange and review of analysis requests and results. This supports the disclosure control processes of TREs, where all outputs/results must be checked to ensure no disclosive data leaves a TRE. TRE-FX tools are being adapted and used across diverse UK and international projects, including DARE UK TREvolution, EOSC-ENTRUST, HDR UK Federated Analytics and ELIXIR Fed-A-Crate.
2025-07-21 15:20:00 15:40:00 01B Bioinformatics in the UK SurvivEHR: A Primary Care Foundation Prediction Model for Multiple Long-Term Conditions Charles Gadd Charles Gadd, Francesca Crowe, Krishnarajah Nirantharakumar, Christopher Yau We present SurvivEHR, a foundation model for time-to-event prediction using Electronic Health Records (EHR), based on the Generative Pre-trained Transformer (GPT) architecture. The model is trained on 23 million patient records from the UK Clinical Practice Research Datalink (CPRD), encompassing longitudinal primary care data. In total, 7.6 billion recorded event across patient timelines are used, with each represented as a tuple comprising: (i) a categorical event index (a unique combination of ICD-10 codes), (ii) an associated numerical value (e.g. measurement), and (iii) the event time (days to/since birth) SurvivEHR follows a pretrain-finetune paradigm: it first learns generalisable clinical representations from large-scale EHR data, and is then fine-tuned for specific prediction tasks such as forecasting future diagnoses, lab values, or mortality risk. This enables SurvivEHR to perform time-to-event forecasting, providing personalised forecasts for risk of future diagnoses, measurements, tests, and death. We further demonstrate that SurvivEHR supports strong transfer learning, and can be used as a Foundation Model for clinical prediction modelling on a number of case study examples. This work is motivated by the growing burden of Multiple Long-Term Conditions (MLTCs), also referred to as multimorbidity, as the prevalence of individuals living with two or more chronic conditions continues to rise. This shift is largely driven by an ageing population and advances in medical care that have extended life expectancy, resulting in more people living longer with chronic diseases. MLTCs are associated with poorer health outcomes, reduced quality of life, increased healthcare costs, and higher rates of hospitalisation and mortality.
2025-07-21 15:40:00 15:50:00 01B Bioinformatics in the UK Pathogen Analysis System (PAS): A Scalable Genomic Data Processing Framework Integrated with the European Nucleotide Archive David Yuan Eugene Ivanov, Senthilnathan Vijayaraja, Tony Burdett, David Yuan The Pathogen Data Network (PDN) consists of two interconnected networks: local private data hubs for public health and global public knowledge bases for genomic research. The Pathogen Analysis System (PAS) in PDN is a computing platform to process pathogen data by integrating with the European Nucleotide Archive (ENA). PAS retrieves input data from ENA, processes it using customizable bioinformatics pipelines, and submits results back to ENA, which may then be highlighted in a Pathogens Portal. PAS supports four use cases: 1. Public data on PDN’s priority list (ENA-sourced). 2. Public data not on the priority list (ENA-sourced). 3. Private data stored in ENA. 4. Data uploaded directly to Galaxy Community Hub, bypassing ENA. The system leverages tools like the ENA File Downloader and Webin CLI for Submission, wrapped as Galaxy tools, to automate workflows. A reference implementation using the Tuberculosis (TB) Variant Analysis Workflow demonstrates how researchers can retrieve data, analyze it, and submit results seamlessly. Developed in collaboration with GA4GH, ENA, EVORA, and BRC Analytics, PAS operates on a layered architecture, enabling scalable, automated analysis. Researchers can plug in their own pipelines or reuse existing ones, benefiting from Galaxy’s computing power without manual data handling. While UC1 is pathogen-specific, UC2-UC4 are generic, making PAS adaptable for any genomic data analysis integrated with ENA. The system also aligns with FAIR principles, with plans to share workflows via WorkflowHub and package them in RO-Crates for improved reproducibility and accessibility. This framework enhances collaborative research, enabling rapid data sharing and analysis.
2025-07-21 15:50:00 16:00:00 01B Bioinformatics in the UK AI in Histopathology Explorer for comprehensive analysis of the evolving AI landscape in histopathology Heba Sailem Yingrui Ma, Shivprasad Jamdade, Lakshmi Konduri, Heba Sailem Digital pathology and artificial intelligence (AI) hold immense transformative potential to revolutionize cancer diagnostics, treatment outcomes, and biomarker discovery. Gaining a deeper understanding of deep learning algorithm methods applied to histopathological data and evaluating their performance on different tasks is crucial for developing the next generation of AI technologies. To this end, we developed AI in Histopathology Explorer (HistoPathExplorer); an interactive dashboard with intelligent tools available at www.histopathexpo.ai. This real-time online resource enables users, including researchers, decision-makers, and various stakeholders, to assess the current landscape of AI applications for specific clinical tasks, analyze their performance, and explore the factors influencing their translation into practice. Moreover, a quality index was defined to evaluate the comprehensiveness of methodological details in published AI methods. HistoPathExplorer highlights opportunities and challenges for AI in histopathology, and offers a valuable resource for creating more effective methods and shaping strategies and guidelines for translating digital pathology applications into clinical practice.
2025-07-21 16:40:00 17:00:00 01B Bioinformatics in the UK Generative machine learning to model cellular perturbations Mo Lotfollahi The field of cellular biology has long sought to understand the intricate mechanisms that govern cellular responses to various perturbations, be they chemical, physical, or biological. Traditional experimental approaches, while invaluable, often face limitations in scalability and throughput, especially when exploring the vast combinatorial space of potential cellular states. Enter generative machine learning that has shown exceptional promise in modelling complex biological systems. This talk will highlight recent successes, address the challenges and limitations of current models, and discuss the future direction of this exciting interdisciplinary field. Through examples of practical applications, we will illustrate the transformative potential of generative ML in advancing our understanding of cellular perturbations and in shaping the future of biomedical research.
2025-07-21 17:00:00 17:10:00 01B Bioinformatics in the UK Building the world’s largest, ethically-sourced database of biological information to pioneer a new class of foundational AI models Carla Greco Carla Greco, William Chow Foundational AI models are revolutionizing many fields, but their effectiveness hinges on the availability of high-quality, diverse data. Without a robust and varied dataset, even the most sophisticated models can fall short of their potential, lacking the necessary breadth to generalize accurately and provide meaningful insights. Whilst public databases provide an immeasurable valuable resource, it does host geographical and taxonomic biases as well as limitations in environmental and genomic context that can hinder the potential when building these models. Here we present BaseData™, Basecamp Research’s central data asset, that is far larger, more diverse, and more richly contextualized than any public dataset, and fully grounded on ethical access and benefit sharing agreements that ensure clear commercial use for the data. This dataset continues to grow at pace, supported by a self-reinforcing data supply chain that spans over 120 locations in over 25 countries, reaching over half of all known biomes from volcanoes on islands to ice fields in Antarctica. BaseData™ has expanded classes of biological systems and biomolecules by 100 times compared to public databases, and this unprecedented novelty and diversity significantly strengthens AI models, enabling them to tackle highly complex tasks. BaseData has already shown to help build tools to improve predictive and generative tasks such as solving protein structures (BaseFold), functional enzyme annotation (Hifi-NN), and making enzyme design programmable (ZymCTRL). We believe this novel and diverse dataset opens the door to truly revolutionary advances in biological AI in the near future.
2025-07-21 17:10:00 17:20:00 01B Bioinformatics in the UK Multimodal generative machine learning for non-clinical safety evaluations in drug discovery and development Arijit Patra Arijit Patra In the evolving landscape of pharmaceutical drug development and a constant reimagination of the Ideas to Patient journey, the integration of multimodal foundation models, generative machine learning, and AI-driven chatbot interfaces has marked a significant leap forward. This talk will present our recent work in building and deploying multimodal foundation models specifically designed for drug toxicity evaluations, as well as the production of intuitive chatbot interfaces to facilitate knowledge discovery and human-in-the-loop assessments throughout the process. Our multimodal foundation models are engineered to process and interpret a variety of data types, including molecular structures, biological assay results, and textual data from scientific literature, along with imaging and computational toxicity data from non-clinical safety studies and toxicologic pathology processes. By harnessing these models, we have enhanced our ability to predict and evaluate potential drug toxicities early in the development pipeline. This not only accelerates the identification of safer drug candidates but also substantially reduces the costs and time associated with traditional toxicity testing methods. In parallel, we have developed and productionized sophisticated chatbot interfaces that serve as powerful tools for knowledge discovery. These chatbots enable teams to interact seamlessly with complex datasets and analytical tools, democratizing access to critical insights and fostering a more collaborative research environment. The chatbots are designed to understand and respond to natural language queries, making advanced data analysis accessible to users regardless of their technical expertise. This talk will showcase applications and case studies where our multimodal foundation models and chatbot interfaces have been successfully implemented. We will discuss the potential impact of these technologies on non-clinical safety evaluations, and improvements in accuracy, efficiency, and decision-making processes. Additionally, we will explore the challenges encountered during development and deployment, as well as the future directions and potential expansions of these innovative tools.
2025-07-21 17:20:00 17:30:00 01B Bioinformatics in the UK CUPiD: a machine learning approach for determining tissue-of-origin in cancers of unknown primary from cell-free DNA methylation profiles Steven M. Hill Alicia-Marie Conway, Simon P. Pearce, Alexandra Clipson, Steven M. Hill, Holly Cassell, Vsevolod J. Makeev, Caroline Dive, Natalie Cook, Dominic G. Rothwell Patients with cancer of unknown primary (CUP) present with metastases, but without an identifiable primary tumour and typically have poor outcomes. Determining the primary site enables access to type-specific therapies, potentially leading to therapeutic benefit. We developed CUPiD, a machine learning classifier for non-invasively and accurately determining CUP tissue-of-origin (TOO) from cell-free DNA (cfDNA) methylation profiles obtained from a blood sample. We used a data augmentation strategy that enabled data arising from over 9,000 TCGA tumour tissue samples (rather than cfDNA samples) to be leveraged for classifier training. To generate the training dataset, we performed in silico mixing of reads from tumour tissue DNA with reads from non-cancer control (NCC) cfDNA, to mimic low tumour content of cfDNA, resulting in 276,108 in silico mixture samples, across 29 cancer types. These were used to train an ensemble of 100 XGBoost classifiers for TOO determination from cfDNA, with an AUROC of 0.98 on held-out mixture samples. We further tested CUPiD on 143 cfDNA samples from patients with metastatic cancer of known type, giving an overall multi-class sensitivity of 85% and TOO accuracy of 97%. In an additional cohort of 41 patients with CUP, CUPiD predictions were made in 32/41 (78%) cases, with 88% of the predictions clinically consistent with a subsequent or suspected primary tumour diagnosis. Our data demonstrate that CUPiD can accurately predict TOO from a single liquid biopsy and has potential to support treatment stratification and improve clinical outcomes for a significant proportion of patients with CUP.
2025-07-21 17:30:00 17:35:00 Bioinformatics in the UK Introduction to AIBIO UK Charlie Harrison AIBio is a UKRI BBSRC-funded network to support and enhance engagement between the Biosciences and AI communities in the UK. The network supports community events, funding of pilot projects and creation of resources for supporting use of AI in the biosciences.
2025-07-21 17:35:00 18:00:00 01B Bioinformatics in the UK AI: Careers and trajectories Mo Lotfollahi, Lucia Marucci, Gabriella Rustici, Harpreet Saini TBC
2025-07-21 11:20:00 12:00:00 04AB VarI Enhancing Multi-Task CNNs for Regulatory Genomics Through Allelic and High-Resolution Training Alexander Sasse Alexander Sasse Multi-task Convolutional Neural Networks (CNNs) have emerged as powerful tools for deciphering how genomic sequence determines gene regulatory responses such as chromatin accessibility or transcript abundance. These models can learn the sequence patterns recognized by regulatory factors from the variation across hundreds of thousands of loci in the genome. Their understanding of gene regulatory syntax enables them to be used to predict individual genomic variant effects across the cell types they were trained on, and to point to the affected biological mechanisms. However, our recent study and that of another group (Sasse et al. 2023) revealed in parallel that, despite strong performance on various variant effect prediction benchmarks (Avsec et al. 2021), these models fail to correctly determine how variants affect the direction of gene expression across individual, an essential capability for associating variants with phenotypes or diseases. To address these limitations and improve model learning from available data, I present two strategies. First, training with sequence variation: we developed a modeling approach that directly contrasts sequence differences to predict allele-specific and personalized functional measurements from RNA-seq, ATAC-seq, and ChIP-seq (Tu, Sasse, and Chowdharry et al. 2025; Spiro and Tu et al. 2025). We applied this approach to data from F1 hybrid mice and from humans with personal whole genome information, with varying degrees of success: while training on allele-resolved data improved predictions of differential signals, training on hundreds of personal genomes did not generalize variant effects to unseen genes. Second, training at higher resolution: we created models that analyze ATAC-seq at base-pair resolution, capturing both overall chromatin accessibility and the distribution of Tn5 transposase insertions (Chandra et al. 2025). Our results demonstrate that additionally modeling the ATAC-seq profile consistently improves predictions of differential chromatin accessibility. Systematic analysis of the models’ sequence attributions confirms that base-pair resolution training enables the model to learn a more sensitive representation of the regulatory syntax that drives differences between immunocytes.
2025-07-21 12:00:00 12:20:00 04AB VarI Combining massively parallel reporter assays and graph genomics to assay the regulatory effects of indels and structural variants Lindsey Plenderleith Lindsey Plenderleith, Rachel Owen, Timothy Connelley, Musa Hassan, Liam Morrison, James Prendergast Many important phenotypes are driven by differences in gene expression caused by variation in regulatory sequences between individuals. Among such variants, the effects of larger changes such as insertion-deletion mutations (indels) and structural variants (SVs) remain understudied relative to single nucleotide variants (SNVs), even though they may often have larger regulatory impacts. We have used the Survey of Regulatory Effects (SuRE) approach, a genome-wide massively parallel reporter assay, to screen the cattle and human genomes to identify SNVs with regulatory effects, and are now leveraging this approach to study the effects of larger variants. The SuRE method, which tests the ability of individual genomic DNA fragments to initiate transcription in an otherwise promoterless plasmid, allows the effects of individual variants to be tested, considerably reducing the confounding impact of linkage disequilibrium. By combining SuRE with a novel graph genomics pipeline we have been able to improve the detection of regulatory effects of indels and SVs. We successfully tested almost 1.4 million indels and SVs, ranging in size from 1 bp to 1.5 kb, and identified around 13,000 with a significant effect on gene expression in primary cattle cells. Work is ongoing to characterise further these potential regulatory variants and their relevance to understanding how indels and SVs shape important phenotypes. These results validate our method as a new tool for evaluating the functional effects of longer variants.
2025-07-21 12:20:00 12:30:00 04AB VarI Multilingual model improves zero-shot prediction of disease effects on proteins Ruyi Chen Ruyi Chen, Nathan Palpant, Gabriel Foley, Mikael Boden Models for mutation effect prediction in coding sequences rely on sequence-, structure-, or homology-based features. Here, we introduce a novel method that combines a codon language model with a protein language model, providing a dual representation for evaluating effects of mutations on disease. By capturing contextual dependencies at both the genetic and protein level, our approach achieves a 3% increase in ROC-AUC classifying disease effects for 137,350 ClinVar missense variants across 13,791 genes, outperforming two single-sequence-based language models. Obviously the codon language model can uniquely differentiate synonymous from nonsense mutations at the genomic level. Our strategy uses information at complementary biological scales (akin to human multilingual models) to enable protein fitness landscape modeling and evolutionary studies, with potential applications in precision medicine, protein engineering, and genomics.
2025-07-21 12:30:00 12:40:00 04AB VarI X-MAP: Explainable AI Platform for Genetic Variant Interpretation Marco Anteghini Marco Anteghini, Andrea Zauli, Emidio Capriotti Genetic variants, particularly missense mutations, can significantly affect protein function and contribute to disease development. Methods like CADD and AlphaMissense are widely used for pathogenicity prediction; however, their integration into existing resources remains limited due to compatibility issues and high computational demands. We introduce X-MAP, an integrated platform that leverages protein language models to enhance variant effect prediction through a novel embedding-based strategy. This approach captures both local and global protein features, enabling more accurate interpretation of mutation impacts. Our method generates embeddings for entire protein sequences using multiple state-of-the-art models—ESM2, ESMC, and ESM1v—and extracts contextual information around mutation sites using a dynamic window of four residues on each side. This window size was empirically optimized to balance detailed local structure with computational efficiency We evaluated both concatenation and difference-based embedding strategies using rigorous 10-fold cross-validation with XGBoost classifiers on a large dataset of 71,595 genetic variants across 12,666 human proteins. Among all methods, the ESMC concatenation strategy with the 4-residue window achieved the highest performance (Accuracy: 0.84, MCC: 0.66, AUC: 0.90), outperforming the Esnp baseline (Accuracy: 0.82, MCC: 0.64, AUC: 0.82), which relies on full sequence concatenation. By concentrating on regions directly affected by mutations while retaining global sequence context, X-MAP achieves both accuracy and computational efficiency. We are currently developing a hybrid Transformer-CNN model to further enhance prediction accuracy and interpretability. X-MAP represents a powerful and scalable framework for variant analysis with direct applications in precision medicine and disease research.
2025-07-21 12:40:00 12:50:00 04AB VarI StructGuy: Data leakage free prediction of functional effects of genetic variants. Alexander Gress Alexander Gress, Johanna Becher, Dominique Mias-Lucquin, Sebastian Keller, Olga Kalinina In recent years, machine learning models for predicting variant effects on protein function have been dominated by unsupervised models doing zero-shot predictions on the task. In their development, multiplexed assay of variant effect (MAVE) data played only a secondary role used for model evaluation, most prominently applied in the ProteinGym benchmark. Yet, the rapidly increasing amount of available MAVE data should be able to fuel novel supervised predictions models, but is hindered by data leakage resulting when MAVE data is used to train a supervised model. Such models are not able to generalize their predictions to proteins not present in the training data, hence so far they are only used in protein design tasks. Here, we present the novel random forest-based prediction method StructGuy that overcomes the problem of data leakage by applying sophisticated splits in hyperparameter optimization and feature selection. By removing proteins similar to any proteins in our training data set from ProteinGym, we constructed a dedicated benchmark that aims to evaluate the ability of a supervised model to generalize to proteins not seen in the training data. In this benchmark, we could do a direct and fair comparison of our StructGuy model with all models that are part in the zero-shot substitutions track of ProteinGym, and were able to demonstrate a slightly higher average Spearmans' correlation coefficient (0.45 vs. second highest: ProtSSN: 0.43). StructGuy directly applied on ProteinGym results in an average Spearmans' correlation coefficient of 0.6.
2025-07-21 12:50:00 13:00:00 04AB VarI Functional characterization of standing variation around disease-associated genes using Massively Parallel Reporter Assays Kilian Salomon Kilian Salomon, Chengyu Deng, Jay Shendure, Max Schubach, Nadav Ahituv, Martin Kircher A substantial reservoir of disease-associated variants resides in non-coding sequences, particularly in proximal and distal gene regulatory sequences. As part of the NIH Impact of Genomic Variation on Function (IGVF) consortium, we investigated functional genetic variation using Massively Parallel Reporter Assays (MPRAs). We tested >28,000 candidate cis-regulatory regions (cCREs) in the proximity (50kb) of 526 neural, cardiac or clinically actionable genes as well as a random gene set. Within these cCREs, we included >46,000 variants from gnomAD. This included all single nucleotide variants (SNVs) with allele frequency (AF) ≥1% as well as 35,000 rare and singleton variants. Rare variants were prioritized using Enformer (Avsec Ž et al. 2021) to select 70% potentially activating, 15% repressing, and 15% random variants. Performing this MPRA in NGN2-derived neurons from WTC-11 cells showed that 16% (4045) of cCREs have significantly different activity from negative controls, while 6% (1540) of elements exhibit distinct activity from scrambled controls (dCREs). Among the dCREs, 3.3% are significantly more active and 2.7% were less active. About 3% (1304) of the tested variants showed a significant allelic effect. We observed both common and rare variants with significant allelic effects, with rare variants contributing the larger proportion. Examples of significant common and singleton SNVs include rs11635753 and rs1257445811 affecting SMAD3 and TRIO, respectively, and both associated with complex neurological phenotypes. Using Enformer for prioritization resulted in an enrichment in the selected rare variants but also failed to effectively capture regulatory grammar at base resolution.
2025-07-21 14:00:00 14:40:00 04AB VarI Variant Interpretation at Scale, for safer and more effective disease treatment Ellen McDonagh Ellen McDonagh
2025-07-21 14:40:00 15:00:00 04AB VarI scFunBurd: Quantifying the cellular liability for complex disorders of all rare gene-disrupting variants. Thomas Renne Thomas Renne, Guillaume Huguet, Tomasz Nowakowski, Sébastien Jacquemont Neurodevelopmental disorders are examples of complex disorders with multidimensional etiologies. This study focuses on Autism Spectrum Disorder (ASD), a prevalent and highly heritable disorder, to illustrate the challenges of identifying rare variants associated with a complex disorder. Previous research has linked only a hundred genes to ASD. However, the majority of gene-disrupting variants and their functions remain unknown. This study aims to develop a cellular burden analysis to associate the rest of the rare gene-disrupting variants with complex disorders on a function-wide scale with the help of transcriptomic datasets. The study relied on 100,000 phenotyped and sequenced individuals from the SPARK dataset. Transcriptomic data are single-nuclei RNAseq of 150,000 cortical cells from 40 individuals, clustered into 91 developmental cell types. Cell type burdens were computed with logistic regression models of the most cell-type specific genes. Our results showed that Loss of Function (LoF) and CNVs had significant liabilities in neuronal cell types. Interestingly, we also identified significant liabilities for ASD in non-neuronal cell types for LoF, which were never pointed out. Moreover, each variant type exhibited unique patterns of cellular liability, highlighting the need to study them individually. Finally, we observed that the cellular burden was mostly resulting of genes never associated with ASD. The scFunBurd method effectively identified new functional processes associated with complex disorders, and offers insights into rare variants not yet linked to ASD. This method could therefore be applied to other complex disorders to uncover their functional liabilities.
2025-07-21 15:00:00 15:20:00 04AB VarI Biostatistical approaches to single-cell perturbation screens to create a prospective map of mutational impact Magdalena Strauss Magdalena Strauss, Sarah Cooper, Matthew Coelho, Aleksander Gontarczyk Gontarczyk, Qianxin Wu, Alex Watterson, John Marioni, Mathew Garnett, Andrew Bassett DNA single nucleotide variants are a major cause of drug resistance in cancer, but for most variants their effects on drug response are yet unknown. While new SNVs are discovered at an increasing rate, the interpretation of their impacts presents a major bottleneck in clinical use. To address this bottleneck, we developed a suite of statistical analysis tools that allowed the creation of a prospective map of mutational impact from new experimental techniques that combine gene editing data with RNA and DNA sequencing readout at the single-cell level. Our tools shed light on the degree of malignancy of individual mutations, on changes in gene regulation resulting from mutations, and on potential drug targets, and also include methods to model the specific noise structure of single-cell data for the gene editing context. First, we studied IFNγ response across different mutations to the JAK1 gene in colon cancer cells[1], and demonstrated the accuracy of our computational tools by linking genotype with transcriptional phenotype in 9,908 cells for scDNA-seq and 18,978 cells for scRNA-seq, encompassing 97 unique genotypes with low error-rates for known genotype-phenotype relationships. In a second application[2], we studied the transcriptional profiles of drug-resistant colon cancer cells at scale, following exposure to the drugs dabrafenib and cetuximab. Our approach shed light on transcriptional differences between different types of drug resistance, including drug addiction. References: 1. Cooper*, Coelho*, Strauss*, et al. Genome Biol 25, 20 (2024). 2. Coelho, Strauss, Watterson, et al. Nat. Genet. (2024).
2025-07-21 15:20:00 15:30:00 04AB VarI SpliceTransformer predicts tissue-specific splicing linked to human diseases Ning Shen Ningyuan You, Chang Liu, Ning Shen We present SpliceTransformer (SpTransformer), a deep-learning framework that predicts tissue-specific RNA splicing alterations linked to human diseases based on genomic sequence. SpTransformer outperforms all previous methods on splicing prediction. Application to approximately 1.3 million genetic variants in the ClinVar database reveals that splicing alterations account for 60% of intronic and synonymous pathogenic mutations, and occur at different frequencies across tissue types. Importantly, tissue-specific splicing alterations match their clinical manifestations independent of gene expression variation. We validate the enrichment in three brain disease datasets involving over 164,000 individuals. Additionally, we identify single nucleotide variations that cause brain-specific splicing alterations, and find disease-associated genes harboring these single nucleotide variations with distinct expression patterns involved in diverse biological processes. Finally, SpTransformer analysis of whole exon sequencing data from blood samples of patients with diabetic nephropathy predicts kidney-specific RNA splicing alterations with 83% accuracy, demonstrating the potential to infer disease-causing tissue-specific splicing events. SpTransformer provides a powerful tool to guide biological and clinical interpretations of human diseases.
2025-07-21 15:30:00 15:40:00 04AB VarI Cell type-specific epigenetic regulatory circuitry of coronary artery disease loci Dennis Hecker Dennis Hecker, Xiaoning Song, Nina Baumgarten, Anastasiia Diagel, Nikoletta Katsaouni, Ling Li, Shuangyue Li, Ranjan Kumar Maji, Fatemeh Behjati Ardakani, Lijiang Ma, Daniel Tews, Martin Wabitsch, Johan L.M. Björkegren, Heribert Schunkert, Zhifen Chen, Marcel H. Schulz Coronary artery disease (CAD) is the leading cause of death worldwide. Recently, hundreds of genomic loci have been shown to increase CAD risk, however, the molecular mechanisms underlying signals from CAD risk loci remain largely unclear. We sought to pinpoint the candidate causal coding and non-coding genes of CAD risk loci in a cell type-specific fashion. We integrated the latest statistics of CAD genetics from over one million individuals with epigenetic data from 45 relevant cell types to identify genes whose regulation is affected by CAD-associated single nucleotide variants (SNVs) via epigenetic mechanisms. We pursue two approaches. Firstly, we aggregate variations in gene bodies and combine their significance levels while accounting for their linkage disequilibrium structure. Secondly, we focus on variations that affect transcription factor binding in enhancers. We identified 1,580 genes likely involved in CAD, about half of which have not been associated with the disease so far. Of all the candidate genes, 23.5% represented non-coding RNAs, which are underrepresented in transcriptome-based gene prioritization. Enrichment analysis and phenome-wide association studies linked the novel candidate genes to disease-specific pathways and CAD risk factors, corroborating their disease relevance. We showed that CAD-SNVs affect the binding of transcription factors with cellular specificity. Finally, we conducted a proof-of-concept biological validation for the novel CAD non-coding RNA gene IQCH-AS1. Our study not only pinpoints CAD candidate genes in a cell type-specific fashion but also spotlights the roles of the understudied non-coding RNA genes in CAD genetics.
2025-07-21 15:40:00 15:50:00 04AB VarI MultiPopPred: A Trans-Ethnic Disease Risk Prediction Method, and its Evaluation on Low Resource Populations Ritwiz Kamal Ritwiz Kamal, Manikandan Narayanan Genome-wide association studies (GWAS) aimed at estimating the disease risk of genetic factors have long been focusing on homogeneous Caucasian populations, at the expense of other understudied non-Caucasian populations. Therefore, active efforts are underway to understand the differences and commonalities in exhibited disease risk across different populations or ethnicities. There is, consequently, a pressing need for computational methods and associated probabilistic models that efficiently exploit these population specific vs. shared aspects of the genotype-phenotype relation. We propose MultiPopPred, a novel trans-ethnic polygenic risk score (PRS) estimation method, that taps into the shared genetic risk across populations and transfers information learned from multiple well-studied auxiliary populations to a less-studied target population. MultiPopPred employs a specially designed penalized shrinkage model of regression and a Nesterov-smoothed objective function optimized via a L-BFGS routine. We present five variants of MultiPopPred based on the availability of individual-level vs. summary-level data and the weightage of each auxiliary population. Extensive comparative analyses performed on simulated genotype-phenotype data reveal that MultiPopPred improves PRS prediction in the South Asian population by 67% on settings with low target sample sizes, by 18% overall across all simulation settings and by 73% overall across all semi-simulated settings, when compared to state-of-the-art trans-ethnic PRS estimation methods. This performance trend is promising and encourages application and further assessment of MultiPopPred under real-world settings.
2025-07-21 15:50:00 16:00:00 04AB VarI Rethink gender as confounder in non linear PRS for human height prediction Huijiao Yang Huijiao Yang Polygenic risk score (PRS) models often include gender as a fixed covariate, implicitly assuming a direct and additive effect on traits such as height. While this approach simplifies modeling, it may obscure the more intricate role gender plays—especially when sex-related genetic variation is partially correlated with trait-associated loci. In this study, we revisit the conventional treatment of gender by conceptualizing it not as a purely causal covariate, but as a potential confounder whose genetic basis overlaps with that of the target phenotype. We propose a representation learning framework that: (1) learns gender-specific genetic patterns directly from genome-wide association study (GWAS) data, and (2) disentangles these patterns from height-related signals using contrastive learning. Our empirical findings show that traditional linear models—particularly LASSO regression—may attribute disproportionate predictive weight to sex chromosome variants, due to their alignment with both gender labels and phenotypic variation. In contrast, our framework yields disentangled embeddings that more clearly separate sex-related genetic structure from height-specific architecture, suggesting that some observed "gender effects" may instead reflect underlying genetic correlations. The framework provides three main contributions: First, it replaces fixed covariate adjustment with a learnable gender encoder trained directly on genotype data. Second, it constructs a non-linear PRS model that improves both predictive accuracy and interpretability. Third, it offers a new lens on sex-informed genetic modeling—highlighting how treating gender as a confounder in latent space can enhance both prediction and biological insight. These findings may extend to other traits with sex-dimorphic genetic architecture.
2025-07-21 16:40:00 17:20:00 04AB VarI Somatic mutations in normal tissues Andrew Lawson Andrew Lawson As we age, all cells in our bodies continuously acquire somatic mutations. Despite appearing histologically normal, many tissues become progressively colonised by microscopic clones carrying somatic driver mutations. Some of these clones represent a first step towards cancer whereas others may contribute to ageing and other diseases. However, our understanding of the clonal landscapes of human tissues, and their impact on cancer risk, ageing and disease, remains limited due to the challenge of detecting somatic mutations present in small numbers of cells. In this presentation, I will summarise the methodological advances that have occurred over the last decade that have enabled us to first discover and subsequently interrogate these microscopic clones. In particular, I will focus on our recent development of nanorate sequencing (NanoSeq), a duplex sequencing method with error rates of <5 per billion base pairs, which is compatible with whole-exome and targeted gene sequencing. Deep sequencing of polyclonal samples with single-molecule sensitivity enables the simultaneous detection of mutations in large numbers of clones, yielding accurate somatic mutation rates, mutational signatures and driver mutation frequencies in any tissue. Applying Targeted NanoSeq to 1,042 non-invasive samples of oral epithelium and 371 samples of blood from a twin cohort, we found an unprecedentedly rich landscape of selection, with 46 genes under positive selection driving clonal expansions in the oral epithelium, over 62,000 driver mutations, and evidence of negative selection in some genes. The high number of positively selected mutations in multiple genes provides high-resolution maps of selection across coding and non-coding sites, a form of in vivo saturation mutagenesis.
2025-07-21 17:20:00 17:40:00 04AB VarI Revisiting Cancer Predisposition: Identifying Altered Genes with Protective and Recessive Effects Michal Linial Michal Linial, Reoi Zucker, Shirel Schreiber essential for both preventive and personalized medicine. Genetic studies of cancer predisposition typically identify significant genomic regions through family-based cohorts or genome-wide association studies (GWAS). However, these approaches often lack biological insight and functional interpretation. In this study, we performed a comprehensive analysis of cancer predisposition in the UK Biobank (UKB) cohort using a novel gene-based method to identify interpretable protein-coding genes that are associated with ten major cancer types. Specifically, we applied proteome-wide association studies (PWAS) to detect genetic associations driven by alterations in protein function. Through PWAS, we identified 110 significant gene-cancer associations across. Notably, in 44% of these associations, the damaged gene was linked to reduced rather than elevated cancer risk, suggesting a protective effect. Together with classical GWAS, we identified 145 unique genomic loci associated with cancer risk. While many of these regions are supported by external evidence, we have listed 51 novel loci. Additionally, leveraging the ability of PWAS to detect non-additive genetic effects, we found that 46% of PWAS-significant cancer regions exhibited exclusive recessive inheritance, underscoring the importance of overlooked recessive genetic effects. These findings emphasize a refreshed view of predisposition that highlights recessive effects, protective genes, and the interrelation of genes in different cancer types. We provide PWAS Hub as an interactive tool to navigate among genes, cancer phenotypes and heritability modes. We conclude that expanding the list of cancer predisposition genes will benefit early diagnosis, genetic counseling, and an approach for personalized risk assessment.
2025-07-21 17:40:00 17:50:00 04AB VarI COBT: A gene-based rare variant burden test for case-only study designs using aggregated genotypes from public reference cohorts Antoine Favier Alexandre Benmerah, Sophie Saunier, Tania Attie-Bitach, Mohamad Zaidan, Céline Huber, Valérie Cormier-Daire, Isabelle Perrault, Jean-Michel Rozet, Katy Billot, Yoann Martin, Antoine Favier, Agathe Guilloux, Manuel Higueras Hernáez, Anita Burgun, Nicolas Garcelon, Xiaoyi Chen, Fabienne Jabot-Hanin, Alejandro García Sánchez, Stefania Chounta, Antonio Rausell More than 4000 rare genetic diseases have been reported, affecting 1 in 16 people. Yet, around 50% of patients remain undiagnosed after genetic testing. Identifying genotype-phenotype associations remains challenging due to limited cohort sizes and high clinical and genetic heterogeneity. Burden tests of rare variants increase statistical power in case-control designs but are limited in rare disease studies due to the lack of matched controls. Case-only aggregation tests have recently emerged; however, most rely on assessing the number of individuals carrying variants under dominant or recessive models rather than the aggregated number of variants across the cohort, overlooking hypomorphic and modifier variants or heterogeneous inheritance modes requiring additive models. We present the Case-Only Burden Test (COBT), a gene-based rare variant burden test for case-only designs. COBT uses a Poisson parametric test to evaluate the excess of variants in a gene observed in a patient cohort, compared to expectations inferred from general population mutation rates. We validated the statistical assumptions and goodness-of-fit of the method on non-Finnish European individuals from the 1000 Genomes Project. COBT showed low false positive rates, contrasting with the high p-value inflation of alternative case-only rare variant tests. Applied to 478 ciliopathy patients, COBT successfully re-identified known disease genes previously annotated via expert review and uncovered novel candidate genes in undiagnosed patients, consistent with clinical phenotypes. Our results show that COBT can uncover genotype-phenotype associations in case-only retrospective studies of rare-disease cohorts driven by primary as well as by secondary hits with major or modifier clinical roles.
2025-07-21 17:50:00 18:00:00 04AB VarI Concluding remarks and prizes VarICOSI Co-chairs
2025-07-24 11:20:00 13:00:00 11A WEB 2025 Exchanging experience and use cases of working cross-sector Gabriella Rustici, Sarah Morgan, TBC The panel will share and explore their experiences of organising and delivering across industry,clinical and academic sectors. Highlighting both the challenges and opportunities that working cross sector can produce.
2025-07-24 14:00:00 16:00:00 11A WEB 2025 Defining principles for working together on training and skills development This session will be split into two sections and will be run as a workshop gathering ideas and ways to move forward from the wider ISMB training community (including industry panel members and students). Part 1 will focus on defining principles to enable closer working relationships between industry and academia for skills development. Part 2 will then look at how ISCB can support this cross sector working for training.
2025-07-20 09:00:00 10:45:00 02F Youth Bioinformatics Symposium Youth Bioinformatics Symposium
2025-07-20 11:00:00 13:00:00 02F Youth Bioinformatics Symposium Youth Bioinformatics Symposium
2025-07-20 14:00:00 16:00:00 02F Youth Bioinformatics Symposium Youth Bioinformatics Symposium
2025-07-20 16:15:00 18:00:00 02F Youth Bioinformatics Symposium Youth Bioinformatics Symposium

- top -