Attention Presenters - please review the Presenter Information Page available here
Schedule subject to change
All times listed are in EDT
Monday, July 15th
10:40-11:00
Opening Remarks
Room: 521
Format: In person

Moderator(s): Junlien Gagneur


Authors List: Show

Exploring the landscape of regulatory uORFs in BMPR2 and their potential as therapeutic targets
Confirmed Presenter: Danielle Gutman, University of Pennsylvania, Department of Genetics, United States

Room: 521
Format: In Person

Moderator(s): Antonio Raussel


Authors List: Show

  • Danielle Gutman, University of Pennsylvania, Department of Genetics, United States
  • Isaac Hoskins, University of Pennsylvania, Department of Genetics, United States
  • San Jewell, University of Pennsylvania, Department of Genetics, United States
  • David Lee, Northwestern Medicine, United States
  • Louis R. Ghanem, Perelman School of Medicine at the University of Pennsylvania, Department of Pediatrics, United States
  • Nicholas J. Hand, University of Pennsylvania, Department of Genetics, Institute of Translational Medicine and Therapeutics, United States
  • Yoseph Barash, University of Pennsylvania, Department of Genetics, Department of Computer and Information Science, United States

Presentation Overview: Show

About 50% of human genes harbor upstream open reading frames (uORFs), with a start codon in the 5’ untranslated region (5’UTR) occurring before the coding sequence (CDS) start, and an in-frame stop codon occurring before or after the CDS start. Previous work demonstrated functional roles for uORFs in regulating their CDS expression, specifically when altering uORF start/stop codons. To detect functional uORFs, we created a database that combines experimentally detected uORFs and transcriptome based uORF predictions. As a test-case for our selection pipeline, the bone morphogenetic protein receptor type II (BMPR2) gene, is strongly associated with Pulmonary Arterial Hypertension (PAH). The 5’UTR of BMPR2 harbors over 30 uORFs, making it an excellent test case for functional validation. We cloned 921bp of the BMPR2 5’UTR into a novel bi-cistronic dual luciferase reporter vector, mutated start/stop codons of selected uORFs, and demonstrated significant down- and up-regulation of reporter activity by a subset of regulatory variants compared to WT. We designed ASOs targeting two variant-identified regions of the WT 5’UTR and used those to treat a BMP-responsive reporter cell line. Both ASOs showed significant up-regulation of the BMP pathway compared to control. We are currently testing the ASOs in PAH patient-derived cells, and WT mice, and exploring additional uORFs in multiple genes implicated in different diseases. In conclusion, we generated an effective pipeline to assess uORFs’ function and targetability. We aim to exploit this pipeline to enrich our understanding of 5’UTR variation and ultimately nominate potentially actionable therapeutic targets.

11:00-11:40
Invited Presentation: Beyond the sequence: interpreting missense variants with structure context
Confirmed Presenter: Jun Cheng

Room: 521
Format: Live Stream

Moderator(s): Julien Gagneur


Authors List: Show

  • Jun Cheng

Presentation Overview: Show

The vast majority of missense variants observed in the human genome are of unknown clinical significance. Machine learning approaches could close this variant interpretation gap by exploiting patterns in biological data to predict the pathogenicity of unannotated variants. I will discuss AlphaMissense, which combines advances of the highly-accurate structure prediction model, AlphaFold, and population variant data to predict missense variant pathogenicity. We demonstrate state-of-the-art predictions on clinically-ascertained labels and experimental benchmarks, without explicitly training on such data. Due to higher predictive performance, the fraction of ClinVar test variants that we can confidently classify with 90% precision has increased by 25.8 percentage points (from 67.1% to 92.9%) compared to the recent well-performing unsupervised model EVE. I will also cover aspects of model evaluation, interpretation and utility. For instance, we find that gene level AlphaMissense scores are predictive of genes essential to cell survival, and this property holds amongst the 22% of smaller genes, which methods based only on population cohort data lack statistical power to detect reliably.

11:40-12:00
Capturing biophysical and protein language model constraints for an improved assessment of the impact of mutations on protein function and stability
Confirmed Presenter: Wim Vranken, Vrije Universiteit Brussel, Belgium

Room: 521
Format: In Person

Moderator(s): Julien Gagneur


Authors List: Show

  • Wim Vranken, Vrije Universiteit Brussel, Belgium
  • Konstantina Tzavella, Vrije Universiteit Brussel, Belgium
  • Catharina Olsen, Vrije Universiteit Brussel, Belgium

Presentation Overview: Show

Our understanding of how proteins operate and how evolution shapes them is mainly based on their overall fold in relation to their amino acid sequence. The direct relation between these is now largely solved by methods such as AlphaFold2. However, capturing the subtle effect of mutations on protein behavior, especially in dynamic and structurally ambiguous regions, remains difficult. Protein language models (pLM) only require sequence information and are able to capture common patterns across protein families. pLMs, however, lack protein-specific nuances required for mutational analysis while producing an enormous amount of difficult to interpret features for each sequence position in a protein. We here introduce the D2D model, which captures protein-specific constraints on the pLM features through the use of evolutionary data in combination with a Gaussian mixture model (GMM). This combination can, out of the box, perform a diverse set of tasks such as single and multiple mutation pathogenicity, protein thermostability and the effect of mutations on binding. When further fine-tuned by supervised training, the D2D model significantly outperforms state-of-the art predictors, such as in the context of passenger and driver mutations in cancer. For interpretation of the effect of these mutations, we can similarly define ‘biophysical constraints’ based on our suite of predictors of biophysical features of proteins, such as backbone dynamics or early folding regions. The combination of these approaches so provides high accuracy predictions for a variety of problems while still enabling physical interpretability, also for protein regions that do not have a single well-defined fold.

12:00-12:20
VespaG: Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction
Confirmed Presenter: Burkhard Rost, TUM Munich, Germany

Room: 521
Format: In Person

Moderator(s): Julien Gagneur


Authors List: Show

  • Celine Marquet, TUM, Germany
  • Julius Schlensok, TUM, Germany
  • Marina Abakarova, LCQB, Sorbonne University Paris, France
  • Burkhard Rost, TUM Munich, Germany
  • Elodie Laine, LCQB, Sorbonne University Paris, France

Presentation Overview: Show

Exhaustively annotating the experimental effect of all known protein variants upon molecular protein function remains daunting and expensive. In response, we present VespaG, an effect prediction utilizing expert-guided protein Language Models (pLMs) to couple high accuracy with unprecedented speed in fitness prediction. We addressed the scarcity of experimental data, by “curating” a dataset comprising 39 million Single Amino Acid Variants (SAVs) from Homo sapiens by explicitly modeling the evolutionary history of natural sequences based on multiple sequence alignments with the state-of-the-art effect prediction method, GEMME. This established a large training set.

VespaG is a minimalist deep learning model directly mapping pLM embeddings to comprehensive SAV mutational landscapes. Evaluation against the ProteinGym Substitution Benchmark with 2.5 million SAVs from 217 multiplex assays of variant effect (MAVE) demonstrated VespaG's efficacy (mean Spearman correlation 0.495±0.04, 95% confidence interval). This resembled methods such as VespaG's teacher GEMME, VESPA, and the much more sophisticated TranceptEVE and AlphaMissense. VespaG reached its top-level performance several orders of magnitude faster, predicting entire mutational landscapes for 20,000 proteins in under an hour on a desktop 32-core CPU, unlocking new possibilities for rapid variant assessment. VespaG performed much better for eukaryotes and prokaryotes than for viruses within ProteinGym (mean Spearman 0.508±0.02 vs. 0.414±0.02 for viral), underlining peculiarities of viral evolution and/or potential limitations of current pLMs.

VespaG is available freely at https://github.com/JSchlensok/VespaG

Addressing biases in large language models for variant impact prediction in macro proteins
Confirmed Presenter: Oriol Gracia I Carmona, King's College London and University College London, United Kingdom

Room: 521
Format: In Person

Moderator(s): Julien Gagneur


Authors List: Show

  • Oriol Gracia I Carmona, King's College London and University College London, United Kingdom
  • Timir Weston, King's College London, United Kingdom
  • Mathias Gautel, King's College London, United Kingdom
  • Aleksej Zelezniak, King's College London, United Kingdom
  • Franca Fraternali, University College London, United Kingdom

Presentation Overview: Show

Large multi-domain proteins pose significant challenges for both experimental studies and computational modelling due to their extensive size, often spanning thousands of amino acids. While Large Language Models (LLMs) offer promise in studying variant effects, their token limit falls short of accommodating these macro proteins. One solution consists in dividing proteins into domain groups. However, since LLMs were trained on complete proteins, biases may arise when analysing protein fragments.
Our study delves deeper into these biases by assessing the performance of ESM2 across a diverse set of proteins containing prevalent domains found in macro proteins. Predictions performed on domain sequences displayed significantly lower confidence than those obtained using whole sequences as context, particularly near the N-terminal end positions.
To mitigate this bias, we fine-tuned ESM2 using a dataset of domain sequences. The resulting model exhibited no significant differences between full-length and domain sequences, yielding a narrower distribution of discrepancies compared to the original ESM2. Moreover, the final predictions from the fine-tuned model aligned closely with those of ESM2 using the whole sequence as context, indicating successful bias reduction without compromising prediction quality. Furthermore, we evaluated the fine-tuned model's efficacy in testing variant effects on Titin, a macro protein with over 35,000 amino acids. Our model outperformed both the original ESM2 and other state-of-the-art variant effect prediction methods across all tested metrics.
In summary, our findings highlight the effectiveness of fine-tuning LLMs on domain sequences to alleviate biases and improve accuracy in studying variant effects on large multi-domain proteins like Titin.

14:20-15:00
Invited Presentation: Clinical classification of variation for disease causality
Confirmed Presenter: Heidi Rehm

Room: 521
Format: In Person

Moderator(s): Antonio Raussel


Authors List: Show

  • Heidi Rehm
15:00-15:20
Ensemble Prediction of the Clinical Impact of Missense Variants Substantially Decreases VUS Rate in Genetic Testing
Confirmed Presenter: Robert Kueffner, GeneDx, United States

Room: 521
Format: In Person

Moderator(s): Antonio Raussel


Authors List: Show

  • Robert Kueffner, GeneDx, United States
  • Maria Guillen Sacoto, GeneDx, United States
  • Gustavo Stolovitzky, None, United States

Presentation Overview: Show

About 90% of the missense variants identified in clinical genetic testing are vetted as Variants of Uncertain Significance (VUS) based on current American College of Genetics and Genomics (ACMG) guidelines. In ACMG’s earlier framework for variant calling evidence, computational approaches were considered only supporting evidence (the lowest level in the ACMG framework). Since then, newer missense impact prediction algorithms have been able to improve performance substantially.

We find that even the best current computational predictors such as AlphaMissense and REVEL offer scores that are divergent across genes and complementary across methods (score=degree of evidence for pathogenicity), which impedes thresholding of pathogenic or benign variants. We analyze and normalize these effects by a calibration that models the probability of pathogenicity based on a maximum entropy assumption and then use this calibrated probability in a dedicated ensemble strategy. At a very stringent cutoff of 99% precision, we recall 80% of a ground truth set consisting of manually vetted pathogenic and benign variants, substantially improving on REVEL (47% recall) and AlphaMissense (54% recall). At the same cutoff, our method suggests that 51% of VUSes can be pushed to VUS leaning benign:pathogenic at a proportion of 2:1.

Our ensemble strategy improves computational predictions of missense variants dramatically over AlphaMissense or REVEL alone, potentially disambiguating up to 50% of the missense VUSes encountered in current clinical tests. The presented approach may be the first optimal strategy for score calibration and ensemble classification, which are important steps in many bioinformatics and machine learning applications.

MAJIQ-CLIN: A novel tool for the identification of Mendelian disease-causing variants from RNA-seq data
Confirmed Presenter: Dina Issakova, University of Pennsylvania, United States

Room: 521
Format: In Person

Moderator(s): Antonio Raussel


Authors List: Show

  • Joseph Aicher, University of Pennsylvania, United States
  • Dina Issakova, University of Pennsylvania, United States
  • Barry Slaff, University of Pennsylvania, United States
  • San Jewell, University of Pennsylvania, United States
  • Gregory Grant, University of Pennsylvania, United States
  • Nicholas Lahens, University of Pennsylvania, United States
  • Elizabeth Bhoj, Children's Hospital of Philadelphia, United States
  • Yoseph Barash, University of Pennsylvania, United States

Presentation Overview: Show

Exome sequencing (ES) is the current standard of care for patients with suspected Mendelian genetic disorders. However, the diagnostic rate is only 25 to 58%. One key regulatory process of gene expression that is not captured well by ES is RNA splicing. Changes in splicing, or alternative splicing, naturally occur in up to 95% of human genes, but 38-50% of human pathogenic variants are estimated to alter RNA splicing, including notable Mendelian disorders. It is therefore crucial to develop reliable tools to detect splicing aberrations from patient RNA-seq to improve current diagnostic rates. For a tool to be considered reliable for detecting splicing aberrations from RNA-Seq, a tool should reliably detect splicing aberrations in previously solved cases, be easy to use, and use resources realistic for a clinical setting. In recent years a few tools were developed to address this need, specifically LeafCutterMD and FRASER. While both served as good proof of concepts for enhancing clinical diagnostics using RNA-Seq, our analysis indicates that several challenges remain. To address these, we developed MAJIQ-CLIN, a pipeline for detecting splicing aberrations in a patient’s RNA-Seq sample compared to a large cohort of controls. We evaluate existing tools compared to MAJIQ-CLIN using both synthetic data with spiked in splice variations as well as several datasets of solved test cases, demonstrating it compares favorably to both LeafCutterMD and FRASER with significant improvements in time, memory, usability, and accuracy. We hope to establish MAJIQ-CLIN as a tool routinely used in clinical practice to improve patient outcomes.

15:20-15:40
Reclassifying variants of uncertain significance with transcriptional profiling
Confirmed Presenter: Kivilcim Ozturk, University of California San Diego, United States

Room: 521
Format: Live Stream

Moderator(s): Antonio Raussel


Authors List: Show

  • Kivilcim Ozturk, University of California San Diego, United States
  • Rebecca Panwala, University of California San Diego, United States
  • Jeanna Sheen, University of California San Diego, United States
  • Kyle Ford, University of California San Diego, United States
  • Nathan Jayne, University of California San Diego, United States
  • Dong-Er Zhang, University of California San Diego, United States
  • Stephan Hutter, MLL Munich Leukemia Laboratory, Munich, Germany, Germany
  • Torsten Haferlach, MLL Munich Leukemia Laboratory, Munich, Germany, Germany
  • Trey Ideker, University of California San Diego, United States
  • Prashant Mali, University of California San Diego, United States
  • Hannah Carter, University of California San Diego, United States

Presentation Overview: Show

Understanding the functional impact of single amino acid substitutions in cancer driver genes remains an unmet need. Perturb-seq provides a tool to investigate the effects of individual mutations on cellular programs by measuring their transcriptional consequences. Here, we develop an approach to functionally assess variant impact in single cells. We deploy ScalablE fUnctional Screening by Sequencing (SEUSS), a Perturb-seq style technique, to generate and assay mutations of the Runt-related transcription factor 1 (RUNX1). We measured the impact of 115 mutations on RNA profiles in single myelogenous leukemia cells and used the profiles to categorize mutations into three functionally distinct groups: wild-type-like, loss-of-function-like and hypomorphic, that were validated in orthogonal assays. Using these profiles, we identified the functional impact of 16 documented variants of uncertain significance. Next, we trained a Random Forest classifier with our variant library serving as the training data and predicted functional effects of all remaining RUNX1 mutations (n=2582), resulting in predictions for a further 103 variants of uncertain significance and achieving an auROC of 0.79 and an auPR of 0.82 on a gold standard dataset. Overall, our work demonstrates the power of transcriptional profiling in single cells to assess the functional impact of missense mutations on cellular programs and provides a scalable method for coding variant impact phenotyping.

15:40-16:00
Metacell burden: A method to quantify the effects on neurodevelopmental disorders of rare genomic variants aggregated across brain cells.
Confirmed Presenter: Thomas Renne, Université de Montréal, Canada

Room: 521
Format: In Person

Moderator(s): Antonio Raussel


Authors List: Show

  • Thomas Renne, Université de Montréal, Canada
  • Cécile Poulain, Université de Montréal, Canada
  • Alma Dubuc, ENS Lyon, France
  • Raphael Bourque, Université de Montréal, Canada
  • Guillaume Huguet, CHU Sainte Justine, Canada
  • Tomasz Nowakowski, UCSF, United States
  • Sébastien Jacquemont, CHU Sainte Justine, Canada

Presentation Overview: Show

Neurodevelopmental disorders (NDs), such as autism or intellectual disabilities, affect up to 5% of the population. While the high heritability rate of these disorders, genetic contribution remains largely unidentified. WES has become pivotal in identifying structural variants and SNVs associated with NDs. However, their functional implications remain undetermined. Recent scRNAseq datasets give new insight in the relations between cell-types and NDs. Still, the majority of variants contributing to NDs are too rare and understudied, so their impact and the biological functions underlying remain unknown.

Our study introduces a novel metacell burden analysis to investigate the association between rare variants carried by 500k individuals, their phenotypes, and the brain cell-types underlying. We aggregate genes linked to biological functions (e.g. cell types or metacells) based on the expression extracted from large-scale single-cell RNA-seq data. The burden of each gene-set on a phenotype is then computed through a linear regression of the sum of metacell’s genes disrupted by a variant.

We identified that rare LoF variants disrupting genes associated to excitatory and inhibitory neurons are associated with cognitive ability. The analysis of structural variants highlighted other cell-types such as microglia, showing that cognitive ability results of multiple pathways. Some phenotypes have correlated effect sizes for specific cell types, while others don’t suggest that some traits share developmental pathways. These findings, consistent with literature hypothesizes, give insight into the relation between rare variants, neurodevelopmental traits and cell profiles. This new approach offers a promising opportunity for further exploration of cell types and phenotypes.

16:40-17:00
Proceedings Presentation: Representing Mutations for Predicting Cancer Drug Response
Confirmed Presenter: Patrick Wall, UC San Diego, United States

Room: 521
Format: In Person

Moderator(s): Hannah Carter


Authors List: Show

  • Patrick Wall, UC San Diego, United States
  • Trey Ideker, UC San Diego, United States

Presentation Overview: Show

Motivation. Predicting cancer drug response requires a comprehensive assessment of many mutations present across a tumor genome. While current drug response models generally use a binary mutated/unmutated indicator for each gene, not all mutations in a gene are equivalent.

Results. Here, we construct and evaluate a series of predictive models based on leading methods for quantitative mutation scoring. These methods include VEST4 and CADD, which score the likely impact of a mutation on normal gene function, and CHASMplus, which scores the likelihood the mutation drives cancer. These models capture cellular responses to dabrafenib, which specifically targets BRAF V600 mutations, whereas models based on binary mutation status do not. These performance improvements generalize to other drug responses, extending genetic indications for PIK3CA, ERBB2, EGFR, PARP1, and ABL1 inhibitors.

Conclusion. Introducing quantitative mutation features in drug response models increases predictive performance and mechanistic understanding.

Availability. Source code and a sample input dataset are available at https://github.com/pgwall/qms.

17:00-17:20
Assessing lethal missense mutations and polymorphism in Drosophila melanogaster with an evolutionary-informed model
Confirmed Presenter: Marina Abakarova, Sorbonne Université, France

Room: 521
Format: In Person

Moderator(s): Hannah Carter


Authors List: Show

  • Marina Abakarova, Sorbonne Université, France
  • Michael Rera, Université Paris Cité, France
  • Elodie Laine, Sorbonne Université, France

Presentation Overview: Show

This study investigates the impact of missense mutations on the Drosophila melanogaster proteome and contributes to our understanding of the genotype-phenotype relationship, with implications for targeted protein editing. We applied a computationally efficient approach we developed previously to predict the impact of all possible mutations in the fly proteome. It leverages fast protein sequence search and alignment with evolutionary-informed mutational effect predictions. Leveraging resources such as FlyBase and the Drosophila Genetic Reference Panel (DGRP), we assessed the discriminative power of the predictions and investigated the interplay between polymorphism, including isoforms, evolutionary conservation, and pathogenicity at the organismal level. The approach accurately distinguishes benign from pathogenic mutations, achieving a balanced accuracy of 0.856. Beyond predictive capability, we found that invariant genes in the DGRP population demonstrate a greater variability across the kingdom of Life. Specifically, non-polymorphic and lethality-induced genes present 3.8-fold enrichment in the high fraction of observed substitutions in the protein homologs(>85%). Additionally, we showcase the importance of the context for variant effect prediction on the proteoforms of Mef2, a muscle-specific transcription factor. 
We provide the community with full variant effect predictions for the entire fly proteome, accessible at https://doi.org/10.5281/zenodo.10995110. Since the approach relies on the quality of the input MSA, we provide both global and local confidence metrics to guide users.

A phylogenetic mutation-selection model predicts fitness effects of mutations in extant mammals
Confirmed Presenter: Thibault Latrille, Department of Computational Biology, University of Lausanne, Switzerland

Room: 521
Format: In Person

Moderator(s): Hannah Carter


Authors List: Show

  • Thibault Latrille, Department of Computational Biology, University of Lausanne, Switzerland
  • Julien Joseph, Laboratoire de Biométrie et Biologie Evolutive, UMR5558, Université Lyon 1, France
  • Diego A. Hartasánchez, Department of Computational Biology, University of Lausanne, Switzerland
  • Nicolas Salamin, Department of Computational Biology, University of Lausanne, Switzerland

Presentation Overview: Show

At the phylogenetic scale, sequence variation informs us on the selective effects of mutations. Indeed, mutations can be either beneficial, deleterious or neutral for their bearer, influencing the likelihood for a mutation to reach fixation. In this study, we first estimated the selective effects of point mutations inside mammalian protein coding sequences, assuming a nearly-neutral model of evolution at the phylogenetic scale. Confronting phylogenetic and population genomics dataset, we then confirmed that mutations predicted to be deleterious from the phylogenetic analysis are currently purified away in extant populations. Conversely, mutations predicted to repair previous deleterious changes are indeed shown to be beneficial in extant populations. This study confirms that deleterious substitutions are accumulating in mammals and are being reverted, generating a balance in which genomes are damaged and restored simultaneously at different loci. At the interface between population genomics and phylogenetic analysis, our work supports a nearly-neutral model of evolution at the phylogenetic scale, informing us on the effect of point mutations for extant populations and individuals. We observe that in 24 out of 28 populations analyzed, between 15% and 45% of all beneficial mutations that are currently segregating in the population are not due to an environmental change. Thus a substantial part of ongoing positive selection is not driven solely by adaptation to environmental change in mammals. Finally, we show that we can also use this nearly-neutral model of evolution as a null model above which we can detect adaptation in protein-coding DNA sequences.

17:20-18:00
DNA language models reveal the architecture of nucleotide dependencies in genomes
Confirmed Presenter: Pedro Tomaz da Silva, Technical University of Munich, Germany

Room: 521
Format: In person

Moderator(s): Hannah Carter


Authors List: Show

  • Pedro Tomaz da Silva, Technical University of Munich, Germany
  • Alexander Karollus, Technical University of Munich, Germany
  • Johannes Hingerl, Technical University of Munich, Germany
  • Xavier Hernandez-Alias, Mechanisms of Protein Biogenesis, Max Planck Institute of Biochemistry, Germany
  • Gihanna Galindez, Technical University of Munich, Germany
  • Nils Wagner, Technical University of Munich, Germany
  • Danny Incarnato, University of Groningen, Netherlands
  • Julien Gagneur, Technical University of Munich, Germany

Presentation Overview: Show

While the genome is composed of individual nucleotides, functional elements such as cis-regulatory elements and structural interactions are formed from sets of interdependent nucleotides. In principle, these dependencies are reflected in coevolutionary relationships. However, their detection beyond coding sequences is challenging with classical approaches.

DNA language models (LMs), which are trained by predicting nucleotides given their sequence context, have recently been proposed as foundational models for sequence-based prediction problems. DNA LMs implicitly capture functional elements from genomic sequences alone. However, which dependencies DNA LMs learn and whether they reflect known or even novel biology remains an open question.

Here we introduce nucleotide dependency maps to systematically study nucleotide dependencies captured by DNA LMs in a purely unsupervised setup.
We compute these maps genome-wide and show that they reveal and clearly delineate known functional genomic features such as transcription factor binding motifs, functional interactions between splice sites, RNA tertiary structures, and coding sequences. Additionally we uncover novel and conserved dependency structures suitable for experimental validation.

We furthermore investigate dependency maps from in silico manipulated sequences, revealing the ability of DNA LMs to capture operations such as copying and reverse complementarity without memorization.

Lastly, we compare dependency maps from openly available DNA LMs, showcasing the drawbacks and advantages of different models. We find stark differences in the ability of models to accurately learn conserved but infrequent features.

Altogether, by leveraging the flexibility of DNA language models, nucleotide dependency mapping emerges as a general methodology to discover and study functional interactions in genomes.