Attention Presenters - please review the Speaker Information Page available here

Schedule subject to change
All times listed are in CDT

Thursday, May 15^th

10:30-10:35

Opening Remarks

Format: In person

Moderator(s): Nic Fisk

Authors List: Show

10:35-11:20

Invited Presentation: From forecasting to design: a retrospective case study in modeling SARS-CoV-2 immunity

Format: In person

Moderator(s): Nic Fisk

Authors List: Show

11:20-11:40

Data-Centric Artificial Intelligence for Environmental Toxicology

Confirmed Presenter: Yimin Wang, Wake Forest University, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Evolutionary Coreset Selection for Rapid Classification of Endocrine Disrupting Chemicals

Endocrine Disrupting Chemicals (EDCs) are major hazards to environmental and public health because they have the ability to disrupt hormonal systems. Recent developments in high-throughput screening and computational modeling have made it possible to detect these chemicals for regulatory and mitigation purposes. Computational modeling, in particular, contributed greatly to the prediction of interactions of EDCs with pharmacologically important proteins. However, the majority of prior works used a model-centric approach, by focusing on the fine-tuning of machine learning algorithms and architectures. In contrast, this work proposes a data-centric approach to modeling by focusing on the development of methods for the construction of training datasets of highest quality. In particular, we developed and implemented an evolutionary coreset selection workflow, and validated it with data released by the U.S. Environmental Protection Agency's Endocrine Disruptor Screening Program (EDSP)[1].

Starting with Simplified Molecular Input Line Entry System (SMILES), we computed over 200 molecular descriptors and performed standard data cleaning and preparation steps, including the removal of highly correlated descriptors and dimensionality reduction. Next, we implemented a genetic algorithm (GA)–based coreset selection, which extracts from the entire training dataset, smaller but representative samples. Using Falkenauer's grouping methods [2], our implementation of GA makes use of a domain-specific recall fitness function that minimizes false negatives. We also used a grid search along with high-performance computing to optimize GA parameters. Through repeated experimentation, we found that coresets captured about 50% of training data while retaining the same proportion of class labels.

Notably, our experiments demonstrated that there are no statistically significant differences in the performance of models trained with coresets selected by different classification methods (i.e., k-nearest neighbors, random forest, and XGBoost) compared to the baseline model trained with the entire dataset. GA-based coresets successfully reduced data size and maintained original class balance, demonstrating their applicability to resource-limited model training.

Overall, these findings emphasize the importance of data-driven approaches to computational screening and point to the need for more nuanced data engineering and algorithmic pipelines. While neither standard data cleaning nor GA-aided coreset selection[3] significantly improve recall of EDSP dataset, future research may combine these techniques in modeling with larger datasets, advanced feature engineering, or model-based fine-tuning to better classify EDCs.

References:
[1] US Environmental Protection Agency. Endocrine disruptor screening pro- gram (edsp) in the 21st century. https://www.epa.gov/endocrine-disruption/ endocrine-disruptor-screening-program-edsp-21st-century, 2024. Accessed: 2024-07-01.
[2] Falkenauer E. A new representation and operators for genetic algorithms applied to grouping problems. Evolutionary Computation. 1994;2(2):123–144.
[3]Jonathan Shapiro. Genetic algorithms in machine learning. In Advanced course on arti- ficial intelligence, pages 146–168. Springer, 1999.

11:40-12:00

Systematic contextual biases in SegmentNT relevant to all nucleotide transformer models

Confirmed Presenter: Justin B. Miller, University of Kentucky, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Large language models (LLMs), such as the Nucleotide Transformer and its SegmentNT U-Net extension, provide nucleotide-resolution predictions of exons and introns that can be used to identify novel transcript isoforms and unravel complex genetic interactions associated with disease. However, algorithmic biases introduced by applying LLMs to genomic sequences are not well understood and may affect downstream predictions. Here, we demonstrate that cyclical biases exist within the prediction model, significantly impacting reported scores in a predictable manner. These biases are influenced by the length of the input sequence and the nucleotide position within that sequence.

Using five genes and a non-genic control region, we demonstrate consistent and significant biases in SegmentNT’s exon and intron predictions. In APOE, which harbors the strongest known genetic risk factor for Alzheimer’s disease, calculated exon scores within RefSeq canonical exons were significantly higher (P<2.23e-308) for the middle nucleotide in a context sequence of 24,576 nucleotides (mean=0.9725±0.0933; median=0.993) than for first- (mean=0.6733±0.1336; median=0.6966) or last-positioned nucleotides (mean=0.0412±0.0201; median=0.0382). We further demonstrate that we can account for these biases by subtracting the intron prediction from the exon prediction at a given nucleotide. Setting a threshold of 0 to differentiate between introns and exon, all three positions within the input sequence provide strong predictive values with the middle (accuracy=98.2%; AUC=0.9990) still outperforming the first (accuracy=95.0%; AUC=0.9772) and last (accuracy=87.9%; AUC=0.9667) positions.

We also examined the effect of input sequence length on predictions, finding that longer sequences improved exon probability estimates but with diminishing returns. In APOE, the median exon probability increased from 0.3443 (24 nucleotides) to 0.993 (24,576 nucleotides) with most accuracy gains plateauing beyond a context of 1,536 nucleotides (median exon probability=0.9898).

We formally assessed how each position in the input sequence affects SegmentNT’s probabilities and discovered a 24-nucleotide oscillatory bias in SegmentNT’s predictions. We found that shifting a nucleotide by just three positions in the input sequence altered its intron probability from 0.079 to 0.310 and its exon probability from 0.938 to 0.978 at position 850 in APOE exon 2. We postulate that these biases may result from SegmentNT’s 6-mer tokenization combined with the U-Net’s positional encoding strategy, which is similarly used by other genomic LLMs. Positions near the beginning and end of the input sequence exhibited greater variability in SegmentNT’s probabilities. These results highlight key biases in genomic foundation models that can be mitigated by normalizing predictions based on the difference between introns/exons across the middle 24 nucleotides in an input sequence of at least 1,536 nucleotides, thus improving the reliability of nucleotide-resolution functional annotations in genomic LLMs.

13:30-14:10

Invited Presentation: A soft matter approach in the study of bacteriophage viruses and condensed DNA

Format: In person

Moderator(s): Peter Liu

Authors List: Show

14:10-14:30

Decoding Tumor Progression: Spatially Resolved Insights into PDAC Epithelial And Stromal Co-evolution

Confirmed Presenter: Ahmed M. Elhossiny, Department of Computational Medicine and Bioinformatics, University of Michigan, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Pancreatic ductal adenocarcinoma (PDAC) stands as one of the most lethal cancers, marked by a dismal five-year survival rate of just 13%. The tumor microenvironment (TME) in PDAC is composed of a diverse array of cell types. In genetically engineered mouse models, the TME co-evolves with epithelial precursor lesions as they progress to cancer. However, the characterization of TME evolution in human samples is not well-documented, primarily due to the scarcity of healthy donor pancreatic tissues. Through collaboration with Gift of Life Michigan, an organ donor organization, our previous findings reported the abundance of PDAC precursor lesions, known as PanINs, in the normal human pancreas (Carpenter et al., Cancer Discovery, 2023). This offers a unique chance to model the concurrent evolution of stroma and neoplastic epithelium during tumor progression.

Here, we employed spatial transcriptomics technology using the 10x Visium platform to analyze normal, tumor-adjacent, and tumor samples. To integrate the spatial data with our pre-established single-cell RNA-seq atlas, we utilized cell-type deconvolution combined with a spatially-aware clustering strategy. This approach allowed us to delineate distinct tissue domains across these samples. This approach enabled us to identify distinct epithelial structures—such as Acinar cells, Ducts, PanINs, and tumor regions—as well as stromal domains including tertiary lymphoid structures (TLS) and fibrotic regions.

We traced the trajectory of spontaneous PanINs in healthy individuals as they progressed toward PDAC using pseudotime analysis. This revealed critical gene programs that dynamically shift over time and pinpointed specific transcriptional changes distinguishing early PanINs from more advanced tumor states. To complement this, we investigated how the surrounding stroma evolves in parallel with epithelial transformation. Specifically, we modeled spatial interactions by performing bivariate spatial autocorrelation analyses between the spatial distributions of epithelial structures—derived from PanIN and tumor regions—and distinct stromal subtypes. This uncovered remodeling events in the stroma, illustrating how stromal populations become spatially reprogrammed in response to neoplastic progression. Together, these insights shed light on the intricate and coordinated evolution of the tumor microenvironment in PDAC.

Future directions will involve conducting cell-cell interaction analyses to further understand epithelial-stromal communication, supported by in vitro functional experiments to validate the roles—whether pro- or anti-tumorigenic—of identified markers. This research establishes a foundation for spatially resolved insights into PDAC biology, guiding therapeutic strategies that target the TME.

14:30-14:50

Deep Learning in AML: Evaluating In Silico Perturbations for Therapeutic Shifts and Driver Gene Inference

Confirmed Presenter: Karen Sachs, Next Generation Analytics, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

14:50-15:10

Leveraging Genomic Foundation Models to Predict Short Nucleotide Variants That Impact Non-Coding Functional Sites in Cancer Genomes

Confirmed Presenter: Pratik Dutta, Department of Biomedical Informatics, Stony Brook University, United States

Format: In Person

Authors List: Show

Presentation Overview: Show

Data Portals, such as Genomic Data Commons, provide catalogs of somatic short nucleotide variants (SNVs) identified from whole-genome sequencing (WGS) of tumor and matched normal DNA from thousands of samples. However, computational identification and prioritization of functionally relevant driver mutations remain far greater challenges in non-coding regions than in coding genes due to incomplete annotation of regulatory regions 1 . Towards this, we developed gene regulatory region predictive models by applying DNABERT, a deep learning model pre-trained on genomic sequences, to systematically characterize the functional impact of somatic mutations in key noncoding regions, including splice sites and transcription factor binding sites (TFBS). We apply these models to Glioblastoma multiforme (GBM), an aggressive brain cancer with limited therapeutic options. We analyzed WGS data from 189 GBM patients (SNVs from CaVEMan and indels from Pindel) obtained from the TCGA-GDC portal. This dataset included 19,968 SNVs and 34,718 indels near the acceptor sites and 23,656 SNVs and 20,171 indels near the donor sites. Additionally, we examined 700 TFBS datasets (33 histone markers and 667 TF ChIP-seq markers) from ENCODE, covering 4,228 ChIP-seq experiments across 91 cell lines, 21 in vitro differentiated cells, 53 primary cells, and 77 tissues. The histone markers contained 540,742 indel variants across 285,204 regions and 734,067 SNVs across 346,600 regions, while TF ChIP-seq regions comprised 9,615,683 indels spanning 3,796,449 regions and 7,827,556 SNVs across 6,869,598 regions.
To assess the functional impact of these variants, we developed two DNABERT-based splice site models and fine-tuned 700 TFBS-specific models. These models computed probability scores for reference and alternative sequences and assessed functional disruption via log-odds ratios and score-change values. The splice site models predicted 299 candidate SNVs and 1,822 indels in acceptor sites and 673 SNVs and 504 indels in donor sites, while histone models identified 4,171 indels and 763 SNVs as functionally disruptive. In TF ChIP-seq regions, DNABERT prioritized 61,731 candidate indels and 30,539 SNVs. We identified frequent mutations (≥10% of patients) in TFBSs of SPI1, ZBTB33, and RAD51 TFs implicated in oncogenesis and immune regulation. Survival analysis revealed mutations in TFBSs of TRIM22, QKI, RELB, and BCL3 as potential prognostic biomarkers, while DNABERT-predicted splice site mutations in PARPBP and MAPT correlated with patient survival, suggesting their role in treatment response. These findings highlight the power of genomic foundation models in identifying biologically significant and clinically relevant variants in GBM. Our approach provides a novel computational framework for prioritizing non-coding regulatory SNVs, contributing to WGS data analyses and precision medicine. To facilitate further exploration, we have developed an interactive dashboard at https://davuluri-lab-brainved.streamlit.app.

15:10-15:30

Panel: Lessons Learned: EBRDED Town Hall

Format: In person

Moderator(s): Peter Liu

Authors List: Show