Attention Presenters - please review the Presenter Information Page available here
Schedule subject to change
All times listed are in EDT
Monday, July 15th
10:40-11:30
Invited Presentation: How generative AI can transform biomedical research
Room: 517d
Format: In person


Authors List: Show

  • James Zou

Presentation Overview: Show

This talk will explore how we can develop and use generative AI to help researchers. I will first discuss how generative AI can act as research co-advisors. Then I will first discuss how we use genAI to expand researchers' creativity by designing and experimentally validating new drugs. Finally, I will explore the role of language as the foundational data modality for biomedicine.

11:30-11:40
Proceedings Presentation: SPRITE: improving spatial gene expression imputation with gene and cell networks
Confirmed Presenter: Eric Sun, Stanford University, United States

Room: 517d
Format: In Person


Authors List: Show

  • Eric Sun, Stanford University, United States
  • Rong Ma, Harvard T.H. Chan School of Public Health, United States
  • James Zou, Stanford University, United States

Presentation Overview: Show

Spatially resolved single-cell transcriptomics have provided unprecedented insights into gene expression {\it in situ}, particularly in the context of cell interactions or organization of tissues. However, current technologies for profiling spatial gene expression at single-cell resolution are generally limited to the measurement of a small number of genes. To address this limitation, several algorithms have been developed to impute or predict the expression of additional genes that were not present in the measured gene panel. Current algorithms do not leverage the rich spatial and gene relational information in spatial transcriptomics. To improve spatial gene expression predictions, we introduce SPRITE (Spatial Propagation and Reinforcement of Imputed Transcript Expression) as a meta-algorithm that processes predictions obtained from existing methods by propagating information across gene correlation networks and spatial neighborhood graphs. SPRITE improves spatial gene expression predictions across multiple spatial transcriptomics datasets. Furthermore, SPRITE predicted spatial gene expression leads to improved clustering, visualization, and classification of cells. SPRITE is available as a software package and can be used in spatial transcriptomics data analysis to improve inferences based on predicted gene expression.

11:40-12:00
Proceedings Presentation: Deciphering High-order Structures in Spatial Transcriptomes with Graph-guided Tucker Decomposition
Confirmed Presenter: Charles Broadbent, University of Minnesota Twin Cities, United States

Room: 517d
Format: In Person


Authors List: Show

  • Charles Broadbent, University of Minnesota Twin Cities, United States
  • Tianci Song, University of Minnesota Twin Cities, United States
  • Rui Kuang, University of Minnesota Twin Cities, United States

Presentation Overview: Show

Spatial transcripome (ST) profiling can reveal cells’ structural organizations and functional roles in tissues. However, deciphering the spatial context of gene expressions in ST data is a challenge—the high-order structure hiding in whole transcriptome space over 2D/3D spatial coordinates requires modeling and detection of interpretable high-order elements and components for further functional analysis and interpretation. This paper presents a new method GraphTucker—-graph-regularized Tucker tensor decomposition for learning high-order factorization in ST data. GraphTucker is based on a non-negative Tucker decomposition algorithm regularized by a high-order graph that captures spatial relation among spots and functional relation among genes. In the experiments on several Visium and Stereo-seq datasets, the novelty and advantage of modeling multi-way multilinear relationships among the components in Tucker decomposition are demonstrated as opposed to the Canonical Polyadic Decomposition (CPD) and conventional matrix factorization models by evaluation of detecting spatial components of gene modules, clustering spatial coefficients for tissue segmentation and imputing complete spatial transcriptomes. The results of visualization show strong evidences that GraphTucker detect more interpretable spatial components in the context of the spatial domains in the tissues. Availability: https://github.com/kuanglab/GraphTucker.

12:00-12:10
CellPie: a fast spatial transcriptomics factor discovery method via joint factorization of gene expression and imaging data
Room: 517d
Format: In person


Authors List: Show

  • Sokratia Georgaka, University of Manchester, United Kingdom
  • William Geraint Morgans, University of Manchester, United Kingdom
  • Qian Zhao, University of Manchester, United Kingdom
  • Diego Sanchez, Cancer Research UK Manchester Institute, Manchester, United Kingdom
  • Mohamed Ghafoor, Division of Informatics, Imaging and Data Sciences, Faculty of Biology, Medicine and Health, University of Manchester, United Kingdom
  • Syed Murtuza Baker, University of Manchester, United Kingdom
  • Mudassar Iqbal, The University of Manchester, United Kingdom
  • Magnus Rattray, The University of Manchester, United Kingdom

Presentation Overview: Show

Spatially resolved transcriptomics has revolutionised the study of the gene expression within tissues, allowing researchers to maintain the spatial context. Accompanying these spatial transcriptomics datasets are often histology images, providing rich information on tissue architecture, organisation and pathology, complementing the spatial gene expression. However, in traditional pipelines, histological information is typically discarded during tasks such as dimensionality reduction of the spatial transcriptomics data.
To address this limitation, we propose Cellpie, a novel approach based on fast, joint non-negative matrix factorisation (NMF). Cellpie simultaneously decomposes spatial gene expression and histology image features into interpretable components. Through joint NMF, CellPie generates non-negative factor matrices representing parts-based representation (factors) of the data, facilitating the identification of biologically relevant patterns of variation. In addition, CellPie extracts the corresponding leading genes and image features that are strongly associated with each factor. These genes and features serve as marker genes and morphological characteristics, respectively, providing insights into the biological processes underlying the observed patterns in the spatial gene expression data. Furthermore, they enable the discoverer of links between molecular signalling and tissue morphology.
We demonstrated CellPie on two distinct tissue types, showcasing its improved accuracy in downstream analysis tasks compared to published dimensionality reduction methods.

12:10-12:30
Proceedings Presentation: Integrating patients in time series clinical transcriptomics data
Confirmed Presenter: Sachin Mathur, Sanofi, United States

Room: 517d
Format: In Person


Authors List: Show

  • Euxhen Hasanaj, Carnegie Mellon University, United States
  • Sachin Mathur, Sanofi, United States
  • Ziv Bar-Joseph, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: Analysis of time series transcriptomics data from clinical trials is challenging. Such studies usually profile very few time points from several individuals with varying response patterns and dynamics. Current methods for these datasets are mainly based on linear, global orderings using visit times which do not account for the varying response rates and subgroups within a patient cohort.

Results We developed a new method that utilizes multi-commodity flow algorithms for trajectory inference in large scale clinical studies. Recovered trajectories satisfy individual-based timing restrictions while integrating data from multiple patients. Testing the method on multiple drug datasets demonstrated an improved performance compared to prior approaches suggested for this task, while identifying novel endotypes that correspond to heterogeneous patient response patterns.

Availability: The source code and instructions to download the data have been deposited on GitHub at https://github.com/euxhenh/Truffle

14:20-15:10
Invited Presentation: Learning the Language of Biology: Transforming Biomedical Discovery with Foundation Models and Causal Inference
Confirmed Presenter: David van Dijk, Yale University, USA

Room: 517d
Format: In person


Authors List: Show

  • David van Dijk, Yale University, USA

Presentation Overview: Show

In this talk, I will showcase the work of my lab in revolutionizing biomedical data analysis through foundation models and large language models (LLMs). First, we introduce CINEMA-OT, a causal-inference-based approach using optimal transport for single-cell perturbation analysis. CINEMA-OT allows individual treatment-effect analysis, response clustering, and synergy analysis, revealing potential mechanisms in airway antiviral response and immune cell recruitment. Next, we present CaLMFlow, combining flow matching with integral equations and causal language models. By fine-tuning LLMs on flow matching and conditioning on natural language prompts, CaLMFlow predicts single-cell perturbation responses and performs protein backbone generation. We then explore "Cell2Sentence" (C2S), a technique translating single-cell transcriptomics into a language for LLMs. C2S automates the generation of natural language insights directly from biological data and generates cells based on textual prompts, enhancing data interpretation and synthesis. Additionally, I will discuss "BrainLM," the first fMRI foundation model to decode brain activity, predict clinical variables, and improve our understanding of brain function and disease. Finally, I will present some of our efforts to integrate foundation models with graphs with the aim to leverage pre-trained textual and non-textual foundation models for graph-based tasks.

15:10-15:20
Deep Reinforcement Learning for Controlled Traversing of the Attractor Landscape of Boolean Models in the Context of Cellular Reprogramming
Confirmed Presenter: Jakub Zarzycki, IDEAS NCBR & University of Warsaw, Poland

Room: 517d
Format: In Person


Authors List: Show

  • Andrzej Mizera, IDEAS NCBR & University of Warsaw, Poland
  • Jakub Zarzycki, IDEAS NCBR & University of Warsaw, Poland

Presentation Overview: Show

Cellular reprogramming can be used for both the prevention and cure of different diseases. However, the efficiency of discovering reprogramming strategies with classical wet-lab experiments is hindered by lengthy time commitments and high costs. In this study, we develop a novel computational framework based on deep reinforcement learning that facilitates the identification of reprogramming strategies. For this aim, we formulate a control problem in the context of cellular reprogramming for the frameworks of BNs and PBNs under the asynchronous update mode. Furthermore, we introduce the notion of a pseudo-attractor and a procedure for identification of pseudo-attractor state during training. Finally, we devise a computational framework for solving the control problem, which we test on a number of different models.

15:20-15:40
Proceedings Presentation: AttentionPert: Accurately Modeling Multiplexed Genetic Perturbations with Multi-scale Effects
Confirmed Presenter: Ding Bai, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), United Arab Emirates

Room: 517d
Format: In Person


Authors List: Show

  • Ding Bai, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), United Arab Emirates
  • Caleb Ellington, Carnegie Mellon University, United States
  • Shentong Mo, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), United Arab Emirates
  • Le Song, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), United Arab Emirates
  • Eric Xing, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) & Carnegie Mellon University, United Arab Emirates

Presentation Overview: Show

Genetic perturbations (e.g. knockouts, variants) have laid the foundation for our understanding of many diseases, implicating pathogenic mechanisms and indicating therapeutic targets. However, experimental assays are fundamentally limited by the number of measurable perturbations. Computational methods can fill this gap by predicting perturbation effects under novel conditions, but accurately predicting the transcriptional responses of cells to unseen perturbations remains a significant challenge. We address this by developing a novel attention-based neural network, AttentionPert, which accurately predicts gene expression under multiplexed perturbations and generalizes to unseen conditions. AttentionPert integrates global and local effects in a multi-scale model, representing both the non-uniform system-wide impact of the genetic perturbation and the localized disturbance in a network of gene-gene similarities, enhancing its ability to predict nuanced transcriptional responses to both single and multi-gene perturbations. In comprehensive experiments, AttentionPert demonstrates superior performance across multiple datasets outperforming the state-of-the-art method in predicting differential gene expressions and revealing novel gene regulations. AttentionPert marks a significant improvement over current methods, particularly in handling the diversity of gene perturbations and in predicting out-of-distribution scenarios.

15:40-16:00
Proceedings Presentation: Predicting single-cell cellular responses to perturbations using cycle consistency learning
Confirmed Presenter: Wei Huang, College of Computer and Information Engineering, Nanjing Tech University, 211816, Jiangsu, China, China

Room: 517d
Format: Live Stream


Authors List: Show

  • Wei Huang, College of Computer and Information Engineering, Nanjing Tech University, 211816, Jiangsu, China, China
  • Hui Liu, College of Computer and Information Engineering, Nanjing Tech University, 211816, Jiangsu, China, China

Presentation Overview: Show

Phenotype-based screening has emerged as a powerful approach for identifying compounds that actively interact with cells. Transcriptional and proteomic profiling of cell population and single cell provide insights into the cellular changes that occur at the molecular level in response to external perturbations, such as drugs or genetic manipulations. In this paper, we propose cycleCDR, a novel deep learning framework to predict cellular response to drugs or gene perturbations. We leverage the power of autoencoders to maps the unperturbed cellular states to a latent space, in which we postulate the effects of drug perturbations on cellular states follow a linear additive model. Next, we introduce the cycle consistency constraints to ensure that unperturbed cellular state subjected to drug perturbation in the latent space would produce the perturbed cellular state through the decoder. Conversely, removal of perturbations from the perturbed cellular states could restore the unperturbed cellular state. The cycle consistency constraints and linear modeling in latent space enable to learn transferable representations of external perturbations, so that our model can generalize well to unseen drugs. We validate our model on four different types of datasets, including bulk transcriptional responses, bulk proteomic responses, and single-cell transcriptional responses to drug/gene perturbations. The experimental results demonstrate that our model consistently outperforms existing state-of-the-art methods, indicating our method is highly versatile and applicable to a wide range of scenarios.

16:40-17:00
The role of chromatin state in intron retention: a case study in leveraging large scale deep learning models
Confirmed Presenter: Asa Ben-Hur, Colorado State University, United States

Room: 517d
Format: In Person


Authors List: Show

  • Asa Ben-Hur, Colorado State University, United States
  • Ahmed Daoud, Colorado State University, United States

Presentation Overview: Show

Complex deep learning models trained on very large datasets have become key enabling tools for current research in natural language processing and computer vision.
By providing pre-trained models that can be fine-tuned for specific applications, they enable researchers to create accurate models with minimal effort and computational resources.
Large scale genomics deep learning models come in two flavors: the first are large language models of DNA sequences trained in a self-supervised fashion, similar to the corresponding natural language models; the second are supervised learning models that leverage large scale genomics datasets from ENCODE and other sources.
We argue that these models are the equivalent of foundation models in natural language processing in their utility, as they encode within them chromatin state in its different aspects, providing useful representations that allow quick deployment of accurate models of gene regulation.
We demonstrate this premise by leveraging the recently created Sei model to develop simple, interpretable models of intron retention, and demonstrate their advantage over models based on the DNA langauage model DNABERT-2.
Our work also demonstrates the impact of chromatin state on the regulation of intron retention.
Using representations learned by Sei, our model is able to discover the involvement of transcription factors and chromatin marks in regulating intron retention, providing better accuracy than a recently published model trained from scratch for this purpose.

Predicting interchromosomal Hi-C contacts from DNA sequence with TwinC
Confirmed Presenter: Anupama Jha, University of Washington, United States

Room: 517d
Format: In Person


Authors List: Show

  • Anupama Jha, University of Washington, United States
  • Borislav Hristov, University of Washington, United States
  • Xiao Wang, University of Washington, United States
  • Sheng Wang, University of Washington, United States
  • William Greenleaf, Stanford University, United States
  • Anshul Kundaje, Stanford University, United States
  • Erez Lieberman Aiden, Baylor College of Medicine, United States
  • William Stafford Noble, University of Washington, United States

Presentation Overview: Show

The 3D nuclear DNA architecture is composed of intrachromosomal and interchromosomal contacts. Despite the functional relevance of interchromosomal contacts, existing predictive models for 3D genome folding have focused on modeling intrachromosomal contacts from nucleotide sequences, mainly ignoring the contributions of interchromosomal contacts. To remedy this, we propose TwinC, an interpretable convolutional neural network model that uses a paired sequence design to model Hi-C interchromosomal contacts from replicate Hi-C experiments. TwinC accepts two 100~kb nucleotide sequences as input and predicts interchromosomal contacts between them. We use Hi-C experiments from 20 human donor heart samples from the ENCODE project to show that TwinC achieves high predictive accuracy (AUROC=0.80) on a cross-chromosomal test set. Furthermore, despite TwinC's computational simplicity and faster training time, it performs at par with the state-of-the-art orca model. Subsequently, we show that TwinC learns the importance of local chromatin accessibility features in the formation of interchromosomal contacts and identifies transcription factories located on different chromosomes that cluster in the nucleus. Our results suggest that by leveraging pooled contacts from multiple donors and employing a twin sequence design, TwinC can learn to accurately predict interchromosomal contacts and identify sequence signatures relevant to their 3D structure in the nucleus.

17:00-17:20
Proceedings Presentation: MolPLA: A Molecular Pre-training Framework for Learning Cores, R-Groups and their Linker Joints
Confirmed Presenter: Mogan Gim, Korea University, South Korea

Room: 517d
Format: In Person


Authors List: Show

  • Mogan Gim, Korea University, South Korea
  • Jueon Park, Korea University, South Korea
  • Soyon Park, Korea University, South Korea
  • Sanghoon Lee, Korea University, South Korea
  • Seungheun Baek, Korea University, South Korea
  • Junhyun Lee, Korea University, South Korea
  • Ngoc-Quang Nguyen, Korea University, South Korea
  • Jaewoo Kang, Korea University, South Korea

Presentation Overview: Show

Motivation: Molecular core structures and R-groups are essential concepts especially in compound analysis and lead optimization. Integration of these concepts with conventional graph pre-training approaches can promote deeper understanding in both local and global properties of molecules. We propose MolPLA, a dual molecular pre-training framework that promotes understanding in a molecule's core structure with peripheral R-groups and extends it with the ability to help chemists find replaceable R-groups in lead optimization scenarios.
Results: Experimental results on molecular property prediction show that MolPLA exhibits predictability comparable to current state-of-the-art models. Qualitative analysis implicate that MolPLA is capable of distinguishing core and R-group sub-structures, identifying decomposable regions in molecules and contributing to lead optimization scenarios by rationally suggesting R-group replacements given various query core templates.

17:20-17:40
Deep generative models for RNA splicing predictions and design
Confirmed Presenter: Yoseph Barash, University of Pennsylvania, United States

Room: 517d
Format: In Person


Authors List: Show

  • Di Wu, University of Pennsylvania, United States
  • Natalie Maus, University of Pennsylvania, United States
  • Anupama Jha, University of Washington, United States
  • San Jewell, University of Pennsylvania, United States
  • Jacob Gardner, University of Pennsylvania, United States
  • Yoseph Barash, University of Pennsylvania, United States

Presentation Overview: Show

Alternative splicing (AS) of pre-mRNA is a highly regulated process with significant splicing changes occurring across human tissues. The tissue-specific changes in splicing, combined with the fact splicing defects are related to numerous disease made the ability to predict or manipulate AS a long-term goal, with applications ranging from identifying novel regulatory mechanisms to designing therapeutic targets. Here, we take advantage of generative model architectures to address the prediction and design of tissue-specific RNA splicing outcomes. First, we construct a predictive model, TrASPr, which combines multiple localized transformers to predict splicing in a tissue-specific manner. Then, we exploit TrASPr as an Oracle to produce labeled data for a Bayesian Optimization (BO) algorithm with a custom loss function for RNA splicing outcome design. We demonstrate TrASPr significantly outperforms recently published models and identifies relevant regulatory features also captured by the BO generative process.

NEAR: Neural Embeddings for Amino acid Relationships
Confirmed Presenter: Daniel Olson, Department of Computer Science, University of Montana, United States

Room: 517d
Format: In Person


Authors List: Show

  • Daniel Olson, Department of Computer Science, University of Montana, United States
  • Daphne Demekas, R. Ken Coit College of Pharmacy, University of Arizona, Arizona,, United States
  • Thomas Colligan, R. Ken Coit College of Pharmacy, University of Arizona, Arizona,, United States
  • Travis Wheeler, R. Ken Coit College of Pharmacy, University of Arizona, Arizona,, United States

Presentation Overview: Show

The homology search tool HMMER is extremely sensitive and can identify homologous protein pairs even when there is very little %id between them. Search tools such as MMSeqs2 and DIAMOND are extremely efficient and capable of rapidly searching extremely large protein databases like TrEMBL, but are less sensitive than HMMER. The varying performance (speed and accuracy) of these tools is largely influenced by the choice of filtering strategies that are used to eliminate candidate alignments before running more expensive alignment algorithms. Motivated by a desire to retain full HMM sensitivity with greater speed, we have developed a new filtering method, called NEAR (Neural Embeddings for Amino acid Relationships). NEAR is a method based on representation learning that is designed to rapidly identify good sequence alignment candidates from a large protein database. NEAR's neural embedding model computes per-residue embeddings for target and query protein sequences and identifies alignment candidates with a pipeline consisting of k-NN search, filtration, and neighbor aggregation.
NEAR’s ResNet embedding model is trained using an N-pairs loss function guided by sequence alignments generated by the widely used HMMER3 tool.
Benchmarking results reveal improved performance relative to state-of-the-art neural embedding models specifically developed for protein sequences, as well as enhanced speed relative to the alignment-based filtering strategy used in HMMER3’s sensitive alignment pipeline. We present NEAR as a standalone filter, but have plans to integrate NEAR into our search tool NAIL.

17:40-18:00
Machine learning-enabled highly multiplexed monitoring of subcellular protein localization in live cells
Confirmed Presenter: Jiri Reinis, CeMM, Austria

Room: 517d
Format: In Person


Authors List: Show

  • Jiri Reinis, CeMM, Austria
  • Andreas Reicher, CeMM, Austria
  • Monika Malik, CeMM, Austria
  • Pavel Ruzicka, CeMM, Austria
  • Maria Ciobanu, CeMM, Austria
  • Stefan Kubicek, CeMM, Austria
  • Andre Rendeiro, CeMM, Austria
  • Victoria Kartysh, CeMM, Austria
  • Tatjana Tomek, CeMM, Austria
  • Marton Siklos, CeMM, Austria

Presentation Overview: Show

Imaging-based methods are widely used for studying the subcellular localization of proteins in living cells. While routine for individual proteins, global monitoring of protein dynamics following chemical or genetic perturbation typically relies on arrayed panels of fluorescently tagged cell lines, limiting throughput and scalability. Here, we describe a strategy that combines high-throughput microscopy, computer vision, and machine learning to detect perturbation-induced changes in multicolor tagged visual proteomics cell (vpCell) pools.

We use genome-wide and cancer-focused intron-targeting sgRNA libraries to generate vpCell pools and a large arrayed collection of clones (4,576 clones, 1,158 unique fluorescently tagged proteins). Each vpCell clone expresses two different endogenously tagged fluorescent proteins. Individual clones can be identified in the pool by image analysis alone, training a machine learning model on localization patterns and expression levels of the tagged proteins. This enables simultaneous live-cell monitoring of large sets of proteins.

To demonstrate broad applicability and scale, we test the effects of antiproliferative compounds on a pool with cancer-related proteins, on which we identify widespread protein localization changes and novel inhibitors of the nuclear import/export machinery. The time-resolved characterization of changes in subcellular localization and abundance of proteins upon perturbation in pooled format highlights the power of the vpCell approach for drug discovery and mechanism of action studies.

Finally, we present an interactive online web atlas of 1,158 fluorescently labeled proteins in clonal cell lines, available at https://vpcells.cemm.at.

PTM-Mamba: A PTM-Aware Protein Language Model with Bidirectional Gated Mamba Blocks
Confirmed Presenter: Zhangzhi Peng, Duke Unversity, United States

Room: 517d
Format: In Person


Authors List: Show

  • Zhangzhi Peng, Duke Unversity, United States
  • Benjamin Schussheim, Duke Unversity, United States
  • Pranam Chatterjee, Duke Unversity, United States

Presentation Overview: Show

Proteins serve as the workhorses of living organisms, orchestrating a wide array of vital functions. Post-translational modifications (PTMs) of their amino acids greatly influence the structural and functional diversity of different protein types and uphold proteostasis, allowing cells to swiftly respond to environmental changes and intricately regulate complex biological processes. To this point, efforts to model the complex features of proteins have involved the training of large and expressive protein language models (pLMs) such as ESM-2 and ProtT5, which accurately encode structural, functional, and physicochemical properties of input protein sequences. However, the over 200 million sequences that these pLMs were trained on merely scratch the surface of proteomic diversity, as they neither input nor account for the effects of PTMs. In this work, we fill this major gap in protein sequence modeling by introducing PTM tokens into the pLM training regime. We then leverage recent advancements in structured state space models (SSMs), specifically Mamba, which utilizes efficient hardware-aware primitives to overcome the quadratic time complexities of Transformers. After adding a comprehensive set of PTM tokens to the model vocabulary, we train bidirectional Mamba blocks whose outputs are fused with state-of-the-art ESM-2 embeddings via a novel gating mechanism. We demonstrate that our resultant PTM-aware pLM, PTM-Mamba, improves upon ESM-2's performance on various PTM-specific tasks. PTM-Mamba is the first and only pLM that can uniquely input and represent both wild-type and PTM sequences, motivating downstream modeling and design applications specific to post-translationally modified proteins.

Tuesday, July 16th
8:40-9:00
Domain adaptation for cell-free DNA fragmentomics
Confirmed Presenter: Natalie Davidson, University of Colorado, Anschutz Medical Campus, United States

Room: 517d
Format: In Person


Authors List: Show

  • Natalie Davidson, University of Colorado, Anschutz Medical Campus, United States
  • Casey Greene, University of Colorado, Anschutz Medical Campus, United States

Presentation Overview: Show

Cell-free DNA (cfDNA) is an emerging minimally-invasive biomarker that could detect cancer, indicate transplant rejection, and predict autoimmune disease severity. A critical application of cfDNA is identifying the cfDNA’s tissue-of-origin, a presumed disease source. The most established cfDNA strategies rely on identifying disease-specific mutations but can only be applied to diseases with a known variant. However, the recent discovery that cfDNA fragmentation patterns reflect nucleosome positioning and active transcription factor binding sites (TFBSs) indicates that the fragmentation patterns alone can predict the tissue of origin and open the door for applications to a broader range of diseases.

Currently, to predict tissue-of-origin, one needs to gather large cohorts to sample their cfDNA, which is commonly infeasible. In contrast, we propose that we instead use domain adaptation to train a model on a complementary data type, ATAC-Seq, such that it can also be used on cfDNA.

To do this, we must address two key problems: 1) generating a tissue prediction model that can translate across the domains of cfDNA and ATAC-Seq; 2) that the majority of cfDNA reads will come from blood.

We address both problems through the use of data augmentation strategies and the utilization of our previously preprinted domain invariant method, BuDDI. We apply this approach first to ATAC-Seq alone, to ensure our model can detect the tissue of origin, even when 99% of total reads come from blood and not the tissue of interest. Finally, we apply our approach to real cfDNA.

DeepROCK: Error-controlled interaction detection in deep neural networks
Confirmed Presenter: Yang Lu, University of Waterloo, Canada

Room: 517d
Format: In Person


Authors List: Show

  • Winston Chen, University of Michigan, United States
  • William Noble, University of Washington, United States
  • Yang Lu, University of Waterloo, Canada

Presentation Overview: Show

The complexity of deep neural networks (DNNs) makes them powerful but also makes them challenging to interpret, hindering their applicability in error-intolerant domains. Existing methods attempt to reason about the internal mechanism of DNNs by identifying feature interactions that influence prediction outcomes. However, such methods typically lack a systematic strategy to prioritize interactions while controlling confidence levels, making them difficult to apply in practice for scientific discovery and hypothesis validation. In this paper, we introduce a method, called DeepROCK, to address this limitation by using knockoffs, which are dummy variables that are designed to mimic the dependence structure of a given set of features while being conditionally independent of the response. Together with a novel DNN architecture involving a pairwise-coupling layer, DeepROCK jointly controls the false discovery rate (FDR) and maximizes statistical power. In addition, we identify a challenge in correctly controlling FDR using off-the-shelf feature interaction importance measures. DeepROCK overcomes this challenge by proposing a calibration procedure applied to existing interaction importance measures to make the FDR under control at a target level. Finally, we validate the effectiveness of DeepROCK through extensive experiments on simulated and real datasets.

9:00-9:20
Proceedings Presentation: CODEX: COunterfactual Deep learning for the in-silico EXploration of cancer cell line perturbations
Confirmed Presenter: Stefan Schrod, University Medical Center Göttingen, Germany

Room: 517d
Format: In Person


Authors List: Show

  • Stefan Schrod, University Medical Center Göttingen, Germany
  • Helena Zacharias, Hannover Medical School, Germany
  • Tim Beissbarth, University Medical Center Göttingen, Germany
  • Anne-Christin Hauschild, University Medical Center G ̈ottingen, Germany
  • Michael Altenbuchinger, University Medical Center Göttingen, Germany

Presentation Overview: Show

Motivation: High-throughput screens (HTS) provide a powerful tool to decipher the causal effects of chemical and genetic perturbations on cancer cell lines. Their ability to evaluate a wide spectrum of interventions, from single drugs to intricate drug combinations and CRISPR-interference, has established them as an invaluable resource for the development of novel therapeutic approaches. Nevertheless, the combinatorial complexity of potential interventions makes a comprehensive exploration intractable. Hence, prioritizing interventions for further experimental investigation becomes of utmost importance.
Results: We propose CODEX as a general framework for the causal modeling of HTS data, linking perturbations to their downstream consequences. CODEX relies on a stringent causal modeling strategy based on counterfactual reasoning. As such, CODEX predicts drug-specific cellular responses, comprising cell survival and molecular alterations, and facilitates the in-silico exploration of drug combinations. This is achieved for both bulk and single-cell HTS. We further show that CODEX provides a rationale to explore complex genetic modifications from CRISPR-interference in silico in single cells.
Availability and Implementation: Our implementation of CODEX is publicly available at https://github.com/sschrod/CODEX. All data used in this article are publicly available.
Supplementary information: Supplementary materials are available at Bioinformatics online.

9:20-9:40
A statistical method for migration history inference reveals alternative patterns of metastatic dissemination, clonality and phyleticity
Confirmed Presenter: Divya Koyyalagunta, Weill Cornell + MSKCC, United States

Room: 517d
Format: In Person


Authors List: Show

  • Divya Koyyalagunta, Weill Cornell + MSKCC, United States
  • Quaid Morris, MSKCC, United States

Presentation Overview: Show

Although metastasis is the cause of 90% of cancer deaths, little is known about its clonal evolution, genetic drivers, and seeding patterns. Identifying these patterns from DNA sequencing data requires solving a challenging mixed-variable combinatorial optimization problem to reconstruct the history of metastatic migrations. Current methods, based on integer linear programs, are slow, restricted to unrealistic assumptions, and cannot report uncertainty in their reconstructions. Furthermore, a fundamental problem with these methods is their inability to choose between multiple equally or similarly likely metastatic migration histories. To address these challenges, we propose a novel statistical framework for migration history inference, Metient, which uses recent machine learning advancements in discrete variable gradient estimation and metastasis specific priors. Rather than requiring a metastatic seeding dissemination model to be known a priori, Metient aims to answer this question by evaluating all possible migration history hypotheses and choosing the best model as informed by biologically motivated data. On simulated data, Metient outperforms the state-of-the-art, and can sample up to 64 possible solutions in 1% of the time. The migration histories inferred by Metient on 167 patients with four cancer types recover expert-assigned parsimony models in 84% of cases, but find notable differences where more plausible histories are proposed. We find that parallel gains of metastatic potential are much less common than previously proposed, and that polyclonal seeding occurs more in lymph nodes than in distant metastases. Along with significantly improving existing methodology, Metient provides a means to better model metastasis across different cancer types.

A deep learning model of tumor cell architecture elucidates response and resistance to CDK4/6 inhibitors
Confirmed Presenter: Sungjoon Park, University of California, San Diego, United States

Room: 517d
Format: In Person


Authors List: Show

  • Sungjoon Park, University of California, San Diego, United States
  • Erica Silva, University of California, San Diego, United States
  • Akshat Singhal, University of California, San Diego, United States
  • Marcus Kelly, University of California, San Diego, United States
  • Kate Licon, University of California, San Diego, United States
  • Isabella Panagiotou, University of California, San Diego, United States
  • Catalina Fogg, University of California, San Diego, United States
  • Samson Fong, University of California, San Diego, United States
  • John Lee, University of California, San Diego, United States
  • Xiaoyu Zhao, University of California, San Diego, United States
  • Robin Bachelder, University of California, San Diego, United States
  • Barbara Parker, University of California, San Diego, United States
  • Kay Yeung, University of California, San Diego, United States
  • Trey Ideker, University of California, San Diego, United States

Presentation Overview: Show

Cyclin-dependent kinase 4 and 6 inhibitors (CDK4/6is) have revolutionized breast cancer therapy. However, <50% of patients have an objective response, and nearly all patients develop resistance during therapy. To elucidate the underlying mechanisms, we constructed an interpretable deep learning model of the response to palbociclib, a CDK4/6i, based on a reference map of multiprotein assemblies in cancer. The model identifies eight core assemblies that integrate rare and common alterations across 90 genes to stratify palbociclib-sensitive versus palbociclib-resistant cell lines. Predictions translate to patients and patient-derived xenografts, whereas single-gene biomarkers do not. Most predictive assemblies can be shown by CRISPR–Cas9 genetic disruption to regulate the CDK4/6i response. Validated assemblies relate to cell-cycle control, growth factor signaling and a histone regulatory complex that we show promotes S-phase entry through the activation of the histone modifiers KAT6A and TBL1XR1 and the transcription factor RUNX1. This study enables an integrated assessment of how a tumor’s genetic profile modulates CDK4/6i resistance.

9:40-10:00
Proceedings Presentation: oncotree2vec – A method for embedding and clustering of tumor mutation trees
Confirmed Presenter: Monica-Andreea Baciu-Dragan, ETHZ, Switzerland

Room: 517d
Format: In Person


Authors List: Show

  • Monica-Andreea Baciu-Dragan, ETHZ, Switzerland
  • Niko Beerenwinkel, ETHZ, Switzerland

Presentation Overview: Show

Understanding the genomic heterogeneity of tumors is an important task in computational oncology, especially in the context of finding personalized treatments based on the genetic profile of each patient’s tumor. Tumor clustering that takes into account the temporal order of genetic events, as represented by tumor mutation trees, is a powerful approach for grouping together patients with genetically and evolutionarily similar tumors and can provide insights into discovering tumor subtypes, for more accurate clinical diagnosis and prognosis. Here, we propose oncotree2vec, a method for clustering tumor mutation trees by learning vector representations of mutation trees that capture the different relationships between subclones in an unsupervised manner. Learning low-dimensional tree embeddings facilitates the visualization of relations between trees in large cohorts and can be used for downstream analyses, such as deep learning approaches for single-cell multi-omics data integration. We assessed the performance and the usefulness of our method in three simulation studies, and on two real datasets: a cohort of 43 trees from six cancer types with different branching patterns corresponding to different modes of spatial tumor evolution and a cohort of 123 AML mutation trees.

10:40-11:30
Invited Presentation: Deep learning of personal genomes
Confirmed Presenter: Sara Mostafavi, University of Washington , USA

Room: 517d
Format: In Person


Authors List: Show

  • Sara Mostafavi, University of Washington , USA
11:30-11:40
ConfuseNN: Interpreting convolutional neural network inferences in population genomics with data shuffling
Confirmed Presenter: Linh Tran, University of Arizona, United States

Room: 517d
Format: In Person


Authors List: Show

  • Linh Tran, University of Arizona, United States
  • David Castellano, University of Arizona, United States
  • Ryan Gutenkunst, University of Arizona, United States

Presentation Overview: Show

Convolutional neural network (CNN) is an increasingly popular supervised machine learning approach that has been applied to many inference tasks in population genomics. Under this framework, population genomic variation data are typically represented as 2D images with sampled haplotypes as rows and segregating sites as columns. While many published studies reported promising performance of CNNs on various inference tasks, understanding which features in the data meaningfully contributed to the CNN's reported performance remains challenging. Here we propose a novel approach to interpreting CNN performance motivated by population genetic theory on genomic data. Specifically, we designed a suite of scramble tests where each test deliberately disrupts a feature in the genomic image data (e.g. allele frequency, linkage disequilibrium, etc.) to assess how each feature affects the CNN performance. We applied these tests to three networks designed to infer demographic history and natural selection from genetic variation data, identifying the fundamental population genomic features that drive inference for each network.

11:40-12:30
Panel: Trustworthy AI in the life sciences
Room: 517d
Format: In person

Moderator(s): Peter Koo


Authors List: Show

14:20-15:10
Invited Presentation: Towards spatiotemporal design principles in multicellular systems
Confirmed Presenter: Mor Nitzan, The Hebrew University, Israel

Room: 517d
Format: In Person


Authors List: Show

  • Mor Nitzan, The Hebrew University, Israel

Presentation Overview: Show

Gene expression profiles of a cellular population, generated by single-cell RNA sequencing, contain rich, 'hidden' information about biological state and collective multicellular behavior that is lost during the experiment or not directly accessible, including cell type, cell cycle phase, gene regulatory patterns, cell-cell communication, and location within the tissue-of-origin. In this talk I will discuss several methods, based on a combination of spectral, machine learning, and dynamical systems approaches, to disentangle and enhance particular spatiotemporal signals that cellular populations encode and interpret their manifestation across space and time in tissues. We will further discuss how we can computationally transfer knowledge across biological datasets and systematically identify gaps in our knowledge.

15:10-15:30
Proceedings Presentation: Probabilistic Pathway-based Multimodal Factor Analysis
Confirmed Presenter: Alexander Immer, Biomedical Informatics Group, Department of Computer Science, ETH Zurich, Zurich, Switzerland

Room: 517d
Format: In Person


Authors List: Show

  • Alexander Immer, Biomedical Informatics Group, Department of Computer Science, ETH Zurich, Zurich, Switzerland
  • Stefan G. Stark, Biomedical Informatics Group, Department of Computer Science, ETH Zurich, Zurich, Switzerland, Switzerland
  • Francis Jacob, Ovarian Cancer Research, Department of Biomedicine, University Hospital Basel and University of Basel, Switzerland, Switzerland
  • Ximena Bonilla, Biomedical Informatics Group, Department of Computer Science, ETH Zurich, Zurich, Switzerland, Switzerland
  • Tinu Thomas, Biomedical Informatics Group, Department of Computer Science, ETH Zurich, Zurich, Switzerland, Switzerland
  • Andre Kahles, Biomedical Informatics Group, Department of Computer Science, ETH Zurich, Zurich, Switzerland, Switzerland
  • Sandra Goetze, Institute of Translational Medicine, Dep. of Health Sciences and Technology, ETH Zurich, Zurich, Switzerland, Switzerland
  • Emanuela S. Milani, Institute of Translational Medicine, Dep. of Health Sciences and Technology, ETH Zurich, Zurich, Switzerland, Switzerland
  • Bernd Wollscheid, Institute of Translational Medicine, Dep. of Health Sciences and Technology, ETH Zurich, Zurich, Switzerland, Switzerland
  • Gunnar Rätsch, Biomedical Informatics Group, Department of Computer Science, ETH Zurich, Zurich, Switzerland, Switzerland
  • Kjong-Van Lehmann, Cancer Research Center Cologne-Essen, Uniklinik Koeln, Germany

Presentation Overview: Show

Multimodal profiling strategies promise to produce more informative insights into biomedical cohorts via the integration of the information each modality contributes. In order to perform this integration, however, the development of novel analytical strategies are needed. Multimodal profiling strategies often come at the expense of lower sample numbers, which can challenge methods to uncover shared signals across a cohort. Thus, factor analysis approaches are commonly used for the analysis of high-dimensional data in molecular biology, however they typically do not yield representations that are directly interpretable, whereas many research questions often center around the analysis of pathways associated with specific observations.

We develop PathFA, a novel approach for multimodal factor analysis over the space of pathways. PathFA produces integrative and interpretable views across multimodal profiling technologies, which allow for the derivation of concrete hypotheses. PathFA combines a pathway-learning approach with integrative multimodal capability under a Bayesian procedure that is efficient, hyper-parameter free, and able to automatically infer observation noise from the data. We demonstrate strong performance on small sample sizes within our simulation framework and on matched proteomics and transcriptomics profiles from real tumor samples taken from the Swiss Tumor Profiler consortium. On a subcohort of melanoma patients, PathFA recovers pathway activity that has been independently associated with poor outcome. We further demonstrate the ability of this approach to identify pathways associated with the presence of specific cell-types as well as tumor heterogeneity. Our results show that we capture known biology, making it well suited for analyzing multimodal sample cohorts.

15:30-15:40
SLIDE: Significant Latent Factor Interaction Discovery and Exploration across biological domains
Confirmed Presenter: Jishnu Das, University of Pittsburgh, United States

Room: 517d
Format: In Person


Authors List: Show

  • Javad Rahimikollu, University of Pittsburgh, United States
  • Hanxi Xiao, University of Pittsburgh, United States
  • Annaelaine Rosengart, University of Pittsburgh, United States
  • Aaron Rosen, University of Pittsburgh, United States
  • Tracy Tabib, University of Pittsburgh, United States
  • Paul Zdinak, University of Pittsburgh, United States
  • Kun He, University of Pittsburgh, United States
  • Xin Bing, University of Toronto, United States
  • Florentina Bunea, Cornell University, United States
  • Marten Wegkamp, Cornell University, United States
  • Amanda Poholek, University of Pittsburgh, United States
  • Alok Joglekar, University of Pittsburgh, United States
  • Robert Lafyatis, University of Pittsburgh, United States
  • Jishnu Das, University of Pittsburgh, United States

Presentation Overview: Show

Modern multi-omic technologies can generate deep multi-scale profiles. However, differences in data modalities, multicollinearity, and large numbers of irrelevant features make the analyses and integration of high-dimensional omic datasets challenging. Here, we present Significant Latent factor Interaction Discovery and Exploration (SLIDE), a first-in-class interpretable machine learning technique for identifying significant interacting latent factors underlying outcomes of interest from high-dimensional omic datasets. SLIDE makes no assumptions regarding data-generating mechanisms, comes with theoretical guarantees regarding identifiability of the latent factors/corresponding inference, and has rigorous FDR control. SLIDE outperforms a wide range of state-of-the-art approaches, including other latent factor approaches, in terms of prediction. More importantly, it provides biological inference beyond prediction that other methods do not afford. Using SLIDE on scRNA-seq data from systemic sclerosis (SSc) patients, we first uncovered significant interacting latent factors underlying SSc pathogenesis. In addition to outperforming existing benchmarks for prediction, SLIDE uncovered significant factors that included well-elucidated altered transcriptomic states in myeloid cells and fibroblasts and a novel keratinocyte-centric signature validated by protein staining. SLIDE also worked well on a wide range of spatial modalities spanning transcriptomic and proteomic data and was able to accurately identify significant interacting latent factors underlying immune cell partitioning by 3D location in different contexts. Finally, SLIDE leveraged paired scRNA-seq and TCR-seq data to elucidate novel latent factors underlying extents of clonal expansion of CD4 T cells in a nonobese diabetic model of T1D. Overall, SLIDE is a versatile engine for biological discovery from modern multi-omic datasets.