9:45-10:00

Combining Machine Learning and Causal Inference to reconstruct condition-related changes in regulatory networks

Format: Live from venue

Moderator(s): Sushmita Roy

Payam Dibaeinia, University of Illinois at Urbana-Champaign, United States
Saurabh Sinha, Georgia Institute of Technology, United States

Presentation Overview: Show

Transcriptional regulatory networks (TRNs) are a powerful conceptual framework to describe mechanisms of gene expression changes accompanying biological processes. It is increasingly apparent that TRNs can change between conditions, e.g., due to mutations or epigenomic changes associated with disease progression. Several statistical methods have been proposed for inference of TRN changes from transcriptomic data under two biological conditions or sample groups. These methods generally report inter-group changes in correlation between transcription factors’ (TFs) and target genes’ expression, revealed either by pairwise analysis or multi-variable regression and machine learning models. However, correlation-based discovery of a TF’s regulatory influence on a gene and of changes in such influence between biological groups suffers in the presence of confounders such as other TFs and group identity. This results in the reconstructed TRN and differential TRN deviating from the true causal network of interest.

We present a causal inference approach to the reconstruction of TF-gene relationships that change between two biological groups. We define the desired differential relationship as an estimand of “do calculus”, tied to a causal diagram that explicitly encodes the confounding effects of other TFs and of the biological group or condition. We develop a computational tool called CIMLA (Counterfactual Inference by Machine Learning and Attribution Models) to estimate this quantity, using non-linear models such as random forests and neural networks to fit gene expression as a function of TFs’ expression, “SHAPley Additive exPlanations” (SHAP) scores to assess each TF’s influence on a gene in each sample, and aggregating inter-group differences of this influence across all samples.

We compared CIMLA with leading methods for differential GRN inference (identified in a recent benchmarking study), using data simulated by our biophysics-inspired SERGIO simulator, and showed that CIMLA significantly outperforms these methods, especially when confounders are present. We also showed that CIMLA can identify differential regulatory relationships that do not exhibit significant differential correlation – a capability at which correlation-based methods commonly fall short.

We employed CIMLA to analyze a previously published single-cell RNA-seq data comprising more than 60,000 human brain cells collected from subjects with and without Alzheimer’s disease (AD). Our analysis points to several potential regulators of AD, some of which are strongly supported by the literature.

Finally, we note that the applications of CIMLA go beyond differential GRN inference; with simple modifications it can be extended to the inference of condition-related changes in a variety of causal associations from observational data.

10:00-10:15

scDoRI: Gene regulatory inference from single cell multi-omics data using interpretable deep learning

Format: Live from venue

Moderator(s): Sushmita Roy

Manu Saraswat, German Cancer Research Centre (DKFZ), European Molecular Biology Laboratory (EMBL), Germany
Moritz Mall, German Cancer Research Centre (DKFZ), Germany
Oliver Stegle, German Cancer Research Centre (DKFZ), European Molecular Biology Laboratory (EMBL), Germany

Presentation Overview: Show

Linear and nonlinear latent variable models to infer cellular manifolds are a corner-stone component of the analysis of single-cell (sc) RNA-seq data. Experimental advances have enabled to jointly profile gene expression and chromatin accessibility in the same physical cells, which yields a wealth of additional information that are not fully exploited by existing approaches. In particular, the integrative analysis of multi-ome data promises to enable the comprehensive recovery of cis-regulatory logic, which could improve the interpretability and the regulatory relevance of inferred latent space components.
Here, we present single-cell Deep multi-Omic Regulatory Inference (scDoRI), an end to end deep learning (DL) model that decomposes variation in gene expression and chromatin accessibility into continuous states (topics) of regulatory modules (gene-regulatory networks, GRNs). For each topic, scDoRI learns a set of co-accessible cis-regulatory elements and links them to downstream genes in a predictive framework. By modelling transcription factor (TF) binding at cis-regulatory elements using evidence from pretrained sequence based DL models, accessibility and expression, scDoRI learns topic-specific links between TFs and their downstream target genes. This yields directly interpretable latent space components in terms of GRNs, aiding downstream analyses and exploration of multi-omic scRNA-ATAC datasets. To illustrate the model, we apply scDoRI to a multi-omic sc RNA-ATAC atlas of mouse gastrulation to discover continuous changes in regulatory modules and infer fine-grain TF activities across different differentiation trajectories. By exploiting the interpretability of the latent space, scDoRI recovers context-specific changes in GRNs upon TF perturbation and accurately predicts the effect of perturbations in unseen contexts. Finally, we apply scDoRI to a multi-omic scRNA-ATAC and spatial transcriptomics dataset of Glioblastoma (GBM) patients to discover changes in GRNs in heterogeneous GBM cell-states across patients. We develop a transfer learning approach to map regulatory modules on spatial data and find colocalization of specific GBM GRNs with immune GRNs in a patient specific manner.

10:15-10:30

Measuring and modeling cooperative DNA binding by transcription factor proteins

Format: Live from venue

Moderator(s): Sushmita Roy

Vincentius Martin, Duke University - Center for Genomic & Computational Biology; Department of Computer Science, United States
Kyle Pinheiro, Duke University - Center for Genomic & Computational Biology; Department of Computer Science, United States
Farica Zhuang, Duke University - Center for Genomic & Computational Biology; Department of Computer Science, United States
Yuning Zhang, Duke University - Center for Genomic & Computational Biology, United States
Raluca Gordan, Duke University - Center for Genomic & Computational Biology; Department of Biostatistics & Bioinformatics, United States

Presentation Overview: Show

Transcription factor-DNA binding is an important aspect of gene regulation, but the effect of cooperativity between TFs at neighboring binding sites has not been sufficiently characterized. In the present work we describe a protein-DNA binding microarray (PBM) methodology to measure TF-TF cooperativity and demonstrate that the resulting data can be used to train (via supervised machine learning) accurate classification and regression models to predict cooperativity. We additionally show that analysis of our data and models aligns with low-throughput results regarding mechanism of cooperativity for the two TFs we selected for our case-studies, ETS1 and RUNX1.

We begin by selecting a number of probe sequences from the genome that have putative, neighboring binding sites for both TFs of interest. Following a standard PBM protocol, we measure TF binding levels when the DNA library is incubated with each TF individually or with both TFs simultaneously. We then compute the cooperativity as the difference in TF binding levels between the two experiments. A given probe sequence is classified as exhibiting cooperative or independent behavior based on whether the presence of the co-factor TF leads to a statistically significant increase in the TF binding level.

Next, we demonstrate that the cooperativity data collected via our novel experimental approach can be used to train accurate classification and regression models. We found that the random forest classifier is effective for distinguishing whether TFs with neighboring binding sites on particular sequences will exhibit independent versus cooperative binding. For the regression task, we investigated the efficacy of random forest regression (RFR), support vector regression (SVR), and convolutional neural networks (CNN). For RFR and SVR, many of the same features were used, such as predicted binding affinity of each TF, the distance between binding site centers, the relative orientation of the two binding sites, as well as DNA sequence and DNA shape features for the region between and outside of the binding site cores. RFR performed well, but error-rates were further reduced when CNNs were used for modeling, especially when using a combination of one-hot encoding (to capture the DNA sequence) and the predicted binding site strengths. This suggests that while important information can be gleaned directly from the sequence data, there is information on binding site strength that the CNN is unable to learn directly from sequence.

10:30-10:45

Forkhead transcription factors diversify their DNA-binding targets via differential abilities to engage inaccessible chromatin

Format: Live from venue

Moderator(s): Sushmita Roy

Sonny Arora, Penn State University, United States
Tomohiko Akiyama, Keio University, Japan
Jianyu Yang, Penn State University, United States
Daniela James, Penn State University, United States
Thomas Blanda, Penn State University, United States
Nitika Badjatia, Penn State University, United States
William K.M. Lai, Cornell University, United States
B. Franklin Pugh, Cornell University, United States
Minoru Ko, Keio University, Japan
Shaun Mahony, Penn State University, United States

Presentation Overview: Show

11:15-11:30

DNA sequence is the primary determinant of R-loop formation across genomes

Format: Live from venue

Moderator(s): Shaun Mahony

Lacey Walker, University of California, Davis, United States
Albert Stanley, University of California, Davis, United States
Gerald Quon, University of California, Davis, United States

Presentation Overview: Show

R-loops are hybrid structures of DNA and nascent RNA that form widely across eukaryotic and bacterial genomes during transcription and can lead to transcriptional repression and genome instability. Factors such as primary DNA sequence, nascent RNA levels, DNA and RNA binding factors, and chromatin state are thought to influence R-loop metabolism. What is unknown are the relative contributions of these factors towards R-loop formation, and to what extent the underlying rules governing R-loop formation are conserved across species.
To address this, we constructed a deep neural network to predict R-loop formation along the genome that is based on primary DNA sequence and other genomic and epigenomic annotations. Our genomic annotations included those related to nascent and mature transcription, chromatin state (histone marks, methylation), and chromatin accessibility. Surprisingly and despite R-loop formation’s dependence on transcription, we found that DNA sequence was the most predictive factor for R-loop formation and frequency; performance of the DNA sequence-only model was 41% higher than the nascent transcription-only model. When using a DNA sequence-only model as a baseline, we found that adding epigenetic state and nascent transcription data provided modest additional prediction power (5% more). This suggests that the role of DNA sequence is distinct from its influence on transcription and epigenetic state. When interrogating the neural network to identify the most salient DNA sequence features that predicted R-loop formation, we found that local islands of G’s and T’s provided strikingly high contributions (positively and negatively, respectively). Simulations removing those islands yielded substantial decrease in predicted R-loop formation.
Finally, an open question is the extent to which the underlying molecular rules governing R-loop formation is conserved across species. To probe this, we trained individual models on human, mouse, zebrafish, chicken, fruit fly, and roundworm datasets and evaluated how well a model trained on one species could predict R-loop data of another species. We found that models trained on human data were 84% as predictive of yeast R-loop formation as other models trained directly on yeast, despite the large variance in GC content and gene density between their genomes. Our results suggest high conservation of sequence rules governing R-loop formation.

11:30-11:45

A framework for summarizing chromatin state annotations within and identifying differential annotations across groups of samples

Format: Live from venue

Moderator(s): Shaun Mahony

Ha Vu, University of California Los Angeles, United States
Zane Koch, University of California Los Angeles, United States
Petko Fiziev, University of California Los Angeles, United States
Jason Ernst, University of California Los Angeles, United States

Presentation Overview: Show

11:45-12:00

Detecting dynamic 3D genome organization with multi-task matrix factorization

Format: Live from venue

Moderator(s): Shaun Mahony

Da-Inn Lee, University of Wisconsin-Madison, United States
Sushmita Roy, University of Wisconsin-Madison, United States

Presentation Overview: Show

The three-dimensional (3D) organization of the genome, which determines how the DNA is packaged inside the nucleus, has emerged as a key regulatory mechanism of cellular function and malfunction. High-throughput chromosomal conformation capture (Hi-C) technologies have enabled the study of 3D genome organization by experimentally measuring the tendency of genomic regions to interact with one another in 3D space. Changes or disruptions to the 3D genome organization have been associated with disease, developmental, and evolutionary processes. Therefore, a key problem in regulatory genomics is to systematically detect higher-order structural changes across Hi-C datasets from multiple timepoints and conditions. Existing methods to detect changes in 3D genome organization are limiting in one or more ways: they identify changes at the individual interaction level, they offer only pairwise comparison between two datasets at a time, or they do not account for more complex nested relationships among the conditions represented by the datasets. We address these limitations with a new approach, Tree-Guided Integrated Factorization (TGIF) to examine the dynamics of topologically associating domain (TAD) boundaries. TGIF is based on hierarchical multi-task Non-negative Matrix Factorization (NMF), a dimensionality reduction method particularly useful for co-clustering the row and the column entities of a matrix. TGIF takes as input Hi-C matrices from multiple related biological conditions, for example, different cell types from a cell lineage. TGIF constrains the lower-dimensional factors from closely related tasks (e.g. cell types derived from the same “parent” of the cell lineage) to be similar. The output factor matrices represent the lower-dimensional view of the chromosomal architecture. We use these low-dimensional views to calculate a TAD boundary score for each region and identify differential or common boundaries across all conditions. Compared to existing approaches, TGIF recovers ground-truth differential TAD boundaries with higher precision in simulated data and is also less susceptible to identifying spurious boundaries due to variation in dataset depth. We applied TGIF to study across multiple cell types as well as across a time course. From a cardiomyocyte differentiation time course data, TGIF identified embryonic-stage-specific TAD boundaries associated with transcriptionally active HERV-H retrotransposon; such sites are known to demarcate TAD boundaries in human and ape pluripotent cells. We further identify transcription factor motifs enriched in embryonic-stage-specific boundary regions and demonstrate that cardiovascular-disease associated GWAS SNPs are depleted in such boundary regions. Together, these results demonstrate the utility of TGIF to identify biologically meaningful topological shifts in genome organization.

12:00-12:15

Loss of chromosome Y is associated with changes in genome-wide DNA methylation profiles of immune cells in patients with Alzheimer's disease

Format: Live from venue

Moderator(s): Shaun Mahony

Marcin Jąkalski, 3P-Medicine Laboratory, Medical University of Gdańsk, Poland
Edyta Rychlicka-Buniowska, 3P-Medicine Laboratory, Medical University of Gdańsk, Poland
Hanna Bruhn-Olszewska, Department of Immunology, Genetics and Pathology and Science for Life Laboratory, Uppsala University, Sweden
Natalia Filipowicz, 3P-Medicine Laboratory, Medical University of Gdańsk, Poland
Maciej Siedlar, Department of Clinical Immunology, Institute of Paediatrics, Jagiellonian University, Collegium Medicum, Poland
Jaroslaw Baran, Department of Clinical Immunology, Institute of Paediatrics, Jagiellonian University, Collegium Medicum, Poland
Kazimierz Węglarczyk, Department of Clinical Immunology, Institute of Paediatrics, Jagiellonian University, Collegium Medicum, Poland
Lars Forsberg, Department of Immunology, Genetics and Pathology, Science for Life Laboratory, Uppsala University, Sweden
Jonatan Halvardson, Department of Immunology, Genetics and Pathology, Science for Life Laboratory, Uppsala University, Sweden
Alicja Klich-Rączka, Department and Clinic of Internal Medicine and Gerontology, Jagiellonian University, Collegium Medicum, Poland
Martin Ingelsson, Department of Public Health and Caring Sciences/Geriatrics, Uppsala University, Sweden
Jan Dumanski, Department of Immunology, Genetics and Pathology, Science for Life Laboratory, Uppsala University, Sweden
Jakub Mieczkowski, 3P-Medicine Laboratory, Medical University of Gdańsk, Poland

Presentation Overview: Show

Loss of chromosome Y (LOY) is the most common form of clonal mosaicism in males. Presence of this phenomenon is most pronounced in leukocytes and increases with age. LOY is also associated with an increased risk for disease and mortality.

Within a single cell, LOY is an event leading to a complete loss of chromosome Y and causes disappearance of almost 2% of the male haploid nuclear genome. This should in turn lead to changes in DNA’s compaction and its rearrangement in the cell’s nucleus and hence alter the epigenetic processes that influence gene expression. Among these are DNA methylation, histone modifications, as well as higher level 3-dimensional structures such as topologically associating domains.

Significantly reduced DNA methylation levels were found to coincide with low Y levels. Our previous research showed that LOY has a clear transcriptional effect in leukocytes, and it is associated with dysregulation of multiple genes. Many of these are involved in immune functions. Moreover, we found that levels of LOY in leukocytes were varying dependent on the investigated cell type but had the highest range of values in granulocytes and monocytes. These two cell types make up the major human blood fractions and play key roles in inflammatory and anti-microbial defense processes.

Alzheimer disease (AD), the most common chronic neurodegenerative disorder and cause of dementia in the elderly, is a major public health problem worldwide. Loosing chromosome Y is associated with higher risks of AD. Disruption in the inflammatory signals has also been documented to be implicated AD.

Here we aimed at investigating the epigenetic effect of LOY in leukocytes in the Alzheimer’s disease. We generated a broad set of DNA methylation data representing AD patients and healthy controls. All these had their LOY levels estimated that allowed to further divide them into LOY and non-LOY groups. We confirmed the involvement of dysregulated genes (both at the level of DNA methylation, as well as bulk and single-cell RNA expression) in immune processes and found these to be abnormally activated in LOY. Additionally, we show that LOY-associated epigenetic changes are much more pronounced in AD when compared to controls.

14:00-15:00

Keynote Presentation: Dissecting the role of the repeat-ome in gene regulation and complex traits

Format: Live from venue

Moderator(s): Jason Ernst

Melissa Gymrek

Presentation Overview: Show

15:00-15:15

A new statistical model and R package to infer ligand-receptor interactions from bulk or spatial transcriptomics data

Format: Live-stream

Moderator(s): Jason Ernst

Jean-Philippe Villemin, IRCM Inserm U1194, University of Montpellier, ICM, France
Laia Bassaganyas, IRCM Inserm U1194, University of Montpellier, ICM, France
Jacques Colinge, IRCM Inserm U1194, University of Montpellier, ICM, France

Presentation Overview: Show

15:15-15:30

Intra-cellular Spatial Transcriptomics Toolkit(InSTanT) reveals widespread co-localization of genes

Format: Live from venue

Moderator(s): Jason Ernst

Anurendra Kumar, Georgia Tech, United States
Ali Ebrahimpour, UIUC, United States
Alex Schrader, UIUC, United States
Juyeon Lee, UIUC, United States
Dave Zhao, UIUC, United States
Hee Sun Han, UIUC, United States
Saurabh Sinha, Georgia Tech, United States

Presentation Overview: Show

Imaging-based spatial transcriptomics technologies such as MERFISH and SEQFISH provide an extraordinarily detailed view of biological processes, which includes individual transcript locations within a cell, across thousands of cells, of hundreds to thousands of genes. We present InSTAnT (Intra-cellular Spatial Transcriptomics Analysis Toolkit), a toolkit for extracting molecular relationships from spatial transcriptomics data at the intra-cellular resolution. InSTAnT detects gene pairs and modules with unusual patterns of co-localization within and across cells. Intra-cellular spatial patterns discovered by InSTAnT may have diverse biological interpretations, including RNA-RNA interactions with regulatory functions, formation of condensates, shared sub-cellular localization, etc., and thus provide a rich compendium of testable hypotheses regarding molecular functions.

At the heart of the InSTAnT suite is a statistical test to detect “proximal pairs” of genes in a single cell’s spatial transcriptome. This test determines if transcripts of a gene pair are located within a distance d significantly more often than expected by chance. We define a gene pair to be “d-colocalized” if it is detected as a proximal pair in many cells, thereby increasing our confidence in a spatial relationship between the two genes. InSTAnT detects d-colocalization using the “Conditional Poisson Binomial” (CPB) test, which is based on a Poisson Binomial distribution and is specially designed to (a) allow for the fact that different cells can have varying numbers of proximal pairs and (b) to deemphasize abundant genes, thus yielding a diverse pool of gene pairs. It then annotates the detected gene pairs by cellular regions e.g., nuclear, perinuclear, cytoplasmic or perimembrane, where they tend to colocalize, and provides specialized statistical tests to identify d-colocalized pairs specific to a cell type. InSTAnT also implements a probabilistic graphical model that detects special cases where an intra-cellular colocalization signal exhibits a non-random spatial pattern at the inter-cellular level. Finally, it relies on a frequent sub-graph mining algorithm to discover gene colocalization modules that are supported by many cells.

We demonstrate the usefulness of the InSTAnT toolkit through extensive analysis of two MERFISH datasets, profiling a human U2OS cell line and the mouse hypothalamic preoptic region respectively. We report numerous colocalization patterns revealed by these analyses, along with rigorous statistical assessment and supporting evidence from databases and predicted RNA interactomes. We identify several novel cell type-specific gene colocalizations in the brain.

In summary, InSTAnT is a powerful toolkit for unbiased spatial pattern discovery and analysis for spatial transcriptomics technologies of the future.

15:30-15:45

STAN, a computational framework for inferring spatially informed transcription factor activity networks

Format: Live from venue

Moderator(s): Jason Ernst

April Sagan, University of Pittsburgh, United States
Hatice Osmanbeyoglu, University of Pittsburgh, United States

Presentation Overview: Show

Transcription factors (TFs) are important modulators of cell fate and function, responsible for large-scale changes in response to the environment or intercellular communication. Hence, other types of cells in proximity are critical for instructing cell context-specific TF activities. Since TFs are typically expressed at low levels and regulated post transcriptionally and/or post translationally, estimating TF activity in specific contexts is a formidable challenge. TF activity can be indirectly assessed by target gene expression. Emerging spatially transcriptomics technologies measure genome-wide mRNA expression across thousands of spots on a tissue slice while preserving information about the location of spots and allowing characterization of the microenvironment. To date, these datasets have not been used to systematically infer the underlying context-specific TFs defining cell identity. Here, we present STAN (Spatially informed Transcription Factor Activity Network), a computational method to predict spot-specific TF activities by utilizing spatial transcriptomics datasets and cis-regulatory information. Specifically, we develop a linear mixed effect model that integrates curated TF target-gene interactions, mRNA expression, spatial coordinates, and imaging data to learn gene regulatory programs that predict the expression of target genes. Spatial coordinates and morphological features extracted from corresponding imaging data are used to promote spatially cohesive gene regulatory programs. Importantly, our model can identify TFs whose activity are primarily dependent on cell type proportions, and TFs whose cell-type specific activity vary spatially by using cell-type deconvolution approaches.

We apply STAN to breast cancer spatial transcriptomics data from 15 patients. For statistical evaluation, we computed the mean Spearman correlation between predicted and measured gene expression profiles on held-out samples and obtained significantly better performance with models utilizing spatial coordinates and imaging data. Clustering of spots by STAN-predicted TF activities largely recovered the distinct spatial domains. For example, we show that STAN yields biological insights into TFs associated with structures such as tertiary lymphoid structures (TLSs). TLSs are well recognized and consist of T cell-rich areas containing dendritic cells and B cell-rich areas containing germinal centers. STAN-predicted BCL6 activity is associated with B-cells in samples with TLSs. Indeed, BCL6 is a TF repressor which has emerged as a critical regulator of germinal centers. The corresponding mRNA levels for BCL6 are lowly abundant in the transcriptomics data, which highlights the advantage of our method for quantifying the effect of TFs. Taken together, STAN enhances the utility of spatial transcriptomics datasets to uncover TF and spatial relationships in diverse cellular states.

16:15-16:30

A multiscale functional map of somatic mutations in cancer integrating protein structures and network topology

Format: Live from venue

Moderator(s): Andreas Pfenning

Yingying Zhang, Cornell University, United States
Alden Leung, Cornell University, United States
Le Li, Cornell University, United States
Tian Qiu, Cornell University, United States
Shayne Wierbowski, Cornell University, United States
Haiyuan Yu, Cornell University, United States

Presentation Overview: Show

A major goal of cancer biology is to fill in the gap between somatically acquired protein-altering mutations and malignant transformation of cells. Two central approaches—identification of spatially clustered mutations on protein structures and inference of commonly dysregulated cellular pathways—interpret the potential function of mutations at different scales. Integrating these two approaches is a promising way to decipher the multiscale functional effects spread from mutated amino acid residues to cellular systems. However, limited 3D structural coverage of the human proteome and interactome has hindered understanding of how the local mutation clusters are organized in the protein-protein interaction network. Here we compiled a complete repository of the structures of every single protein as well as interfaces for all known interacting protein pairs in humans which are determined by either experimental or deep learning-based approaches, and developed a computational tool integrating spatial cluster identification with 3D structurally-informed protein network analysis to create a multiscale functional map of somatic mutations in cancer. By applying our methodology to 5,950 TCGA tumors across 19 cancer types, we identified 1,656 intra- and 3,343 inter-protein mutation clusters, of which ~50% would not have been found if using only experimentally-determined protein structures. Moreover, studying the global organization of local mutation clusters led to a 5.5-fold increase in the number of significantly dysregulated protein subnetworks, the majority of which were previously blurred by non-clustered background mutations using standard network analyses. A notable discovery was a group of local mutation clusters converging on non-canonical PRC1, a protein complex repressing cell cycle and metabolic genes. Altogether, this study demonstrates that charting the functional landscape of cancer mutations requires a combination of their local spatial organization within protein structures and their global organization at the network level. Our multiscale functional map of somatic mutations can ultimately spark the discovery of novel molecular mechanisms underlying tumorigenesis.

16:30-16:45

Evaluating pathway analysis methods: from benchmark to best practices

Format: Live from venue

Moderator(s): Andreas Pfenning

Luopin Wang, Purdue University, United States
Aryamav Pattnaik, Purdue University, United States
Subhransu Sahoo, Purdue University, United States
Annaleigh Powell, Purdue University, United States
Srishti Chakravorti, Purdue University, United States
Md Tajmul, Purdue University, United States
Ella Stone, Purdue University, United States
Deepika Dhawan, Purdue University, United States
Isabella Sirit, Purdue University, United States
Deborah Knapp, Purdue University, United States
Jason Hanna, Purdue University, United States
Matthew Olson, Purdue University, United States
Behdad Afzali, National Institutes of Health, United States
Majid Kazemian, Purdue University, United States

Presentation Overview: Show

Genome-scale sequencing and screening data provide exceptionally detailed genetic and transcriptional profiles of cells. These can potentially be harnessed for early cancer detection and targeted treatment. One of the most critical steps in interpreting such data is biological pathway analysis. Pathway analysis identifies statistical relationships between experimentally derived group of genes and groups of genes (gene sets) with known biological functions. Several methods are available for pathway analysis (e.g., GSEA and Enrichr). However, the performance of these methods to identify correct biological pathways from either bulk or single cell data and to rank them by importance remains unknown, largely due to the absence of evaluation platforms using real biological data.

Here, we develop a comprehensive benchmark for evaluating pathway analysis methods. This benchmark is composed of experimental data collected from a range cell types, assays and species. Using this benchmark, we assessed the performance of commonly used pathway analysis methods at correctly identifying true biological pathways in bulk RNA-seq data. We specifically looked at true and false positive rates, robustness to noise and key parameters in analysis such as gene set size and ranking of genes in input genesets. Our benchmark determined best practices for conducting pathway analysis and a simple statistical stratification to reduce the number of false positive pathways from these analyses. Additionally, we developed a new ensemble pathway enrichment method that combines multiple approaches and greatly outperforms underlying methods in all key parameters. Importantly, we have applied best practices to predict pathways associated with poor and good survival in early-stage TCGA cancer patients. We show that this approach is able to identify novel prognostic markers with strong predictive power for patient survival. Finally, we predicted drugs that modulate the expression of prognostic markers for further therapeutic use. In summary, we have created novel benchmarking and pathway analysis tools to enable unlocking the full potential of genome-scale studies.

16:45-17:00

Influence network model uncovers relations between biological processes and mutational signatures

Format: Live from venue

Moderator(s): Andreas Pfenning

Bayarbaatar Amgalan, NCBI/NLM/NIH, United States
Damian Wojtowicz, NCBI/NLM/NIH, United States
Yoo-Ah Kim, NCBI/NLM/NIH, United States
Teresa Przytycka, NCBI/NLM/NIH, United States

Presentation Overview: Show

Understanding of mutagenic processes acting on cancer cells is fundamental for a better understating of etiology of carcinogenesis and designing treatment strategies. There is a growing appreciation that mutagenic processes can be studied through the lenses of mutational signatures, which represent characteristic mutation patterns attributed to individual mutagens. However, the causal link between mutagens and observed mutation patterns remains not fully understood, limiting the utility of mutational signatures.

To gain insights into these relationships, we developed a network-based method, named GeneSigNet that uncovers dominant influences among genes and mutational signatures. Without using perturbation data or prior knowledge, GeneSigNet relies solely on the purely observed activities of nodes and infers weighted-edges and their directions among genes and signatures. As input, patient-specific activity of a node corresponding to a gene is measured by its gene expression and the exposure of a mutational signature is seen as a measure of the activity of the corresponding mutagenic process. The network construction consists of two main steps: First, a sparse partial correlation technique is used to obtain an initial sparse weighted directed network. This graph contains both unidirectional and bidirectional edges. Next, where applicable, a partial higher moment strategy is used to orient bidirectional edges between nodes.

In general, the inference of direction of influence relations from statistical dependencies without additional prior knowledge about regulatory mechanisms or dedicated perturbation experiments is highly challenging. However, it has a potential advantage of capturing the causative dependencies. GeneSigNet provides an important step toward this direction that is independent of the specific application considered in this study.

Applying GeneSigNet to cancer data sets, we uncovered important relations between mutational signatures and several cellular processes that can shed light on cancer related mutagenic processes. Our results are consistent with previous findings such as the impact of homologous recombination deficiency on a clustered APOBEC mutations in breast cancer. The network identified by GeneSigNet also suggest an interaction between APOBEC hypermutation and activation of FOXP3 expressing regulatory T Cells that suppress anti-tumor immune response. It also inferred a relation between APOBEC mutations and changes in DNA conformation, supporting the view that DNA conformational changes sensitive DNA to APOBEC mutations. GeneSigNet also exposed a possible link between the SBS8 signature of unknown aetiology and the nucleotide excision repair pathway. These results demonstrate that GeneSigNet provides a new and powerful method to reveal the relation between mutational signatures and gene expression.

Friday, November 11^th

9:30-9:45

RSG Welcome - Day 2

Format: Live from venue

Jason Ernst

9:45-10:00

Inferring absolute developmental potential in single cells

Format: Live from venue

Moderator(s): Jason Ernst

Minji Kang, Stanford University, United States
Zhenqin Wu, Stanford University, United States
Gunsagar S. Gulati, Brigham and Women's Hospital, United States
José J. A. Armenteros, Stanford University, United States
James Zou, Stanford University, United States
Aaron M. Newman, Stanford University, United States

Presentation Overview: Show

10:00-10:15

LinRace: single cell lineage reconstruction using paired lineage barcode and gene expression data

Format: Live from venue

Moderator(s): Jason Ernst

Xinhai Pan, Georgia Institute of Technology, United States
Xiuwei Zhang, Georgia Institute of Technology, United States
Hechen Li, Georgia Institute of Technology, United States
Pranav Putta, Georgia Institute of Technology, United States

Presentation Overview: Show

10:15-10:30

Single-cell Ca2+ parameter inference reveals how transcriptional states inform dynamic cell responses

Format: Live from venue

Moderator(s): Jason Ernst

Xiaojun Wu, University of Southern California, United States
Roy Wollman, University of California, Los Angeles, United States
Adam MacLean, University of Southern California, United States

Presentation Overview: Show

10:30-10:45

Mapping genotype to phenotype through joint probabilistic modeling of single-cell gene expression and chromosomal copy number variation

Format: Live from venue

Moderator(s): Jason Ernst

Linyue Fan, Columbia University, United States
Isha Arora, Cornell University, United States
Alexander Preau, Columbia University, United States
Nicolas Beltran-Velez, Fero Labs, United States
Yiping Wang, Columbia University, United States
Johannes Melms, Columbia University, United States
Amit Dipak Amin, Columbia University, United States
Yohanna Georgis, Columbia University, United States
Patricia Ho, Columbia University, United States
Lindsay Caprio, Columbia University, United States
Antoni Ribas, UCLA, United States
Alison Taylor, Columbia University, United States
Benjamin Izar, Columbia University, United States
Elham Azizi, Columbia University, United States

Presentation Overview: Show

11:15-11:30

DeepGAMI: Deep biologically guided auxiliary learning for multimodal integration and imputation to improve phenotype prediction

Format: Live from venue

Moderator(s): Jennifer Mitchell

Pramod Bharadwaj Chandrashekar, University of Wisconsin Madison, United States
Chenfeng He, University of Wisconsin Madison, United States
Ting Jin, University of Wisconsin Madison, United States
Sayali Alatkar, University of Wisconsin Madison, United States
Saniya Khullar, University of Wisconsin Madison, United States
Daifeng Wang, University of Wisconsin Madison, United States

Presentation Overview: Show

Genotype-phenotype association is found in many biological systems, such as brain-related diseases and behavioral traits. Despite the recent improvement in the prediction of phenotypes from genotypes, they can be further improved and explainability of these predictions remains challenging, primarily due to complex underlying molecular and cellular mechanisms. Emerging multimodal data enables studying such mechanisms at different scales from genotype to phenotypes involving intermediate phenotypes like gene expression. However, due to the black-box nature of many machine learning techniques, it is challenging to integrate these multi-modalities and interpret the biological insights in prediction, especially when some modality is missing. Biological knowledge has recently been incorporated into machine learning modeling to help understand the reasoning behind the choices made by these models.

To this end, we developed DeepGAMI, an interpretable deep learning model to improve genotype-phenotype prediction from multimodal data. DeepGAMI uses prior biological knowledge to define the neural network architecture. Notably, it embeds an auxiliary-learning layer for cross-modal imputation while training the model from multimodal data. Using this pre-trained layer, we can impute latent features of additional modalities and thus enable predicting phenotypes from a single modality only. Finally, the model uses integrated gradient to prioritize multimodal features and links for phenotypes. We applied DeepGAMI to multiple emerging multimodal datasets: (1) population-level genotype and bulk-tissue gene expression data for predicting schizophrenia, (2) population-level genotype and gene expression data for predicting clinical phenotypes in Alzheimer's Disease, (3) gene expression and electrophysiological data of single neuronal cells in the mouse visual cortex, and (4) cell-type gene expression and genotype data for predicting schizophrenia. We found that DeepGAMI outperforms existing state-of-the-art methods and provides a profound understanding of gene regulatory mechanisms from genotype to phenotype, especially at cellular resolution. DeepGAMI is an open-source tool and is available at https://github.com/daifengwanglab/DeepGAMI.

11:30-11:45

Improving the power of fine-mapping regulatory variants by combining cell type-specific epigenetics with comparative genomics

Format: Live from venue

Moderator(s): Jennifer Mitchell

Badoi N Phan, Carnegie Mellon University, United States
Alyssa Lawler, Carnegie Mellon University, United States
Jing He, University of Pittsburgh School of Medicine, United States
Ashley R Brown, Carnegie Mellon University, United States
Irene Kaplow, Carnegie Mellon University, United States
Amanda Kowalczyk, Carnegie Mellon University, United States
Chaitanya Srinivasan, Carnegie Mellon University, United States
Grant A Fox, Carnegie Mellon University, United States
Ziheng Chen, Carnegie Mellon University, United States
Morgan E Wirthlin, Carnegie Mellon University, United States
William R Stauffer, University of Pittsburgh School of Medicine, United States
Andreas Pfenning, Carnegie Mellon University, United States

Presentation Overview: Show

Measures of nucleotide sequence conservation across species are useful for identifying functional genomic loci. For example, phyloP scores are often able to a substantial boost in fine-mapping candidate loci from genome-wide associations studies (GWAS). However, these methods can fail when regulatory function is maintained, often in a cell-type specific manner, even when the genome sequence itself shows minimal conservation. To overcome those limitations, we introduce Cell-TACIT, the Cell Type-Aware Conservation Inference Toolkit, to identify human trait-associated regulatory variants. In Cell-TACIT, we train convolutional neural networks models corresponding to each cell type on matched single nucleus ATAC-Seq datasets across at least 2 species. Then, we use those models impute cell type-specific open chromatin across hundreds of orthologous candidate regulatory elements, which provides an estimate on the extent to which cell type-specific function is conserved. Finally, we compare the degree of cell type-specific conservation to using GWAS summary statistics stratified heritability enrichment scores.

Applying Cell-TACIT to dozens neuropsychiatric trait loci identified higher heritability enrichment and more fine-mapped variants than nucleotide conservation and human open chromatin data alone. In the case of schizophrenia, a highly-powered GWAS, we found conservation based on phyloP score yielded a heritability enrichment score of only 3.8 relative to other neural cell types. Using open chromatin annotations across the human population for a specific subtype of neuron implicated in schizophrenia (D1-D2-Hybrid) raised the heritability enrichment to 53.3. Using only highly cell type-specific open chromatin Cell-TACIT with cell TACIT raised the heritability enrichment to 116.7 (adj. P = 1.32E-8). These stronger enrichments were more likely to be found for neural cell types implicated in the disorder. For example, open chromatin regions human oligodendrocyte precursor cells did show an enrichment for schizophrenia-associated mutations (heritability enrichment 21.0), but these loci did not tend to have orthologous regions also predicted to be active in oligodendrocyte precursors (heritability enrichment 10.9).

We experimentally validate predictions for enhancers with risk variants near the DRD2 schizophrenia risk locus using in vivo reporter assays. By integrating genome conservation and multi-species open chromatin data, Cell-TACIT prioritizes variants within regions of conserved regulatory function for in vivo characterization, and addresses a major challenge in translating disease associations to mechanistic understanding.

11:45-12:00

Comparative genomics with 240 mammals offers powerful new insights into mammalian genome and phenotype evolution

Format: Live from venue

Moderator(s): Jennifer Mitchell

Irene Kaplow, Carnegie Mellon University, United States
Matthew Chrismas, Uppsala University, Sweden
Diane Genereux, Broad Institute and University of Massachusetts Chan Medical School, United States
Michael Dong, Uppsala University, Sweden
Graham Hughes, University College Dublin, Ireland
Kerstin Lindblad-Toh, Broad Institute and Uppsala University (Sweden), United States
Elinor Karlsson, University of Massachusetts Medical School, United States
Allyson Hindle, University of Nevada Las Vegas, United States
Gregory Andrews, University of Massachusetts Chan Medical School, United States
Joel Armstrong, University of California Santa Cruz, United States
Mark Diekhans, University of California Santa Cruz, United States
Cornelia Fanter, University of Nevada Las Vegas, United States
Nicole Foley, Texas A&M University, United States
Linda Goodman, Fauna Bio, United States
Kathleen Keough, University of California San Francisco, United States
Bogdan Kirilenko, Senckenberg Research Institute, Germany
Amanda Kowalczyk, Carnegie Mellon University, United States
Ruby Redlich, Carnegie Mellon University, United States
Katherine Pollard, University of California San Francisco, United States
Alyssa Lawler, Carnegie Mellon University, United States
Daniel Schäffer, Carnegie Mellon University, United States
Megan Supple, University of California Santa Cruz, United States
Aryn Wilder, San Diego Zoo Wildlife Alliance, United States
Andreas Pfenning, Carnegie Mellon University, United States
Irina Ruf, Senckenberg Research Institute, Germany
Wynn Meyer, Lehigh University, United States
Beth Shapiro, University of California Santa Cruz, United States
Zhiping Weng, University of Massachusetts Chan Medical School, United States
Michael Hiller, Senckenberg Research Institute, United States
William Murphy, Texas A&M University, United States

Presentation Overview: Show

Evolutionary constraint is a powerful predictor of genome function. Previous studies of mammalian constraint were limited by species number and reliance on human-referenced alignments. The Zoonomia Consortium has created a reference-free whole-genome alignment of 240 species to explore the evolution of placental mammals.

Using this alignment, we computed new constraint scores, achieving single-base resolution of constraint and acceleration. Within constrained transcription factor binding sites, constraint scores are highly correlated with motif information content. We identified regions of contiguous constraint (RoCC), defined as twenty or more consecutive constrained bases. The longest RoCC is in an intron of METAP1D, encompassing four distal enhancer-like ENCODE candidate cis-regulatory elements. METAP1D encodes an essential mitochondrial protein conserved at least back to the common ancestor of human and zebrafish. In addition to the METAP1D transcription start site (TSS), this RoCC physically interacts with the TSS’s of TLK1 and HAT1 in the human adult cerebral cortex, which regulate chromatin structure and histone production and acetylation, respectively, and are expressed in the cortex of diverse mammals. Some of the most accelerated regions in human and chimpanzee overlap topologically associated domain boundaries near neurodevelopmental genes, suggesting that they alter these genes’ regulatory landscapes.

Intriguingly, 48.5% of constrained bases are unannotated by resources like ENCODE; we call such regions UNannotated Intergenic COnstrained RegioNs (UNICORNs). Most UNICORNs are within 500kb of the transcription start site for a protein-coding gene. UNICORNs tend to contain fewer variants, which tend to have lower allele frequencies, than in other intergenic regions. 17% of UNICORNs overlap open chromatin regions identified in recently published datasets from the developing brain, specific adult brain regions, and specific cell types within the motor cortex, suggesting that UNICORNs may have functions that will be revealed as additional regulatory genomics data is generated.

To associate genetic variation across species with mammalian traits, we developed new computational methods for pairing our alignment with recently curated phenotype annotations. We found that the number of olfactory receptor genes is associated with the number of olfactory turbinals, genes evolving more quickly in hibernators are associated with stress adaption, and motor cortex regulatory elements near genes associated with microcephaly and macrocephaly are associated with the evolution of brain size. As more phenotype annotations along with regulatory element data from relevant tissues and cell types become available, we anticipate that our alignment and new methods will provide additional exciting insights into the evolution of placental mammals.

12:00-12:15

ExplaiNN: interpretable and transparent neural networks for genomics

Format: Live-stream

Moderator(s): Jennifer Mitchell

Gherman Novakovsky, University of British Columbia, Canada
Oriol Fornes, University of British Columbia, Canada
Manu Saraswat, German Cancer Research Center, Germany
Sara Mostafavi, University of Washington, United States
Wyeth W. Wasserman, University of British Columbia, Canada

Presentation Overview: Show

14:00-15:00

Keynote Presentation: Inferring causal cell types driving human disease and complex traits

Format: Live from venue

Moderator(s): Sushmita Roy

Tiffany Amariuta-Bartell

Presentation Overview: Show

15:00-15:15

Genome-wide detection of human variants that disrupt intronic branchpoints

Format: Live from venue

Moderator(s): Sushmita Roy

Peng Zhang, The Rockefeller University, United States
Jean-Laurent Casanova, The Rockefeller University, United States

Presentation Overview: Show

15:15-15:30

Power of Inclusion: up to a 50-fold increase in polygenic score transferability with admixed individuals

Format: Live from venue

Moderator(s): Sushmita Roy

Yosuke Tanigawa, MIT, United States
Manolis Kellis, MIT, United States

Presentation Overview: Show