General Computational Biology

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in CDT
Monday, July 11th
10:40-11:00
CosTaL: An accurate and scalable graph-based clustering algorithm for high-dimensional single-cell data analysis
Room: Madison CD
Format: Live from venue

Moderator(s): Dan DeBlasio

  • Yijia Li, University of Minnesota, United States
  • Edgar Arriaga, University of Minnesota, United States
  • David Anastasiu, Santa Clara University, United States


Presentation Overview: Show

With the aim of analyzing large-size multidimensional single-cell datasets, which are more and more common nowadays, we are reporting the strategy of CosTaL (Cosine-based Tanimoto similarity-pruned graph for community detection by Leiden algorithm) for clustering practice. Similar to the predecessors like PhenoGraph and PARC, CosTaL transforms the cells with high-dimensional features from omics data into a weighted k-nearest-neighbor (kNN) graph. The cells are converted to the vertices of the graph, while the close relatedness between similar cells are kept, represented by the weight of the edges between vertices. Specifically, CosTaL builds an exact kNN graph using cosine similarity and uses the Tanimoto coefficient as the pruning strategy to re-weight the edges for the graph to improve the accuracy of clustering. As a result, we demonstrate that CosTaL generally gets higher accuracy scores on seven benchmark cytometry datasets and six single-cell RNA-sequencing datasets using six different evaluation metrics, compared with other graph-based clustering methods, including PhenoGraph, Scanpy, and PARC. Additionally, CosTaL has the fastest computational time on large datasets, suggesting that CosTaL generally has better scalability over the other methods, which is beneficial for processing large-size datasets.

11:00-11:20
spSeudoMap: cell type mapping in spatial transcriptomics using unmatched single-cell RNA-seq data
Room: Madison CD
Format: Live from venue

Moderator(s): Dan DeBlasio

  • Sungwoo Bae, Institute of Radiation Medicine, Medical Research Center, Seoul National University, South Korea
  • Hongyoon Choi, Department of Nuclear Medicine, Seoul National University Hospital, South Korea
  • Dong Soo Lee, Department of Nuclear Medicine, Seoul National University Hospital, South Korea


Presentation Overview: Show

Recently, cell topography could be tracked in various tissues by jointly analyzing spatial and single-cell transcriptomics. Since many scRNA-seq data have been acquired after cell sorting, integrating these data with spatial data is limited due to inconsistencies in the cell types that make up the two datasets. Here, we present spSeudoMap, which generates synthetic cell mixtures that closely resemble gene expression profiles of spatial transcriptomic data and train a model to predict spatial cell compositions. To overcome the inconsistency problem, we defined a pseudotype, cell types that exist exclusively in the spatial data. The fraction of pseudotype and its synthetic expression profiles in the cell mixture were assigned based on pseudobulk transcriptomes. The simulated cell mixture was considered as a reference and the model estimating the cell fraction of the mixtures was modified to predict the spatial cell composition using domain adaptation. SpSeudoMap was assessed in brain tissues and was able to accurately map region-specific neuron types extracted from single-cell data to expected anatomical locations. Besides, this method could describe the spatial landscape of immune cell subtypes and their interactions in heterogeneous breast cancer tissue. In short, spSeudoMap could predict the spatial distribution of cell subpopulations using sorted scRNA-seq data and can help clarify the role of the minority but significant cell types.

11:20-11:40
One Cell At A Time: a unified framework to integrate and analyze single-cell RNA-seq data
Room: Madison CD
Format: Live-stream

Moderator(s): Dan DeBlasio

  • Lin Zhang, University Health Network, Canada
  • Chloe Wang, University Health Network, Canada
  • Bo Wang, University Health Network, Canada


Presentation Overview: Show

The surge of single-cell RNA sequencing (scRNA-seq) technologies gives rise to the abundance of large scRNA-seq datasets at the scale of hundreds of thousands of single cells. Integrative analysis of large-scale scRNA-seq datasets can aggregate complementary biological information from different datasets and have the potential of revealing de novo cell types. However, most existing methods fail to integrate multiple large-scale scRNA-seq datasets in a computational and memory efficient way. Our recent work OCAT, One Cell At A Time, a machine learning method that sparsely encodes single-cell gene expressions to integrate data from heterogeneous sources without highly variable gene selection or explicit batch effect correction. We have demonstrated that OCAT efficiently integrates multiple scRNA-seq datasets and achieves the state-of-the-art performance in cell type clustering, especially in challenging scenarios of non-overlapping cell types. In addition, OCAT can efficaciously facilitate a variety of downstream analyses, such as differential gene analysis, trajectory inference, pseudo time inference and cell type inference. OCAT is a unifying tool to simplify and expedite the analysis of large-scale scRNA-seq data from heterogeneous sources.

11:40-12:00
ATHENA: Analysis of Tumor Heterogeneity from Spatial Omics Measurements
Room: Madison CD
Format: Live from venue

Moderator(s): Dan DeBlasio

  • Adriano Martinelli, ETH Zurich, Switzerland
  • Pushpak Pati, IBM Research Zurich, Switzerland
  • Maria Anna Rapsomaniki, IBM Research Zurich, Switzerland


Presentation Overview: Show

Tumor heterogeneity has emerged as a fundamental property of most human cancers, and its accurate and biologically meaningful quantification has the potential to translate biological complexity into clinically actionable insight. Currently, spatial omics technologies are revolutionizing our understanding of tumor ecosystems, enabling their deep phenotypic profiling at an unprecedented resolution while preserving the tumor topology. Although several spatial omics data analysis tools have started to emerge, adedicated resource that enables tumor heterogeneity quantification is largely missing. We introduce here ATHENA, a computational framework that brings together a large collection of established and novel heterogeneity scores borrowing ideas from spatial statistics, graph theory and information theory, able to capture the heterogeneity of the tumor ecosystem. ATHENA supports any spatial omic dataset, as well as standard tissue imaging data. Using apublicly available imaging mass cytometry dataset, we show how ATHENA can highlight tumor regions of high spatial heterogeneity and quantify spatial properties, cell interaction and immune infiltration patterns present in the tumor ecosystem. ATHENA is implemented in a highly modular, extendable, and scalable fashion, with emphasis in visualization and interoperability with other popular computational frameworks, and it’s available as a Python package under an open-source license here:https://github.com/AI4SCR/ATHENA.

12:00-12:20
Widespread redundancy in -omics profiles of cancer mutation states
Room: Madison CD
Format: Live from venue

Moderator(s): Dan DeBlasio

  • Jake Crawford, University of Pennsylvania, United States
  • Maria Chikina, University of Pittsburgh, United States
  • Brock Christensen, Geisel School of Medicine at Dartmouth College, United States
  • Casey Greene, University of Colorado School of Medicine, United States


Presentation Overview: Show

Although DNA sequencing identifies cancer mutations, other -omics assays can provide a fuller picture of the cellular dysregulation underlying cancer pathology. However, for a given mutation, it is not always clear which -omics layer will best capture cancer-relevant signal. To evaluate the information content of different -omics types, we use them as input to classifiers trained to distinguish between samples with and without mutations in key cancer genes. Using data from the TCGA Pan-Cancer Atlas, we focus on RNA sequencing, DNA methylation arrays, reverse phase protein arrays, microRNA sequencing, and somatic mutational signatures as readouts of mutational state.

Across a collection of 217 cancer-related genes, RNA sequencing tends to be the most effective predictor of mutational state. Surprisingly, we found that other -omics layers are equally effective predictors for many genes. Mutations in most genes predicted accurately by at least one readout (52/86, or 60.5%) were predicted accurately by two or more independent readouts from the six we considered. We also found that multi-omics models provided little or no predictive improvement over the best single-omics model for six well-studied cancer genes. Our results will inform the future design of studies focused on the functional outcomes of cancer mutations.

14:30-14:50
Amino acid sequence assignment from single molecule peptide sequencing data using a two-stage classifier
Room: Madison CD
Format: Live from venue

Moderator(s): Sara Mostafavi

  • Matthew Smith, Oden Institute, The University of Texas at Austin, United States
  • Edward Marcotte, Department of Molecular Biosciences The University of Texas at Austin, COI: Erisyon co-founder, shareholder, SAB member, United States


Presentation Overview: Show

Tools for protein identification and quantification lag DNA and RNA sequencing techniques in sensitivity and throughput. To address this, our group invented fluorosequencing, a single molecule protein sequencing technology. In fluorosequencing, proteins are proteolytically digested into peptides, and specific amino acids are labeled with fluorescent dyes. Labeled peptides are immobilized in a flow-cell where, using Edman degradation chemistry, they are sequenced in parallel while being imaged by single molecule microscopy. Fluorosequencing produces sequencing reads from many individual molecules simultaneously, with a significant elevation in noise and errors that must be addressed in subsequent computational analysis.

We found that Hidden Markov Models representing the state changes of a peptide undergoing sequencing can provide excellent measures of the probability of a fluorosequencing read given that peptide, which we can in turn use for Bayesian classification. Naïve models did not scale to larger peptides with more labels, so we developed a number of novel algorithmic adjustments to our Hidden Markov Model implementation catered to address fluorosequencing data. Additionally we combined our brute-force Bayesian classifier with a k-Nearest-Neighbors classifier that reduces the number of Hidden Markov Models needed to be built and run.

14:50-15:10
Neural relational inference to learn long-range allosteric interactions in proteins from molecular dynamics simulations
Room: Madison CD
Format: Live from venue

Moderator(s): Sara Mostafavi

  • Juexin Wang, University of Missouri, United States
  • Jingxuan Zhu, Jilin University, China
  • Weiwei Han, Jilin University, China
  • Dong Xu, Univ. of Missouri-Columbia, United States


Presentation Overview: Show

Protein allostery is a biological process facilitated by spatially long-range intra-protein communication, whereby ligand binding or amino acid change at a distant site affects the active site remotely. Molecular dynamics (MD) simulation provides a powerful computational approach to probe the allosteric effect. However, current MD simulations cannot reach the time scales of whole allosteric processes. The advent of deep learning made it possible to evaluate both spatially short and long-range communications for understanding allostery. For this purpose, we developed and applied a neural relational inference model based on a graph neural network, which adopts an encoder-decoder architecture to simultaneously infer latent interactions for probing protein allosteric processes as dynamic networks of interacting residues. From the MD trajectories, this model successfully learned the long-range interactions and pathways that can mediate the allosteric communications between distant sites in the Pin1, SOD1, and MEK1 systems. Furthermore, the model can discover allostery-related interactions earlier in the MD simulation trajectories and predict relative free energy changes upon mutations more accurately than other methods. The software is open sources at https://github.com/juexinwang/NRI-MD

15:10-15:30
The effect of genomic 3D structure on CRISPR cleavage efficiency
Room: Madison CD
Format: Live-stream

Moderator(s): Sara Mostafavi

  • Shaked Bergman, Tel Aviv University, Israel
  • Tamir Tuller, Tel Aviv University, Israel


Presentation Overview: Show

CRISPR is a gene editing technology which enables precise in-vivo genome editing. But its potential is hampered by its relatively low specificity and sensitivity. Improving CRISPR’s on-target and off-target effects requires a better understanding of its mechanism and determinants. Here we demonstrate, for the first time, the chromosomal 3D spatial structure’s effect on CRISPR’s cleavage efficiency, and its predictive capabilities.

We used high-resolution Hi-C data to estimate the 3D distance between different regions in the human genome and used these spatial properties to generate 3D-based features, characterizing each region’s density. We evaluate these features based on empirical, in-vivo CRISPR efficiency data and compare them to 5 state-of-the-art CRISPR efficiency models. The 3D features improved the models’ combined R2 by 24.2%, and their correlation to the empirical CRISPR efficiency was higher than 3 of the models’.

The features indicated a uniform relation between the 3D properties of the target site and its CRISPR efficiency: sites with lower spatial density demonstrated higher efficiency. Understanding how CRISPR is affected by the 3D DNA structure provides insight into CRISPR’s mechanism in general and improves our ability to correctly predict CRISPR’s cleavage as well as design gRNAs for therapeutic and scientific use.

16:00-16:20
Systematic discovery of regulatory motifs associated with the insulator function of human enhancer-promoter interactions
Room: Madison CD
Format: Live-stream

Moderator(s): Sara Mostafavi

  • Naoki Osato, Waseda University, Japan
  • Michiaki Hamada, Waseda University, Japan


Presentation Overview: Show

Chromatin interactions are essential in enhancer-promoter interactions (EPIs) and transcriptional regulation. CTCF and cohesin proteins located at chromatin interaction anchors. However, there is still no overall understanding of proteins associated with chromatin interactions and insulator functions. Here, we describe a systematic and comprehensive deep-learning-based approach for discovering DNA-binding motifs of transcription factors (TFs) associated with insulator function, EPIs, and gene expression. This analysis identified 98 directional and non-directional motifs that significantly affected the expression level of putative transcriptional target genes in human foreskin fibroblast cell, and included the following known TFs associated with insulator function or interacted with an insulator TF: CTCF, cohesin (RAD21 and SMC3), BATF, BCL6, FOS, FOXA3, HNF4A, JUN, MAZ, MECP2, MYB, MYOD1, PAX5, PRDM9, SIN3A, SMAD2, SMAD3, SPI1, TRIM28, USF1, VDR, and ZNF143. Most of the known TFs are associated with CTCF, but MAZ is reported to have insulator function independently. These findings and methods contribute to reveal novel functions of TFs and gene regulation.

16:20-16:40
High resolution chromatin loop mapping from sparse Hi-C data based on deep learning
Room: Madison CD
Format: Live from venue

Moderator(s): Sara Mostafavi

  • Shanshan Zhang, Case Western Reserve University, United States
  • Dylan Plummer, Case Western Reserve University, United States
  • Fulai Jin, Case Western Reserve University, United States
  • Jing Li, Case Western Reserve University, United States


Presentation Overview: Show

Mapping chromatin loops from noisy Hi-C heatmaps remains a major challenge. Here we present DeepLoop, which performs rigorous bias-correction followed by deep-learning-based signal-enhancement for robust chromatin interaction mapping from low-depth Hi-C data. DeepLoop enables loop-resolution single-cell Hi-C analysis. It also achieves a cross-platform convergence between different Hi-C protocols and micro-C. DeepLoop allowed us to map the genetic and epigenetic determinants of allele-specific (AS) chromatin interactions in human genome. We nominate new loci with AS-interactions governed by imprinting or allelic DNA methylation. We also discovered that in the inactivated X chromosome (Xi), local loops at the DXZ4 “megadomain” boundary escape X-inactivation, but the FIRRE “superloop” locus does not escape. Importantly, DeepLoop can pinpoint heterozygous SNPs and large structure variants (SVs) that cause allelic chromatin loops, many of which rewire enhancers with transcription consequences. Taken together, DeepLoop expands the use of Hi-C to provide loop-resolution insights into the genetics of 3D genome.