General Computational Biology

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in EDT
Tuesday, May 14th
10:30-10:45
Characterizing Landscapes of Somatic Mutability using a Graph Based Approach
Room: Clapp Hall Auditorium
Format: Live from venue

  • Arjun Srivatsa, Carnegie Mellon University, United States
  • Russell Schwartz, Carnegie Mellon University, United States


Presentation Overview: Show

Rapidly developing biotechnology tools are allowing scientists to examine the genetic differences and population structure between cells within an individual. This so-called somatic mosaicism involves unique biological and mutational processes that are not yet fully understood. Cancers represent a particularly important case of an often decades-long somatic evolutionary process; whereby healthy and precancerous cells accrue (epi)mutations that eventually lead to selective advantage and clonal disorder. Heterogeneity in cancers represents a fundamental challenge, with each cancer involving a unique output of somatic biology; each cancer contains clones with differing mutations, evolving through different mutational processes, with unique distributions in three dimensional space. As such, the future of computational cancer care will involve effectively characterizing the heterogeneous landscape of somatic mutability. In this abstract, we propose a graph based framework to characterize patterns of somatic mutability within somatic cell populations. We create a k-mer read graph from raw read data. We then annotate this graph according to its graph motif patterns and their associated read frequency. Using this graph based framework, we are able to detect unique patterns within the graphs that correspond to complex and layered structural variation at unique clonal frequencies. The properties of these different graphs and their motifs can be used to characterize and cluster different instances of somatic biology (e.g. similar cancer types). We further plan to use a more refined version of this computational framework to measure the differences in somatic biology between different cancer instances, as well as between cancer instances and precancerous or healthy instances.

10:45-11:00
Exploring the Proteome-wide Impact of Mutational Processes in Cancer Genomes
Room: Clapp Hall Auditorium
Format: Live from venue

  • Jigyansa Mishra, 1. University of Toronto, Toronto, ON 2. Ontario Institute for Cancer Research, Toronto, ON, Canada
  • Bayati Masroor, 1. University of Toronto, Toronto, ON 2. Ontario Institute for Cancer Research, Toronto, ON, Canada
  • Nina Adler, 1. University of Toronto, Toronto, ON 2. Ontario Institute for Cancer Research, Toronto, ON, Canada
  • Kevin C. L. Cheng, 1. University of Toronto, Toronto, ON 2. Ontario Institute for Cancer Research, Toronto, ON, Canada
  • Jüri Reimand, 1. Ontario Institute for Cancer Research, Toronto, ON 2. University of Toronto, Toronto, ON, Canada


Presentation Overview: Show

Somatic mutations in cancer genomes are accumulated by various mutational processes (Greenman et al., 2007; Stratton et al., 2009), such as intrinsic DNA repair defects (Rosenthal et al., 2016) or extrinsic exposure to carcinogens (Alexandrov et al., 2016). Mutational processes leave characteristic mutation patterns in the genome, known as mutational signatures (Alexandrov et al., 2020; Alexandrov et al., 2013). While genomic studies have extensively characterized these signatures (Alexandrov et al., 2016; Burns et al., 2013; Pfeifer et al., 2005; Roberts et al., 2013), their impact on the proteome remains largely unexplored. Our lab recently uncovered that certain mutational processes predominantly induce nonsense mutations in hallmark cancer genes and pathways, leading to protein truncation and loss-of-function effects (Adler et al., 2023). Here, we have conducted a proteogenomic analysis to delineate the impact of over 700,000 missense variants (causing amino acid substitutions), collected from 12,000 genomes spanning 24 cancer types from TCGA (Weinstein et al., 2013), on protein interaction networks. We mapped the variants to 150,000 experimentally validated phosphorylation sites in the proteome, catalogued by ActiveDriverDB (Krassowski et al., 2018) and applied MIMP (Wagih et al., 2015), a machine learning model, to predict their potential to rewire phosphorylation sites. Specific single-base substitution (SBS) signatures were revealed to selectively alter phosphorylation-associated conserved, short linear protein motifs, i.e., SliMs (Davey et al., 2012). UV signature SBS7b was found to disrupt CDK kinase-substrate interactions regulating cell cycle and proliferation in melanoma. APOBEC-driven DNA editing signature SBS2 impacted CK2 kinase-associated motifs in lung cancers, leading to tumor progression. Hallmark cancer genes like BRAF in melanoma and U2AF1 in lung adenocarcinoma were identified to be hotspots for these kinase motif-altering variants, arising from distinct signatures. In BRAF, V600E substitutions primarily triggered by ageing-correlated SBS5 and UV-linked SBS7b signatures, induces a new phosphorylation site recognised by the family of Polo-Like Kinases (PLK). This putatively explains why elevated expression levels of PLK kinases in patients with the BRAF-V600E substitution are associated with resistance to vemurafenib treatment (Babagana et al., 2020) and significantly worse overall survival rates (Uebel et al., 2023). Thus, studying how mutational processes rewire the protein interactome provides vital insights into how endogenous factors and lifestyle variables, such as tobacco consumption and sunscreen negligence, reconfigure signaling events in cancer. These alterations profoundly impact cancer risk, progression, and tumor heterogeneity. Moreover, our study holds promise to identify therapeutic targets and biomarkers, advancing precision oncology.

11:00-11:15
An integrated cumulative scoring function to uncover novel acute leukemia-related genes based on functional effects of non-synonymous single nucleotide variants
Room: Clapp Hall Auditorium
Format: Live from venue

  • Amanda Bataycan, University of Texas at El Paso, United States
  • Jonathon Mohl, University of Texas at El Paso, United States
  • Ming-Ying Leung, University of Texas at El Paso, United States


Presentation Overview: Show

We have devised a quantitative scoring function Q(Gene) to assess the cumulative effects of nonsynonymous single nucleotide variants (SNVs) on the protein-coding genes in acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). The goal is to find novel candidate leukemia-related genes for further bioinformatics analyses and wet-lab studies. With Genomic Data Commons as primary resource for this analysis, SNV information from whole-exome data of 149 patients with AML and 603 patients with ALL was obtained. In total 136,084 and 181,998 distinct SNVs were found in the AML and ALL datasets respectively, with the vast majority occurring only in tumor samples.

The idea of Q(Gene) scoring is to sum up the deleterious effects of the individual SNVs with respect to the transcripts in which they occur, weighted by the occurrence frequency difference between tumor and normal samples among patients and accounting for transcript lengths, to obtain an overall cumulative pathogenic score for each gene. Among various analyzers that quantify the functional effects caused by SNVs, we employed four well-established ones: (i) FATHMM-XF based on supervised machine learning, (ii) the sequence homology based SIFT that incorporates physical properties of amino acids, (iii) PolyPhen that uses sequence, phylogenetic, and structural characteristics, and (iv) CADD that includes conservation and functional information. With each analyzer, we calculated the Q(Gene) score for every gene that contained deleterious nonsynonymous SNV and generated a ranked list of pathogenic genes.

The individual analyzers’ performances with Q(Gene) scoring were assessed by the percentage of overlap of its top-scoring genes with the lists of known AML- and ALL-related genes found in published literature and public databases. As expected, the overlap percentages vary substantially among the different analyzers. We therefore developed an integrated Q(Gene) scores by taking averages of the deleteriousness of the SNVs given by individual analyzers before inputting them to the Q(Gene). This integration procedure has resulted in improved overlap percentages. The top novel gene predicted using the integrated Q(Gene) score is GOLM1 for AML and MXRA5 for ALL. Although not yet reported in the context of leukemia, GOLM1 has previously been linked to liver cancer and MXRA5 to lung and colorectal cancers.

We are currently examining the performance of the integrated Q(Gene) scoring function on other cancer datasets with additional functional effect analyzers, aiming to establish a general scoring scheme that can reliably identify potential cancer-related genes based on SNV data.

11:15-11:30
Reconstructing tumor evolutionary histories from single-cell sequencing data incorporating single nucleotide variants (SNV), copy number aberrations (CNA) and structural variants (SV)
Room: Clapp Hall Auditorium
Format: Live from venue

  • Nishat Anjum Bristy, Graduate Student, Computational Biology Department, Carnegie Mellon University, United States
  • Xuecong Fu, Graduate Student, Department of Biological Sciences, Carnegie Mellon University, United States
  • Minhang Xu, Graduate Student, Baylor College of Medicine, United States
  • Lanting Li, Graduate Student, Computational Biology Department, Carnegie Mellon University, United States
  • Russell Schwartz, Carnegie Mellon University, United States


Presentation Overview: Show

Tumor progression is a process of clonal evolution, where a healthy cell population repeatedly acquires mutations, giving rise to more aggressive cell populations (clones) over the course of time. Understanding the progression of tumors through the lens of evolution can be a powerful tool to unveil the intricate process by somatic alterations accumulate within tumor cells as a cancer progresses. This is commonly done by inferring evolutionary trees of clonal evolution, known as ‘tumor phylogenenies’. Alterations driving tumor progression include single nucleotide variations (SNV), copy number aberrations (CNA) and structural variations (SV). However, most prevalent methods for reconstructing tumor evolution take advantage of only a subset of these variants, potentially overlooking important mutational information in the process. We have developed a method for reconstructing tumor evolutionary histories from single-cell sequencing data that for the first time leverages all of the variants. The key insight of our approach is that these variant types offer complementary information about tumor evolution and incorporating all types of variants can lead to more accurate inferences and more comprehensive understanding of how distinct mutational events give rise to different subclones. Through simulations and real biological datasets, we have demonstrated that incorporating each of these types of variants leads to improved accuracy in clonal phylogeny reconstruction. In ongoing work, we are extending our method to take advantage of diverse data modalities, including extension from DNA to RNA and hybrid data sets, in order to characterize more comprehensively clonal structure and the mutations establishing it and understand better the interplay of mutation events and altered function mediated by gene expression in driving tumor evolution.

11:30-11:45
From protein domains to cancer subtypes: harnessing the power of classification algorithms to advance biology
Room: Clapp Hall Auditorium
Format: Live from venue

  • Kirill Medvedev, Department of Biophysics, University of Texas Southwestern Medical Center, United States
  • R. Dustin Schaeffer, Department of Biophysics, University of Texas Southwestern Medical Center, United States
  • Anna Savelyeva, Department of Urology, University of Texas Southwestern Medical Center, United States
  • Kenneth Chen, Department of Pediatrics, University of Texas Southwestern Medical Center, United States
  • Aditya Bagrodia, Department of Urology, University of California San Diego Health, United States
  • Liwei Jia, Department of Pathology, University of Texas Southwestern Medical Center, United States
  • Nick Grishin, Departments of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, United States


Presentation Overview: Show

Classification is a fundamental concept in scientific research, particularly in fields such as biology, machine learning, statistics. In the context of biological objects, classification involves creating simplified models of the real world, facilitating the analysis and interpretation of large amounts of information. Our team has developed and is maintaining Evolutionary Classification of protein Domains database (ECOD), which distinguishes itself from other structural classifications by primarily grouping domains based on evolutionary relationships (homology) rather than topology. AlphaFold (AF) - a recently developed deep learning method, demonstrated the capability to predict protein structure with atomic-level accuracy. Using AF protein models, we studied and evaluated the pan-cancer structurome - the structural space of proteins over and underexpressed in 21 cancer types from The Cancer Genome Atlas (TCGA) using domains from the ECOD classification. Our analysis revealed significant overrepresentation of beta sandwich domains due to the high levels of immunoglobulins and underrepresentation of proteins with exclusively alpha helical domains due to their involvement in homeostasis, apoptosis and transmembrane transport. We also used AF models for evolutionary classification of proteins from Salmonella enterica pangenome, which is a pathogenic bacterium known for causing severe typhoid fever in humans. We classified 17,238 domains from 13,147 proteins from 79,758 Salmonella enterica strains and studied in detail domains of 272 proteins from fourteen characterized Salmonella pathogenicity islands. Analysis of the potentially pathogenic proteins indicates that they form 119 clusters within the Salmonella genome, suggesting their potential contribution to the bacterium's virulence. The classification of cancer subtypes is of immense importance in the field of oncology. Testicular germ cell tumors (TGCT) are the most prevalent solid malignancy among adolescents and young men, ranking second in terms of the average life years lost per person dying of cancer. We conducted a computational study of 64 pure seminomas available at TCGA. Consensus clustering approach of 64 pure seminoma samples based on transcriptomic data identified two distinct subtypes. Our analysis revealed that two seminoma subtypes differ in pluripotency stage, activity of double stranded DNA breaks repair mechanisms, rates of loss of heterozygosity and telomere elongation, expression of lncRNA associated with cisplatin resistance. Using all available histopathological slides of pure seminoma at TCGA we developed deep learning decision making tool for identification of seminoma subtypes using only slide images. Overall, our findings suggest that seminoma subtype 2 exhibits similarities with non-seminomatous TGCTs, which are characterized by increased aggressiveness and necessitate an alternative treatment approach.

11:45-12:00
Development of an accurate and robust pan-cancer RNAseq Expression-based Machine Learning classifier for pediatric malignancies
Room: Clapp Hall Auditorium
Format: Live from venue

  • Daniel Putnam, St Jude Children's Research Hospital, United States
  • Alexander Gout, St Jude Children's Research Hospital, United States
  • Delaram Rahbarinia, St Jude Children's Research Hospital, United States
  • Meiling Jin, St Jude Children's Research Hospital, United States
  • Xiaotu Ma, St Jude Children's Research Hospital, United States
  • Yen-Chun Liu, St Jude Children's Research Hospital, United States
  • Jinghui Zhang, St Jude Children's Research Hospital, United States
  • David Wheeler, St Jude Children's Research Hospital, United States
  • Larissa Furtado, St Jude Children's Research Hospital, United States
  • Xiang Chen, St Jude Children's Research Hospital, United States


Presentation Overview: Show

Machine learning classifiers based on molecular features of tumors are increasingly used in the diagnostic of pediatric cancer and have become a key component of personalized therapy. Although classifiers using methylation data have been broadly adopted in pediatric central nervous system (CNS) tumor diagnostics, whole transcriptome sequencing (RNAseq) has been less commonly utilized as a diagnostic approach. Here we report the development of a supervised machine learning pipeline to classify pediatric tumors which has shown consistently high accuracy on multiple RNAseq data.

For broad generalizability, mRNA-seq and total RNA-seq data, generated from both research cohorts and routine clinical testing, were integrated for this development. We harmonized diagnosis subtype annotations for 1765 (1129 Total RNA-seq, 636 mRNA-seq) hematologic malignancy (B-ALL (n=16), T-ALL (n=8), and AML (n=13)) and 696 (616 Total RNA-seq, 80 mRNA-seq) non-CNS solid-tumor samples (n=13) guided mainly by WHO criteria, supplemented by key molecular signatures in literatures and inputs from St. Jude pathologists.

We divided the samples into 70% training and 30% test cohorts by patient id and tumor subtype. We then retained 17975 protein coding genes after removing genes with consistent and large differences (abs(logFC) > 5, FDR < 0.001 and average expression > 0) between samples obtained using mRNA and total RNA platforms. A subset quantile normalization model and a frozen surrogate variable analysis (fSVA) model were implemented in training samples, which were subsequently applied to testing samples for normalization and batch correction respectively. Top PCA features that explain 90% (non-CNS solid tumors) and 95% (hematologic malignancy) of the variance were used to train a stacking ensemble classifier from five base class predictors (Linear Discriminant Analysis, Support Vector Machine, Neural Network, Multi-class Logistic Regression and Random Forests), which returned a final class prediction with a calibrated confidence score for testing samples.

Our classifier achieved an overall accuracy of 92% and 99% on the hematologic and solid-tumor test samples respectively. We further validated the model using NCI-TARGET dataset, an external pediatric pan-cancer cohort, which resulted in an overall accuracy of 91% in hematologic malignancies (n=594, mRNA-seq) and 99% in solid tumors (n=451, mRNA-seq), respectively. Taken together, we have generated an accurate pan-cancer classification scheme for pediatric malignancies that is based on predefined clinically important subtypes, and robust to potential confounding effects introduced by different sequencing platforms and different cohorts. This classifier prototype represents a valuable approach to support tumor diagnosis and clinically meaningful stratification of tumor types.

14:30-14:45
Isoform and fusion detection on bulk and single-cell long-read RNA-seq data
Room: Clapp Hall Auditorium
Format: Live from venue

  • Wenjia Wang, University of Pittsburgh, United States
  • Yuzhen Li, University of Pittsburgh, United States
  • Baoguo Ren, University of Pittsburgh, United States
  • Yan P. Yu, University of Pittsburgh, United States
  • Jian-Hua Luo, University of Pittsburgh, United States
  • George Tseng, University of Pittsburgh, United States
  • Silvia Liu, University of Pittsburgh, United States


Presentation Overview: Show

Advancements in long-read transcriptome sequencing (long-RNA-seq) technology have revolutionized the study of isoform diversity. Alternative splicing allows a single gene to produce multiple protein isoforms. It plays a crucial role in increasing the diversity of proteins encoded by the genome, thereby contributing to the complexity and functionality of higher organisms. Fusion transcripts are fused RNAs that can either translate into chimeric proteins or alter the gene expression. Studies show that fusion transcripts have high concurrence rate on multiple types of cancer samples and are closely correlated with cancer recurrence. Full-length transcripts generated by long-RNA-seq will enhance the detection of various transcriptome structural variations.

In this project, we proposed IFDlong, a bioinformatic tool to detect isoform and fusion transcripts from long-read transcriptome sequencing data. Specifically, the software will first annotate the long reads with genes and isoforms and quantify isoforms expression by a novel expectation-maximization algorithm. Then the tool will discover and quantify fusion transcripts at isoform-level. For evaluation, our proposed IFDlong pipeline shows overall the best performance when compared with several existing tools on multiple in-silico simulation data sets, with different settings on sequencing error rate, read length, novel isoform and fusion combination. Next, novel fusion transcripts were detected by our pipeline when applied into in-house bulk long-read RNA-seq data with 8 colon cancer samples and single-cell long-read RNA-seq data on liver cancer samples. In addition, multiple public long-read data sets were employed and proved that our tool is compatible with different long-read platforms and its ability to accurately profile alternative splicing and fusion events. Novel isoforms and fusions detected by our pipeline will potentially serve as biomarkers for disease diagnosis and therapeutic targets. The novel insights into isoform-specific biomarkers promise significant translational impact of the proposed software.

14:45-15:00
Sparse Approaches to Differential Abundance and Expression Analyses: Potential and Pitfalls
Room: Clapp Hall Auditorium
Format: Live from venue

  • Won Gu, Penn State University, United States
  • Justin Silverman, Penn State University, United States


Presentation Overview: Show

The problem of compositions in gene expression and microbiome data analysis has become well known. Naive solutions to differential analysis can lead to spurious conclusions, e.g., biased estimates and elevated type-I and type-II error rates. Many authors have proposed tools that attempt to overcome the problem of compositions by assuming that a minority of genes or taxa are differentially abundant or differentially expressed. Recently, we have shown that many of these tools assume the wrong type of sparsity and, as a result, can lead to Type-I error rates exceeding 80%. In this talk, I will discuss these pitfalls and discuss potential solutions using Bayesian Partially Identified Models.

15:00-15:15
How low can you go? : Using deep learning diffusion models for denoising of low-depth RNA-seq data
Room: Clapp Hall Auditorium
Format: Live from venue

  • Carl Munoz, University of Montreal, Canada
  • Sebastien Lemieux, IRIC / Université de Montréal, Canada


Presentation Overview: Show

Bulk RNA-seq is often done at a depth of around 20-50 million reads per sample. However, RNA-seq becomes prohibitively expensive for larger-scale projects at this level of coverage. One solution to increase the number of samples without increasing the budget is to reduce the sequencing depth per sample. However, this comes at the cost of reducing the quality of the data. As such, there is a trade-off that must be made between the number of samples and the quality of each transcriptomic profile.
There exist imputation tools developed for single-cell RNA-seq and spatial transcriptomics data that circumvent this dilemma by denoising the low-depth RNA-seq data. However, none use the number of reads per sample as information for imputation, which could be crucial in determining exactly the amount of denoising that must be done. Additionally, these methods have not been designed for general usage outside of their respective sequencing technologies.
One method that has the potential to address both of these issues, though not yet used in this context, is the denoising diffusion probabilistic model (DDPM) (Ho et al., 2020). Originally designed to randomly generate images, we intend to use the idea behind them to denoise RNA-seq data. Specifically, in the Poisson-JUMP version (Chen & Zhou, 2023), which instead starts with all zero values and adds counts at each step, the intermediate timesteps have a strikingly similar appearance to low-coverage RNA-seq. This suggests a level of compatibility between this model and RNA-seq that would allow us to use a low-coverage sample as an input at an intermediate timestep to generate a denoised sample.
As such, we are currently developing a DDPM model capable of denoising RNA-seq data in a manner such that important biological information is conserved (such as sample type and differentially expressed genes). Using The Cancer Genome Atlas (TCGA) (Weinstein et al, 2014), we have established benchmarks on prediction accuracy and differentially expressed genes, which suggest that sequencing depth can be significantly reduced before seeing a significant impact on performance. We are currently in the process of developing the model, which we expect will allow us to denoise RNA-seq from nearly any coverage level. Once completed, this model would allow for cost reductions of RNA-seq in both medical and experimental contexts.

15:15-15:30
Automating the identification of public transcriptomic datasets that include particular metadata attributes
Room: Clapp Hall Auditorium
Format: Live from venue

  • Grace S. Brown, Brigham Young University, United States
  • Tolulope Barbara Akinbo, Brigham Young University, United States
  • Stephen Piccolo, Brigham Young University, United States


Presentation Overview: Show

Funding agencies and journals require researchers to deposit high-throughput molecular data in public repositories so that others can validate their findings and reuse their data for new studies. However, experimental data are of little use without metadata, which describe individual samples (e.g., self-described race, biological sex, or tumor stage). Researchers often use particular metadata variables as the primary variable of interest or use metadata to control for confounding effects. Metadata matter to society because researchers can ensure that their findings are more universally relevant to diverse populations. Historically, males and people of European descent have been overrepresented in biomedical research. However, current representation is hard to quantify because databases lack standard reporting vocabularies, researchers describe metadata differently, and many datasets are missing metadata attributes. Accurately identifying which datasets include a particular attribute requires manual review and is thus time and labor intensive.

We addressed this issue using a machine-learning approach. After downloading metadata for 27,000 transcriptomic datasets from BioProject (NCBI), we manually reviewed 2,000 datasets for the presence of metadata describing race/ethnicity/ancestry, sex, or tumor stage. We characterized each metadata attribute as an n-gram profile based on its name and associated data values. Then we trained Random Forest models on these profiles and generated predictions for the remaining 25,000 datasets. By manually reviewing 3,000 of these datasets at random, we found that our models identified race/ethnicity/ancestry, sex, and tumor stage with an area under the precision-recall curve of 0.90, 0.98, and 0.67, respectively. Additionally, we reviewed all datasets predicted with a confidence score above 10% and observed precision values of 0.54, 0.91, 0.28, respectively.

With this promising ability to identify datasets that include metadata attributes—especially related to race/ethnicity/ancestry and sex—we estimated subgroup representation for all 27,000 datasets. We found that 52% of transcriptomic study subjects were male and 48% were female. Reported race/ethnicity/ancestry values included 59% Europeans (or European ancestry), 18% African (or African ancestry), 12% Asian (or Asian ancestry), 0.8% Native American, 0.7% Oceanian, and 0.2% Middle Eastern. Our method can be applied to identify datasets containing other metadata researchers may be interested in. By simplifying the process of assessing demographic data, our project enhances the efficiency and inclusivity of research endeavors reliant on public data repositories.

16:00-16:15
Compound Identification: Generalization of Shannon Entropy Similarity Measure via Tsallis Entropy and Comparison of Spectrum Preprocessing Pipelines
Room: Clapp Hall Auditorium
Format: Live-stream

  • Hunter Dlugas, Wayne State University, United States
  • Xiang Zhang, University of Louisville, United States
  • Seongho Kim, Wayne State University, United States


Presentation Overview: Show

Spectral library matching is a common method used to identify compounds from their mass spectrometry data. This method identifies a given query compound by computing a similarity score between the mass spectrum of the given query compound and every compound in a reference library. The identity of the query compound is then taken to be the reference library compound with the largest similarity score. Spectrum preprocessing prior to computing similarity scores typically involves combining ion fragments of near identical size, matching ion fragments between the query and reference library spectra, and removing low-intensity ion fragments. Another standard preprocessing transformation is the weight factor transformation which aims to find the optimal amount of contribution of fragment ion size to the similarity scores and subsequent compound identification. The standard similarity measure used in spectral library matching is the Cosine similarity measure, and in 2021, the Shannon Entropy similarity measure was introduced and was demonstrated to outperform the Cosine similarity measure.

We generalize the Shannon Entropy similarity measure via Tsallis Entropy by introducing an ‘entropy dimension’ parameter. Also, we compare the three different spectrum preprocessing pipelines obtained from performing the weight factor transformation at different locations within the standard spectrum preprocessing pipeline. The novel Tsallis Entropy similarity measure is validated on (1) liquid-chromatography – tandem mass spectrometry (LC-MS/MS) data from the Global Natural Products Social Networking (GNPS) database and (2) gas chromatography – mass spectrometry (GC-MS) data from National Institute of Standards and Technology (NIST) databases. Because the standard spectrum preprocessing transformations do not apply to GC-MS data, the comparison of spectrum preprocessing pipelines is validated on the LC-MS/MS data only. Results show that the novel Tsallis Entropy similarity measure slightly outperformed the Shannon Entropy similarity measure with respect to both accuracy and AUC regardless of where the weight factor transformation is performed with respect to combining ion fragments of near identical size, removing low-intensity ions fragments, and matching ion fragments between the query and reference spectra for the LC-MS/MS data. The accuracy and AUC of the Shannon and Tsallis Entropy similarity measures were nearly identical in the GC-MS case. The Cosine similarity measure outperformed both the Shannon and Tsallis Entropy similarity measures with respect to accuracy and AUC when the weight factor transformation was performed (1) after combining ion fragments of near identical size and removing low-intensity ions in the LC-MS/MS case and (2) in the GC-MS case.

16:15-16:30
Statistical challenges in detection of noncanonical proteins in mass spectrometry data
Room: Clapp Hall Auditorium
Format: Live from venue

  • Aaron Wacholder, Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, United States
  • Anne-Ruxandra Carvunis, Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, United States


Presentation Overview: Show

Research over the past decade has demonstrated pervasive translation of eukaryotic genomes outside of annotated coding sequences. It is unknown how many of these “noncanonical” translated open reading frames (ORFs) encode stable proteins with the potential to participate in biological processes. A common approach to finding noncanonical proteins that are stable in the cell is to search for predicted noncanonical proteins in “shotgun” proteomics mass spectrometry (MS) data. However, studies employing this strategy report a wide range of findings, with some reporting hundreds of noncanonical proteins detected and others only a few. To investigate this apparent discrepancy, we reanalyzed MS datasets in yeast and human.

Using a standard target-decoy approach to obtain a 1% false discovery rate (FDR) among noncanonical detections, we find that predicted noncanonical proteins are detected at very low rates. In yeast, only four of over 18,000 predicted noncanonical proteins could be detected across three large MS datasets. To better understand why noncanonical proteins are so challenging to detect in MS data, we developed a logistic regression model to predict detectability based on the data from canonical proteins. We find that the low detectability of noncanonical proteins is expected from their typically short lengths and low expression levels.

We next examined why, in contrast to our results, several studies report relatively high rates of noncanonical protein detection in MS data. We find that there is considerable variation among studies in how FDR is controlled. For example, studies may control FDR at the whole proteome level or for the noncanonical class specifically, and studies may control FDR for each experiment individually or for the entire study in aggregate. These choices lead to large differences in the number of noncanonical proteins inferred to be detected. When established guidelines for FDR estimation are followed, noncanonical detections are found to be rare. Our results point towards the need for careful consideration of statistical approach in searching for noncanonical proteins in mass spectrometry data.

16:30-16:45
SWING (Sliding Window INteraction Grammar) - a generalizable interaction-language model for protein interactions across biological contexts
Room: Clapp Hall Auditorium
Format: Live from venue

  • Alisa Omelchenko, University of Pittsburgh, United States
  • Jane Siwek, University of Pittsburgh, United States
  • Prabal Chhibbar, University of Pittsburgh, United States
  • Anna Rosengart, Carnegie Mellon University, United States
  • Kiran Nazarali, Georgia Institute of Technology, United States
  • Iliyan Nazarali, Center for Systems Immunology, United States
  • Javad Rahimikollu, University of Pittsburgh, United States
  • David Koes, University of Pittsburgh, United States
  • Alok Joglekar, University of Pittsburgh, United States
  • Jishnu Das, University of Pittsburgh, United States


Presentation Overview: Show

The explosion of sequence data has allowed the rapid growth of protein LMs (pLMs). pLMs have now been employed in many frameworks including variant-effect prediction (ESM1b/AlphaMissense), and peptide-specificity prediction (BertMHC). Traditionally for protein-protein or peptide-protein interactions (PPIs), corresponding sequences are either co-embedded followed by post-hoc integration, or the sequences are concatenated prior to embedding. Interestingly, no method utilizes a language representation of the interaction itself. We developed an interaction LM (iLM) which uses a novel language to represent interactions between protein/peptide sequences. Sliding Window Interaction Grammar (SWING) leverages differences in amino acid properties to generate an interaction vocabulary. This vocabulary is an input into an LM followed by a supervised prediction step where the LM’s representations are used as features.
SWING was first applied to predicting peptide:MHC (pMHC) interactions. 3 pMHC prediction models were trained (Class I, Class II, and Mixed Class) using a large ensemble of human immunopeptidome datasets. Currently, existing approaches have separate Class I and Class II pMHC prediction models as the structural and functional aspects of Class I and Class II pMHC interactions are completely distinct. SWING however was not only successful at generating Class I and Class II models that have comparable prediction to state-of-the-art approaches, but the unique Mixed Class model was also successful at jointly predicting both classes. Further, we generated a SWING model trained only on Class I alleles that was predictive for Class II, a complex prediction task not attempted by any existing approach. Additionally, the human Class I model alone was successful at predicting peptide binding for mouse Class II receptors, the first instance a model trained on only Class I MHCs is informative of Class II MHCs across species. This demonstrates that SWING is a highly generalizable few-shot model that learns the language of PPIs.
To further evaluate SWING’s generalizability, we tested its ability to predict the disruption of specific protein-protein interactions by missense mutations. Although modern methods like AlphaMissense and ESM1b can predict interfaces and variant effects/pathogenicity per mutation, they are unable to predict interaction-specific disruptions. SWING was successful at accurately predicting (relative to experimental benchmarks) the impact of both Mendelian mutations and population variants on protein-protein interactions. This is the first generalizable approach that can accurately predict interaction-specific disruptions by missense mutations with only sequence information. Overall, SWING is a first-in-class iLM that learns the language of PPIs.

16:45-17:15
Proceedings Presentation: A unified hypothesis-free feature extraction framework for diverse epigenomic data
Room: Clapp Hall Auditorium
Format: Live from venue

  • Ali Tugrul Balci, University of Pittsburgh, United States
  • Maria Chikina, Mount Sinai School of Medicine, United States


Presentation Overview: Show

Motivation: Epigenetic assays using next-generation sequencing (NGS) have furthered our understanding of the
functional genomic regions and the mechanisms of gene regulation. However, a single assay produces billions of data
represented by nucleotide resolution signal tracks. The signal strength at a given nucleotide is subject to numerous sources
of technical and biological noise and thus conveys limited information about the underlying biological state. In order to
draw biological conclusions, data is typically summarized into higher order patterns. Numerous specialized algorithms
for summarizing epigenetic signal have been proposed and include methods for peak calling or finding differentially
methylated regions. A key unifying principle underlying these approaches is that they all leverage the strong prior that
signal must be locally consistent. Results: We propose L0 segmentation as a universal framework for extracting locally
coherent signals for diverse epigenetic sources. L0 serves to both compress and smooth the input signal by approximating
it as piece-wise constant. We implement a highly scalable L0 segmentation with additional loss functions designed for
NGS epigenetic data types including Poisson loss for single tracks and binomial loss for methylation/coverage data. We
show that the L0 segmentation approach retains the salient features of the data yet can identify subtle features, such
as transcription end sites, missed by other analytic approaches. Availability: Our approach is implemented as an R
package “l01segmentation” with a C++ backend. Available at https://github.com/boooooogey/l01segmentation.

17:15-17:30
SLIDE: Significant Latent Factor Interaction Discovery and Exploration across biological domains
Room: Clapp Hall Auditorium
Format: Live from venue

  • Javad Rahimikollu, University of Pittsburgh, United States
  • Hanxi Xiao, University of Pittsburgh, United States
  • Annaelaine Rosengart, University of Pittsburgh, United States
  • Aaron Rosen, University of Pittsburgh, United States
  • Tracy Tabib, University of Pittsburgh, United States
  • Paul Zdinak, University of Pittsburgh, United States
  • Kun He, University of Pittsburgh, United States
  • Xin Bing, University of Toronto, Canada
  • Florentina Bunea, Cornell University, United States
  • Marten Wegkamp, University of Pittsburgh, United States
  • Amanda Poholek, University of Pittsburgh, United States
  • Alok Joglekar, University of Pittsburgh, United States
  • Robert Lafyatis, University of Pittsburgh, United States
  • Jishnu Das, University of Pittsburgh, United States


Presentation Overview: Show

Modern multi-omic technologies can generate deep multi-scale profiles. However, differences in data modalities, multicollinearity of the data, and large numbers of irrelevant features make the analyses and integration of high-dimensional omic datasets challenging. Here, we present Significant Latent factor Interaction Discovery and Exploration (SLIDE), a first-in-class interpretable machine learning technique for identifying significant interacting latent factors underlying outcomes of interest from high-dimensional omic datasets. SLIDE makes no assumptions regarding data-generating mechanisms, comes with theoretical guarantees regarding identifiability of the latent factors/corresponding inference, and has rigorous FDR control. SLIDE outperforms/performs at least as well as a wide range of state-of-the-art approaches, including other latent factor approaches, in terms of prediction. More importantly, it provides biological inference beyond prediction that other methods do not effort
Using SLIDE, we first sought to uncover altered cell-type-specific regulatory mechanisms underlying diffuse systemic sclerosis (SSc) pathogenesis. Using scRNA-seq profiles from skin biopsies of SSc subjects, SLIDE was able to accurately predict disease severity, and outperformed/performed as well as several benchmarks including LASSO, principal components regression (PCR), partial least squares regression (PLSR) and PHATE regression. Further, the interacting latent factors uncovered by SLIDE pointed to three distinct mechanisms. The first encompassed altered transcriptomic states in myeloid cells and fibroblasts, a well elucidated basis of SSc disease severity. The second included an unexplored keratinocyte-centric signature, which we validated using protein staining. Finally, SLIDE uncovered a novel mechanism involving an interaction between the altered transcriptomic states in myeloid cells and fibroblasts with HLA signaling in macrophages. This mechanism has strong support in recent genetic association analyses and demonstrates the power of SLIDE in unveiling novel biological mechanisms.
SLIDE also worked well on a wide range of spatial modalities spanning transcriptomic and proteomic data (10X Visium, SLIDE-seq, MERFISH and CODEX), and was able to accurately identify significant interacting latent factors underlying immune and neuronal cell partitioning by 3D location in a range of contexts. Finally, SLIDE leveraged paired scRNA-seq and TCR-seq data to elucidate latent factors underlying extents of clonal expansion of CD4 T cells in a nonobese diabetic model of T1D. The latent factors uncovered by SLIDE included well-known activation markers, inhibitory receptors and intracellular regulators of receptor signaling, but also homed in on several novel naïve and memory states that standard analyses missed. Overall, SLIDE is a versatile engine for biological discovery from modern multi-omic datasets.

17:30-17:45
Atomic elementary flux modes explain the steady state flow of metabolites in flux networks
Room: Clapp Hall Auditorium
Format: Live from venue

  • Justin G. Chitpin, University of Ottawa, Canada
  • Theodore J. Perkins, University of Ottawa, Canada


Presentation Overview: Show

Elementary flux modes (EFMs) are minimal, steady state pathways that characterize the flow of molecules in a metabolic flux network under steady state. EFM theory, first proposed by Schuster and Hilgetag¹, states that any set of steady state fluxes in a metabolic network may be explained as a positive, linear combination of their EFM weights. However, decomposing fluxes onto EFMs is generally not unique, and there are typically exponentially many EFMs, limiting their application to small-scale networks. We recently addressed this flux decomposition problem for the special case of networks consisting exclusively of unimolecular reactions². By imposing a Markovian constraint on EFMs, we modelled steady state fluxes as a cycle-history Markov chain (CHMC) to compute a unique set of EFM weights that reconstruct any network fluxes. Here, we describe a method to generalize our CHMC analysis to networks containing multispecies reactions. By defining EFMs with respect to individual atoms, we propose the concept of atomic EFMs, which are biologically plausible, steady state paths tracing the flow of individual atoms through a network. We employ a state-of-the-art atom mapping algorithm³ to model the flow of source metabolite atoms under our atomic CHMC model. We apply our method to enumerate carbon and nitrogen EFMs from genome-scale networks of four different organisms from the BiGG database⁴. We further compute atomic EFM weights from a HepG2 liver cancer cell line dataset constructed from the Human Metabolic Reaction (HMR 2.0) database with a set of steady state fluxes estimated from protein and metabolite abundances⁵. Our results show that enumerating atomic EFMs is computationally tractable in large-scale networks, which have several orders of magnitude fewer carbon and nitrogen EFMs than standard multispecies-reaction EFMs. We further show that constructing atomic CHMCs leads to more meaningful pathways that explain how nutrient sources are remodelled within the network. By characterizing these structural properties, we study the mass flow of source metabolite atoms and quantify differential atom metabolism. Our atomic CHMC method is a powerful tool to quantify metabolic remodelling as technologies to generate large-scale metabolic flux networks advance.

1. Schuster and Hilgetag. J. Biol. Syst. 2(2), 1994.
2. Chitpin and Perkins. J. Theor. Biol. 575, 2023.
3. Schwaller et al. Sci. Adv. 7(15), 2021.
4. King et al. Nucleic Acids Res. 44(D1), 2016.
5. Nilsson et al. Proc. Natl. Acad. Sci. U. S. A. 117(19), 2020.

17:45-18:00
Graph-based Linear Optimization for Spatial-temporal Population Immunity Calculation
Room: Clapp Hall Auditorium
Format: Live-stream

  • Ferdous Nasri, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, DE, Germany
  • Simon Cyrani, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, DE, Germany
  • Lukas Wenner, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, DE, Germany
  • Bernhard Y. Renard, Windreich Department of AI and Human Health, Icahn School of Medicine at Mount Sinai, NY; Hasso Plattner Institute, DE, USA, Germany


Presentation Overview: Show

Understanding population immunity against diseases is pivotal for predicting their spread and implementing effective pharmaceutical and non-pharmaceutical interventions. However, obtaining accurate representations of population immunity amidst dynamic disease spread and varying administration of different kinds of vaccinations and boosters poses a complicated challenge. Pathogens can also evolve to escape vaccine- or infection-induced neutralizations over time in order to increase incidences in an immunized society, further complicating immunity calculations. Additionally, infections may stem from different variants of the same pathogen, offering limited immunity against others. Obtaining clean and comprehensive data to enable immunity approximation is often hindered by factors, such as vaccination records based solely on the location of administration rather than the individual's residence, leading to distortions such as overestimating vaccination rates in certain areas due to the availability of vaccination hubs causing vaccination-tourism.
To address these challenges, we propose a novel and streamlined method to calculate spatially and temporally refined immunity scores. We integrate vaccination numbers from administration locations, disease incidence reports, and variant information while considering the waning of immunity against reinfection over time, reflecting the natural decay of pathogen-neutralization antibodies, demonstrated on COVID-19 in Germany. We develop a graph-based linear optimization approach to normalize against vaccination-tourism on a county level for six successive vaccinations administered. The spatial vaccination rebalancing is evaluated against a one-time available true data. Additionally, we calculate spatial lower and upper bounds for the immunity index and show that the difference between these bounds and therefore the highest possible error is minimal, demonstrating the robustness of our assessments. Finally, to facilitate time-series studies and spread predictions, we simulate realistic immunity indices for each county and each day of the pandemic, which explain the connection between events and incidence numbers.
This enables comprehensive analyses of immunity trends and their impact on disease spread dynamics. By keeping the input data minimal and the model easily generalizable, our method enables the prompt assessment of immunity against other diseases in diverse geographical locations. This empowers policymakers and public health officials to make informed decisions and effectively manage disease outbreaks, such as more effective placement of vaccination hubs, emergency bed availability in clinics, or localised temporary lockdown mandates, ultimately reducing their societal and economic impact.

18:15-18:30
Creating An Interdisciplinary and Multicollegiate Curriculum for Bioinformatics at A HBCU
Room: Clapp Hall Auditorium
Format: Live from venue

  • Xianfa Xie, Virginia State University, United States
  • Wei-Bang Chen, Virginia State University, United States
  • Nuha Farhat, Virginia State University, United States


Presentation Overview: Show

Black or African Americans have been underrepresented in the profession of bioinformatics. Students at Historically Black Colleges/Universities (HBCUs) have been particularly at disadvantaged positions to receive trainings in this field due to lack of relevant faculty expertises, lack of student awareness, and departmental and collegiate boundaries serving as barriers for interdisciplinary education. Funded by the National Science Foundation and through collaborations among faculty members from three departments (biology, mathematics, and computer science) belonging to two different colleges, we have created a unified curriculum that provides training in all the three different disciplines, including biology (particularly genomics), statistics, and computer science, together with a set of workshops and a bioinformatics-focused capstone research project. While the curriculum provides the same set of trainings to students from different undergraduate majors, it is flexible to allow students from different majors to take the required courses with different schedules that best suit the individual major programs, particularly in the first two years. The curriculum, including core courses, workshops, and the capstone research project, provides up-to-date, practical, comprehensive, and rigorous trainings in the field of bioinformatics that fills a major gap in education at HBCUs, better prepares students at HBCUs to become productive professionals in the field of bioinformatics, and can be used as a model for bioinformatics education at other HBCUs, primarily undergraduate institutions (PUIs), and even research universities.

18:30-18:45
Invited Presentation: The MetaSchool: a just-in-time professional development series for PhD students
Room: Clapp Hall Auditorium
Format: Live from venue

  • Joseph Ayoob
18:45-19:00
Invited Presentation: TBD
Room: Clapp Hall Auditorium
Format: Live from venue

  • Ben Busby
19:00-19:15
Invited Presentation: International Informatics Masters’ Students’ Challenges Post-Covid
Room: Clapp Hall Auditorium
Format: Live from venue

  • Guenter Tusch


Presentation Overview: Show

Many bioinformatics graduate programs in the U.S. see a growing number of international students enrolling thus reflecting a trend towards globalization and internationalization. The aim of this study was to investigate the relationship between student background, technological proficiency, and class engagement, as well as the impact these factors have on the academic achievement of international and domestic students prior to, during, and after the COVID19 pandemic. Data were from 419 students enrolled in a two-year graduate program in Health Informatics and Bioinformatics at Grand Valley State University.

The largest group of international students came from southeast Asia followed by Africa. Various criteria were analyzed, including assessments of presentations, midterm exams, group projects, discussion participation, lab assignments, and final exams. The impact of the COVID-19 pandemic was significant, particularly evident in midterm exam performance, where most students experienced a decline in grades. Additionally, domestic students consistently outperformed international students across all three phases of the pandemic. While the domestic students mostly struggled with exams and the project, the international students struggled almost through all assignment types. The results showed that students’ grades significantly declined during than pandemic, but are recovering, although not to the original levels. For those who take hybrid or online classes, differences before and during the pandemic are apparently not significant. These findings have practical implications for educators and administrators seeking to develop effective interventions that can enhance the academic success of international students in their programs.

19:15-19:30
Invited Presentation: Teaching a Research-Oriented PreCollege Program in Computational Biology
Room: Clapp Hall Auditorium
Format: Live from venue

  • Josh Kangas
  • Phillip Compeau
Wednesday, May 15th
10:30-10:45
Efficient Enumeration of Tree Topology Statistics for Population Phylogenomics
Room: Clapp Hall Auditorium
Format: Live from venue

  • Maureen Stolzer, Carnegie Mellon University, United States
  • Yuting Xiao, Carnegie Mellon University, United States
  • Dannie Durand, Carnegie Mellon University, United States


Presentation Overview: Show

Advances in phylogenetic population modeling, combined with sequencing of large collections of closely related taxa, have enabled unprecedented studies of population processes in evolutionary and ecological contexts; e.g. the discovery of Neanderthal DNA in human genomes and the role of adaptive introgression in Heliconius butterflies. Population processes, including ILS and introgression, can result in gene trees that disagree with the species tree. For example, the history of a gene sampled from three species with phylogeny A|BC may agree with the species tree or have alternate branching order B|AC or C|AB. The resulting distribution of gene tree topologies provides a wealth of information for testing alternate hypotheses, distinguishing between ILS and introgression, and estimating population parameters.

Given a species tree with three leaves, this gene tree distribution is easily obtained by counting the three topologies. This approach does not scale up gracefully because the number of possible topologies grows exponentially with the number of taxa. A common solution is to sub-sample rooted triplets; i.e., sets of three species that probe the ancestral population of interest. The gene-tree distribution for a rooted triplet reflects the combined incongruence from ILS and all episodes of introgression, recent or ancient, between non-sister lineages in the triplet. Sampling multiple triplets can help to tease these episodes apart, but combining the results is labor-intensive and the number of triplets also grows exponentially. A second problem is screening out incongruent branching patterns that result from gene duplication and loss. This is frequently achieved by restricting the analysis to single-copy orthologs in a preprocessing step. However, sequence-based ortholog identification is error-prone, especially in duplication-rich and poorly assembled genomes.

Here, we present a novel algorithm that addresses these challenges. Our algorithm tabulates the distribution of gene tree topologies associated with an ancestral branch while discarding incongruence resulting from introgression between distant descendants. This will support studies focused on quantifying ILS alone, as well as studies of introgression seeking to exclude ILS. It also screens out incongruence due to gene duplication and hence can be applied to both multigene families and single-copy orthologs. Gene tree statistics are obtained for all internal branches in a single traversal of the species tree, supporting clade-level analyses. Our algorithm is polynomial in tree size and hence applicable to very large species trees. We demonstrate our approach through the reanalysis of several phylogenomic datasets from recent studies.

10:45-11:00
Friend or foe: a case study of rapid proto-gene elongation mediated by tandem repeat evolution
Room: Clapp Hall Auditorium
Format: Live from venue

  • Lin Chou, University of Pittsburgh, United States
  • Shu-Ting Cho, University of Pittsburgh, United States
  • Jiwon Lee, University of Pittsburgh, United States
  • Anne-Ruxandra Carvunis, University of Pittsburgh, United States


Presentation Overview: Show

Recent studies have shown that historically non-genic sequences can give birth to new genes through an intermediary state termed proto-gene, which exhibits properties between those of non-genic sequences and genes, such as shorter length. It has been observed that proto-genes often overlap with other genetic elements in genomes. However, how these genetic elements affect proto-genes’ evolution is not well understood. Here we report the case study of rapid proto-gene elongation through tandem repeat (TR) expansion. We focused on human proto-gene orf143, which was encoded by long noncoding RNA (lncRNA) MYU and overlapped with the TR array that conferred the oncogenicity. The start codon of orf143 was found to emerge de novo in the ancestor of simians. In new and old world monkeys, the orthologs did not harbor the TR locus and remained relatively short in the extant species compared to the human ortholog. In line with this finding, ancestral sequence reconstruction showed that the ancestor of apes also harbored this short ORF that lacked TR. However, a sequence of events led to the rapid elongation of orf143 in apes. First, a genomic region downstream of the stop codon quickly duplicated, leading to the origin of the TR locus in the ancestor of humans, chimpanzees, gorillas, and bonobos. A subsequent point mutation in the ancestral stop codon led to the incorporation of the TRs into the ORF, marking the start of rapid elongation. Phylogenic analysis of the TR units indicated that, while the incorporation of the TRs happened in the ancestor of humans, chimpanzees, and bonobos, the TRs continued to expand in each species. Our findings revealed TR expansion as a mechanism for proto-gene elongation and can potentially serve as a source of new genetic material for proto-genes.

11:00-11:15
Characterizing regulatory conservation at paralogous gene families utilizing multi-mapped reads and the T2T genome assembly
Room: Clapp Hall Auditorium
Format: Live from venue

  • Alexis Morrissey, The Pennsylvania State University, United States
  • Alan Brown, The Pennsylvania State University, United States
  • Shaun Mahony, The Pennsylvania State University, United States


Presentation Overview: Show

A majority of annotated genes in the human genome have at least one paralog and many are part of larger gene families. Gene duplication is a common mechanism by which new genes are created and is thus an important driver of genome complexity and evolution. Previous studies have shown that paralogs have high rates of contact within the three-dimensional genome. This suggests that there may be shared gene regulatory networks among gene family members, including shared programs of transcription factor binding. Shared gene regulatory networks among paralogs may arise due to the duplication of cis-regulatory regions along with the ancestral gene body. Duplication of paralogous regulatory regions will result in high sequence similarity, which can pose mapping ambiguity issues for regulatory genomics assays that rely on short read sequencing such as ChIP-seq and DNase-seq. In this work, we deploy our multi-mapped read rescue method, Allo, to remap regulatory genomics data at paralogous genes in the telomere-to-telomere (T2T) project human genome assembly. We also extend the Allo framework to solve the issue of multi-mapped reads in RNA-seq data, as paralogs themselves share a high sequence similarity. By disambiguating multi-mapped reads and incorporating the new paralogous gene sets identified in the T2T assembly, we can more comprehensively study gene regulatory networks at paralogs. Using a diverse set of matched ENCODE DNase-seq and RNA-seq datasets across many cell types, we identify possible regulatory networks responsible for paralogous gene regulation. Furthermore, utilizing motif identification within DNase-seq peaks as well as ENCODE ChIP-seq data, we identify multiple previously uncharacterized interactions between transcription factors and paralogous gene families.

11:15-11:30
Large tandem duplications in cancer result from transcription and DNA replication collisions
Room: Clapp Hall Auditorium
Format: Live from venue

  • Yang Yang, University of Chicago, United States
  • Lixing Yang, University of Chicago, United States
  • Jonathan Chou, University of California, San Francisco, United States


Presentation Overview: Show

Genome instability is a hallmark of cancer, with somatic structural variations (SVs) being prevalent in human cancers. However, the underlying molecular mechanisms of their formation remain unclear. DNA replication stress is recognized as a major contributor to genome instability. In this study, we analyzed 6,193 whole-genome sequenced tumors to investigate the impact of transcription and DNA replication collisions on genome instability. Through comprehensive analysis of robust simple SV signatures deconvoluted from three cohorts (PCAWG, Hartwig, and POG570), we detected transcription-dependent replicated-strand bias, indicative of transcription-replication collision (TRC), specifically in large tandem duplications (TDs). This replicated-strand bias in large TDs was found to be transcription-dependent and could not be attributed to other genomic factors. This observation was reproducible across several other independent cohorts. Large TDs were found to be abundant in several tumor types, such as uterine, breast, ovarian, prostate, esophageal, stomach, and head and neck cancers. They are associated with poor patient survival and are frequently accompanied by mutations in TP53, CDK12, and SPOP. Furthermore, functional assays conducted in two prostate cancer cell lines—C42B and 22Rv1—and an immortalized normal prostate epithelial cell line—PNT2 revealed that CDK12 inactivation led to a significant increase in TRCs and R-loops. Furthermore, the proportion of large TDs was significantly higher in CDK12-deficient cell lines compared to wild-type C42B cell lines following one year of culture. Moreover, a drug sensitivity analysis conducted on 203 cell lines from the Cancer Cell Line Encyclopedia (CCLE) exhibited that cancer cell lines with abundant large TDs were significantly more sensitive to two DNA repair inhibitors—the WEE1 inhibitor and the PARP inhibitor. Pharmacological inhibition of G2/M checkpoint proteins, including WEE1, CHK1, and ATR, selectively inhibited the growth of cells deficient in CDK12. Collectively, our findings suggest that TRCs contribute to the formation of large TDs in cancer, and their detection could serve as a potential biomarker for prognosis and therapeutic targeting.

11:30-11:45
Smoothing functions for data-driven evolutionary models of multidomain protein architectures
Room: Clapp Hall Auditorium
Format: Live from venue

  • Xiaoyue Cui, Carnegie Mellon University, United States
  • Akshat Gupta, Carnegie Mellon University, United States
  • Maureen Stolzer, Carnegie Mellon University, United States
  • Gautam Iyer, Carnegie Mellon University, United States
  • Dannie Durand, Carnegie Mellon University, United States


Presentation Overview: Show

Multidomain proteins are encoded by mosaics of sequence fragments called domains. Domains act as protein modules; they are associated with specific molecular functions, are found in otherwise unrelated proteins, and fold into a stable structure independent of the surrounding context.
The architecture of a multidomain protein, that is, the sequence of its constituent domains in N- to C-terminal order, evolves through domain insertions, duplications, and deletions. In principle, these processes could generate any combination of domains. However, the number of domain combinations observed in sequenced genomes is vastly smaller than expected, suggesting that domain architecture evolution is stringently constrained. The exact nature of these constraints is poorly understood.
Comparing domain architectures that are observed in nature with those that are not provides a framework for inferring the ""design rules"" of multidomain architectures. This approach rests on the assumption that domain architectures that are not observed have features that are disfavored by evolution, providing clues to the unknown constraints. However, some domain architectures in that have not been encountered in genomic data may, in fact, be be viable. Architectures that are favored by evolution may not be observed due to incomplete sequencing or annotation error. More fundamentally, some viable domain architectures may not be observed because evolution has not yet generated and tested them.
One solution is to construct a generative, data-driven, Markov model that emits viable domain architectures, including architectures not yet encountered. However, in order to obtain realistic sets of positive examples, this model needs to be smoothed in a manner that corrects for both sampling error and missing values in the training data. Many bioinformatic applications introduce smoothing via a pseudocount, but this limits the amount of prior knowledge that can be captured.
Here, we introduce and evaluate smoothing algorithms that exploit implicit contextual information to approximate domain combination frequencies while addressing data incompleteness. Each of these algorithms reflects a different set of assumptions about the underlying process of multidomain evolution. We compare the effectiveness of these algorithms using artificial and genuine data.
We obtain a robust and reliable tool for investigating multidomain protein evolution. In the broader arena, our results leverage sequence context for more accurate inference and hence are relevant to a wide class of sequence analysis problems, such as gene finding, intron/exon boundary detection, and taxonomic identification in metagenomics.

11:45-12:00
Sensitive, specific association of microbial functions with host phenotypes using Phylogenize2
Room: Clapp Hall Auditorium
Format: Live from venue

  • Kathryn Kananen, Dept. of Microbiology, The Ohio State University, United States
  • Stephanie Majernik, Dept. of Microbiology, The Ohio State University, United States
  • Patrick Bradley, Dept. of Microbiology, The Ohio State University, United States


Presentation Overview: Show

In metagenomics, a key challenge is to explain the differences in microbial populations that we observe in terms of gene function. Many common approaches to this problem do not account for the fact that related species tend to share both genes and phenotypes. This makes them susceptible to discovering clade markers for the microbes that are changing most in abundance, and these markers tend to have weak evidence for functional relevance. However, existing tools to account for phylogeny are cumbersome and require an accurate and complete genome database for the environment being studied; this can be especially problematic outside of highly-sequenced biomes like the human gut. We have developed a substantially revised tool, Phylogenize2, which is easier for end-users to install and leverages large, modern collections of isolate and metagenome-assembled genomes. We have also substantially improved Phylogenize2's statistical power by combining microbiome-specific differential abundance methods with adaptive shrinkage. As a test case, we apply Phylogenize2 to a human cohort with liver cirrhosis, and discover a link between abundances of the Lachnospiraceae, a prevalent group of commensal Clostridia, and the anaerobic oxidative stress response. In summary, Phylogenize2 is a publicly available, open source tool that can help extract specific functional information from a wide variety of environmental and host-associated microbiomes.

14:30-14:45
Exploring the diversity of Staphylococcus epidermidis in the female microbiome
Room: Clapp Hall Auditorium
Format: Live from venue

  • Sandra Jablonska, Loyola University Chicago, United States
  • Grace Finger, Loyola University Chicago, United States
  • Alex Kula, Loyola University Chicago, United States
  • Catherine Putonti, Loyola University Chicago, United States


Presentation Overview: Show

Staphylococcus epidermidis is a prominent member of the human microbiota. While it predominantly colonizes the skin, it can also be found in the urinary, gastrointestinal, and respiratory tract and the oral cavity. Comparative studies of S. epidermidis strains isolated from different anatomical niches on the same individual have primarily focused on strains isolated from different skin surfaces or different skin surfaces, an oral swab, and a nasal swab. Other anatomical sites have yet to be considered. These observations led to our study, which focuses on isolating and comparing S. epidermidis strains inhabiting the same individual’s urinary tract, nasal cavity, oral cavity, and skin. An IRB-approved study collected samples from healthy females aged 18-25 and S. epidermidis was isolated using selective media. Strains were purified and then sequenced. In total, over 90 S. epidermidis strains were sequenced from the 50 participants sampled. Pan-genome analysis was performed using Anvi’o to generate a single-copy core genome tree. Furthermore, functional enrichment was performed to identify functional associations of gene clusters and quantify the distribution of the genes that appear more frequently in strains from a particular niche. Understanding the presence of S. epidermidis abundance and interactions between different anatomical sites provides insight into the relationship between the human microbiome and microbes.

14:45-15:00
A mechanistic model of Eastern Equine Encephalitis Virus replication reveals an efficient, yet tightly-constrained strategy of genome replication for infectious particle production
Room: Clapp Hall Auditorium
Format: Live from venue

  • Caroline I. Larkin, University of Pittsburgh, United States
  • Jason E. Shoemaker, University of Pittsburgh, United States
  • William B. Klimstra, University of Pittsburgh, United States
  • James R. Faeder, University of Pittsburgh, United States


Presentation Overview: Show

Eastern Equine Encephalitis Virus (EEEV) is an arthropod-borne, single-stranded positive-sense RNA Alphavirus that poses a significant threat to public health and national security. Unlike similar viruses such as SARS-CoV-2 or Hepatitis C virus, EEEV can invade the nervous system, causing severe encephalitis with a human mortality rate of ~30-70%. Moreover, there are no preventative or standardized therapies, leaving patients to rely solely on supportive care. In addition, studies have shown that EEEV is easily aerosolized, making it an ideal potential biowarfare agent. Although the molecular components and interactions of infection, replication, and amplification of EEEV within the host cell are well-studied, how these mechanisms integrate to determine the dynamics of RNA viral replication and host immune responses remains unclear, limiting our ability to advance therapeutic development.

Computational models provide a powerful tool for probing both quantitative and qualitative effects arising from the modulation of viral infections. Here, we present a novel mechanistic model describing the intracellular dynamics of EEEV viral lifecycle. The model describes attachment, entry, uncoating, replication, assembly, and export of both infectious virions and virus-like particles within mammalian cells. The model incorporates the assembly mechanism defining the structural difference between infectious and non-infectious progeny viral particles: the presence of a functional viral genome. Additionally, it quantitatively characterizes host ribosome activity in EEEV replication via a model parameter that describes ribosome density on viral RNA.

To calibrate model parameters, we generated experimental data over the course of EEEV infection characterizing the strand-specific dynamics of viral RNA, protein, and infectious particle production. The model recapitulates the measured timing and amplitude of virion production, as well as the corresponding viral RNA and protein dynamics. Furthermore, the model accurately infers relative promoter affinities of the subgenomic and genomic promoters on the negative-sense RNA template. The model also predicts that EEEV is remarkably efficient in producing infectious viral particles with an average ratio of 15 non-infectious particles per infectious particle over 24 hours post-infection, which agrees with observed ratios in other Alphaviruses. Sensitivity analysis of the model identifies genome replication as the critical step for infectious particle production, making it an ideal target for future therapeutic development. In the future, we plan to integrate the innate immune response of the host cell into the model, which will enable the identification of therapeutic strategies for boosting immune system activation to maximize suppression of viral infection.

15:00-15:15
Metagenome profiling and containment estimation through abundance-corrected k-mer sketching with sylph
Room: Clapp Hall Auditorium
Format: Live from venue

  • Jim Shaw, University of Toronto, Canada
  • Yun William Yu, Carnegie Mellon University, United States


Presentation Overview: Show

Profiling metagenomes against databases allows for the detection and quantification of microbes, even at low abundances where assembly is not possible. We introduce sylph (https://github.com/bluenote-1577/sylph), a metagenome profiler that estimates genome-to-metagenome containment average nucleotide identity (ANI) through zero-inflated Poisson k-mer statistics, enabling ANI-based taxa detection. Sylph is the most accurate method on the CAMI2 marine dataset, and compared to Kraken2 for multi-sample profiling, sylph takes 10× less CPU time and uses 30× less memory. Sylph’s ANI estimates provide an orthogonal signal to abundance, allowing for an ANI-based metagenome-wide association study for Parkinson’s disease against 289,323 genomes, confirming known butyrate-PD associations at the strain level. Sylph takes < 1 minute and 16 GB of RAM to profile against 85,205 prokaryotic and 2,917,521 viral genomes, detecting 30× more viral sequences in the human gut compared to RefSeq. Sylph offers precise, efficient profiling with accurate ANI estimation for even low-coverage genomes.

15:15-15:30
Transfer learning for cross-species bacterial sgRNA/Cas9 activity prediction
Room: Clapp Hall Auditorium
Format: Live from venue

  • Tyler Browne, Western University (Ontario), Canada


Presentation Overview: Show

The CRISPR-Cas9 bacterial adaptive immune system is a genetic engineering tool with potential as a next-generation antimicrobial agent. The Cas9 nuclease is targeted to a genomic site by a single guide RNA molecule (sgRNA) where the complex introduces a double-strand DNA break. To ensure the efficacy of a CRISPR system-based antimicrobial, we need to have confidence that the system will be precise in where it cleaves. A still unsolved problem is the impact of nucleotide sequence variations in relation to Cas9 complex on-target cleavage activity in bacteria. Through the proper application of machine learning and accurate data, sequence features impacting sgRNA activity can be extracted, allowing for accurate predictions of bacterial sgRNA-associated Cas9-complex activity. Utilizing a deep learning approach, we developed the bacterial sgRNA/Cas9 activity prediction model crisprHAL. This new model is designed to enhance predictive performance through transfer learning with smaller amounts of high-quality data. Our unique dual branch architecture, comprised of convolutional and recurrent neural network layers, optimizes the information transfer from the initial model — trained on a larger prior sgRNA activity dataset — to smaller datasets used for fine tuning. Variations of crisprHAL were constructed for the enzymes TevSpCas9 and SpCas9. Performance from both variants of the model exceeded prior models on our data, as measured by 5-fold cross validation. To test model generalization, we used four datasets: one prior, and three of our own datasets. A notable result is the successful performance of crisprHAL on a killing efficiency-based Cas9-sgRNA activity dataset we generated in Citrobacter rodentium. Until this point, all bacterial datasets were generated in Escherichia coli, preventing tests of cross-species model performance. Based upon our results, unlike in eukaryotes, our bacterial model sgRNA/Cas9 activity predictions appear to be generalizable across bacterial species. We present crisprHAL, a sgRNA/Cas9 activity prediction model which recapitulates known sgRNA/Cas9-target DNA interactions and provides a pathway to a generalizable sgRNA bacterial activity prediction tool.

16:00-16:30
Proceedings Presentation: Investigating Alignment-Free Machine Learning Methods for HIV-1 Subtype Classification
Room: Clapp Hall Auditorium
Format: Live from venue

  • Kaitlyn Wade, Western University, Canada
  • Lianghong Chen, Western University, Canada
  • Alana Deng, Western University, Canada
  • Gen Zhou, Western University, Canada
  • Pingzhao Hu, Western University, Canada


Presentation Overview: Show

Many viruses are organized into taxonomies of subtypes based on their genetic similarities. For human immunodeficiency virus 1 (HIV-1), subtype classification plays a crucial role in infection management. Sequence alignment-based methods for subtype classification are impractical for large datasets because they are costly and time-consuming. Alignment-free methods involve creating numerical representations for genetic sequences and applying statistical or machine learning methods. Despite their high overall accuracy, existing models perform poorly on less common subtypes. Furthermore, there is limited work investigating the impact of sequence vectorization methods, in particular natural language-inspired embedding methods, on HIV-1
subtype classification. Our study presents a comprehensive analysis of sequence vectorization methods across a variety of machine learning methods. We report a k-mer based XGBoost model with a balanced accuracy of 0.84, indicating that it has good overall performance for both common and uncommon HIV-1 subtypes.
We also report a Word2Vec-based logistic regression model that achieves promising results on precision and recall. Our study sheds light on the effect of sequence vectorization methods on HIV-1 subtype classification and suggests that natural language-inspired encoding methods show much promise. Our results could help
to develop improved HIV-1 subtype classification methods, leading to improved individual patient outcomes, and the development of subtype-specific treatments.

16:30-16:45
A novel bioinformatic approach for strain specific virus detection using NGS
Room: Clapp Hall Auditorium
Format: Live from venue

  • Alexandra Bridgeland, MilliporeSigma, United States
  • Bradley Hasson, MilliporeSigma, United States
  • Yanfei Zhou, MilliporeSigma, United States
  • Rebecca Bova, MilliporeSigma, United States
  • Hiral Desai, MilliporeSigma, United States
  • William Dolan, MilliporeSigma, United States


Presentation Overview: Show

Viral detection through next-generation sequencing (NGS) is a powerful method for identifying the presence of specific viruses in biologically derived samples. However, obtaining reliable reference sequences for downstream analysis can be challenging, as they are not always available or accurate. This abstract presents a novel bioinformatics method to address this issue through the creation of reference sequences and generation of a comprehensive database for strain specific virus detection. Although the example primarily focuses on Avian Leukosis Virus (ALV) and differentiation of specific strains, the methods described are applicable to a wide range of viral research applications including human health and agricultural concerns.

Illumina short read sequence data (2x300 configuration) from each strain of ALV procured from ATCC was used to run a robust in-house variant calling algorithm which leverages an alignment-based pileup technique to conduct variant calling and generate consensus sequences. Preliminary reference sequences for each strain were selected from NCBI GenBank or RefSeq as applicable based on completeness of the sequence and similarity to the sequenced data. The consensus sequences generated from the initial variant calling analysis became the reference for realigning the pre-processed reads. The aligned reads were then extracted and used in a de novo assembly using metaSPAdes. The resulting contigs were clustered at 99%, manually examined, and annotated to reduce redundancy and validate the assembly. The contigs were further analyzed using Basic Local Alignment Search Tool (BLAST) to generate a list of most similar NBCI accessions for downstream analysis. The BLAST results were parsed and clustered to identify the most representative sequences. Using the same pre-processed sequencing data, the variant calling algorithm was rerun on these selected genomes. The resulting consensus sequences were manually curated based on the results obtained from the algorithm. The algorithm was subsequently run several times with different parameters to refine the final consensus sequences and increase confidence in the results.



The culmination of this analysis resulted in a 37-sequence database which included 20 NCBI accessions, six pipeline-generated consensus sequences, and 11 contigs. This comprehensive database will be instrumental in detecting the virus of interest in host samples, allowing for differentiation of the six ALV strains. This innovative method successfully mitigates the challenge of obtaining accurate and representative reference sequences. The resulting database will enhance future work on the virus of interest and facilitate more accurate identification and viral serotype classification in host organisms.

16:45-17:00
HaloClass: State-of-the-art salt-tolerant protein classification with protein language models
Room: Clapp Hall Auditorium
Format: Live from venue

  • Kush Narang, College of Biological Sciences, University of California, Davis, United States
  • Abhigyan Nath, Pt. J.N.M Medical College, India
  • William Hemstrom, Department of Biological Sciences, Purdue University, United States
  • Simon Chu, Biophysics Graduate Program, University of California, Davis, United States


Presentation Overview: Show

Protein folding and function is known to be sensitive to environmental factors including solvent salinity. It is well understood that most proteins cannot function in extreme salinity, far away from their natural solvent conditions. Accurately predicting a protein’s salt tolerance in silico is important for microbiologists seeking to understand adaptations in salt-tolerant organisms. Simultaneously, salt-tolerant enzymes have applications for industrial processes, including the treatment of saline water and the production of biofuels, making their identification and design an important objective.

Here, we present HaloClass, an algorithm that utilizes features from ESM-2, a protein language model (pLM), to accurately classify whether a protein is salt-tolerant or not. On multiple benchmark datasets, HaloClass establishes a new state-of-the-art. HaloClass significantly outperforms existing approaches in classification performance metrics and generalization. On 8 pairs of homologous salt-tolerant and non-salt-tolerant proteins independent from the training data, HaloClass classifies only 1 pair incorrectly. Closer structural analysis suggests that surface-exposed charged residues might help HaloClass’s prediction. Next, we leverage techniques in dimensionality reduction to compare how protein representations from language models can surpass manually engineered features in halophilicity classification.

Finally, we simulated a guided mutation study using previously published experimental data. The experimental study reported changes in salt-tolerance of nine multiple-point mutants ranging from 2 to 8 sites, HaloClass accurately predicted all changes in salt-tolerance in congruence with experimental data and is the first protein classifier to evaluate mutants accurately on near-identical AlphaFold structures (RMSD < 0.6Å).

This data suggests that HaloClass can facilitate the discovery and guided design of salt-tolerant enzymes, for industrial and microbiological applications. All code for HaloClass will be publicly available on GitHub and Google Colab in our preprint.

17:00-17:15
IRnet: Immunotherapy response prediction using pathway knowledge-informed graph neural network
Room: Clapp Hall Auditorium
Format: Live from venue

  • Yuexu Jiang, University of Missouri, Columbia, United States
  • Manish Immadi, University of Missouri, Columbia, United States
  • Duolin Wang, University of Missouri, Columbia, United States
  • Shuai Zeng, University of Missouri, Columbia, United States
  • Yen On Chan, University of Missouri, Columbia, United States
  • Jing Zhou, University of Missouri, Columbia, United States
  • Dong Xu, University of Missouri, Columbia, United States
  • Trupti Joshi, University of Missouri, Columbia, United States


Presentation Overview: Show

Introduction: Immunotherapy, specifically, immune checkpoint inhibitors (ICIs) are powerful and precise therapies for many cancer types and have improved the survival of patients who positively respond to them. However, only a minority of patients respond to ICI treatments.

Objectives: Determining ICI responders before treatment would dramatically save medical resources, avoid potential drug side effects, and save valuable time by exploring alternative therapies. Here, we aim to present a novel deep-learning method that can predict ICI treatment response in cancer patients.

Methods: The proposed deep-learning framework leverages graph neural network and biological pathway knowledge. We trained and tested our method using ICI-treated patients’ data from several clinical trials covering melanoma, gastric cancer, and bladder cancer.

Results: The results indicate that the prediction performance is superior to other currently available state-of-the-art methods or tumor microenvironment-based predictions. Moreover, the model quantifies the importance of pathways, pathway interactions, and genes involved in the prediction.

Conclusion: IRnet is a competitive method and tool in predicting patients’ response to immunotherapy, specifically immune checkpoint inhibitors (ICI). The interpretability of IRnet provides insights into mechanisms involved in the different ICI treatments. It is publicly available at https://irnet.missouri.edu

17:15-17:30
A Quantitative Analysis of Mitochondrial Morphology
Room: Clapp Hall Auditorium
Format: Live from venue

  • Huangqingbo Sun, Carnegie Mellon University, United States
  • Yuxin Lu, Carnegie Mellon University, United States
  • Ju-chun Huang, Carnegie Mellon University, United States
  • Robert F. Murphy, Carnegie Mellon University, United States


Presentation Overview: Show

Mitochondria play an essential role in various cellular processes with dynamic morphology and subcellular distributions. Notably, the release of proteins such as cytochrome c from mitochondria can trigger apoptosis as demonstrated by the work of Gomes et al. [1]. Contrary to the textbook depiction of mitochondria as bean-shaped entities with a folded inner membrane, Karbowski and Youle’s work [2] on electron microscopy reveals significant morphological variability across cell lines and within individual cells. Therefore, a systematic way to quantify the mitochondrial morphological changes will be crucial to understanding its structure and role in metabolism.
The spherical harmonic (SPHARM) transform maps the curvature map of an object shape to a sphere, representing the shape as a mesh of equal area elements converted from a voxel image in an orthogonal space. Ruan et al. [3, 4] applied this transform for statistical shape analysis via SPHARM-RPDM, as integrated into the CellOrganizer (http://www.cellorganizer.org/) MATLAB package.
In this work, we analyzed the mitochondria shape using the volume electron microscopy datasets from OpenOrganelle (https://openorganelle.janelia.org/). Each mitochondrion was segmented from three-dimensional (3D) microscopy images as a distinct entity. Subsequently, we aligned each mitochondrial 3D shape using the first-order ellipse (FOE) method, as outlined in the work of Styner et al. [5]. For constructing a mitochondrial shape space, we parameterized each shape using SPHARM-RPDM, yielding a complex vector of 3072 elements as the SPHARM coefficients. By concatenating the numbers in the real and imaginary parts of SPHARM coefficients associated with one mitochondrial shape, we obtained the shape descriptor of individual mitochondrial shapes. We performed PCA to get the shape modes (i.e. PCs) in a 3D space and revealed the shape variation of the mitochondria. The shape modes presented are the result of PCA on integrated electron microscopy datasets from various cell lines, including interphase HeLa cells, macrophage cells, immortalized T cells and A431 cell lines, highlighting the extensive morphological diversity of mitochondria.


References
[1] Ligia C Gomes and Luca Scorrano. Biochimica et Biophysica Acta (BBA) Molecular Cell Research, 1833(1):205–212, 2013.
[2] M Karbowski and R J Youle. Cell Death & Differentiation, 10(8):870–880, 2003
[3] Timothy D Majarian, Ivan Cao-Berg, Xiongtao Ruan, and Robert F Murphy. Modeling Biomolecular Site Dynamics: Methods and Protocols, pages 251–264, 2019.
[4] Xiongtao Ruan and Robert F Murphy. Bioinformatics, 35(14):2475–2485, 2019.
[5] Martin Styner, et al.The Insight Journal, (1071):242, 2006.

17:30-18:00
Proceedings Presentation: SlowMoMan: A web app for discovery of important features along user-drawn trajectories in 2D embeddings
Room: Clapp Hall Auditorium
Format: Live from venue

  • Kiran Deol, University of Alberta, Canada
  • Griffin Weber, Harvard University, United States
  • Yun William Yu, Carnegie Mellon University, United States


Presentation Overview: Show

Nonlinear low-dimensional embeddings allow humans to visualize high-dimensional data, as is often seen in bioinformatics, where data sets may have tens of thousands of dimensions. However, relating the axes of a nonlinear embedding to the original dimensions is a nontrivial problem. In particular, humans may identify patterns or interesting subsections in the embedding, but cannot easily identify what those patterns correspond to in the original data. Thus, we present SlowMoMan (SLOW Motions on MANifolds), a web application which allows the user to draw a 1-dimensional path onto a 2-dimensional embedding. Then, by back-projecting the manifold to the original, high-dimensional space, we sort the original features such that those most discriminative along the manifold are ranked highly. We show a number of pertinent use cases for our tool, including trajectory inference, spatial transcriptomics, and automatic cell classification.

Software availability: https://yunwilliamyu.github.io/SlowMoMan/
Code availability: https://github.com/yunwilliamyu/SlowMoMan

Thursday, May 16th
10:30-11:00
Proceedings Presentation: Differential Equation Modeling of Cell Population Dynamics in Skeletal Muscle Regeneration from Single-Cell Transcriptomic Data
Room: Clapp Hall Auditorium
Format: Live from venue

  • Renad Al-Ghazawi, University of Ottawa, Canada
  • Xiaojian Shao, National Research Council of Canada, Canada
  • Theodore Perkins, Ottawa Hospital Research Institute, Canada


Presentation Overview: Show

Satellite cell-mediated skeletal muscle regeneration is a complex process orchestrated by diverse cell populations within a dynamic niche. To better understand how crosstalk between cell types governs cell fate decisions and controls cell population dynamics, we developed a novel non-linear ordinary differential equation (ODE) model guided by single-cell RNA sequencing (scRNA-seq) data. The emergence of scRNA-seq studies of muscle regeneration offers a significant opportunity to refine models of regeneration and enhance our understanding of cellular interactions. Our model consists of 10 variables and 22 parameters, capturing the dynamics of key myogenic lineage and immune cell types, which undergo cell fate and migration decisions such as quiescence, activation, proliferation, differentiation, infiltration, apoptosis, and exfiltration in response to muscle damage and intercellular signaling. We calibrated time-series scRNA-seq data to units of cells per cubic millimeter of tissue, and then fit our model's parameters to capture the observed dynamics, validating on an independent time series. We find that the model is able to capture the dynamics of each cell type, especially after incorporating several regulatory cell-cell interactions that, although previously hypothesized in the literature, have been underexplored. In addition, we performed sensitivity analysis to assess the influence of individual parameter in governing the regeneration trajectory. Our model lays a foundation for future computational explorations of muscle regeneration and therapeutic strategies.

11:00-11:15
Advanced Probabilistic Models and Methods for Comparative Analysis of scRNA-seq
Room: Clapp Hall Auditorium
Format: Live from venue

  • Alicia Petrany, Rowan University, United States
  • Ruoyu Chen, Moorestown High School, United States
  • Shaoqiang Zhang, Tianjin Normal University, China
  • Yong Chen, Rowan University, United States


Presentation Overview: Show

High-throughput experiments utilizing next-generation sequencing (NGS) technology have been instrumental in investigating biological questions at bulk and single-cell levels. Comparative analysis of two NGS datasets often relies on testing the statistical significance of the difference of two Negative Binomial distributions (DOTNB). However, although the NB distributions are well studied, the theoretical results for DOTNB remain largely unexplored. Here, we derived basic analytic results for DOTNB and examined its asymptotic properties. As a state-of-the-art application of DOTNB, we introduce DEGage, a computational method for detecting differentially expressed genes (DEGs) in scRNA-seq data. Extensive validation using simulated and real scRNA-seq datasets demonstrates that DEGage outperforms five popular DEG analysis tools, DEGseq2, DEsingle, edgeR, Moncle3 and scDD. DEGage proves robust against high dropout levels and exhibits superior sensitivity when applied to rare cell types with small cell numbers. These theoretical results can improve the performance and reproducibility of comparative analyses for dispersed count data in NGS and beyond.

11:15-11:30
Probing the Significance of Heterophily on Cell Type Prediction from scRNA-Seq Data by Leveraging Graph Neural Network Models
Room: Clapp Hall Auditorium
Format: Live from venue

  • Mahshad Hashemi, University Of Windsor, Canada
  • Lian Duan, University of Windsor, Canada
  • Luis Rueda, University of Windsor, Canada


Presentation Overview: Show

Graph Neural Networks (GNNs) have emerged as powerful tools for analyzing structured data, particularly in domains
where relationships and interactions between entities are key. By leveraging the inherent graph structure in datasets, GNNs
excel in capturing complex dependencies and patterns that traditional neural networks might miss. This advantage is
especially pronounced in the field of computational biology, where the intricate connections between biological entities play
a crucial role. In this context, we investigate the application of GNNs to single-cell RNA sequencing data, a domain rich
with complex and heterogeneous relationships. While standard GNN models like Graph Convolutional Networks (GCN),
GraphSAGE, Graph Attention Networks (GAT), and MixHop often assume homophily (similar nodes are more likely to
be connected), this assumption does not always hold in biological networks. To address this, we explore advanced GNN
methods, such as H2GCN and Gated Bi-Kernel GNNs (GBK-GNN), that are specifically designed to handle heterophilic
data. Our study spans across six diverse datasets, enabling a thorough comparison between heterophily-aware GNNs and
traditional homophily-assuming models, including MLP, which disregards graph structure entirely. The findings of this
research not only underscore the significance of considering data-specific characteristics in GNN applications but also
illuminate the potential of heterophily-focused methods in deciphering the complex patterns within single-cell RNA data,
paving the way for more accurate and insightful analyses in computational biology.

11:30-11:45
Single cell transcriptomics-level Cytokine Activity Prediction and Estimation (SCAPE) for human and mouse models.
Room: Clapp Hall Auditorium
Format: Live from venue

  • Azka Javaid, Dartmouth College, United States
  • Hildreth Frost, Dartmouth College, United States


Presentation Overview: Show

Cytokine interaction activity modeling is a pressing problem since uncontrolled cytokine influx is at fault in a variety of medical conditions, including COVID19 and cancer. Accurate knowledge of cytokine activity levels can be leveraged to provide tailored treatment recommendations. Here, we describe a novel method named Single cell transcriptomics-level Cytokine Activity Prediction and Estimation (SCAPE) that can predict cell-level cytokine activity from single cell RNA-sequencing (scRNA-seq) data. We describe two versions of the SCAPE method that support the analysis of either human or mouse scRNA-seq data. The human version of SCAPE generates activity estimates using cytokine-specific gene sets constructed leveraging information from the CytoSig and Reactome databases and scored with a modified version of the Variance-adjusted Mahalanobis (VAM) method that adjusts for negative weights. We validate this version of SCAPE using both simulated and real scRNA-seq data and compare our technique against CytoSig’s and NicheNet’s ligand activity prediction frameworks. For the simulation study, we perturb real human scRNA-seq data to reflect the expected stimulation signature of up to 41 cytokines. For the real data evaluation, we use publicly accessible human scRNA-seq datasets that capture cytokine stimulation and blockade experimental conditions and a COVID19 transcriptomics dataset.
To build a mouse version of SCAPE, we leverage the recently published Immune Dictionary by Cui et al. that represents the single-cell transcriptomics profiles of up to 86 cytokines via experiments performed in vivo in mice with impacted immune cells collected from draining lymph nodes. While the Immune Dictionary has significant utility as a data source, the associated Immune Response Enrichment Analysis (IREA) companion software is currently limited as it is only functional via an online web application. In addition, given an input scRNA-seq dataset and a specified cell-type, the IREA web tool only outputs aggregated scores over each cytokine as opposed to returning the cell-level cytokine activity scores. Lastly, the authors use differential expression to construct gene sets that aim to find cytokine-specific markers against just the unstimulated control. In comparison, our mouse version of SCAPE creates custom gene sets leveraging differential expression analysis for each cytokine against the unstimulated control and against all other cytokines followed by scoring the gene sets using our modified VAM approach. We use cross-validation to assess if our model correctly predicts the cytokine whose activity is being stimulated. As demonstrated by these evaluations, the SCAPE method can accurately estimate cell-level cytokine activity from human and mouse scRNA-seq data.

11:45-12:00
Single-cell multiomic analysis of gene regulation across contexts dissects the molecular underpinnings of Gene-by-Environment interactions in inflammatory disease
Room: Clapp Hall Auditorium
Format: Live from venue

  • Julong Wei, Center for Molecular Medicine and Genetics, Wayne state university, United States
  • Mohammed Husain Bharmal, Center for Molecular Medicine and Genetics, Wayne state university, United States
  • Adnan Alazizi, Center for Molecular Medicine and Genetics, Wayne state university, United States
  • Henriette Mair-Meijers, Center for Molecular Medicine and Genetics, Wayne state university, United States
  • Richard Slatcher, Department of Psychology, University of Georgia, United States
  • Samuele Zilioli, Department of Psychology, Wayne State University, United States
  • Xiaoquan Wen, Department of Biostatistics, University of Michigan, Ann Arbor, United States
  • Roger Pique-Regi, Center for Molecular Medicine and Genetics, Wayne State University, United States
  • Francesca Luca, Center for Molecular Medicine and Genetics, Wayne State University, United States


Presentation Overview: Show

Modern lifestyle, diet, drinking coffee and smoking, can have a strong influence on the immune system and inflammatory response. These environmental factors can also modulate the genetic effects on disease phenotypes; yet we have a very limited understanding of the underlying gene regulatory mechanisms. Only a few studies analyzed genome wide chromatin accessibility together with gene expression across multiple environments on a limited number of cell-types. Advances in single cell technology allow us to profile more complex biological systems involving the interaction of multiple cell types orchestrating the cellular response to environmental stimuli. We hypothesized that annotating context-specific gene regulatory variants in peripheral blood improves fine-mapping and interpretation of genetic risk for inflammatory traits.
Here we considered six environmental factors (caffeine, nicotine, vitamins A, D and E, and zinc) and paired controls to stimulate PBMCs isolated from 12 participants and activated with phytohemagglutinin. Using single cell multiome technology we simultaneously profiled gene expression and chromatin accessibility in the same cell for 96 samples (128). A total of 60,551 high quality nuclei were annotated into 11 immune cell types. We characterized the cell type-specific gene expression and chromatin accessibility responses, for a total of 7,241 differentially expressed genes (DEGs) and 17,774 differentially accessible regions (DARs) (10% FDR and |LFC|>0.5). We observed a strong positive correlation between gene expression and chromatin accessibility responses and identified 294 transcription factors (TFs) with significant motif activity changes (response motifs). We also show that TF gene expression changes and their combinations drive TF activity changes for some motifs, such as the Interferon-regulatory factor family.
We annotated genetic variants in chromatin accessibility regions and Response motifs and then performed computational fine-mapping of eQTLs in GTEx whole blood tissue. We detected response motif annotations for 22 contexts that were significantly enriched for eQTL signals, thus representing potential latent environments in GTEx. With integration of the fine-mapped eQTLs with GWAS of 11 inflammatory diseases, we identified putative causal genes and refined the credible set of 869 risk loci. For 43% of these loci, the causal variant was annotated in a Response motif. These risk variants were enriched in Response motifs for specific contexts, indicating the role of Gene-by-Environment (GxE) interactions in etiology of disease. In conclusion, we developed an approach that combines single cell genomics and statistical genetic analysis to interpret the molecular underpinnings of GxE in disease.

13:30-13:45
TransTEx: Novel tissue-specificity scoring method for grouping human transcriptome into different expression groups
Room: Clapp Hall Auditorium
Format: Live from venue

  • Pallavi Surana, PhD student, SUNY-Stony Brook, United States
  • Pratik Dutta, Research Scientist, SUNY-Stony Brook, United States
  • Ramana Davuluri, Professor, SUNY-Stony Brook, United States


Presentation Overview: Show

Motivation: Although human tissues carry out common molecular processes, gene expression patterns can distinguish different tissues. Traditional informatics methods, primarily at the gene lev-el, overlook the complexity of alternative transcript variants and protein isoforms produced by most genes, changes in which are linked to disease prognosis and drug resistance.
Results: We developed TransTEx (Transcript-level Tissue Expression), a novel tissue-specificity scoring method, for grouping transcripts into four expression groups. TransTEx applies sequential cutoffs to tissue-wise transcript probability estimates, subsampling-based p-values and fold-change estimates. Application of TransTEx on GTEx mRNA-seq data divided 199,166 human transcripts into different groups as 17,999 tissue-specific (TSp), 7436 tissue-enhanced (TEn), 36,783 widely expressed (Wide), 133,242 lowly expressed (Low) and 3,706 no expression (Null) transcripts. Tes-tis has the most (13,466) TSp isoforms followed by liver (890), brain (701), pituitary (435), and muscle (420). We found that the tissue specificity of alternative transcripts of a gene is predominant-ly influenced by alternate promoter usage. By overlapping brain-specific transcripts with the cell-type gene-markers in scBrainMap database, we found that 63% of the brain-specific transcripts were enriched in non-neuronal cell types, predominantly astrocytes followed by endothelial cells and oli-godendrocyte precursor cells. In addition, we found 61 brain cell-type marker genes encoding a total of 176 alternative transcripts as brain-specific and 22 alternative transcripts as testis-specific, highlighting the complex tissue-specific and cell-type specific gene regulation and expression at isoform-level. TransTEx can be adopted to the analysis of bulk RNA-seq or scRNA-seq datasets to find tissue- and/or cell-type specific isoform-level gene markers.
Availability and Implementation: TransTEx database: https://bmi.cewit.stonybrook.edu/transtexdb/ and the R package is available via GitHub: https://github.com/pallavisurana1/TransTEx

13:45-14:00
Coordinated Translation Efficiency as a Functional Organizing Principle of Mammalian Transcripts Across Cell Types
Room: Clapp Hall Auditorium
Format: Live from venue

  • Yue Liu, University of Texas at Austin, United States
  • Can Cenik, University of Texas at Austin, United States
  • Ian Hoskins, University of Texas at Austin, United States
  • Jonathan Chacko, University of Texas at Austin, United States
  • Hakan Ozadam, University of Texas at Austin, United States
  • Michael Geng, University of Texas at Austin, United States


Presentation Overview: Show

Characterization of shared patterns of RNA expression between genes across conditions has led to the discovery of novel biological functions and regulatory networks. These co-expression relationships have illuminated gene function and higher-order organization of transcriptomes in a vast array of biological contexts. However, we currently do not know if similar patterns of co-expression extend to translation. Coordinated translational networks have remained unexplored, primarily due to the scarcity of comprehensive translational measurements across a large compendium of biological contexts. Here, we uniformly analyzed thousands of matched ribosome profiling and RNA-seq data from more than a hundred tissues and cell lines across human and mouse studies. We introduce the concept of Coordinated Translational Efficiency (CoTE), pinpointing mRNAs that demonstrate coordinated translation patterns across cell types. We uncover some novel gene functions that rely on translational similarity information alone, a factor not previously considered. Moreover, our observations indicate that proteins exhibiting positive similarity at both translational and transcriptional levels are significantly more prone to physical interaction. We finally discover CoTE patterns indicative of RNA-binding protein (RBP) involvement, suggesting potential mechanisms of translational regulation. Our findings establish coordinated translation across various conditions as a pervasive and conserved organizing principle of mammalian transcriptomes.

14:00-14:15
Modeling the depth of cellular dormancy from RNA-sequencing data
Room: Clapp Hall Auditorium
Format: Live from venue

  • Michelle Wei, University of Arizona, United States
  • Guang Yao, University of Arizona, United States


Presentation Overview: Show

The vast majority of cells in the human body are not actively dividing but dormant. Among these dormant cells, some are reversible (quiescent) and can reenter the cell cycle upon growth stimulation, whereas others are irreversibly arrested in senescence or terminal differentiation. Often referred to as G0, cellular dormancy states exhibit significant heterogeneity in ""depth"". In quiescence, deeper quiescent cells, although fully reversible, require stronger growth stimulation and take a longer time to reenter the cell cycle than shallower quiescent cells. Examples of quiescence deepening have long been observed in cells under extended periods of serum starvation or contact inhibition in culture and with aging in vivo. On the contrary, subpopulations of quiescent muscle, neural, and hematopoietic stem cells, upon tissue injury or in response to systemic circulating factors, move to shallower (primed) quiescence that bear a closer resemblance to active cells. Similarly, senescent cells also exhibit different ""depths"" with different cellular characteristics.

Our recent work, which profiled transcriptomic changes over time in quiescent fibroblasts, suggests that quiescence deepening represents a transition trajectory from active proliferation to permanent senescence, and is regulated by cellular dimmer switch-like mechanisms. To model this trajectory, we trained an elastic net that predicts “quiescence depth scores” (QDS). When tested on publicly available RNA-sequencing datasets, the model predicted QDS that accurately reflected the relative dormancy depth of a wide array of cell types. Based on this analysis process, we created QDSWorkflow, an R package for predicting dormancy depth from RNA-sequencing data. The package provides a framework of simple “wrapper” functions for preprocessing data, building models, and visualizing results. It accepts both bulk and single-cell RNA-sequencing data as input and has additional quality control and bootstrapping functionality for the latter. By capturing a broad gene signature with elastic net, it can predict QDS that reflect the relative dormancy depth of cells in various conditions, which we demonstrate with two example datasets. For a DNA-damaged fibroblast dataset, the QDS predictions increase monotonically as fibroblasts progress from proliferation to senescence, reflecting the cell cycle arrest that results from DNA damage. For a neural stem cells (NSCs) dataset, the QDS predictions decrease monotonically as the NSC become activated, reflecting the increase in proliferative potential associated with exiting G0. As a robust tool for extracting dormancy depth from RNA-sequencing data, QDSWorkflow has various potential applications in cancer and aging research.

14:15-14:30
Modeling islet enhancers using deep learning identifies candidate causal variants at loci associated with T2D and glycemic traits
Room: Clapp Hall Auditorium
Format: Live from venue

  • Sanjarbek Hudaiberdiev, NIH
  • Leland Taylor, NIH
  • Wei Song, NIH
  • Narisu Narisu, NIH
  • Redwan Bhuiyan, University of Connecticut
  • Henry Taylor, NIH
  • Xuming Tang, Weill Cornell Medicine
  • Tingfen Yan, NIH
  • Amy Swift, NIH
  • Lori Bonnycastle, NIH
  • Shuibing Chen, Weill Cornell Medicine
  • Michael Stitzel, University of Connecticut
  • Michael Erdos, NIH
  • Ivan Ovcharenko, NIH, United States
  • Francis Collins, NIH


Presentation Overview: Show

Genetic association studies have identified hundreds of independent signals associated with type 2 diabetes (T2D) and related traits. Despite these successes, the identification of specific causal variants underlying a genetic association signal remains challenging. In this study, we describe a deep learning (DL) method to analyze the impact of sequence variants on enhancers. Focusing on pancreatic islets, a T2D relevant tissue, we show that our model learns islet-specific transcription factor (TF) regulatory patterns and can be used to prioritize candidate causal variants. At 101 genetic signals associated with T2D and related glycemic traits where multiple variants occur in linkage disequilibrium, our method nominates a single causal variant for each association signal, including three variants previously shown to alter reporter activity in islet-relevant cell types. For another signal associated with blood glucose levels, we biochemically test all candidate causal variants from statistical fine-mapping using a pancreatic islet beta cell line and show biochemical evidence of allelic effects on TF binding for the model-prioritized variant. To aid in future research, we publicly distribute our model and islet enhancer perturbation scores across ~67 million genetic variants. We anticipate that DL methods like the one presented in this study will enhance the prioritization of candidate causal variants for functional studies.

14:30-14:45
Jointly characterizing the sequence and chromatin binding preferences of transcription factors using neural networks
Room: Clapp Hall Auditorium
Format: Live from venue

  • Jianyu Yang, Penn State, United States
  • Shaun Mahony, Penn State, United States


Presentation Overview: Show

To understand the cell-specific determinants of TF DNA-binding specificity, we need to examine how newly activated TFs interact with sequence and preexisting chromatin landscapes to select their binding sites. We have developed neural networks that jointly model sequence and prior chromatin data to interpret the binding specificity of TFs that have been induced in well-characterized chromatin environments. Feature attribution approaches allow us to quantify the degree to which sequence and prior chromatin features explain induced TF binding, both at individual sites and genome-wide. In this presentation, We will discuss the challenges we have faced in jointly modeling sequence and chromatin features with neural networks, including training strategies that attempt to separate the interdependencies between data modalities. And we will discuss how we have applied feature attribution approaches to interpret the chromatin preferences of induced TFs.

We will demonstrate our approaches by analyzing differential binding activities across a selection of Forkhead-domain TFs (FoxA1, FoxC1, FoxG1, FoxL2, and FoxP3) when each is expressed in mouse embryonic stem cells. Despite having similar in vitro DNA-binding preferences, the various Fox TFs bind to different DNA targets, and drive differential gene expression patterns, even when expressed in the same preexisting chromatin environment. Using our neural networks and associated feature attribution approaches, We will show that differential Fox binding activities are explained by a mixture of differential DNA sequence preferences and differential abilities to engage relatively inaccessible chromatin. Thus, jointly modeling sequence and chromatin TF preferences will be an important strategy for characterizing cell-specific regulatory patterns and understanding how paralogous TFs diverge in their activities.

14:45-15:00
A Hybrid Constrained Continuous Optimization Approach for Optimal Causal Discovery from Biological Data.
Room: Clapp Hall Auditorium
Format: Live from venue

  • Yuehua Zhu, Tsinghua University, University of Pittsburgh, United States
  • Maria Chikina, University of Pittsburgh, United States


Presentation Overview: Show

Understanding causal effects is a fundamental goal of science and underpins our ability to make predictions in unseen settings and conditions. While direct experimentation is the gold standard for measuring and validating causal effect, the field of causal discovery theory offers a tantalizing alternative: extracting causal insights from observational data. Theoretical analysis demonstrates that this is indeed possible if the dataset is large and certain conditions are met. However, biological datasets, with their complex structures and latent variables, present unique challenges distinct from standard benchmarking datasets and simulations.

In this work we perform comprehensive benchmarking using large scale biological datasets to both construct the causal truth and perform causal discovery from observational data. We test a diverse suite of methods across a large combination of ground truth, observational datasets, and evaluation metrics. Among the methods tested, the PC algorithm stands out for its robust performance in constructing causal networks. However, its limitation lies in producing graphs without detailed causal models. We propose PC-NOTEARS (PCnt), our hybrid solution, which combined the PC algorithm's strengths with the NOTEARS' ability to estimate causal effects accurately.

PC-NOTEARS implementation is available on https://github.com/zhu-yh1/PC-NOTEARS