Scalable Analysis for Big Biological Data

Attention Presenters - please review the Speaker Information Page available here

The modern era of bioinformatics is defined by a deluge of data coupled with powerful computing. Providing tools and data that are FAIR (Findable, Accessible, Interoperative, and Reusable), while not currently consistently applied, is a sensible goal for the bioinformatics community. The coupling of big biological data and its application to scientific and medical knowledge can potentially provide meaningful improvements to quality of life and future discoveries. However, to capture the practical application of big data, the tools and methodology need to be honed for the scale of the data and the quality of the results. How do we adapt algorithms to efficiently handle terabytes or even petabytes of genomic data? What unique problems only arise when investigating problems at large scale? How do we balance a trade-off in scalability versus accuracy?

Available biological datasets have increased exponentially in size over the past decades due to technological improvements. As modern databases increase in size, so too does the cost of maintenance and the complexity of the data. Increased biological data necessitates additional computational expertise and infrastructure, as well as the relevant biological expertise for interpreting the data in the burgeoning field of bioinformatics. As a result, efficient and scalable analyses are urgently needed for the scientific community to leverage the opportunities of “big data” while ensuring the results are scalable, accurate, and accessible. It can be especially difficult to replicate results at scale because of the computing power needed. Differences in cloud computing environments compound these issues, occasionally leading to problems with replicability. The era of big data has the potential to catalyze incredible progress toward answering scientific questions, but a concerted effort should be made to ensure that the methods are robust and reliable.

Meaningful presentations highlighting solutions to these challenges will be the theme of this session. Analysis of big data sets and scalable algorithmic approaches are especially relevant. We are also interested in talks that have developed best practices for replicability and how these methods are used in comparing algorithms and results. Talks may focus on algorithm development, large scale applications, and/or novel understudied problems using big biological data. We look forward to proposals covering FAIR, challenging, and pressing issues in computational biology as well as high-throughput and high-performance computing.

Schedule subject to change
All times listed are in EDT
Thursday, May 16th
10:30-11:00
Moving from Data Enclaves through Knowledge and Genome Graphs into Complex Models
Room: Cathedral of Learning, G24
Format: Live from venue

  • Ben Busby, DNAnexus, United States


Presentation Overview: Show

This talk will discuss how to think about and implement analysis on and across multimodal data enclaves, such as the UK Biobank-RAP. It will further discuss data management techniques for massive scale data, and discuss what types of models are appropriate for various analyses, particularly issues to consider when implementing transformer models and other deep learning models. Knowledge graphs and genome graphs can massively reduce the time and computational power necessary for indexing data for large models, particularly for multi-locus analysis, and this will be covered as well. The implementation portion of this session will include an overview on how to use data enclaves, including common data processing operations and leveraging models both within and between data enclaves.

11:00-11:15
Lessons from integration of 168,000 human gut microbiome samples
Room: Cathedral of Learning, G24
Format: Live from venue

  • Richard Abdill, University of Chicago, United States
  • Samantha Graham, University of Minnesota, United States
  • Vincent Rubinetti, University of Colorado, United States
  • Frank Albert, University of Minnesota, United States
  • Casey Greene, University of Colorado, United States
  • Sean Davis, University of Colorado, United States
  • Ran Blekhman, University of Chicago, United States


Presentation Overview: Show

As evidence accumulates of the complex interactions between microbiota and their hosts, it has become clear that a holistic characterization of human health and disease requires understanding the factors that shape variation in the human microbiome. While other genomics fields have used large, pre-compiled compendia to extract systematic insights requiring otherwise impractical sample sizes, there has been no comparable resource for the 16S rRNA sequencing data most commonly used to quantify microbiome composition. To help close this gap, we assembled a set of 168,464 publicly available human gut microbiome samples, processed with a single pipeline and combined into the largest unified microbiome dataset to date. We used this resource, which is freely available at microbiomap.org, to shed light on global variation in the human gut microbiome. Here, I describe the computational approach to building and maintaining this resource and discuss the technical hurdles we've encountered, with a focus on metadata and inconsistencies between projects.

11:15-11:30
Scalable Community Detection for Large Networks
Room: Cathedral of Learning, G24
Format: Live from venue

  • Aidan Lakshman, University of Pittsburgh, United States
  • Erik Wright, University of Pittsburgh, United States


Presentation Overview: Show

Community detection in graphs has numerous applications from social networks to biology. Bioinformatic analyses especially depend on graph community detection algorithms to infer homology groups from sequence similarity networks. However, the immense size of modern graphs makes it challenging to accurately detect communities. Approaches with linear-time scalability typically perform worse than approaches with worse time complexity. Here, we set out to compare popular methods for community detection on synthetic graphs with known communities and biological graphs with unknown communities. We found that faster algorithms often perform worse at detecting communities than less scalable algorithms. To address this issue, we introduce two new variants of the Fast Label Propagation algorithm for clustering extremely large sequence similarity networks. Our approach offers accuracy comparable to less scalable approaches while providing linear-time scalability. Furthermore, we made it possible to apply our community detection algorithms outside of main memory, which permits community detection on huge graphs with limited RAM. This advance democratizes the process of community detection because access to expensive supercomputer resources is not required. We compared different algorithms' automatically detected communities on both synthetic and real biological networks. We discuss the accuracy of different community detection algorithms in the context of their relative time and memory complexities. Our implementation of community detection is available in the open source SynExtend package for R.

11:30-11:45
G2PDeep-v2: a web-based deep-learning framework for phenotype prediction and biomarker discovery using multi-omics data.
Room: Cathedral of Learning, G24
Format: Live from venue

  • Sania Zafar Awan, University of Missouri - Columbia, United States
  • Shuai Zeng, University of Missouri - Columbia, United States
  • Trinath Adusumilli, University of Missouri - Columbia, United States
  • Manish Sridhar Immadi, University of Missouri - Columbia, United States
  • Trupti Joshi, University of Missouri - Columbia, United States
  • Dong Xu, University of Missouri - Columbia, United States


Presentation Overview: Show

The G2PDeep-v2 server is a web-based platform powered by deep learning, for phenotype prediction and markers discovery from multi-omics data in any organisms including humans, plants, animals, and viruses. The server provides multiple services for researchers to create deep-learning models through an interactive interface and train these models using an automated hyperparameter tuning algorithm on high-performance computing resources.
Unlike the previous version of G2PDeep [1], the new version, G2PDeep-v2, now supports multiple inputs for multi-omics data, offers a broader array of model selection options, advanced settings for tuning model hyperparameters, and includes comprehensive Gene Set Enrichment Analysis (GSEA) functionalities. Notably, compared with other available applications, G2PDeep-v2 provides end-to-end management of machine learning and deep learning projects from multi-omics dataset creation, all the way to model interpretation, which also supports individual omics or any combination of up to 3 multi-omics data selection for the predictions. It is equipped with a fully automated pipeline to process and organize multi-omics data such as gene expression, miRNA expression, DNA methylation, protein expression SNP, and CNV.
To accelerate scientific research for survival analysis in cancer studies, we utilized G2PDeep-v2 for long-term survival prediction and established biomarkers/candidates associated with survival for 23 cancer studies using The Cancer Genome Atlas (TCGA) datasets. Various models, including our proposed multi-CNN (CNN), Logistic Regression (LR), Support Vector Machine (SVM), Decision Trees (DT), and Random Forest (RF), were employed for predictions. To ensure reproducibility, the data for each cancer study underwent a systematic division into a training dataset (60% of the entire data) for model training, a validation dataset (20% of the entire data) for hyper-parameter tuning, and a test dataset (20% of the entire data) to evaluate model performance. Quantification of predictive performance was achieved by calculating the mean area under the curve (AUC) over a 5-fold cross-validation framework. G2PDeep-v2 using our proposed multi-CNN outperforms all other machine learning models in predicting phenotypes for the Skin Cutaneous Melanoma (SKCM) study with uniform multi-omics data.
The G2PDeep server is publicly available at http://g2pdeep.org. The Python-based deep-learning model is available at https://github.com/shuaizengMU/G2PDeep_model

11:45-12:00
A novel algorithm for full-resolution registration of giga- pixel images with sub-cellular accuracy
Room: Cathedral of Learning, G24
Format: Live from venue

  • Rajdeep Pawar, Department of Computational and Systems Biology, University of Pittsburgh, United States
  • Aatur Singhi, Department of Pathology, University of Pittsburgh, United States
  • Shikhar Uttam, Department of Computational and Systems Biology, University of Pittsburgh, United States


Presentation Overview: Show

Introduction:
Two-dimensional (2D) analysis of Hematoxylin and Eosin (H&E)-stained tissue sections have long been the cornerstone of anatomic pathology. However, it is increasingly being realized that extending this analysis to 3D microenvironments is required to fully understand the morphological and architectural complexity of pathobiology associated with tumor biology. Although, significant algorithmic advancements are being made to achieve this goal, accurate registration(alignment) of H&E images of serially adjacent 2D sections into a coherent 3D volume at full image resolution remains a challenge. This difficulty stems from various factors, including inherent differences between adjacent tissue sections, and tissue deformation induced during the sectioning process.In response to these challenges, we introduce a new registration method based on sparse representation for serial whole-slide images that facilitates high-precision 3D reconstruction of the tumor microenvironment.

Methods:
Our approach relies on sparse representation of gigapixel whole-slide images, effectively reducing the data from approximately 6.5 billion pixels to a mere 50,000 pixels – compression by a factor 1300 – without compromising accuracy. This reduction in data volume allows us to apply matrix decomposition algorithms without prohibitive computational cost typically associated with high-resolution images. In addition, we use scale-invariant feature transform and random sample consensus algorithm to precisely align the 2D serial sections to reconstruct the 3D tumor microenvironment, while preserving sub-cellular details that are crucial for accurate analysis.

Results:
The proposed method enables a high registration accuracy at full-resolution, achieving a normalized cross correlation (NCC) of +0.62. This performance markedly exceeds that of leading methodologies, such as CODA (Nature Methods, 2022): +0.38. Furthermore, our algorithm’s ability to order the serial section of slides into its appropriate order also achieves 100% accuracy.

Conclusion:
We have developed an algorithm for reconstructing 3D tumor microenvironments from their 2D whole-slide images at full-resolution and with sub-cellular accuracy, without requiring image down-sampling. Our algorithm demonstrates significantly better performance than the current state-of-the-art methods. We anticipate our method to significantly improve 3D tumor microenvironment analysis, for gaining deeper insights into spatial biology of the tumor microenvironment and developing more effective diagnostic and therapeutic strategies.

References:
1. Lowe, G. ""Sift-the scale invariant feature transform."" Int. J 2.91-110 (2004)
2. Fischler, Martin A., et al. ""Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography."" Communications of the ACM 24.6 (1981)
3. Kiemen, Ashley L., et al. ""CODA: quantitative 3D reconstruction of large tissues at cellular resolution."" Nature Methods (2022)

13:30-13:45
Exploring the Role of Mycoviruses in Fungal Pathogenesis: Implications for Precision Antifungal Therapies
Room: Cathedral of Learning, G24
Format: Live from venue

  • Sergei Lenskii, University of Minnesota, Canada
  • Nadezhda Lenskaia, Independent Researcher, Canada


Presentation Overview: Show

Infectious diseases, including drug-resistant fungal infections, pose a significant threat to global health with substantial mortality rates. Recognizing this challenge, the World Health Organization (WHO) has established the Fungal Priority Pathogens List (FPPL) to tackle the growing threat of invasive fungal infections, especially among people with weakened immune system. Our study delves into the intricate interactions between fungal pathogens and their viruses (mycoviruses) to advance understanding of these complex interactions for future treatment interventions. Mycoviruses exhibit multifaceted impacts on fungal virulence, growth, and reproduction. While some mycoviruses attenuate fungal virulence, others inhibit growth or increase fungus susceptibility to stress, e.g., virus AfuPmV-1 in Aspergillus fumigatus. However, not all mycoviruses are detrimental, and some establish neutral or even beneficial relationships with their fungal hosts.

Recognizing the potential of mycoviruses akin to bacteriophages in phage therapy, we propose a computational approach to explore a novel frontier in therapeutic approaches against fungal infections. In our opinion, mycoviruses offer targeted strategies to combat specific fungal pathogens, minimizing collateral damage and resistance development. Although further research and clinical validation are imperative, computational strategies for discovering mycoviruses hold substantial promise for advancing precision-oriented antifungal therapies. Also, mycoviruses can have synergetic effect in combination with antifungals, potentially revolutionizing fungal infection treatment.

We have developed a scalable computational approach to analyze available resources for mycovirus exploration including research literature, databases with known mycovirus sequences, and large public sequencing data. We applied the developed computational approach to explore viromes of major fungal pathogen listed on the WHO FPPL. The exponential growth of fungal and environmental sequencing projects presents a wealth of data for mycovirus discovery. We are confident that scalable computational approaches to extract virus sequences from these datasets are crucial. Our analysis showcases how integrated computational methods can harness big data for mycovirus exploration and help researchers identify candidate mycoviruses potentially leading to the development of novel treatment methods for fungal infections.

We demonstrate how computational approaches enable researchers to navigate vast biomedical datasets efficiently, identifying prospective mycovirus candidates for further experimentation and validation. Additionally, the developed computational framework facilitates the study of virus-host interactions in priority fungal pathogens lacking known viruses. The potential synergistic effects of discovered mycoviruses and antifungals on fungal pathogens represent a promising avenue for exploration. The results of our study underscore the importance of understanding mycovirus-fungal interactions and harnessing computational tools for advancing precision antifungal therapies, offering new insights into combating fungal infections.

13:45-14:00
Deciphering Phage Genomes: A Comprehensive Approach with Hidden Markov Models
Room: Cathedral of Learning, G24
Format: Live from venue

  • Tatiana Lenskaia, University of Toronto, Canada
  • Alan Davidson, University of Toronto, Canada


Presentation Overview: Show

Phages, viruses that infect bacteria, play pivotal roles in microbial communities, bacterial evolution, and biotechnology. However, a substantial portion of their genetic material remains enigmatic, with many genes annotated as hypothetical proteins lacking known functions. To address this knowledge gap, we propose an integrative approach leveraging Hidden Markov Models (HMMs) for comprehensive phage annotation.
Our methodology begins with constructing several hundreds of high-quality HMM profiles derived from diverse sets of known phage structural genes. These HMM models are manually curated and validated by the leading experts in phage research. The developed computational approach includes capturing conserved motifs, structural features, and conserved genome context characteristics in phage genomes. These features serve as sensitive detectors, capable of identifying putative phage genes within genomic sequences, even amidst noise and genetic variability. In tandem with HMM-based detection, clustering algorithms are employed to enhance annotation accuracy and efficiency. These techniques trained on curated datasets of phage gene functions, augment the predictive power of our approach by inferring gene functionalities based on the comprehensive analysis of multiple factors including sequence similarity, domain architecture, and contextual genome information.
By systematically annotating structural genes, our approach sheds light on a substantial portion of the genetic repertoire of phages and possible roles of putative genes in phage biology, providing valuable insights into phage-host interactions, microbial evolution, and ecological dynamics. Also, our method offers practical benefits for both basic research and biotechnological applications. The comprehensive phage annotation enables identification of candidate genes for novel virulence factors, defense mechanisms, antibiotic resistance determinants, and other biomedically relevant elements encoded within phage genomes. Moreover, understanding phage gene functions is crucial for the development of phage-based therapies, bioprospecting for novel enzymes, and engineering synthetic phages for various biotechnological purposes.
In conclusion, our comprehensive phage annotation approach represents a powerful tool for unraveling many mysteries encoded within phage genomes. By elucidating the functions of hypothetical genes and deciphering the intricate genetic landscapes of phages, our methodology opens new avenues for exploring the vast diversity and biological significance of these ubiquitous viruses.

14:00-14:15
Generative Model for Gene Expression Samples
Room: Cathedral of Learning, G24
Format: Live from venue

  • Oleksandr Narykov, Computing, Environment and Life Sciences, Argonne National Laboratory, Lemont, IL 60439, USA, United States
  • Alexander Partin, Computing, Environment and Life Sciences, Argonne National Laboratory, Lemont, IL 60439, USA, United States
  • Yitan Zhu, Computing, Environment and Life Sciences, Argonne National Laboratory, Lemont, IL 60439, USA, United States
  • Thomas Brettin, Computing, Environment and Life Sciences, Argonne National Laboratory, Lemont, IL 60439, USA, United States


Presentation Overview: Show

Genetics-informed translational studies in the medical field are challenging because of the significant variability between individual organisms. In the case of pre-clinical studies, the problem is even more pronounced as there is a need to fill the gap between different biological models that vary in purity and availability. Previous studies addressed the problem of augmenting tumor gene expression data in the limited context of cancer sample classification using Generative Adversarial Networks (GANs). However, training this class of Neural Networks (NNs) is challenging due to vanishing gradients, model collapse, and failures to converge. Diffusion models (DMs) are a cutting-edge advancement in generative AI.
It is succeeding adversarial approaches, such as GANs and DMs, that rely on an iterative process of degrading data and denoising the result in advanced image, audio, and video generation areas. The performance of these models is robust. However, the generative process is known to be computationally intensive and slow. Recent OpenAI work addresses DMs’ convergence speed issue by introducing a new class of DM – Consistency Models. Ultimately, we aim to leverage this architecture and the Argonne Leadership Computing Facility’s AI Testbed resources to construct a generative model for the multiple biological models – cell lines, single-cell RNA-Seq samples from patients, and patient-derived xenografts (PDX). Generating robust biological data is an open problem plagued by challenges, such as lack of training samples, data inconsistency, high dimensionality, and poor interpretability, that are less frequent in traditional AI domains, e.g., image and audio. Overcoming those limitations is an undertaking that requires developing new strategies and approaches, so it is vital to be able to test them promptly. Our research provides life science researchers with a way to create synthetic data based on samples cell line samples and obtain in silico samples for complex, realistic settings for further analysis and refinement.
We present adaptation of the Consistency Model (CM)[1] that generates synthetic RNA-Seq gene expression profiles and was tested in the cancer cell lines context. The main idea behind CMs is to learn a consistency function f(xt,t)→x, where xt corresponds to the noisy data at the arbitrary timepoint t. This allows us to generate a ground-truth sample x from any point along the diffusion process in one step. Using unsupervised learning methods, we attempt to distinguish between ground truth examples and synthetic samples to validate the generative model.
[1] Song, Yang, et al. ""Consistency Models."" arXiv preprint arXiv:2303.01469 (2023).

14:15-14:30
ECTP: an R package for predicting gene targets of environmental chemical exposures using coupled matrix-matrix completion
Room: Cathedral of Learning, G24
Format: Live from venue

  • Kai Wang, University of Michigan, United States
  • Justin Colacino, University of Michigan, United States
  • Dana Dolinoy, University of Michigan, United States
  • Maureen Sartor, University of Michigan, United States


Presentation Overview: Show

Environmental exposures present a huge health burden. Identifying the target genes of environmental exposures is a key step in identifying detrimental health effects for chemicals with limited toxicity knowledge. Traditional in vivo and in vitro toxicity testing is powerful but requires a generous amount of time and funding. To accelerate the process of environmental chemicals safety assessment, computational toxicity methods are widely used in today’s toxicity studies to find potential toxic effects of chemicals. Methods like read-across, and quantitative structure-activity relationship (QSAR) are usually used to predict a binary label for chemicals, e.g. whether a chemical is a carcinogen or not, or whether or not certain biological pathways could be activated. The limitation of these methods is that they cannot provide a comprehensive biological response of chemical exposures. We previously implemented a novel method called Coupled Matrix-Matrix Completion (CMMC) algorithm to predict the gene targets of environmental chemicals and tested it with human exposure gene interaction data from the CTD (Comparative Toxicogenomics Database) database. Different from previous methods, CMMC can integrate environmental chemicals and target genes on a broad scale to predict overall chemical-gene interactions. The input of CMMC includes three matrices, i.e., the main matrix contains the existing known chemical-gene interactions; a chemical-chemical similarity matrix contains similarity values among all chemicals in the main matrix; and a gene-gene similarity matrix contains similarity values among all genes in the main matrix. After the calculation, missing values in the main matrix are imputed representing novel predicted chemical-gene interactions. Compared to alternative methods, CMMC achieved the best AUC and a stable performance with a series of benchmark datasets with different sizes of input matrices. The first implementation of this method is in C++ using the Armadillo library. To make our method more user friendly to biologists and toxicologists, we now introduce an R package of our method named ECTP (Environmental Chemical Target Prediction). In this package, we implemented the core function CMMC with the Rcpp and RcppArmadillo packages providing a runtime comparable to the C++ version. The input of our package can be a single chemical (IUPAC Name) or a chemical list if the user would like to predict gene targets for multiple chemicals. RRDKit is used to calculate chemical similarities between the user provided chemical(s) and the built-in chemical list. The prediction results will be returned as an R data frame for further downstream analysis.

14:30-14:45
Streamlining High Throughput Kinome Analysis: Introducing KADL, a Comprehensive Kinome Analysis Description Language
Room: Cathedral of Learning, G24
Format: Live from venue

  • Ali Imami, University of Toledo, United States
  • William Ryan, University of Toledo, United States
  • Hunter Eby, University of Toledo, United States
  • Jennifer Nguyen, University of Toledo, United States
  • Taylen Arvay, University of Toledo, United States
  • Robert McCullumsith, University of Toledo, United States


Presentation Overview: Show

High throughput functional kinome analysis using PamChip has significantly advanced our understanding of functional proteomics and Kinomics. However, the diversity in methods that analyze the resulting dataset has meant that we are forced to either pick one or undergo painful procedures to integrate the results across multiple methods. To address these issues, we introduce the Kinome Analysis Description Language (KADL), a unified Domain Specific Language (DSL) for analyzing the kinome data.

KADL is expressed as a subset of the English language that allows for a declarative description of the analysis results that can then be displayed either as an R Shiny application, a PDF document, or as individual figures that can be integrated directly into a manuscript. The description conforms to a Parsing Expression Grammar (PEG) representation that maps the description into precise steps, which can then be executed to generate the results. Rust, a memory- and type-safe language, is used to translate the KADL representation into distinct analysis steps in R. The R representation then generates the final analysis results. It integrates established tools such as Upstream Kinase Analysis (UKA), Kinome Random Sampling Analyzer (KRSA), Kinase Enrichment Analysis (KEA3), and PTM Substrate Enrichment Analysis (PTM-SEA).
KADL offers a comprehensive high throughput kinome analysis platform, supporting multiple visualizations and providing a user-friendly syntax. The combination of Rust and R contributes to its efficiency, accuracy, and adaptability. The language addresses methodological disparities, streamlining the analysis workflow and fostering reproducibility.
In conclusion, the Kinome Analysis Description Language (KADL) represents a significant advancement in high throughput kinome analysis. Its integration of tools, user-friendly syntax, and utilization of PEG parsers, Rust, and R contribute to standardization, efficiency, and the generation of robust results. KADL stands as a powerful tool poised to catalyze advancements in kinome research.

14:45-15:00
Automated identification of radiotherapy courses from US Department of Veterans Affairs administrative data
Room: Cathedral of Learning, G24
Format: Live from venue

  • Max Schreyer, Oregon Health & Science University, VA Portland Health Care System (VAPORHCS), United States
  • Chris Anderson, VA Portland Health Care System (VAPORHCS), United States
  • Ryan Melson, VA Portland Health Care System (VAPORHCS), United States
  • Reid Thompson, Oregon Health & Science University, VA Portland Health Care System (VAPORHCS), United States


Presentation Overview: Show

Radiotherapy is a critically important cancer treatment both globally and for the aging population of US Veterans, with over 60% of cancer patients receiving radiotherapy during their disease. Despite this, radiation courses are not clearly defined within the Veterans Health Administration (VHA), which comprises the single largest integrated healthcare system in the United States. We present a supervised machine learning model that utilizes billing and diagnostic codes from VHA and Center for Medicare and Medicaid Services (CMS) databases to predict radiation course dates with compelling accuracy (average precision of 0.99). A multi-center cohort of 1,982 radiation patients was selected for model training and testing, with ground-truth labels determined through manual chart review. We encoded a set of 304 radiation code-dependent, time-based features and built three machine learning models: a random forest, AdaBoost and neural network. All models showed high accuracy when predicting individual course date labels (range 96.3% - 97.5%), with our random forest showing the highest overall performance. The retrospective application of our model to 1,333,286 patients coupled with a heuristic algorithm for assembling radiation courses identified 1,526,660 predicted instances of radiotherapy. The identified courses were collected into a shared resource to facilitate future VHA-based studies, and our predictive model is available for application to a wider range of non-VHA datasets, particularly those leveraging CMS data.