Presentation Overview: Show
Cancer genomes accumulate a large number of somatic mutations resulting from imperfection of DNA processing during normal cell cycle as well as from carcinogenic exposures or cancer related aberrations of DNA maintenance machinery. These processes often lead to distinctive patterns of mutations, called mutational signatures. Considering these signatures as quantitative traits, we leverage them for studies of the interactions between mutagenic processes, other cellular processes, and environment. Untangling these interactions is critical for understanding processes underlying mutational signatures. To address these challenges, we developed several complementary computational approaches allowing us to link several mutational signatures to their causes. I will discuss selected approaches focusing on network based methods.
Presentation Overview: Show
In biology, graph layout algorithms can reveal comprehensive biological contexts by visually position-ing graph nodes in their relevant neighborhoods. A layout software algorithm/engine commonly takes a set of nodes and edges and produces layout coordinates of nodes according to edge constraints. However, current layout engines normally do not consider node, edge, or node-set properties during layout and only curate these properties after layout is created. Here, we propose a new layout algo-rithm, distance-bounded energy-field minimization algorithm (DEMA), to natively consider various biological factors, i.e., the strength of gene-to-gene association, the gene’s relative contribution weight, and the functional groups of genes, to enhance the interpretation of complex network graphs. In DEMA, we introduce a parameterized energy model where nodes are repelled by the network to-pology and attracted by a few biological factors, i.e., interaction coefficient (IC), effect coefficient (EC), and fold change (FC) of gene expression. We generalize these factors as gene weights, PPI weights, gene-to-gene correlations, and the gene set annotations—four parameterized functional properties used in DEMA. Moreover, DEMA considers further attraction/repulsion/grouping coefficient to enable different preferences in generating network views. Applying DEMA, we performed two case studies using genetics data in Autism Spectrum Disorder (ASD) and Alzheimer’s disease (AD), re-spectively, for gene candidate discovery. Furthermore, we implement our algorithm as a plugin to Cytoscape, an open-source software platform for visualizing networks; hence, it is convenient. Our software and demo can be freely accessed at http://discovery.informatics.uab.edu/dema.
Presentation Overview: Show
Model organisms are widely used to better understand the molecular causes of human disease. While sequence similarity greatly aids this transfer, sequence similarity does not imply functional similarity, and thus, several current approaches incorporate protein-protein interactions (PPIs) to help map findings between species. Existing transfer methods either formulate the alignment problem as a matching problem, which pits network features against known orthology, or more recently, as a joint embedding problem. Here, we propose a novel state-of-the-art joint embedding solution: Embeddings to Network Alignment (ETNA). More specifically, ETNA generates individual network embeddings based on network topological structures and then uses a Natural Language Processing-inspired cross-training approach to align the two embeddings using sequence orthologs. The final embedding preserves both within and between species gene functional relationships, and we demonstrate that it captures both pairwise and group functional relevance. In addition, ETNA's embeddings can be used to transfer genetic interactions across species and identify phenotypic alignments, laying the groundwork for potential opportunities for drug repurposing and translational studies.
Presentation Overview: Show
Accurately identifying genes associated with diseases is the key to understanding the disease mechanisms and finding treatment strategies accordingly. The modular nature of disease genes in the human gene interaction network has motivated several network-based disease gene prediction methods, including network embeddings. However, complex diseases are heterogeneous, involving several hundreds of genes, and can manifest differently in various contexts such as tissues and disease states. Overlooking the contexts in which diseases and traits manifest themselves could lead to a less accurate understanding of the human disease genes. Here, we developed a context-specific network embedding method highlighting certain contextual information, including the tissue specificity and gene expression study specificity. We then used an ensemble logistic regression model to combine all the context-specific embeddings to perform disease gene predictions according to the validation scores. Our method significantly improves disease gene prediction performance over the context-naive embeddings. Furthermore, the resulting ensemble model coefficients accurately reflect the biologically meaningful disease-context association. Finally, our method is general and can be applied to a user-defined gene expression dataset to generate the corresponding context-specific embeddings to better understand the context information by finding the top diseases related to the gene expression dataset.
Presentation Overview: Show
Protein structural classification (PSC) is a supervised problem of assigning proteins into pre-defined structural (e.g., CATH or SCOPe) classes based on the proteins' sequence or 3D structural features. We recently proposed PSC approaches that model protein 3D structures as protein structure networks (PSNs) and analyze PSN-based protein features, which performed better than or comparable to state-of-the-art sequence or other 3D structure-based PSC approaches. However, existing PSN-based PSC approaches model the whole 3D structure of a protein as a static (i.e., single-layer) PSN. Because folding of a protein is a dynamic process, where some parts (i.e., sub-structures) of a protein fold before others, modeling the 3D structure of a protein as a PSN that captures the sub-structures might further help improve the existing PSC performance. Here, we propose to model 3D structures of proteins as multi-layer sequential PSNs that approximate 3D sub-structures of proteins, with the hypothesis that this will improve upon the current state-of-the-art PSC approaches that are based on single-layer PSNs (and thus upon the existing state-of-the-art sequence and other 3D structural approaches). Indeed, we confirm this on 72 datasets spanning ~44,000 CATH and SCOPe protein domains.
Presentation Overview: Show
Invasive Aspergillosis (IA), a fungal infection of the lungs caused by the pathogen Aspergillus fumigatus, is the most common invasive fungal infection in immunosuppressed individuals. Recent studies have recognized IA as a secondary infection that complicates COVID-19 increasing mortality. Despite the high clinical relevance of A. fumigatus, the molecular mechanisms that underlie IA and co-morbid conditions remain poorly characterized. We present a network-based analysis pipeline that combines gene regulatory network (GRN) inference and network-based interpretation of regulatory modules to characterize A. fumigatus transcriptional response. Our GRN inference approach incorporates latent transcription factor activity (TFA) estimation to elucidate transcription factors that are post-transcriptionally regulated for which gene expression may not be informative. We provide an interactive network visualization framework that incorporates statistical and topological tools used to investigate context specific roles of regulators within the network. Our framework can be used to interpret input gene lists to predict associated biological pathways, prioritize regulators based on kernel diffusion and identify novel subnetwork components using a Steiner tree approximation. Application of our framework to A. fumigatus predicted known and novel regulators of multiple secondary metabolite regulatory pathways. Our approach and resource are broadly applicable for network-based interpretation of clinically significant fungal species.
Presentation Overview: Show
A general principle of biology is the self-assembly of proteins into functional complexes. Characterizing their composition is, therefore, required for our understanding of cellular functions. Unfortunately, we lack knowledge of the comprehensive set of identities of protein complexes in human cells. To address this gap, we developed a machine learning framework to identify protein complexes in over 15,000 mass spectrometry experiments which resulted in the identification of nearly 7,000 physical assemblies. We show our resource, hu.MAP 2.0, is more accurate and comprehensive than previous state of the art high throughput protein complex resources and gives rise to many new hypotheses, including for 274 completely uncharacterized proteins. Further, we identify 259 promiscuous proteins that participate in multiple complexes pointing to possible moonlighting roles. We have made hu.MAP 2.0 easily searchable in a web interface (http://humap2.proteincomplexes.org/), which will be a valuable resource for researchers across a broad range of interests including systems biology, structural biology, and molecular explanations of disease.
Presentation Overview: Show
Protein-protein interactions (PPIs) are key drivers of cell function. While it is widely assumed that permanent PPIs tend to be important for cellular function and therefore not dispensable, it remains unclear whether or not transient PPIs are more dispensable than permanent PPIs. Here, we estimate and compare dispensable content among transient and permanent PPIs in the human interactome, by calculating the fractions of transient and permanent interactions that are neutral upon disruption. Starting with a human reference interactome mapped by experiments, we construct a human structural interactome by building three-dimensional structural models for PPIs using homology modeling, and then distinguish transient interactions from permanent interactions using several structural and biophysical properties. Next, we map common mutations from healthy individuals and disease-causing mutations onto the structural interactome, and perform structure-based calculations of the probabilities for common mutations (assumed to be neutral) and disease mutations (assumed to be mildly deleterious) to disrupt transient interactions and permanent interactions. Using Bayes’ theorem, we estimate that a similarly small fraction (<~20%) of both transient and permanent PPIs are completely dispensable, i.e., effectively neutral upon disruption by mutation. Hence, transient and permanent interactions are subject to similarly strong selective constraints in the human protein interactome.
Presentation Overview: Show
Motivation: A factory in a metabolic network specifies how to produce target molecules from source compounds through biochemical reactions, properly accounting for reaction stoichiometry to conserve or not deplete intermediate metabolites. While finding factories is a fundamental problem in systems biology, available methods do not consider the number of reactions used, nor address negative regulation.
Methods: We introduce the new problem of finding optimal factories that use the fewest reactions, for the first time incorporating both first- and second-order negative regulation. We model this problem with directed hypergraphs, prove it is NP-complete, solve it via mixed-integer linear programming, and accommodate second-order negative regulation by an iterative approach that generates next-best factories.
Results: This optimization-based approach is remarkably fast in practice, typically finding optimal factories in a few seconds, even for metabolic networks involving tens of thousands of reactions and metabolites, as demonstrated through comprehensive experiments across all instances from standard reaction databases.
Availability and implementation: Source code for an implementation of our new method for optimal factories with negative regulation in a new tool called Odinn, together with all datasets, is available free for non-commercial use at http://odinn.cs.arizona.edu.
Presentation Overview: Show
The Metabolic Network Explorer is a new addition to the BioCyc.org
website and Pathway Tools software that supports interactive
exploration of metabolic networks. Any metabolic network visualization
tool must by necessity show only a subset of all possible metabolite
connections, or the results will be visually overwhelming. Other tools
limit the set of displayed connections based on predefined pathways or
other preselected criteria. We sought instead to provide a tool that
would give the user dynamic control over which connections to follow.
The Metabolic Network Explorer is a web-based software tool that
allows the user to specify a starting metabolite of interest and
interactively explore its immediate metabolic neighborhood in both
directions, letting the user select from the full set of connected
reactions. Although only a small portion of the metabolic network is
visible at a time, that portion is selected by the user, based on the
full reaction complement, and it is easy to switch among alternate
paths of interest. The display is intuitive, customizable, and
provides copious links to more detailed information pages. The
Metabolic Network Explorer fills a gap in the set of metabolic network
visualization tools and complements other modes of exploration.
Presentation Overview: Show
The interpretation of disease-associated genetic variants in non-coding genomic regions remains challenging in the post-GWAS era, and enhancers emerged as key players in mediating the effect of genetic variants on complex traits/diseases. Their activity is often regulated via transcription factors (TFs), epigenetic changes and genetic variants. While existing approaches link enhancers to their target genes and infer TF-gene connections, we currently lack a framework that systematically integrates enhancers into TF-gene regulatory networks. Furthermore, we lack an unbiased way of assessing the biological meaningfulness of inferred regulatory interactions. Here we present two methods, implemented as user-friendly R-packages, for building and evaluating enhancer-mediated gene regulatory networks (eGRNs) called GRaNIE (Gene Regulatory Network Inference including Enhancers - https://git.embl.de/grp-zaugg/GRaNIE) and GRaNPA (Gene Regulatory Network Performance Analysis - https://git.embl.de/grp-zaugg/GRaNPA), respectively. GRaNIE jointly infers TF-enhancer, enhancer-gene and TF-gene interactions by integrating open chromatin data (e.g., ATAC-Seq or H3K27ac) with RNA-seq across samples (e.g. individuals), and optionally also Hi-C. GRaNPA is a general framework for evaluating the biological relevance of TF-gene GRNs by assessing their performance for predicting cell-type specific differential expression. We demonstrate their power by investigating gene regulatory mechanisms in macrophages that underlie their response to infection, and their involvement in common genetic (autoimmune) diseases.
Presentation Overview: Show
Protein networks are commonly used for understanding the interplay between proteins in the cell as well as for visualizing omics data. Unfortunately, existing networks such as STRING are heavily biased by data availability in the sense that well-studied proteins have many more interactions than understudied proteins. To create networks also for the latter, we need to use high-throughput data, such as single cell RNA-seq (scRNA-seq) and proteomics, which do not have this literature bias. However, due to the sparseness (i.e. many proteins not observed in each cell/sample) and redundancy (many similar cells/samples) of such data, simple correlation analysis does not result in high-quality networks. We present FAVA, Functional Associations using Variational Autoencoders, which deals with these issues by compressing the high-dimensional data into a meaningful, dense, low-dimensional latent space. We demonstrate that calculating correlations in this latent space results in much improved networks compared to the original representation for massive scRNA-seq and proteomics data from Human Protein Atlas and PRIDE, respectively. We show that these networks, which given the nature of the input data should be free of literature bias, indeed have much better coverage of understudied proteins than existing networks.
Presentation Overview: Show
Despite our ability to efficiently capture human genomes, we still remain far from accurately predicting phenotypes from sequence. There are a variety of reasons for this gap, but one reason is the potential for genetic interactions among variants. Efforts using reverse genetic approaches in the yeast model system have shed light on this problem. Combinations of mutations in nearly all possible yeast genes were constructed and phenotyped, producing a global genetic network that has been a valuable resource for understanding yeast biology. While technical challenges have previously limited similar endeavors in human cells, CRISPR/Cas9-based genome editing technology now makes this powerful combinatorial mutation approach possible.
I will discuss our recent efforts to map a global genetic interaction network for human cells based on genome-wide CRISPR/Cas9 screens in a reference human cell line. We have identified several challenges associated with interpreting data from differential CRISPR screens and have developed a novel computational pipeline for accurate scoring of quantitative genetic interactions in this context. I will describe these lessons learned and other insights from our growing reference human genetic interaction map.