Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


GLBIO 2019 | May 19-22, 2019 | Univ. of Wisconsin at Madison | Talk Abstracts


Monday, May 20, 2019: The Marquee | Fifth Quarter | Varsity Hall I
Tuesday, May 21, 2019: The Marquee | Fifth Quarter | Varsity Hall I
Wednesday, May 22, 2019: The Marquee | Varsity Hall I

General Track: Macromolecular Structure & Function | Gene Regulation I | Gene Regulation II | Comparative Genomics & Phylogenetics | Networks I | Networks II | Algorithms & Machine Learning | Genome Informatics | Clinical and Health Informatics I |
Clinical and Health Informatics II
Special Sessions: Precision Medicine I | Precision Medicine II | Precision Medicine III | Education I | Education II | Education III | Microbiome I | Microbiome II | Microbiome III | RNA Sequence to Structure I | RNA Sequence to Structure II | RNA Sequence to Structure III

The Marquee on Monday, May 20, 2019

Links for Monday, May 20, 2019: Fifth Quarter | Varsity Hall I
Links for Tuesday, May 21, 2019: The Marquee | Fifth Quarter | Varsity Hall I
Links for Wednesday, May 22, 2019: The Marquee | Varsity Hall I
Start Time Title Author(s)
9:30 AM Keynote #1 - Data-enabled integrative analysis for fine mapping and interpretation of associated genetic variants Sunduz Keles
Introduction by Sushmita Roy
While there has been many advances in incorporating prior information into prioritization of  associated variants in genome-wide and molecular association studies, functional annotation data rarely played more than an indirect role in assessing evidence for association in these approaches. This talk is organized around our recent work on generating large-scale annotation data for single nucleotide variants and their model-based integration into fine mapping of quantitative and molecular traits.
General Track - Macromolecular Structure & Function
Chair: Catherine Welsh
11:00 AM Improved protein structure refinement using machine learning based restrained relaxation Debswapna Bhattacharya
Protein structure refinement aims to bring moderately accurate template-based protein models closer to the native state through conformational sampling. However, guiding the sampling towards the native state by effectively using restraints remains a major issue in structure refinement. Here, we present refineD [1], which uses deep learning to predict multi-resolution probabilistic restraints from the starting structure and subsequently converts these restraints into scoring terms to guide conformational sampling during structure refinement. To the best of our knowledge, this is the first study that applies machine learning derived multi-resolution probabilistic restraints in protein structure refinement.

We use Deep Convolutional Neural Fields (DeepCNF) [2, 3], a deep discriminative learning classifier, to predict the likelihood of Cα atom of any residue of the starting structure to be within rÅ with respect to the native. Following the Global Distance Test (GDT-HA) score [4], we use an ensemble of four DeepCNF classifiers after fixing r to four different distance thresholds (0.5, 1, 2, 4Å). Each DeepCNF classifier combines several centroid scoring functions of Rosetta [5, 6], sequence profile based residue conservation features, and consistency between structural features extracted from the starting structure and predicted values from its sequence. Output from the DeepCNF classifiers are subsequently converted to multi-resolution probabilistic restraints to perform restrained relaxation using the FastRelax application of Rosetta [7, 8].

refineD has been found to produce consistent and substantial structural refinement through the use of cumulative and non-cumulative restraints on a comprehensive benchmark of 150 targets. It outperforms unrestrained relaxation strategy or relaxation that is restrained to starting structures using the FastRelax application of Rosetta [7, 8] or atomic-level energy minimization based ModRefiner [9] method as well as molecular dynamics (MD) simulation based FG-MD [10] protocol. Furthermore, by adjusting restraint resolutions, the method addresses the tradeoff that exists between degree and consistency of refinement. These results demonstrate a promising new avenue for improving accuracy of template-based protein models by effectively guiding conformational sampling during structure refinement through the use of machine learning based restraints.

refineD web server is freely available at http://watson.cse.eng.auburn.edu/refineD/.

1. Bhattacharya, D., refineD: Improved protein structure refinement using machine learning based restrained relaxation. Bioinformatics, 2019.
2. Wang, S., S. Sun, and J. Xu. AUC-Maximized deep convolutional neural fields for protein sequence labeling. in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2016. Springer.
3. Wang, S., et al., Protein secondary structure prediction using deep convolutional neural fields. Scientific reports, 2016. 6: p. 18962.
4. Zemla, A., LGA: a method for finding 3D similarities in protein structures. Nucleic acids research, 2003. 31(13): p. 3370-3374.
5. Rohl, C.A., et al., Protein structure prediction using Rosetta, in Methods in enzymology. 2004, Elsevier. p. 66-93.
6. Leaver-Fay, A., et al., ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules, in Methods in enzymology. 2011, Elsevier. p. 545-574.
7. Khatib, F., et al., Algorithm discovery by protein folding game players. Proceedings of the National Academy of Sciences, 2011. 108(47): p. 18949-18953.
8. Tyka, M.D., et al., Alternate states of proteins revealed by detailed energy landscape mapping. Journal of molecular biology, 2011. 405(2): p. 607-618.
9. Xu, D. and Y. Zhang, Improving the physical realism and structural accuracy of protein models by a two-step atomic-level energy minimization. Biophysical journal, 2011. 101(10): p. 2525-2534.
10. Zhang, J., Y. Liang, and Y. Zhang, Atomic-level protein structure refinement using fragment-guided molecular dynamics conformation sampling. Structure, 2011. 19(12): p. 1784-1795.
11:15 AM Improving the prediction of loops and drug binding in GPCR structure models. Bhumika Arora, Venkatesh Kareenhalli, Denise Wootten and Patrick Sexton.
G protein-coupled receptors (GPCRs) are the integral membrane proteins involved in a vast variety of physiological functions and form the largest group of potential drug targets. Thus, knowledge of their three dimensional structure is important for rational drug design and understanding molecular recognition. However, the limited availability of experimentally determined GPCR structures has led to a need for alternate methods for deriving the high resolution structural information. Homology modeling is a common approach for modeling the transmembrane helical cores of GPCRs, however, these models have varying degrees of inaccuracies that result from the quality of template used. We have previously explored the extent to which inaccuracies inherent in homology models of the transmembrane helical cores of GPCRs can impact loop prediction [1]. For individual loop modeling, the influence of the presence/absence of other extracellular loops was also probed. We found that loop prediction in GPCR models is much more difficult than loop reconstruction in crystal structures owing to the imprecise positioning of loop anchors, although modeling a particular extracellular loop in the presence of other extracellular loops provides constraints that help in predicting “near-native” loop conformations observed in crystal structures. Therefore, minimizing the errors in loop anchors is likely to be critical for optimal GPCR structure prediction. To address this, we have developed a ligand directed modeling (LDM) method comprising of geometric protein sampling and ligand docking, and evaluated it for capacity to refine the GPCR models built across a range of templates with varying degrees of sequence similarity with the target. The LDM reduced the errors in loop anchor positions, as well as improved the prediction of ligand binding poses, resulting in much better performance of these models in virtual library screenings. Thus, this ligand directed modelling method is efficient in improving the quality of GPCR structure models.

[1] Arora B, et al. (2016) Prediction of loops in G protein-coupled receptor homology models: effect of imprecise surroundings and constraints. J Chem Inf Model. 56(4):671-686.
11:30 AM Modeling the Functional Activity of Protein Sequence Variants Using Graph Convolutional Neural Networks Sam Gelman, Zhiyuan Duan, Philip Romero and Anthony Gitter
Changes to a protein’s sequence can have a positive, neutral, or negative impact on its fitness for a particular function, such as enzymatic activity. Knowledge of the complex relationship between sequence and function can be used to understand the effects of mutations and engineer proteins with improved functional properties. In recent years, deep mutational scanning (DMS) has enabled scientists to measure the fitness of hundreds of thousands of mutant versions of a protein. The resulting data represents a small sample of a protein’s sequence-to-function relationship and provides useful insights into the effects of mutations on protein function. However, the full space of sequence mutations of a protein is orders of magnitudes larger than a DMS sample, especially when considering protein variants with multiple mutations. Computational models are needed to fully leverage DMS data and predict the fitness of experimentally uncharacterized variants.
We describe an approach that uses graph convolutional neural networks to predict the DMS-derived protein function of a mutated protein sequence. Graph convolutional neural networks are similar to the 2D convolutional neural networks for images, but they have been generalized to work on structured graphs rather than 2D grids. This approach has several major advantages. Neural networks are capable of learning complex, non-linear functions and allow us to incorporate biological context to better reflect the underlying input-output mapping. Wild type protein structure is encoded as a graph and incorporated into the neural network, connecting residues that are nearby in 3D space. Parameter sharing via convolutional filters enables the network to more easily generalize the effects of amino acid substitutions to other positions in the protein sequence. Additionally, our approach integrates amino acid properties and structural features via the data encoding.
Other methods aimed at variant effect prediction typically use multiple sequence alignments (MSAs) of evolutionarily related sequences as training data. This approach can be applied to proteins that lack DMS data. However, deep mutational scans are targeted at specific functional properties that may be different than evolutionary pressure. DMS provides a unique opportunity to train models directly from a sample of a protein’s sequence-function space for a particular function. Existing approaches to directly model DMS data are limited by predicting only on single mutation variants or using models that are not flexible enough to fit the underlying space. Our approach is more appropriate for protein engineering than MSA-based methods and is able to make predictions for variants with multiple mutations.
We tested our graph-based approach on five DMS datasets containing variants with multiple mutations. We achieve better performance than linear regression and fully connected neural network baselines. The fitness of most multi-mutation variants is the additive effect of the corresponding single-mutation variants (i.e. they are not epistatic). Thus, in order to rigorously evaluate models trained to predict on multiple mutations, it is important to look at performance of single mutation variants separately as well as more difficult subsets of multi-mutation variants, such as those that exhibit a high degree of epistasis.
In addition, we consider several factors that impact the difficulty of learning from a particular DMS dataset, including how large the DMS sample is as a percentage of the protein’s total sequence space, the reliability of fitness score estimates for individual variants, and the effect of short read lengths during sequencing. Our results indicate that performance is sensitive to training set size, but using an appropriate model and data encoding that matches the underlying space can overcome limitations of small training sizes to some degree. The advantages of encoding the wild type protein structure in the graph convolutional neural network are strongest when sample sizes are small.
11:45 AM Chemoinformatic signature analysis of aspen seedling root exudates under biotic and abiotic stress conditions Peter Larsen
Chemoinformatic signature analysis of aspen seedling root exudates under biotic and abiotic stress conditions

Between 20 and 30 percent of photosynthetically fixed carbon in plants is lost through root exudates, representing a significant cost to the plant and environmental sink of photosynthetically fixed carbon. These root exudates are tightly regulated in response by the plant’s biotic and abiotic environment and have a potent effect on the plant’s soil environment and root-associated microorganism community. Understanding how the profile of root exudates, particularly organic acids, changes in response to changes in the environment will provide a deeper understanding of how a plant interacts with and manipulates it biotic and abiotic soil environment. Non-targeted metabolomics analysis of root exudates is a powerful tool for analyzing changing concentrations of metabolites present in root exudates in response to changing environmental conditions. However, the number of Unknown metabolites present in metabolomics data complicates metabolomics analysis of root exudates. While Putatively Annotated metabolites in metabolomics data are those that can be attributed to a reproducible spectral signal with properties consistent with a specific metabolites from a library of standards, Unknown metabolites in metabolomic data refers to a discernable, reproducible, and quantifiable spectral signal in metabolomics data that does not match the spectral properties of any compounds in a database. Given that there are an estimated 200,000 metabolites in the plant metabolome and the known compounds with spectral signatures in available metabolomics analysis databases can number less than a thousand, the Unknown metabolites will likely far outnumber the Annotated metabolites in any metabolomics dataset. We present a computational method utilizing Chemoinformatic Signatures to propose possible identities for Unknown metabolites in metabolomics data collected from aspen seedling root exudates. A Chemoinformatic Signature is defined as a hypervolume that separates metabolites in a biologically significant group from metabolites not in that significant group and integrates chemoinformatic data with metabolomics, transcriptomics, and metabolic modeling data. These hypervolumes exist within an n-dimensional Euclidean space defined by a set of n axis for which each axis is a chemoinformatic attribute associated with metabolites. Any individual metabolites is a point in the n-dimensional space with coordinates defined by a metabolite’s vector of Chemoinformatic Attributes. This method combines analysis of differentially abundant metabolites in metabolomics data from a biological experiment with chemoinformatics analysis of annotated metabolites. Possible Unknown metabolite identities are selected from a set of metabolites that are predicted to be present in aspen roots by metabolic modeling and transcriptomics data analysis. The predicted identities for Unknown metabolites can then be validated through additional, hypothesis-driven biological experiments, or introducing new compounds into the database of spectral characteristics for metabolomics analysis. The approach we present here is generalizable and has potentially application in analysis of metabolomics data for a wide variety of biological systems and for the improvement of spectral databases used in metabolomics analysis.
General Track - Gene Regulation I
Chair: Matthew Weirauch
1:30 PM Integrative analysis of epigenetics data identifies gene-specific regulatory elements Florian Schmidt, Alexander Marx, Marie Hebel, Martin Wegner, Nina Baumgarten, Manuel Kaulich, Jonathan Goeke, Jilles Vreeken and Marcel Schulz.
Understanding transcriptional regulation is a major goal of computational biology. Especially enhancers are essential regulators driving cellular development. Enhancers can be identified experimentally, e.g. using enhancer RNAs, ChIP-seq of Histone Modifications(HMs), or Hi-C experiments. However, experimental linkage of enhancers to genes is challenging. Therefore, several computational methods have been proposed to create tissue-specific regulatory maps from epigenetics data.

A common strategy to de-novo link tissue-specific enhancer regions to genes is to unify DNase-hypersensitive-sites (DHS) across several samples. Subsequently, the unified regions are linked to nearby genes; either solely distance based or using a correlation test between the epigenetic signal and the expression of the possible target gene. Also, integrative efforts are made to combine known enhancers in curated databases, such as GeneHancer. Via gene-expression modeling, we show that these approaches are limited in accounting for the distinct regulatory landscape of genes and thus lead to suboptimal enhancer-gene associations.

We developed an unbiased, peak-independent, method called STITCHIT to identify and to link regulatory regions to genes. We apply STITCHIT on a uniformly reprocessed dataset comprising paired DNase1-seq and RNA-seq data for 215 human primary cell and tissue samples from IHEC.

Within STITCHIT, we consider the epigenetic-signal of all samples jointly using the minimum description length principle to identify regions exhibiting a signal variation related to the expression of a distinct gene. In contrast to purely peak-based approaches, no sample specific information is lost. STITCHIT finds associations over large genomic intervals, e.g. 1 mb, providing us with an extensive catalog of regulatory elements and their target gene interactions.

In a novel application, we utilize regulatory elements identified in a doxorubicin CRISPR-Cas9 viability screen and associate them to their most likely target genes using STITCHIT. As doxorubicin resistance is limiting the effectivity of cancer therapy, it is important to identify both genes and their regulators that are related to the resistance. Several of the highlighted genes and regulators could be validated by literature, while we identified several novel candidate genes that are supported by ChIA-Pet data.

To ensure the quality of our predictions, we compared STITCHIT against the GeneHancer database and to two approaches combining DHS sites in other established ways. Regulatory elements (REMs) called by STITCHIT lead to a better performance of gene-expression models than both GeneHancer regions and peak-based approaches. Furthermore, STITCHIT REMs show a higher overlap with chromatin conformation data such as ChIA-PET or promoter capture Hi-C than related approaches. Also, STITCHIT REMs are more enriched for functional elements such as GWAS hits or eQTLs illustrating the functional relevance of our predicted REM. In a concrete example, STITCHIT retrieves more experimentally validated regulatory elements of the gene ERRB2 than all other tested approaches.

Besides, we illustrate that STITCHIT is capable of dissecting superenhancers, large genomic regions carrying out enhancer function. For a distinct superenhancer, we illustrate how STITCHIT associate's subregions to various, distal target genes. Several of these associations could be verified using ChIA-PET and Promoter Capture Hi-C data.

STITCHIT is freely available (https://github.com/SchulzLab/STITCHIT). Due to an efficient implementation, large data sets comprising hundreds of samples can be processed easily. Thus, we believe that STITCHIT can pave the way for a better understanding of gene-specific regulation, especially in light of the large amounts of epigenetics data becoming available.
1:45 PM WACS: Improving Peak Calling by Optimally Weighting Controls Aseel Awdeh, Marcel Turcotte and Theodore Perkins.
Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq), initially introduced more than a decade ago, is widely used by the scientific community to detect protein/DNA binding and histone modifications across the genome. Every experiment is prone to noise and bias, and ChIP-seq experiments are no exception. To alleviate bias, the incorporation of control datasets in ChIP-seq analysis is an essential step. The controls are used to account for the background signal, while the remainder of the ChIP-seq signal captures true binding or histone modification. However, a recurrent issue is different types of bias in different ChIP-seq experiments. Depending on which controls are used, different aspects of ChIP-seq bias are better or worse accounted for, and peak calling can produce different results for the same ChIP-seq experiment. Consequently, generating "smart" controls, which model the non-signal effect for a specific ChIP-seq experiment, could enhance contrast and increase the reliability and reproducibility of the results. We propose a peak calling algorithm, Weighted Analysis of ChIP-seq (WACS), which is an extension of the well-known peak caller MACS2. There are two main steps in WACS: First, weights are estimated for each control using non-negative least squares regression. The goal is to customize controls to model the noise distribution for each ChIP-seq experiment. This is then followed by peak calling. We demonstrate that WACS significantly outperforms MACS2 and AIControl, another recent algorithm for generating smart controls, in the detection of enriched regions along the genome, in terms of motif enrichment and reproducibility analyses. This ultimately improves our understanding of ChIP-seq controls and their biases, and shows that WACS results in a better approximation of the noise distribution in controls
2:00 PM refine.bio: A resource of harmonized public gene expression data sets. Jaclyn N. Taroni, Kurt G. Wheeler, Richard W. W. Jones, Deepashree Venkatesh Prasad, Ariel Rodriguez Romero, Candace L. Savonen and Casey S. Greene.
There are now more than three million publicly available genome-wide assays available in government-supported repositories such as EBI’s ArrayExpress and NCBI’s Gene Expression Omnibus. These data are a rich resource for biological research. However, even data that are relatively straightforward to analyze, such as gene expression data, pose challenges. Samples were assayed on many different platforms via both array- and sequencing-based technologies and data-processing details are often sparse. Here we present refine.bio: a multi-organism collection of genome-wide gene expression data that has been obtained from publicly available repositories and uniformly processed and normalized. refine.bio allows biologists, clinicians, and machine learning researchers to search for experiments from different source repositories all in one place and build custom data sets for their questions of interest.

The volume of publicly available gene expression data is already in the petabyte-scale and is projected to reach the exabyte-scale in the next few years. Data at these scales impose unique challenges. We built refine.bio using Amazon Web Services (AWS). Using AWS allows us to run at scale; however, cloud architectures also pose certain hurdles. The detachment between the physical hardware and provisioning lead to failure modes that required workarounds. We will discuss what has and hasn’t worked when processing sequencing data at this scale.

To date, we have processed over 500,000 microarray and RNA-seq samples from ArrayExpress, Gene Expression Omnibus, and Sequence Read Archive. Samples are processed with standardized pipelines that have been selected based on their wide-ranging utility. The collection is designed to be consistently updated as the system surveys public repositories for new gene expression samples. We use Single Channel Array Normalization (SCAN) for Affymetrix microarray and Illumina BeadArray samples (Piccolo et al., 2012). We use Salmon (Patro et al., 2017) for quantification with a custom transcriptome reference and tximport (Soneson et al., 2015) to summarize estimates to the gene-level in a manner that takes into account transcript length and library size for RNA-seq data. We perform modest standardization of the sample metadata obtained from public repositories by mapping related keys to a single key.

refine.bio is well-suited for quickly assessing if signals are present in particular datasets and for identifying and obtaining data sets for accelerated validation of findings. Users add samples or datasets to their “shopping cart” and can combine selected individual samples into multi-sample gene expression matrices in a process we term aggregation. We allow users to aggregate either all samples from an experiment or all samples from a species. We aim to keep sample distributions as similar as possible by performing quantile normalization (QN). Users can also elect to scale gene values depending on their downstream application. We also provide processed compendia which are intended for machine learning applications.

refine.bio is freely available to the entire scientific community. Our web interface (https://www.refine.bio) is designed to make data-processing steps transparent to users. refine.bio is on Github at https://github.com/AlexsLemonade/refinebio. We provide examples of downstream analyses in the R programming language at https://github.com/AlexsLemonade/refinebio-examples and provide guidance for using refine.bio data with GenePattern. Although refine.bio is not a substitute for experiments and processing pipelines tailored to answer specific biological questions of interest or for input from relevant experts (e.g., those with statistics expertise), it promises to accelerate biomedical research by quickly making transparently processed gene expression data available to a broad audience.
2:15 PM Leveraging public epigenomic datasets to examine the role of regulatory variation in the three-dimensional organization of the genome Brittany Baur, Jacob Schreiber, Shilu Zhang, Yi Zhang, Mohith Manjunath, Jun Song, William Stafford Noble and Sushmita Roy.
Regulatory sequences such as enhancers can regulate the expression level of a gene hundreds of kilobases away through chromosomal looping that can bring distal regulatory elements in three-dimensional proximity to target genes. Data from high-throughput Chromosome Conformation Capture (3C) technologies that measure the three-dimensional proximity of genomic loci at high-resolution exist only for a few model cell lines due to sequencing costs and the number of cells required to make reliable measurements at high-resolution. To address this lack of data, we developed L-HiC-Reg, which exploits the local structure of the genome to predict interaction counts in new cell lines. L-HiC-Reg uses a random forests regression model and a handful of easy-to-measure one-dimensional regulatory signals to learn a predictive model.

We trained L-HiC-Reg models on a high-resolution (5 kb) Hi-C dataset and applied the models to generate a new resource of contact count predictions in 55 human cell lines and tissues from the Roadmap Epigenomics database. Because not all regulatory signals are measured in all 55 cell lines we used Avocado, a deep tensor decomposition technique, to impute signals in cell lines where data was missing. Predictions generated using imputed datasets did not show significant deterioration compared to predictions generated with real datasets. This enabled us to substantially increase the number of cell lines on which L-Hi-C-Reg can be applied. In particular, we were able to generate predictions in 33 more cell lines than we would have been able to if we did not have imputed data. We validated our interactions using a number of strategies. First, our predictions, when aggregated to a lower resolution, accurately recapitulate the measured low-resolution contact count for the few Roadmap cell lines where low-resolution measurements are available. Second, we tested our significant and high scoring interactions for overlap with other complementary datasets (e.g, ChIA-PET) and found significant overlap with these datasets. Third, we used our predicted counts to recover TADs and compared them to TADs inferred on true counts in cell lines where they were available and found high agreement in the predicted TADs. We were also able to cluster the 55 cell lines based on their shared 3D genome conformation. Finally, we assessed the expression levels of genes associated with significant interactions and found that these genes are associated with increased gene expression.

Our compendium of interactions can be used to link individual as well as groups of regulatory SNPs to target genes for diverse sets of complex traits. For example, we used our compendium to show that the breast cancer associated non-coding variant rs3903072 and CTSW gene has a higher contact count in natural killer cells than in variant human mammary epithelial cells. CTSW, which has the highest expression in natural killer cells and is highly correlated with breast cancer patient survival, could be involved in regulating immune cells within the tumor microenvironment. We also used our compendium of significant interactions to link regulatory SNPs associated with different complex traits from the NHGRI GWAS catalog to gene subnetworks. Briefly, we first identified genes that are predicted to interact with a non-coding SNP via long-range regulation using our compendium of predictions. We overlayed these genes on a physical interaction network and used graph diffusion and spectral clustering to identify groups of interacting genes that might be targets of these SNPs. We applied this approach to SNPs associated with immune diseases and found that the inferred subnetworks are enriched for relevant immune-related biological processes. In summary, we have created a resource of contact count predictions that should be useful to examine long-range gene regulation and regulatory variation in a large number of cell types.
General Track - Comparative Genomics & Phylogenetics
Chair: Dannie Durand
3:00 PM TreeCluster: Clustering Biological Sequences using Phylogenetic Trees Metin Balaban, Niema Moshiri, Uyen Mai and Siavash Mirarab
Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given a (not necessarily ultrametric) tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints that limit the diameter of each cluster, the sum of its branch lengths, or chains of pairwise distances. These three versions of the problem can be solved in time that increases linearly with the size of the tree, a fact that has been known by computer scientists for two of these three criteria for decades. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU picking for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available at https://github.com/niemasd/TreeCluster.
3:30 PM JalviewJS: Reintroducing Jalview as JavaScript on the Web Robert Hanson, Geoffrey Barton, James Procter, Mungo Carstairs and Benedict Soares.
The Jalview Desktop Application is an open-source, GPL-licenced, graphical multiple sequence alignment editor and analysis workbench, first developed in 1996. It is installed on over 57,000 computers in more than 100 countries. Publications describing Jalview have been cited more than 6,000 times. In addition to sophisticated colouring, filtering, searching, alignment editing, and annotation functions for DNA, RNA and protein sequences, Jalview provides linked views of trees, three-dimensional structures and RNA secondary structure. Eight standard multiple alignment algorithms, four disorder predictors, seventeen conservation methods, the JPred protein secondary structure predictor, and RNAalifold RNA structure predictor are freely available in Jalview. Users can interactively search PDBe and Uniprot, and retrieve alignments and genomic data from the Interpro and Ensembl data resources at the EMBL-EBI. Recent developments have included adding efficient support for long eukaryotic genes with large numbers of annotations and new features including a linked DNA/RNA/protein multiple alignment window, discovering and displaying genetic variation (SNPs) from Ensembl, extended querying against PDBe and Uniprot EMBL-EBI services and the development of more than 30 training videos that have received more than 50,000 views.

In this presentation, we describe recent advances in the Jalview Desktop Application that capitalize on revolutionary technology we have developed that allows virtually any Java applet or Java application to be run as JavaScript within a browser-based environment without the need for Java. "JalviewJS" is one of the first of a new breed of powerful applications developed with full-featured automated simultaneous real-time production of both Java and JavaScript versions with nearly identical functionality. The technology allows easy embedding of auto-generated JavaScript versions of full Java applications and (formerly Java) applets on web pages, providing interactive modular "Java applet-like" operation within a strictly JavaScript-only environment.
3:45 PM Cross-species transcription factor binding prediction via domain adapted neural networks Kelly Cochran, Divyanshi Srivastava, Akshay Balsubramani, Anshul Kundaje and Shaun Mahony.
The binding patterns of highly conserved transcription factors (TFs) appear to divergesignificantly across closely related species. Experiments measuring protein-DNA binding for thesame TFs in the same tissue types across different mammals show strikingly little overlap ofbinding sites. This divergence persists in conserved regulatory regions for homologous genes. Onthe other hand, the general DNA binding preferences (motifs) of most TFs appear to be stronglyconserved, and the same cohorts of TFs appear to drive regulatory activities in the same tissuetypes across mammalian species. Therefore, the general features of tissue-specific regulatoryarchitecture should be conserved, even if genomic binding sites are divergent.
Several recently published models aim to predict TF binding across cell types based oninputs including genomic sequence and chromatin accessibility metrics. Typically, these modelsare trained and tested on data generated in multiple cell types from a single species. Anassumption underlying these methods is that a generalized logic of TF binding learned fromseveral training cell types should transfer accurately into a previously unseen cell type. Wehypothesize that an analogous form of transfer learning may be possible across species, wherecell type is controlled. Training a model to meet this objective would provide us with a tool tosimultaneously investigate cross-species regulatory divergence and potentially infer TF bindingacross multiple species without the need for expensive ChIP-seq experiments.
Here we assess the ability of a neural network to perform transfer learning of TF bindinglogic across species. First, we designed a network that accepts sequence and optionalaccessibility information for a given genomic window and predicts whether a ChIP-seq peakindicating TF binding is present within that window. We trained this model using data from agiven cell type in one species, and then assessed its performance on data from the same cell typein another species. We quantified the relative performance differential between models trainedand tested on the same species vs. across species and between models trained with or withoutaccessibility input.
We find that our model is able to generalize much of its predictive capabilities acrossspecies. Both within- and across-species TF binding predictions are improved drastically whenthe model is provided with chromatin accessibility in addition to genomic sequence, even whenthis input is a single binary value obtained from domain calls on ATAC-seq data. Throughinvestigation of sites differentially predicted between models trained on the same species as thetest dataset vs. another species, we find that repeat sequences unique to the test species are theprimary source of false positives from cross-species models. Specifically, when comparing howmodels trained on mouse or human training data perform on human test data, we observe that themouse-trained model incorrectly predicts many SINE and satellite repeat sequences as TF-bound, whereas the human-trained model does not. These results demonstrate that our model canprovide a framework for investigating how species-specific genome differences are relevant forcross-species TF binding prediction. We also explore potential techniques for leveraging thisknowledge of species-unique differences to improve cross-species model performance, such asmodel-integrated domain adaptation.
4:00 PM Evolution of the Metazoan Protein Domain Toolkit Maureen Stolzer, Yuting Xiao and Dannie Durand.
Domains, sequence fragments that encode structural or functional protein modules, are the basic building blocks of proteins. Thus, the set of all domains encoded in a genome is the protein function toolkit of the species. Domain family gain, expansion, and loss drive the evolution of this toolkit. New protein functions can arise via gain of new domains or through the formation of novel combinations of existing domains, while specialization and streamlining of the protein toolkit are effected by domain loss.

Here, we investigate how changes in genomic domain content are linked to genome and organismal evolution in multicellular animals. Using a phylogenetic birth-death-gain model, we inferred the relative contributions of gain, expansion, and loss to genomic domain content on a holozoan species tree. Our results show that the relative importance of gain, expansion, and loss varies across lineages, according to a small number of patterns. In most lineages, one of these patterns dominates. This suggests that metazoan genomes are driven by one of four evolutionary strategies: expansion of the protein toolkit; turnover, in which existing domain families are replaced by new ones; specialization, where the number of families decreases, while the size of the remaining families increases; and streamlining, consistent with overall genome reduction. Our results also reveal characteristic evolutionary patterns among domain families. We observe that sets of protein domain families are evolving in concert, sharing a similar history of events and/or a similar representation in ancestral genomes. In many cases, they also share a functional role, linking protein family evolution to innovations in the immune and nervous systems. In summary, the use of a powerful probabilistic birth-death-gain model reveals organizing principles of protein evolution in metazoan genomes.
4:15 PM Mechanism of biocide resistance of bacterial isolates from hydraulic fracturing-impacted streams from Pennsylvania Lindsey Schenten, Jeremy Chen See, Maria Fernanda Campa, Terry Hazen, Regina Lamendella and Stephen Techtmann.
Hydraulic fracturing (HF), commonly known as fracking, is an increasingly common form of oil and gas extraction. Biocides are used in HF operations in order to control microbial growth, which has the potential to hinder the quality of the extracted oil and gas through microbial degradation and souring. Biocides are also used to prevent microbial clogging and damage to the equipment. Glutaraldehyde and 2,2-dibromo-3-nitrilopropionamide (DBNPA) are two of the most commonly used biocides in HF operations. Recent metagenomic evidence suggests that the microbial communities in streams adjacent to HF sites are altered relative to control streams. Furthermore, there is evidence for increased resistance to glutaraldehyde in these HF-impacted streams.

The goal of this study is to better understand the mechanisms behind biocide resistance to glutaraldehyde and DBNPA. This will be done through a combination of comparative genomics and transcriptomics to identify the genes involved in resistance. To accomplish this, we have isolated microorganisms that have resistance to glutaraldehyde (100 ppm), DBNPA (100 ppm), or a combination of the two at 100 ppm each from HF-impacted streams from Pennsylvania. After treatment of impacted stream water and sediment with the biocides, we have been able to recover approximately 102 CFU/ml from Glutaraldehyde treated water and sediment, approximately 102 CFU/ml from DBNPA treated, and 101 CFU/ml after the combination treatment. This suggests that the cocktail of the two biocides is more effective at microbial control in these settings. However, the high number of colonies obtained indicates a robust population of biocide-resistant microbes in these streams. We are currently performing whole-genome sequencing of these isolates to identify genes previously implicated in biocide or antimicrobial resistance through using the Resistance Gene Identifier algorithm associated with CARD database. Comparative genomics will be performed between these isolates to identify potential genes that confer biocide resistance through analysis of core and pan genomes of the resistant isolates and their closest sequenced relative. Future work will focus on performing transcriptomics of select isolates to generate their expression profiles to compare the active metabolic mechanisms by which these microbes combat these biocides. This potentially will help foster a better understanding of how biocide resistance develops, what mechanisms are used, and if there are major similarities in either the mechanisms or the paths to resistance across different microorganisms responding to different biocides.
4:30 PM snacc: Sequence Non-Alignment Compression and Comparison for Inferring Microbial Phylogenies Alex Sweeten, Rafal Mostowy and Leonid Chindelevitch.
Classifying, clustering or building a phylogeny on a set of genomes without the expensive computation of sequence alignment involves calculating pairwise distances by an appropriate metric. One such metric is the normalized compression distance (NCD), an approximation of the true information distance between two objects. Despite NCD's universal applicability, it has seen few applications in bioinformatics, with no existing tools applying NCD to whole-genome datasets to the best of our knowledge. We introduce Sequence Non-Alignment Compression and Comparison (snacc), a pipeline specifically tailored for genomic data, and employing NCD with a variety of compressors to compute pairwise distances between whole genomes. We investigate the use of snacc with 5 common compression algorithms, and apply it to several bacterial and viral datasets with varying properties. Our results show that snacc achieves comparable accuracy and running time relative to other metrics, demonstrating a large improvement over previous NCD implementations, and can be successfully used to reconstruct microbial phylogenies. In addition, snacc is flexible enough to incorporate almost any compression algorithm in a simple manner. snacc is an open-source tool and is available at https://github.com/SweetiePi/snacc/.
5:00 PM Keynote #2 - Less is More: Extracting features from shallow sequencing data Dan Knights
Introduction by Chad Myers
Microbiomes are complex and highly variable, requiring analysis of massive quantities of microbial DNA from biological samples. Unfortunately, clinical microbiome researchers often have to choose between having high-resolution data, via deep shotgun sequencing, or having larger sample sizes, via affordable but low-resolution marker gene sequencing. Using real examples in clinical microbiome studies, this talk discusses methods for increasing power using larger studies with shallow shotgun metagenomics sequencing.

- top -

Fifth Quarter on Monday, May 20, 2019

Links for Monday, May 20, 2019: The Marquee | Varsity Hall I
Links for Tuesday, May 21, 2019: The Marquee | Fifth Quarter | Varsity Hall I
Links for Wednesday, May 22, 2019: The Marquee | Varsity Hall I
Start Time Title Author(s)
Precision Medicine I
Chair: Aritro Nath
11:00 AM Imputed gene expression associations identify replicable trans-acting genes enriched in immune traits Heather Wheeler
Regulation of gene expression is an important mechanism through which genetic variation can affect complex traits. A substantial portion of gene expression variation can be explained by both local (cis) and distal (trans) genetic variation. Much progress has been made in uncovering cis-acting expression quantitative trait loci (cis-eQTL), but trans-eQTL have been more difficult to identify and replicate. Here we take advantage of our ability to predict the cis component of gene expression coupled with gene mapping methods such as PrediXcan to identify high confidence candidate trans-acting genes and their targets. That is, we correlate the cis component of gene expression with observed expression of genes in different chromosomes. Leveraging the shared cis-acting regulation across tissues, we combine the evidence of association across all available GTEx tissues and find 2356 trans-acting/target gene pairs (FDR<0.05) with high mappability scores. Reassuringly, trans-acting genes are enriched in transcription and nucleic acid binding pathways and target genes are enriched in known transcription factor binding sites. If trans-acting genes drive complex trait inheritance, we hypothesized that the trans-acting genes we discovered using our cross-tissue model should be more significantly associated with complex traits than both their targets and other background genes. We focused on immune related complex traits because our observed gene expression data are from whole blood. We also used height as a representative complex trait because of the large sample sizes available. For each trait, we found that trans-acting gene associations are more significant than background gene associations. Though attenuated in comparison to trans-acting genes, target genes are also more significant than background genes for several traits. These results are consistent with the omnigenic model of percolating trans effects through the regulatory network. Our scripts and summary statistics are publicly available at https://github.com/WheelerLab/trans-PrediXcan for future studies of trans-acting gene regulation.
11:20 AM Testing and controlling for horizontal pleiotropy with the probabilistic Mendelian randomization in transcriptome-wide association studies Xiang Zhou
Controlling and Testing Horizontal Pleiotropic Effect with the Probabilistic Two-Sample Mendelian Randomization for Transcriptome Wide Association Studies Two sample Mendelian randomization analyses have been widely applied in transcriptome wide association studies to infer the causal relationship between omics phenotypes and complex traits. A key modeling assumption of these Mendelian randomization analyses is that the instrumental variables do not have horizontal pleiotropy effects -- an assumption that is challenging to validate or control for in real data applications. Here, we propose a probabilistic version of the commonly used Egger regression to test and control for horizontal pleiotropic effects for two-sample Mendelian randomization studies. Our method is capable of accommodating high-dimensional correlated genetic instrumental variables and providing effective control of horizontal pleiotropic effects. With extensive simulations, we show that our method provides calibrated type I error control under a range of horizontal pleiotropic scenarios and is more powerful than several existing approaches including PrediXcan, TWAS, SMR in detecting causal associations between omics phenotypes and complex traits. Finally, we illustrate the benefits of our method in applications to three large-scale genome-wide association studies including the UK Biobank.
11:40 AM Integrative analysis of transcriptomic annotation data and biobank-scale GWAS summary statistics identifies risk factors for Alzheimer’s disease Qiongshi Lu
Despite the findings in genome-wide association studies (GWAS) for late-onset Alzheimer’s disease (LOAD), our understanding of its genetic architecture is far from complete. Transcriptome-wide association analysis that integrates GWAS data with large-scale transcriptomic databases is a powerful method to study the genetic architecture of complex traits. However, it is challenging to effectively utilize transcriptomic information given limited and unbalanced sample sizes in different tissues. Here we introduce and apply UTMOST, a principled framework to jointly impute gene expression across multiple tissues and perform cross-tissue gene-level association analysis using GWAS summary statistics. Compared with single-tissue methods, UTMOST achieved 39% improvement in expression imputation accuracy and generated effective imputation models for 120% more genes in each tissue. A total of 69 genes reached the Bonferroni-corrected significance level in the transcriptome-wide association meta-analysis for LOAD. Among these findings, we identified novel risk genes at known LOAD-associated loci as well as five novel risk loci. Several genes, including IL10 and ADRA1A, also have therapeutic potential to improve neurodegeneration. Cross-tissue conditional analysis further fine-mapped IL10 as the functional gene at the CR1 locus, a well-replicated risk locus for LOAD. Extension of this framework to perform biobank-wide association analysis will also be discussed. Overall, integrated analysis of transcriptomic annotations and biobank information provides insights into the genetic basis of LOAD and may guide functional studies in the future.
Precision Medicine II
Chair: Heather Wheeler
1:30 PM Long non-coding RNAs are crucial pharmacogenomic biomarkers of anticancer agents beyond protein-coding genes Aritro Nath, R. Stephanie Huang
Long non-coding RNAs (lncRNAs) compose a vast majority of the human genome compared to protein-coding genes (PCGs) and influence key regulatory processes in cancer cells. However, it is unknown whether lncRNAs can serve as potential pharmacogenomic biomarkers for cancer drugs. Here we report the results of a comprehensive analysis of lncRNA transcriptome and genome of 1000 cancer cell lines as drug response predictors. Regularized regression-based prediction analysis of more than 600 drugs demonstrated lncRNAs are potent predictors of drug response comparable to PCGs. By adjusting linear models for the effects of cell lineage and the expression of proximal cis-PCGs, we identified significant drug-lncRNA associations that were independent of these critical confounders. Furthermore, we illustrated lncRNAs could augment response prediction even for drugs with established, clinically actionable, PCG biomarkers. As an example, we identified and experimentally validated the role of EGFR-AS1 and MIR205HG as two novel predictors of anti-EGFR therapeutic response independent of EGFR somatic mutation status in lung cancer cells. Our study shows lncRNAs are potent biomarkers of anticancer agents. Novel drug-lncRNA associations are not spurious artifacts of correlations with proximal PCGs, tissue-lineage or established biomarkers. Thus, delineating the pharmacogenomic contribution of lncRNAs will be crucial in improving our understanding of cancer drug response.
1:50 PM Using cancer eQTL profiles and GWAS to prioritize drug targets in cancer Paul Geeleher
Genome-wide association studies (GWAS) have identified hundreds of inherited genetic variants affecting cancer risk. Recently, it was shown that genes with prior genetic evidence are over four times more likely to be successful drug targets. However, most GWAS variants are in non-coding regions of the genome and modulate risk by affecting gene regulation. Thus, determining how inherited genetic variation affects gene expression in cancer is critically important to identifying true target genes. Consequently, by leveraging large genomics datasets like The Cancer Genome Atlas (TCGA), previous studies have mapped expression quantitative trait locus (eQTLs) using tumor expression data. However, tumors are mixtures of both cancer and normal cells, for example, immune cells and stroma. We have developed a new approach that can accurately account for the effect of tumor-infiltrating normal cells on cancer eQTLs. The approach involves first estimating the proportion of tumor-infiltrating normal cells (tumor purity) using a combined estimate from genomics data and H&E staining. Then, we developed a statistical model that can account for the effect of tumor purity on eQTLs by modeling the interaction of the tumor purity estimate and genotype. Intuitively, this models how the magnitude of the association between gene expression and genotype changes as a function of tumor purity and extrapolates this effect to 100% cancer cells.
2:10 PM Peering into germline and somatic breast cancer genomes in women of African descent Yonglan Zheng
Both mortality and incidence rates of breast cancer vary by race and ethnicity. In West Africa, breast cancer is almost uniformly fatal because of late stage at diagnosis and aggressive behavior of the disease in young women. It is well recognized that women of African ancestry have a remarkable higher proportion of poorly differentiated tumors that lack estrogen receptor expression (ER-) and present in advanced stages. The risk of subtype-specific onset of breast cancer in Blacks is consistent in all geographical locations, suggesting a common suite of genetic influences on risk per se. Plenty of breast cancer loci have been identified in genome-wide association studies (GWAS), which were mainly conducted in populations of European ancestry. Replication studies showed the complexity of direct applying GWAS findings across racial/ethnic groups, as a noticeable portion of GWAS-index polymorphisms could be replicated in women of African ancestry. Polygenic risk scores constructed from the published odds ratios on GWAS-index variants in Whites and Asians did not provide a comparable degree of risk stratification for Blacks. On the other hand, fine-mapping turns out to be a powerful approach to better characterize the breast cancer risk alleles in diverse populations. To date, the laEach subunit of regulatory protein complexes uniquely associates with the genome via protein-DNA or protein-protein interactions. The ChIP-exo protocol precisely characterizes protein-DNA crosslinking patterns by combining ChIP with 5’ to 3’ exonuclease digestion. Within a regulatory complex, the physical distance of a regulatory protein to the DNA affects cross-linking efficiencies. Therefore, analysis of the sequencing read distribution shapes created by the exonuclease can potentially enable greater levels of biological insight by identifying the protein-DNA interaction preferences of proteins or the modes by which they bind.
Here, we present a computational pipeline that simultaneously analyzes ChIP-exo read patterns across multiple experiments and infers spatial organizations of the proteins. Because many proteins bind DNA in a non-sequence specific manner, we directly align the strand separated ChIP-exo read patterns and produce representative ChIP-exo read profiles at a set of binding events. Given a set of aligned ChIP-exo read profiles across multiple proteins, we use a probabilistic mixture model to deconvolve the ChIP-exo read patterns to protein-DNA crosslinking sub-distributions. The method allows consistent measurements of crosslinking strengths of protein-DNA interactions across multiple ChIP-exo experiments. Lastly, we perform MDS to visualize cross-linking preferences of the regulatory proteins.
We have applied the ChIP-exo analysis methods to a set of proteins that organizes the PolIII transcriptional pre-initiation complex (PIC) assembly of yeast tRNA genes. Our results demonstrate that inferred protein organization closely recapitulates the known organization of the tRNA PIC, thereby confirming that the detailed analysis of ChIP-exo reads enables us to understand the precise organization of protein-DNA complexes. We anticipate that ChIP-exo read pattern analysis will offer an economical approach in creating testable hypotheses of protein organization.rgest breast cancer GWAS meta-analysis in women of African descent identified a novel hit (TNFSF10) that was associated with risk of ER- breast cancer. Furthermore, in the post-GWAS era, examining a group of variants within a biological pathway can provide complementary and valuable insights for the genetic architecture of complex diseases. The revolutionizing massive parallel sequencing technologies have largely increased our capability to extend breast cancer germline risk assessment beyond BRCA1 and BRCA2. Hereditary cancer panels have been widely used to identify loss-of-function variants in known and candidate breast cancer genes. Exceedingly high frequency of BRCA1 and BRCA2 germline damaging variants was recently reported in Nigerian women of breast cancer unselected for family history or age. Among them, one in eight cases of invasive breast cancer is a result of inherited damaging variants in BRCA1, BRCA2, PALB2, or TP53, and breast cancer risks associated with these genes are extremely high. Moreover, at the allelic level, the profile is highly heterogeneous. Similar findings were also found in breast cancer women in two other African countries, Cameroon and Uganda. It is possible that the time is nigh for the initiation of national population-screening in diverse populations. The genetic causes of cancer include both inherited germline variants and somatic alterations. Through high-depth genome, exome, and RNA sequencing, we examined the molecular features of breast cancers using nearly two hundred patients from Nigeria and more than one thousand patients from The Cancer Genome Atlas. Nigerian breast tumors are characterized by increased HRD signature and pervasive TP53 variants, which indicates aggressive biology. The life history analysis revealed that in the HR-/HER2+ subtype, clonal losses of chromosome 14q are highly enriched in Nigerians but absent in Whites. Also, somatic single nucleotide variant clustering analysis showed that Nigerian cancers have a higher level of intra-tumoral heterogeneity than Whites, which may explain the pronounced aggressiveness of breast cancer in women of African ancestry. In contrast, early drivers (e.g. TP53 and PIK3CA) and whole-genome duplication rates were mostly similar between the groups. These studies underscore the importance of genomic diversity in research and clinical practice, and have opened the field of cancer health disparities. We need more in-depth rigorous genomics research using larger cohorts in diverse geographic regions.
Precision Medicine III
Chair: Qiongshi Lu
3:00 PM Patient derived xenografts for precision cancer medicine Arvind Singh Mer, Wail Ba-Alawi, Petr Smirnov, Yi Wang, Ben Brew, Anna Goldenberg, Benjamin Haibe-Kains
One of the key challenges in cancer precision medicine is finding robust biomarkers of drug response. Patient-derived xenografts (PDXs) have emerged as reliable preclinical models since they better recapitulate tumor response to chemo- and targeted therapies. However, the lack of standard tools poses a challenge in the analysis of PDXs with molecular and pharmacological profiles. Efficient storage, access and analysis is key to the realization of the full potential of PDX pharmacogenomic data. We have developed Xeva (XEnograft Visualization & Analysis), an open-source software package for processing, visualization and integrative analysis of a compendium of in vivo pharmacogenomic datasets. The Xeva package follows the PDX minimum information (PDX-MI) standards and can handle both replicate-based and 1x1x1 experimental designs. We used Xeva to characterize the variability of gene expression and pathway activity across passages. We found that only a few genes and pathways have passage specific alterations (median intraclass correlation of 0.53 for genes and positive enrichment score for 92.5% pathways). Activity of the mRNA 3'-end processing and elongation arrest and recovery pathways were strongly affected by model passaging. We leveraged our platform to link the drug response and the pathways whose activity is consistent across passages by mining the Novartis PDX Encyclopedia (PDXE) data containing 1,075 PDXs. We identified 87 pathways significantly associated with response to 51 drugs (FDR < 5%), including associations such as erlotinib response and signaling by EGFR in cancer pathways and MAP kinase activation and binimetinib response. We have also found novel biomarkers based on gene expressions, copy number aberrations (CNAs) and mutations predictive of drug response (concordance index > 0.60; FDR < 0.05). Xeva provides a flexible platform for integrative analysis of preclinical in vivo pharmacogenomics data to identify biomarkers predictive of drug response, a major step toward precision oncology.
3:15 PM Genetically regulated gene expression underlies lipid traits in Hispanic cohorts Angela Andaleon, Lauren S. Mogi, Heather E. Wheeler
Plasma lipid levels are risk factors for cardiovascular disease, a leading cause of death worldwide. While many studies have been conducted on lipid genetics, they mainly comprise individuals of European ancestry and thus their transferability to diverse populations is unclear. We performed genome-wide (GWAS) and imputed transcriptome-wide association studies of four lipid traits in the Hispanic Community Health Study/Study of Latinos cohort (HCHS/SoL, n = 11,103), tested the findings for replication in the European and Hispanic populations in the Multi-Ethnic Study of Atherosclerosis (MESA CAU, n = 1,297; MESA HIS, n = 1,297), and compared the results to the larger, predominantly European ancestry meta-analysis by the Global Lipids Genetics Consortium (GLGC, n = 196,475). GWAS revealed both known and not previously implicated SNPs in regions within or near known lipid genes. We used PrediXcan to calculate predicted gene expression in HCHS/SoL, MESA CAU, and MESA HIS using reference transcriptome data. These reference models include 44 GTEx V6 tissues including liver, artery, and adipose in an 85% European and 15% African-American population, and 5 MESA models in monocytes in African-American, European, Hispanic, African-American and Hispanic, and combined populations. In our PrediXcan analyses in multiple tissues and ethnicities, we found 47 significant gene-phenotype associations (P < 4.816e-8) with 9 unique significant genes, many of which occurred across multiple phenotypes, tissues, and multi-ethnic populations. These include well-studied lipid genes such as SORT1, CETP, and PSRC1. We identified genes that associate independently from nearby genes, remain significant after conditioning on the predicted expression of known lipid genes, and colocalize with expression quantitative trait loci (eQTLs), indicating a possible mechanism of gene regulation in lipid level variation. We also investigated prediction in multi-ethnic versus European-only models, as well as replication in European vs. Hispanic populations. We found that replication of both effect sizes and P values were better correlated in the Hispanic population than the European population when associations are more significant for HCHS. To fully characterize the genetic architecture of lipid traits in diverse populations, larger studies in non-European ancestry populations are needed.
3:30 PM NeTFactor, a framework for identifying transcriptional regulators of gene expression-based biomarkers Gaurav Pandey, Mehmet Eren Ahsen, Supinda Bunyavanich, Alexandar Grishen, Galina Grishina, Yoojin Chun
With rapid advances in genomic technology, several multi-gene expression-based predictive biomarkers have been identified for diseases such as breast cancer, cerebrovascular disease, and Alzheimer’s disease. Biological and regulatory mechanisms driving the performance of such biomarkers are often not readily evident. Here we describe an innovative framework, NetFactor,that combines network analyses with gene expression data to identify a minimal set of transcription factors (TFs) that are expected to significantly and maximally regulate such biomarkers. NetFactor first computationally infers a context-specific gene regulatory network (GRN) from disease-relevant gene expression data generated using technologies like RNA sequencing or microarrays. It then applies statistical enrichment methods to the structure and components of this GRN to rank potential TFs in terms of their disease activity and likelihood of regulating the biomarker. Finally, NetFactor uses an innovative LASSO-based optimization approach to determine the minimal set of TFs that most significantly and exclusively regulate the genes in the biomarker. Our application of NeTFactor to an accurate gene expression-based asthma biomarker identified ETS translocation variant 4 (ETV4) and peroxisome proliferator-activated receptor gamma (PPARG) as the biomarker’s most significant TF regulators. siRNA-based knock down of each of these TFs in an airway epithelial cell line model demonstrated significant reduction of cytokine expression relevant to asthma, validating NeTFactor’s top-scoring findings. While PPARG has been associated with airway inflammation, ETV4 has not yet been implicated in asthma, thus indicating the possibility of novel, disease-relevant discovery by NetFactor. These results illustrate that the application of NeTFactor to multi-gene expression-based biomarkers could yield valuable insights into disease-relevant regulatory mechanisms and biological processes, allowing us to gain more from biomarkers beyond their main role as classifiers or predictors.
3:45 PM Systematic characterization of a set of variants from heterogeneous information Xiaoman Xie, Casey Hanson, Saurabh Sinha
Genotype-to-phenotype studies, e.g., GWAS or family-based studies, identify sets of genomic variants associated with diseases. These variants must then be interpreted mechanistically, i.e., in terms of the molecular pathways or regulatory interactions impacted by them. For non-coding variants, the majority of GWAS findings, such mechanistic interpretation is challenging for several reasons, including a) difficulty of predicting their immediate functional impact, e.g., on transcription factor (TF)-DNA binding or chromatin state, b) even greater difficulty in predicting their impact on gene expression, and c) dependence of impact on cellular contexts, e.g., tissues relevant to the phenotype. In light of these formidable challenges in the field of single non-coding variant interpretation, a pragmatic related goal is to discover the system-level insights that a set of phenotypic variants point to, for example, common regulators, driver genes and pathways that several variants in the set are associated with. Such insights are especially useful in studies of complex diseases where no single variant explains etiology. Here we provide a new method for this ‘SNP set characterization’ task, called ‘VarSAn’ (Variant Set Analysis), that uses graph random walk-based methods to identify mechanistic properties such as pathways and regulators relevant to a given set of variants. Instead of annotating individual variants or performing an enrichment test for a single type of annotation, our tool aggregates diverse annotations of a collection of variants, along with prior knowledge about genes, TFs and pathways, to provide systems-level insights into those variants. Furthermore, VarSAn anticipates that most input SNP sets may include a large number of SNPs that are not related to the phenotype under study and only an unknown ‘core’ of SNPs within the set will share common mechanistic associations. It addresses this challenge by using an iterative, SNP set-trimming algorithm to find a subset of the input set that reveals the most pronounced shared mechanisms. The first step of VarSAn is to set up a heterogeneous network whose nodes represent SNPs, TFs, genes and a suitable collection of pathways, and edges representing annotations of and relationships among these various entities, including SNP-gene associations (based on eQTL studies, genomic proximity or 3D interactions), SNP-TF connections (based on predicted binding impact), TF-gene relationships (based on regulatory networks, if known), gene-gene associations (based on physical interactions of encoded proteins or genetic interactions) and gene-pathway membership. The second step takes a set of variants/SNPs as the ‘query set’ and uses Random Walk with Restarts on the network to rank nodes for relevance to the query set. This step ranks pathways, TFs, genes and SNPs separately, thereby providing information on various mechanistic features shared by the provided SNP set. To test VarSAn, we analyzed a data set of ~300 lymphoblastoid cell lines for which cytotoxicity measurements are available in response to different treatments, along with genotype and gene expression data, allowing us to construct the above-mentioned heterogeneous network. We first tested the method on semi-synthetic data, where we sampled a set of variants associated with a pathway, added varying levels of noise (random SNPs) to it and then applied VarSAn on those sets to test if the original pathway is recovered. The method is able to accurately rank the expected pathway at the top even when including 10 times as many random SNPs (noise) as truly associated SNPs. We then used VarSAn to characterize GWAS SNPs of Gemcitabine and Radiation treatments. This revealed various interleukin signaling pathways and the nectin adhesion pathway as associated with Gemcitabine GWAS SNPs and DNA damage repair-related pathways such as p53 when GWAS SNPs of Radiation were analyzed. The results agree with molecular mechanisms known about the respective treatments.
4:00 PM Characterization of clonal evolution in microsatellite unstable metastatic cancers through multi-regional tumor sequencing Russell Bonneville, Lianbo Yu, Julie Reeser, Thuy Dao, Michele Wing, Hui-Zi Chen, Melanie Krook, Jharna Miya, Eric Samorodnitsky, Amy Smith, Nicholas Nowacki, Sameek Roychowdhury
Microsatellites are short, repetitive segments of DNA, dispersed throughout the human genome. Microsatellite instability (MSI) occurs when cells are unable to regulate the length of their microsatellites during cell division, due to defects in the mismatch repair (MMR) system. MSI has been identified in several human cancer types, most notably colorectal and endometrial cancer in association with Lynch syndrome. Of clinical interest, microsatellite instability-high (MSI-H) tumors have been shown to exhibit increased sensitivity to immune-enhancing therapies such as PD-1 inhibition. Although next generation sequencing (NGS) has permitted advancements in cancer type-agnostic detection of MSI, the heterogeneity and evolution of microsatellite changes in MSI-positive tumors remains poorly described, and can potentially affect the accuracy of MSI-H detection and efficacy of immunotherapy. Furthermore, recent computational advances have enabled identification and characterization of subclonal cancer cell populations with bulk tumor DNA sequencing, along with deconvolution of subclonal phylogeny to model clonal evolution. Here we aim to infer the distribution of microsatellite lengths within tumor subclones in order to assess evolution of microsatellite instability in the context of tumor heterogeneity. We performed whole exome sequencing on multiple tumor samples acquired from biopsies and/or surgical resections of 2 patients with MSI-H malignancies. Per-sample microsatellite instability status and distributions of microsatellite lengths were determined using MANTIS. Single-nucleotide variations (SNVs) and insertions/deletions (indels) were called using VarScan2, and allele-specific copy number variations (CNVs) determined using FALCON. Mutational signatures were called with deconstructSigs. We used Canopy to infer subclonal phylogeny and per-sample clonal fractions. Using the microsatellite distributions from MANTIS and clonal fractions from Canopy, we developed a simulated annealing-based optimization method to estimate the per-subclone distributions of microsatellite lengths. We have sequenced multiple tumor samples from two patients with known MSI-H malignancies, and MANTIS confirmed all samples as microsatellite unstable. Canopy identified four or eight subclones per patient. Each subclone exhibited at least one of four MSI-associated mutational signatures (6, 15, 20, 26), with the presence of these signatures in multiple tree branches indicating that microsatellite instability continues to introduce novel mutations throughout the disease course. Patient 1 (colon cancer) had early somatic mutations in CTNNB1 and KRAS, along with a germline mutation in the MMR gene MLH1 responsible for Lynch syndrome and MSI. Patient 2 (prostate cancer) developed early somatic mutations in LYST, TP53 and the MMR gene MSH6. Within each patient, we modeled the evolution of 663 to 1098 microsatellite loci. We noted a trend of increasing instability with time, with subclonal MANTIS score correlating with mutational load (r^2 = 0.68, p = 0.02). Within each patient, we identified a set of loci relatively unstable within all clones (“ubiquitously unstable”), with other loci only unstable in some clones (“subclonally unstable”) or not unstable in any clone (“stable”). Clones which diverged earlier in tumor evolution demonstrated reduced correlation of standard deviations of microsatellite length compared to more recently diverged clones. We also note that microsatellite lengths tend to shorten relative to the germline (average subclonal median length contraction 0.63 ± 0.26). This study provides the first investigation into microsatellite instability in a subclonal context. These results provide new insights into microsatellite instability as a dynamic mutagenic process operative in mismatch repair-deficient malignancies and resulting in tumor heterogeneity. We aim to expand this study to more patients to identify recurrent ubiquitously unstable microsatellite loci, and to include patients with immunotherapy-refractory MSI-H malignancies to assess clonal differences in MSI as a potential contributor to immunotherapy resistance.
4:15 PM iFunMed: Integrative Functional Mediation Analysis of GWAS and eQTL Studies Constanza Rojo, Qi Zhang, Sunduz Keles
Genome-wide association studies (GWAS) have successfully identified thousands of genetic variants contributing to disease and other phenotypes. However, significant obstacles hamper our ability to elucidate causal variants, identify genes affected by causal variants, and characterize the mechanisms by which genotypes influence phenotypes. The increasing availability of genome-wide functional annotation data through large consortia projects is providing unique opportunities to incorporate prior information into the analysis of GWAS to better understand the impact of variants on disease etiology, either to boost signal-to-noise in association analysis or to prioritize SNPs and leverage sub-threshold variants. Although there has been many advances in incorporating prior information into prioritization of trait-associated variants in GWAS, functional annotation data has played a secondary role in the joint analysis of GWAS and molecular (i.e., expression) quantitative trait loci (eQTL) data in assessing evidence for association. To address this, we develop a novel mediation framework, iFunMed, to integrate GWAS and eQTL data with the utilization of publicly available epigenome and regulation-based large scale functional annotation data from consortia projects such as ENCODE and Roadmap Epigenomics Project, without requiring raw subject-level data. iFunMed extends the scope of standard mediation analysis by incorporating information from multiple genetic variants at a time and leveraging variant-level summary statistics. iFunMed model is fit in a computationally feasible way by taking advantage of variational methodologies. The key output of iFunMed includes posterior probabilities of inclusion for each SNP (i.e., probability that a given SNP has a non-zero effect) for both the direct and the mediation model and effect size estimates. We accompany iFunMed with a dimension reduction approach that screens the annotations before building the data-driven priors, motivated by the fact that a large proportion of the annotations exhibit no to little association with the summary statistics. The screening strategy has well calibrated Type I error rate and good power. Data-driven computational experiments convey how relevant annotation information improves SNP detection for both the direct and indirect effects in the mediation analysis by increasing the area under the receiver operating characteristic (AUROC) and precision-recall (PR) curves up to 20% and highlights the robustness of iFunMed to the use of irrelevant annotations. Application to Framingham Heart Study data focused on blood-related phenotypes and provided comparisons of iFunMed fits that integrates regulatory information and with those that do not. Use of a large collection of publicly available annotations identified a number of additional SNPs that are missed in the mediation analysis without annotation but well-supported by independent studies and are potentially impacting binding of transcription factors, either by destroying or creating new binding sites.
4:30 PM A Pathway Perspective on Drug Response for Targeted Therapies in Acute Myeloid Leukemia Aurora Blucher, Steve Kurtz, Cristina Tognon, Brian Druker, Guanming Wu, Shannon McWeeney
To match patients with targeted therapies, we require a better understanding of how patient genetic variability affects drug response. In diseases with high levels of heterogeneity, this can be a challenging task because patients harbor very different genetic aberrations. In these cases, we require models that can capture individual patient variability and also allow us to investigate higher level shared dysregulation across patients. Here, we use patient-specific pathway modeling for the BeatAML cohort of acute myeloid leukemia patients to investigate variability in patient drug response. Acute myeloid leukemia is well-known for extreme heterogeneity and patients often show poor response to therapy. We take a pathway perspective on mutational aberrations in AML, and show that pathway mutational status is related to drug response for many patient subgroups. We further expand the pathway perspective by using a probabilistic graphical modeling framework to model how patient gene mutations impact these pathways. This framework allows us to investigate in a more mechanistic manner how mutations in different genes can result in differing systems-level effects. Concurrently, we model how therapeutic drugs impact the same set of pathways by leveraging drug-target information from the Cancer Targetome resource. By uniting mutation-impacted pathways and drug-impacted pathways, we can investigate why certain patient subgroups respond differently to drugs. Importantly, this approach allows us to generate very specific, mechanism-based hypotheses about how patient-level mutations result in dysregulated pathway signaling and affect drug response. Because this framework enables us to investigate how pathway dysregulation is tied to differing patient drug response, it is highly applicable to diseases outside the cancer domain. This computational framework is key for translational pipelines to match patients with targeted therapies, such as drug screen development and clinical trials. B.J.D. potential competing interests-- SAB: Aileron Therapeutics, ALLCRON, Cepheid, Vivid Biosciences, Celgene, Gilead Sciences (inactive), Baxalta (inactive), Monojul (inactive); SAB & Stock: Aptose Biosciences, Blueprint Medicines, Beta Cat, Third Coast Therapeutics, GRAIL (inactive), CTI BioPharma (inactive); Scientific Founder: MolecularMD (inactive, acquired by ICON); Board of Directors & Stock: Amgen; Board of Directors: Burroughs Wellcome Fund, CureOne; Joint Steering Committee: Beat AML LLS; Clinical Trial Funding: Novartis, Bristol-Myers Squibb, Pfizer; Royalties from Patent 6958335 (Novartis exclusive license) and OHSU and Dana-Farber Cancer Institute (one Merck exclusive license)
4:45 PM GraPhyC: Using Consensus to Infer Tumor Evolutionary Histories Layla Oesper, Kiya Govek, Camden Sikes
Tumors evolve as part of an evolutionary process where distinct sets of somatic mutations accumulate in different cell lineages descending from an original founder cell. A better understanding of how such tumor lineages evolve over time, which mutations occur together or separately, and in what order these mutations were gained may yield important insight into cancer and how to treat it. Thus, in recent years there has been an increased interest in computationally inferring the evolutionary history of a tumor -- that is, a rooted tree where vertices represent populations of cells that have a unique complement of somatic mutations and edges that represent ancestral relationships between these populations. However, some inference methods may return multiple possible evolutionary histories for a single patient, and different methods run on the same dataset often produce different evolutionary histories. Thus, a method that is able to combine information across multiple distinct tumor evolutionary reconstructions may be able to provide a more accurate reconstruction of a tumor's history. In this work we consider the problem of finding a consensus tumor evolutionary tree from a set of conflicting input trees. In contrast to traditional phylogenetic trees, the tumor evolutionary trees we consider contain features such as mutation labels on internal vertices (in addition to the leaves) and allow multiple mutations to label a single vertex. Mutation labels indicate when a somatic mutation first arose. We recently published our GraPhyC algorithm that solves the consensus problem using a weighted directed graph where vertices are sets of mutations and edges are weighted based on the number of times a parental relationship is observed between their constituent mutations in the input trees (Govek et al., 2018). We return the minimum weight spanning arborescence in this graph as the consensus tree. We describe several distance measures between these tumor evolutionary trees, and prove that our GraPhyC algorithm minimizes the total distance to all input trees for one of these distance measures. We also show that GraPhyC can be computed efficiently and describe how GraPhyC can be used in other contexts, such as clustering sets of tumor evolutionary histories that are consistent with a single patient's data, or across different patients. These applications may provide further insight into tumor development. We evaluate GraPhyC using both simulated data and two real sequencing datasets. On simulated data we show that our method outperforms a baseline method that returns the input tree with the smallest total distance to all other input trees. Using a set of tumor trees derived from both whole-genome and deep sequencing data from a Chronic Lymphocytic Leukemia patient we find that GraPhyC identifies a tree not included in the set of input trees, but that contains characteristics supported by other reported evolutionary reconstructions of this tumor. On a set of tumor trees derived by several different methods on a single cell sequencing dataset for a triple negative-breast cancer patient, we find that GraPhyC is able to effectively handle discordance between the set of input trees. Thus, we see that GraPhyC is a versatile approach that can be applied to tumor evolutionary trees in several different contexts. We evaluate GraPhyC using both simulated data and two real sequencing datasets. On simulated data we show that our method outperforms a baseline method that returns the input tree with the smallest total distance to all other input trees. Using a set of tumor trees derived from both whole-genome and deep sequencing data from a Chronic Lymphocytic Leukemia patient we find that GraPhyC identifies a tree not included in the set of input trees, but that contains characteristics supported by other reported evolutionary reconstructions of this tumor. On a set of tumor trees derived by several different methods on a single cell sequencing dataset for a triple negative-breast cancer patient, we find that GraPhyC is able to effectively handle discordance between the set of input trees. Thus, we see that GraPhyC is a versatile approach that can be applied to tumor evolutionary trees in several different contexts. An implementation of GraPhyC is available at https://bitbucket.org/oesperlab/graphyc.

- top -

Varsity Hall I on Monday, May 20, 2019

Links for Monday, May 20, 2019: The Marquee | Fifth Quarter
Links for Tuesday, May 21, 2019: The Marquee | Fifth Quarter | Varsity Hall I
Links for Wednesday, May 22, 2019: The Marquee | Varsity Hall I
Start Time Title Author(s)
Education I
Chair: Anna Ritz
11:00 AM Interdisciplinary biology education: a holistic approach to an intractable problem Kristin Jenkins
Bioinformatics, quantitative biology and other important emerging disciplines like data science are described as interdisciplinary - combining a variety of skills and practices from multiple sources to generate novel approaches to biological problems. Each of these disciplines is a potentially effective route for teaching many core biological concepts, and no biology education would be complete without exposure to these disciplines. However, it is unclear how each of each of these disciplines can receive the appropriate level of attention in an already over-full biology curriculum. How is a faculty member (or department) to incorporate experience with all these important disciplines to prepare students for the 21st Century workforce? Could extending our interdisciplinary approach to research benefit biology education? Vision and Change provides an overarching guide to what all biology students should know, categorizing this knowledge as core concepts and competencies. These broad categories describe topics and practices required in many disciplines, including bioinformatics and quantitative biology, such as using models and hypothesis testing. Transferring and applying knowledge in different situations is challenging for students, but is a key skill in interdisciplinary sciences. Exposure to multiple disciplines could be leveraged to support both the development of core skills and knowledge and the ability to transfer and apply knowledge in different scenarios. Such an interdisciplinary approach will require communities of practice that can support collaboration and communication between disciplinary educators. The ability to share resources, provide professional development and exchange ideas on effective pedagogical approaches will reduce barriers to interdisciplinary teaching. The Quantitative Undergraduate Biology Education (QUBES) project is an example of this type of community of practice, where bioinformatics and quantitative biology faculty have come together to improve biology education.
11:20 AM Expanding Undergraduate Participation in Computational Biology: Resources and Lessons Learned from a Hands-on Workshop Layla Oesper
Computational biology is an exciting and ever-widening interdisciplinary field. Expanding the participation of undergraduate students in this field will help to inspire and train the next generation of scientists necessary to support this growing field. However, students at smaller institutions, such as those focused on undergraduate education, may not have access to faculty or even courses related to computational biology at their home institutions. Providing more opportunities for all undergraduate students to be exposed the wide variety of subfields within computational biology will be important for ensuring these students are included in the pipeline of scientists contributing to this field. To this end, we hosted a computational biology workshop that brought together undergraduate students from three different Midwest liberal arts colleges. The goal of the workshop was to provide an introduction to how computer science can be used to help answer important problems in Biology. A diverse set of six faculty members from different institutions each put together a hands-on module as an introduction to a different area that they taught to the students at the workshop. In this talk, I will discuss the lessons learned from this undergraduate computational biology workshop, and the workshop materials that are freely available to the larger computational biology community.
11:40 AM Network for Integrating Bioinformatics into Life Sciences Education (NIBLSE): Recent Activities William Morgan
The Network for Integrating Bioinformatics into Life Sciences Education (NIBLSE, pronounced “nibbles”) seeks to integrate bioinformatics education into undergraduate life science curricula. To this end, this NSF-funded project has established a network of educators dedicated to this vision, identified barriers to the integration process, developed a suite of community-endorsed, bioinformatics core competencies necessary for today’s life science graduates, and curated a collection of learning resources that address these competencies. Current efforts to foster bioinformatics education in undergraduate life science education include an active incubator system for nurturing learning resources, a QUBES-supported Faculty Mentoring Network for supporting educators integrating bioinformatics, and an initiative to help instructors assess their bioinformatics educational efforts. The 2nd NIBLSE Conference on Implementation and Sustainability in October 2019 will provide a forum for interested bioinformatics educators to consider how these efforts can be further deepened and sustained.
Education II
Chair: Catie Welsh
1:30 PM Increasing data analysis skills in the pediatric cancer community with the Childhood Cancer Data Lab training workshops Candace Savonen, Deepashree Prasad, Casey Greene and Jaclyn Taroni
The vast amount of genomic data generated each year hold valuable information about the underlying biology of complex diseases. Often biomedical researchers are not readily equipped to use these types of data to answer their biological questions of interest. The Childhood Cancer Data Lab (CCDL) is an initiative of Alex’s Lemonade Stand Foundation (ALSF), an organization devoted to fighting childhood cancer that has funded almost 1000 grants at 135 institutions. The CCDL was founded in late 2017 to empower childhood cancer researchers to harness the power of “big data.” Here, we present our early experiences designing, implementing, and executing short, 3-day training workshops centered on gene expression analysis for pediatric cancer researchers with little to no experience in bioinformatics or programming as part of the CCDL.

We identified RNA-seq analysis as a major area of need in the pediatric cancer community based on an online survey of primarily pediatric cancer-focused researchers and discussions with researchers at Alex’s-focused and national meetings. Accordingly, we constructed our workshop curriculum to prepare researchers to perform processing and analysis of transcriptomic data with an emphasis on reproducibility. The workshop is done in an interactive, small-group setting (20 participants or fewer). Participants are primarily drawn from ALSF-funded research groups and others in the pediatric cancer field.

All analyses are conducted within a Docker container prepared by CCDL staff to promote a reproducible, reusable software stack. We use the download, quality control (FastQC), and quantification of RNA-seq data (Salmon) as an opportunity to introduce the command line and shell scripting for reproducibility. Participants use the R programming language and Bioconductor to perform downstream analyses on childhood cancer-specific data, such as differential expression analyses and hierarchical clustering. We use R Notebooks that are prepared by CCDL staff in part to emphasize documenting computational results at the time that they are obtained.

A portion of time near the end of the workshop is set aside for researchers to bring in their own data or identify publicly available data relevant to their scientific question of interest. This allows CCDL staff to provide help with situation-specific issues. During registration and a pre-workshop survey for accepted participants, we ask about their research question, what kind of data they have, and what challenges they have been encountering. Participants use the processing and analysis steps that they have learned in prior modules and present their results to the rest of the group.

All participants of the pilot workshop said they would recommend the training to their peers on a post-workshop quality improvement questionnaire. Based on the questionnaire from this pilot, we further refined the curriculum to include introduction to R and single-cell RNA-seq modules. Ideally, future workshops would include modules that emphasize version control and reproducibility beyond data analysis (e.g., effectively sharing wet lab research products such as experimental protocols). The workshop’s curriculum is publicly maintained and updated on Alex’s Lemonade GitHub (https://github.com/AlexsLemonade/training-modules). The larger vision is to train and support individuals at other institutions to use our curricula to host their own workshops. This is a scalable solution to increase bioinformatics skills throughout the childhood-cancer research community. ALSF will encourage the expansion of these workshops by providing financial and administrative support. The CCDL will continue to develop these workshops in efforts to catalyze the search for childhood cancer cures by equipping more researchers with foundational bioinformatic skills.
1:50 PM Bioinformatics in the Library: Bridging the Skills Gap for Biomedical Researchers Pamela Shaw, Matthew Carson, Sara Gonzales, Kristi Holmes, Robin Champieux and Ted Laderas
Computational skills training for the biomedical research workforce can be challenging. Informatics courses for graduate students have to compete for space in crowded biomedical sciences curricula; and established researchers—principal investigators, postdoctorates, research staff, and research faculty—approach training from a variety of computational competency levels. Graduate students, faculty, and staff alike often lack technical skills in basic computer literacy, programming languages, data management and analysis, and data workflow management. There is a need for extracurricular training to bridge these skills gaps. One-time workshops and boot camps provide a large amount of information in a short time, but knowledge gained from these workshops is lost quickly. One solution to overcome these gaps and knowledge losses is to develop extracurricular programs that provide skill-building sessions at a variety of levels of computational competence throughout the year, supplemented by online, self-paced training materials. Such educational variety can best be achieved by collaborative partnerships between campus units and resource centers.

Library-based informatics programs have been in place at several universities and medical schools for over fifteen years. These programs are staffed by Master’s or PhD level science graduates, and provide training and consultation in a variety of computational skills. The library is a perfect partner in providing supplemental computational skills training for researchers: it is a neutral, trusted entity; it is known for knowledge management; and the library has strong collaborative partnerships with other campus centers, core facilities, and research computing services.

We present the library as a primary point of contact for informatics and data management. The library offers consultation and training to researchers for their computational needs. In cases where longer-term or more intensive support is needed, the library provides a referral services to core facilities and specialists on campus.

We also present BioData Club: a kit of resources available on GitHub. BioData Club was developed by Oregon Health and Science University under the Clinical Data to Health (CD2H) cooperative agreement. A goal of the CD2H educational initiative is to pilot and implement the BioData Club kit at CTSA institutions and other academic health sciences sites. The kit is available on GitHub and provides templates and guidance for establishing a BioData Club at an institution, complete with templates and links to repositories with developed instructional materials. It is hoped that the kit will expand with each institution’s instance to provide a wide variety of instructional and communication materials for improving computational competence among biomedical researchers.
2:10 PM The ml4bio Workshop: Machine Learning Literacy for Biologists Chris Magnano, Fangzhou Mu, Debora Treu and Anthony Gitter
Machine learning has been incredibly successful in mining large-scale biological datasets. Despite its popularity among computational researchers, machine learning remains elusive to experimental biologists, who form the majority of the life sciences research community, leaving powerful computational tools underappreciated and data generated in wet labs underexplored. Recent years have seen a growing interest among biology trainees to embark on machine learning projects that complement their research. However, most machine learning courses and tutorials require substantial background knowledge in coding and mathematics, which many biologists may lack. On the other hand, bioinformatics workshops for biologists assume less coding experience, but participants are often taught to mechanically run through a software pipeline for certain tasks without learning the best practices in various stages of the workflow. Such an approach, though effective in the short term, can lead to error-prone data analysis, misinterpretation of results, and difficulty in adapting to other tasks in the long run of a scientist’s research effort. The community clearly needs to explore novel educational frameworks in order to address these challenges in teaching machine learning to biologists. Unlike traditional task-centric approaches, our educational objective is to equip biologists with the proper mindset when it comes to applying machine learning in their research and the ability to critically analyze machine learning applications in their domain. Built around this core idea, our ml4bio workshop prioritizes teaching machine learning literacy, that is, the right way to set up learning problems, how to reason about learning algorithms, and how to assess learned models. We have developed interactive software with a graphical interface and a set of accompanying slides and tutorials for use during workshop sessions. The software and interactive exercises guide participants through a full cycle of the machine learning workflow while doing proper model training, validation, selection, and testing. By following instructions in the slides and tutorials, participants build intuition about the strengths and weaknesses of various model classes and evaluation metrics by visualizing model behavior under different data distributions and sets of model hyperparameters. We further attempt to mind the gap between theory and practice through illustration of machine learning applications on real biological tasks. Overall, our approach encourages beginners to take a holistic view of the machine learning workflow rather than immediately dive into the technicalities of coding and mathematics. We have successfully offered two pilot workshops attended by graduate students and postdocs with diverse backgrounds and research interests. The feedback we collected provides strong preliminary evidence on the effectiveness of our approach. Moving forward, our short-term plan is to tailor the workshop material to better serve our educational objective and the needs of participants. The current version of the software only supports classification models. For future releases, we will expand the set of models to include those for regression and clustering. We are also looking for new biological case studies that highlight good and bad practices of machine learning in the biological literature. Our long-term software development plan is to more closely link the ml4bio graphical interface and the Python scikit-learn code on which it is built in order to guide participants who wish to later customize their own machine learning pipeline. Our ultimate goal is the national distribution of the workshop. As an initial step towards this end, we are working closely with educators and facilitators on and off campus to outline a timetable on future workshop development and to adopt best practices of successful workshops such as Software and Data Carpentry. Our workshop materials are available at https://github.com/gitter-lab/ml-bio-workshop/ under the CC-BY-4.0 license and our ml4bio software is available at https://github.com/gitter-lab/ml4bio/ and PyPI under the MIT license.
Education III
Chair: Pamela Shaw
3:00 PM Conference-based Undergraduate Experiences: Lowering the Barrier for Learning about Computational Biology Anna Ritz
Computational training is becoming critical for undergraduates who wish to pursue careers in biology. While there are notable exceptions, many small schools (such as primarily undergraduate institutions) do not have the resources or staffing to offer computational biology training within biology departments. Conference attendance can broaden undergraduate participation in computational biology for these resource-limited institutions. Attending computational biology conferences can educate students about computer science applications within biology, empower students with a unique opportunity that few undergraduates obtain, and provide a platform for faculty from other institutions to interact with strong interdisciplinary undergraduates. I argue that this opportunity, which is usually reserved for seniors who contribute to a faculty member's research, should be made available to students in their first few years of college as they explore majors and career paths. I will describe how I integrated conference attendance into an upper-level undergraduate course, show preliminary data assessing the conference experience, and share lessons I learned when helping undergraduates navigate conferences.
3:20 PM Teaching introductory bioinformatics with Jupyter notebook-based active learning Colin Dewey
With growing evidence that active learning is more effective than traditional lecturing with respect to student performance in STEM, there has been increasing interest within the bioinformatics community to adopt active learning approaches within our courses. Active learning can take many forms, from brief in-class problems or quizzes interspersed between segments of a traditional lecture, to “flipped” classrooms, in which students watch video lectures outside of class and participate in activities during the class period. Along these lines, within the bioinformatics community there has been work in developing materials supporting active learning, including the creation and sharing of video lectures and programming problem-based activities.

To experiment with such active learning approaches in teaching undergraduate-level bioinformatics, I recently revamped the course “Introduction to Bioinformatics” at the University of Wisconsin-Madison, turning what had previously been a traditional lecture-based class into a largely flipped classroom. With a focus on computer science and statistics foundations, this course covers the topics of sequence assembly, sequence alignment, phylogenetic trees, genome annotation, clustering, and biological network analysis. Prior to each class period, the students were asked to watch one or more short video lectures, complete an assigned reading, take a short online quiz, and submit questions to an online discussion board. After a short discussion of the most common questions from the pre-class material, the bulk of the in-class time was spent with the students completing programming or written problems within cloud-based Jupyter notebooks. The other components of the course, homework and exams, were kept comparable to those used in previous versions of the course.

The most novel aspect of the revamped course was the set of over 30 Jupyter notebooks (Python kernel) that the students completed as part of their in-class activities. For the typical class period, the students were presented with a new notebook template that contained an average of three problems that they were to complete for a small yet non-negligible part of their overall grade. The most common format of a problem was to fill in the definition of Python function that performed some subtask of an algorithm that had been covered in the pre-class materials. Other common problems involved visualizing the results of an analysis, taking advantage of the interactive plotting features of Jupyter notebooks. Students were encouraged to work with each other to complete the in-class notebooks and were arranged in groups of four within the classroom, which was equipped with laptops for every student. The notebooks were autograded and students were allowed to submit their work multiple times until they passed the autograder tests.

Course evaluations revealed that the students generally enjoyed the notebook activities and the flipped format of the course. The most common criticism of the course by students was that the in-class activities required too much time, with many students spending hours after the class period to complete the notebooks. Although a direct comparison of grades across semesters is difficult due to numerous varying factors, the median undergraduate score did increase by roughly three percentage points in the revamped course as compared to my last offering of the course, although this difference was not statistically significant. As an instructor, I enjoyed the fact that the flipped format enabled me to spend more time working one-on-one with students who were struggling in the class. One lesson learned was that three-day-per-week, 50-minute class periods were suboptimal for notebook-based activities and thus the next offering of the course will use a two-day-per-week 75-minute class period format.
3:40 PM NIBLSE Incubators: A community-based model for the development of bioinformatics learning resources Michael Sierk, Sam Donovan, William Morgan, Hayley Orndorf, Mark Pauley, Sabrina Robertson, Elizabeth Ryder and William Tapprich
The Network for Integrating Bioinformatics into Life Sciences Education (NIBLSE) is an NSF funded Research Coordination Network that aims to establish bioinformatics as an essential component of undergraduate life sciences education. As part of that effort, the project is working to make existing bioinformatics learning resources more accessible to non-specialists and increase their use across undergraduate biology courses. To this end, NIBLSE has partnered with the Quantitative Undergraduate Biology Education and Synthesis (QUBES) project to develop and implement a novel model, called incubators, for supporting the refinement, publication, and dissemination of high-quality bioinformatics teaching resources such as a lab activities, worksheets, or classroom exercises. The incubators bring together the author of an existing resource with experienced users, novice users, and a managing editor from NIBLSE to discuss how to refine and improve the resource to make it more robust and more applicable in various undergraduate settings. The talk will outline the challenges faced in developing high-quality learning resources and describe how the incubator model addresses several of those challenges. Examples of previous incubators will be presented, and attendees will be shown how to volunteer to participate in an incubator.
4:00 PM Next steps for the bioinformatics education community Kristin Jenkins, William Morgan, Layla Oesper, Anna Ritz, Michael Sierk, Russell Schwartz
This session will be a combination panel discussion and community forum intended to identify goals for the bioinformatics education community over the next five years and tentative plans for accomplishing them. Directed questions to a panel of the special session’s invited speakers will be used to help focus the agenda on subtopics, such as major near-term educational goals of the panelist, the major challenges he or she anticipates, and how those might be tackled. The format is intended also to allow ample time for questions and input from the audience on these topics. We intend to bring the major conclusions of the forum back to the International Society for Computational Biology (ISCB) Education Community of Special Interest (COSI) to help set the agenda for the broader international efforts with which the Education COSI and its members are involved.

- top -

The Marquee on Tuesday, May 21, 2019

Links for Monday, May 20, 2019: The Marquee | Fifth Quarter | Varsity Hall I
Links for Tuesday, May 21, 2019: Fifth Quarter | Varsity Hall I
Links for Wednesday, May 22, 2019: The Marquee | Varsity Hall I
Start Time Title Author(s)
9:00 AM Keynote #3 - Reading the fossil record of a cancer Quaid Morris
Introduction by Shaun Mahony
During carcinogenesis, cells accumulate 1000s of somatic DNA mutations. Driver mutations bestow fitness advantages that lead to selective sweeps that increase that frequency of mutated cells compared to those lacking the driver. These sweeps also increase the frequency of passenger mutations accumulated since the last such sweep. These mutation "fossils" have little impact on cell function but reflect the mutational processes that generated them. Both their type (i.e., A to C) and genomic locations depend on both what caused the mutation, e.g., UV light, and also the chromatin state of the cell that acquired it. I will describe machine learning approaches to (i) group mutations into subclones associated with different sweeps, (ii) reconstruct the phylogenies of these subclones, and (iii) to analyze these groups to infer properties of the historical cell environment in which these mutations accumulated. The ultimate goal of this work is to reconstruct the dynamic cell environments as a normal cell progressively transforms into a cancerous one.
General Track - Gene Regulation II
Chair: Dennis Kostka
10:30 AM Direct prediction of regulatory elements from partial data without imputation Yu Zhang and Shaun Mahony
Genome segmentation approaches allow us to characterize regulatory states in a given cell type using combinatorial patterns of histone modifications and other regulatory signals. In order to analyze regulatory state differences across cell types, current genome segmentation approaches typically require that the same regulatory genomics assays have been performed in all analyzed cell types. This necessarily limits both the numbers of cell types that can be analyzed and the complexity of the resulting regulatory states, as only a small number of histone modifications have been profiled across many cell types. Data imputation approaches that aim to estimate missing regulatory signals have been applied before genome segmentation. However, this approach is computationally costly and propagates any errors in imputation to produce incorrect genome segmentation results downstream.
We present an extension to the IDEAS genome segmentation platform which can perform genome segmentation on incomplete regulatory genomics dataset collections without using imputation. Instead of relying on imputed data, we use an expectation-maximization approach to estimate marginal density functions within each regulatory state. We demonstrate that our genome segmentation results compare favorably with approaches based on imputation or other strategies for handling missing data. We further show that our approach can accurately impute missing data after genome segmentation, reversing the typical order of imputation/genome segmentation pipelines. Finally, we present a new 2D genome segmentation analysis of 127 human cell types studied by the Roadmap Epigenomics Consortium. By using an expanded set of chromatin marks that have been profiled in subsets of these cell types, our new segmentation results capture a more complex picture of combinatorial regulatory patterns that appear on the human genome.
11:00 AM Discovering structural units of chromosomal organization with matrix factorization and graph regularization Da-Inn Lee and Sushmita Roy.
The three-dimensional (3D) organization of the genome is emerging as an important layer of gene regulation in many developmental, disease, and evolutionary processes (Bonev and Cavalli, 2016; Hug and Vaquerizas, 2018; Rowley et al., 2017; Krijger and de Laat, 2016). The 3D genome configuration can be assayed with high-throughput chromosome conformation capture (3C) techniques like Hi-C (Lieberman-Aiden et al., 2009; Rowley and Corces, 2018). The availability of Hi-C datasets has fueled the development of computational methods to examine 3D organization. One important goal for such methods has been identifying chromosomal structural units, such as compartments and topologically associating domains (TADs), disruptions in which can have drastic consequences on normal phenotypes.
At a high level, discovering such structural units can be thought of as a task of clustering genomic regions based on their interaction patterns. Recently a large number of methods have emerged with different computational frameworks ranging from community detection within networks (Filippova et al., 2014; Norton et al., 2018), Gaussian mixture modeling (Dixon et al., 2012; Yu et al., 2017), and signal processing approach (Crane et al., 2015). However, comparison of TAD-finding methods (Forcato et al., 2017; Dali and Blanchette, 2017; Zufferey et al., 2018) have found large variability across methods in their sensitivity to the resolution and sequencing depth of the dataset. Low sequencing depth in particular, resulting in sparser Hi-C matrices at high resolutions, poses a significant challenge.
Here we present Graph Regularized Non-negative matrix factorization and Clustering of Hi-C data (GRiNCH), a novel matrix-factorization based method for analysis of Hi-C data to discover chromosomal structural units. GRiNCH is based on non-negative matrix factorization (NMF), a powerful dimensionality reduction tool providing interpretable low-dimensional structure from high-dimensional datasets in genomic and imaging domains (Lee and Seung, 2000; Wu et al., 2018; Stein-O’Brien et al., 2018). An NMF-based approach to examine Hi-C matrices has a number of advantages: (1) NMF can be used to predict missing entries, which can be used to smooth noisy, sparse matrices; (2) factorization enables one to recover the low-dimensional signals and clustering of row and column entities of the input matrix; (3) the non-negativity constraint of the factors is well suited for count datasets (such as Hi-C matrices). However, a straightforward application of NMF to Hi-C data is not sufficient because of the strong distance dependency of Hi-C data, that is, regions that are close to each other tend to have more interactions. To impose the distance dependence within the NMF framework, we employ a graph regularized NMF approach, where the graph captures the distance dependency of contact counts such that the learned factors are smooth over the graph structure (Cai et al., 2011).
We use GRiNCH’s factors to define TAD-like structures and compare them to existing methods for finding TADs. GRINCH’s TADs are stable and robust to resolution and depth of the data and are comparable to the state of the art methods with significant association with CTCF binding at the TAD boundaries. In addition, compared to existing smoothing approaches, GRiNCH-based smoothing has the best agreement with TADs and significant interactions identified from a higher-depth dataset. Taken together, graph-regularized NMF is a promising approach to discover known and novel types of structural units from Hi-C data as well as for smoothing the matrix from low depth datasets, which can be important for downstream analysis including TAD recovery.
11:15 AM FreeHi-C enables systematic benchmarking of analysis methods for Hi-C Data and improves FDR control for differential Hi-C analysis Ye Zheng and Sunduz Keles.
The recent maturation of chromosome conformation capture (3C) and Hi-C sequencing technologies have given rise to high throughput profiling of three-dimensional chromatin architecture and revealed transformative insights on long-range regulation of genes. Alongside the technological breakthroughs, a growing number of computational and analytical methods for analysis of Hi-C and related data have been proposed, yielding an urgent need for realistic Hi-C data simulators that can benchmark model performances and validate the results of such thriving field of methods. We developed FreeHi-C, an open-source tool for data-driven simulation of Hi-C data. FreeHi-C takes as input raw Hi-C sequencing reads and leverages a data-driven method to empirically learn parameters governing genomic fragment interactions. This is fundamentally different from existing approaches that simulate Hi-C contact matrices under a series of assumptions. Subsequently, FreeHi-C generates pairs of sequencing reads that represent the interacting fragment pairs with embedded random nucleotide mutations and indels. The complete simulation procedure of FreeHi-C imitates the general Hi-C experimental protocol and simulates the desired numbers of reads independently, thereby allowing parallel runs for faster implementation. Additionally, FreeHi-C includes a data processing module, which can be applied to both the input biological replicates and the simulated ones; therefore the contact count files (BED), which are compatible for downstream analysis, can be directly produced.
We illustrated the versatile features of FreeHi-C on Hi-C datasets of two human cell lines, GM12878 and A549, and malaria parasite Plasmodium falciparum 3D7, as a representative of a small genome. Specifically, we performed reproducibility analysis with HiCRep (Yang et al., 2017) and detected significant interactions by Fit-Hi-C (Ay et al., 2014a) with biological replicates and their FreeHi-C simulations and showed that analysis of simulated data displays general characteristics of experimental data. We next utilized FreeHi-C to evaluate false discovery rate and power characteristics of state-of-the-art differential interaction detection methods. Our results highlight that differential interaction detection benefits from incorporation of simulated replicates from a different number of aspects. First, studies where only one biological replicate is generated for each condition typically fail to meet the input requirements of many methods or software as they require multiple replicates for estimating the within condition variability. Under this scenario, simulating another technical replicate eliminates such usage limits. Notably, our computational experiments, benchmarked with external RNA-seq, CTCF ChIP-seq, and permutation approaches, highlight that inclusion of simulated replicates to actual analysis boosts the detection power while exhibiting better false discovery rate control. Aggregating differential analysis results from simulated replicates refines top ranking differential interactions even with the availability of two or more biological replicates per condition and results in a better ranking of significant differential contacts for further quantitative and experimental validation. FreeHi-C is implemented in Python with core calculation accelerated by C and is publicly available at https://github.com/keleslab/FreeHiC.
11:30 AM Using Markov Random Field to Model Gene Expression in the 3D Genome Naihui Zhou, Iddo Friedberg and Mark Kaiser.
The chromatin and its 3D organization plays important regulatory roles in cellular function in the eukaryotic cell. With the advance in the 3C (HiC) technology, more long-range intra-chromosomal and inter-chromosomal interactions between genomic loci have come to light.

This study is an attempt to further explore the 3D spatial mechanisms at play during transcription. By using a probabilistic model for the within-sample variations of gene expression, we directly model gene expression values on a spatial neighborhood network inferred from HiC data. We fit a hierarchical Markov Random Field (MRF) model to estimate the level of spatial dependency among protein-coding genes in the human IMR90 cell. We overcame computational challenges of large matrices using the double Metropolis algorithm to carry out the Markov Chain Monte Carlo (MCMC) simulation for this Bayesian model.

Our study confirms the spatial dependency of gene expression among neighboring genes in the 3D genome organization on a global scale. Further insights were be made into the mechanism of differential expression as a response to stimuli involving the chromatin compartments.

This study serves as a model for understanding spatial dependency in gene expression, and can help highlight the location of active transcription factories and hubs in the cell.
11:45 AM Analysis of ChIP-exo read profiles reveals spatial organizations of protein complexes Naomi Yamada, Nina Farrell, B. Franklin Pugh and Shaun Mahony
Each subunit of regulatory protein complexes uniquely associates with the genome via protein-DNA or protein-protein interactions. The ChIP-exo protocol precisely characterizes protein-DNA crosslinking patterns by combining ChIP with 5’ to 3’ exonuclease digestion. Within a regulatory complex, the physical distance of a regulatory protein to the DNA affects cross-linking efficiencies. Therefore, analysis of the sequencing read distribution shapes created by the exonuclease can potentially enable greater levels of biological insight by identifying the protein-DNA interaction preferences of proteins or the modes by which they bind.
Here, we present a computational pipeline that simultaneously analyzes ChIP-exo read patterns across multiple experiments and infers spatial organizations of the proteins. Because many proteins bind DNA in a non-sequence specific manner, we directly align the strand separated ChIP-exo read patterns and produce representative ChIP-exo read profiles at a set of binding events. Given a set of aligned ChIP-exo read profiles across multiple proteins, we use a probabilistic mixture model to deconvolve the ChIP-exo read patterns to protein-DNA crosslinking sub-distributions. The method allows consistent measurements of crosslinking strengths of protein-DNA interactions across multiple ChIP-exo experiments. Lastly, we perform MDS to visualize cross-linking preferences of the regulatory proteins.
We have applied the ChIP-exo analysis methods to a set of proteins that organizes the PolIII transcriptional pre-initiation complex (PIC) assembly of yeast tRNA genes. Our results demonstrate that inferred protein organization closely recapitulates the known organization of the tRNA PIC, thereby confirming that the detailed analysis of ChIP-exo reads enables us to understand the precise organization of protein-DNA complexes. We anticipate that ChIP-exo read pattern analysis will offer an economical approach in creating testable hypotheses of protein organization.
General Track - Networks I
Chair: Anna Ritz

1:30 PM Incorporating noisy priors for estimating transcription factor activities for genome-scale regulatory network inference Alireza Fotuhi Siahpirani, Rupa Sridharan and Sushmita Roy.
Transcriptional regulatory networks model the context-specific expression levels of genes by specifying the targets of regulatory proteins (such as transcription factors and signaling proteins). Expression-based network reconstruction is among the most popular computational approaches to infer genome-scale regulatory networks. Given the mRNA profiles of genes and their potential regulators, these methods infer the regulators of a gene based on the predictive ability of a regulator’s mRNA level to explain the mRNA level of a target gene. However, reconstruction of regulatory networks remains a challenging problem with computational methods failing to recover most known physical interactions identified from ChIP-chip and ChIP-seq assays [Marbach et al. 2012].

Recent work in computational inference of regulatory networks have advanced in two directions: (a) integration of prior information to improve the agreement with physical networks, (b) inference of transcription factor (TF) activities to overcome possible issues with using mRNA levels of the regulator as a proxy for TF activity [Liao et al., 2003, Ocone and Sanguinetti, 2011, Arrieta-Ortiz et al., 2015]. Both of these directions require an input network that provides an initial assessment of potential regulatory edges. In most systems, the input regulatory network can be very noisy, however, the extent to which the quality of the input network influences results is not known. This is especially an issue for estimation of TFA which directly uses the structure to infer the TF activity levels.

To address this issue, we developed a new approach, Network Inference with Regularized TF Activities (NIRTA), that uses regularized regression to prune out and down-weight noisy edges and uses these inferred activities to learn a regulatory network. We first used simulated data and showed that the estimated TFA and the resulting inferred network can be sensitive to noise in the input network. Next, we extended the Network Component Analysis (NCA) algorithm, which is used to estimate TFAs, to incorporate our prior confidence in the individual interactions of the noisy input network. On simulated data, we showed that our approach of estimating TFA is more robust to noise, estimates more accurate TFAs and infers better networks compared to an approach that naively uses the noisy input network. We applied our approach to yeast and mammalian expression datasets. For the mammalian study, we considered four well-studied cell lines, mouse embryonic stem cells (mESC), lymphoblastoid cell lines (Gm12878), a breast cancer cell line (MCF7) and human embryonic stem cells (hESCs) and collected a large compendium of expression data including RNA-seq and microarray data. In yeast, regularized TFA estimation improves the performance of all network inference methods tested [Huynh-Thu et al. 2010, Greenfield et al. 2013, Roy et al. 2013, Siahpirani & Roy 2017] compared to the TFA estimated using the original NCA method. Furthermore, adding both TFA and the prior has the best performance. We observe similar results in most of the mammalian systems, where incorporation of both the prior and regularized TFA has the best performance. Our inferred networks rank several relevant regulators for a cell line highly and are associated with meaningful biological processes, which are consistent with the cell line of interest. Taken together, our results show that by handling noisy input prior networks, NIRTA provides a powerful approach for reconstructing gene regulatory networks and is broadly useful across diverse systems.
1:45 PM Building Robust Gene Co-expression Networks from RNA-seq Data Kayla Johnson and Arjun Krishnan.
As the cost of RNA sequencing has continued to fall, the amount of publicly available RNA-seq data has continued to grow; currently, there are over 80,000 publicly-available human RNA-seq samples. A predominant method for studying gene function in specific biological contexts is to construct a gene co-expression networks using transcriptomes from those contexts. Although many studies have focussed on best preprocessing procedures for use of RNA-seq data for analysis of differential expression, not enough attention has been given to best practices for processing RNA-seq data for calculating gene co-expression. Constructing an accurate co-expression network depends on several factors including expression quantification from read count and presence of experimental and technical artifacts, which introduce non-biological variation into the data. In this research, we leverage thousands of uniformly aligned RNA-seq samples from various experiments that span diverse tissues, diseases, and conditions to investigate these factors. We construct gene co-expression networks using different within-sample and between-sample normalizations and network transformation methods, and then evaluate the resulting networks based on their ability to recover documented tissue-naive and tissue-specific gene functional relationships. This comprehensive benchmarking provides insight to the best procedures for deriving a robust gene co-expression network from an RNA-seq dataset.
2:00 PM Network analysis of synonymous codon usage Khalique Newaz, Gabriel Wright, Jun Li, Patricia Clark, Scott Emrich and Tijana Milenkovic.
Most amino acids are encoded by multiple synonymous codons. However, for an amino acid, some of its synonymous codons are used significantly less often than others (i.e., are rarely used in the genome overall and hence are called “rare codons”) and tend to be translated more slowly than more common synonymous codon counterparts. By studying positions of rare codons in the 1-dimensional (sequence) structure of a protein, it has been shown that rare codons can have a positive impact on co-translational folding of a protein to its final 3D structure. Moreover, the positions of many rare codons are evolutionary conserved in homologous gene sequences (such rare codons are henceforth called “conserved codons”). However, these conserved positions are not enriched at obvious structural boundaries, such as between structured protein domains. Studying positions of rare codons in the 3-dimensional (3D) structure of a protein, which is “richer” in biochemical information than the sequence alone, might provide more insight into the importance of rare codons for folding of particular protein structures. So, we ask whether conserved, rare, and common (i.e., frequently used, non-rare) codons occupy different positions in the 3D structure of a protein, as well as whether the relationship between the positions of conserved, rare, or common codons in a protein is linked to the function of the protein.

To explore the above questions, we analyze a recent large data set consisting of ~280,000 proteins spanning 76 species for which codon usage information is available. We consider a non-redundant subset of these proteins that are at most 90% sequence-similar to each other, resulting in ~4,600 proteins. Among these proteins, we only keep those that have at least one conserved codon and that have sufficient 3D protein structural information in the Protein Data Bank. This results in 63 proteins spanning seven species. We model the 3D structure of each protein using the concept of protein structure networks (PSNs). Namely, given a protein, we model its amino acids as nodes of its PSN and join two nodes by an edge if the corresponding amino acids are sufficiently close in the 3D space. To study the 3D structural relationship between conserved, rare, and common codons, we study the network positions of their respective amino acids in the PSNs. We use the notion of network centrality to capture the PSN position of a node. Given a protein and a node centrality measure, we examine all possible kinds of trends between the conserved, rare, or common codons. For example, one possible trend is that node centrality values of amino acids encoded by conserved codons are significantly greater than node centrality values of amino acids encoded by rare and common codons. Another possible trend is that node centrality values of amino acids encoded by common codons are significantly greater than node centrality values of amino acids encoded by rare and conserved codons.

We find that the 63 proteins show 17 trends, with respect to at least one of the six node centrality measures that we use. Interestingly, there is no single trend that is dominant, meaning that even the most common trend is shared between only 12 proteins. We hypothesize that the 17 protein groups (corresponding to the 17 trends) perform different biological functions. Hence, we analyze enrichment of each group in biological process gene ontology terms. Indeed, when we compare each individual protein group to all proteins with conserved codon(s), we find that the different protein groups are significantly enriched (with corrected p-values from 0.049 to 0.008) in different biological functions. Our results imply the existence of a link between codon usage, protein folding, and protein function.
2:15 PM Network Inference with Granger Causality Ensembles on Single-Cell Transcriptomic Data Atul Deshpande, Li-Fang Chu, Ron Stewart and Anthony Gitter.
Advances in single-cell transcriptomics not only enable us to measure the gene expression of individual cells, but also allow us to order cells by their state along a dynamic biological process. Many ordering algorithms assign 'pseudotimes' to each cell, representing the progress along the biological process. Ordering the expression data according to such pseudotimes can be valuable for understanding the underlying regulator-gene interactions in a biological process, such as differentiation. However, the distribution of cells sampled along a transitional process, and hence that of the pseudotimes assigned to them, is not uniform. This renders many standard mathematical methods such as Granger Causality ineffective for analyzing the ordered gene expression states.

We present Single-Cell Inference of Networks using Granger Ensembles (SCINGE), an algorithm for gene regulatory network inference from ordered single-cell gene expression data. SCINGE uses a kernel-based Granger Causality regression, which smooths over irregular pseudotimes and missing expression values in the ordered single-cell data. It then aggregates the predictions from an ensemble of regression analyses using a modified Borda count method to compile a ranked list of candidate interactions between transcriptional regulators and their target genes. We compare SCINGE against contemporary algorithms for gene network reconstruction in two mouse embryonic stem cell differentiation case studies and observe that SCINGE outperforms the other methods. In this regard, we also comment on the pitfalls of only relying on aggregate statistics like average precision to characterize a method’s performance. We present two different visualizations that provide a deeper understanding of some underlying variables behind the overall performance of a network inference method by assessing performance of individual transcriptional regulators. We observe from these visualizations that network inference methods, including SCINGE, may have near random performance for predicting the targets of many individual regulators even if the aggregate performance is good.

The smoothing nature of the kernel-based method allows the removal of zero-valued drop-outs from the dataset, and we show preliminary results from this integrated drop-out handling strategy. Our experiments also suggest that in some cases including cells' pseudotime values can hurt the performance of network reconstruction methods. Although SCINGE is currently limited to single trajectory biological processes, we are expanding the study to include branching trajectories and exploring applications to a variety of single-cell datasets representing more complex processes such as cancer progression. A MATLAB implementation of SCINGE is available at https://github.com/gitter-lab/SCINGE.
General Track - Algorithms & Machine Learning
Chair: Lana Garmire

3:00 PM A new resolution function to evaluate tree shape statistics Maryam Hayati, Bita Shadgar and Leonid Chindelevitch
Phylogenetic trees are frequently used in biology to study the relationships between a number of species or organisms. The shape of a phylogenetic tree contains useful information about patterns of speciation and extinction, so powerful tools are needed to investigate the shape of a phylogenetic tree. Tree shape statistics are a common approach to quantifying the shape of a phylogenetic tree by encoding it with a single number.

In this article, we propose a new resolution function to evaluate the power of different tree shape statistics to distinguish between dissimilar trees. We show that the new resolution function requires less time and space in comparison with the previously proposed resolution function for tree shape statistics. We also introduce a new class of tree shape statistics, which are linear combinations of two existing statistics that are optimal with respect to a resolution function, and show evidence that the statistics in this class converge to a limiting linear combination as the size of the tree increases.
3:30 PM Topic modeling enables identification of regulatory complexes in a comprehensive epigenome Guray Kuzu, Matthew Rossi, Naomi Yamada, Prashant Kuntala, Chitvan Mittal, Nitika Badjatia, Gretta Kellog, Frank Pugh and Shaun Mahony.
Characterizing the composition and organization of protein complexes that form on DNA is key to understanding gene transcription and regulation. Chromatin immunoprecipitation (ChIP) based techniques have been widely applied to characterize protein-DNA binding across numerous systems. However, insights into the organization of regulatory complexes have been limited by three shortcomings: 1) a lack of comprehensiveness in existing compendia of protein-DNA binding profiles; 2) a lack of positional resolution in most existing genome-wide ChIP experiments; and 3) a lack of computational analysis methods for characterizing regulatory complex organization across large collections of genome-wide ChIP experiments.
Under our ongoing Yeast Epigenome Project, we have characterized the genomic occupancy patterns of a comprehensive set of nuclear-localized proteins (~400 proteins) in yeast using the high-resolution ChIP-exo assay. The resulting dataset represents the first comprehensive characterization of any cell type’s genome-wide protein-DNA interaction landscape at a resolution sufficient to define the positional organization of factors. Here, we demonstrate that topic modeling approaches can be used to identify sets of interacting proteins within this regulatory landscape. Topic modeling has been used to discover thematic concepts in large collections of documents. A topic is typically defined as a recurring pattern of co-occurring words and each document is a mixture of topics that are present in the corpus. Our approach, based on the hierarchical Dirichlet process, forms probabilistic topics from co-occurring ChIP-exo signals across the genome. In contrast with hard-clustering or state-based approaches (e.g. Hidden Markov Models), topic models allow multiple topics to contribute to the generation of data in each bin, and are therefore more appropriate for modeling the fine-grained organization of protein-DNA complexes from high-resolution ChIP-exo data.
Here, we provide evidence that the topics estimated by our approach can be interpreted as functional groups of regulatory proteins. Our topics encapsulate subunits of known complexes and sets of proteins from known interacting complexes. Moreover, profiling the distribution of topics on the genome reveals the spatial organization of protein complexes during gene transcription and regulation. We also identified motifs enriched in the genomic regions where certain topics are present, and recapitulated motifs of sequence-specific proteins. Motifs associated with topics containing non-sequence-specific proteins indicate possible mechanisms for the genomic recruitment of protein complexes. Furthermore, analysis of gene clustering based on the presence of topics displays variety in gene regulation mechanisms. Therefore, topic modeling provides a unique framework for understanding the high-resolution organization of large numbers of regulatory proteins within a deeply characterized epigenome.
3:45 PM The k is a lie Gregory Way and Casey Greene.
Generating data has become cheap. Now we are awash in transcriptomic data from discovery-oriented research, data collected as surrogate endpoints on clinical trials, data from clinical records, or genetic data collected in a case-control framework. To improve our understanding of the mechanisms of biology, we need to make such data human-interpretable. Often, our group and others use unsupervised methods that map data into a reduced dimensional space to accomplish this. There are many different methods that do this: autoencoder neural networks, principal components analysis, non-negative matrix factorization, and even k-means clustering. The challenge with these methods is that they map the data to some number of dimensions, and the best number of dimensions is rarely known in advance. We have recently performed multi-method and multi-dimensionality examinations of three benchmark datasets (TCGA, GTEx, and TARGET). We performed sequential compressions for dimensionalities from 2 to 200 in regular increments using Principal Components Analysis (PCA), Independent Components Analysis (ICA), Non-negative Matrix Factorization (NMF), Denoising Autoencoders (DAs), and Variational Autoencoders (VAEs). We sought to assess compressions produced for each method at many dimensionalities, which we refer to here as k though it differs by method. Our evaluations include the reconstruction of held out data, enrichment of features for biological pathways, model stability, the extent to which features capture key characteristics of the samples, and the extent to which features support subsequent supervised machine learning applications. Our primary finding from this work is that, for these complex biological datasets, there is no single, optimal k and thus that the k is a lie. We observe that certain signals, such as cell type proportions or cancer type most effectively sort into their own dimensions when the value of k is relatively low. On the other hand, more subtle signatures that make up these broad signals, such as underlying pathway activity differences, are best captured when k is much higher. Consequently, each perspective can be useful and it can be difficult or impossible to select a k that is most generally useful. We also find that performance for supervised learning tasks is maximized when the supervised learner has access to models with many different k and also models generated by many different methods. Thus, as we consider and compare unsupervised learning methods, we may wish to consider their value not as individual players but as contributors to a broader ecosystem of methods. We may then wish to favor those methods that learn features that are biologically aligned and also most distinct from the methods learned by other methods.
4:00 PM Machine learning is fast and accurate for network-based gene classification Arjun Krishnan, Renming Liu, Christopher Mancuso and Anna Yannakopoulos.
Computationally predicting the roles individual genes play in pathways, traits, and diseases – computational gene classification – is a powerful approach towards bridging the genotype-phenotype gap. Current state-of-the-art techniques for inferring novel gene-annotations of such nature from known ground truths rely on molecular interaction networks to guide the predictions. For example, label propagation is a class of techniques that prioritizes new genes for a specific attribute (e.g. pathway/disease) by diffusing known gene-attribute associations across the gene network. Here, we comprehensively examine popular network-based methods for gene classification, including supervised machine learning (ML) methods that take advantage of either the original gene network or its reduced representation in the form of unsupervised node embeddings. We evaluate these methods on a variety of tasks, including associating genes with hundreds of cellular functions, organismal phenotypes, and complex traits/diseases. We set up evaluations carefully to focus on discovering attributes of under-characterized (or completely uncharacterized) genes. Extensive analysis was carried out across multiple networks (of various sources, scales, and genomic coverage) based on metrics that capture the quality of top predictions prioritized for experimental validation. These analyses establish that approaches that combine molecular network knowledge with ML are superior to label-propagation methods for gene classification based on genome-wide molecular networks, providing accurate novel associations at scale. Since network-based gene classification is a critical task that is going to continue to attract the development of newer methods, we have made a structured, fully-documented code base – along with diverse gene attribute standards – available for computational researchers to rapidly benchmark their methods against the methods tested here in all the networks.

Further, we expand the top method – the network-based ML approach – to predict novel disease-associated genes when previously-known disease-genes, as is often the case, come from a variety of studies and study types (e.g. proteomics, microarrays, (epi-)genome-wide association studies) of varying quality. Our newer ML method incorporate both the reliability and the concordance of multiple data sources to produce an integrated genome-wide ranking of disease-genes based on large-scale gene networks. Applying this approach to predict novel protein biomarkers for Alzheimer's disease (ALZ), we show that, compared to only learning from known ALZ protein biomarkers, naively including all available data without regard to quality lessens predictive performance while incorporating data weighted with respect to its quality improves predictive performance. Experimental evaluation of our top predictions implicates several novel candidates as bona-fide biomarkers of ALZ.
4:15 PM Nearest-neighbor Projected-Distance Regression (NPDR) detects network interactions and controls for confounding and multiple testing Brett Mckinney, Trang Le and Bryan Dawkins.
Efficient machine learning methods are needed to detect complex interaction network effects in complicated modeling scenarios in high dimensional data, such as GWAS or gene expression for case-control or continuous outcomes. Many machine learning feature selection methods have limited ability to address the issues of controlling the false discovery rate and adjusting for covariates. To address these challenges, we develop a new feature selection technique called Nearest-neighbor Projected-Distance Regression (NPDR) that uses the generalized linear model (GLM) to perform regression between nearest-neighbor pair distances projected onto predictor dimensions. Motivated by the nearest-neighbor mechanism in Relief-based algorithms, NPDR captures the underlying interaction structure of the data, handles both dichotomous and continuous outcomes and various combinations of predictor data types, statistically corrects for covariates and allows for regularization. Using realistic simulations with main effects and network interactions, we show that NPDR outperforms standard Relief-based methods and random forest at detecting functional variables while also enabling covariate adjustment and multiple testing correction. Using RNA-Seq data from a study of major depressive disorder, we show that NPDR with covariate adjustment effectively removes spurious associations due to confounding. We apply NPDR to a separate RNA-Seq study with a continuous outcome of sleep quality and identify genes important to the phenotype.
4:30 PM Gene Expression Prediction: A Machine Learning Approach Paul Okoro, Ryan Schubert, Amy Luke, Lara Dugas and Heather Wheeler.
Tremendous progress in understanding and unravelling genetic predictors of complex traits have been made through genome-wide association studies (GWAS) and transcriptome association methods like PrediXcan. However, these genomic successes were largely achieved in populations of European ancestry; thereby creating disparity in the applicability of these results in other ancestry populations. We have shown that genetic predictors of gene expression built in one continental population do not perform as well when applied to another. The goal of this project is to use machine learning algorithms to build gene expression prediction models that are specially optimized for African-origin populations and thus broaden the applicability of PrediXcan to diverse populations.
Our study cohort includes 78 women of recent African-origin from the US and Ghana selected from Modeling the Epidemiologic Transition Study (METS). In this cohort, we have performed genome-wide SNP genotyping and measured gene expression in whole blood by RNA-seq. We are using the genotype data to predict gene expression using several machine learning algorithms. Although we expect the predictive power of any gene to be dependent on the heritability of the gene expression trait in our study cohort, we have previously shown that the genetic architecture of gene expression is sparse rather than polygenic. Thus, for many genes, a handful of SNPs have large effect associations that can explain most of the heritable component of gene expression traits.
Thus far, we have used nested cross-validation of elastic net to train genotypic predictors of gene expression. We found 306 of 2342 protein coding genes tested have significant predictive performance (R > 0.1). We will compare elastic net prediction to other models, including random forest, support vector machine, k-nearest neighbor, and Bayesian sparse linear mixed models. We will test our resulting gene expression prediction models fitted with METS cohort data for replication in other available transcriptome data in African-origin cohorts such as 1000 Genomes and Multi Ethnic Study of Atherosclerosis (MESA).
Using METS data and different machine learning algorithms, we will build models to generate accurate estimates of genetically regulated gene expression in the African cohort. The models from this study will be useful to researchers carrying out genomic studies in African populations and will enhance our knowledge of biological mechanisms associated with disease in all populations.
5:00 PM Keynote #4 - Leveraging Large Scale Genome and Transcriptome to Decode the Biology of Complex Traits Hae Kyung Im
Introduction by Tony Gitter
Over the last decade and half, the field of complex trait genetics has made unprecedented amount of progress discovering tens of thousands of variants robustly associated with a broad spectrum of human diseases and traits. Despite this success, the understanding of the mechanisms underlying these discoveries is lagging. In this talk, I will survey what we have learned about the biology of complex traits using genome-wide association studies, biobank-level phenome data, comprehensive atlas of transcriptome regulation, and a suite of methods tailored to integrate these data.

- top -

Fifth Quarter on Tuesday, May 21, 2019

Links for Monday, May 20, 2019: The Marquee | Fifth Quarter | Varsity Hall I
Links for Tuesday, May 21, 2019: The Marquee | Varsity Hall I
Links for Wednesday, May 22, 2019: The Marquee | Varsity Hall I
Start Time Title Author(s)
Microbiome I
Chair: Sailendharan Sudakaran
10:35 AM Viral discovery in Freshwater Environments Catherine Putonti, Jonathon Brenner, Thomas Hatzopoulos and Andrea Garretto
In contrast to their bacterial hosts, bacteriophages are severely underrepresented in sequence databases. Viral metagenomic studies have been instrumental in increasing our understanding of the genetic diversity of phages in nature. Nevertheless, freshwater environments are severely under characterized. Analyses of viral metagenomes presents bioinformatic challenges distinct from the analyses of cellular communities, and thus has led to the development of tools specific for viral studies. Here we present two new software tools, PhageRage and virMine, developed for the characterization of known viral strains and discovery of new viral strains, respectively. These tools integrate existing tools coupled with new reporting and visualization functionality. Using these tools we have conducted an extensive analysis of the viral community within Chicago's Lake Michigan nearshore waters.
11:00 AM Microbial communities within Lake Michigan nearshore waters in Chicago area Carine Mores, Michael Zilliox and Catherine Putonti
Motivation: The Great Lakes are a critical freshwater resource, providing drinking water to millions of residents of both the US and Canada. Furthermore, the Great Lakes are an essential part of commerce and recreation in the area. While city and state-led monitoring initiatives provide data on pathogenic and fecal indicator bacteria (FIB), little is known about other microbial communities present in the nearshore waters. In the present study, samples were collected over the summer of 2013 and 2014 in order to identify the microbial communities that persist within Chicago’s nearshore waters: the interface with recreational users.

Methods: Nearshore surface water samples were collected over the summer of 2013 and 2014. The 2013 sampling focused on two Chicago public beaches – Montrose Beach and 57th Street Beach, which were also sampled in 2014 along with two additional sites – Wilmette Beach and 97th Street Beach. Each sample was surveyed through targeted sequencing of the V4 16S rRNA gene on the Illumina MiSeq bench-top sequencer. Bioinformatic analysis was conducted using Mothur software v.1.40.5. The sequences were aligned against SILVA database v132. Taxonomic classifications were generated to the genus level using the RDP classifier. The taxonomic information was used to assign operational taxonomic units (OTUs) using the clustering algorithm built in to the Mothur software with a 0.03 cutoff level. Samples were subsampled at a read depth of 10,000 for 2013 and 50,000 for 2014 to normalize the data. Alpha diversity calculations were made in Mothur software. PCA plots were generated using STAMP software.

Results: Several taxa are consistently found, in both 2013 and 2014 samples and in all the beaches sampled. Flavobacterium species are the most abundant taxa in 2014 samples and species belonging to the family Burkholderiaceae are the most abundant taxa found in 2013 and second most abundant taxa found in 2014. Two genera, Fluviicola and Algoriphagus, were only detected in 2014, among all four beaches. Candidatus Methylopumilus was present in 2013 but not in 2014. While FIBs were not found above the detection limit in 2013 samples, high levels of Escherichia/Shigella were abundant in the four beaches sampled in 2014, all for the same date point - July 22. While variations were observed both from beach to beach and over time, these variations were not found to be statistically significant.

Conclusions: Overall, the nearshore microbial community of the Chicago waters is stable both over time and from one site to the next. With the exception of the one sampling date in 2014, FIBs were not abundant within these waters. The sampling and sequencing efforts were able to provide more knowledge about the microbial communities present in nearshore waters of Lake Michigan.
11:15 AM The microbial communities of bilge water, boat surfaces and external port water: a global comparison Laura Schaerer, Ryan Ghannam, Tim Butler and Stephen Techtmann
Dispersal of many invasive macroscopic eukaryotes has been accelerated by the shipping industry. While the spread of macroscopic eukaryotic organisms aided by the shipping industry has been fairly well-documented, very few studies have been dedicated to studying the potential for microorganisms to be spread in the same ways. Bacteria, due to their small size and ubiquitous nature have the potential to be transported by ships in several ways, including on boat surfaces and in ballast water. While ballast water has been a key focus in understanding the spread of invasive species, little work has been done to determine the extent to which bilge water may be a conduit for dispersal of organisms. In this study, we used 16S rRNA sequencing to compare the microbial community in bilge water and boat surfaces to the microbial community of the external water in 20 different ports in five distinct regions around the world (Asia, Europe, East Coast U.S.A., West Coast U.S.A. and Great Lakes U.S.A.). The overall goal of this study was to determine similarities and differences between the microbial communities living on and in boats and the microbial communities of the overlying port water to better quantify to what extent the microbiome of a boat is determined by the port water. This work provides a detailed analysis of the microbial communities resident on boats throughout the world. We show that the microbial communities of bilge water and the hull are seeded by the microbes found in port water. Using the program SourceTracker we showed that 40% and 52% of the bilge and hull samples respectively were derived from the port microbial community. We also report highly variable abundance of cyanobacteria in the bilge compartment of boats, supporting our hypothesis that the boat microbial community is seeded by the overlying water.
11:30 AM Environmental Monitoring of Crude Oil using Natural Microbial Communities and Machine Learning Stephen Techtmann, Timothy Butler and Paige Webb
Microbes are present in almost every environment and are highly diverse. These microbial communities can rapidly change in response to environmental perturbations. With increasing frequency, microbial communities in the human ecosystem have been used as markers for disease. In the present study, we are interested in using natural microbial communities as sensors for environmental contamination in the form of crude oil. Often trends in microbial community composition are investigated through unsupervised approaches such as cluster analysis and Principle Coordinate Analysis (PCoA). These approaches are useful in observing patterns in the data. However, supervised machine learning can provide key insights into the predictable nature of changes in microbial communities and can be used to identify microbial taxa that are indicative of a particular state.

In this study, we used supervised machine learning to interrogate microbial communities in the Great Lakes. To demonstrate the utility of supervised learning in environmental monitoring, we developed random forests classifiers to predict the presence of oil in samples from the Great Lakes. A series of microcosms were set up with water from sites across the Great Lakes. In each location, two types of crude oil (Bakken and Diluted Bitumen) were amended into the microcosms to simulate oil contamination. An additional set of microcosms were set up with no added oil to serve as a control. Microbial communities in these microcosms were profiled using 16S rRNA gene sequencing. Initial diversity analysis indicated a distinct community composition in oil-amended microcosms. Further, there was a distinct community composition between the microcosms amended with each oil type. This suggests that the addition of oil and the type of oil in particular select for a distinct set of microbes. The random forests model was able to predict the presence of oil with an accuracy of greater than 95% and a kappa of 0.92 using microbial community composition alone. Another model was constructed to differentiate between the types of oil in the microcosm using microbial information. This model was able to accurately bin samples into control or the two oil types with an accuracy of 78% and a kappa of 0.67. This indicates that machine learning can be used to identify patterns in microbial community data that is indicative of the presence of oil in contaminated water samples.

This study provides more support for the ability of natural microbial communities to serve as sensors of environmental phenomena. Further investigation will provide insights into the use of these models for biomarker identification and the potential for rapid environmental diagnostic approaches.
Microbiome II
Chair: Sailendharan Sudakaran
1:30 PM Zero-Inflated Generalized Dirichlet Multinomial (ZIGDM) Regression Model for Microbiome Compositional Data Zheng Zhen Tang
There is heightened interest in using high-throughput sequencing technologies to quantify abundances of microbial taxa and linking the abundance to human diseases and traits. Proper modeling of multivariate taxon counts is essential to the power of detecting this association. Existing models are limited in handling excessive zero observations in taxon counts and in flexibly accommodating complex correlation structures and dispersion patterns among taxa. We develop a new probability distribution, Zero-Inflated Generalized Dirichlet Multinomial (ZIGDM), that overcomes these limitations in modeling multivariate taxon counts. Based on this distribution, we propose a ZIGDM regression model to link microbial abundances to covariates (e.g. disease status), and develop a fast Expectation-Maximization (EM) algorithm to efficiently estimate parameters in the model. The derived tests enable us to reveal rich patterns of variation in microbial compositions including differential mean and dispersion. The advantages of the proposed methods are demonstrated through simulation studies and an analysis of a gut microbiome dataset.
2:00 PM Virus-Driven Metabolism of Sulfur Compounds: From Genes to Ecosystems Karthik Anantharaman
Microbial sulfur metabolism plays a critical role in the transformation of organic carbon compounds and nutrients in the environment, human health and disease. Our current knowledge of the microbial ecology associated with this key element is primarily based on single gene- and cultivation-based studies of microorganisms that overlook viruses and provide no reliable information on comprehensive microbial metabolism. Genome-resolved metagenomics, an approach that can yield near-complete and even finished genomes for organisms, has the potential to fundamentally transform our understanding of ecosystems by enabling organism-specific descriptions of elemental transformations and redox processes and virus-host infection dynamics. In this presentation, I will describe the metabolic analysis of viral genomes to implicate novel viruses in production of hydrogen sulfide and identify new processes that contribute to sulfur transformations in human and environmental systems
Microbiome III
Chair: Catherine Putonti
3:00 PM An Improved Species-level Taxonomic Classification Method for Marker Gene Sequences Qunfeng Dong, Xiang Gao, Huaiying Lin and Kashi Revanna
Species-level classification for 16S rRNA gene sequences remains a serious challenge for microbiome researchers, because existing taxonomic classification tools for 16S rRNA gene sequences either do not provide species-level classification, or their classification results are unreliable. The unreliable results are due to the limitations in the existing methods which either lack solid probabilistic-based criteria to evaluate the confidence of their taxonomic assignments, or use nucleotide k-mer frequency as the proxy for sequence similarity measurement. We have developed a method that shows significantly improved species-level classification results over existing methods. Our method calculates true sequence similarity between query sequences and database hits using pairwise sequence alignment. Taxonomic classifications are assigned from the species to the phylum levels based on the lowest common ancestors of multiple database hits for each query sequence, and further classification reliabilities are evaluated by bootstrap confidence scores. The novelty of our method is that the contribution of each database hit to the taxonomic assignment of the query sequence is weighted by a Bayesian posterior probability based upon the degree of sequence similarity of the database hit to the query sequence. Our method does not need any training datasets specific for different taxonomic groups. Instead only a reference database is required for aligning to the query sequences, making our method easily applicable for different regions of the 16S rRNA gene or other phylogenetic marker genes. Our software, called BLCA, is freely available at https://github.com/qunfengdong/BLCA.
3:25 PM Controlling contaminant sequences in low microbial biomass microbiome studies Lisa Karstens
Microbial communities are commonly studied using culture-independent methods, such as 16S rRNA gene sequencing. However, one challenge that limits our ability to accurately characterize microbial communities is exogenous bacterial DNA introduced during sample processing. This is especially problematic for samples from low microbial biomass environments, such as the urinary tract, lower airway, upper atmosphere, and hospitals. While computational approaches have been proposed as a post-processing step to identify and remove potential contaminant sequences, their performance has not been independently evaluated.

To identify the impact of decreasing microbial biomass on 16S rRNA gene sequencing experiments, we performed a serial dilution on a mock microbial community. We evaluated four computational approaches to identify and remove contaminant sequences: 1) presence in a negative control, 2) low relative abundance, 3) inverse correlation with DNA concentration (Decontam frequency method) and 4) classification based on predefined environmental sources (SourceTracker).

As expected, the proportion of contaminant bacterial DNA increased with decreasing starting microbial biomass, with 79.12% of the most dilute sample arising from contaminant sequences. Inclusion of contaminant sequences led to overinflated diversity estimates and distorted microbiome composition. All methods for contaminant identification successfully identified some contaminant sequences. The accuracy of these methods varied depending on the method parameters used and contaminant prevalence. Notably, removing sequences present in a negative control erroneously removed >64.2% of expected sequences. SourceTracker successfully removed over 98% of contaminants when the experimental environments were well-defined. However, SourceTracker misclassified a subset of expected sequences, and performed poorly when the experimental environment was unknown, failing to remove >99% of contaminants. In contrast, the Decontam frequency method successfully removed 74 - 91% of contaminants and did not remove any expected sequences. Our results indicate that contaminant bacterial DNA is problematic for low microbial biomass samples, but computational approaches can be used to mitigate their impact.
3:50 PM The microbiome in severe ocular surface diseases Michael Zilliox, William Gange, John Thompson, Gina Kuffel, Carine Mores, Cara Joyce and Charles Bouchard
Although the ocular surface microbiome (OSM) has been reported in dry eye disease (DED), the specific changes that occur in other severe ocular surface diseases (OSDs) are currently unknown. We performed a prospective, observational study to characterize the OSM in several chronic OSDs including: Stevens-Johnson Syndrome (SJS), Graft versus Host Disease (GVHD), Floppy Eyelid Syndrome (FES), and Dry Eye Disease (DED), as well as healthy controls. Despite the low biomass of the ocular surface, 47 of the 78 (60%) eyes sampled had positive reads: (10/16 (63%) healthy controls, 12/16 (75%) SJS, 6/14 (43%) GVHD, 8/16 (50%) FES, 11/16 (69%) DED). We observed that nearly half of patients (8/17) had distinct microbiomes in each eye. Most healthy controls had a Lactobacillus/Streptococcus mixture in at least one eye (70%), and 30% had significant amounts of Corynebacterium. Staphylococcus was the dominant bacteria (>50% of the reads) for 4/7 (57%) patients with SJS in at least one eye, compared to 0/10 healthy controls. Patients with GVHD generated relatively few positive samples 6/14 (43%). On the other hand, more than half of the patients with FES and DED had positive samples, which had similar OSMs, with Corynebacterium being the most prevalent bacteria in most eyes. Patients with different OSD had different dominant bacteria: Staphylococcus predominated in SJS, Lactobacillus in GVHD, and Corynebacterium in DED and FES. A majority of healthy eyes had a Lactobacillus/Streptococcus microbiome, which may play a role in maintaining a healthy ocular surface.
4:15 PM The influence of environmental, dietary, and pharmaceutical agents on the human gut microbiome Michael Burns
The importance of the microbiome in human health and disease states is a rapidly developing field of research. Our previous research primarily focused on the role of the human microbiome in cancer, especially colorectal cancer. We have applied multi-omics approaches to investigate the relationship between specific colorectal cancer mutations across the exome and the corresponding microbial landscapes. As follow-up to this work, we are currently engaged in research that attempts to uncover the potential roles of dietary, environmental, and pharmacological elements on the structure and function of the gut microbiome. To execute this research we have collected microbial community samples from normal, healthy donors and used an in vitro model to test the effects of a wide array of compounds on the composition and abundance of microbes as they are selected in these environments. This research is ongoing and has the ultimate goal of generating a model that is able to predict shifts in human microbial communities from one state to another as a function of exposure to specific diets, environments, or treatments.
4:40 PM Dynamic interaction network inference from longitudinal microbiome data Jose Lugo-Martinez, Daniel Ruiz Perez, Giri Narasimhan and Ziv Bar-Joseph
Several studies have focused on the microbiota living in environmental niches including human body sites. In many of these studies researchers collect longitudinal data with the goal of understanding not just the composition of the microbiome but also the interactions between different taxa. However, analysis of such data is challenging and very few methods have been developed to reconstruct dynamic models from time series microbiome data. We propose a computational pipeline that enables the integration of data across individuals for the reconstruction of such models. Our pipeline starts by interpolating the samples using B-Splines. Then, the interpolated data for all individuals is aligned by warping the time scale of each sample into the scale of another representative sample. This compensates the different rates in which biological events happen for the different samples. The aligned profiles are then used to learn a Dynamic Bayesian Network which represents causal relationships between taxa and clinical variables and are perfect for this task because of its interpretability. We tested our methods on three longitudinal microbiome data sets (infant gut, vagina and oral cavity). Different biological insights were obtained by the models which include several known and novel interactions. The extended CGBayesNets package is freely available with the documentation together in the journal paper. Our results provide evidence that microbiome alignments coupled with dynamic Bayesian networks improve predictive performance over previous methods and enhance our ability to infer biological relationships within the microbiome and between taxa and clinical factors.

- top -

Varsity Hall I on Tuesday, May 21, 2019

Links for Monday, May 20, 2019: The Marquee | Fifth Quarter | Varsity Hall I
Links for Tuesday, May 21, 2019: The Marquee | Fifth Quarter
Links for Wednesday, May 22, 2019: The Marquee | Varsity Hall I
Start Time Title Author(s)
Chair: Sarath Chandra Janga
10:40 AM Synthetic biology approaches to study and exploit RNA regulation Bryan Dickinson
RNA controls information flow through the central dogma and provides unique opportunities for manipulating cells. However, both fundamental understanding and potential translational applications are impeded by a lack of methods to study and exploit the regulation of RNA. Here, I will present two vignettes on our recent protein engineering and molecular evolution efforts focused on understanding and controlling RNA. First, I will unveil a new evolution system for creating reverse transcriptases that encode RNA modifications in mutations, which allow us to catalog the precise locations of a poorly-understood RNA methylation modification in mammalian cells. Second, I will present a new protein engineering strategy for constructing programmable RNA regulatory systems, built entirely from human protein parts. Collectively, our technology development focused around RNA regulation will continue to shed light on how mammalian cells function at a fundamental level, while also opening up new opportunities in molecular evolution and epitranscriptomic biotechnology development.
11:20 AM Deep learning framework for accurate transcriptome-wide identification of Gm RNA modification events at single molecule resolution from direct RNA sequencing data Sasank Vemuri, Swapna Vidhur, Raja Shekar Varma Kadumuri, Sarath Janga
The 2’-O-methylation of nucleotides (Nm) is one of the most abundant RNA modifications across the transcriptome [1]. Nm modifications have been observed to influence the stability, interactions and structure of RNA molecules [2]. Numerous disorders like asthma, alzheimer’s disease and cancer have been associated with Nm modifications [2]. Although recent methods have enabled the epitranscriptomic profiling of Nm in various cell lines and tissues, current sequencing-based methods have several major drawbacks including: multifaceted labor-intensive protocols, cross-reactivity to chemical compounds and ambiguous mapping of Nm positions resulting from short read sequencing technologies [3,4]. Most of these challenges can be circumvented by employing direct sequencing and mapping of modifications on RNA molecules. Nanopore direct RNA sequencing enables us to represent the RNA sequence data in the form a signal that is directly dependent on the chemical composition of a given molecule. Such an accurate delineation of RNA sequence will essentially render us to distinguish between different chemical compositions constituting a given RNA molecule. 2’-O-methylguanosine (Gm) is an Nm modification where the guanosine is methylated at the 2’ end. Despite the functional significance of RNA modifications like Gm, currently, there is no method to detect Gm modifications at a single molecule resolution. Here we present Kea, a neural networkbased framework developed using an LSTM architecture, for detection of Gm modifications from nanopore-based direct RNA sequencing data. Deployment of Kea on direct RNA-seq data resulting from a Minion sequencer, can not only predict the presence of Gm modifications across the RNA molecule but also can precisely map the location of the modification at a single nucleotide resolution. Kea is trained and validated on ~18,000 experimentally known 2’-O-methylguanosine (Gm) and unmodified Guanosine (G) signal signatures from matched direct RNA sequencing and Nm-seq, generated from HeLa cell line. Kea takes the albacore base called fast5 files resulting from direct RNA sequencing as input to provide Gm modification predictions at read level with a validation accuracy of ~90%. Independent validation of the model was performed on synthetically designed RNA oligos with polyA tails, with one Gm modification per read, which further supports the robustness of our model.
11:40 AM RNA editing in neural transcriptomes and antiviral immunity Helen Piontkivska, Noel-Marie Plonski, Heather Milliken Mercer, Caroline Nitirahardjo
During the last two decades United States and other countries in the Western Hemisphere have experienced a major uptick in the number of novel viral infections that significantly increased public health burden, including arthropod-borne arboviruses such as West Nile (WNV), and the latest entrant into the list of major global infectious disease threats - Zika virus (ZIKV). Following discovery of a link between ZIKV infection in pregnancy and microcephaly in infants in 2015, this public health crisis continues to present a significant hazard to human health, in major part due to life-long profound disability that infants born with congenital Zika syndrome face. Nor are adults immune: known consequences of ZIKV infection in adults include Guillain-Barré syndrome (GBS), a debilitating peripheral neuropathy that is often life-threatening. However, other arthropod-borne arboviruses, such as West Nile (WNV), are also associated with a diverse set of neurological symptoms, fetal and adult; albeit the number of adverse outcomes linked with individual infections is likely underestimated in part due to lack of comprehensive surveillance. Thus, it is critical for us to decipher specific mechanisms of host-pathogen relationships that underlie such pathogenesis. We recently showed that adenosine to inosine deamination, a type of RNA editing catalyzed by members of the adenosine deaminases acting on RNA (ADAR) gene family, plays a prominent role in molecular evolution of ZIKV as part of interferon-regulated antiviral response. The viral genomes were shown to harbor a signature of ADAR editing, including enrichment of nucleotides resistant to ADAR editing and underrepresentation of substitutions at resistant sites, thus, reflecting both long- and short-term evolutionary consequences of viral interactions with host innate immune response. However, ADAR editing also occupies a key regulatory position in neural transcriptome, where ADAR-mediated transcriptome diversification impacts expression and function of various neural proteins, including neurotransmitter receptors and transporters. Our findings from transcriptome-based differential gene expression and RNA editing analyses, and phylogenetic analyses of viral genomes support the role of ADAR editing as a factor in molecular evolution of RNA viruses as well as in viral-driven neurotoxicity. In the absence of effective prevention, understanding how RNA editing contributes to viral-mediated neural pathogenesis is critical to the development of treatments that can ameliorate potential sequelae of ZIKV and other arboviral infections.
Chair: Heidi Dvinge
1:30 PM Algorithmic decision-making in single-cell genomic data analyses Jun Li
In recent years, single-cell RNA sequencing (scRNA-seq) has emerged as a powerful technology for surveying cell types and state transitions in a cost-effective manner. ScRNA-seq data, along with a growing array of epigenomic data, can provide fresh insights into the functional heterogeneity in vivo that is not accessible with bulk-tissue analyses. As expected, the rapid adoption of single-cell measurement technologies has created a new pressure point around the computational analyses of these data. As of early 2019, >350 tools have appeared to address >30 scRNA-seq analysis tasks (e.g., normalization, clustering, and imputation). However, the community still struggles to identify the best workflow for any given task. I will briefly review the progress in this nascent field, and discuss the major challenges we still face as we move from the exciting applications in the first-wave to validation and standardization in the second-wave. I will share examples from our experience on learning the statistical properties of each dataset, finding the proper tools for data exploration, and navigating the cascade of algorithmic decisions. Single-cell data are unusually sparse and noisy. Most tools that address sparsity contain hidden assumptions that reflect the initial framing of the problem, with default parameters optimized for specific, first-use situations. Our efforts often begin with extracting the most relevant statistical "biosketch" of real datasets, then creating reusable simulated datasets accordingly, and applying the truth-known simulations in standardized method evaluation. The ability to benchmark data actions (i.e., algorithmic choices) using real-enough simulated data has become essential in developing customized pipelines that match the inherent difficulties in each real study. The lessons we learned can be generalized to other types of complex data.
2:10 PM Evaluation of Computational Methods to Deconvolute Cell Types in Single-Cell Transcriptomics data Qianhui Huang, Yu Liu, Lana Garmire
Single cell RNA-Seq (scRNA-Seq) is a new technology that has transformed our ability to discover cell types and states in tissues and organisms. Annotating cell types in single cell atlas data is a critical step in single cell analysis. The conventional approach of using a set of cell-type specific markers to manually define cell types is laborious, and prone to bias and errors. To avoid these issues, some deconvolution methods have recently emerged to automatically assign cell types, based on existing annotations from another dataset. However, their relative performance across a common set of benchmark datasets is unknown. We, for the first time, evaluated four newly developed single cell analysis methods with cell-type deconvolution capacities: Seurat, Garnett, SingleR and scmap. To investigate the potential of leveraging existing deconvolution methods for other omics analysis, we also included three methods originally designed for DNA methylation: CIBERSORT, Robust Partial Correlations and Constrained Projection. We used two benchmark datasets, representing liquid and solid tissues, respectively. The first benchmark data set is from 10x genomics and publicly accessible, composed of bead-purified 94655 PBMCs of 6 major cells types (reference) and unsorted fresh 2467 PBMCs collected from a healthy donor (query). The second, solid tissue benchmark dataset is human pancreatic islet data, composed of GSE85241 from CelSeq2 platform (reference), and GSE86469 from Fluidigm C1 (query). In order to test the robustness of the methods, we randomly down-sampled the genes in the pancreatic dataset into subsets of 5000, 10000, 15000 genes, following the distribution of log-transformed gene counts. We used multi-class confusion matrix, adjusted rand index, homogeneity, completeness and V-measure scores to evaluate the accuracy of classification. For the PBMC datasets, Seurat achieved the highest overall classification specificity of 90.23%. While all the single-cell specific methods scored higher than 50%, none of the methods designed for methylation deconvolution reached 50% accuracy. Most methods failed to differentiate CD4+ T cells from CD8+ T cells, except Seurat, which recovered 89% CD8+ T cells and 84% CD4+ T cells. For the Pancreas datasets, Seurat achieved the highest overall classification specificity of 97.02%. All the single-cell specific methods had above 90% accuracy, whereas all the methylation methods scored only had around 75% accuracy. When genes are randomly dropped from the dataset, Seurat and SingleR achieved the most stable prediction performance with accuracies above 93% for all the gene subsets. However, methylation methods such as Constrained Projection and CIBERSORT revealed exceptional precision to annotate rare cell types in the datasets. For instance, Constrained Projection method scored 100% in predicting the labels for epsilon cells, schwann cells and macrophages, which were less than 1% in the population. In general, Seurat and SingleR have the most accurate and consistent performance in annotating major cell types and differentiating cell subtypes, comparing to other methods. Although having relatively low overall accuracy, methods such as Constrained Projection and CIBERSORT predicted well on rare cell types, suggesting their complementary values.
Chair: Heidi Dvinge
3:00 PM RNA Structure Prediction David Mathews
The folding stability of RNA secondary structure can be estimatedusing a nearest neighbor model, and this model is in widespread useto predict RNA secondary structures. Nearest neighbor parameters forpredicting RNA folding free energy change at 37°C are based on adatabase of optical melting measurements on small model systems.This work revises and expands the nearest neighbor model byincluding the latest experimental results on the melting stability ofsmall model systems. A statistical model called AIC was applied todetermine and select nearest neighbor parameters that aresignificantly important to the stability of loops and to preventoverfitting. Surprisingly, we found that the AU helix-end penalty wasremoved by AIC model selection for hairpin loops, indicating that the AU end penalty should not be applied to hairpin loops. We also foundthat the stability of hairpin loops is independent of first mismatchsequence, which was assumed to be important in the previous 2004model. We did a benchmark on a set of 3856 RNA sequences withknown structures by implementing both 2004 and the new nearestneighbor parameters in the RNAstructure software package for RNAsecondary structure prediction. Secondary structure predictionidentified 1% more of the known pairs using the new model comparedto 2004 model, and this improvement is statistically significant.Therefore, the new hairpin loop model predicts RNA secondarystructure more accurately. We are implementing the complete new setof nearest neighbor parameters and hypothesize that this will improvethe accuracy of RNA secondary structure prediction significantly.
3:40 PM RNA structure elucidation at single molecule resolution using an integrated framework of long read sequencing and machine learning Swapna Vidhur Daulatabad, Molly Evans, Quoseena Mir, Julius Lucks, Sarath Chandra Janga
RNA molecules have a wide functional diversity [1], one of the key attributes governing such ability stems from the dynamic nature of RNA structure [2]. Despite a rapid progress in identifying novel functions of RNAs and novel classes of RNAs [3], we still lack a comprehensive understanding of the RNA structure. With the advancements in next generation sequencing, chemical probing methods and their efficient integration have enabled us to characterize RNA structure more clearly than ever [4]. One such protocol that not only dissects the RNA structure but also does with great throughput is SHAPE-seq [5]. However, common limitations of SHAPE-seq and similar protocols, include the use of short read sequencing as a downstream step, which provides incomplete information of the RNA molecule and lack of isoform level specificity of the transcript, restricting from developing a wholistic and comprehensive paradigm for RNA structure problem. Therefore, we propose an efficient integration of Nanopore based single molecule direct RNA long read sequencing and advanced machine learning frameworks to infer and predict RNA structure accurately. The SHAPE-seq protocol aims to modify the RNA once per molecule ("single-hit" kinetics) and Nanopore enables full-length and strand specific direct RNA sequencing at single molecule resolution. Combining both the approaches helps us reduce the biases induced by reverse transcriptase truncation and non-specificity of the transcripts. Traditional chemical probing combined with short read sequencing and NMR validation have enabled us to develop better model RNAs, TSL2 hairpin 85nt and HepC IRES, 338nt. Hence, in this study, in-vitro transcribed TSL2 hairpin and HepC IRES RNA molecules were probed with the 1M7 reagent and each control 'probed' with DMSO. Polyadenylated RNA molecules were sequenced using MinION from Oxford Nanopore Technologies, to generate 200k reads for 5N5C and 605k reads for HepC IRES. Average base quality for both the runs was 8, average read length 120p and 400bp for 5N5C and HepC IRES respectively. Furthermore, we plan to train multiple machine learning models including deep learning frameworks to predict the modified bases and thereby the structure of the molecule using the raw direct RNA-sequencing data of these molecules, to develop de-novo RNA structure models at single molecule resolution. Such accurate RNA-structure prediction models and corresponding methods can help us gain insights on the role of the structure in modulating the post-transcriptional regulatory processes.
4:00 PM Integrating thermodynamic and sequence contexts improves protein-RNA binding prediction Yunan Luo, Yufeng Su, Xiaoming Zhao, Yang Liu, Jian Peng
<div>Predicting RNA-binding protein (RBP) specificity is important for understanding gene expression regulation and RNA-mediated enzymatic processes. It widely believed that RBP binding specificity is determined by both the sequence and structural contexts of RNAs. Existing approaches, including traditional machine learning algorithms and more recently, deep learning models, have been extensively applied to integrate RNA sequence and its predicted or experimental RNA structural probabilities for improving the accuracy of RBP binding prediction. Such models were trained mostly on the large-scale in vitro datasets, such as the RNAcompete dataset. However, in RNAcompete, most synthetic RNAs are unstructured, which makes machine learning methods not effectively extract RBP-binding structural preferences. Furthermore, RNA structure may be variable or multi-modal according to both theoretical and experimental evidence. In this work, we propose ThermoNet, a thermodynamic prediction model by integrating a new sequence-embedding convolutional neural network model over a thermodynamic ensemble of RNA secondary structures. First, the sequence-embedding convolutional neural network generalizes the existing k-mer based methods by jointly learning convolutional filters and k-mer embeddings to represent RNA sequence contexts. Second, the thermodynamic average of deep-learning predictions is able to explore structural variability and improves the prediction, especially for the structured RNAs. Extensive experiments demonstrate that our method significantly outperforms existing approaches, including RCK, DeepBind and several other recent state-of-the-art methods in both in vitro and in vivo prediction. The implementation of ThermoNet will be available on GitHub at the publication time.</div>
4:20 PM Integrative analysis of RNA-seq and eCLIP-seq with SURF for elucidating alternative splicing roles of RNA binding proteins Fan Chen, Sunduz Keles
RNA binding proteins (RBPs) regulate post-transcriptional events, during which alternative splicing (AS) of pre-mRNA provides the major source of variation for protein diversity in the human genome. RBPs control inclusion or exclusion of the exomes by interacting with different genomic locations relative to the splicing sites. This leads to generation of mixtures of transcript variants and thus protein isoforms. Alterations in alternative splicing sites (e.g., hnRNP A1 binding site in SMN2) are implicated in disease development and are of great value for therapeutical targeting. High throughput technologies such as eCLIP-seq for profiling RBP binding sites in vivo have recently matured. These combined with complementary RNA-seq experiments from conditions with functioning and knocked-down RBPs provide an unprecedented opportunity for deciphering working principles of RBPs. Recently, the Encyclopedia of DNA Elements (ENCODE) project generated such high throughput data for 120 RBPs. We present a statistical framework, named statistical utility for analysis of RBPs’ splicing functions (SURF), to integrate human genome annotation, RNA-seq and eCLIP-seq data. SURF enables identifying positional preferences of RBPs in different classes of alternative splicing events and mining TCGA and GTEx for associations with genomic elements targeted by RBPs. Specifically, SURF features (i) an automated annotation-based parsing of AS events, including skipped/cassette exon, alternative 5′/3′ splice site, alternative first exon, alternative polyadenylation, and retained intron; (ii) a count-based method that detects the differential alternative splicing (DAS) events between RNA-seq experimental conditions and accounts for biological variation by taking advantage of experimental replicates; (iii) a model-based analysis that bridges consequential direct RBP interaction with RNA and the DAS events; (iv) a rank-based method that assesses the differential activity of any set of RBP targets (gene or transcript level) between diverse conditions (e.g., cancer versus normal from TCGA) and links phenotypical changes to RBPs.  We evaluated SURF with computational experiments and applied it to ENCODE data from 120 RBPs, generating a comprehensive analysis of RBPs in terms of their regulatory rules for different types of alternative splicing events. In addition to recovering splicing functions of previously known RBPs (e.g., SRSF1 and RBM15), SURF identified novel splicing patterns for other splicing factors (e.g., PUM1, PUM2). In summary, SURF allows a comprehensive mapping of RBPs to their AS functions and a systematic compilation of the regulated targets for downstream discovery.
4:40 PM A deep learning-based method for the normalization of data from genome-wide CRISPR-Cas9 screens Henry Ward, Ahm Mahfuzur Rahman, Chad Myers
While CRISPR-Cas9 genome editing technology has enabled quick and flexible genome-wide screens in human cell cultures, the data this technology generates is still poorly understood. A particular area of interest involves genome-wide single-knockout screens, which have been previously performed in model organisms such as yeast to create genetic interaction maps or explore phenotypes for genes under conditions of interest. In human cell cultures, the largest corresponding dataset generated using CRISPR-Cas9 editing thus far is a map of genetic dependencies in cancer produced by the Broad Institute (Tsherniak, A. et al., 2017). Currently, this dataset represents genome-wide single-knockout screens of 17,634 genes conducted across 558 cell lines. Several studies have demonstrated that the genetic dependency profiles, i.e. the signature for a single gene’s essentiality across this collection of cell lines, contain rich functional information. However, both the signal driving the effects observed in this large-scale dataset and the experimental and statistical artifacts present in the dataset remain poorly characterized. Interestingly, the size of the dataset makes it amenable to deep learning techniques originally developed for processing and analyzing image datasets. While many genomic datasets are not large enough to support the application of deep learning algorithms, some success has been found using deep learning on single-cell RNA-sequencing data. These approaches learn a dimensionality-reduced latent space representation of the data, using autoencoders or generative networks, in order to augment and normalize the data (Regier et al., 2018 & Eraslan et al., 2019). We reason that a similar approach could be taken to learn a latent-space representation of and normalize, or at least elucidate the signal in, the cancer dependency map. Here, we present a novel generative adversarial network (GAN)-based approach for normalizing CRISPR-Cas9 genome-wide screening data. Specifically, given a dataset of sufficient size, each gene-wise profile across N-squared cell lines can be represented as an N by N image. For the cancer dependency map, we chose N=22 after subsetting the dataset to 484 cell lines, which results in a dataset of 17,804 gene-wise "images." Under the assumption that systematic artifacts in the gene-wise profiles would be low-dimensional, we trained a GAN on the map using a simple architecture and a 2-dimensional latent space to generate synthetic gene-wise images. After generating a set of 50,000 gene-wise images, we applied PCA to identify the principal components of the dataset. To normalize what we expect are either experimental artifacts or non-specific biological signals obscuring more specific functional relationships, we then projected the real set of 17,804 gene-wise images onto the first several principal components of the synthetic data, and subtracted the corresponding projections from the real dataset to effectively remove these patterns from the data. Our preliminary analyses indicate that the GAN was trained successfully and that the GAN-normalized dataset captures more signal based on evaluations against external functional datasets than a dependency map normalized with the same procedure to principal components of the real dataset. GANs are notoriously difficult to train, but we suggest that standard training metrics like discriminator accuracy and an examination of the resulting synthetic gene-wise profiles in a heatmap are sufficient to determine whether a GAN failed to train on non-image data. Moreover, precision-recall and receiver-operator curves using Pearson correlations of gene-wise profiles to predict CORUM co-complex membership reveal a large performance benefit for the GAN-normalized data, compared to both the real dependency map and the aforementioned real PC-normalized dependency map.

- top -

The Marquee on Wednesday, May 22, 2019

Links for Monday, May 20, 2019: The Marquee | Fifth Quarter | Varsity Hall I
Links for Tuesday, May 21, 2019: The Marquee | Fifth Quarter | Varsity Hall I
Links for Wednesday, May 22, 2019: Varsity Hall I
Start Time Title Author(s)
9:00 AM Keynote #5 - Bridging Pre-Clinical Drug Screening with Patient Molecular Profiles for Biomarker Discovery and Drug Repurposing R. Stephanie Huang
Introduction by Russell Schwartz
Using computational methods developed in our lab, we imputed drug response in very large clinical cancer genomics data sets, such as The Cancer Genome Atlas (TCGA). This yields a new resource of imputed drug response for every drug in each patient. These imputed drug response data are then used for biomarker identification through association analysis with various molecular markers measured in the clinical cohort; and/or for drug repurposing.
General Track - Networks II
Chairs: Arjun Krishnan

10:30 AM Connectivity measures for signaling pathway topologies Nicholas Franzese, Adam Groce, T. M. Murali and Anna Ritz
Characterizing cellular responses to different extrinsic signals is an active area of research, and curated pathway databases describe these complex signaling reactions. Here, we revisit a fundamental question in signaling pathway analysis: are two molecules "connected" in a network? This question is the first step towards understanding the potential influence of molecules in a pathway, and the answer depends on the choice of modeling framework. We examined the connectivity of Reactome signaling pathways using four different pathway representations. We find that Reactome is very well connected as a graph, moderately well connected as a compound graph or bipartite graph, and poorly connected as a hypergraph (which captures many-to-many relationships in reaction networks). We present a novel relaxation of hypergraph connectivity that iteratively increases connectivity from a node while preserving the hypergraph topology. This measure, B-relaxation distance, provides a parameterized transition between hypergraph connectivity and graph connectivity. B-relaxation distance is sensitive to the presence of small molecules that participate in many functionally unrelated reactions in the network. We also define a score that quantifies one pathway's downstream influence on another, which can be calculated as B-relaxation distance gradually relaxes the connectivity constraint in hypergraphs. Computing this score across all pairs of 34 Reactome pathways reveals two case studies of pathway influence, and we describe the specific reactions that contribute to the large influence score. Our method lays the groundwork for other generalizations of graph-theoretic concepts to hypergraphs in order to facilitate signaling pathway analysis.
11:00 AM A computational framework for benchmarking human CRISPR screens Ahm Mahfuzur Rahman, Maximilian Billmann, Michael Costanzo, Matej Usaj, Brenda Andrews, Charles Boone, Jason Moffat and Chad Myers.
Genome-wide loss-of-function studies have been a powerful tool for investigating the functional organization of biological systems. In Saccharomyces cerevisiae, phenotypes of almost all possible single and double mutants have been explored in detail. While technological limitations have prevented similar studies for human, the emergence of CRISPR-based technologies now enables systematic functional interrogation of every gene in the human genome.

A variety of computational methods for scoring and analyzing phenotypes from CRISPR screens in human cells have now been proposed. These methods can be divided into two major categories: methods that quantify the single gene knock-out (single mutants) effects, and methods that score the effects of knock-outs of gene combinations (usually double mutants) manifested in genetic interactions. Despite the fact that most of these approaches are designed with similar goals in mind, systematic, objective benchmarking of the computational methods has been limited.

We developed a framework for comparative evaluation of computational scoring methods for CRISPR screens. Our approach consists of two main elements: functional standards capturing relationships among genes, and metrics for benchmarking against these standards. Given a scoring method of interest, we use precision/recall-based performance measures to quantify the ability of the approach to produce genetic interactions that are predictive of functionally related genes (in the case of double mutant screens), or alternatively, to produce genetic dependency profiles over which similarity is predictive of functionally related genes.

Given our incomplete understanding of the human genome and a lack of established 'ground truth' datasets for functional relationships, we leverage a combination of curated and data-driven functional gold standards. We incorporate curated functional information from several sources including the Gene Ontology, Protein Complex (CORUM), and diverse Pathway standards, each of which captures the functional information at different levels of resolution. Additionally, we include data-driven, inferred functional relationships derived from unbiased integration of large-scale genomic datasets. Importantly, our approach assesses overall performance in capturing functional relationships, but also provides a local assessment of performance across a variety of different functions, enabling detection of specific biological signals that may be preferentially captured by a particular scoring approach.

We describe several insights from applications of our benchmarking framework to genome-scale CRISPR-based interaction screens. For example, across a collection of double mutant screens, we find that negative genetic interactions generally contain more specific functional information than positive genetic interactions over all types of functional gold standards. We also demonstrate the potential of our framework in guiding future screens as it enables estimation of the number of genome-wide screens necessary to cover a biological process of interest. Finally, we describe instances where our approach is able to detect and remove biases toward specific biological functions when the performance is dominated by a few specific functional categories. In general, this benchmarking framework provides a concrete basis for comparative evaluations of CRISPR screens and guides the design of improved scoring methods.
11:15 AM Parameter Selection in Biological Pathway Prediction with Graphlet Based Similarity Score Chris Magnano and Anthony Gitter
A common way to integrate and analyze large amounts of biological “omic” data from transcriptomic, proteomic, or metabolomic assays is through network analysis. An important problem in biological network analysis is pathway finding: creating a subnetwork of the known proteome which represents some process or cellular state. Typically, this type of analysis is done by deriving subnetworks algorithmically from a generic background network and the condition-specific input omic data.
A challenge in pathway finding is how to select what kind of network is most useful for hypothesis generation and further experimental analysis. Pathway creation algorithms typically have parameters whose adjustment can produce pathways with drastically different topological properties. However, in common practice parameters for pathway creation are manually chosen based on intuition about if a given algorithmically produced network appears similar to biological pathways.
The pathway finding task does not allow for traditional parameter selection methods. Most pathway finding methods have no overall likelihood or measure of the overall goodness of a found pathway to optimize. Due to the exploratory nature of pathway finding there is no ground truth for a particular experiment, so parameter tuning methods used in traditional machine learning cannot be used as well. While most methods release default parameter settings the creators of that method found to work well on a particular problem, the diversity of biological pathways and omic data make it difficult to select a single set of parameters that work well in all situations.
We have developed a method, based on the parameter advising framework, to tune network algorithms to eliminate biologically implausible predictions. While no ground truth exists for de novo pathway creation, by leveraging the background knowledge in manually curated pathway databases we can select pathways whose high-level structure resembles that of known biological pathways. At the core of this method is a graphlet decomposition metric, which measures topological similarity to sets of publicly available curated biological pathways. In order to account for the wide range of topologies in biological pathways, we require a generated pathway to be similar to only a subset of curated pathways to be considered topologically sound.
We evaluated our method by sampling known biological pathways to simulate omic data. Thus, a generated pathway should ideally match this known pathway, so we can calculate precision and recall of the generated pathway. We use our method to select parameters for four existing pathway creation methods on these simulated datasets. Preliminary results suggest that pathways created with parameters chosen through our method are better able to recovery known pathway interactions than using default parameter settings. Typically the parameters chosen are close to the best possible parameter setting if the true pathway was known beforehand. Our parameter selection approach is method-agnostic; it is applicable to any pathway finding algorithm which returns a network.
11:30 AM Rice Genes Prioritization for Cold Tolerance Using Random Walk with Restart on Multiplex Heterogeneous Network Cagatay Dursun, Naoki Shimoyama, Mary Shimoyama, Michael Schlappi and Serdar Bozdag.
Cold stress is a major factor in limiting the tropical plant of rice (Oryza sativa) crop yield in northern hemisphere of the world. It is known that some varieties of Oryza sativa are more cold-tolerant than the others. However, the genes that are related to cold tolerance in Oryza sativa cultivars remain elusive. To identify genes related to cold tolerance in rice, we conducted Electro Leakage (EL) and Low Temperature Seedling Survivability (LTSS) phenotype experiments of 360 Oryza sativa cultivars at five different temperatures. Using the EL and LTSS outcomes, a genome-wide association study (GWAS) was conducted to extract the potential cold-tolerance alleles. GWAS analysis reported DNA regions that harbor thousands of genes that are potentially associated with cold tolerance. Given the sheer number of potential cold tolerance genes, an efficient gene prioritization approach is needed to rank the most promising genes for further experimental validation. However, incompleteness of existing datasets and lack of annotated data create a challenge for prioritization of these candidate genes.

Network propagation methods are promising and state of the art methods for gene prioritization using the premise that functionally related genes tend to interact with each other in biological networks. Recently, a new network propagation method called Random Walk with Restart on Multiplex Heterogeneous Networks (RWR-MH) has been developed. RWR-MH performs random walk with restart on multi-layered gene networks that are connected to a single-layer disease similarity network and ranks disease-associated genes based on a set of known disease genes. Although, these methods are very effective in gene prioritization, they are known to be biased toward high degree genes in the network.

In this study, we present an improved version of RWR-MH and its application on multi-omics datasets of rice to effectively prioritize the cold tolerance related genes. Our method allows multi-layer gene and disease networks. It also calculates empirical p-values of gene ranking using random stratified sampling of genes based on their connectivity degree in the network. In order to prioritize cold tolerance related genes in rice, we applied the improved RWR-MH on a multiplex heterogenous rice network. We created three-layer cultivar similarity network namely, EL similarity, LTSS similarity, and genotype similarity network. We also created three-layer gene interaction network, namely co-expression, protein-protein interaction and pathway interaction network. We connected cultivar similarity network to the gene network based on the GWAS results. We run the algorithm on this multiplex heterogenous network using two known cold tolerance related genes as seeds and ranked all the genes.

To evaluate our results, we performed GO enrichment of top 200 ranked genes. Our results showed that the top 200 genes were enriched in GO terms such as “cell-wall” and “fatty-acid” production, which are known to be related in cold tolerance in rice. As a negative control, we also performed GO enrichment of the bottom 200 ranked genes and observed no GO enrichment as expected. We also observed that candidate genes from GWAS results were ranked lower in overall when the known cold-tolerant genes are used as seeds compared to using random seed genes. Top-ranked genes also exhibited significant p-values suggesting that their rankings were independent of their degree in the network. In conclusion, our results reported several novel cold-tolerant genes that can be used for further experimental validation.
11:45 AM Network-driven discovery of influenza virus replication host factors Jason Shoemaker
Host proteins (factors) are essential for virus replication, but identifying host factors for influenza virus replication is a difficult process and the results of high throughput screens of have varied largely between studies. Here, we present the results of two recent studies in which different network biology approaches were used to predict influenza virus replication host factors. We demonstrate that we can improve the likelihood of identifying virus replication host factors by integrating siRNA screening data onto subnetworks of the human protein interaction network. The proteins within the identified subnetworks can be prioritized using network topologies measures including degree (number of neighbors) and betweenness (degree of bottlenecked-ness). An inhibition experiment found that more than 50% of the subnetwork high betweenness proteins are virus replication host factors. In a separate study, we apply the engineering concept of controllability to determine if the human proteins which bind the influenza virus are in positions critical to controlling the human protein interaction network. We find that virus-interacting human proteins are highly enriched for proteins essential to network control. Moreover, we find that select proteins critical to control have been identified in multiple, independent siRNA screens. Interestingly, proteins essential to control also correspond to proteins whose between changes drastically when virus interaction data is integrated into the human PPI. Together, these two studies demonstrate that host-virus interaction data can be exploited to improve host factor (i.e. drug target) discovery and provide insight into why viruses evolve to regulate or selective bind particular host proteins.
General Track - Genome Informatics
Chair: Jaclyn Taroni

1:30 PM A tool for automatically identifying and correcting plant breeding location data Getiria Onsongo, Samantha Fritsche, Thy Nguyen, Jeffery Thompson and Kevin A.T. Silverstein.
Advances in big data technologies are making it possible to analyze large amounts of data in near real-time. These technologies offer great promise in the area of data driven plant breeding. To fully realize this promise, disparate sources of data such as genotype, environment, management and socioeconomic data need to be integrated. Collectively, this data could be used to inform genetic predictive models for maize, wheat and other crops. Researchers with plant breeding data might want to integrate environmental data with climate data for that location for predictive analytics. One of the primary challenges to collectively analyzing these disparate sources is errors in location data. Common errors include flipped latitude and longitude values, missing negative signs and in some cases missing location data. We have developed a tool for automatically detecting and correcting errors in location data. Given latitude, longitude and optionally administrative level, region name, location name and country, this tool automatically identifies and flags errors and suggests corrections. Users have the option to accept or discard suggested corrections and then store them in a PostgreSQL database. PostgreSQL has a Geographic Information System (GIS) extension that can be leveraged to correct future data. For crop-trial and plant breeding data, where number of planting locations is finite, use of a database management system is particularly convenient for identifying and correcting data. One of the steps in correcting location data is geocoding to fill in missing latitude and longitude data. Once geocoded, other data with the same country, region and location name can be geocoded by simply querying the database.

Identifying and correcting potential errors is a multi-step process. For data with latitude, longitude, administrative region and country information, latitude and longitude data are validated by confirming they belong to the corresponding country. Location information is considered valid if the multipolygon bounding its coordinates corresponds to the country as entered in the data. Our tool comes with a shapefile with country boundaries but users have the option of using their own shapefile. This validation step easily detects common errors such as missing negative signs or flipped latitude and longitude data. To suggest corrections, combinations for entering latitude and longitude information are generated. A query is used to determine the country for each of these possible locations. If one of these locations belongs to a country that matches entered data, it is suggested as the correct latitude and longitude value for that data entry. For entries without latitude and longitude information, the database is queried using location name, administrative region and country name to determine if this information already exists in the database. If it is not present in the database, a Google API is used to geocode the location. Successfully geocoded, latitude and longitude information are stored in the database for future lookups.

In addition to identifying and correcting potential errors, this tools also includes a visualization tool that plots both original and corrected locations on a map making it easy for users to validate results. We used this tool on data from over 1400 plant breeding stations around the world and were able to quickly validate or correct errors in over 90% of location data from these stations. Being able to visually examine flagged locations and the corresponding correct value made the validation process seamless and convenient. In a few instances latitude and longitude values were flipped resulting in a plant breeding stations in the middle of the ocean and the correct value in a location coincident with a plant breeding station as anticipated. This tool is freely available, and is being integrated into the GEMSTM Agroinformatics platform.
1:45 PM scds: Computational Annotation of Doublets in Single Cell RNA Sequencing Data Abha Bais and Dennis Kostka.
Motivation: Single cell RNA sequencing (scRNA-seq) technologies enable the study of transcriptional heterogeneity at the resolution of individual cells and have an increasing impact on biomedical research. Specifically, high-throughput approaches that employ micro-fluidics in combination with unique molecular identifiers (UMIs) are capable of assaying many thousands of cells per experiment and are rapidly becoming commonplace. However, it is known that these methods sometimes wrongly consider two or more cells as single cells, and that a number of so-called doublets is present in the output of such experiments. Treating doublets as single cells in downstream analyses can severely bias a study’s conclusions, and therefore computational strategies for the identification of doublets are needed. Here we present single cell doublet scoring (scds), a software tool for the in silico identification of doublets in scRNA-seq data.

Results: With scds, we propose two new and complementary approaches for doublet identification: Co-expression based dou- blet scoring (cxds) and binary classification based doublet scoring (bcds). The co-expression based approach, cxds, utilizes binarized (absence/presence) gene expression data and employs a binomial model for the co-expression of pairs of genes and yields interpretable doublet annotations. bcds, on the other hand, uses a binary classification approach to discriminate artificial doublets from the original data. We apply our methods and existing doublet identification approaches to four data sets with experimental doublet annotations and find that our methods perform at least as well as the state of the art, but at comparably little computational cost. We also find appreciable differences between methods and across data sets, that no approach dominates all others, and we believe there is room for improvement in computational doublet identification as more data with experimental annotations becomes available. In the meanwhile, scds presents a scalable, competitive approach that allows for doublet annotations in thousands of cells in a matter of seconds.

Availability and Implementation: scds is implemented as an R package and freely available at https://github.com/ kostkalab/scds.
2:00 PM Gene Prediction and Genome Functional Annotation of the Almond ‘Nonpareil’ genome Wilberforce Zachary Ouma, Tea Meulia, Thomas Gradziel and Jonathan Fresnedo Ramirez.
Almond is a relevant nut crop whose production in California was valued at ~ $21.5 billion, accounting for 82% of the international almond market. Economic projections indicate almond acreage will continue to increase, thus it is imperative to provide solutions to production and breeding constraints for all involved in the almond supply chain. Despite its relevance, genomic resources specific to almond are lacking. We report here preliminary results of the genome assembly, gene prediction and functional annotation of the ‘Nonpareil’ almond cultivar. A genome assembly with an N90 continuity 13.96 Mb in eight scaffolds was produced using a combination of Illumina technology and high-throughput chromosome conformation capture (Hi-C). Next, we implemented an evidence-based gene prediction relying on short-read (Illumina) and long-read (Oxford Nanopore’s MinION) transcriptome data. We generated 7.62 Gb of short reads, and 2.61 Gb of long-read data with an average length of 1.1 Kb. Evidence-based gene prediction and annotation pipelines, implemented on a high-performance computing (HPC) platform, yielded 27,487 gene models. The predicted protein-coding genes were subsequently functionally annotated by a combination of sequence motif-based (Pfam domain search) and homology-based (NCBI and Uniprot) methods. The genome sequence and the associated gene annotations are planned to be deposited into the Genome Database for Rosaceae.
2:15 PM Tracking the popularity and outcomes of all bioRxiv preprints Richard Abdill and Ran Blekhman
Researchers in the life sciences are posting work to preprint servers at an unprecedented and increasing rate, sharing papers online before (or instead of) publication in peer-reviewed journals. Though the increasing acceptance of preprints is driving policy changes for journals and funders, there is little information about their usage. Here, we collected and analyzed data on all 37,648 preprints uploaded to bioRxiv.org, the largest biology-focused preprint server, in its first five years. We find preprints are being read more than ever before (1.1 million downloads in October 2018 alone) and that the rate of preprints being posted has increased to a recent high of 2,100 per month, driven primarily by exponential growth in the number of papers in neuroscience and bioinformatics. We also find that two-thirds of preprints posted before 2017 were later published in peer-reviewed journals, and that preprints with more downloads tend to appear in journals with a higher impact factor. The age of preprints appearing in journals also differs by publication, with journals such as G3 publishing most preprints within four months of their appearance on bioRxiv. Lastly, we developed Rxivist.org, a web application providing multiple ways of interacting with preprint metadata, including an API and website that allows users to sort preprints related to their interests based on Twitter activity or bioRxiv download metrics.
3:00 PM Keynote #6 - Inferring Host-Virus Interactions from Diverse Data Sources Mark Craven
Introduction by Tijana Milkovic
Insight into the mechanisms and context of host-virus interactions can be gained by applying
computational methods to a broad range of experimental, observational, and secondary data
sources. I will discuss our work in several studies that involve developing and applying
predictive methods in order to characterize host-virus interactions. These studies incorporate
viral genomic sequences, genome-wide loss-of-function screens, disease phenotypes measured
in hosts, the scientific literature, and electronic health records as data sources.

- top -

Varsity Hall I on Wednesday, May 22, 2019

Links for Monday, May 20, 2019: The Marquee | Fifth Quarter | Varsity Hall I
Links for Tuesday, May 21, 2019: The Marquee | Fifth Quarter | Varsity Hall I
Links for Wednesday, May 22, 2019: The Marquee
Start Time Title Author(s)
General Track - Clinical and Health Informatics I
Chair: Serdar Bozdag

10:30 AM Autoimmune risk allele-dependent human gene regulation by Epstein-Barr Virus EBNA2 Matthew Weirauch, Xiaoting Chen, Ted Hong, Mario Pujato, Daniel Miller, Sreeja Parameswaran, Omer Donmez, Mariana Saint Just Ribeiro, Carmy Forney, Yongbo Huang, Kenneth Kaufman, Bo Zhao, Iouri Chepelev, John Harley and Leah Kottyan.
Decades of research has implicated an etiologic role for the Epstein-Barr Virus (EBV) in several autoimmune diseases, with patients displaying increased infection rates, viral loads, and immune responses. However, the underlying molecular mechanisms behind these associations have remained elusive. We developed an unbiased approach employing a novel computational method called RELI (Regulatory Element Locus Intersector) for cross-referencing publically available Genome-Wide Association Study (GWAS) and ChIP-seq data. Using RELI, we discovered that the EBV transactivator protein EBNA2 occupies up to half of the genetic risk loci for a set of seven autoimmune diseases, along with dozens of particular human transcription factors (TFs) and co-factors. Application of a second new computational tool called MARIO (Measurement of Allelic Ratios Informatics Operator) revealed over 20 examples of autoimmune risk allele-dependent co-binding of EBNA2 with human TFs. To further examine the mechanisms underlying these phenomena, we performed RNA-seq, ATAC-seq, and EBNA2 ChIP-seq in Ramos B cell lines that are uninfected, infected with a strain of EBV that carries EBNA2 (B95-8), or infected with a strain that lacks EBNA2 (P3HR-1). Global analysis of these data reveals hundreds of regions of the human genome showing EBNA2-dependent human gene expression and chromatin accessibility, many of which correspond to autoimmune risk loci and display risk allele-dependent behavior. Allele-dependent behavior is also observed in experiments performed in cells derived from multiple sclerosis and systemic lupus erythematosus patients. Collectively, these results implicate a critical mechanistic role for EBNA2 in the etiology of multiple autoimmune diseases. More generally, they illustrate an important and relatively unexplored mechanism through which environmental influences (viral infection) and human genetics can synergize to influence human disease.
10:45 AM The Clinical Implications of Subclonal Copy Number Alterations in Chronic Lymphocytic Leukemia Mark Zucker and Kevin Coombes.
Copy number variations (CNVs) play an important role in many cancers, including chronic lymphocytic leukemia (CLL), and are often associated with clinical outcome. Tumors (including CLL tumors) often possess multiple distinct clones, each with its own distinct CNV or set of CNVs. In prior research, we developed a method for analyzing clonal heterogeneity in cancer using copy number information from SNP (single nucleotide polymorphism) array data and assessed the prevalence and clinical implications of clonal heterogeneity in a CLL patient population using SNP array data from the time of diagnosis and clinical outcome data. Here, we examine the clinical implications – specifically time to treatment, overall survival after treatment, and time to progression - of specific subclonal CNVs in CLL. We look both at CNVs known to occur and have clinical implications in CLL, such as trisomy of chromosome 12 and deletions on the p-arm of chromosome 17 and q arm of 11, as well as at CNVs with novel clinical associations found in our data set. In our analysis, we elucidate the precise nature of the relationship between particular CNVs and clinical outcome. First, we found that, as expected, trisomy 12 corresponded to better prognosis (compared to cytogenetically normal patients), while deletions on 17p and 11q corresponded to worse prognosis. We also found that, for most CNVs, the CNV’s association with clinical outcome is mediated merely by its presence or absence. That is, the fraction of cells possessing the CNV did not independently correlate with clinical outcome. For other CNVs, notably deletion 17p and deletion 11q, the effect on clinical outcome appears to increase the larger the clone possessing the alteration is. Patients in which the CNV-possessing clone was large tended to do worse than those for which the clone possessing the CNV was small. We also found that overall survival after treatment tended to be more affected by particular CNVs than overall survival in general. Additionally, we find that some CNVs associated with poorer clinical outcome are positively associated with the presence of clonal heterogeneity, including deletion 17p. Finally, we discuss potential biological explanations for our observations, especially in the context of what genes occur on the genomic regions affected by the clinically relevant copy number changes, and find some biologically interesting and clinically relevant genes located on clinically relevant CNV-affected regions. These findings may have important implications for our understanding of the complexities of how tumors develop and change over time during the course of the disease, especially after treatment, and may potentially have implications for predicting prognosis and for optimizing and personalizing treatment strategies for CLL.
11:00 AM Objective risk stratification of prostate cancer using machine learning and radiomics applied to multiparametric magnetic resonance images Bino Varghese, Frank Chen, Darryl Hwang, Suzanne Palmer, Andre Luis De Castro Abreu, Osamu Ukimura, Monish Aron, Manju Aron, Inderbir Gill, Vinay Duddalwar and Gaurav Pandey.
A major issue with the interpretation of tumor imaging data is the discordance between patients’ inferred clinical risk and their imaging findings. The derivation of unbiased content-oriented features from these images (radiomics) and their subsequent use in frameworks based on machine learning (ML) methods can augment radiologists’ role in clinical care by providing more objective risk assessments. Previous work in this direction has been limited to the use of one or a small number of ML, specifically classification, methods, and their evaluation using standard measures like AUC. Developing an automated framework that systematically and rigorously identifies the best classifier(s) for predicting tumor risk with a given set of radiomic features can substantially boost the potential of this approach. We developed such a framework comprised of classification, cross-validation and statistical analyses to identify a classifier that most accurately differentiates high-risk prostate cancer (PCa) patients from lower-risk ones in a sizeable cohort examined using multi-parametric MRI (mpMRI). This framework examined seven commonly used classifiers in terms of their performance and stability for predicting PCa risk using 110 radiomics features extracted from T2- and Diffusion-weighted mpMRI images. Our cohort consisted of 121 PCa patients, which was randomly split into training (n = 68) and validation (n = 53) sets prior to the application of our framework. Using a systematic cross-validation setup and rigorous statistical analysis of the performance of candidate classifiers evaluated on the training set, the framework identified the Quadratic kernel Support Vector Machine (QSVM) classifier as the most effective PCa risk prediction method. Indeed, the QSVM classifier performed well on the independent validation set in terms of a variety of evaluation measures (AUC=0.71, F-measure=0.69, Precision=0.57 and Recall=0.86). In particular, it performed better than PI-RADS v2, the imaging score-based method currently used to clinically assess PCa risk, especially in terms of class-specific evaluation measures, namely F-measure (0.52), Precision (0.45) and Recall (0.61). The use of these class-specific measures, also a novelty of our framework, enables the assessment of a classifier’s performance on the hardest class(es) of patients, such as the high-risk PCa patients in our cohort. These results demonstrate the effectiveness of data-driven frameworks like ours for assessing and deriving objective imaging-based risk predictors that can assist radiologists in improving clinical care.
11:15 AM Systematic analysis of genetic interactions in Parkinson’s disease reveals interactions with known risk genes Wen Wang, Benjamin Vandersluis and Chad Myers.
Various genomic approaches have been applied to study the genetics of Parkinson's disease (PD), yet despite the wealth of candidate loci produced by existing studies, there remains a substantial disparity between the disease risk explained by discovered loci and the estimated total heritable disease risk based on familial aggregation. The heritability of PD has been reported to be between 20%40% but the loci discovered to date only explain 7% of the genetic burden of the disease. Genetic interactions, which refer to combinations of two or more genes whose contribution to a phenotype cannot be fully explained by their independent effects, may play an important role in closing the missing heritability gap. Previous human genetic studies of PD typically restricted their search for genetic interactions among known candidate genes or genes that are functionally related to known candidates. Detecting genetic interactions systematically with statistical significance remains a major challenge due to the daunting number of variant combinations possible in the human genome. Genetic interactions have been systematically studied in the yeast model system through gene double knockout experiments. One important result from the analysis of yeast genetic interaction network is that genetic interactions tend to connect between two genes with similar functions, or between two genes with different but compensatory functions. Because of their functional redundancy, the two genes can compensate for the loss of each other, and thus, only simultaneous perturbation of both would result in a loss of function. Furthermore, when genetic interactions occur, they tend to occur in large coherent sets reflecting the functional modules involved, either connecting many gene pairs within the same functional module or between two functional modules. Guided by this principle, we recently developed a method called BridGE to search for genetic interactions based on functional modules (Wang et al. 2017). More specifically, it identifies within- and between-pathway interactions from human population genetic data. Here, we describe improvements to our approach that increase our power for discovering genetic interactions along with our detailed application of these latest developments to multiple PD cohorts. We identified 32 pathway-level interactions that were statistically significant, 20 of them were between-pathway interactions (FDR<0.05) and 12 of them were within-pathway interactions (FDR<0.1). We also tried to validate these 32 interactions in an independent PD cohort and found 12 of them were also enriched for SNP-SNP interactions, which suggests our discoveries are highly reproducible. Although the mechanistic basis of these interactions will require further study, many of the pathways implicated in interactions have plausible links to the pathology of Parkinson’s disease. For example, we found within-pathway interactions in the Parkinson’s disease gene set (KEGG), PARKIN pathway (Biocarta), TGF-beta signaling (Reactome), and VEGF signaling (Reactome), all of which were linked to PD by previous studies. We also identified several other pathways that interacted with these known PD pathways, such as Botulinum Neurotoxicity (Reactome), Oxidative Stress-Induced Gene Expression via Nrf2 (Reactome), which suggests that many of the established risk variants are modified by variants in multiple, previously unappreciated pathways. The fact that our approach converged on a set of pathways with high relevance to PD and a high rate of replication in independent populations is encouraging. We expect that further exploration of discovered interactions is likely to be fruitful for understanding the underlying genetic basis of Parkinson's disease.
General Track - Clinical and Health Informatics II
Chair: Debbie Chasman

1:30 PM An integrative computational modeling approach to identify repurposable metabolic drug targets in CD4+ T cells Bhanwar Lal Puniya, Bailee Lichter, Robert Moore, Sydney Townsend, Alex Ciurej, Ab Rauf Shah, Matteo Barberis and Tomas Helikar.
CD4+ T cells play a central role in immune system to protect against diseases. In peripheral lymphoid organs, naïve CD4+ T cells encounter antigen presenting cells such as macrophages and dendritic cells via T cell receptor (TCR) and CD28. After TCR activation, in the presence of cytokines, CD4+ T cells undergo clonal expansion and differentiation into antigen specific subtypes including Th1, Th2, Th17, and iTregs. During clonal expansion, the increased bioenergetics and biosynthetic demands are fulfilled by metabolic reprogramming. For example, naïve T cells are dependent on oxidative phosphorylation and fatty acid oxidation whereas Th1, Th2, and Th17 cells are highly glycolytic. Aberrant regulation of CD4+ Tcell metabolism is associated with several immune disorders (e.g., multiple sclerosis, systemic lupus erythematous etc. Computational modeling has become an integral too in life sciences research that enables the integration and interrogation of heterogeneous data. In particular, computational modeling has the potential to characterize the complex and mechanisms underlying various biological systems and diseases, and to identify novel drug targets and/or re-purpose existing drug for new applications. In this study, we integrated large-scale transcriptomics and proteomics data, metabolic modeling, and literature mining to identify potential re-purposable candidates for autoimmune disorders. Specifically, we characterized and developed metabolic models of all canonical T cell subtypes; i.e., naïve, Th1, Th2, Th17, and iTregs. Subtype specific models were extracted from a generic human metabolic reconstruction (Recon 3D) and integrating publicly available transcriptomics (159 microarrays) and proteomics (20 samples) datasets. The constructed models were validated using known T cell specific metabolic pathways and gene essentiality. Pathway based validations included the activity of glycolysis, fatty acid synthesis, and glutaminolysis in effector T cells, whereas fatty acid oxidation, and oxidative phosphorylation in naïve and iTreg cells. Essential genes identified in different cell lines based on siRNA and CRISPER were obtained and compared with essential genes predicted by the models. For all models, 72% to 80% of predicted essential genes were common with experimentally-determined essential genes. Together, our models achieved ~ 60% accuracy and ~70% precision. Next, we integrated data from drugBank and cMap databases with the models, and mapped 256 to 347 genes (across the models) that could be used as re-purposable targets for existing drugs. For each model, we systematically perturbed these gene targets and identified affected genes using flux ratios (Flux treated/ Flux untreated). Finally, to select perturbed genes as potential repurposable drug target, we leveraged the available large scale data obtained from healthy individuals and patients for rheumatoid arthritis multiple sclerosis, systemic lupus erythematosus, Graves ’ disease, etc., and investigated if affected genes have association with these diseases. For each perturbation, a score was generated using the flux ratios and differentially expressed genes as a measure of perturbation effects. Candidate target genes were selected based on the highest score. Furthermore, enrichment of drugs in the selected genes identified important potential drugs and their targets. For example, for rheumatoid arthritis, 148 genes were identified as most promising drug targets. Of these, eight were targeted by five or more drugs. In these, six (dipyridimole, pentoxifylline, ketotifen, aminophylline, wortmannin,and ibudilast) out of top eight drugs were already established experimentally to have negative impact T cell proliferation, while two novel potential CD4+ cell targets, dyphylline and Ro-20-1724, were identified. In summary, here we leveraged a set of CD4+ T cell-specific metabolic models, and a computational approach that integrates modeling with large-scale data and text mining to predict new repurposable drug targets. We were able to obtain a moderate agreement between literature and predictions, however, further laboratory validations are required to validate these findings.
1:45 PM Linear Drug-Target Interaction Model Predicts Effective Drugs and Also Identifies Disease-Related Proteins and Pathways from Phenotypic Screens Feng Guo
Phenotypic screening is widely used in drug discovery, which helps identify therapeutic strategies for complex diseases that involve the dysregulation of multiple cellular pathways. The in vitro screening of drugs is usually based on “targets” of drug candidates, i.e. endogenous proteins involved in biological pathways that have molecular interaction with drug compounds. However, compounds selected by their targets are not always effective. A challenge of using phenotypic screens is identifying which drug targets are responsible for the observed phenotypic effects, whether and how these targets’ contribution to the effects are correlated with other targets or pathways.
Here we develop a method for predicting drug effects from information on canonical interactions between compounds and their targets and pathways. We search STITCH database (http://stitch.embl.de/) for canonical targets of all drug candidates, and then search KEGG database (https://www.genome.jp/kegg/pathway.html) for related biological pathways of each targets. Based on the interaction between drugs and targets or pathways, we construct binary drug-target and drug-pathway interaction matrices. Then we use the matrices to create a mathematical model in which the effect of a compound can be decomposed as a linear combination of effects of its targets or pathways. Several machine learning algorithms are adopted to solve this linear model.
To train such a model, we use NCI ALMANAC drug screening data, which test over 100 kinds of FDA-approved anti-cancer drugs in NCI-60 cell lines derived from several different human tumors. 10-fold cross validation is applied to train and evaluate the model. Among those algorithms, Lasso (least absolute shrinkage and selection operator) but not other algorithms perform well in feature (target) selection. Lasso reduce the size of matrices by eliminating unimportant features. With reduced drug-target matrices, the AUCs of single drug effect prediction are significantly improved, compared with models using original full matrices or randomly reduced matrices. The highest AUC is achieved by SVD, which is over 0.8. This indicates that features kept by Lasso are highly associated with the phenotypic effects.
The model has been applied to a compound screening data set for Huntington’s disease. Our drug-target model also has a high AUC in predicting compounds’ neuroprotection, and it identifies key target proteins and pathways involved in disease progression. Our findings suggest that modeling of this type can be used in conjunction with phenotypic screens to identify novel combination therapeutics, as well as targets and pathways that are relevant to complex diseases.
2:00 PM Using histopathology whole-slide images quantitative features to predict liver cancer survival Noshad Hosseini, Fadhl M. Alakwaa, Olivier B. Poiron and Lana X. Garmire.
Hepatocellular carcinoma (HCC) is the most common type of Liver cancer responsible for more than 80 percents of liver cancer cases. Providing a robust prognosis prediction for HCC patients can significantly improve the quality of life for patients. Most of the studies on this subject have been using Omics data to model the survival prediction, however, the omics data are not readily available for all patients. On the other hand, histopathology images are routinely analyzed and archived for suitable patients. In this study, we investigated the strength of pathology imaging for accurate prognosis. We obtained 330 hematoxylins and eosin (H&E) stained histopathology whole-slide images of HCC from The Cancer Genome Atlas (TCGA) and extracted 176,223 quantitative image features for each patient, using software CellProfiler. Furthermore, we used Machine-learning methods to select the most relevant features for survival and divided patients into two subgroups of short-term and long-term survivors. Using 10-fold cross-validation tests on two survival prediction methods developed earlier in our group, Cox-nnet, and deepProg, we obtained C-index of 0.77578 and 0.743973 respectively, highlighting the predictive values of histopathological features. Furthermore, we used deepProg, a generalized deep-learning based prognosis prediction framework, to integrate histopathological data with three other omics data types, namely RNA-Seq, Methylation, and miRNA data. The new model of 3 omics and pathological data achieved C-index as high as 0.92933, compared to the C-index of 0.85 for the model of 3 omics only. In summary, our results demonstrate that imaging data for HCC can improve the survival prediction, whether being integrated with other data types or used alone. Our method can be generalized to predict other types of cancers.
2:15 PM Machine Learning Classifier for Endometriosis Using Transcriptomics and Methylomics Data Sadia Akter, Dong Xu, Susan Nagel, John Bromfield, Katherine Pelch, Gilbert Wilshire and Trupti Joshi.
Endometriosis is a complex and common gynecological disorder yet a poorly understood disease affecting about 176 million women worldwide, and causing significant impact on their quality of life and economic burden. Neither a definitive clinical symptom nor a minimally invasive diagnostic method is available thus leading to an average of 10 years of diagnostic latency. Discovery of relevant biological patterns from microarray expression or NGS data has been advanced over the last several decades by applying various machine learning tools. We performed machine learning analysis using 38 RNA-seq and 77 enrichment-based DNA-methylation (MBD-seq) datasets. We experimented how well various supervised machine learning methods such as decision tree, PLSDA, support vector machine and random forest perform in classifying endometriosis from the control samples trained on both transcriptomics and methylomics data. The assessment was done from three different perspectives for improving classification performances: (a) implication of three different normalization techniques, (b) implication of differential analysis using the generalized linear model (GLM), and (c) application of a newly developed ensemble technique called GenomeForest. The ensemble technique achieved 97.4% accuracy, 93.8% sensitivity, 100% specificity, and 0.968 F1 score for transcriptomics and 90.9% accuracy, 92.9% sensitivity, 88.6% specificity and 0.918 F1 score for methylomics. Several candidate biomarker genes were identified by multiple machine learning experiments including NOTCH3, SNAPC2, B4GALNT1, and GTF3C5 from the transcriptomics data analysis, and TRPM6, RP3-522J7.6 and MFSD14B from the methylomics data analysis. We concluded that an appropriate machine learning diagnostic pipeline for endometriosis should use TMM normalization for transcriptomics data, and quantile or voom normalization for methylomics data, GLM for feature space reduction and classification performance maximization, chromosomal partitioning with ensemble of decision trees for greatest increase in classification performance, and F1 score for both ranking of the individual models and generating the composite score.

- top -