Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

RSG with DREAM 2018 | December 8-10, 2018 | New York, USA | HOME

Posters

View Posters By Category

  1. RSG
  2. Dream
RSG-1: Methods for Identifying Tumor Heterogeneity and Rare Subclones in Single Cell DNA Sequence Data
Topic: Single Cell Genomics Clustering RNASeq AML PCA Tum
  • Sombeet Sahu, Mission Bio, United States

Short Abstract: Background With the advancements of single cell sequencing technologies it is now possible to interrogate thousands of cells in a single experiment. Single-cell RNA-Seq has been available for several years but high-throughput single-cell DNA analysis is in its infancy. Therefore, it is essential to develop new capabilities for assessing genetic variation present in rare cells and to better understand the role that these cells play in the evolution of tumor progression. To address these challenges and enable the characterization of genetic diversity in cancer cell populations, we developed a novel approach to identify mutation signatures which define subclones present in a tumor population. Methods Here we present a two-step clustering and subclone identification method using data generated on the Tapestri single-cell DNA platform which can generate upto 10000 cells and analyzed using the Tapestri analytical workflow. The pipeline steps involve obtaining raw reads from the sequencer, removing adapters, aligning and mapping the reads, calling individual cells, and identifying genetic variants within each cell. After filtering for high quality variants, we then filter for data completeness to ensure only high quality data is used in downstream processing. The variant-cell matrix is than subjected to unsupervised hierarchical clustering on a PCA projection space to identify the top variants defining the subclone structure. The silhouette value is used to identify the most optimal clustering. After clusters have been identified we calculate the average variant allele frequency for all the variants across clusters. We also identify the top variants explaining the most variance in the PCA projection. After filtering on these top variants, we perform additional rounds of clustering to obtain the optimal profile. To validate our methodology, we used two different model systems: A) a 50:50 mix of the K562 and RAJI cell lines or B) a mixture with MOLM13, PL21 and Raji cell lines present at 33% each. Results With our two-step clustering process we show the distinct clusters correlating with titration and cell line ratio. We were also able to identify the cluster associated signature mutations. These data demonstrate the utility of the Tapestri platform, and the analytical pipeline, and associated data visualization capability. Our approach has the potential to address the key issues of identifying rare subpopulations of cells and transforms our ability to accurately characterize clonal heterogeneity. This high throughput method of accurately characterizing clonal populations should lead to improved patient stratification and therapy selection for various cancer indications.

RSG-2: PASS2nets: Building Networks of Distantly Related Protein Domains
Topic: Networks Protein Visualization Structure Sequence
  • Sridhar Hariharaputran, National Centre for Biological Sciences, (TIFR), Bangalore AND Bharathidasan University, Trichy, India, India
  • Ramanathan Sowdhamini, National Centre for Biological Sciences (TIFR), India

Short Abstract: PASS2 is an automatic alignment superfamily database. It contains alignment information of protein structures at the superfamiily level and it directly corresponds to SCOPe release and mapped to the Gene Ontology information. We have been analysing the data and discovered "outliers". These outliers are significant in building the networks of distantly related protein domains and superfamilies. Our earlier works and analysis provide details of their behaviour. We viewed and knitted the diverse information as networks and discovered the properties shared by the domain across multiple superfamilies. The core information that binds the domains act as hubs and supporting information as edges. It is possible to navigate across the network/networks and to view the entire data as a global map. This approach provides a network and sub-network view of the connected domain data. Based on the chosen property the network and its topology differ globally and locally. Some nodes are core property of the networks. PASS2 is available at http://caps.ncbs.res.in/pass2/ References 1. Hariharaputran S, Srinivasan N, Sowdhamini R. Clusters and Networks in Alpha/Beta Hydrolases. Work selected for The Molecular Biophysics Unit, Indian Institute of Science (IISc) - in-house SYMPOSIUM 2018. 2. Hariharaputran S, Srinivasan N, Sowdhamini R. NEWS: Network Exploration With Sequence, Structure, Semantics. (2018) 3. Hariharaputran S, Srinivasan N, Sowdhamini R Connecting the dots: integrated analysis of Mycobacterium species. Work submitted 2018. 4. Gandhimathi A, Ghosh P, Hariharaputran S, Mathew OK, Sowdhamini R. PASS2 database for the structure-based sequence alignment of distantly related SCOP domain superfamilies: update to version 5 and added features. Nucleic Acids Res. 2016 Jan 4;44(D1):D410-4. doi: 10.1093/nar/gkv1205. Epub 2015 Nov 8. 5. Arumugam G, Nair AG, Hariharaputran S, Ramanathan S. Rebelling for a reason:protein structural "outliers". PLoS One. 2013 Sep 20;8(9):e74416. doi:10.1371/journal.pone.0074416. eCollection 2013.

RSG-3: Gene coexpression network for voltage-dependent potassium channels Kv2 in Beta cells across developmental stages
Topic: Insulin Beta cells Potassium channels Transcriptom
  • German Bernate, Neuroscience Division, Cognitive Neuroscience Department, Instituto de Fisiología Celular UNAM., Mexico
  • Francisco Barajas, Immunogenomics and Metabolic Disease Laboratory, Instituto Nacional de Medicina Genómica, SS, Mexico City, Mexico, Mexico
  • Marcia Hiriart, Neuroscience Division, Cognitive Neuroscience Department, Instituto de Fisiología Celular, UNAM, Mexico
  • Myrian Velasco, Neuroscience Division, Cognitive Neuroscience Department, Instituto de Fisiología Celular, UNAM, Mexico

Short Abstract: Introduction Insulin is an anabolic hormone that plays a critical role in keeping glucose homeostasis and energy stores in the organism. Pancreatic beta cells are exclusively responsible for insulin secretion and its main secretagogue is glucose. Glucose-stimulated insulin secretion (GSIS) depends on the activity of different ionic channels such as ATP-sensitive potassium channel (KATP), voltage-dependent Ca2+ and Na+ channels, as well as voltage-dependent potassium channels (Kv2). The Kv2 channels repolarize action potential, ending insulin secretion. From birth to postnatal day 20 (pd20), rat pancreatic beta cells release insulin in a monophasic way and are unable to respond to extracellular glucose concentration (immature cells). At adult stage, rat beta cells exhibit robust and biphasic insulin secretion, being sensitive to changes in extracellular glucose concentration (mature cells). This whole process is known as functional maturation. In order to understand and characterize beta-cell functional maturation, a transcriptome analysis from 20 days postnatal and adult beta cells was carried out in our lab. Transcriptome analysis showed a significant greater expression of Kcnb2 gene in adults compared to pd20 rat beta cells. In this work, we are interested in knowing which genes regulate Kv2 channels expression. Methods We carried out a gene coexpression network for each gene of interest (Kcnb1 and Kcnb2). For network construction, the Pearson´s correlation coefficient between Kcnb1, kcnb2 and the total of probes in the transcriptome (n = 29,214) was calculated. The genes that presented a significant correlation were selected (p = <0.01). Once obtained, the network was analyzed and gene functional enrichment analysis was performed with the aid of Cytoscape Cluego package and Reactome database. Then by using a mutual information algorithm (“Aracne”), candidate regulatory genes were identified, and with each of these genes, enrichment analysis was carried out. Each of the mostly enriched sets was selected and a Bayesian network was constructed with them. Results and Conclusion The gene coexpression network analysis and the construction of gene regulatory networks with functional enrichment showed the great variety of genes and their corresponding functional pathways that appear to carry out the regulation of the expression of both genes of interest (Kcnb1 and Kcnb2). Our results indicate that Foxo1, Cacnb2, Vefga and Gna12 directly regulate the Kcnb1 gene (Kv2.1); while Bad, Pik3ca, Rab11a and Lamp5 regulate Kcnb2 (Kv2.2) and participate in insulin secretion of pancreatic beta-cell.

RSG-4: TimeXNet Web: Identifying cellular response networks from diverse omics time-course data
Topic: gene response networks time-course data analysis m
  • Phit Ling Tan, Institute of Medical Science, University of Tokyo, Japan
  • Yosvany Lopez, Tokyo Medical and Dental University, Japan
  • Kenta Nakai, Intitute of Medical Science, University of Tokyo, Japan
  • Ashwini Patil, Institute of Medical Science, University of Tokyo, Japan

Short Abstract: Condition-specific time-course omics profiles are frequently used to study cellular response to stimuli and identify associated signaling pathways. However, few online tools allow users to analyze multiple types of high-throughput time-course data. TimeXNet Web is a web server that extracts a time-dependent gene/protein response network from time-course transcriptomic, proteomic or phospho-proteomic data, and an input interaction network. It classifies the given genes/proteins into time-dependent groups based on the time of their highest activity and identifies the most prob-able paths connecting genes/proteins in consecutive groups. The response sub-network is enriched in activated genes/proteins and contains novel regulators that do not show any observable change in the input data. Users can view the resultant response network and analyze it for functional enrichment. TimeXNet Web supports the analysis of high-throughput data from multiple species by providing high quality, weighted protein-protein interaction networks for 12 model organisms. TimeXNet Web is available at http://txnet.hgc.jp/.

RSG-5: Chromatin accessibility signatures of immune system aging differ between men and women
Topic: Single cell Immune system aging Chromatin accessib
  • Duygu Ucar, Jackson Laboratory, United States

Short Abstract: The immune system of women and men function and respond to infections and vaccination differently; however, it is unknown whether they age differently. To systematically study this, we profiled chromatin accessibility and transcriptomes of peripheral blood mononuclear cells (PBMCs) obtained from healthy adults ages ranging from 22 to 93 (n=108). Epigenomic and transcriptomic maps of these individuals (50 men, 58 women) revealed that PBMCs from men go through more significant aging-related changes than PBMCs from women. By deconvoluting changes captured in PBMCs using single cell RNA-seq data, we showed that the interaction between sex and aging is cell-specific: T cells lose chromatin accessibility in both sexes, whereas B and innate cells exhibit male-specific changes. Furthermore, genomic profiles of PBMCs diverge over time between sexes, where old women have more active epigenomes and transcriptomes for loci associated with B and T cells, potentially explaining health and lifespan differences between sexes.

RSG-6: Using Protein Localization Studies to Improve Genome-scale Metabolic Models
Topic: COBRA Constraint Based Modeling Metabolism Genome
  • Helen Tung, Imperial College London, United Kingdom
  • Uri David Akavia, McGill University, Canada

Short Abstract: Cancer cells reprogram metabolism to support rapid proliferation and survival. Energy metabolism is particularly important for growth and genes encoding enzymes involved in energy metabolism are frequently altered in cancer cells. A genome scale metabolic model (GSMM) is a mathematical formalization of metabolism which allows simulation and hypotheses testing of metabolic strategies. It has successfully been applied to many microorganisms and is now used to study cancer metabolism. GSMMs are based on the accumulated knowledge of metabolic reactions and enzymes in the published literature where reactions are represented mathematically as a stoichiometric matrix. The reactions in models are annotated with gene-protein-reaction relationships, organized into Boolean expressions (AND/OR) that specify the relationship between associated genes. One of the widely used generic human metabolic models is Recon 2.2, containing 7785 reactions, 5324 metabolites and 1675 genes. The accuracy of predictions using GSMMs is dependent on accurate gene associations to reactions. Current GSMMs do not consider intracellular protein localization in annotating the GPR relationships which may lead to false-positive interactions. In this study, protein localization data obtained from 10 different studies and Gene Ontology were used to identify the intracellular location of each protein on the organelle level. Mismatches between protein localization and reaction location were used to identify inaccurate gene associations in the human GSMM Recon 2.2. Based on the validation results of 617 gene locations, 479 reactions were added, 90 reactions were removed, and 671 reactions were modified in Recon 2.2. The modified Recon 2.2 includes 8210 reactions, 5656 metabolites and 1723 genes, an addition of 10%, 5% and 2%, respectively. The improved model can predict energy productions correctly and perform more known metabolic tasks. Improving prediction accuracy of Recon 2.2 will facilitate identification of biomarkers and drug targets using contextualized GSMMs.

RSG-7: Imputing the effects of genetic variation on mRNA splicing in rare human samples
Topic: RNA splicing genetic variation splicing quantitati
  • Ankeeta Shah, Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL, United States
  • Yang I Li, Section of Genetic Medicine, Department of Medicine, Department of Human Genetics, University of Chicago, Chicago, IL, United States

Short Abstract: Over 90% of all disease-associated variants identified in genome-wide association studies (GWAS) are located outside protein coding regions, suggesting that they impact gene regulation. While most efforts to understand disease- associated variants have focused on studying effects on gene expression levels, we have previously shown that mRNA splicing is another important mechanism that links genetic variation to disease (Li et al., 2016). Therefore, our work suggests that a substantial number of GWAS hits may affect disease risk by impacting RNA splicing. To better understand how variation in RNA splicing affects disease, we investigated the impact of genetic variation on RNA splicing in a large collection of 53 tissues from the Genotype-Tissue Expression (GTEx) Consortium. Using a novel splicing quantification method called LeafCutter (Li et al., 2018), we identified over forty thousand unannotated splicing events across tissues and mapped thousands of splicing quantitative trait loci (sQTL) in individual tissues. We found that 10-20% of splicing events show tissue specific patterns of inclusion, and up to 40% of testis-specific splicing events are not observed in any other tissue (i.e. these are tissue-specific splicing events). We found that sQTL were enriched in nearly all GWAS traits examined, including autoimmune and neurodegenerative diseases. Notably, most sQTL did not affect gene expression levels, confirming the widespread importance of measuring RNA splicing alongside gene expression. Finally, we quantified the concordance of sQTL effects on splicing events that are present across GTEx tissues using flashr, a statistical method that accurately estimates sharing of genetic effects on multiple phenotypes and/or samples (Wang and Stephens, 2018). Strikingly, our analysis revealed that ∼90% of sQTL are shared across samples as compared to ∼70% for expression quantitative trait loci. The high levels of sharing of genetic effects on RNA splicing across tissues offer the possibility of accurately imputing the effects of genetic variation on RNA splicing in rare or inaccessible human samples with only a single reference transcriptome. Therefore, by using the imputed effects of genetic variation from large neuropsychiatric disease GWASs, we can study the impact of RNA splicing variation in developing fetal brains in the context of multiple neuropsychiatric diseases.

RSG-8: Metabolic network-based predictions of toxicant-induced metabolite changes in the laboratory rat
Topic: constraint-based modeling acetaminophen transcript
  • Anders Wallqvist, US Army Medical Research and Materiel Command, United States
  • Jaques Reifman, US Army Medical Research and Materiel Command, United States
  • Venkat Pannala, BHSAI/TATRC, United States
  • Kalyan Vinnakota, BHSAI/TATRC, United States
  • Martha Wall, Vanderbilt University School of Engineering, United States
  • Shanea Estes, Vanderbilt University, United States
  • Tracy O’brien, Vanderbilt University, United States
  • Richard Printz, Vanderbilt University, United States
  • Masakazu Shiota, Vanderbilt University, United States
  • Jamey Young, Vanderbilt University School of Engineering, United States
  • Irina Trenary, Vanderbilt University School of Engineering, United States

Short Abstract: In order to provide timely treatment for organ damage initiated by therapeutic drugs or exposure to environmental toxicants, we first need to identify markers that provide an early diagnosis of potential adverse effects before permanent damage occurs. Specifically, the liver, as a primary organ prone to toxicants-induced injuries, lacks diagnostic markers that are specific and sensitive to the early onset of injury. Here, to identify plasma metabolites as markers of early toxicant-induced injury, we used a constraint-based modeling approach with a genome-scale network reconstruction of rat liver metabolism to incorporate perturbations of gene expression induced by acetaminophen, a known hepatotoxicant. A comparison of the model results against the global metabolic profiling data revealed that our approach satisfactorily predicted altered plasma metabolite levels as early as 5 h after exposure to 2 g/kg of acetaminophen, and that 10 h after treatment the predictions significantly improved when we integrated measured central carbon fluxes. Our approach is solely driven by gene expression and physiological boundary conditions, and does not rely on any toxicant-specific model component. As such, it provides a mechanistic model that serves as a first step in identifying a list of putative plasma metabolites that could change due to toxicant-induced perturbations.

RSG-9: Bio Network Visualization and Analysis from Text Mining Data
Topic: Bio-literature mining Text mining Named Entity Rec
  • Soo Jun Park, Electronics and Telecommunications Research Institute, South Korea
  • Soo Young Cho, National Cancer Center, South Korea
  • Young Seek Lee, Seoul National University, South Korea

Short Abstract: We propose a bio network visualization and analysis system from bio text mining data. The system automatically extracts relations using named entity recognition and relation extractions in the biological literature from PubMed. We have used POS for preprocessing articles and syntactically parsed using a maximum entropy model to extract entities and their relations. Then, the relations are visualized with a graph structure to efficiently grasp the knowledge of neighboring entities and their properties. The relations which are bio network can be further analyzed by applying a weight-based association ranking algorithm using the additional information from the discovered network. The proposed system collects articles from PubMed search and fetches their abstracts by the Entrez Programming Utilities. It is the structured interface that uses a fixed URL syntax to translate input parameters into the necessary values for NCBI software components to search for and retrieves the requested data. First, the queried abstracts are followed by certain preprocessing steps. We use part-of-speech (POS) tagging and syntactic parsing with a statistical named entity recognition method, such as a maximum entropy model. Our named entity recognizer extracts 40 categories of named entities as hierarchical order database. Afterward recognized biological entities are syntactically analyzed to perform relations among them. Then, we syntactically parse sentences and extract relations by analyzing parsing results. The types of relationships are often represented using a verb and its corresponding target subject and object words are located back and forth to the verb. The relation word has attributes such as polarity, negativity, or weight. We can construct relation pathways among the extracted entities using a weight-based ranking algorithm. With this attribute, we can provide a predicted relation with gene and disease. The constructed path also works as a sub-network of mined results. The meaningful relation between extracted entities is visualized as a graph and we this is called a network view. The nodes and edges are represented to entity names and recognized relation via mining engine, respectively. When a target entity is selected, its related entities are displayed with included sentences and its meaningful relations. Once the entities are extracted, we can trace the relations between desired entities.

RSG-10: Incorporating genetic networks into case-control association studies with high-dimensional DNA methylation data
Topic: DNA methylation Genetic network Statistical select
  • Hokeun Sun, Pusan National University, Korea, South Korea
  • Kipoong Kim, Pusan National University, Korea, South Korea

Short Abstract: In human genetic association studies with high-dimensional gene expression data, it has been well known that statistical selection methods utilizing prior biological network knowledge such as genetic pathways and signaling pathways can outperform other methods that ignore genetic network structures in terms of true positive selection. In recent epigenetic research on case-control association studies, relatively many statistical methods have been proposed to identify cancer-related CpG sites and their corresponding genes from high-dimensional DNA methylation array data. However, most of existing methods are not designed to utilize genetic network information although methylation levels between linked genes in the genetic networks tend to be highly correlated with each other. In this article, we propose new approach that combines data dimension reduction techniques with network-based regularization to identify outcome-related genes for analysis of high-dimensional DNA methylation data. The proposed approach first captures gene-level signals from multiple CpG sites using data dimension reduction techniques and then regularizes them to perform gene selection based on prior genetic network information. In simulation studies, we demonstrated that the proposed approach overwhelms other statistical methods that do not utilize genetic network information in terms of true positive selection. We also applied it to the 450K DNA methylation array data of the four breast invasive carcinoma cancer subtypes from The Cancer Genome Atlas (TCGA) project.

RSG-11: Virus-Directed Disruption of PPI Network Control
Topic: network analysis network topology controllability
  • Emily Ackerman, University of Pittsburgh, United States
  • Jason Shoemaker, University of Pittsburgh, United States

Short Abstract: While yearly seasonal outbreaks of influenza virus create high demand for effective treatment strategies, there are only three CDC approved influenza virus treatments administered today. With an estimated 3-5 million severe cases worldwide each year (which disproportionately infect at risk populations), effectual methods of antiviral drug target discovery are needed. Network analysis methods have identified disease-associated genes and predicted drug targets using network topology measures such as degree and betweenness. However, there is little exploration of the differential network behavior between the healthy and diseased cell states to predict key regulators of the diseased state. Controllability is the concept that a system can be driven to any possible final state given an appropriate external input. Imagining a host PPI network as a system of protein states, viral manipulation of healthy protein states will drive the system to a final diseased state. Two network controllability methods exist: one which determines how the absence of proteins effects the difficulty of controlling network behavior, and one which evaluates a protein’s importance to all configurations of network control. Prior to this work, only the first method has been applied to a human PPI network. Here, we complete a controllability analysis of two PPI networks (a host PPI network and an integrated virus-host PPI network) comparing healthy and infected cell states to identify key regulators of network behavior. Within, influenza A virus (IAV) interacting host proteins and “driver” proteins (a non-unique set which must be manipulated for system control) are compared to all proteins to determine if these groups are enriched for virus replication host factors. Proteins that are both IAV interacting and drivers demonstrate an increase in topology measures post-infection. Analysis suggests that viruses prefer to interact with host proteins that offer the best advantages in controlling total system behavior. In total, 24 proteins are predicted as indicators of the infected state with the most prominent being PRMT5. 12 proteins are identified as having interferon regulating roles centered around NF-kB. Though the protein set is not enriched for host factors of influenza A replication, protein HNRPA0 has been identified in three independent virus replication screens. Analyzing control structure disruption within the infected cell PPI network identifies proteins which allow infection to take hold and are, therefore, possible antiviral drug targets. Deeper knowledge of the host network’s control behavior could provide new insights into inhibiting virus replication and repositioning current drugs.

RSG-12: A supervised method for enhancer identification and linkage to target genes
Topic: Enhancers Transcriptional regulation Machine learn
  • Florian Schmidt, Saarland University, Germany
  • Alexander Marx, Max Plank Institue for Informatics and Saarland University, Germany
  • Jonathan Goeke, Genome Institute of Singapore, Singapore
  • Jilles Vreeken, Max-Planck Institute for Informatics and Saarland University, Germany
  • Marcel Schulz, Saarland University, Germany

Short Abstract: Understanding transcriptional regulation is a major goal of computational biology. Especially enhancers are essential regulators driving cellular development. Enhancers can be identified experimentally, e.g. using enhancer RNAs, ChIP-seq of Histone Modifications(HMs), or Hi-C experiments. However, experimental linkage of enhancers to genes is challenging. Therefore, several computational methods have been proposed to create tissue-specific enhancer maps from epigenetics data. A common strategy to de-novo link tissue-specific enhancer regions to genes is to unify DNase-hypersensitive-sites(DHS) across several samples. Subsequently, the unified regions are linked to nearby genes; either pure distance based or using a correlation test between the epigenetic signal and the expression of the possible target gene. Also, integrative efforts are made to combine known enhancers in curated databases, such as GeneHancer. Via gene-expression modeling, we show that these approaches are limited in accounting for the distinct regulatory landscape of genes and thus lead to suboptimal enhancer-gene associations. We developed an unbiased, peak-independent, supervised method called STITCH to identify and to link regulatory regions to genes. We apply STITCH on a uniformly reprocessed dataset comprising paired DNase1-seq and RNA-seq data for 215 human samples from IHEC. Within STITCH, we consider the epigenetic-signal of all samples jointly using the minimum description length principle to identify regions exhibiting a signal variation related to the expression of a distinct gene. In contrast to purely peak-based approaches, no sample specific information is lost. STITCH finds associations over large genomic intervals, e.g. 1 mb, leading to an extensive catalog of enhancer-gene interactions. STITCH is compared against GeneHancer and two approaches combining DHS sites in other common ways. Enhancers called by STITCH lead to a better performance of gene-expression models than both GeneHancer regions and peak-based approaches. Additionally, STITCH enhancer predictions cover about 85% of the GeneHancer database, supporting the quality of our predictions. Including data on HMs revealed that the enhancers identified with STITCH and DNase1-seq data are often surrounded by H3K27ac, an established enhancer mark. Furthermore, we show various downstream applications, e.g. how our identified enhancers can be used to link noncoding DNA mutations in cancer to the genes they regulate. STITCH is freely available (https://github.com/SchulzLab/STITCH). Due to an efficient implementation, large data sets comprising hundreds of samples can be processed easily. Thus, we believe that STITCH can pave the way for a better understanding of gene-specific regulation, especially in light of the large amounts of epigenetics data becoming available.

RSG-13: NEMO: A new algorithm for cancer subtyping by multi-omic clustering
Topic: Multi-omics Cancer subtypes Integration TCGA
  • Nimrod Rappoport, Tel Aviv University, Israel
  • Ron Shamir, School of Computer Science, Tel Aviv University, Israel

Short Abstract: High throughput experimental methods developed in recent years provide a large number of molecular measurements in a single experiment. Computational analysis of such datasets, termed "omics", has helped in the discovery of more biologically and clinically relevant cancer subtypes. Increasingly, measurements of multiple omic profiles for the same cohort are available. Defining cancer subtypes using multi-omic data may improve our understanding of cancer, and suggest more precise treatment for patients. We present NEMO (NEighborhood based Multi-Omics clustering), a novel algorithm for multi-omics clustering. NEMO is based on integrating the similarities across different omics between patients, while specifically considering patients that are highly similar to one another. It is faster and simpler than extant multi-omics clustering algorithms, and avoids iterative optimization. We performed extensive testing of NEMO on ten cancer datasets from TCGA spanning 3168 patients, with three omics (gene expression, DNA methylation and miRNA expression) measured for each patient. To assess the solution quality, we tested how distinct its clusters are in terms of survival and other reported clinical parameters. In these tests, NEMO outperformed nine state of the art multi-omics clustering algorithms, including multi-view clustering methods developed in the Machine Learning community for a similar task. Many publicly available multi-omic datasets are partial, namely, some patients lack measurements for some omics. To analyze such datasets, current methods either require imputation of missing values or analyze only the subset of patients that have data of all omics. In contrast, NEMO supports clustering of all samples without imputing missing values. We tested performance on partial data of NEMO, of nine multi-omics clustering methods with imputation, and of PVC, a multi-view algorithm that supports partial data. Tests included both synthetic data and the ten TCGA cancer datasets redacted to simulate partial data. NEMO outperformed all methods that used imputed data. PVC performed better on some of the tests, but it is limited to two omics and positive data. Finally, we used NEMO to perform a detailed analysis of a TCGA partial dataset of AML patients, and demonstrated the increased statistical power gained from considering patients with partial data. NEMO can be downloaded from GitHub. A full manuscript is available on bioRxiv.

RSG-14: TFInvestigator: A tool for constructing temporal transcriptional cascades applied to cardiac development
Topic: Regulatory networks Time series Data Regulatory Ca
  • Rayan Daou, University Medical Center Goettingen, Germany
  • Edgar Wingender, University Medical Center Goettingen, Germany
  • Martin Haubrock, University Medical Center Goettingen, Germany

Short Abstract: Gene regulatory networks are dynamic by nature despite being often displayed as static interactions between the regulators and their targets. To capture the dynamic aspect of these networks we present TFInvestigator a tool that combines a reference regulatory network based on binding site predictions and time series gene expression data, to generate a stage specific dynamic network. The dynamic network identifies the regulators that are uniquely specific for each time point resulting in a cascade that shows the emergence and disappearance of the these regulators and regulatory interactions across time. Being web-based, user-friendly, and utilizing a simplistic approach the tool requires no expert knowledge in programming or statistics making it directly usable for experimentalists. In addition to generating the dynamic network, the tool offers multiple interactive visual workflows, in which a user can track, tweak and investigate further different regulators, genes, and interactions, directing the tool along the way into biologically sensible results. The implemented workflows were applied to 2 different expression datasets based on experiments conducted on human induced pluripotent stem cells (hiPSCs) and human embryonic stem cells (hESCs) undergoing differentiation into mature Cardiomycytes across different time points and was successful in identifying previously known and new potential key regulators in the cardiac development and the particular time points they are associated to, providing as a result a dynamic regulatory cascade that explains the temporal role of each of those regulators.

RSG-15: Site directed mutagenesis suggests that phosphorylation plays a role in the function of a dominant MED5 mutation in Arabidopsis
Topic: Site directed mutagenesis Phenylpropanoid biosynth
  • Anne Heintzelman, Purdue University Department of Biochemistry, Center for Plant Biology, United States
  • Xiangying Mao, Purdue University Department of Biochemistry, Center for Plant Biology, United States
  • Clint Chapple, Purdue University Department of Biochemistry, Center for Plant Biology, United States

Short Abstract: Among other critical functions, the Mediator protein complex in Arabidopsis maintains homeostasis of phenylpropanoid biosynthesis, largely through the activity of the MED5 subunit. Utilizing a forward genetic screen, plants with mutations in MED5b were identified that have a dwarfed growth phenotype and reduced production of sinapoylmalate, a UV absorbing pigment in leaf epidermal cells. Because sinapoylmalate fluoresces blue under UV light, these mutants exhibit a reduced epidermal fluorescence (ref) phenotype. The mutant ref4-3 is the result of a point mutation that causes a G383S amino acid substitution in MED5b leading to constitutive repression of the phenylpropanoid pathway. Complete loss of REF4, however, results in normal growth and sinapoylmalate accumulation. Loss of ref4-related 1 (RFR1/MED5a), a paralog of REF4, also results in a wild-type growth phenotype with a slight increase in sinapoylmalate production over the wild type. Knocking out both REF4 and RFR1, referred to as med5a/b, results in wild-type growth and a greater increase in sinapoylmalate production. The mechanism by which the G383S point mutation in ref4-3 leads to repression of the phenylpropanoid pathway is unknown. We explored the possibility that the effects of ref4-3 are due to either the introduction of a phosphorylation site into the MED5b protein or more simply from increased side chain size at position 383. We tested this hypothesis by transforming med5a/b double knockout plants with MED5b constructs containing various G383 mutations. These plants were assessed for growth phenotype, genotype, and sinapoylmalate accumulation by UV fluorescence and HPLC. The results from these experiments suggest that the phenylpropanoid metabolism associated with sinapoylmalate accumulation does not depend on the phosphorylation of MED5b and that the reduction of sinapoylmalate that caused the ref phenotype is more likely due to the increased side chain sized at position 383 in MED5.

RSG-16: Improving Gene Prioritization Algorithms Using Diffusion State Distance-Preprocessed Protein Networks
Topic: protein interaction network diffusion state distan
  • Yuelin Liu, Tufts University, United States
  • Lenore Cowen, Tufts University, United States

Short Abstract: The diffusion state distance (DSD) is a novel distance measure for small world networks. Recent studies have found that incorporating a DSD-based kernel into the methods have improved performance for protein function prediction and disease module identification. We were interested in whether the same thing would be true for network-based gene prioritization algorithms. Given a protein interaction network and a set of known disease-gene associations, network-based gene prioritization algorithms rank candidate genes based on their predicted association scores to a target disease using network information. We considered three ways of preprocessing a protein network based on DSD distance: 1) replacing confidence-weighted edge weights with the inverse of the DSD distance, 2) reweighing the edges by multiplying the given confidence edge weight by the inverse of DSD distance, and 3) taking a minimum spanning tree in the original network according to DSD distance, and then adding back in edges whose DSD distance was beneath a cutoff. We evaluated the performance of two popular network-based gene prioritize algorithms, Kohler’s RWR and the DADA algorithm, on two different human (STRINGdb, and HIPPIE) protein-protein interaction networks, comparing performance of the algorithms run directly on the network, with their cDSD-edge weighted variants. We observed that the multiplicative reweighting method combining DSD with confidence edge weights (method 2) consistently gave the best performance; outperforming the original implementations of the algorithms as well as the other DSD modified variants. Method 1 did not perform as well as method 2 but usually improved on the original implementation. Performance for method 3 was always worse than the original algorithms. We conclude that reweighting PPI networks by multiplying the given confidence weight by the inverse of the confidence-weighted DSD distance is a good strategy for improving the performance of these algorithms. We hypothesize that for gene prioritization problems, both the local information in the immediate neighborhood as well as the more global information about the neighborhood captured by DSD must be combined in some way to improve performance.

RSG-17: Variational Autoencoder learns representative chromatin accessibility profiles and makes accurate binding predictions at motif sites
Topic: chromatin accessibility variational autoencoder DN
  • Jennifer Hammelman, Massachusetts Institute of Technology CSAIL, United States
  • David Gifford, gifford@mit.edu, United States

Short Abstract: Current methods for predicting transcription factor binding from DNAse- or ATAC- seq derived DNA accessibility often assume that all bound motif sites share a single accessibility profile. However, accessibility for a given motif even at bound sites is decidedly heterogeneous. We suggest a variational autoencoder model that encodes a latent space representing the distribution of accessibility profiles at input genomic sites, and can be used to generate a set of representative profiles from the centers of the clustered latent space. We choose a variational autoencoder over an autoencoder or other dimensionality reduction methods because it encourages a meaningful compressed representation. Using a variational autoencoder with input and output of normalized read counts of DNAse-seq accessibility at 1000bp centered at the site of a motif, the learning objective is a continuous space latent representation of the accessibility profiles at the set of genomic sites. We demonstrate the ability of the model to separate the accessibility of CTCF motif sites into clusters of distinct accessibility profiles including directional accessibility profiles and compare this to a baseline of k-means clustering on the normalized count data. We also explore the ability of the model to cluster tissues for E2F4 motif sites, demonstrating even without using information about the known binding accessibility profile that changes to motif site accessibility is informative. We expect the same trends to hold for many other motifs as their model learned clusters indicate multiple distinct accessibility profiles. The presence of disparate accessibility profiles for a single DNA binding protein challenges the assumptions of other models that motif sites only carry binary information via the binding of transcription factors. Finally, we show that trained to classify binding sites with supervised learning by adding a classification output layer that takes as input the latent space and motif score, the model has competitive auROC (0.776) and average precision (0.190) compared to two other prediction methods, PIQ (auROC=0.778,average precision=0.402) and DeFCoM (auROC=0.575,average precision=0.167) for RAD21 in GM12878 cells. We expect this variational auotencoder model to be helpful in cases where the input sites may actually represent a family of transcription factors where we can identify distinct accessibility profiles, in cases where transcription factors bind and directionally open chromatin, or in cases where transcription factors have different accessibility patterns such as opening closed chromatin or binding to already open chromatin at different genomic sites or in different cell types.

RSG-18: Chem-seq: Evaluation of Genomewide Binding Effects of DNA Minor Groove-Binding Pyrrole Imidazole Polyamides
Topic: next-generation sequencing minor-groove binder exp
  • Jason Lin, Chiba Cancer Center Research Institute, Japan
  • Atsushi Takatori, Chiba Cancer Center Research Institute, Japan
  • Paul Horton, National Institute of Advanced Industrial Science and Technology (AIST), Japan
  • Hiroki Nagase, Chiba Cancer Center Research Institute, Japan

Short Abstract: Pyrrole-imidazole (PI) polyamides are motif-specific DNA minor groove binders that can strike at oncotargets deemed “undruggable” at the protein level, e.g. Ras. While extensive studies illuminate the potential of PI polyamides at modifying the cancer epigenome and disrupting oncogenes, their short recognition motifs often imply the presence of multiple binding targets, thus highlighting the need to evaluate the impact of PI polyamide binding in a cancer genome for bedside translation, especially in the area of off-target and phenotypic side effects. Investigations in the off-target effects of PI polyamides, however, are particularly challenging, since the relatively short base pairs of < 20 can affect the frequency of genomic binding. These increases may translate to binding at non-intended loci, potentially leading to off-target effects, issues that very few approaches are able to address to-date. We have been developing a next-generation sequencing-based platform, Chem-seq, to address the genome-wide effect of PI polyamides via affinity-based IonTorrent sequencing, expression microarrays and computational analysis. Chem-seq can be used to identify a PI polyamide’s binding targets in vitro from sequencing data to reveal the underlying biochemical changes, and to infer subsequently the possible phenotypic changes in vivo via the prediction of side effects from gene expression profiling by machine learning. This presentation will discuss our method of inferring off-target binding from expression profiling based on the relative impact to various biochemical pathways, as well as an accompanying side effect prediction engine to allow candidate polyamides to be systematically screened. Mouse experiments corroborated some of these predictions and further emphasized the robustness of our prediction model for our polyamide candidates. The use of Chem-seq platform allows us to evaluate PI polyamide candidates with increased throughput and confidence that may hopefully accelerate the bedside use of PI polyamides as cancer therapeutic agents.

RSG-19: quanTIseq: portraying the tumor immune contexture through RNA-seq data deconvolution
Topic: Deconvolution Cancer immunology Gene expression Ne
  • Francesca Finotello, Medical University of Innsbruck, Austria
  • Clemens Mayer, Medical University of Innsbruck, Austria
  • Christina Plattner, Medical University of Innsbruck, Austria
  • Gerhard Laschober, Medical University of Innsbruck, Austria
  • Dietmar Rieder, Medical University of Innsbruck, Austria
  • Hubert Hackl, Medical University of Innsbruck, Austria
  • Anne Krogsdam, Medical University of Innsbruck, Austria
  • Zuzana Loncova, Medical University of Innsbruck, Austria
  • Wilfried Posch, Medical University of Innsbruck, Austria
  • Doris Wilflingseder, Medical University of Innsbruck, Austria
  • Sieghart Sopper, Medical University of Innsbruck, Austria
  • Marieke Ijsselsteijn, Leiden University Medical Centre, Netherlands
  • Douglas Johnson, Vanderbilt University, United States
  • Yaomin Xu, Vanderbilt University, United States
  • Yu Wang, Vanderbilt University, United States
  • Melinda E. Sanders, Vanderbilt University, United States
  • Monica V. Estrada, Vanderbilt University, United States
  • Paula Ericsson-Gonzalez, Vanderbilt University, United States
  • Justin Balko, Vanderbilt University, United States
  • Noel de Miranda, Leiden University Medical Centre, Netherlands
  • Zlatko Trajanoski, Medical University of Innsbruck, Austria

Short Abstract: The tumor immune contexture, namely the type and density of tumor-infiltrating immune cells, has prognostic value in several cancers and influences patients’ responses to immunotherapy with immune checkpoint blockers. Moreover, the study of its pharmacological modulation by conventional and targeted drugs could identify synergistic partners for combinatorial therapies based on immune checkpoint blockers. However, the quantification of the immune contexture is currently hampered by the lack of simple and efficient methods. We developed quanTIseq, a computational pipeline based on gene-expression deconvolution that quantifies the fractions and densities of ten immune cell types from bulk RNA sequencing (RNA-seq) and tissue-imaging data. Unlike previous approaches, quanTIseq is specifically designed for RNA-seq data and implements a full analytical pipeline that consists in: read pre-processing, gene expression quantification and normalization, gene re-annotation, and estimation of cell fractions and densities. We performed an extensive validation using publicly available data sets, as well as ad-hoc generated RNA-seq data from simulations, blood-derived mixtures, and tumor samples from three cancer cohorts. Using cell fractions estimated by flow cytometry and immunohistochemistry as gold standard measures, we proved the high accuracy and robustness of quanTIseq for the quantification of immune cells in blood and tumor samples, respectively. We applied quanTIseq to more than 8,000 solid tumors from The Cancer Genome Atlas (TCGA) revealing that the immune contexture is highly heterogeneous within and across cancer types. Moreover, our pan-cancer analysis provided evidence that the activation of the CXCR3/CXCL9 axis is more strongly associated with the infiltration of cytotoxic T cells than mutational load and other genomic features. Finally, we demonstrated that Immunoscore and T/B-cell score derived from deconvolution results have prognostic values in several solid cancers. In addition, analysis of the transcriptomes of patients treated with kinase inhibitors revealed profound pharmacological remodeling of the immune contexture. These results suggest that quanTIseq can be used to identify immunogenic effects of conventional and targeted drugs and thereby reveal mechanistic rationale for the design of combination therapies. Finally, analysis of the baseline RNA-seq data from patients treated with PD1 blockers showed the potential of quanTIseq for the extraction of immunological features that, alone or in combination, might predict the response to checkpoint blockade. quanTIseq is available at: http://icbi.i-med.ac.at/software/quantiseq/doc.

RSG-20: Gut microbiome composition and small RNA spectra in human stool for colorectal cancer detection
Topic: microbiome miRNA noncoding RNA small noncoding RNA
  • Giulio Ferrero, Department of Computer Science, University of Torino, Italy
  • Francesca Cordero, Department of Computer Science, University of Torino, Italy
  • Sonia Tarallo, Italian Institute for Genomic Medicine (IIGM), Torino, Italy, Italy
  • Gaetano Gallo, Department of Colorectal Surgery, Clinica S. Rita, Vercelli, Italy, Italy
  • Antonio Francavilla, Italian Institute for Genomic Medicine (IIGM), Torino, Italy, Italy
  • Giuseppe Clerico, Department of Colorectal Surgery, Clinica S. Rita, Vercelli, Italy, Italy
  • Alberto Realis Luc, Department of Colorectal Surgery, Clinica S. Rita, Vercelli, Italy, Italy
  • Paolo Manghi, Centre for Integrative Biology (CIBIO), University of Trento, Italy, Italy
  • Andrew Thomas, Centre for Integrative Biology (CIBIO), University of Trento, Italy, Italy
  • Paolo Vineis, Italian Institute for Genomic Medicine (IIGM), Torino, Italy, Italy
  • Nicola Segata, Centre for Integrative Biology (CIBIO), University of Trento, Italy, Italy
  • Barbara Pardini, Italian Institute for Genomic Medicine (IIGM), Torino, Italy, Italy
  • Alessio Naccarati, Italian Institute for Genomic Medicine (IIGM), Torino, Italy, Italy

Short Abstract: Many evidence suggest a contribution of a gut microbiota dysbiosis in the onset and progression of ColoRectal Cancer (CRC). Human small RNAs (sRNAs) were shown to be involved in both tumorigenesis as well as inter-kingdom interactions between microbiome and human cells. However, little is known on how much informative are a study of Whole Metagenome Sequencing (WMS) and a small RNA-Seq (sRNA-Seq) experiment performed on the same stool samples from CRC patients. We performed WMS and sRNA-Seq on 80 stool samples collected from healthy individuals and patients with adenoma or CRC. Metaphlan2 was applied to identify bacteria relative abundances from WMS data while sRNA-Seq reads were analysed using a novel computational pipeline integrating BWA and Kraken algorithms. Secondary structure analysis was predicted using RNAFold applied on each sRNA-Seq read. Patient classification accuracy was computed using a Random Forest classifier. Our analysis revealed a significant correlation between relative abundances of bacterial DNA and sRNAs (median r=0.89) with a consistent increment of Proteobacteria from healthy to CRC patients. Escherichia Coli emerged as significantly abundant bacteria at both DNA and sRNA level but also associated to a low transcriptional rate. Secondary structure analyses revealed that bsRNA reads assigned to bacteria could form more stable structures compared to reads mapped on human sRNAs consistent with the highest structured property of bacteria RNAs. Analysing the reads assigned to bacterial sRNAs annotations, led us a set whose expression significantly increased from healthy to CRC patients. Noteworthy, using the combination of human and bacterial sRNAs with bacterial DNA profiles, we obtained a high sample classification accuracy (AUC=0.87). Our results support the hypothesis that the integration of WMS and sRNA-Seq data can be efficiently employed to classify samples as well to extract novel insight on human and bacterial sRNAs involved in a disease.

RSG-21: Unveiling of conserved transcriptomics perturbation signatures in mice and human
Topic: Functional genomics Perturbation signatures Pathwa
  • Christian Holland, Joint Research Center for Computational Biomedicine, RWTH Aachen, Germany, Germany
  • Bence Szalai, Department of Physiology, Semmelweis University Budapest, Hungary, Hungary
  • Luz Garcia-Alonso, EMBL-EBI, Wellcome Genome Campus, Cambridge, UK, United Kingdom
  • Julio Saez-Rodriguez, Institute of Computational Biomedicine, Heidelberg University, Germany, Germany

Short Abstract: Profiling of the whole transcriptome leads to large gene expression matrices which are hard to analyse and interpret beyond differential gene expression analysis. Functional genomic tools have been proven as powerful approaches for the downstream analysis. Summarizing the large space of gene expression values into smaller number of biological meaningful features not only helps to reduce experimental noise but can also help to identify the underlying mechanisms of human diseases. We have recently shown that gene sets comprising downstream signatures, which we consider as the footprint of either a pathway or transcription factor (TF) on gene expression, outperform classical knowledge-based genesets (e.g. KEGG pathways). However, as most of these footprint signatures are curated for the application in Homo sapiens their usability in the model organism Mus musculus is uncertain. Evolutionary conservation of the gene regulatory system between M. musculus and H. sapiens suggests that those footprints also referred as transcriptomic perturbation signatures are evolutionary conserved as well. This implies that functional genomic tools which function at the level of transcriptomic perturbation signatures can be putative applied on mouse data. In this study we performed a comprehensive benchmark study testing our hypothesis exploiting two state of the art, perturbation-signature-based functional genomic approaches: PROGENy and DoRothEA. While PROGENy estimates signaling pathway activity, DoRothEA is a resource that enables the estimation of TF activities by applying enrichment analysis methods, where manually curated TF regulons serve as underlying gene sets. We benchmark the performance of both tools on publicly available human and mouse single-gene and single-drug perturbation data using area under the Receiver Operating Characteristic curve (AUROC) and area under Precision Recall curve (AUPRC) as a performance measure. Our results show that PROGENy is globally effective in inferring pathway activity on mice data, as the performance is comparable between mouse and human data. In the case of DoRothEA it could be shown that the regulons applied to mouse data perform like dedicated mouse TF regulons. The usability of PROGENy and DoRothEA is finally demonstrated by recovering known pathway/TF-disease associations exploiting human and mouse transcriptome disease signatures. Ongoing work is expanding PROGENy and DoRothEA for analysis of single-cell RNA data.

RSG-22: Inferring RNA Secondary Structures from Homologous RNA Sequences with Deep Neural Networks
Topic: Deep Learning Convolutional Neural Networks Recurr
  • Steffan Paul, Harvard University, United States
  • Peter Koo, Harvard University, United States
  • Sean Eddy, Harvard University, United States

Short Abstract: Stochastic context-free grammars are powerful computational approaches to model RNA secondary structure from RNA alignments. However, they are limited to finding nested base pairs, which excludes some biologically important non-nested base pairing interactions such as RNA pseudoknots. Here we explore the capability of various deep neural network architectures, including multilayer perceptrons (MLPs), convolutional neural networks (CNNs), and recurrent neural networks (RNNs), which are not limited by such constraints, to learn RNA secondary structures from homologous sequences. Using first-order and second-order in silico mutagenesis analysis to interpret our models, we evaluate how well each network learns known base pair interactions from synthetic sequences with a known structure and in vivo RNA sequences from well-characterized RNA families. We find that MLPs are better suited to learn RNA secondary structures when given an RNA alignment, while CNNs and RNNs learn noisier structures but do not require any alignments.

RSG-23: Comparison of single and module-based methods for modeling gene regulatory networks
Topic: Gene regulatory networks Cancer Variational Bayes
  • Mikel Hernaez, University of Illinois at Urbana-Champaign, United States
  • Olivier Gevaert, Stanford University, United States

Short Abstract: Reverse engineering gene regulatory networks from gene expression data is still a major challenge in computational biology. Many initial methods generate these networks by creating a single graph, where each of the nodes is a gene, and an edge between two genes is drawn if their expression satisfies a given constraint. Once the graph is created, the next step is usually to look for ``hub'' genes or subnetworks of highly connected genes conveying some biological meaning. An alternative approach is group genes into modules of co-expressed genes from the initial gene expression profiles and link modules through driver genes. Both single gene and module-based approaches have been used to successfully discover dependencies between known driver genes and tumor dependencies. However, no direct comparison has been done between both approaches to determine if module-based approaches outperform single gene approaches. In this work, we propose a novel and scalable algorithm drawing from the module-based methods and compare this approach with single gene networks. Specifically, we first find modules of co-regulated genes whose expression profile can be explained by a combination of a few driver genes. We analyze several methods for uncovering these modules such as LASSO, thresholded linear regression and Variational Bayes Spike Regression (VBSR). After the modules are built we further refine them by creating bipartite graphs from each of them yielding concise gene regulatory networks evaluating again LASSO, thresholded linear regression, and VBSR. This process is bootstrapped 100 times to develop robust regulatory networks. First, we show that VBSR achieves significant improvement with respect to previously used methods for module creation on both simulated and TCGA data. Specifically, VBSR produces less false positive while maintaining the number of true positives and false negatives. Once the modules are computed, we also show that LASSO is the better choice for this second task and outperforms VBSR. Thus, we show that generating modules of co-expressed genes which are predicted by a sparse set of regulators using VBSR, and then building a bipartite graph on the generated modules using LASSO, yields more informative networks—as measured by the rate of enriched elements and a thorough network topology assessment—than the single network approaches. Specifically, the proposed method produces networks closer to a scale-free topology, and the modules show up to 10x more enriched elements than when using single gene networks using TCGA-OV and TCGA-HNSC data.

RSG-24: Predictive Modeling of Signal Processing by Hematopoietic Gene Regulatory Networks
Topic: gene regulatory networks hematopoiesis cell differ
  • Joanna Handzlik, University of North Dakota, United States
  • Manu Manu, University of North Dakota, United States

Short Abstract: Cytokines and lineage-specific transcription factors (TFs) are critical molecular components involved in cell-fate choice during hematopoiesis. Networks of TFs are thought to play an instructive role in cell-fate commitment and differentiation, whereas cytokine signaling controls cell-fate indirectly through proliferation, survival and maturation. Recent data suggest that cytokines also have an instructive role in cell-fate choice, however how cytokine signal transduction and transcriptional networks interact to jointly instruct cell fate is not understood. Here we investigate how interactions between cell-extrinsic signals and cell-intrinsic GRNs give rise to context-dependent cell-fate decisions. In order to discover these interactions and how they dictate cell fate, we inferred Gene Regulatory Networks (GRNs), consisting of TFs, signaling effectors, and cytokine receptors, involved in the red-blood and myeloid cell-fate decision. The GRNs were inferred using ARACNe from a time series of genome-wide gene expression acquired during the differentiation of a IL3-dependent multipotent cell line, FDCP-mix, into erythrocytes or neutrophils. The time series data were then used to fit coupled differential equation models using simulated annealing to determine the type, activation or repression, and the strength of interaction between the genes. 100 executions of our GRN on 11 genes resulted in 8 models with accepted Root Mean Square (RMS). The gene circuits were able to recover known interactions between TFs and predict the results of genetic perturbation experiments. The model inferred more general patterns in the topology of the TFs network where erythrocyte genes repress neutrophil genes and vice versa. These results highlight the importance of interactions between cytokine signaling and transcriptional networks for cell-fate choice and suggest that the process is governed by densely interconnected GRNs.

RSG-25: BindSpace: decoding transcription factor binding signals by large-scale joint embedding
Topic: Functional Genomics Machine Learning Transcription
  • Han Yuan, Memorial Sloan Kettering Cancer Center, United States
  • Meghana Kshirsagar, Memorial Sloan Kettering Cancer Center, United States
  • Lee Zamparo, Memorial Sloan Kettering Cancer Center, United States
  • Yuheng Lu, Memorial Sloan Kettering Cancer Center, United States
  • Christina Leslie, Memorial Sloan Kettering Cancer Center, United States

Short Abstract: Decoding transcription factor (TF) binding signals in genomic DNA is a fundamental problem. Here we present a prediction model called BindSpace that learns to embed DNA sequences and TF class/family labels into the same space. By training on binding data for hundreds of TFs and embedding over 1M DNA sequences, BindSpace achieves state-of-the-art multiclass binding prediction performance, in vitro and in vivo, and can distinguish signals of closely related TFs.

RSG-26: Single cell deconvolution reveals that the cell fractions vary across brain disorders and human aging
Topic: single cell transcriptome deconvolution cell fract
  • Xu Shi, Yale University, United States
  • Daifeng Wang, Stony Brook University, United States
  • Mark Gerstein, Yale University, United States

Short Abstract: Brain tissues are composed of a variety of neuronal and non-neuronal cell types. Gene expression changes observed at the tissue level may be due to changes in the proportions of cell types. However, it remains unknown how these changes in cell proportions can quantitatively contribute to the variation in tissue-level gene expression observed across a population of individuals. To address this question, we used two complementary strategies examining expression across our cohort of 1,866 individuals. First, we used standard pipelines to uniformly process single-cell RNA-seq data from PsychENCODE, in conjunction with other single-cell studies on the brain. Then we assembled profiles of brain cell types, including both excitatory and inhibitory neurons, major non-neuronal types (e.g., microglia and astrocyte), and additional cell types associated with development. Second, we used an unsupervised analysis to identify the primary components of bulk expression variation. We decomposed the bulk gene-expression matrix using non-negative matrix factorization, and determined whether the top components, capturing the majority of covariance, were consistently associated with the single cell signatures. This demonstrates that an unsupervised analysis derived solely from bulk data roughly recapitulates the single-cell signatures, partially corroborating them. We then examined how variation in proportions of basic cell types contributes to variation in bulk expression. To this end, we estimated the relative proportions of various cell types for each tissue sample (i.e., "cell fractions"). In particular, we deconvolved the bulk, tissue-level expression matrix using the single-cell signatures to estimate cell fractions across individuals. Overall, our analyses demonstrated that variation in cell types contributed significantly to bulk variation. That is, weighted combinations of single-cell signatures could account for most of the population-level expression variation, with an accuracy of ~89%. We identified cell-fraction changes associated with different traits. For example, there are different fractions of particular types of excitatory and inhibitory neurons in male and female samples. Finally, we observed an association with age. In particular, with increasing age the fractions of Ex3 and Ex4 significantly increased and some non-neuronal types decreased. These changes may be associated with differential expression of specific genes, e.g., Somatostatin (SST), known to be associated with aging and neurotransmission. This work is a part of PsychENCODE capstone project on comprehensive functional genomic resource and integrative model for the human brain, Daifeng Wang, Shuang Liu, Jonathan Warrell, Hyejung Won, Xu Shi, Fabio Navarro, Declan Clarke, Mengting Gu, Prashant Emani,…, Daniel H. Geschwind, James A. Knowles, Mark Gerstein.

RSG-27: A unified genome-wide analysis of dysfunctional T cell states in cancer and chronic viral infection
Topic: T cell dysfunction immunology cancer transcription
  • Yuri Pritykin, Memorial Sloan Kettering Cancer Center, United States
  • Christina Leslie, Memorial Sloan Kettering Cancer Center, United States

Short Abstract: Tumor-specific T cells that have differentiated into a terminal dysfunctional state exist in the tumor microenvironment. A systematic understanding of the requirements of immunotherapeutic rescue of these cells is critically needed to improve clinical results in patients. Mouse models of chronic infection and cancer have been studied to elucidate biological mechanisms of persistent antigen stimulation resulting in T cell dysfunction, or “exhaustion”. Recently, chromatin accessibility imprinting has been associated with T cells falling back into the dysfunctional state after temporary rescue by checkpoint blockade, suggesting epigenetic mechanisms in control of T cell dysfunction. However, comprehensive characterization of T cell dysfunction across models based on their epigenetic and transcriptional profiles is lacking. We collected 106 chromatin accessibility (ATAC-seq) and 87 gene expression (RNA-seq) samples from recent publications. Using generalized linear modeling, we mapped profiles of chromatin accessibility peaks in gene promoters and enhancers from different studies into the same space. We observed that epigenetic profiles of dysfunctional tumor-infiltrating T cells and dysfunctional T cells in chronic viral infection were extremely similar. Contrary to prevailing belief, we observed across mouse models that T cells committed to becoming dysfunctional early after an immune challenge, rather than first mounting and then loosing an effector response. Differentially expressed genes with massive differential accessibility of their accessibility peaks during dysfunction development, observed consistently across models, including transcription factors (TF) well studied in immunity such as Tcf7, Lef1, Satb1, Ikzf2, Tox, are candidates for further targeted analysis. We then associated absolute levels of chromatin accessibility in peaks of each sample with predicted TF binding using regularized negative binomial regression with cross-validation. Using TF coefficients, we mapped chromatin accessibility profiles into the TF activity space of much lower dimensionality, preserving relationships between samples. We identified key TFs whose binding was associated with open or closed chromatin in functional and dysfunctional cell states. Strikingly, the strongest association with closing chromatin in dysfunction, consistently across mouse models, was observed for Tcf7/Lef1 binding, further suggesting the role of these TFs in establishing the terminal T cell dysfunctional state. Our results suggest that the coordinated activity of a broad range of TFs (not necessarily binding at the same sites), rather than of one or a handful of TFs, is responsible for establishing and maintaining T cell functional states. This study provides a comprehensive global understanding of the common TF regulatory mechanisms governing T cell functional and dysfunctional states.

RSG-28: Revealing the Spatiotemporal Dynamics of Placental Group B Streptococcus Infection using scRNA-Seq and Spatial Transcriptomics
Topic: host-pathogen interaction Group B Streptococcus sp
  • Felicia Kuperwaser, NYU Langone Health, United States
  • Gal Avital, NYU Langone Health, United States
  • Tara Randis, NYU Langone Health, United States
  • Adam Ratner, NYU Langone Health, United States
  • Itai Yanai, NYU Langone Health, United States

Short Abstract: Revealing the Spatiotemporal Dynamics of Placental Group B Streptococcus Infection using scRNA-Seq and Spatial Transcriptomics Felicia Kuperwaser1*, Gal Avital1*, Tara M. Randis2,3, Adam J. Ratner2,3, Itai Yanai1, 1 Institute for Computational Medicine, NYU School of Medicine, New York, USA 2 Department of Pediatrics, New York University School of Medicine, New York, USA 3 Department of Microbiology, New York University School of Medicine, New York, USA *These authors contributed equally to this work Group B Streptococcus (GBS) is a major cause of infection during pregnancy and in the first weeks of the infant’s life. GBS asymptomatically colonizes about 25% of adults, but during pregnancy, ascending GBS infection can cause chorioamnionitis, which may subsequently induce premature birth or neonatal sepsis. In this context, the placenta acts as the initial interface between pathogen virulence and host defense mechanisms. Specifically, prior work has implicated placental macrophages in the host response to GBS, but the full extent of their role in this process has not been clearly defined. Understanding GBS-host cell interactions in the placenta can provide insight into GBS pathogenesis and its clinical manifestations with implications for therapeutic intervention. To investigate this infection and the molecular changes that accompany it, we used an ascending model of GBS infection in pregnant mice coupled with transcriptomic approaches. We infected pregnant mice with GBS on day E13 of pregnancy and 48 hours later harvested infected placentas and processed them for both scRNA-Seq and spatial transcriptomics (ST) analysis. In the ST method, a tissue section is mounted on a spatially-barcoded microarray and processed to generate spatial transcriptomic libraries which provide a high-resolution transcriptomic map of infected placenta. Analyzing the differential gene expression between cell types in the placental map, we found enrichment in specific spatial regions for processes involved in immune defense and response to GBS-specific mechanisms of pathogenesis. PCA and clustering analysis on the scRNA-Seq data revealed two distinct macrophage subpopulations, corresponding to M1 and M2 phenotypes, unique to the GBS-infected placenta, suggesting that the cellular landscape in the GBS-infected placenta is altered relative to that of a healthy placenta. Further studies promise to elucidate the effect of these differentially regulated macrophages on other placental cell populations, which may, in turn affect normal placental function and contribute to the devastating clinical manifestations of GBS infection in pregnancy.

RSG-29: RADAR: Annotation and prioritization of variants in the post-transcriptional regulome of RNA-binding proteins
Topic: RNA binding proteins post-transcriptional regulati
  • Jing Zhang, Yale University, United States
  • Jason Liu, Yale University, United States
  • Donghoon Lee, Yale University, United States
  • Jo-Jo Feng, Yale University, United States
  • Lucas Lochovsky, Yale University, United States
  • Shaoke Lou, Yale University, United States
  • Michael Rutenberg-Schoenberg, Yale University, United States
  • Mark Gerstein, Yale University, United States

Short Abstract: RNA binding proteins (RBPs) have been reported to play essential roles in both co- and post-transcriptional regulation. RBPs bind to thousands of genes in the cell through multiple processes, including splicing, cleavage and polyadenylation, editing, localization, stability, and translation. Recently, scientists have made efforts to complete these post- or co-transcriptional regulome by synthesizing public RBP binding profiles, which have greatly expanded our understanding of RBP regulation. Since 2016, the Encyclopedia of DNA Elements (ENCODE) consortium started to release data from various types of assays on matched cell types to map the functional elements in post-transcriptional regulome. For instance, ENCODE has released large-scale enhanced crosslinking and immunoprecipitation (eCLIP) experiments for hundreds of RBPs. This methodology provides high-quality RBP binding profiles with strict quality control and uniform peak calling to accurately catalog the RBP binding sites at a single nucleotide resolution. Simultaneously, ENCODE performed expression quantification by RNA-Seq after knocking down various RBPs. Finally, ENCODE has quantitatively assessed the context and structural binding specificity of many RBPs by Bind-n-Seq experiments. In this study, we aimed to construct a comprehensive RBP regulome and a scoring framework to annotate and prioritize variants within it. We collected the full catalog of 318 eCLIP (for 112 RBPs), 76 Bind-n-Seq, and 472 RNA-Seq experiments after RBP knockdown from ENCODE to construct a comprehensive post-transcriptional regulome. By combining polymorphism data from large sequencing cohorts, like the 1,000 Genomes Project, we demonstrated that the RBP binding sites showed increased cross-population conservations in both coding and noncoding regions. This strongly indicates the purifying selection on the RBP regulome. Furthermore, we developed a scoring scheme, named RADAR (RNA BinDing Protein regulome Annotation and pRioritization), to investigate variant impact in such regions. RADAR first combines RBP binding, cross-species and cross-population conservation, network, and motif features with polymorphism data to quantify variant impact described by a universal score. Then, it allows tissue- or disease-specific inputs, such as patient expression, somatic mutation profiles, and gene rank list, to further highlight relevant variants. By applying RADAR to both somatic and germline variants from disease genomes, we demonstrate that it can pinpoint disease-associated variants missed by other methods. In summary, RADAR provides an effective approach to analyze genetic variants in the RBP regulome, and can be leveraged to expand our understanding of post-transcriptional regulation. To this end, we have implemented the RADAR annotation and prioritization scheme into a software package for community use (radar.gersteinlab.org).

RSG-30: Decoding chromatin regulation through data mining and chromatin proteomics
Topic: epigenetics histone code proteomics histone PTMs e
  • Nhuong Nguyen, Friedrich Miescher Institute, Switzerland
  • Saulius Lukauskas, Helmholtz Zentrum München, Germany
  • Peter Faull, The Francis Crick Institute, United Kingdom
  • Helen Flynn, The Francis Crick Institute, United Kingdom
  • Bram Snijders, The Francis Crick Institute, United Kingdom
  • Peter DiMaggio, Imperial College London, United Kingdom
  • Till Bartke, Helmholtz Zentrum München, Germany

Short Abstract: DNA and histone modifications play central roles in the control of gene expression. These modifications form an epigenetic ‘code’ that stores information within chromatin. Despite progress in understanding the targeting of single modifications by epigenetic readers the regulatory mechanisms decoding combinatorial chromatin modifications are still largely unknown. In an attempt to ‘break’ the epigenetic code we have developed streamlined and sensitive SILAC nucleosome affinity purification (SNAP) protocols that allow us to isolate nucleosome-binding proteins and identify factors that recognise histone and DNA modification signatures using quantitative mass spectrometry. We have generated 55 nucleosomes containing differentially modified histones H3 and H4 containing various acetylation and methylation marks on lysines in their N-terminal tails by native chemical ligation in order to recreate defined chromatin modification states. By combining e.g. H3K4me1 or H3K4me3 with either H3K27me3, the histone variant H2A.Z, or varying degrees of lysine acetylation we can mimic ‘poised’ or ‘active’ enhancer and promoter states. Combinations of H3K9me3, H3K27me3, H4K20me2/3 and CpG-methylation enable us to generate various forms of heterochromatin. In total we have identified around 2000 factors, linked to processes such as transcription, replication, or DNA repair, and quantified their binding to the different nucleosomes. Computational analyses of our dataset have allowed us to establish an interaction network of such chromatin-associated factors, elucidating both known chromatin complexes and novel interactions based on the similarity of their binding profiles. Further application of factor analysis methods has allowed us to establish the key regulatory drivers of this epigenetic network formation that describe the constituents of the nucleosome code.

RSG-31: CONNECTOR: fitting and clustering analysis of biological growth data
Topic: cancer growth mathematical models intratumor heter
  • Francesca Cordero, Department of Computer Science, University of Torino, Italy
  • Jessica Giordano, Department of Computer Science, University of Torino, Italy
  • Simone Pernice, Department of Computer Science, University of Torino, Italy
  • Roberta Sirovich, Department of Mathematics, University of Torino, Italy
  • Maddalena Arigoni, Molecular Biotechnology Center / University of Torino, Italy
  • Jessica Erriquez, Candiolo Cancer Institute-FPO, IRCCS Candiolo, Torino, Italy, Italy
  • Marco Beccuti, Università degli studi di Torino, Italy
  • Martina Olivero, Candiolo Cancer Institute-FPO, IRCCS Candiolo, Torino, Italy, Italy
  • Maria Flavia Di Renzo, Candiolo Cancer Institute-FPO, IRCCS Candiolo, Torino, Italy, Italy
  • Raffaele Calogero, University of Torino, Italy

Short Abstract: In the representation of growth data, there is a broad class of mathematical models that can be used for fitting. Despite many similarities, the long-established models may exhibit significant differences, in particular at the extreme - often not sampled - time points. In modeling tumour growth data there are no biological or clinical evidence that can drive the choice of the correct model to better fit the growth data. We envisioned that proper fitting and subsequent clustering of fitted growth curves can be applied to identify common patterns of growth and to study the underlying cell population dynamics. We developed CONNECTOR, an R package that is able to fit and cluster growth data. To fit the data the Functional Clustering Models (FCM) is implemented and a deep study of its parameters configuration is performed. Moreover, three classical growth models, i.e. Malthus, Gompertz, and logistic are available to fit the data. Moreover, an unsupervised clustering algorithm is applied. In CONNECTOR are developed the separation and tightness measures to compare the results. At the best of our knowledge CONNECTOR is the only methodology able to analyze cancer growth data by the functionalities described above. To assess the usefulness and improvement offered by CONNECTOR, we applied it on a set of 24 PDXs propagated from a single ovarian tumor sample. Individual PDXs grew differently, this lead us to infer that in each PDX coexists a mix of cell populations with distinct genotypic and phenotypic characteristics that mimics intratumor heterogeneity. To estimate in an unbiased manner the possible heterogeneous behavior of PDXs, we have applied CONNECTOR. Based on the clustering performance, the FCM resulted the best method to fit and cluster the data. It came out that the optimal number of clusters was four that were verified from a biological point of view. We have selected 10/24 PDX growth curves representative of the four clusters for molecular analysis. These PDX samples have been analyzed using whole exome sequencing (WES) and SNP arrays for copy number aberrations (CNA), to extrapolate cell population composition. To analyze WES and CNA we developed a tailored computational pipeline that estimates the subpopulation composition of the PDX samples analyzed. We identified prevailing subpopulations associated with each cluster, with the remarkable expansion of a specific one in the faster growing PDXs. These results demonstrate the correlation between the genetic make-up and the growth patterns highlighted in the clustering obtained with CONNECTOR.

RSG-32: CNet: detecting clinically associated, combina-tory genomic signatures
Topic: CNet genomic signature network association multi-t
  • Peilin Jia, University of Texas Health Science Center at Houston, United States
  • Guangsheng Pei, University of Texas Health Science Center at Houston, United States
  • Zhongming Zhao, University of Texas Health Science Center at Houston, United States

Short Abstract: Motivation: Genome-wide multi-omics profiling of complex diseases provides valuable resources to discover associations between various measures of genes and diseases. Currently, a pressing challenge is how to effectively detect functional genes associated with or causing phenotypic outcomes. Methods: We developed CNet to identify groups of genomic signatures whose combinatory effect is significantly associated with clinical and phenotypical outcomes. CNet builds on a sequential feedforward method to search for groups of signatures, augmented by a down-sampling bootstrap strategy to reduce random hitchhiking signatures. To incorporate heterogeneous multi-omics profiling, CNet searches across the signature space and selects the best profiling data to represent a specific gene. In addition, we apply a dynamic trimming procedure to remove relatively less informative signatures at every step. Thus, the modules from CNet represent not only the best combination of genes, but also the optimized signatures per gene. Finally, to integrate the various forms of clinical and phenotypical measurements in biological data, we introduced four models to deal with different data types: the glm model for continuous phenotypes, the FET or chisq model for categorical phenotypes, and the km model for survival data. Results: We tested CNet in different scenarios using drug-response data, cancer genomics data, and genome-wide association study data for multiple traits. Using the drug-response data for 24 compounds in ~500 cell lines, CNet identified drug-sensitive modules with the original target genes and their interactors or partners in the same pathways. For several MEK inhibitors (AZD6244, Erlotinib, and PD-0325901), CNet was able to identify module genes from the MEK pathway such as BRAF, NRAS, and RAF1. Using the multi-omics cancer genomics data, CNet effectively identified somatic mutations or copy number variations that were associated with cancer subtypes. In addition, we applied CNet to detect disease-causing links involving the trio: signature -> pathway activity -> survival outcome. For example, in adrenocortical carcinoma, samples with mutant TP53 showed a significantly increased activity of the Fanconi Anemia pathway (p = 2.07×10-3) and a decreased survival rate (p = 0.01), while this pathway was significantly associated with patient survival outcome (p = 2.26×10-10). Finally, we applied CNet to discover shared genetic variants in multiple traits, e.g., a module including HDAC4 and PTPRG was found in six mental disorders but not in other immune or metabolic phenotypes. Conclusion: CNet presents a useful method to detect disease association and has a wide range of applications. CNet is available at https://github.com/bsml320/CNet.

RSG-33: A framework for supervised enhancer prediction with epigenetic pattern recognition and targeted validation across organisms
Topic: Enhancer prediction Matched Filter Epigenetics Mac
  • Mengting Gu, Yale University, United States
  • Anurag Sethi, Yale University, United States
  • Emrah Gumusgoz, Yale University, United States
  • Joel Rozowsky, Yale University, United States
  • Kevin Yip, The Chinese University of Hong Kong, Hong Kong
  • Richard Sutton, Yale University, United States
  • Mark Gerstein, Yale University, United States

Short Abstract: Enhancers are important noncoding elements, but they have been traditionally hard to characterize experimentally. Only a few mammalian enhancers have been validated, making it difficult to train statistical models for their identification properly. Instead, postulated patterns of genomic features have been used heuristically for identification. The development of massively parallel assays allows for the characterization of large numbers of enhancers for the first time. Here, we developed a framework that uses Drosophila STARR-seq data to create shape-matching filters based on enhancer-associated meta-profiles of epigenetic features. We combined these features with supervised machine learning algorithms to predict enhancers. We demonstrated that our model could be applied to predict enhancers in mammalian species (i.e., mouse and human). We comprehensively validated the predictions using a combination of in vivo and in vitro approaches, including transgenic assays in mouse and transduction-based reporter assays in human cell lines. Overall, the validations involved 153 enhancers in 6 mouse tissues and 4 human cell lines. The results confirmed that our model can accurately predict enhancers in different species without re-parameterization. Finally, we examined the transcription-factor binding patterns at predicted enhancers and promoters in human cell lines. We demonstrated that these patterns enable the construction of a secondary model effectively discriminating between enhancers and promoters.

RSG-35: Regulatory network analyses to identify transcription factor regulators of disease biomarkers
Topic: Biomarker Discovery Transcriptional Regulation gen
  • Mehmet Eren Ahsen, Icahn School of Medicine at Mount Sinai, United States
  • Alexandar Grishen, Icahn School of Medicine at Mount Sinai, United States
  • Yoojin Chun, Icahn School of Medicine at Mount Sinai, United States
  • Galina Grishina, Icahn School of Medicine at Mount Sinai, United States
  • Gustavo Stolovitzky, IBM, United States
  • Gaurav Pandey, Icahn School of Medicine at Mount Sinai, United States
  • Supinda Bunyanovich, Icahn School of Medicine at Mount Sinai, United States

Short Abstract: With rapid advances in genomic technology, several multi-gene expression-based predictive/discriminative biomarkers have been identified for diseases such as breast cancer, stroke and Alzheimer’s disease. Although some of these biomarkers are already used in clinical practice, such as MammaPrint and Oncotype DX for breast cancer prognosis, their biological interpretation beyond examination of their individual constituent genes or enriched Gene Ontology terms/pathway(s) is not commonly undertaken. Here, we describe NetTFactor, a novel method that combines network analyses with RNAseq data to identify transcription factors (TFs) significantly regulating such biomarkers. The identified TFs may shed further light on the underlying biology, and also serve as endo-phenotypes of the target disease. NetTFactor first computationally infers a context-specific gene regulatory network (GRN) from the RNAseq data. It then applies statistical enrichment methods to the structure and components of this GRN to rank potential TFs in terms of their disease activity and likelihood of regulating the biomarker. Finally, NetTFactor uses an innovative LASSO-based optimization approach to determine the minimal set of TFs that most significantly and exclusively regulate the genes in the biomarker. In a proof of concept study, we applied NetTFactor to identify the most significant TF regulators of an accurate nasal brush-based biomarker of asthma that we previously identified. This biomarker, which is based on the expression profiles of 90 genes interpreted through a Logistic Regression function, performed with strong predictive value and sensitivity across eight test sets. The application of NetTFactor to this asthma biomarker and the associated RNAseq data identified ETV4 and PPARG as its most significant TF regulators. siRNA-based knock down of ETV4 and PPARG each in an airway epithelial cell line model demonstrated significant reduction of cytokine expression relevant to asthma, supporting NetTFactor’s findings. While PPARG has been previously associated with airway inflammation, a phenotype associated with asthma, ETV4 has not yet been implicated in asthma, thus indicating the possibility of novel, disease-relevant discovery by NetTFactor. The application of this novel method to other multi-gene expression-based biomarkers could yield valuable insights into disease-relevant regulatory mechanisms and biological processes, allowing us to gain more from biomarkers beyond their main role as classifiers or predictors.

RSG-36: Small molecules for human embryonic stem cells differentiation
Topic: Human embryonic stem cells Definitive endoderm Sma
  • German Novakovsky, UBC, CMMT, Canada
  • Wyeth Wasserman, UBC , BCCHR, CMMT, Canada
  • Sara Mostafavi, UBC, BCCHR, CMMT, Canada
  • Artem Cherkasov, UBC, Vancouver Prostate centre, Canada
  • Francis Lynn, UBC, CMMT, Canada
  • Paul Pavlidis, UBC, MSL, Canada
  • Nathaniel Lim, UBC, MSL, Canada

Short Abstract: Improving methods for differentiating human embryonic stem cells (hESC) represents a significant challenge in the modern regenerative medicine. The current protocols depend upon the use of recombinant proteins and growth factors that are expensive, and difficult to administer in therapeutic scenarios. Thus, small molecules that can effectively modulate stem cells signaling pathways would represent a very attractive option. Based on our ongoing research of differentiation of insulin-producing pancreatic beta cells, we here propose to develop chem- and bioinformatics pipeline capable to discover such small molecules mimicking the effect of differentiation factor proteins. Specifically, we will develop a small molecule stem cell modulator that can replace Activin A, a key growth factor regulating early formation of Definitive Endoderm (DE) [1]. Our bioinformatics procedure traverses pathway and expression data resources. First, a set of genes responsive to Activin A are identified from expression profiling experiments. Using the gene set, the connectivity map (CMap, [2]) and the updated version called L1000 [3], are queried to identify small molecules that potentially induce transcriptomics profiles consistent with DE differentiation. We identified 12 small molecules for consideration, of which most are mTOR and PI3K pathway inhibitors, consistent with literature about roles of these pathways in DE differentiation [4,5]. Second, gene set enrichment analysis (GSEA, [6]) was performed to identify those small molecules from CMap/L1000, expression profiles of which are enriched for DE differentiation-related pathways. This approach identified 13 potentially relevant drugs. Third, we built a binary machine learning classifier that predicts positive or negative expression of DE biomarkers based on chemical structures of small molecules. Using this model we predicted potential DE inducers among 1280 drugs from the small-molecule Prestwick library. In conclusion, the proposed approach will represent first-reported predictive method for identification of small molecule candidates potentially capable of replacing biologics in hESC differentiation protocols. References: 1. Loh, KM et al. Cell Stem Cell. 2014;14:237–52. 2. Lamb, J et al. Science. 2006; 313, 1929–1935. 3. Subramanian, A et al. Cell. 2017; 171, 1437–1452. 4. Zhou, J et al. Proc Natl Acad Sci USA. 2009;106(19):7840–7845. 5. McLean, AB et al. Stem Cells. 2007; 25:29-38. 6. Subramanian, Proc. Natl. Acad. Sci. USA. 2005; 171, 1437–1452.

RSG-37: GSEA-InContext: Leveraging biological context to extract and prioritize pathway alterations in transcriptomics experiments
Topic: Gene set enrichment analysis GSEA Transcriptomics
  • Rani Powers, University of Colorado Anschutz Medical Campus, United States
  • Harrison Pielke-Lombardo, University of Colorado Anschutz Medical Campus, United States
  • Andrew Goodspeed, University of Colorado Anschutz Medical Campus, United States
  • Aik-Choon Tan, University of Colorado Anschutz Medical Campus, United States
  • James Costello, University of Colorado Anschutz Medical Campus, United States

Short Abstract: Gene Set Enrichment Analysis (GSEA) is one of the most widely used methods in bioinformatics to analyze and interpret coordinate, pathway-level changes in transcriptomics experiments. For an experiment where less than seven samples per condition are being compared, the GSEA Preranked algorithm generates a gene set enrichment score representing the degree to which genes in a given pathway are overrepresented at either end of a user-supplied ranked list. The enrichment score is tested for significance using a null distribution of enrichment scores generated from permuted gene sets, wherein genes are randomly selected from the input experiment. Looking across a variety of biological conditions, however, we found that genes are not randomly distributed with many showing consistent patterns of up- or down-regulation across tissue types and experimental perturbations. We also found that these trends caused a common set of pathways to be enriched under a variety of conditions. For example, we used GSEA Preranked to identify enriched gene sets in hundreds of experiments and found that gene sets related to DNA replication and hypoxia pathways were regularly overrepresented, regardless of the cell type or experimental condition. Based on these observations, we developed a complementary method to GSEA Preranked called GSEA-InContext, which addresses the question: which pathways are uniquely enriched in a single experiment compared to many other, independent experiments? In addition to the ranked list of genes typically provided as input to GSEA Preranked, GSEA-InContext also takes as input a user-defined set of “background experiments” and uses the context-specific patterns from the background set to inform the statistical testing procedure. Specifically, our algorithm generates permuted gene lists based on each gene’s rank distribution estimated from the set of background experiments. In doing so, GSEA-InContext tests for gene sets that are significantly more or less enriched in the user’s experiment compared to the other experiments selected for comparison. Although users can provide their own background experiments, we offer a collection of over 1,000 human microarray and RNA-seq experiments that a user can compare their results against. For example, evaluating the GSEA-InContext algorithm on several experiments involving small molecule drugs with known targets, we show that it successfully prioritized gene expression changes that are more likely due to direct drug effects. Through a variety of additional, novel applications, we show that GSEA-InContext offers valuable experimental and biological insight that would be missed by applying the canonical GSEA Preranked method alone.

RSG-38: An Optimal Combination of Computational Systems Biology Predictions: the MOCA algorithm
Topic: Unsupervised Ensemble Classification
  • Robert Vogel, IBM, United States
  • Mehmet Eren Ahsen, Icahn School of Medicine at Mount Sinai, United States
  • Gustavo Stolovitzky, IBM, United States

Short Abstract: Choosing the best method to apply to specific computational biology problems can be challenging. Even though benchmarks of method performances are readily found in some application domains, there are emerging fields at the forefront of translational biomedicine for which data sets are small and where well benchmarked methods are lacking. To address this challenge in the context of classification problems we developed the Method for Optimal Classification by Aggregation (MOCA) an unsupervised ensemble method for binary classification. MOCA performs a weighted sum over the rank predictions of an ensemble of base classifiers of unknown performance. We prove, that under the hypothesis of conditionally independent base classifiers, the optimal weight of the ith base classifier is proportional to the ratio of a function of its area under the receiver operating characteristic (AUROC) and the class conditioned variance of predictions. These weights are optimal in the sense that under the assumption of conditionally independent base classifiers, they maximize the information content of the ensemble. We developed a strategy for estimating the MOCA weights in the absence of labeled training data. The AUROC (used in the numerator of the MOCA weight) of the base classifiers can be computed from the second moment of the rank covariance matrix without knowledge of the labels. This is remarkable as this provides an estimate of the performance of the base classifiers in the absence of knowledge of the solution. The class conditioned variance of predictions (used in the denominator of the MOCA weights) can be estimated from the third central moment of the predictions without knowledge of the solution. We demonstrate the performance of the MOCA ensemble for transfer learning problems, where there is a small amount of labelled data that is insufficient for training a canonical model, and in application to predictions submitted to a variety of collaborative scientific DREAM challenges. Specifically, we apply the MOCA ensemble for classification of skin lesion images as benign nevus or malignant melanoma when training data is insufficient for learning the weights in deep neural networks. Additionally, we show that MOCA performs robustly when labeled data are unavailable using simulations, DREAM predictions of BCL6 transcription factor targets, and the DREAM Prostate Cancer Challenge to predict if a patient will be required to discontinue a specific chemotherapy.

RSG-39: Learning genetic determinants of plant epigenome by convolutional neural network
Topic: Histone modifications Convolutional neural network
  • Ngoc Tu Le, Okinawa Institute of Science and Technology, Japan
  • Hidetoshi Saze, Okinawa Institute of Science and Technology, Japan

Short Abstract: Background: Genomes of many plant species, such as Arabidopsis thaliana and Oryza sativa, are non-randomly organized into separate functional domains marked with distinct patterns of modification on DNA sequence and histone proteins. Despite the fact that these modifications play critical roles in regulating biological processes functioning on DNA substrate, such as transcription and replication, as well as on maintaining genome stability, the mechanism instructing their deposition to proper genomic targets remains elusive. Methods and Results: In this study we have developed a convolutional neural network model to dissect the contribution of genomic sequence to shaping epigenomic landscape in the plant model Arabidopsis thaliana. Our results show that, DNA sequence contribute a significant part in determining targets of various epigenetic modifications in Arabidopsis. Analysis of the trained model features the binding signatures of plant transcription factors, suggesting their functional importance in the process. Strikingly, our model correctly predicted the impact of single nucleotide mutation on the change of epigenetic marks at some specific loci, which has already been experimentally validated. Conclusions: Genomic sequence plays an important role in shaping epigenomic landscape in Arabidopsis. Moreover, our method illustrates the efficiency of deep learning-based approach in understanding the interaction between genetic and epigenetic players in plant species.

RSG-40: LIPLIKE: LInear Profile LIKElihood parameter identification for inference of high gene regulatory networks from ‘omics data
Topic: Network inference Precision Linear network Profile
  • Rasmus Magnusson, Linköping University, Sweden
  • Andreas Tjärnberg, Linköping University, Sweden
  • Mika Gustafsson, Linköping University, Sweden

Short Abstract: The reverse engineering of gene regulatory networks from data has for years evoked the interest of several research fields. Indeed, in the big data era, algorithms to infer network structures from data such as mRNA expression have been used in a wide array of applications. Most attempts of inferring gene regulation however focus on predicting a full network, where false positive predictions of interactions are accepted to achieve a high recall. We hypothesized however that the acceptance of high rates of false edge identification is treacherous, with type I errors potentially bringing high risks to, for instance, applications such as predictive medicine. To counter the low accuracy in regulatory network inferences, we propose our new method LipLike (LInear Profile LIKElihood). The method is inspired by the field of mechanistic modeling of biological systems, where uncertainties in parameter estimations are often assessed by calculating the profile likelihood. In mechanistic modeling the aim is to estimate a confidence interval of values that a parameter can take while retaining model fit to data. In LipLike, we estimate the divergence in the objective function for certain values of parameters, and achieve a confidence score of edge importance. We observed that this ranking separates edges into two distinct categories: high-confidence and no-confidence edge identifications. Our method can be illustrated by the following example: Consider a system of two transcription factors (TFs), A and B, that can bind to a target gene. If B is highly correlated to a third TF C, many network identification tools would pick either all three TFs (as in the case of Elastic Net), or A&B or A&C at almost random (as in the case of the LASSO). Our proposed algorithm would return only A as an identified regulator. We tested our models on in silico generated data from GeneSPIDER with different information contents, as well as the DREAM5 challenge in silico and E. coli data sets. Indeed, in our evaluation of the LiPLike method, we observed higher accuracy than most (in silico data), or all (E. coli data) contestants in to ranking edges. These findings prompted us to construct a python package, which is freely available to the community at https://gitlab.com/rasma87/liplike.

RSG-41: ChIPanlayser: A Mechanistic Approach to Transcription Factor Binding prediction in Drosophila
Topic: Transcription Factor Binding BioInformatics Biophy
  • Patrick Martin, Univeristy of Essex, United Kingdom
  • Nicolae Radu Zabet, University of Essex, United Kingdom

Short Abstract: At the heart of gene regulation are Transcription Factors (TFs), proteins that bind to DNA in a sequence specific manner and trigger the activation or repression of gene expression. TFs have usually a high number of binding sites spread throughout the genome, but, in many cases, only a small subset of these sites are bound in each cell type. The mechanisms by which TFs bind selectively to only a subset of their binding sites is still unclear. We developed ChIPanalyser, a Bioconductor package that models mechanistically TF binding to the DNA using a statistical thermodynamics framework. The model assumes that TF binding can be explained by four main factors: binding energy, DNA accessibility, the number of TF molecules bound to DNA and a scaling factor of TF specificity (weighting how well TFs distinguish between specific and unspecific binding sites). Predicting TF binding is challenging from a technical perspective and the methods chosen to assess the quality of the predicted binding profiles play an important role in the understanding the true quality of a model. We tested several evaluation methods and grouped them in two classes: (i) similarity methods (e.g. correlation or AUCROC) are suitable in predicting ChIP peaks and (ii) dissimilarity methods (e.g. MSE or K-S Distance ) are suitable in predicting the intensity of the ChIP signal. Next, we focused our analysis on three TFs in Drosophila that also act as insulators: CTCF, Su(Hw) and BEAF-32. Our results show that the three TFs exhibit different behaviours. CTCF showed a high preference for accessible DNA and its binding preference is strongly impacted by the number of bound molecules and the specificity to the DNA. CTCF also displayed a preferential binding behaviour towards hotspots (binding at a several specific loci), which is consistent with previous reports. Secondly, BEAF-32, is also binding preferentially in accessible DNA, but the number of molecules and specificity affected to a lesser degree the observed binding profiles. Finally, Su(Hw) not only preferred binding in seemingly inaccessible DNA, but its binding profile was also more sensitive to the number of bound molecules and specificity in inaccessible DNA. These findings are once again consistent with the large body of work showing the role of Su(Hw) in suppressing open chromatin. To conclude, ChIPanalyser predicts both location and enrichment with a considerable level of accuracy and sheds light on the mechanisms of TF binding.

RSG-42: Predicting cancer drug response in silico for targeting tumor heterogeneity
Topic: tumor heterogeneity single-cell analysis cancer dr
  • Chayaporn Suphavilai, Department of Computer Science, National University of Singapore & Genome Institute of Singapore, Singapore
  • Ankur Sharma, Cancer Therapeutics and Stratified Oncology, Genome Institute of Singapore, Singapore
  • Lorna Tu, Computational and Systems Biology, Genome Institute of Singapore, Singapore
  • Shumei Chia, Cancer Therapeutics and Stratified Oncology, Genome Institute of Singapore, Singapore
  • Ramanuj Dasgupta, Cancer Therapeutics and Stratified Oncology, Genome Institute of Singapore, Singapore
  • Niranjan Nagarajan, Computational and Systems Biology, Genome Institute of Singapore, Singapore

Short Abstract: Tumor heterogeneity is well recognized as an important factor in defining treatment response and clinical outcomes in diverse cancer types [1]. While molecular profiling data has been used to predict drug response in silico, current methods have focused on capturing inter-patient heterogeneity but not intra-tumor heterogeneity. The increased availability of single-cell omics approaches [2] opens up the possibility that tumor heterogeneity can be accounted for in computational models for drug response. Several challenges still exist for doing this, including the ability to make calibrated predictions for drug response across cell types, and the ability to accurately predict drug response in unseen cell types. Based on observations that gene expression biomarkers for drug response can be tissue specific, we address the issue of calibrated response prediction across cell types. Additionally, we have developed a novel extension to a recommender system framework [3] that accurately models bias terms for unseen cell types. We applied this new method CaDRReS-Sc to explore heterogeneity in drug response in head and neck cancer. Based on single-cell RNA-seq data we noted substantial intra-tumor and tumor/metastasis heterogeneity across patients. These differences were also reflected in in silico drug response predictions that when aggregated for a tumor correlated better with in vitro drug response. The ability to predict intra-patient drug response heterogeneity has important applications for combating drug resistance and metastasis, with ongoing efforts focused on confirming CaDRReS-Sc’s utility for identifying complementary drug combinations. References: 1. Dagogo-Jack, I., & Shaw, A. T. (2018). Tumour heterogeneity and resistance to cancer therapies. Nature Reviews Clinical Oncology, 15(2), 81. 2. Macaulay, I. C., Ponting, C. P., & Voet, T. (2017). Single-cell multiomics: multiple measurements from single cells. Trends in Genetics, 33(2), 155-168. 3. Suphavilai, C., Bertrand, D., & Nagarajan, N. (2018). Predicting Cancer Drug Response using a Recommender System, Bioinformatics, bty452.

RSG-43: Hypoxic adaptation of Mycobacterium tuberculosis: An in silico Investigation of cross-talk between mycobacterial metabolism, gene-regulation and host-pathogen interactions
Topic: Mycobacterium tuberculosis infection Hypoxia Host-
  • Anirban Dutta, TCS-Research, Tata Consultancy Services Ltd., India
  • Tungadri Bose, TCS-Research, Tata Consultancy Services Ltd., India
  • Chandrani Das, TCS-Research, Tata Consultancy Services Ltd., India
  • Sharmila Mande, TCS-Research, Tata Consultancy Services Ltd., India

Short Abstract: Mycobacterium tuberculosis causes persistent infections in humans, and is associated with extended periods of latency. The progression of infection is an outcome of several complex interactions between the host and the pathogen. During this process, M. tuberculosis cells also undergo several physiological and metabolic changes. While a number of previous studies have focused on inspecting individual facets of this complex process, the presented research aims at understanding the coordination among different regulatory and metabolic pathways in the pathogen. The presentation encompasses results from our earlier published research (PMID:30053801) in this direction and some extended analyses. To gain deeper insights into the infection process, in silico approaches were adopted in the presented work to investigate three different but connected aspects, namely, (i) host-pathogen interactions (HPIs) between human and M. tuberculosis proteins, (ii) gene regulatory network controlling hypoxic adaptation of M. tuberculosis and (iii) alterations in M. tuberculosis metabolism under hypoxic condition. Cross-talks between these components have been probed to identify the gene-regulatory events as well as Host-Pathogen interactions (HPIs) which are likely to drive metabolic changes during pathogen’s adaptation to the intra-host hypoxic environment. Results include a few newly predicted HPIs, which probably help the pathogen to subvert ROI/ RNI stress (reactive oxygen/nitrogen intermediates) inside the host, as well as take part in modulating host cell cycle and cytoskeleton structure. On a more intriguing note, results also indicate a significantly pronounced effect of HPIs on hypoxic metabolism of M. tuberculosis. Shortest path based analyses, considering several previously published gene expression profiles pertaining to Mycobacterium infection models, suggest that the HPI network is more intimately connected to hypoxic metabolic adaptation even when compared to the known hypoxic gene regulatory network. Insights from the current study underscore the need of investigating the infection process from a systems-level perspective incorporating multiple facets of the pathogen's survival inside the host cell. In addition, the comprehensive host-pathogen interaction network, the Boolean model of M. tuberculosis (H37Rv) hypoxic gene-regulation, as well as the genome scale metabolic model of M. tuberculosis, which were reconstructed for this study, are expected to be useful resources for future research on tuberculosis infection.

RSG-44: DEPICTIVE : A strategy for the quantitative discovery of sources of cell-to-cell variability
Topic: Single-cell measurements Variability analysis Bina
  • Pablo Meyer, IBM, United States
  • Robert Vogel, IBM, United States
  • Gustavo Stolovitzky, IBM, United States
  • Luís Santos, Icahn School of Medicine at Mount Sinai, United States
  • Jerry Chipuk, Icahn School of Medicine at Mount Sinai, United States
  • Marc Birtwistle, Clemson University, United States

Short Abstract: Single cell measurements have shown that populations of cells are intrinsically diverse in their bio-molecular compositions, state, and responsiveness to environmental conditions. Surprisingly, genetic variability is not necessary for establishing population diversity. In fact, non-genetic sources of cell-to-cell variability (ngCCV) are a manifestation of the physical properties of the biochemical processes of cells, and consequently represent a general property of life at the single cell level. Of particular interest to the biomedical community is how this ngCCV contributes to pathway regulation and disease. To date a quantitative framework that specifically attributes population diversity to the observed variability in bio-molecular components is lacking. To such end, we developed a method for DEtermining Parameter Influence on Cell-to-cell variability through the Inference of Variance Explained, DEPICTIVE for short. Using single cell measurements, DEPICTIVE computes the contribution of each bio-molecular observable to the binary response being studied. We validated our method with both simulation data and experimental measurements of TRAIL induced apoptosis of Jurkat cells. Our method uncovered mitochondria abundance as a novel source of ngCCV that tunes the sensitivity of individual cells to TRAIL. Indeed, ngCCV that manifests as diverse sensitivities to therapeutic intervention is an important consideration for precision medicine.

RSG-45: Calculation of half-life using nascent and mature RNAseq identifies that anti-diabetic rosiglitazone regulates mRNA dynamics to regulate PPARg target
Topic: mRNA half life nascent RNA mature RNA rosiglitazon
  • Nha Nguyen, University of Pennsylvania, United States
  • Kyoung Jae Won, University of Copenhagen, Denmark

Short Abstract: We often see the expression of some drug target genes are not regulated by the drug. From the comparison of nascent transcripts from global run-on sequencing (GROseq) data and gene expression data from RNAseq, we found that a number of genes whose transcription levels were increased due to Rosiglitazone (rosi) treatment but their expression levels were not changed. To explain this, we applied a systematic way and calculated mRNA half-life. Rosi has been suggested as a remedy as it showed the ability to prevent as well as reverse type 2 diabetes. Rosi is an antidiabetic agent that functions primarily by increasing insulin sensitivity. Even though its notable side effects including weight gain, edema and potential cardiovascular mortality have limited their clinical use, they are still used in ambulatory diabetes visits. A better understanding of the mechanism behind rosi treatment will allow the development of potential therapeutic targets of type 2 diabetes. To understand the transcriptional as well as post-transcriptional regulation by rosi, we sequenced nascent RNAs and total RNAs in murine adipocytes before and after rosi treatment (0min, 10 min, 30min, 1h and 3 h after rosi treatment). We calculated mRNA half-life using the transfer function obtained from Fourier transformation of the nascent as well as mature transcripts. We found that half-lives of mRNAs are much shorter than previous mRNA half-life calculation. Furthermore, calculating the mRNA half-life across time, we found that rosi regulates mRNA half-lives. Many of the genes with decreased a half-life were the targets of PPARg. Especially, these genes were associated with lipolysis. We found an RNA binding protein regulates mRNA stability upon rosi treatment. Further knocking down of the protein confirmed the suggested mechanism. Our results explain why some drug targets are not responding well using the post-transcriptional mechanisms. It also potentially explains weight gain, one of the side-effect of rosi.

RSG-46: High resolution analysis of the selection on local mRNA folding strength in protein-coding sequences across the tree of life
Topic: mRNA Secondary Structure Gene Expression Regulatio
  • Michael Peeri, Department of Biomedical Engineering, Tel-Aviv University, Israel
  • Tamir Tuller, Department of Biomedical Engineering, Tel-Aviv University, Israel

Short Abstract: The strength of local mRNA folding can be increased or decreased at different regions of the CDS (coding sequence) through the choice of synonymous codons. This modulation of local folding strength affects the interaction with the ribosome and is thought to influence many additional aspects of gene expression, including translation initiation, translation elongation, co-translational protein folding, mRNA aggregation and the transcription rate. We performed a genome-wide computational study of local mRNA folding strengths in 515 genomes across the tree of life, comparing folding strengths in native CDSs to a null model which maintains amino acid content, GC-content, and codon distributions. Consequently, most of the deviation in folding strength between native sequences and those sampled from the null model should be attributed to selection acting on mRNA secondary structure. We showed this selection changes direction along the CDS following a characteristic profile present in many bacterial (79%) and archaeal (48%) species, but not in most eukaryotes (15%). This profile, with selection for weak folding within 50nt of the CDS edges, selection for strong folding maintained throughout the rest of the CDS and a peak of strong folding centered 100nt downstream of the start codon, hints at disparate requirements for mRNA folding at different stages of gene expression and specifically during translation. Then, based on a phylogenetic tree of the analyzed organisms, we performed regression analysis while controlling for the expected similarities between traits due to evolutionary relationships, to detect significant interactions with genomic and environmental traits and shed light on the underlying biophysical and evolutionary mechanisms. Much of the observed inter-specific variation in selection for local folding can be explained independently by genomic GC-content or codon bias, which is non-trivial as the null model maintains the GC-content and codon bias of the coding sequences. In bacteria, these correlations have opposite directions comparing the CDS edges and inner CDS, again indicating differences between translation initiation and elongation. In eukaryotes, the large variety in measured levels of selection on folding is limited to high-GC species but the relation is non-monotonic and was detected based on the maximal information coefficient. Growth rate, however, only has a linear relation with the selection on folding strength in specific groups of organisms. These results should advance our understanding of the effect of mRNA secondary structure on gene expression and fitness in any species, and thus promote developing novel modeling and engineering approaches for controlling gene expression.

RSG-47: Laplace approximation for inferring causal directed acyclic structures in gene regulatory networks
Topic: Markov chain Monte Carlo maximum likelihood causal
  • Andrea Rau, INRA UMR 1313 GABI, Jouy-en-Josas, France
  • Flaminia Zane, IBPS, Sorbonne University, Paris, France
  • Gilles Monneret, LPSM, CNRS 8001, Sorbonne University, Paris, France
  • Pascal Fieth, University of Oldenburg, Germany
  • Alexander Hartmann, Institut für Physik, Universität Oldenburg, Germany
  • Florence Jaffrézic, INRA UMR 1313 GABI, Jouy-en-Josas, France
  • Gregory Nuel, LPSM, CNRS 8001, Sorbonne University, Paris, France

Short Abstract: Inferring the underlying causal relationships between genes from expression data is a task of critical importance in systems biology. In the particular case where a mixture of observational and intervention experiments (ex: single or multiple knock-out or knock-down experiments) are available, only a few methods are currently available. The first, called Intervention calculus when the Directed graph is Absent (IDA; Maathuis et al., 2009), provides causal bounds for direct and indirect effects once a skeleton graph has been estimated using the PC algorithm (pcalg R package). The second approach instead relies on the notion of a causal ordering of genes, whose posterior distribution is inferred from the data using a probabilistic generative model which allows for single and multiple interventions. This can be conveniently done within an MCMC simulation (Rau et al., 2013), in particular for the case where the posterior distribution is efficiently approximated (Hartmann and Nuel, 2017). In the present work, we extend this second approach by introducing two novelties: 1) we use a Laplace approximation to obtain a fast approximation of the integrated likelihood of the model; and 2) we use parallel tempering combined with the classic MC3 algorithm (Barker et al., 2010) to efficiently explore the directed acyclic graph (DAG) space. This new approach proves to be both faster and more reliable than the previous causal approach based on causal node orderings. It also has the advantage of providing a collection of DAG structures that can be aggregated to provide robust estimates for each dataset. We also introduce a simple mixture model over the DAG space to help represent this posterior distribution. Finally, we illustrate the method on both simulated and real datasets, where it shows promising results. References: Barker, Hill, and Mukherjee (2010) MC4: a tempering algorithm for large-sample network inference. Pattern Recognition in Bioinformatics, 431—442. Hartmann and Nuel (2017) Using triplet ordering preferences for estimating causal effects in the analysis of gene expression data. PLOS ONE 12(1): e0170514. Maathuis, Kalisch, Bühlmann, and others (2009) Estimating high-dimensional intervention effects from observational data. The Annals of Statistics, 37(6A):3133—3164. Rau, Jaffrézic, and Nuel (2013) Joint estimation of causal effects from observational and intervention gene expression data. BMC Systems Biology, 7(1):111.

RSG-48: Predicting therapeutic targets for cardiovascular disease using logical modeling
Topic: Logical modeling Prior knowledge network Cardiovas
  • Amel Bekkar, Swiss Institute of Bioinformatics SIB and University of Lausanne UNIL, Switzerland
  • Julien Dorier, Swiss Institute of Bioinformatics SIB, Switzerland
  • Isaac Crespo, Swiss Institute of Bioinformatics SIB, Switzerland
  • Cristina Casal, Swiss Institute of Bioinformatics SIB, Switzerland
  • Anne Estreicher, Swiss Institute of Bioinformatics SIB, Switzerland
  • Anne Niknejad, Swiss Institute of Bioinformatics SIB, Switzerland
  • Alan Bridge, Swiss Institute of Bioinformatics SIB, Switzerland
  • Ioannis Xenarios, University of Lausanne UNIL, Switzerland

Short Abstract: Cardiovascular diseases (CVDs) are the leading cause of mortality and morbidity in Europe and worldwide. They are multifactorial, chronic and complex pathologies that cannot be described or explained by reductionist view. In order to tackle this complex diseases we have developed a logical modeling framework that consists of three efforts. The first step is composed of an expert curation related to the body of literature that we called Prior Knowledge Network (PKN). The PKN is assembled from the existing knowledge and experimental evidence to include the relevant components for CVDs as well as the relationships between them (inhibition or activation). As compared to databases that register facts and summarize them, we have encoded the logical rules of regulations, enabling the use of the PKN for modeling and simulation. The second step simulates the cellular decision process and identifies the phenotypes attained by the regulatory networks. As the PKN is large (729 node, 3406 logical rules) a manual optimization would be time consuming and its simulation computationally demanding which requires the use of approaches such as Optimusqual, an optimization method that uses a genetic algorithm to find in the PKN the sub-graph that reproduces as well as possible a training set built from experimental data such as gene expression data. In the final step we simulate several in silico perturbations either known or unknown in the field of CVD. That allows us in one hand to evaluate and assess the pertinence of our model and in the other hand to make predictions and generate new testable hypotheses about driver nodes able to switch the network from the disease to the healthy state and ultimately find new therapeutic targets.

RSG-49: Using Deep Learning on Histopathology Images to Predict the Presence of BRAF Driver Mutation
Topic: Melanoma BRAF Machine Learning Histopathology
  • Randie Kim, New York University, United States
  • Sofia Nomikou, New York University, United States
  • Zarmeena Dawood, New York University, United States
  • Nicolas Coudray, New York University, United States
  • George Jour, New York University, United States
  • Una Moran, New York University, United States
  • Jeffrey S. Weber, New York University, United States
  • Narges Razavian, New York University, United States
  • Richard Shapiro, New York University, United States
  • Russell Berman, New York University, United States
  • Iman Osman, New York University, United States
  • Aristotelis Tsirigos, New York University, United States

Short Abstract: Precision medicine in melanoma relies on determining the mutational status through DNA molecular assays, which are time-consuming and costly. Here, we developed a fully automated prediction model using the Inception v3 deep convolutional neural network. The network was trained on formalin-fixed paraffin embedded histopathology images of primary resected melanomas from 182 patients and validated on images from 41 patients. The automated prediction model first annotates the image by selecting for tumor-rich areas and subsequently predicts the presence of BRAF mutation on the selected tumor region. The network discriminated melanomas from non-tumor tissue with an Area Under the Curve (AUC) of 0.98 on an independent test set comprising of images from 43 patients. For melanomas greater than 1.5mm, AUC of 0.85 was achieved for predicting mutated BRAF on the independent test set. Identification of actionable mutations is an integral part of targeted cancer treatments. Deep learning for the prediction of driver mutations on histopathology images of melanomas, particularly thicker lesions > 1.5mm harboring BRAF mutation, can potentially decrease costs associated with ancillary testing and more importantly, reduce lag time for the initiation of appropriate therapies. Thus, our model can potentially be integrated into melanoma counseling and treatment at any stage of disease.

RSG-50: Genomic Concordance in Omic Disease Modules: key factor for understanding Complex diseases
Topic: Complex Disease Disease Modules ModifieR transcrip
  • Tejaswi Badam, University of Skövde, Sweden
  • Mika Gustafsson, Linköping university, Sweden
  • Zelmina Lubovac, University of Skövde, Sweden
  • Hendrik de Weerd, University of Skövde, Sweden

Short Abstract: A complex disease is a result of perturbations in intracellular and intercellular omic networks rather than a consequence of abnormality in a single gene. This complexity needs to be addressed for a higher understanding of the disease mechanisms. It is well known that genes and proteins involved in the same mechanism such as a disease show a high degree of modularity. Network based approaches maneuver this property to identify disease and its related components in the vicinity of the interactome as Disease Modules. Identification of disease modules and their linked pathways allows us to explore molecular relationships across different omics . So far, the majority of tools in this area have focused on using topological information of the interactome to predict disease modules which consider only network properties but not biological or ‘omic changes in the disease. Therefore, we hypothesized that the use of one omic data as the weight input of the perturbation along with the topological information of interactome leads us to identification of the disease module shared across the other omics. We propose a method named MODifieR (Module IdentifieR) for constructing disease modules by simultaneous integration of transcriptome , SNPome and methylome. MODifieR is an R-package which is a conglomerate package of 10 established methods(8 public and 2 in-house) used for predicting disease modules. MODifieR consists of seed-based methods , clique-based methods and co-expression-based methods for module identification. we applied a wisdom of crowds’ approach to identify the gold standard methods for each disease and standard disease module shared across the omics by constructing a consensus module. The validation of the disease module is performed for genome wide significance using the Pathway scoring algorithm (PASCAL) that allows pathway enrichment scoring based on GWAS data. We have applied our approach across 22 different complex diseases and 10 different cell types to benchmark omic disease modules . For example, in case of the multiple sclerosis , we found a significant overlap of genes among transcriptome , SNPome and methylome disease modules with a p-value of less than 1e-05 , which in turn suggests that there is genomic concordance which was evident using disease module approach. The MODifieR r-package is available at https://gitlab.com/Gustafsson-lab/MODifieR.

RSG-51: Hierarchical regression model to detect dependencies in single cell RNA-seq and multi-omics read count data
Topic: single cell RNA-seq hierarchical model
  • Jukka Intosalmi, Aalto University, Finland
  • Henrik Mannerstrom, Aalto University, Finland
  • Saara Hiltunen, Aalto University, Finland
  • Harri Lähdesmäki, Aalto University, Finland

Short Abstract: Modern single cell sequencing technologies have made it possible to measure the RNA (scRNA-seq) and epigenomic (e.g. scBS-seq, scATAC-seq) content of individual cells. These data provide us with detailed information about the cellular states, but despite several pioneering efforts, it remains an open research question how regulatory networks could be inferred from these noisy discrete read count data. We have recently introduced a hierarchical regression model that is designed for detecting dependencies in scRNA-seq and other count data [1]. We model count data using a Poisson log-normal mixture distribution and, by means of our hierarchical formulation, detect the dependencies between genes using a linear regression model for the latent, cell-specific gene expression rate parameters. The hierarchical formulation allows us to model count data directly without artificial data transformations and makes it possible to incorporate diverse normalization information into the latent layer of the model. We have also extended the hierarchical model to detect dependencies in scBS-seq and scRNA-seq data for building single cell models of epigenetic gene regulation mechanisms. Importantly, our model accounts for uncertainty in both input and output variables and can be extended in many ways due to its modular design. We have implemented the model in the PyStan inference framework that uses the Stan’s probabilistic programming language for Bayesian modeling [REF jos haluaa]. This allows for seamless integration with other data analysis tools in the Python ecosystem and for easy implementation and evaluation of model extensions. We evaluate the proposed approach using both simulated and experimental data, including scRNA-seq and scBS-seq measured from the same cells (scM&T-seq). Our results show that the proposed approach performs better than standard regression techniques in parameter inference task as well as in variable selection task. An implementation of our method is made available at https://github.com/jeintos/SCHiRM.

RSG-52: Revealing the hallmarks of host-pathogen interactions across distinct bacterial infections using scDual-Seq
Topic: single-cell RNA-Seq host-pathogen interaction bact
  • Gal Avital, New York University, United States
  • Felicia Kuperwaser, New York University, United States
  • Itai Yanai, New York University, United States

Short Abstract: During an intracellular bacterial infection, the host cell and the infecting pathogen interact and affect each other through a progressing series of events that may result in myriad distinct outcomes. To study this system, we developed the scDual-Seq method and used it to simultaneously measure the transcriptomes of individual mouse bone marrow-derived macrophages infected with Salmonella typhimurium over time. We identified subpopulations of Salmonella across infected macrophages which led us to propose a model in which these subpopulations represent distinct stages throughout the infection process. To understand the specific strategies invoked across pathogens and the pathways by which our immune system thwarts these attacks, we sought to identify the unique and the core host and pathogen interactions that occur during infection. We thus compared in molecular detail the pathways induced across infection by eight diverse bacterial species that comprise many of the main human pathogens: Staphylococcus aureus, Listeria monocytogenes, Enterococcus faecalis, Group B Streptococcus, Pseudomonas aeruginosa, Yersinia pseudotuberculosis, Shigella flexneri and Salmonella enterica. We infected primary human macrophages from different donors with each species and used scDual-Seq to generate a comprehensive dataset of gene expression profiles of both the host and the pathogen during infection. Examining the expression profiles of the infected macrophages across the pathogens, we discovered a universal dynamic pattern with two main phases throughout infection: corresponding to coherent early and late modules with a sharp transcriptional transition between these. Comparing these modules across the pathogens, we found that the early module captures intra-cellular activity such as lysosome and metabolic processes whereas the later module is functionally enriched with signaling pathways and secretion (chemokine and cytokines, interferon gamma, type I interferon, etc.). Our work defines the hallmarks of host-pathogen interactions by identifying recurring properties of infection that will serve to provide important diagnostics and more effectively timing treatments.

RSG-53: Predicting and correcting the sequence effects of DNA tags on massively parallel reporter assays
Topic: massively parallel reporter assay tag sequence bia
  • Dongwon Lee, New York University, United States
  • Ashish Kapoor, University of Texas Health Science Center at Houston, United States
  • Changhee Lee, Harvard University, United States
  • Michael Mudgett, Johns Hopkins University, United States
  • Michael Beer, Johns Hopkins University, United States
  • Aravinda Chakravarti, New York University, United States

Short Abstract: Massively parallel reporter assays (MPRAs) are becoming a standard method for rapidly evaluating in vitro activities of cis-regulatory elements (CREs) in large scale. However, a major concern is that sequence differences between DNA tags may affect reporter activities; prior studies show reproducible effects of tag sequences on reporter readout. The tactic of using the averages of multiple tags per CRE does not remove the root problem because the biases are not random and tags are not uniformly distributed. Therefore, we developed a sequence-based computational framework, MPRA Tag Sequence Analysis (MTSA), for correcting potential tag sequence effects on reporter expression. Using four public MPRA data sets (Melnikov et al. 2012; Kheradpour et al. 2013; Ulirsch et al. 2016; Inoue et al. 2017), we first show that the relative tag expression for each CRE can be predicted from its tag DNA sequence. Support vector regression (SVR) models using gapped k-mers as features show high correlations between observed and SVR-predicted relative expression (r=0.4~0.7), with 5-fold cross-validation. These results strongly imply that a significant fraction (up to 50%) of expression differences between tags corresponding to a specific CRE is due to DNA sequence differences. We also show that two-rounds of training consistently improve the correlations by ~0.05 in all cases: the first-round of training and normalization reduces CRE-wide bias (from those tags that consistently increase or decrease expression) helping build more accurate models in subsequent training. Most importantly, it provides enhanced statistical power for identifying causal CRE variants by reducing tag-to-tag expression variation. Applying our MTSA normalization to a MPRA data set designed to find human regulatory variants affecting red blood cell traits (Ulirsch et al. 2016), we discovered that variants that became statistically insignificant after MTSA normalization (n=28 of 61) were predicted to have much smaller impacts on CRE activities by deltaSVM than the remainder (n=33 of 61) that were significant (Mann-Whitney U test, P=0.001). Lastly, we compared learned sequence feature weights for the four MPRA data sets to demonstrate that their vector design was a major factor contributing to tag sequence effects (average r=0.42 between the same designs vs. r=-0.02 between different designs), and not their cellular context. MTSA is, to the best of our knowledge, the first method for predicting and correcting tag effects on reporter activity using their DNA sequence. We expect MTSA to help improve the design and interpretation of MPRA assays.

RSG-54: Combining Network Theory and Machine Learning to Identify a Diagnostic and Prognostic Biosignature for Glioblastoma
Topic: network medicine glioblastoma glioma biomarker dis
  • Mackenzie Hastings, InSyBio, United States
  • Konstantinos Theofilatos, InSyBio, United States
  • Seferina Mavroudi, Technological Educational Institute of Western Greece, Greece

Short Abstract: Background. Glioblastoma multiform (GBM), the most common and aggressive primary adult brain tumor, has a median prognosis of less than 15 months. This malignant tumor is currently diagnosed through the conventional method of tissue analysis as well as MRI for its universal recurrence [1]. However, these tests are invasive and unreliable leading recent research to pursue finding molecular diagnostic and prognostic biomarkers. Yet, most existing approaches are searching for single biomarkers and only limited approaches exist to identify combined biosignatures for diagnostic purposes [2]. Methods. In the current manuscript we have mined from the literature the available transcriptomics datasets related to glioblastoma tissue using Gene Expression Omnibus and TCGA data repositories creating a dataset of 288 samples. This collection of transcriptomics data was integrated and analyzed with a combination of statistical techniques (differential expression analysis), correlation network construction and comparison techniques [3] and a hybrid combination of multi-objective optimization algorithms with SVM [4] to identify the final biomarkers and train predictive models. The predictive analytics method was expanded using the classifiers Chain Approach to allow for GBM diagnosis, glioma subtypes (astrocytoma, oligodendrioglioma, glioblastoma and non-tumor) prediction and GBM genetic profile (IDH1, K27, G34 and wild type) prediction using a single set of biomarkers. Results and Conclusions. This study concluded with a signature of 13 transcripts (NELL2, SERPINI1, SMOC1, FGF2, MMRN2, PRSS3, VEGFB, ADAM21, ADAMTSL4, C1QTNF4, CCL3L3, COL4A2, LAMB1) and a combined predictive model which presented 94.45% accuracy in diagnosing GBM, 78.60% accuracy in predicting the Glioma subtypes and 78.57% accuracy in predicting the GBM genetic profile with 5-fold cross validation. The proposed solution improved in each prediction problem the predictive accuracy by at least 10% compared to single biomarker predictors and by at least 4.5% when comparing it with the predictive models trained using the signature proposed in [2] and random forests. Finally, the network analytics component of the solution was able to reconstruct a directed signalling network whose further analysis can provide meaningful information about the mechanisms which lead to the GBM carcinogenesis.

RSG-55: Fine-scale linkage disequilibrium structure of non-coding RNAs, genic and sub-genic sequences
Topic: LD structure LD at fine-scale non-coding RNAs geni
  • Norma Alejandra Vergara Lope Gracia, University of Southampton, United Kingdom

Short Abstract: Introduction: There is a pressing need to understand disease gene characteristics to help interpret voluminous NGS data. The integration of genomic properties may improve filtering of NGS variant lists to determinate true disease candidates. The genome-wide pattern of linkage disequilibrium (LD) at a fine scale level represents a combination of recombination rates, selection and mutation over many generations. These patterns are non-random pair-wise allelic associations indicating that two genes are physically linked. This study aims to analyse LD maps of the autosomal genome at a very fine scale, down to the genic, intergenic and non-coding RNAs (ncRNAs) level, providing novel insights into the impact of recombination and selection on genome structure and function. Methods: SNP data from WGS sample were analysed for 454 individuals from the Wellderly study. LD distances were computed according to the Malecot-Morton model using LDMAP. The extent of LD was defined as Kb/LDU for genic, intergenic and ncRNAs where LDU are LD units. The boundaries for all genes were set considering UCSC Genome Browser files. The LDU boundaries of all annotated features were determined by linear interpolation. NcRNAs data were clustered where overlapping but were not distinguished from other genomic features. All genes which overlap with other genes were merged into a smaller number of genic regions. Intergenic regions were taken as any areas flanked by, but not overlapped by, genic regions. Results: The extent of LD across non-coding RNAs (ncRNA) which comprise $\sim$11\% of the sequence, showed that LD patterns of ncRNAs are more extensive than intergenic regions. This result reflects increased positive selection this might align with evidence for the functional significance of ncRNA regions. Genic regions were found to comprise ~40% and intergenic regions ~55% of the sequence. The average extent of LD across the genic regions of autosomes is ~44.5 Kb compared to ~37.8 Kb for intergenic regions. Hence LD is ~16% more extensive in genic compared to intergenic regions, presumably reflecting relatively reduced recombination and/or increased selection across genic regions. Conclusion: The findings show the broad characteristics of LD maps with respect to genome function at the sub-genic level demonstrating differences between LD patterns. The pattern of LD varies across the gene profile although the functional implications of this are not fully understood. Further analysis of LD structure differences between ncRNA species and functional elements are likely to provide additional insights into the functional significance of these patterns.

RSG-56: FOXA2 is required for enhancer priming during pancreatic differentiation
Topic: FOXA2 GATA6 stem cell hPSC pioneer factor pancreat
  • Kihyun Lee, Memorial Sloan Kettering Cancer Center, United States
  • Hyunwoo Cho, Memorial Sloan Kettering Cancer Center, United States
  • Robert Rickert, Memorial Sloan Kettering Cancer Center, United States
  • Qing Li, Memorial Sloan Kettering Cancer Center, United States
  • Julian Pulecio, Memorial Sloan Kettering Cancer Center, United States
  • Christina Leslie, Memorial Sloan Kettering Cancer Center, United States
  • Danwei Huangfu, Memorial Sloan Kettering Cancer Center, United States

Short Abstract: Mechanisms of lineage priming in the differentiation of human pluripotent stem cells (hPSCs) are not well understood due to the difficulty of isolating transient progenitor populations. To model gene transcriptional and chromatin regulation programs at specific time points during cell fate commitment in pancreatic lineage, we used directed differentiation combined with gene editing technology to define precise gene requirements for the consecutive chromatin events during cell fate transitions. Using multiple genomewide analyses, we profiled the stages of differentiation of hPSCs into pancreatic progenitor cells at the level of gene expression, open chromatin activity using assay for transposase-accessible chromatin using sequencing (ATAC-seq), binding of key transcription factors in the pancreatic lineage, and histone modifications. Our analysis of chromatin accessibility dynamics combined with computational motif analyses led us to uncover a requirement for FOXA2, known as a pioneer factor, in human pancreas specification not previously shown from mouse knockout studies. Comparison of chromatin accessibility and gene expression profiles in differentiation of FOXA2 wild-type and FOXA2 knockout hPSCs using a two-factor generalized linear model identified genotype-dependent effects at each stage. FOXA2 knockout hPSCs showed impaired recruitment of GATA6 to pancreatic enhancers and formed reduced numbers of pancreatic progenitors. We further found that FOXA2 is associated with nucleosome repositioning by separately analyzing ATAC-seq read pairs corresponding to nucleosomal fragments versus to nucleosome-free regions. We also observed H3K4me1 deposition during enhancer priming prior to activation. This work provides direct evidence for transcription factor-mediated enhancer priming in the establishment of developmental competence.

RSG-57: Fast and accurate reconstruction of cell trajectory and pseudo-time for massive single cell RNA-seq data
Topic: single cell RNA-seq cell trajectory pseudo-time ma
  • Yang Chen, The Jackson Laboratory for Genomic Medicine, United States
  • Yuping Zhang, University of Connecticut, United States
  • Zhengqing Ouyang, The Jackson Laboratory for Genomic Medicine, United States

Short Abstract: Single cell RNA sequencing (scRNA-seq) is emerging to revolutionize the study of development and disease processes. It has been widely used to investigate gene expression dynamics, cell type identification, cell state transition, and pseudo-time estimation at single cell level. Modern scRNA-seq technologies such as Drop-seq/InDrop and 10x Genomics make it possible to profile tens of thousands of cells simultaneously. The analysis of scRNA-seq data has becoming a central topic in regulatory genomics. One of the most challenging issues in scRNA-seq data analysis is on the reconstruction of complex cell trajectory and pseudo-time of individual cells. On one hand, existing literature included empirical approaches to study specific cell lineages using known time labels and cell marker genes, which are not readily generalizable to analyze many other complex scRNA-seq datasets. On the other hand, existing stand-alone methods were only demonstrated on small-scale scRNA-seq datasets. It is unclear whether these approaches are feasible for analyzing large-scale scRNA-seq data. We have developed a new machine learning approach to estimating the trajectory and global pseudo-time of vast number of single cells. Assessments on scRNA-seq datasets with tens of thousands of cells demonstrate improved cell trajectory and pseudo-time reconstruction compared to existing methods. The implementation of our method requires less CPU time and memory usage, allowing its applications to massive single cell data. The computational advancements of single cell analysis facilitate the understanding of complex cell lineages underlying development and diseases.

RSG-58: High-throughput tissue dissection and cell purification with digital cytometry
Topic: scRNA-seq Digital cytometry Deconvolution Microenv
  • Chloe Steen, Stanford University, United States
  • Chih Long Liu, Stanford University, United States
  • Andrew Gentles, Stanford University, United States
  • Aadel Chaudhuri, Stanford University, United States
  • Florian Scherer, Stanford University, United States
  • Michael Khodadoust, Stanford University, United States
  • Mohammad Esfahani, Stanford University, United States
  • Bogdan Luca, Stanford University, United States
  • David Steiner, Stanford University, United States
  • Maximilian Diehn, Stanford University, United States
  • Ash Alizadeh, Stanford University, United States
  • Aaron Newman, Stanford University, United States

Short Abstract: Tumors are complex ecosystems comprised of distinct cell types that are distinguished by their developmental origins and functional states. Common methods for characterizing tumor cellular composition, such as flow cytometry and immunohistochemistry, generally rely on small combinations of preselected marker genes, limiting the number of cell types that can be simultaneously interrogated. Single cell RNA sequencing (scRNA-Seq) has emerged as a powerful technology for cell type discovery, but is currently impractical for large-scale analyses and cannot be applied to fixed specimens collected as part of routine clinical care. To overcome these challenges, we developed CIBERSORTx, a novel deconvolution framework for inferring cell type abundance and cell type-specific gene expression profiles (GEPs) directly from bulk tissue transcriptomes without physical cell isolation. Unlike previous deconvolution methods, CIBERSORTx was designed to enable cross-platform analyses of complex tissues using cell signatures derived from diverse sources, including single cell reference profiles. In addition, CIBERSORTx includes improved techniques for separating bulk tissue RNA admixtures into cell-type specific GEPs. To investigate the technical performance and clinical utility of CIBERSORTx, we applied it to characterize cellular heterogeneity in resected tumor biopsies profiled by bulk RNA-Seq. Using single cell-derived reference profiles to dissect bulk melanoma tumor GEPs, we identified novel cell type-specific signatures of driver mutations and immunotherapy response. In addition, we discovered striking cell type-specific phenotypic states within the microenvironment of non-small cell lung cancer tumors, a finding that we validated in GEPs of sorted cell types from an independent cohort. Given the rapid pace of data generation, methods to broadly apply single cell reference maps will become increasingly important, especially in settings where tissue is limited, fixed, or challenging to disaggregate into intact single cells. Our results demonstrate that CIBERSORTx is a useful approach for deciphering complex tissues, with implications for high resolution cell phenotyping in research and clinical settings.

RSG-59: DNA seqFISH resolves E. coli chromosome structure during cell replication.
Topic: seqFISH E. coli chromatin structure
  • Feng Bao, Harvard University, United States
  • Yandong Zhang, California Institute of Technology, United States
  • Long Cai, California Institute of Technology, United States
  • Guo-Cheng Yuan, Harvard University, United States

Short Abstract: The dynamic changes of E. coli chromosome structure during the cell replication process remain unknown. Traditional DNA FISH experiments can only label a few chromosomal loci thus lack a global view of chromosome structure. We have recently developed a technique called DNA seqFISH that allows us to image the single-cell chromosome structure in a highly multiplexed manner. In this study, we use DNA seqFISH experiments to label the spatial positions of around 100 loci on the E. coli chromosome in 2,386 cells at the single-cell resolution. Single-cell chromosome structures from various replication stages are captured and reconstructed at 50 kb resolution. Using clustering analysis, we identified two different types of chromosome structures at higher accuracy, which are in well agreement with the known translational symmetric configuration in two daughter cells during chromosome replication. The combination analysis of cell structures from multiple replication stages revealed the progressive, dynamic development of two structures.

RSG-60: MAST: A MULTI-AGENT BASED SPATIO-TEMPORAL MODEL OF THE INTERACTION BETWEEN IMMUNE SYSTEM AND TUMOR GROWTH
Topic: single cell tumor immunoediting multi agents model
  • Giacomo Baruzzo, University of Padova, Italy
  • Giovanni Finco, University of Padova, Italy
  • Francesco Morandini, University of Padova, Italy
  • Piergiorgio Alotto, University of Padova, Industrial Engineering Department, Italy
  • Barbara Di Camillo, University of Padova, Italy

Short Abstract: In recent years, single-cell technologies, both at imaging and sequencing level, have given the possibility to observe how tissues and organs are spatially and temporally organized as a system of multiple cells, able to communicate and interact with each other. In this context, multi-agent based spatial models could present an interesting approach for developing personalized medicine strategies, for their ability to simulate complex systems of different types of cells, stochastic behavior and interaction among cells. Moreover, if coupled with partial differential equations, dependency of cell behavior on concentration of communication molecules and nutrients can be also modeled. We present a model of the interaction between immune system and tumor growth that includes the description of different cell types such as CD8+ T cells, dendritic cells, natural killer cells, cancer cells, immune resistant cancer cells, suppressed T cells, stroma and necrotic cells. Each type of cell is characterized by different possible actions and interactions with neighbor cells: namely, moving, dividing, dying, attacking, mutating. Starting from an initial state, the system evolves stochastically by selecting a possible action for each cell and consequently updating the communication molecules released in the neighborhood. The probability each cell has to fulfill a specific action depends on the cell type, on the type of surrounding cells and on the communication molecules in the neighborhood. For example, cancer cells can duplicate, mutate or die. Duplication depends on available nutrients in the cell neighborhood; death depends on the lack of survival nutrients in the cell neighborhood (which leads to necrosis) or immune system attack; (epi)genetic mutations occur at a certain rate and can give rise to: new antigens capable of inducing a specific immune response, ability to block the T-cells attack (like PDL-1+ mutation), ability to release inhibitory molecules that would locally repel immune system cells. Partial Differential Equations are used to model nutrients diffusion from their source (vessels) within the tissue. Model parameters’ choice is guided by patient-specific single-cell RNA-sequencing data. This model reproduces key aspects of cancer tissue and of tumor microenvironment such as infiltration of immune cells in tumor tissue and its inverse correlation with tumor growth rate. It can be used to mimic different clusters of patients with specific mutation rate, tumor immuno-editing, presence / absence of stroma which can give suggestions for new, targeted therapeutic strategies.

RSG-61: High Throughput Cell Microarray for Mapping miRNA:Protein Interactomes
Topic: miRNAs regulatory networks High-throughput mapping
  • Thu Chu, New York University, United States
  • Rui Qin, New York University, United States
  • Jonathan Chung, New York University, United States
  • Deepika Dhawan, New York University, United States
  • Lara Mahal, New York University, United States

Short Abstract: miRNAs (miR) are small non-coding and fine-tuning regulatory RNAs that bind compliment mRNA, inhibiting either stability or translation of the mRNA transcript. miRNAs regulate networks of genes that work in concert to control a specific biological process, tightening the expression window for critical genes. By mapping the targets of miRNA involved in specific biological processes or disease state, we can determine genes that are important in that process or disease, thus defining the miRNA Proxy Approach. Only small number of miRNA-gene interactions is experimentally identified relative to the vast scale of predicted miRNA:gene database and a systematic data processing to analyze the miRNA regulatory network is still lacking. The accurate identification of the impact of miRs on protein expression (miR:protein interactions) is hindered by low accuracy of prediction (17-66%), and the suboptimal throughput of more direct miR:mRNA validations (e.g. luciferase assay). To overcome these obstacles, our laboratory is developing new microarray-based tools for high-throughput mapping of miR:protein interactions to generate a large database of miRNA-gene networks and potentially create new system to process and apply this database. Our targeted gene-encoding protein classes for mapping are glycosylation-related genes and G-protein-coupled receptors.

RSG-62: Dissecting Pathway Disturbances Using Network Topology and Multi-platform Genomics Data
Topic: Data integration Multi-platform genomics Network t
  • Yuping Zhang, University of Connecticut, United States
  • M. Henry Linder, University of Connecticut, United States
  • Ali Shojaie, University of Washington, United States
  • Zhengqing Ouyang, The Jackson Laboratory for Genomic Medicine, United States
  • Ronglai Shen, Memorial Sloan Kettering Cancer Center, United States
  • Keith Baggerly, The University of Texas MD Anderson Cancer Center, United States
  • Veerabhadran Baladandayuthapani, The University of Texas MD Anderson Cancer Center, United States
  • Hongyu Zhao, Yale University, United States

Short Abstract: Complex diseases such as cancers usually result from accumulated disturbance of pathways instead of the disruptions of one or a few major genes. As opposed to single-platform analyses, it is likely that integrating diverse molecular regulatory elements and their interactions can lead to more insights on pathway-level disturbances of biological systems and their potential consequences in disease development and progression. To explore the benefit of pathway-based analysis, we focus on multi-platform genomics, epigenomics, and transcriptomics (-omics, for short) from 11 cancer types collected by The Cancer Genome Atlas project. Specifically, we use a well-studied oncogenic pathway, the BRAF pathway, to investigate the relevant copy number variants (CNVs), methylations, and gene expressions, and quantify their effects on discovering tumor-specific aberrations across multiple tumor lineages. We also perform simulation studies to further investigate the effects of network topology and multiple omics on dissecting pathway disturbances. Our analysis shows that adding molecular regulatory elements such as CNVs and/or methylations to the baseline mRNA molecules can improve our power of discovering tumorous aberrances. Also, incorporating CNVs with the baseline mRNA molecules can be more beneficial than incorporating methylations. Moreover, employing regulatory topologies can improve the discoveries of tumorous aberrances. Finally, our analysis reveals similarities and differences among diverse cancer types based on disturbance of the BRAF pathway.

RSG-63: BGCLUST: Clustering large single-cell RNA-seq datasets
Topic: Single Cell Gene expression Clustering
  • Maziyar Baranpouyan, University of Pittsburgh, United States
  • Abha Bais, University of Pittsburgh, United States
  • Dennis Kostka, University of Pittsburgh, United States

Short Abstract: Background: Rapid progress of single cell technologies is enabling researchers to query RNA expression of large numbers of single cells simultaneously, leading to significant advances in our understanding of cellular heterogeneity and tissue composition. Microfluidics-based sequencing protocols that employ unique molecular identifiers (UMIs) allow for high-throughput processing, and the screening of many thousands of cells (or more) is becoming increasingly commonplace. However, bioinformatics analysis of such data continues to pose challenges, at the most general level in terms of processing time and computational resources. To address this, we designed BGCLUST — a clustering approach designed for large single-cell RNA-seq (scRNA-seq) datasets. Methods: BGCLUST takes a p x n matrix of expression values for p genes and n cells (n can be large) as input. Exploiting gene-gene correlations across cells we derive a correlation graph, and using associated transition probabilities we first construct a smoothed version of the expression matrix. Next, we take advantage of random projection singular value decomposition (SVD) to obtain an approximate Eigen decomposition of the complete underlying n x n cell-cell correlation matrix, while avoiding explicit construction. Keeping only a small fraction of eigenvectors to characterize cells we finally perform hybrid clustering: Initial centers for k-means clustering are derived via hierarchical clustering of a sub-sample of cells, which leads to stable clustering solutions. Results: We evaluated the performance of BGCLUST on ten datasets ranging from about one thousand to approximately 70,000 cells and compared it with state-of-the-art approaches. BGCLUST shows significant improvements over the status quo, on average increasing the adjusted Rand index by more than eight percent with respect to the best competitor (28% maximum, -3% minimum). In addition, we demonstrate increased stability of clustering solutions for BGCLUST, as well as improved visualization results when employing dimension reduction techniques. Overall, this constitutes an improvement in quality and computational speed (clustering 50k cells typically takes less than five minutes) of scRNA-seq clustering. All experiments were performed on a laptop with 3.2-GHz CPU and 16 GB of RAM. Conclusion: BGCLUST is a scalable clustering approach for large scRNA-seq datasets. It enables researchers to analyze tens of thousands of cells in a matter of minutes, significantly faster than what is currently the norm.

RSG-64: Metabolite signatures of organ dysfunction in sickle cell disease patients
Topic: Sickle Cell Disease Metabolomics WGCNA Networks GW
  • Yann Ilboudo, Montreal Heart Institute, Canada
  • Melanie Garrett, Center for Human Disease Modeling, Duke University Medical Center, United States
  • Allison Ashley-Koch, Center for Human Disease Modeling, Duke University Medical Center, United States
  • Marilyn Telen, Department of Medicine, Division of Hematology, Duke University Medical Center, United States
  • Guillaume Lettre, Montreal Heart Institute, United States

Short Abstract: Sickle cell disease (SCD) is a monogenic disease caused by mutations in the β-globin gene. The complications related to the disease are systemic as they impact multiple organ systems. Our goal in this study was to identify metabolome changes contributing to SCD-related severity. Employing both targeted and untargeted approaches, we profiled the plasma of 706 SCD patients using liquid chromatography tandem mass spectrometry. The cohort included 406 French patients of recent African descent and 300 African Americans from the southeastern US. In total, we measured the levels of 61 known and 2,100 unknown metabolites. We applied weighted gene correlation network analysis (WGCNA) algorithms to account for correlations among metabolites and identify specific metabolomic modules associated with SCD-related complications. Finally, we incorporated genetic data from 30 million SNPs to the modules in order to identify the biological pathways implicated by the unknown metabolites. We constructed 66 modules containing at least 7 metabolites per module. Correlating these modules to 15 clinically important phenotypes (6 complications and 11 hematological traits), we found a module strongly associated with increased risks of gall bladder removal. That particular module contained 4 known metabolites (glycocholate, glycodeoxycholate, taurocholate, and taurodeoxycholate) involved in bile acid metabolism, and 20 unknown metabolites. Additionally, we found another module of metabolites strongly correlated with estimated glomerular filtration rate (eGFR). The module contained several carboxylic acids metabolites (citrulline, 4-acetamidobutanoate, symmetric dimethylarginine) as well as metabolites involved in purine and pyrimidine metabolism. Moreover, for eGFR, the same module included 114 unknown metabolites. We performed a GWAS for each of the 39 most robust modules, which resulted in two modules strongly associated with SNPs (FDR < 0.05). We found that one of these two modules was significantly associated (P < 8.0 x 10-10) with multiple SNPs near the gene encoding for hepatic triglyceride lipase (LIPC). Using metabolomics, we identified metabolomic signatures of liver, gall bladder, and kidney complications in SCD. Although hemolysis is the key determinant of organ damage in SCD, understanding which specific metabolite or metabolic pathway plays a role in organ dysfunction can be exploited to predict SCD severity. The module we found to be associated with the LIPC variants suggest that triglycerides could play an important role in the progression of SCD. Determining the causal role of the metabolites involved within the module and how they relate to complications may be key to our understanding of the role of blood lipids in SCD.

RSG-65: SPARSim Single Cell: a count data simulator for single cell RNA-seq data
Topic: Single cell RNA-seq Data simulation Compositional
  • Giacomo Baruzzo, University of Padova, Italy
  • Ilaria Patuzzi, University of Padova and Istituto Zooprofilattico Sperimentale delle Venezie, Italy
  • Barbara Di Camillo, University of Padova, Italy

Short Abstract: Single cell RNA sequencing (scRNA-seq) has revolutionized transcriptomic studies, offering new possibilities to address biological and medical questions. scRNA-seq count data shows many differences compared to bulk RNA-seq, making the application of consolidated RNA-seq analysis methods not straightforward or even inappropriate. Thus, the development of new methods for analyzing scRNA-seq data is currently one of the most active research field in bioinformatics. In this scenario, the availability of realistic simulated data could be a powerful tool for benchmarking and assessing current and novel bioinformatics methods. Unfortunately, many of the current scRNA-seq simulators are poorly documented and their results do not resemble properties observed in real data (Zappia et. al, 2018). To overcome these limitations and provide a useful tool for scRNA-seq researchers, here we present SPARSim, a novel scRNA-seq count data simulator implemented as R package. Different from available simulators, SPARSim uses an innovative Gamma+Multivariate Hypergeometric model to generate in silico count tables. In particular, the Multivariate Hypergeometric distribution provides an accurate modelizzation of the compositional nature of sequencing data, resembling the sampling process without replacement and with fixed sampling size typical of such data. In order to simulate even complex scenarios, SPARSim can simulate the presence of spike-ins, multimodal gene expression and batch effects. Simulation parameters could be specified by users, estimated from real count tables or taken from a parameters database describing more than 20 real datasets. The performance of SPARSim in simulating realistic count table data was assessed in comparison with six real datasets, describing different organisms, cell types and experimental protocols, thus allowing to test the proposed simulation model over a wide range of real scenarios. SPARSim was assessed in terms of ability in recreating realistic variability between samples/cells, counts intensity distribution within samples and sparsity, both as total count matrix sparsity and as zero values distribution per gene and per cell. The assessment results revealed the ability of SPARSim in creating realistic scRNA-seq count tables, achieving comparable or better simulation results than Splat, the most used scRNA-seq count data simulator (Zappia et. al, 2018). Remarkably, compositionality is not properly modelled by currently adopted Gamma+Poisson (Negative Binomial) or Zero Inflated Negative Binomial modelizzation that often result in unrealistic sparsity. Overall, our results indicate that SPARSim is a valuable tool for researchers involved in developing and testing robust and reliable scRNA-seq bioinformatics methods, helping to finally unleash the full potential of scRNA-seq data.

RSG-66: Title: FunPipe: a python library for efficient construction of fungal genomic pipelines
Topic: Pipeline development fungal genomics genomic analy
  • Xiao Li, The Broad Institute, United States
  • Jose Munoz, The Broad Institute, United States
  • Christina Cuomo, The Broad Institute, United States

Short Abstract: Genome sequencing is occuring at increasing scale for the study of fungi and other microbes, providing new insights about population structure, evolution, and variants linked to phenotypes such as drug resistance and virulence. A major challenge in handling this expanding volume of genomic data is to develop computational frameworks for processing and analysis of this data using automated, reproducible and scalable pipelines. To meet this challenge, we developed the FunPipe python package, containing wrapper functions for widely used methods and tools we developed to process and analyzing re-sequencing data compared to a reference genome. Python is a powerful programming language for software development and scientific computing and visualization. Using a modular design, we provide a diverse repository of tools for efficient data processing as well as further tool development (such as unittest and pip for software development, scipy, numpy, pandas for scientific computing, seaborn and matplotlib for graphics). A major component of our package is a re-sequencing pipeline built using the Workflow Description Language (WDL). Additional pipelines could be easily integrated using WDLs or other approaches (Common Workflow Languages, Snakemake, or Nextflow) for ease of deployment on different computational platforms including on cloud computing infrastructure. Currently, major functionalities of the FunPipe package are built for analysis of re-sequencing data, including alignment based (bam) quality control reporting, GATK based variant calling, quality control metrics for variant calls (vcf), ploidy and copy number analysis, variant annotation, mutational profiling, plink based population genetic analysis, and phylogenetic analysis. We followed software engineering best practices, such as unit testing, to ensure the quality of the codebase, and continuous integration,to test, develop and release the package in a quick and consistent way. Docker, google containers and conda were used to ensure consistent versions of software and the local environment required to run these programs. We will release source code to the public via github to allow the community to use and contribute to this package. We envision this package to continue grow and accelerate large scale genomic analyses by the fungal research community.

RSG-67: Accurate prediction of regulatory maps in Arabidopsis by integrating DAP-seq, ATAC-seq and single cell sequencing data
Topic: Integrative analysis DAP-seq ATAC-seq machine lear
  • Qi Song, Virginia Tech, United States
  • Song Li, Virginia Tech, United States

Short Abstract: Integrating heterogeneous genomic data can provide novel insights into the regulation of condition-specific or single cell-specific gene expressions. Recent advances in experimental techniques such as DNA Affinity Purification Sequencing (DAP-seq) and Assay for Transposase-Accessible Chromatin using Sequencing (ATAC-seq) have generated large-scale regulatory genomic datasets for model plant species Arabidopsis thaliana. Using these genomic datasets, we have constructed a comprehensive and accurate map of condition-specific gene regulation in Arabidopsis. We developed the Condition Specific Regulatory network inference engine (ConSReg), which can infer condition-specific interactions from integrated heterogeneous genomic data using sparse linear model based feature selection. Our results show that ConSReg can accurately predict gene expressions with an average Area Under a Curve (AUC) of 0.84 across multiple testing datasets. We found that including ATAC-seq information significantly improves the performance of ConSReg. We applied ConSReg to Arabidopsis single cell RNA-seq data of two root cell types (endoderims and cortext) and successfully identified five key regulators in two root cell types. Three out of the five regulators are supported by existing publications. Compared to existing tools, our approach does not require time-series expression data, which provides great convenience when expression data is limited.

RSG-69: ManiNetCluster: A manifold learning approach to reveal the functional linkages across multiple gene networks
Topic: manifold learning clustering functional genomics g
  • Nam Nguyen, Stony Brook University, United States
  • Ian Blaby, Brookhaven National Laboratory, United States
  • Daifeng Wang, Stony Brook University, United States

Short Abstract: The coordination of genome encoded function is a critical and complex process in biological systems, especially across phenotypes or states (e.g., time, disease, organism). Understanding how the complexity of genome-encoded function relate to these states remains a challenge. To address this, we have developed a novel computational method based on manifold learning and comparative analysis, ManiNetCluster (Manifold-Network-Clustering), which simultaneously clusters multiple molecular networks to systematically reveal the functional linkages across multiple datasets. Specifically, ManiNetCluster employs manifold learning to match local and nonlinear structures among the networks of different states, to identify cross-network linkages. For example, by applying ManiNetCluster to the developmental gene expression datasets across model organisms (e.g., worm, fruit fly), we found that our tool significantly better aligns the orthologous genes than existing state-of-the-art methods, indicating the nonlinear interactions between evolutionary functions in development. Moreover, we applied ManiNetCluster to a series of transcriptomes measured in the model green alga Chlamydomonas reinhardtii, to determine the genomic functional linkages between various metabolic processes between the light and dark periods of a diurnally cycling culture, identifying a number of genes putatively regulating processes across each light regime. We describe how ManiNetCluster comprises a novel approach for generating genome function inferences. ManiNetCluster is available as an R package together with a tutorial at http://github.com/namtk/ManiNetCluster.

RSG-70: Analysis of tissue specific regulatory programs with chromatin accessibility data and binding motif models
Topic: Regulatory programs Chromatin accessibility Prosta
  • Anastasia Shcherban, University of Tampere, Finland
  • Juha Kesseli, University of Tampere, Finland
  • Matti Nykter, University of Tampere, Finland

Short Abstract: Carcinogenesis is a complex process which is associated with perturbations in gene expression levels. Transcription factors (TFs) are cellular proteins that are key players of gene regulation and hence discovery of their functionality in a genome-wide scale is highly important for better understanding of cancer development and progression. TF binding has been characterized extensively in a number of cell lines. While direct TF binding measurements in tissues remain challenging, recent development has made it feasible to measure chromatin accessibility in tumor tissues. Using chromatin accessibility data from ATAC-seq and transcription factor binding prediction based on motif models we present a computational approach to uncover combinations of TFs activated in prostate cancer progression. Our method is based on implementation of Hidden Markov Models (chromHMM) which allows us to uncover regulatory programs governed by combinations of TFs based on their binding site predictions. By incorporating accessible chromatin data into the model we are able to segregate TFs based on their accessibility and segment genomic regions such as promoters into a sequence of regulatory programs that are active in tissue context. Furthermore, we show that our approach can be extended to other regulatory elements such as enhancers. By applying our method on a dataset that includes tissue samples from different stages of prostate cancer we are able to uncover putative TF-induced regulatory programs that are driving cancer progression. We also demonstrate that these results can further be used for determining potential functions underlying some of the discovered regulatory programs by performing pathway and Gene Ontology enrichment analyses. Overall, our approach will help to elucidate the regulatory programs employed in tissue context without a need for tissue specific TF measurements.

RSG-71: Single-cell ATAC-seq Signal Extraction and Enhancement with SCATE
Topic: scATAC-seq Single cell DNase-seq Genomics Machine
  • Zhicheng Ji, Johns Hopkins University, United States
  • Weiqiang Zhou, Johns Hopkins University, United States
  • Hongkai Ji, Johns Hopkins University, United States

Short Abstract: Single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) is a new technology for measuring genome-wide regulatory element activities in single cells. With the ability to analyze cells’ distinct behaviors in a heterogeneous cell population, this technology is rapidly transforming biomedical research. Data produced by scATAC-seq are highly sparse and discrete. Existing computational methods typically use these data to analyze regulatory pathway activities in single cells. They cannot accurately measure activities of individual cis-regulatory elements (CREs) due to data sparsity. We present SCATE, a new statistical framework for analyzing scATAC-seq data. SCATE adaptively integrates information from co-activated CREs, similar cells, and publicly available regulome data to substantially increase the accuracy for estimating activities of individual CREs. We show that one can use SCATE to identify cell subpopulations and then accurately reconstruct CRE activities of each subpopulation. The reconstructed signals are accurate even for cell subpopulations consisting of only a few cells, and they significantly improve prediction of transcription factor binding sites. The accurate CRE-level signal reconstruction makes SCATE an unique tool for analyzing regulatory landscape of a heterogeneous cell population using scATAC-seq data.

RSG-72: Scalable estimation of heritability and genetic correlation for biobank-scale data
Topic: Linear Mixed Model GWAS heritability
  • Yue Wu, University of California, Los Angeles, United States
  • Ali Pazokitoroudi, University of California, Los Angeles, United States
  • Sriram Sankararaman, University of California, Los Angeles, United States

Short Abstract: SNP heritability, i.e. the proportion of phenotypic variance explained by SNPs, and genetic correlation are important parameters in efforts to understand the genetic architecture of complex phenotypes as well as in the design and interpretation of genome-wide association studies. Attempts to accurately estimate SNP heritability as well genetic correlation have motivated the analysis of large datasets as well as the development of sophisticated computational methods. While linear Mixed Models (LMMs) provide a coherent statistical model for estimating these parameters, inference in LMMs poses serious computational burdens. We propose a scalable randomized Method-of-Moments (MoM) estimator of SNP heritability and genetic correlations in LMMs. Our method, RHE-reg, leverages the structure of genotype data to estimate these parameters in O(BMN/log_3(max(M,N))) on a dataset of N individuals, M SNPs and a user-defined parameter B (B<<100). RHE-reg also efficiently computes asymptotic as well jackknife standard errors. Finally, RHE-reg scales to multi-phenotype datasets where the cost of parameter estimation scales as O((B+P)MN/log_3(max(M,N))), where P is the number of phenotypes. We perform extensive simulations to validate the accuracy and scalability of our method. In general, we find that the accuracy of RHE-reg is comparable to exact method-of-moment estimators but is less statistically efficient relative to restricted maximum likelihood estimators as implemented in GCTA though the loss in efficiency is small for phenotypes with modest heritability (MSE of RHE-reg relative to GCTA is 1.09 for heritability of 0.2, 2.03 for heritability 0.5 in simulated genotypes, and 1.36 and 3.76 for heritability 0.2 and 0.5 respectfully in the North Finland Birth Cohort dataset). On the other hand, its computational efficiency enables RHE-reg to compute heritability and genetic correlations on the full release of the UK biobank dataset consisting of 430,000 individuals and 460,000 SNPs in ~3 hours on a stand-alone computing machine. We used RHE-reg to estimate heritability for all 60 non-disease traits and 283 disease traits in the UK biobank (i.e, heritability of height has a point estimate of 0.67, heritability of diastolic blood pressure has a point estimate of 0.17 ). We also document a decrease in heritability with age for a number of non-disease traits (i.e, height heritability decrease of 1.6% for every 10 years, and heritability for trunk fat percentage decrease of 5% for every 10 years).

RSG-73: Better decoding of TF signals in accessible chromatin with learned embeddings and neural networks
Topic: chromatin accessibility neural networks transcript
  • Lee Zamparo, Memorial Sloan-Kettering Cancer Center, United States
  • Han Yuan, Cornell University, United States
  • Meghana Kshirsagar, Memorial Sloan-Kettering Cancer Center, United States
  • Christina Leslie, Memorial Sloan-Kettering Cancer Center, United States

Short Abstract: Understanding where transcription factors bind is a critical problem in establishing an understanding of gene regulation. However, it is not practical to physically assay binding of all the relevant TFs in a given cell type. With the advent of ATAC-seq, chromatin accessibility is easy and inexpensive to measure, so we can make the imperfect assumption that TF binding profiles largely fall within accessibility regions in a given cell type and try to predict occupancy from DNA sequence signals in accessible elements. This reduces the problem to finding and decoding the sequence signals in these accessible sites, to predict which sites are occupied by different TFs. We have previously developed an algorithm to learn the sequence preferences for over 200 TFs in human, based on HT-SELEX in-vitro binding data. We learn a dictionary of codes for all eight base pair sequences (8-mers), that projects each into a high dimensional vector space. All 8-mers form a basis for the SELEX probes in the data set, and the collection of all enriched probes for each SELEX factor forms a basis for the TF itself. The 8-mers, probes, and TF labels form a hierarchy of objects that are all embedded into the vector space, where proximity between points encodes the binding preferences for over 200 TFs, that we have called BindSpace. In this work, we use the 8-mer encoding from BindSpace to featurize the DNA sequences in ATAC-seq peaks, and learn to predict binding of TFs within the different ATAC-seq peaks using a convolutional neural network. This approach outperforms using the TF label embeddings from BindSpace directly, and is comparable with state of the art gapped k-mer kernel predictors. We believe this work demonstrates the benefits of using a more powerful sequence embedding than the standard one-hot encoding. In work in progress, we are learning an attention module of the network in conjunction with BindSpace to interpret the binding signals from our network, obviating the need to interpret motifs in the sequence space.

RSG-74: Identification of prognostic indicators of healthy ageing with a machine learning based systems biology approach using gut microbiome data
Topic: Machine Learning Network biology Feature extractio
  • Matthew Madgwick, Earlham Institute, United Kingdom
  • Padhmanand Sudhakar, Earlham Institute, United Kingdom
  • Tamas Korcsmaros, Earlham Institute, United Kingdom

Short Abstract: An integrated machine learning based systems biology workflow for identifying prognostic indicators of healthy aging and age-related disorders from microbiome data. The approach focuses on metagenomics and metatranscriptomics data, which capture the functional potential of the microbiota in modulating host processes. Ultising the power of Artificial Neural Networks (NNs) to find discrete underlying data structures to extract from the microbiome data. Then predicting dynamic changes in key features during the aging process from a heterogeneous ensemble to match the complexity of underlying relations in the datasets. The predicted results will be interpreted using age-related network model and microbial features from the identified and predicted communities, inferring connections between microbial and host proteins relevant in aging.

RSG-75: Functional Interpretation of Single-Cell Similarity Maps
Topic: single cell transcriptomics cellular signatures of
  • David Detomaso, University of California, Berkeley, United States
  • Matthew Jones, UC San Francisco, United States
  • Tal Ashuach, University of California, Berkeley, United States
  • Meena Subramaniam, UC San Francisco, United States
  • Chun J Ye, UC San Francisco, United States
  • Nir Yosef, University of California, Berkeley, United States

Short Abstract: A key challenge in single cell RNA-seq (scRNA-seq) data analysis is the identification of functional variation between cells in an automated and scalable fashion, especially when meaningful clusters or cellular labels are not available to stratify samples. To address this problem, we have developed VISION: a software package that relies on a new and flexible annotation approach and includes a scalable back-end as well as an interactive and feature rich front-end. VISION operates on manifolds that capture the functional relationships between cells and leverages the concept of transcriptional signatures to interpret the meaning of variability on this manifold. These manifolds can either be internally computed by VISION or provided by an external machine learning algorithm like scVI or SIMLR. Many key features distinguish VISION, including the abilities to: (1) operate both with or without pre-conceived stratifications (e.g. clusters or sample labels) allowing users to identify crucial gradients of functional variation in the data when clusters are not meaningful; (2) fit downstream of any method for manifold learning, clustering, and trajectory inference to provide functional interpretation of their output; (3) enable the exploration of the transcriptional effects of meta-data; (4) scale analysis to hundreds of thousands of cells in less than an hour; and (5) facilitate collaborative projects with its low-latency and interactive web-based output report. To demonstrate the utility of VISION, we first applied it to a new dataset of 67k Peripheral Blood Mononuclear Cells obtained from a cohort of Systemic Lupus Erythematosus (SLE) patients and healthy controls. Specifically, in the B cell compartment, we identify important transcriptional signatures that distinguish healthy from disease samples, as well as signatures that track with the severity of the disease. We also illustrate that VISION is able to identify specific biological programs that characterize single cell trajectories, as demonstrated with a published dataset of ~5k Hematopoietic Stem Cells (HSCs). Overall, VISION offers a dynamic and high-throughput method for profiling variation and heterogeneity in scRNA-seq datasets; we anticipate its scalability and flexibility in analysis and visualization will allow it to become a powerful tool as scRNA-seq continues to mature into a widely used technology capable of profiling hundreds of thousands of cells in a single experiment.

RSG-76: A High-Performance Multi-Classifier System Using Hybrid-Overlay Features to Predict miRNA:Protein Interactions
Topic: microRNA target prediction multi-classifier machin
  • Sujeethraj Koppolu, Icahn School of Medicine at Mount Sinai, United States
  • Bin Zhang, Icahn School of Medicine at Mount Sinai, United States

Short Abstract: microRNA (miR) mediated silencing of mRNA molecules is recognized as a major mechanism for the post-transcriptional regulation of gene expression and mRNA translation. These short endogenous noncoding RNAs bind to target mRNAs through complementary base pairing between the seed region of miRs and a matching region on the mRNAs. Although several factors contributing to miR targeting, such as sequence specificity, target site availability and thermodynamic stability have been well studied, accurate prediction of valid miR targets remain a daunting challenge due to several other unknown factors that seem to influence the process. Existing methods predict that all human genes may be subject to miR regulation, with each mRNA potentially targeted by hundreds of miRs and each miR potentially targeting hundreds or thousands of genes. Despite the intricate, cross-interfering nature of miRs, most computational methods only focus on individual miR-mRNA features. While such an approach provides valuable information regarding the pairing, they are not enough to understand the overall regulatory effects of the miRs. Here, we propose to integrate a multi-network approach in which individual features between miR-mRNA pairs are overlaid with miR-miR and mRNA-mRNA features to form hybrid overlay features. These features capture the overlapping and compensatory nature of the miRNA regulation and improve the efficiency of miRNA target prediction. Using 28 individual features encapsulating sequence-, accessibility- or conservation-based factors for miR-mRNA pairs, we generated 336 hybrid-overlay features by overlaying 12 miR-miR and mRNA-mRNA features, such as sequence homology, conservation and expression correlations, onto the 28 individual features. We used the publicly available TarBase v7.0 and miRTarBase datasets for labeling the validated miR-mRNA targets. Comparing 7 different classifier methods including Decision Trees, k-Nearest Neighbors, Random Forests, Neural Networks, Support Vector Machines, DeepBoost and Multilayer Perceptrons, we identified three such methods that significantly improved prediction efficiency with the hybrid-overlay features. Further, integrating the scores from the different classifier methods, we calculated a single overall multi-classifier score that represented the reliability of such predictions across multiple methods.

RSG-77: Topic-models for learning epigenetic profiles jointly from SELEX and ATAC-seq
Topic: epigenomics atac-seq selex topic model transcripti
  • Meghana Kshirsagar, Memorial Sloan Kettering Cancer Center, United States
  • Han Yuan, Cornell University, United States
  • Lee Zamparo, Memorial Sloan Kettering Cancer Center, United States
  • Christina Leslie, Memorial Sloan-Kettering Cancer Center, United States

Short Abstract: Determining the cell-type specific and genome-wide binding locations of transcription factors (TFs) is an important step towards decoding their regulatory behavior in different cell states. Profiling by ATAC-seq (assay for transposase-accessible chromatin using sequencing) reveals open chromatin sites that are potential binding sites for TFs, but does not identify which TFs occupy a given site. As an alternative to resource-intensive assays such as TF ChIP-seq, we present a computational approach that determines the binding sites of specific TFs directly from ATAC-seq. Our approach deconvolves the combined signal of TF binding provided by ATAC-seq into components representing each TF by exploiting data from in vitro experiments such as SELEX (systematic evolution of ligands by exponential enrichment). Like ChIP-seq, these in vitro experiments give us a measure of protein-DNA binding specificity of individual TFs, albeit independent of cell-type specific chromatin organization. We combine the in vitro binding specificity from SELEX with in vivo chromatin accessibility from ATAC-seq to learn models for cell-type specific binding. To deconvolve the ATAC-seq signal, we use probabilistic mixture models to represent the mixture of signals from individual TFs. In particular, we use a hierarchical model called latent Dirichlet allocation (LDA or topic model) to represent DNA sequence as a probabilistic distribution over TFs, and TFs in turn as a probabilistic distribution over k-mers. In topic model terminology, TFs are topics and k-mers are words. We train a joint model on DNA sequence in the form of peaks from ATAC-seq and enriched oligomers from SELEX. We also try to incorporate SELEX data in the form of a hyper-prior to a model trained solely on ATAC-seq peaks. We find that our models automatically learn topics mainly for TFs that are expressed in the specific cell-type. We further find that the topic-based representations of TFs are similar for TFs with similar binding specificities (e.g., EOMES, TBX4). We also learn different topic representations for the same TF in different cell types (GM12878 and K562). Finally, we automatically learn a topic corresponding to the cleavage bias of Tn5 transposase, the RNase enzyme used in ATAC-seq. In preliminary experiments our model’s predictive performance is comparable to FIMO, a motif-based model, used position weight matrices derived from the same SELEX-seq data.

RSG-78: Cross-trial meta- and time-series pharmacodynamic analyses of Apremilast effects
Topic: apremilast psoriasis psoriatic arthritis ankylosin
  • Irina Medvedeva, Celgene, United States
  • Matthew E. Stokes, Celgene, United States
  • Peter Schafer, Celgene, United States
  • Robert Yang, Celgene, United States

Short Abstract: Apremilast, a selective small-molecule inhibitor of phosphodiesterase 4, is an approved oral drug for psoriasis and psoriatic arthritis. In the present study, we sought to derive apremilast-specific effect on pharmacodynamic biomarkers in patients across multiple Phase 3 clinical trials, including ankylosing spondylitis (AS), psoriatic arthritis (PSA), and psoriasis (PSOR). We analyzed the protein expressions of 150 analytes measured by the Myriad RBM DiscoveryMAP immunoassay panel. Both longitudinal treatment- and response- effects were modeled separately and combined across all trials. We identified biomarkers linking with disease scores, as well as statistically robust biomarkers reflecting apremilast treatment through meta-analyses. In recognition of patient variabilities, we further applied time-series analyses by the CoGAPS methods to derive distinct patterns composed of different groups of analytes. Overall, apremilast-induced downregulation effect were consistent with those observed in the non-responders-responders comparisons. Apremilast-induced upregulation, however, appeared in patterns with more complex expression behavior. Our findings provide a foundational knowledge base connecting between clinical endpoints, pharmacodynamic biomarkers and future design of in vitro studies.

RSG-79: Analysis of ChIP-exo read profiles reveals spatial organizations of protein complexes
Topic: transcription factor binding chip-exo DNA-protein
  • Naomi Yamada, The Pennsylvania State University, United States
  • Nina Farrell, Harvard University, United States
  • B. Franklin Pugh, The Pennsylvania State University, United States
  • Shaun Mahony, The Pennsylvania State University, United States

Short Abstract: Each subunit of regulatory protein complexes uniquely associates with the genome via protein-DNA or protein-protein interactions. The ChIP-exo protocol precisely characterizes protein-DNA crosslinking patterns by combining ChIP with 5’ to 3’ exonuclease digestion. Since the physical distance of a regulatory protein to the DNA affects cross-linking efficiencies, analysis of ChIP-exo read enrichment patterns should enable detection of precise genomic positions that regulatory complexes associate with. Most approaches to analyzing multiple high-resolution assays merely catalog the broad regions that are enriched for sequencing reads. However, analysis of the sequencing read distribution shapes created by the exonuclease can potentially enable greater levels of biological insight by identifying the DNA-protein interaction preferences of proteins or the modes by which they bind. Here, we present a computational pipeline that simultaneously analyzes ChIP-exo read patterns across multiple experiments and infers spatial organizations of the proteins. Identifying protein-DNA crosslinking positions from individual sites are not feasible due to high read count variation at single binding events. To overcome this problem, we use a guide tree to progressively align the strand separated ChIP-exo read patterns and produce representative ChIP-exo read profiles at a set of binding events. The guide tree potentially enables detection of multiple binding modes of proteins when appropriate. This approach provides a broad application to many proteins because it does not require motifs or pre-defined references such as TSS to align binding events. Given a set of aligned ChIP-exo read profiles across multiple proteins, we use a probabilistic mixture model to deconvolve the ChIP-exo read patterns to DNA-protein crosslinking sub-distributions. The method allows consistent measurements of crosslinking strengths of DNA-protein interactions across multiple ChIP-exo experiments. Lastly, we perform MDS to visualize cross-linking preferences of the regulatory proteins. We have applied the ChIP-exo analysis methods to a set of proteins that organizes transcription pre-initiation complex (PIC) assembly of the yeast tRNA gene. Our results demonstrate that inferred protein organization closely recapitulate the known organizations of the tRNA PIC, thereby confirming that the detailed analysis of ChIP-exo reads enable us to understand the precise organizations of protein-DNA complexes.

RSG-80: Predictive Metagenomic Analysis of Autoimmune Disease
Topic: human gut microbiome autoimmune disease predictive
  • Angelina Volkova, NYU Medical Center, United States
  • Kelly Ruggles, NYU Medical Center, United States

Short Abstract: Background: The gut microbiome has been implicated in autoimmune disease by multiple studies in the past decade. The question of whether there are common microbial features characterizing general autoimmunity still remains. The goal of our study was to identify taxa and gene pathways that are predictive of general autoimmunity and distinct autoimmune diseases. To do this, we compared the gut microbiome taxa and pathways between different autoimmune diseases using data from 35 studies from the past 8 years, developing a machine learning-based approach for integrative analysis of 16S and shotgun metagenomics data sequenced on different platforms. Methods: The majority of studies identified with human fecal samples investigated inflammatory bowel disease, multiple sclerosis, rheumatoid arthritis, as well as much rarer autoimmune diseases, such as Behçet’s syndrome. Initially, we analyzed only 16S rRNA data (30 studies). Sequences were preprocessed with QIIME 2 (v. 2017.12) and the metagenome functional content of the samples was obtained with PICRUSt (v. 1.0.0). Predictive models of disease status were built separately for both taxa and pathway-level features using the caret package in R based on Lasso, random forest and support vector machine with recursive feature elimination algorithms as follows. The data were split into training set (90%) and test sets (10%), training sets were then used to build models with 3-times repeated 9-fold cross-validation, and test sets were used to validate the results. Results: The highest Area Under the Curve (AUC), a measure of model accuracy, for the prediction of general autoimmunity was 0.871 and 0.744 for taxa and pathways levels, respectively. For general autoimmunity, the most important taxa for distinguishing between disease and healthy states were the genera of Coprococcus, Parabacteroides, Roseburia and Blautia. The most predictive pathways included flavonoid biosynthesis, cell division, cell motility and section, ion channels, and alpha-linolenic acid metabolism. The features for the distinct autoimmune diseases, and comparisons of autoimmune diseases between each other were also obtained with the AUCs more than 0.84. Conclusion: Our analysis showed that there are common taxa and pathways signatures of the gut microbiome in autoimmune diseases despite the sequencing platform and population differences between the studies used in the analysis, which have been confirmed by three predictive models. In order to fully understand the microbiome features underlining autoimmunity, shotgun metagenomics studies will be added to the predictive models to further identify predictive features of general autoimmunity and specific disease pathology.

RSG-81: Predictive analysis of response to androgen-deprivation in prostate cancer
Topic: Treatment resistance Personalized therapeutic advi
  • Sukanya Panja, Rutgers University, United States
  • Sheida Hayati, Rutgers University, United States
  • Nusrat Epsi, Rutgers University, United States
  • James Scott Parrott, Rutgers University, United States
  • Antonina Mitrofanova, Rutgers University, United States

Short Abstract: Resistance to androgen-deprivation is a central problem in prostate oncology. Since prostate cancer progression and maintenance depend on androgens, androgen-deprivation therapy (ADT) has been a mainstay of treatment for advanced disease. Even though patients initially respond to androgen deprivation, majority develop resistance and relapse, progressing to castration-resistant disease, which is nearly always metastatic and lethal. Prioritization of patients for androgen-deprivation administration could provide invaluable survival benefits, especially for patients with advanced malignancy. We have developed an integrative machine learning genome-wide computational approach Epi2GenR to uncover an interplay between epigenomic (i.e., DNA methylation) and genomic (i.e., mRNA expression) mechanisms that govern resistance to androgen-deprivation and predict patients' favorable or poor response to androgen-deprivation, prior to therapy administration. Our method has uncovered a panel of 5 differentially methylated sites, which can explain changes in expression of their site-harboring genes: TTC27, STMN1, FOSB, FKBP6, and CSPG5, and have shown their significant ability to predict primary resistance to androgen-deprivation (hazard ratio=4.6). Furthermore, the 5 site-gene panel was able to accurately predict response to ADT across five independent patient cohorts and was independent of disease aggressiveness, tumor grade at diagnosis, age, and commonly utilized prostate cancer prognostic markers. We have demonstrated that our method is robust to noise (i.e., increased false positive and false negative rates) and has significant predictive ability, when compared to predictions at random (p=0.01). Finally, we simulated a situation when a new incoming patient needs to be assigned risk of resistance using leave-one-out cross validation across five independent cohorts and demonstrated that our panel can predict risk of resistance to ADT with 90% accuracy. We propose that the identified 5 site-gene panel could be utilized to pre-screen patients to prioritize those who would benefit from ADT and patients at risk of developing resistance, who should be offered alternative treatment regimens. Such discovery has a near-term potential to enhance personalized therapeutic advice for patients with advanced malignancy and improve prostate cancer management at large.

RSG-82: Discovery and Validation of a Gut Microbiome that Improves Athletic Performance
Topic: microbiome metagenomics metabolism exercise physio
  • Jonathan Scheiman, Harvard University, United States
  • Jacob Luber, Harvard University, United States
  • Theodore Chavkin, Harvard University, United States
  • Tara MacDonald, Harvard University / JDC, United States
  • Angela Tung, Harvard University, United States
  • Loc-Duyen Pham, Harvard University / JDC, United States
  • Marsha Wibowo, Harvard University, United States
  • Renee Wurth, Harvard University, United States
  • Sukanya Punthambaker, Harvard University, United States
  • Braden Tierney, Harvard University, United States
  • Zhen Yang, University of Waterloo / JDC, Canada
  • Mohammad Hattab, Harvard University, United States
  • Sarah Lessard, Harvard University / JDC, United States
  • George Church, Harvard University, United States
  • Aleksandar Kostic, Harvard University / JDC, United States

Short Abstract: In elite athlete stool samples, the microbe Veillonella increases in abundance post-exercise. Metagenomic analysis of samples from elite athlete cohorts show that this increase in abundance is correlated with an increase in the microbial methylmalonyl-CoA pathway, which breaks down lactate into the short chain fatty acid propionate. Gavage of athlete-derived Veillonella or its metabolic end product propionate into mice increases athletic performance and reduces inflammatory response. Thus, we propose a model in which specific gut bacteria directly enhance athletic performance.

RSG-83: Predicting Gain-of-Function and Loss-of-Function Mutations
Topic: Gain-of-function Loss-of-function NLP machine lear
  • Cigdem Sevim Bayrak, Icahn School of Medicine at Mount sinai, United States
  • Yuval Itan, Icahn School of Medicine at Mount sinai, United States

Short Abstract: Gain-of-function (GOF) and loss-of-function (LOF) mutations in the same gene may result in different disease phenotypes and hence require different drug treatments. There are numerous available computational tools that can predict the fitness effects of genetic variants (damaging or neutral). However, these methods do not accurately predict whether the variant results a reduced (LOF) or enhanced (GOF) protein activity. To address this need, we aimed to develop a computational method that will accurately differentiate GOF from LOF mutations. We generated the first extensive database of all currently known disease-causing GOF and LOF mutations by using natural language processing (NLP) on the abstracts of the Human Gene Mutation Database (HGMD). We then, annotated GOF and LOF mutations in this dataset with protein-level features including protein domain, secondary structure, solvent accessibility, and flexibility using sequence-based methods and homology modeling. We further annotated the GOF and LOF mutations by various genomic and population genetic features. We demonstrated highly significant power of protein-level and genomic features to discriminate between GOF and LOF mutations. Our results systematically describe the differences between GOF and LOF mutations for the first time, and suggest a novel computational method to classify mutations using machine learning techniques based on protein- and genomic-level features.

RSG-84: Finding transcriptional regulators central to rheumatoid arthritis with transcriptomics of IL-17 dose response, time series, and siRNA silencing in stromal cells
Topic: RNA-seq gene regulation inflammation time series d
  • Kamil Slowikowski, Harvard University, United States
  • Hung N Nguyen, Harvard University, United States
  • Gerald F M Watts, Harvard University, United States
  • Fumitaka Mizoguchi, Tokyo Medical and Dental University, Japan
  • Erika H Noss, University of Washington, United States
  • Michael B Brenner, Harvard University, United States
  • Soumya Raychaudhuri, Harvard University, United States

Short Abstract: Rheumatoid arthritis (RA) is characterized by chronic joint inflammation, with a persistent proinflammatory feedback loop between stromal cells of the tissue and infiltrating immune cells. The inflammatory cytokines tumor necrosis factor alpha (TNF) and interleukin 17 (IL-17) are elevated in RA synovial fluid. TNF and IL-17 costimulation synergistically activates synovial fibroblasts to produce many cytokines and chemokines, including the hallmark cytokine IL-6, one of the main targets of biologic therapies for inflammatory diseases. To identify transcriptional mediators of TNF and IL-17 synergy, we designed a large-scale time series and IL-17 dose-response study. We isolated synovial fibroblasts from the joint tissues of patients with osteoarthritis and RA, and used RNA-seq to assay the genome-wide transcriptional response to TNF and different dosages of IL-17 (0, 1, and 10 ng/mL). By modeling the time-series data with basis splines, we identified 409 genes with expression proportional to IL-17 dose at 1% false discovery rate (FDR). Many of these genes have been shown to be regulated by NFKB, but NFKB expression was not responsive to IL-17 dose. Instead, one of its cofactors, NFKB inhibitor zeta (NFKBIZ), was responsive to IL-17 dose both in terms of gene expression and splice isoform abundance. Silencing NFKBIZ by siRNA suppressed the protein levels of IL-6, IL-8 (CXCL8), and MMP3 by 82% after TNF and IL17 costimulation, but only by 8% after TNF stimulation. In addition, we found other putative mediators of TNF and IL-17 synergy by using differential expression analysis and transcription factor motif enrichment analysis, including CUX1, STAT3, STAT4, LIFR, and ELF3. Next, we used siRNAs to silence these mediators, and we assayed the downstream genome-wide effects with a second RNA-seq time series. Differential expression analysis of the siRNA RNA-seq data suggested that CUX1 regulates CXC chemokines CXCL1, CXCL2, CXCL3 and CXCL8 in synovial fibroblasts. By chromatin immunoprecipitation, we found that CUX1 is recruited to the promoters of these genes after stimulation. Since these chemokines are potent chemoattractants for neutrophils, we hypothesized that CUX1 might be required for neutrophil recruitment. Using a transwell migration assay, we found that CUX1 silencing significantly reduced the number of neutrophils that migrated through transwells. In summary, we used transcriptomics in a time series and dose-response study to characterize the inflammatory response to TNF and IL-17 in synovial fibroblasts, and we identified CUX1 and NFKBIZ as two key regulators of fibroblast mediated inflammation.

RSG-85: miRNA-based Disease Subtyping by integrating miRNAome and Interactome
Topic: Non-coding RNAs Network analysis Interactome miRNA
  • Marissa Sumathipala, Harvard Medical School Channing Division of Network Medicine, United States
  • Marc Santolini, Center for Research and Interdiscplinarity, France
  • Amitabh Sharma, Harvard Medical School Channing Division of Network Medicine, United States

Short Abstract: Diseases are not driven by a single biological mediator but arise from perturbations in cellular interaction networks. As small non-coding RNAs that regulate expression of multiple disease related genes, miRNAs are part of a complex regulatory network where each miRNA regulates several genes and a gene is regulated by several miRNAs. While gene-disease networks and their emergent properties are well studied, the mapping of miRNA-disease networks is in its infancy. We take a conceptually different approach to prior miRNA-disease network studies. First, we predict a miRNA-disease network model without a priori information about miRNA-disease associations and validate predictions with experimental data. Second we explore how diseases and miRNAs are linked at a higher level of association through analysis of the miRNA-disease network’s emergent properties. We first construct a multipartite miRNA-gene-disease network by joining miRNA-gene, gene-gene, and gene-disease bipartite networks. We collapse the multipartite network with a network diffusion algorithm utilizing random walks that scores miRNA candidates by their proximity to disease genes, yielding the final weighted miRNA-disease network (MDN). The MDN is validated with experimental miRNA-disease data from dbDEMC; strong, significant positive correlations (r>0.25, p<0.0001, for six example cancers) between differential miRNA expression and our predicted miRNA-disease edge weights demonstrate biologically accurate miRNA rankings. We uncovered miRNA-based disease subtype classification from disease projections, which have been explored in gene space but not in miRNA space. We threshold the MDN to retain only high-ranked miRNAs per disease, then create a disease projection in miRNA space (DDM), where nodes are diseases and two diseases are connected if they share the same miRNA. Next, we collapse the diseases into broad disease types using ICD-9 medical codes. In comparison to a gene space disease projection, the DDM has greater biologically relevant disease clustering by disease type. The DDM has distinguishable, homogenous clusters, arising from phenotypically similar diseases being regulated by the same miRNAs. In contrast, the gene space projection is densely connected with fewer distinguishable clusters. We quantify this disease clustering by type using mutual information, which measures the alignment between structural communities and disease type. Mutual information in miRNA space is 2.3, compared to only 1.42 in gene space, indicating diseases tend to agglomerate by type in miRNA space more so than in gene space. Our miRNA-disease network model for disease subtyping enables more accurate and comprehensive disease-disease relationships and prioritization of disease-specific miRNAs critical to a disease subgroup.

RSG-86: Co-factor driven landscapes define Hox TF binding in motor neuron subtype specification
Topic: Cell programming TF binding specificity Sequence-c
  • Divyanshi Srivastava, The Pennsylvania State University, United States
  • Milica Bulajić, New York University, United States
  • Shaun Mahony, The Pennsylvania State University, United States
  • Esteban Mazzoni, New York University, United States

Short Abstract: Progenitor motor neurons first emerge in the ventral neural tube, under sonic hedgehog, retinoic acid and FGF gradients. Motor neuron progenitors develop positional (columnar) subtype identities along the anterior-posterior neural tube axis under Hox transcription factor control. Limb innervating motor neurons (LMCs) develop at the brachial and lumbar levels, concurrent with the expression of Hoxc6 and Hoxc10. Neurons that innervate the sympathetic ganglia (PGCs) emerge at thoracic levels, driven by Hoxc9 expression. Since the Hox TFs bind DNA through a conserved homeodomain, they recognize similar sequence motifs in the genome. Thus, whether inherent Hox TF sequence preferences are sufficient to confer Hox phenotypic specificity remains unclear. In vitro binding assays have shown that the binding of Hox TFs in complex with TALE family of TFs leads to emergent differences in sequence specificity of Hox TFs. Further, TALE family proteins are not the only TFs implicated in Hox binding; Foxp1 has been established as an accessory factor critical for Hox-driven motor neuron subtype specification. However, how the TALE co-factors and Foxp1 drive in vivo Hox binding remains unknown. Here, we aim to characterize the combinatorial logic through which TALE, Hox and Foxp1 TFs specify uniquely functional neuronal identities in vivo. Since motor neurons form a rare cell population in vivo, we derive motor neuron subtypes by Hox TF overexpression in ES derived progenitor motor neurons. The induced neurons show markers consistent with in vivo LMC and PGC identity. Such induced motor neurons allow us to probe the genome-wide binding of the Hox, Meis and Foxp1 TFs. We find that Hox TFs bind both shared and unique sites in the genome. The shared Hox binding sites encode general motor neuron features, whereas unique Hox binding sites exist at subtype specific genes. We observe that various Hox TFs exhibit variability in their interactions with TALE co-factors. We also find that the Hox TFs are constrained differentially by the chromatin accessibility at target genomic loci. In conclusion, our results suggest that interactions with co-factors enable the Hox TFs to bind their specific regulatory targets within a complex sequence-chromatin landscape.

RSG-87: Discovery of biased orientation of DNA motif sequences affecting enhancer-promoter interactions and transcription of genes
Topic: Transcriptional regulation Gene expression Chromat
  • Naoki Osato, Department of Bioinformatic Engineering, Graduate School of Information Science & Technology, Osaka University, Japan

Short Abstract: Discovery of biased orientation of DNA motif sequences affecting enhancer-promoter interactions and transcription of genes Chromatin interactions have important roles for enhancer-promoter interactions (EPI) and regulating the transcription of genes. CTCF and cohesin proteins are located at the anchors of chromatin interactions, forming their loop structures. CTCF has insulator function limiting the activity of enhancers into the loops. DNA binding sequences of CTCF indicate their orientation bias at chromatin interaction anchors – forward-reverse (FR) orientation is frequently observed. However, it is still unclear what proteins are associated with chromatin interactions. To find DNA binding motif sequences of transcription factors (TF) such as CTCF affecting EPI and the transcription of genes, here, I developed a novel computational method to estimate the accuracy of the prediction of EPI based on the expression level of putative transcriptional target genes. I predicted human transcriptional target genes of TF bound in open chromatin regions in enhancers and promoters in monocytes and other cell types using experimental data in public database. Transcriptional target genes were predicted based on enhancer-promoter association (EPA). EPA was shortened at the genomic locations of FR or reverse-forward (RF) orientation of DNA binding motifs of TF, which were supposed to be at chromatin interaction anchors. To examine the effect of EPI, the expression level of target genes predicted based on EPA was compared with target genes predicted from only promoters. Some transcription factors showed a significant difference of the distribution of expression level of their target genes between enhancers and promoters, implying that transcription factors bound in enhancers affected the expression level of their target genes. Total 287 biased orientation of DNA motifs (159 forward-reverse (FR) and 153 reverse-forward (RF) orientation) affected the expression level of putative transcriptional target genes significantly in monocytes of four people in common, and included known TF associated with chromatin interactions, such as CTCF, cohesin (RAD21 and SMC3), ZNF143 and YY1. Biased orientation of DNA motif sequences tend to be co-localized in the same open chromatin regions. Some pairs of biased orientation of DNA motif sequences of TF including the same or different pairs of TF were enriched in upstream and downstream of the same genes. Moreover, EPI predicted using FR or RF orientation of some DNA motifs including CTCF and cohesin were overlapped with chromatin interaction data (Hi-C and HiChIP) more than other EPA. Reference: Osato N, BMC Genomics 2018, https://doi.org/10.1186/s12864-017-4339-5 Osato N, bioRxiv 2018, https://doi.org/10.1101/290825

RSG-88: Discovering structural units of chromosomal organization with graph-regularized non-negative matrix factorization
Topic: high-throughput chromosome conformation capture ch
  • Da-Inn Lee, University of Wisconsin-Madison, United States
  • Sushmita Roy, University of Wisconsin-Madison, United States

Short Abstract: Three dimensional organization of the genome is emerging as an important determinant of cell-type specific expression and is implicated in many diseases, including cancer (Bouwman & de Laat 2015). Hi-C is a type of high-throughput chromosome conformation capture (3C) assay which can be used to study the three-dimensional organization of chromosomes (Lieberman-Aiden et al. 2009). Analysis of Hi-C data can reconstruct the building blocks that give rise to or result from the organizational principles of the genome: topologically associating domains (TADs), transcriptionally active compartments, chromatin loops, chromosomal territories (Gibcus & Dekker 2013). Recent studies comparing TAD-finding methods (Forcato et al. 2017, Dali & Blanchette 2017) found the methods to vary significantly in their replicability and stability across sequence depth, sparsity, and resolution of the input data, suggesting the need for more robust methods. Here we present GRiNCH, an approach based on Non-Negative Matrix Factorization (NMF) to identify organizational units of chromosomes from Hi-C data. NMF is a powerful dimensionality-reduction technique that can recover low-dimensional representations of images, texts, and biological data (Lee & Seung 2000). GRiNCH extends the NMF framework by using a graph regularization term that (Cai et al. 2011) encourages nearby genomic regions in similar chromatin state or with similar insulator binding pattern to converge to a similar low-dimensional state. Our results show that GRiNCH can recover clusters with TAD-like properties whose boundaries show a significant association with the presence of CTCF binding. Compared to existing TAD-finding methods, GRiNCH clusters are more stable to sparse and low-depth Hi-C datasets. Finally, through a matrix completion process, GRiNCH can impute missing interaction counts and offer a smoothed Hi-C matrix comparable in quality to smoothing process employed by methods like HiCRep (Yang et al. 2017). Taken together, GRiNCH offers a promising approach to mining biologically meaningful structural domains of the genome.

RSG-89: METCC: METric learning for Confounder Control Making distance matter in high dimensional biological analysis
Topic: Machine Learning Cancer Genomics Metric Learning B
  • Kabir Manghnani, University of Illinois Urbana Champaign, United States
  • Adam Drake, Freenome inc., United States
  • Nathan Wan, Freenome Inc, United States
  • Imran Haque, Standford University, United States

Short Abstract: RECOMB/ISCB Conference on Regulatory & Systems Genomics Abstract: The abstract should not exceed 400 words. Next generation sequencing assays that measure genome wide signals are widely used tools in the biomedical community. The use of cfDNA sequencing in particular is growing both in academia and industry due to its ease of acquisition and utility. However cfDNA assay’s are highly sensitive to the technical processes as any biological signal of interest is often obscured by other biases like sample collection, processing procedure, sequencing bias, etc. Because of this cfDNA data is likely to exhibit numerous, not necessarily independent, latent factors that can not be modeled as a simple factorizable separation of data into “true expression signal” and “unknown covariates”. We formulate the problem of normalizing out covariate biases as a supervised learning of a latent embedding subject to particular constraints. We then propose a novel model to account for these covariates in expression data that is a direct optimization of the objectives of this normalization problem: GENE-ML: GENomic Embedding via Metric-Learning. Specifically, within this framework we learn a low-dimensional representation in which we enforce that samples that are similar (or dissimilar) only with respect to relevant biology; i.e. within the representation, similarities induced by systematic biases are removed while dis-similarities between distinct biological classes are preserved. This is accomplished by learning a distance function Dw:Rk→ R that distinguishes inputs that have different labels while enforcing similarity between inputs that have the same label . We then show the effectiveness of GENE-ML by analyzing its embeddings of high-dimensional cfDNA data. We show that they are far less susceptible to confounding biases than embeddings produced by the current standard for genomic normalization (HCP) and low-rank approximation techniques. In particular, we analyze 1) extent to which each of these methods produces embeddings that are still separable by known confounders 2) the performance of supervised linear models built ontop of these . Additionally, we note the advantage GENE-ML has in terms of practical usage as it does not require the same high quality covariate metadata as methods such as factorization methods. Finally, we examine the biological interpretation of these models by investigating the location of the greatest weights of linear versions of the architectures within the genome.

RSG-90: Bio-Express: Bioinformatics workflow system for massive genomic sequencing data analysis
Topic: Cloud Genome analysis pipeline MapReduce
  • Byungwook Lee, Korean BioInformation Center, South Korea
  • Gunhwan Ko, KOBIC, South Korea
  • Pan-Gyu Kim, KOBIC, KRIBB, South Korea

Short Abstract: While next-generation sequencing (NGS) costs have fallen in recent years, cost and complication of computation remain substantial obstacles to the use of NGS in bio-medical care and genomic research. The rapidly increasing amounts of data available from the new high-throughput methods have made data processing without automated pipelines infeasible. Integration of data and analytic resources into workflow systems provides a solution to the problem, simplifying the task of data analysis. To address the challenge, we developed a cloud-based workflow management system, Bio-Express, to provide a fast and cost-effective analysis on massive genomic data. We implemented complex workflows making optimal use of high-performance compute clusters. Bio-Express allows users to create multi-step analyses using drag and drop functionality, and modify parameters of pipeline tools. Users can also import the Galaxy pipelines into Bio-Express. Bio-Express is a hybrid system, which enables users to use both analysis programs for the traditional tools and the MapReduce-based big data analysis programs in a single pipeline simultaneously. Thus, the execution of analytics algorithms can be parallelized, which can speed up the whole process. We also developed a high-speed data transmission solution, KoDS, to transmit a large amount of data at a fast rate. KoDS has a file transferring speed up to 10 times than normal FTP and HTTP. Computer hardware for Bio-Express is 800 CPU cores and 800Tb, which enable 500 jobs to run at the same time. Bio-Express is a scalable, cost-effective, and publicly available web service for large-scale genomic data analysis. Bio-Express supports reliable and highly scalable execution of sequencing analyses workflows in a fully automated manner. Bio-Express provides a user-friendly interface to all genomic scientists to try to select right results from NGS platform data. The Bio-Express cloud server is freely available for use from https://www.bioexpress.re.kr/.

RSG-92: Building a tumor atlas: integrating single-cell RNA-Seq with spatial transcriptomics in pancreatic ductal adenocarcinoma
Topic: Single-cell RNA-Seq Spatial Transcriptomics Cancer
  • Reuben Moncada, Institute for Computational Medicine - NYU Langone Health, United States
  • Florian Wagner, Institute for Computational Medicine - NYU Langone Health, United States
  • Marta Chiodin, Institute for Computational Medicine - NYU Langone Health, United States
  • Joseph Devlin, NYU Langone Health, United States
  • Maayan Baron, Institute for Computational Medicine - NYU Langone Health, United States
  • Cristina Hajdu, NYU Langone Health, United States
  • Diane Simeone, NYU Langone Health, United States
  • Itai Yanai, Institute for Computational Medicine - NYU Langone Health, United States

Short Abstract: To understand tissue architecture it is necessary to understand both which cell types are present and their physical relationships to one another. Single-cell RNA-Seq (scRNA-Seq) has made significant progress towards the unbiased and systematic characterization of cell populations within a tissue by studying thousands of cells in a single experiment. However, the characterization of the spatial organization of individual cells within a tissue has been more elusive. The recently introduced ‘spatial transcriptomics’ (ST) method reveals the spatial pattern of gene expression within a tissue section at a resolution of a thousand 100 µm spots across the tissue, each capturing the transcriptomes of ~20-70 cells. Here, we present an approach for the integration of scRNA-Seq and ST data generated from the same sample of pancreatic cancer tissue and deploy it on primary tumors from two patients. Using markers for cell types identified by scRNA-Seq, we robustly deconvolved the cell type composition of each ST spot to generate a spatial atlas of cell proportions across the tissue. We find that distinct macrophage subpopulations occupy distinct spatial localizations in the tumors. Our results provide a framework for the integration in any tissue of the subpopulation structure defined by scRNA-Seq and tissue architecture revealed by ST. 

RSG-93: Inferring RNA binding specificities from protein sequences by joint matrix factorization
Topic: RNA binding motifs RNA binding proteins joint matr
  • Alexander Sasse, University of Toronto, Canada
  • Quaid Morris, University of Toronto, Canada

Short Abstract: RNA binding proteins (RBPs) are important co- and post-transcriptional regulators of mRNA processing and gene expression. Most RBPs possess preferences towards specific short sequence, or sequence-structure patterns, called motifs. The largest set of sequence preferences has been measured using RNAcompete, an in vitro assay which measures binding strength of the protein of interest to a pool of 250,000 designed RNA sequences (Ray et al. 2013). From these measurements statistically enriched sequence binding preferences can be identified through several computational pipelines (Sasse et al. 2018). About 60% of human RBPs contain one or more RRM domains (Gerstberger, Hafner, and Tuschl 2014). Although RRMs possess a canonical binding cleft between the 2nd and 3rd beta-strands, RNA-protein co-complex structures demonstrate that in the vast majority of RRMs, the specificity determining residues are not in this cleft but are located in terminal or linker regions (Afroz et al. 2015). Previously, 200 RNAcompete measurements were used to infer RNA sequence binding preferences for 30% of the metazoan RBPs using motifs of highly identical sequences of RNA binding domains (Ray et al. 2013). However, many RBPs do not have a clear homolog and inferring specificity by this method for all RBPs would require a prohibitively large set of measured RBPs. Since RRMs don’t possess conserved RNA recognition sites, we sought to design a method, called a recognition map, which would directly map from features from the protein sequence to RBP specificity. We represented protein sequences by the occurrences of possible peptides of length 4 (bags-of-4-mers (Pelossof et al. 2015)) in them and their close homologs. We applied joint matrix factorization (jMF) to enable direct regularized mapping from peptide content to binding specificities. In addition, the jMF enables linear reconstruction of “ideal” bag-of-4-mer vectors for a measured binding preference. Mapping the peptide values of this reconstruction back to protein sequences of available co-complex structures showed significant correlations between the predicted scores to residues that are directly involved in RNA-binding. Thus, jMF enabled detection of potential binding sites from RNA binding assays and protein sequence alone without the necessity to learn from RBP-RNA structures.

RSG-94: Identifying disease-causing mutations, genes and pathways in exomes of Ashkenazi Jewish inflammatory bowel disease patients
Topic: Ashkenazi Jewish IBD whole exome sequencing diseas
  • Yiming Wu, mount sinai school of medicine, United States
  • Aayushee Jain, mount sinai school of medicine, United States
  • Yuval Itan, mount sinai school of medicine, United States

Short Abstract: The Ashkenazi Jewish (AJ) population is predisposed to several diseases including inflammatory bowel disease (IBD) due to their well-known founder effect. In this study we aim to detect IBD-related variants, genes and pathways within the AJ population. We genetically identified in whole exome sequencing (WES) data 4,135 Ashkenazi Jewish samples, of them 1,314 IBD cases and 2,821 controls. We employed functional impact and population genetics filters to collect credible genetic variants in the IBD and controls WES data. We then performed a gene burden cases-controls analysis to identify variants and genes that are over-represented in the IBD cases. Finally, we estimated biological pathways that were significantly enriched in AJ IBD cases. Our results provide insights into high impact rare variants, their harboring genes and associated pathways that contribute to IBD in the AJ population.

RSG-95: Alternative mechanical way to track the path of cancer cells
Topic: cancer tracking viscosity
  • Darya Stepanenko, Okinawa Institute of Science and Technology, Japan
  • Ye Zhang, Okinawa Institute of Science and Technology, Japan

Short Abstract: The life cycle of the cell is strongly connected to the mechanical deformability of its cytoplasm. So the changes in mechanical properties indicates cell's state evolution. Instead of chemical, cell disruptive approach, we suggest to track cell's pathway through the measurement of the mechanical properties in real-time. This project has implemented the fluorescent particle-tracking technique for monitoring the alteration of actin cytoskeleton in cancer cell line HeLa induced by targeted molecular self-assembly compound. We showed that, compound treatment increases the viscosity inside the cell's cytoplasm to different levels depending on compound concentration and treatment time.

RSG-96: Immunobiochemical Reconstruction of Influenza Lung Infection - Melanoma Skin Cancer Interactions
Topic: incoherent feed-forward loop melanoma cancer infec
  • Evgeni Nikolaev, Rutgers Cancer Institute of New Jersey, New Brunswick, NJ 08903, USA, United States
  • Eduardo Sontag, Laboratory for Systems Pharmacology, Harvard Medical School, Boston, MA 02115, USA, United States

Short Abstract: It was recently reported that acute influenza infection of the lung promoted distal melanoma growth in the dermis of mice. Melanoma-specific CD8+ T cells were shunted to the lung in the presence of the infection, where they expressed high levels of inflammation-induced cell-activation blocker PD-1, and became incapable of migrating back to the tumor site. At the same time, co-infection virus-specific CD8+ T cells remained functional while the infection was cleared. It was also unexpectedly found that PD-1 blockade immunotherapy reversed this effect. Here, we proceed to ground the experimental observations in a mechanistic immunobiochemical model that incorporates the T cell pathways that control PD-1 expression. A core component of our model is a kinetic motif, which we call a PD-1 Double Incoherent Feed-Forward Loop (DIFFL), and which reflects known interactions between IRF4, Blimp-1, and Bcl-6. The different activity levels of the PD-1 DIFFL components, as a function of the cognate antigen levels and the given inflammation context, manifest themselves in phenotypically distinct outcomes. Collectively, the model allowed us to put forward a few working hypotheses as follows: (i) the melanoma-specific CD8+ T cells re-circulating with the blood flow enter the lung where they express high levels of inflammation-induced cell-activation blocker PD-1 in the presence of infection; (ii) when PD-1 receptors interact with abundant PD-L1, \yellow{constitutively} expressed in the lung, T cells render motility loss. (iii) At the same time, virus-specific cells adapt to strong stimulation by their cognate antigen by lowering the transiently-elevated expression of PD-1, remaining functional and mobile in the inflamed lung, while the infection is cleared. The role that T cell receptor (TCR) activation and its feedback play in the underlying processes is also highlighted and discussed. We hope that the results reported in our study could potentially contribute to the advancement of immunological approaches to cancer treatment and, as well, to a better understanding of a broader complexity of fundamental interactions between pathogens and tumors.

RSG-97: Pan-cancer analysis of distant metastasis in the context of node-positive and node- negative disease
Topic: Tumor microenvironment Computational biology Pan-c
  • Almudena Espin-Perez, Stanford University, United States
  • Bogdan Luca, Stanford University, United States
  • Aaron Newman, Stanford University, United States
  • Andrew Gentles, Stanford University, United States

Short Abstract: Significant evidence exists that in some cancers, systemic permissiveness for metastasis of malignant cells from primary tumors to distant sites is mediated by interactions occurring in lymph nodes. We conducted a pan-cancer analysis of the association between gene expression levels and time-to-distant metastasis (DMFS), comparing these to associations with overall survival (OS). Expression of MHC genes (particularly class II) showed favorable associations for time-to-metastasis as well as overall survival, with higher expression portending longer survival. However, a significant number of genes showed stronger association with DMFS than with OS, despite the fact that these would be expected to be strongly correlated given the clinical fact that metastasis is a strong driver of death from cancer. By performing these analyses separately in node-positive and node-negative disease, we further identified genes that were differentially related to DMFS. We further explored the influence of specific cell types on DMFS by applying the CIBERSORT algorithm to deconvolve bulk expression profiles. We found that specific immune populations were associated positively or negatively with time to metastasis, based on a previously validated signature matrix of 22 cell types. Notably, modulation of EMT-related genes, and changes in expression of immunosuppressive checkpoint pathways reflected earlier metastasis. Overall, our results provide a map of the relationship between gene expression and cancer metastasis, in the context of lymph node involvement.

RSG-98: TissueTimer: temporal and tissue-ratio estimation from bulk RNA-sequencing samples
Topic: rna sequencing regression data science machine lea
  • Aaron Solomon, The Alan Turing Institute, United Kingdom
  • Daphne Ezer, The Alan Turing Institute, United Kingdom

Short Abstract: Bulk sample differential expression (DE) is a mainstay of modern RNA-sequencing (RNA-seq) pipelines. Whole tissues or organisms are sequenced and compared to elucidate differentially expressed genes associated with disease phenotypes and other biological processes. However, samples are comprised of numerous subtissues, each of which have their own unique gene expression profiles. DE comparisons not only compare subtissues of interest, but also incorporate noise from pooled variation amongst non-corresponding tissues. Ratios of these underlying tissues vary between samples, introducing further variation. Furthermore, compared samples may vary in their developmental stage, which adds further noise to DE comparisons. Failure to account for variation in underlying tissue ratios and developmental time points can lead to artificially high variances and inaccurate differential expression analyses. Computational methods have been developed to estimate tissue ratios from bulk samples, but few provide robust methods to estimate the likelihood of samples originating from the same temporal point and tissue ratio combination. Present methods also consider only short-term temporal variations, not long-term developmental timelines, and few provide routes to re-estimate differential expression after calibrating for estimated tissue ratios. Here, we present TissueTimer, a statistical method for partitioning bulk RNA-seq samples into individual tissue ratios, estimating the developmental timepoint from which the sample originated, and adjusting DE comparisons accordingly. TissueTimer combines multi-dimensional linear regression and optimization procedures with probabilistic kernel density analysis to estimate tissue ratios from a pre-defined reference marker gene set. Our method robustly identifies tissues in datasets of A. thaliana gene expression and, in a naive form, generalizes with better than 70% accuracy for tissue labeling tasks on testing data generated by a different laboratory on Arabidopsis grown in different conditions. We demonstrate that TissueTimer is sensitive long-term developmental time frames as well as short term (< 24 hour) time changes via analysis of circadian clock RNA-seq data. TissueTimer discerns tissue ratios from bulk samples and is robust against technical noise. Our method estimates probability density functions for the tissues in each estimated sample, enabling between-sample significance testing to assess whether samples share the same origin tissue composition and time. TissueTimer’s regression model naturally extends to partition DE comparisons into real biological variance and undesired latent variation – thus, TissueTimer will enable scientists to reinterpret bulk RNA-seq data via controlling for the detected ratios and time sequences, thus enabling more accurate differential expression analysis.

RSG-99: Mutation of Mediator subunit CDK8 counteracts the stunted growth in an Arabidopsis MED5 mutant
Topic: plant Mediator transcriptional regulation phenylpr
  • Xiangying Mao, Purdue University, United States
  • Vikki Weake, Purdue University, United States
  • Clint Chapple, Purdue University, United States

Short Abstract: Plant metabolic networks are precisely regulated by the spatial and temporal expression of suites of genes. Among the various transcription (co)factors, the multi-protein Mediator complex has been identified as a hub for transcription regulation. The core Mediator complex, comprising the head, middle and tail modules, functions as a bridge between transcription factors and basal transcription machinery, whereas the CDK8 kinase module plays a repressive regulatory role. It is still unclear, however, how the kinase module represses target genes especially in vivo. Using a forward genetic screen, our lab determined that MED5, an Arabidopsis Mediator tail subunit, is required for maintaining phenylpropanoid homeostasis. A semi-dominant mutant (ref4-3) characterized by a single amino acid substitution in MED5b (G383S) was isolated as a strong suppressor of phenylpropanoid pathway, indicated by decreased soluble phenylpropanoid metabolite accumulation, reduced lignin content and dwarfism. In contrast, loss of MED5a and MED5b results in the hyper-accumulation of phenylpropanoid pathway derivatives. Considering that the CDK8 kinase module is a repressive module in Mediator, we tested the hypothesis that MED5 represses phenylpropanoid pathway by interacting with CDK8. To test this hypothesis, CDK8 knockout lines (cdk8-1) were crossed with ref4-3, and the phenylpropanoid content of the resulting double mutants was evaluated. In ref4-3 cdk8-1 plants, the concentration of sinapate esters and total lignin content are as low as they are in ref4-3, yet the growth defect in ref4-3 is largely rescued. To further determine the genes targeted by MED5 and CDK8 in maintaining proper plant growth, we performed an RNA-seq analysis which showed that a majority of the genes involved in salicylic acid (SA) biosynthesis and signaling are up-regulated in ref4-3 compared to wild type and ref4-3 cdk8-1. Consistent with this observation, both free and total SA, both of which have been previously implicated in dwarfing in lignin-modified plants, are accumulated to elevated levels in ref4-3 but not in wild type and ref4-3 cdk8-1. Nevertheless, blocking SA biosynthesis is not sufficient to restore the growth deficiency of ref4-3, suggesting that the hyperaccumulation of SA is more likely to be an effect rather than a cause for its dwarf phenotype. To elucidate how ref4-3 regulates downstream gene targets in a CDK8-dependent manner, we performed RNA polymerase II ChIP-seq analysis in wild type, ref4-3, cdk8-1 and ref4-3 cdk8-1. Together with RNA-seq analysis, we identified that mis-regulation of DJC66, a gene encoding a DNA co-chaperone, is involved in the dwarfism of the med5 mutants.

RSG-100: Alternative splicing links histone modifications to stem cell fate decision
Topic: Alternative splicing Histone modification Cell fat
  • Yungang Xu, University of Texas Health Science Center at Houston, United States
  • Weiling Zhao, University of Texas Health Science Center at Houston, United States
  • Scott D. Olson, University of Texas Health Science Center at Houston, United States
  • Karthik S. Prabhakara, University of Texas Health Science Center at Houston, United States
  • Xiaobo Zhou, University of Texas Health Science Center at Houston, United States

Short Abstract: Background: Understanding the embryonic stem cell (ESC) fate decision between self-renewal and proper differentiation is important for developmental biology and regenerative medicine. Attention has focused on mechanisms involving histone modifications, alternative pre-messenger RNA splicing, and cell-cycle progression. However, their intricate interrelations and joint contributions to ESC fate decision remain unclear. Results: We analyze the transcriptomes and epigenomes of human ESC and five types of differentiated cells. We identify thousands of alternatively spliced exons and reveal their development and lineage-dependent characterizations. Several histone modifications show dynamic changes in alternatively spliced exons and three are strongly associated with 52.8% of alternative splicing events upon hESC differentiation. The histone modification-associated alternatively spliced genes predominantly function in G2/M phases and ATM/ATR-mediated DNA damage response pathway for cell differentiation, whereas other alternatively spliced genes are enriched in the G1 phase and pathways for self-renewal. These results imply a potential epigenetic mechanism by which some histone modifications contribute to ESC fate decision through the regulation of alternative splicing in specific pathways and cell-cycle genes. Supported by experimental validations and extended datasets from Roadmap/ENCODE projects, we exemplify this mechanism by a cell-cycle-related transcription factor, PBX1, which regulates the pluripotency regulatory network by binding to NANOG. We suggest that the isoform switch from PBX1a to PBX1b links H3K36me3 to hESC fate determination through the PSIP1/SRSF1 adaptor, which results in the exon skipping of PBX1. Conclusion: We reveal the mechanism by which alternative splicing links histone modifications to stem cell fate decision.

RSG-101: Tissue-variability of regulatory variant effects and the role of transcription factors
Topic: genetic variation transcription factors transcript
  • Elise Flynn, Columbia University, United States
  • Stephane Castel, Columbia University, United States
  • Pejman Mohammadi, Scripps Research, United States
  • Tuuli Lappalainen, Columbia University, United States

Short Abstract: The GTEx Consortium has identified thousands of genetic regulatory variants (local expression quantitative trait loci [cis-eQTLs]) in the human population that affect gene expression in multiple tissues. Over a third of GTEx eQTLs are active across all or almost all surveyed tissues and may have different effect sizes in different tissues, but the molecular mechanisms of this cross-tissue variability of eQTL effect size are poorly understood. Since eQTLs are enriched in transcription factor binding sites (TFBS), we developed a model of eQTL effect size as a function of transcription factor (TF) activity levels across tissues – low (absence of TF binding) or high (saturation of TF binding) tissue TF level results in reduced eQTL tissue effect size, with maximum eQTL effect at mid-range TF levels. We investigated mechanisms of tissue-variability of eQTL effect size (allelic fold change) in the GTEx v8 data release, which includes 838 individuals’ whole genome sequences and their RNA-seq data from a total of 15,201 samples across 49 tissues. We first examined the correlation of eGene expression with eQTL absolute effect size across tissues and identified 1,864 expression-correlated eQTLs (Spearman corr, 5% BH FDR, N=26,499), with 51% positive and 49% negative correlations. This result shows that expression level and genetic regulatory effects are not independent phenomena, and we found that the expression-correlated eQTLs have distinct genetics, gene features, and functional annotations – importantly, they are enriched for TFBS overlap compared to non-correlated eQTLs. A portion of these TFBS-overlapping eQTLs are also correlated with the relevant TF gene expression level across tissues. This indicates that TF activity levels may indeed modify eQTL effects across tissues, and it allows identification of upstream molecular drivers of eQTL activity. We proceeded to find several examples of eQTLs that co-localize with GWAS signals, directly disrupt TFBS, and correlate with TF levels across tissues, thus pinpointing the likely causal variant and the direct molecular regulator of disease associations. Our results provide a basis for identifying regulatory mechanisms of eQTL activity and for understanding eQTL effect size variability across tissues in the context of TF activity. This has major implications for future work investigating context-specific mechanisms of GWAS loci and their effects on human phenotype and disease.

RSG-102: NicheNet: Modeling intercellular communication by linking ligands to target genes
Topic: intercellular communication single-cell transcript
  • Robin Browaeys, VIB-UGent Center for Inflammation Research, Belgium
  • Wouter Saelens, VIB-UGent Center for Inflammation Research, Belgium
  • Yvan Saeys, VIB-UGent Center for Inflammation Research, Belgium

Short Abstract: Technological advances in spatial transcriptomics and single-cell gene expression profiling hold the potential to study intercellular communication in unprecedented ways. In this regard, sophisticated technologies have been developed to specifically assay the spatial context of a cell, characterize cellular niches or construct physical cell-cell interaction maps. To gain insights in intercellular communication processes from gene expression data of interacting cells, current computational analyses link ligands expressed by sender cells to their corresponding receptors expressed by receiver cells. However, detailed understanding of a cell-to-cell signaling process requires knowing how individuals signals are interpreted; i.e. knowing the signaling pathways that get activated and the genes of which the expression changes as a result of this. To infer the putative effects of ligands expressed by one cell on gene expression in an interacting cell, we have developed NicheNet, a computational method that integrates expression data of interacting cells with prior information on potential links between ligands and target genes. The prior information at the basis of NicheNet was inferred by integrating complementary data sources covering ligand-receptor, signaling and gene regulatory interactions. To validate this prior information on potential ligand-target links, we collected transcriptome data of cells before and after they were stimulated by a ligand in culture. We assessed the ability of the model in predicting the transcriptional response, but also in predicting the ligand of interest given this response. For the latter prediction task, NicheNet outperforms the upstream regulator analysis of Ingenuity Pathway Analysis® while being less popularity biased as well. Finally, we applied NicheNet on single-cell RNAseq data of immune cell niches and tumors. By inferring putative ligand-target links active in a process of interest, we demonstrate how NicheNet can generate novel hypotheses concerning tumor-microenvironment interactions and the influence of cellular interactions in steering polarization and differentiation of immune cells.

RSG-103: Uncovering alternative splicing cis-regulation in breast cancer risk
Topic: sQTL alternative splicing breast cancer risk GWAS
  • Juliana Machado, University of Algarve, Portugal
  • Ramiro Magno, University of Algarve, Portugal
  • Joana M. Xavier, University of Algarve, Portugal
  • Ana-Teresa Maia, University of Algarve, Portugal

Short Abstract: Introduction Recent genome-wide association studies (GWAS) have revealed the association of hundreds of single nucleotide polymorphisms (SNPs) with breast cancer (BC) risk. However, they fail to pinpoint the underlying biological mechanism for this risk. Interestingly, most risk-associated SNP loci are located in non-coding regions, suggesting possible regulatory roles, such as altering the binding of transcription or splicing factors, as well as miRNAs. There has been a bias in the functional characterisation of GWAS loci towards the effect of regulatory SNPs on transcription factor binding. Our aim is to determine the extent of the contribution of rSNPs influencing splicing among known breast cancer susceptibility loci. Material and Methods We screened genome-wide significant (P ≤ 5x10-8) breast cancer risk associated SNPs from published GWASes for association with alternative splicing isoforms. To this end, we optimised existing bioinformatics tools and used RNA-seq expression data from normal breast samples, available from GTEx project, to identify splicing quantitative trait loci (sQTL). Results and Discussion We found that rs6456883 is a signicant cis-sQTL for the expression of ZNF311 gene isoforms, and that three more SNPs, rs6682326, rs3008282 and rs2906324, are also significant cis-sQTLs for the expression of RPL23AP53 gene isoforms. We are currently performing their functional characterisation, to reveal the mechanism by which these variants regulated alternative splicing. Our work is starting to reveal the extent by which alternative splicing plays a role in known breast cancer susceptibility, and also paves the way to further testing of other candidate loci.

RSG-104: Using machine learning algorithms for classification of medulloblastoma subgroups based on gene expression data
Topic: Machine learning Medulloblastoma Pediatric brain t
  • Sivan Gershanov, Molecular Biology, Ariel University, Israel
  • Igor Vainer, Molecular Biology, Ariel University, Israel
  • Albert Pinhasov, Molecular Biology, Ariel University, Israel
  • Helen Toledano, Pediatric Oncology, Schneider Children’s Medical Center of Israel, Israel
  • Nitza Goldenberg-Cohen, Ophthalmology, Bnai Zion Medical Center; Krieger Eye Research Laboratory, FMRC; Rappaport Faculty of Medicine, Technion, Israel
  • Mali Salmon-Divon, Molecular Biology, Ariel University, Israel

Short Abstract: Medulloblastoma (MB), the most common malignant brain tumor in children, is divided into four molecular subgroups: WNT, SHH, Group 3 and Group 4. Clinical practice and treatment design are becoming subgroup-specific. Nowadays clinicians use a 22-gene signature set to diagnose the subgroups. While WNT and SHH subgroups have well-defined biomarkers, differentiating Group 3 from Group 4 is less clear-cut. The aim of this study is to improve the diagnostic process in the clinic by identifying the most efficient list of biomarkers for accurate, fast and cost-effective MB subgroup classification. We tested five machine learning based algorithms, four are well known and one is a novel method we developed. We applied them on a public microarray expression data set and compared their performance to that of the known 22-gene set. Both Decision Tree and Decision Rules resulted in a reduced set (nine and ten respectively) with similar accuracy to the 22-gene set. Random Forest and SVM-SMO methods showed improved performance, without applying feature-selection. When implementing our novel SARC (SVM Attributes Ranking and Combinations) classifier, allowing feature-selection, the resulted accuracy level was the highest and better than using the 22-gene set as input. The number of attributes in the best-performing combinations range from 13 to 32, including known MB related genes such as WIF1, NPR3 and GRM8, along with LOC440173 a long non-coding RNA. To summarize we identified sets of attributes that have the potential to improve MB subgroup diagnosis. Broad clinical use of this classification may accelerate the design of patient’s specific targeted therapies and optimize clinical decision.

RSG-105: Characterization of different cell types using Benford law
Topic: single cell RNA sequencing scRNA-seq Benford Law c
  • Sne Morag, Ariel University, Israel
  • Mali Salmon-Divon, Ariel University, Israel

Short Abstract: Separation of living cells and identification of their cell-type and origin is essential for revealing their function and importance in medical applications. Approaches for cell-type identification were historically limited on the detection of known markers on the cell surface. With the development of high throughput sequencing technologies in general, and single cell sequencing in particular, novel methods for cell-type classification could emerged which require no prior knowledge of specific markers. In this study we developed a novel algorithm based on the Benford law for accurate classification of cell type identity. The Benford law, a direct entropy derivative, states that within a large numerical data with maximum entropy, the leading digit’s occurrence probability drops as its value increases. Previously, in our lab, it was shown that bulk RNA-seq expression values follow Benford law and that adherence to Benford law could indicate a gene’s housekeeping or tissue-specific characteristics. In this study, adherence of single-cell gene expression to the Benford distribution was tested and examined for its ability to identify different cell types. Obedience to Benford was calculated upon single-cell RNA-seq (scRNA-seq) data sets using mean absolute error (MAE). The Benord-based results was compared to parallel analysis based on expression only. Dimensionality reduction methods and machine learning approaches were used to quantify the ability of the Benford-based algorithm to classify single cells into their corresponding cell-type and tissues. Our initial results show that scRNA-seq data follows Benford distribution in general and within specific cell type groups. Additionally, MAE calculation was found to be useful in the creation of cell type separation and was more effective in comparison to classification based on expression only. To summarize, this study may yield a robust in-silico tool for identification and characterization of a cell type. This tool could be used in various medical applications including identification of metastasis origin, evaluate a cell’s potency level and enable identification of rare cells within a population. These could contribute to cancer diagnosis, treatment and regenerative medicine.

RSG-107: Causal Gene Regulatory Network Construction using Single-cell RNA-seq and Single-cell ATAC-seq data
Topic: Single cell Transcription factor Gene regulation s
  • Wenpin Hou, Johns Hopkins University, United States
  • Zhicheng Ji, Johns Hopkins University, United States
  • Dongwon Lee, New York University, United States
  • Suchi Saria, Johns Hopkins University, United States
  • Aravinda Chakravarti, New York University, United States

Short Abstract: The development of single-cell technologies opens the way for accurate quantification of diverse regulatory activities across the genome sequence. A key unanswered question is: what are the sequence specificities of DNA binding proteins that co-regulate genes. Although sets of co-regulated genes lead to their correlated co-expression, the reverse is not true. We present here a novel method that couples single cell gene expression (scRNA-seq) and single-cell ATAC-seq (scATAC-seq) to identify the target genes of any transcription factor (TF). We propose a new pseudotime alignment algorithm that aligns the pseudotime trajectories of scRNA-seq and scATAC-seq data from individual cells. The algorithm aims to find a transformation of pseudotime, from different sets of cells, such that the temporal dynamic pattern of the regulated activities of a gene studied by scATAC-seq agrees with the pattern of the same gene’s expression assessed by scRNA-seq. For each gene, this agreement is measured by the Pearson correlation coefficient (PCC) and the pseudotime transformation that results in the highest sum of PCCs across all genes is selected. Pairs of TF and its target gene are then identified by the dynamic pattern of the TF’s regulatory activities highly correlated (PCC >= 0.8 for up-regulation and PCC <= -0.8 for down-regulation) with the pattern of the target gene’s expression, but allowing for a time lag. This leads to the construction of a causal gene regulatory network (GRN). Bulk ChIP-seq and RNA-seq data can be used to benchmark this method. Specifically, pairs of TFs and target genes are classified into up-regulated, down-regulated and non-regulated genes, and the classification compared with the one inferred from single-cell data. Our method is being tested using scRNA-seq and scATAC-seq data from human hematopoetic stem cells (HSC) differentiating into lymphocytes. Bulk RNA-seq and ChIP-seq data of IKZF1 and CTCF were also collected for HSCs and lymphocytes. A total of 6,864 genes were considered. For IKZF1, 5,793 (83.3%) genes were correctly identified with a mean time lag of 256.3 cells, and for CTCF, 5,429 (79.1%) genes were correctly identified with a mean time lag 272.975 cells. Thus, our method is capable of identifying TF and their target genes with significant accuracy. Adding a component of machine learning to these algorithms, and additional datasets, promises further significant improvements.

RSG-108: Exploring the Roles of Drosophila Heat Shock Factor in Stress and Development
Topic: Prediction Modelling Enhancer 3D Conformations Dev
  • Husam Abdulnabi, University of Toronto, Canada
  • Tim Westwood, University of Toronto, Canada

Short Abstract: Heat Shock Factor (HSF) is commonly known as the master regulator of the Heat Shock Response. HSF was one of the first transcription factors to be purified and characterized and has served as a model for studying transcription regulation in part because of its convenient induction method. When (i) heat shock (hs) or other stress is applied, HSF (ii) activates the expression of (iii) chaperone genes by (iv) binding to promoters. This basic model has endured decades of research with genomics and its tools expanding exponentially. However, HSF is also required in (i) non-stress conditions, (ii) repressing the expression of (iii) non-chaperone genes, and (iv) binding outside of promoter regions. To better understand these non-canonical activities of HSF, a multifaceted computational analysis was devised for the Drosophila genome. Raw to mid-processed published HSF data were re-processed, publicly available tools and databases employed, and frontier genomic research explored to propose novel models for HSF mechanics in stress. This includes enhancer activity through 3D interactions and combinatorial activation. As mentioned, HSF is required in normal non-stress growth and development. In HSF mutants, yeast do not grow, vertebrates experience numerous defects, and Drosophila arrest at early developmental stages. At least three theories exist for HSF’s requirement under non-stress conditions: (1) HSF produces chaperones that are required for normal function (as seen in yeast); and/or (2) HSF drives a distinct development program in response to intrinsic stress (as seen in its role in cancer); and/or (3) HSF is regulating the expression of non-chaperone genes. The critical role(s) of HSF in Drosophila development was studied by extending computational analyses with other sources, creating a prediction model for its activity and regulation. This model was challenged with physical experiments such as Chromatin Immunoprecipitation followed by high throughput sequencing (ChIP-Nexus), quantitative (q-) and reverse transcription (rt-) PCR, and confocal imaging on Drosophila embryos containing hs gene promoter-GFP constructs in hs and non-stress conditions. This research not only serves to explore HSF’s role and regulation in stress and development but also provides a pipeline for similar research with other transcription factors.

RSG-109: Cell type specific response: LPS time course in co-cultured iPSC derived neurons and microglia
Topic: single-cell transcriptomics neuroinflammation infl
  • Jimena Monzón-Sandoval, Cardiff University, United Kingdom
  • Elena Burlacu, University of Oxford, United Kingdom
  • Zameel Cader, University of Oxford, United Kingdom
  • Sally Cowley, University of Oxford, United Kingdom
  • Caleb Webber, Cardiff University, United Kingdom

Short Abstract: In order to understand neurodegenerative diseases we need human models that recapitulate the processes observed in disease. Neuroinflammation is a common feature observed across neurodegenerative diseases including Alzheimer’s disease, Amyotrophic Lateral Sclerosis, and Parkinson’s disease. This mechanism is thought to be mediated by neuroimmune microglia cells errantly interacting with neurons to precipitate neuronal cell loss. Here we describe a cell-type specific response through time to lipopolysaccharide (LPS, a bacterial endotoxin) by co-culturing microglia and cortical neurons derived from human induced pluripotent stem cells (iPSC). We measured gene expression for 14,000 cells before and after LPS stimulation (0, 6 and 18 hours) by 10X shallow single cell sequencing. We grouped cells according to their expression profiles using a graph clustering approach and identified fifteen clusters ranging in size from 206 to 1819 cells. Each unbiased cluster contained cells at 0, 6 and 18 hours. As expected, the strongest LPS response was in a cell cluster identified by marker genes as microglia, which showed enrichments of genes involved in cell migration, cell motility, programmed cell death and response to LPS. However, none of the non-microglia clusters showed a comparable response, suggesting there was little interaction between the LPS-provoked microglia and the non-microglia populations in co-culture. Nonetheless, two clusters of non-microglia cells did show a milder response. Cell-identity analyses showed that these cells expressed cell markers consistent with an astrocyte identity. This study proposes that the LPS-provoked neuroinflammatory response is mediated by astrocytes. We are currently exploring this single cell expression data to identify what the mechanism underlying response competency is in the non-neuronal population as a potential therapeutic route to mediating this response in disease.

RSG-110: SECAT: High throughput quantitative protein complex profiling
Topic: proteomics protein complexes protein-protein inter
  • George Rosenberger, Columbia University, United States
  • Moritz Heusel, ETH Zurich, Switzerland
  • Ruedi Aebersold, ETH Zurich, Switzerland
  • Andrea Califano, Columbia University, United States

Short Abstract: Proteins catalyze and control nearly all biochemical functions of cells. To mediate their biological function, they often interact with other proteins via specific protein-protein interactions (PPI), forming macromolecular complexes as functional units. Characterizing PPI both qualitatively and quantitatively poses a major challenge to understand the molecular mechanisms underlying biological processes and phenotypes, such as those involved in complex diseases. Recently, approaches that combine biochemical fractionation (e.g. size exclusion chromatography) of native protein complexes followed by mass spectrometry-based proteomics (LC-MS/MS, e.g. SWATH-MS) have been developed to characterize the PPI among thousands of proteins in a single experiment. Exemplified by SEC-SWATH-MS, this enables qualitative and quantitative comparison of the protein complex profiles between different biological conditions at high throughput for the first time. However, the low selectivity of the biochemical fractionation dimension has proved challenging for data analysis and integration and requires the development of novel methodology. Here we present the SEC Algorithmic Toolkit (SECAT), a novel algorithm that combines signal processing, machine learning and data integration by a network-centric approach. SECAT improves the selectivity of detected PPI while providing insights into qualitative and quantitative changes between experimental conditions. Input to SECAT are i) quantitative matrices on peptide level over SEC fractions generated by SEC-SWATH-MS across different conditions and ii) prior information on binary PPI from reference databases which are then used to validate candidate PPI based on the experimental data using quantitative metrics from signal processing. Individual scores are then integrated by a semi-supervised learning approach resulting in context-specific PPI networks. Confident interactions are quantified between experimental conditions to identify differential interactions and they are then integrated by a network-centric strategy to identify the most significantly changed protein subunits. To validate SECAT and to assess the performance against established methods, we used a SEC-SWATH-MS data set of a HeLa cell line measured in interphase and mitosis. In comparison to the currently employed strategies, SECAT achieves higher selectivity and higher recovery of confidently detected PPI. Applying SECAT and STRING-DB to the HeLa cell line data set enabled confident (q-value < 0.1) detection of 6145 PPI between 1142 proteins. The PPI of 129 proteins were found to be significantly (q-value < 0.1) altered between the conditions, with many being associated to molecular mechanisms involved in cell cycle regulation. With the increasing throughput of the experimental methods, we believe that SECAT can become a valuable tool to assess PPI by a network-centric approach.

RSG-111: Multiblock integration of transcriptomics and metabolomics to study Type 1 Diabetes progression
Topic: Type 1 Diabetes islet autoimmunity N-way partial l
  • Leandro Balzano Nogueira, University of Florida, United States
  • Ricardo Ramirez, University of Florida, United States
  • Tatyana Zamkovaya, University of Florida, United States
  • Jordan Daley, University of Florida, United States
  • Alexandria Ardissone, University of Florida, United States
  • Srikar Chamala, University of Florida, United States
  • Desmond Schatz, University of Florida, United States
  • Mark Atkinson, University of Florida, United States
  • Michael Haller, University of Florida, United States
  • Patrick Concannon, University of Florida, United States
  • Eric Triplett, University of Florida, United States
  • Ana Conesa, Genomics of Gene Expression Lab, Spain

Short Abstract: Type 1 diabetes (T1D) is an autoimmune disease with increasing incidence rates in many developed countries. The Environmental Determinants of Diabetes in the Young (TEDDY) is a prospective birth cohort designed to study type 1 diabetes (T1D) by following high genetic-risk children. Here, Genome-wide gene expression, metabolomics and dietary biomarkers in plasma samples were evaluated through a multiblock-multivariate analysis approach to integrate these datasets and study relationship of these molecular features with progression towards autoimmunity. We applied N-way partial least squares-discriminant analysis (NPLS-DA) coupled with variable importance for the projection (VIP) to identify best discriminating features between cases and controls along time. This approach allowed us to select 862 genes, 245 metabolites and 3 dietary biomarkers for further integrative analysis study. This analysis also revealed molecular profiles changing as early as one year before seroconversion. Integrative interpretation of the mathematical models was done by mapping features to Paintomics3 tool, that display transcriptomics and metabolomics data on the KEGG pathway. Pathway models were complemented with inter-omics molecular associations identified by Partial Correlation Analysis. These integrative strategies revealed nutrition absorption decrease, lipid metabolism abnormalities and intracellular reactive oxygen species (ROS) accumulation, followed by extracellular matrix remodeling, inflammation, cytotoxicity, angiogenesis, and higher antigen presenting cells activity. Here, we demonstrated the applicability of multi-omics integrative approaches to determine the relationship between metabolic profiles and autoimmunity in T1D progression long before any autoantibody appear, suggesting that new methods for early diagnosis and intervention are possible.

RSG-112: Epigenetic footprinting in T cells and fibroblasts from reprogrammed inducible pluripotent stem cells
Topic: Immunotherapy induced pluripotent stem cells repro
  • Alyssa Morrow, University of California, Berkeley, United States
  • Dharmeshkumar Patel, University of Minnesota, United States
  • James Kaminski, University of California, Berkeley, United States
  • John W. Hughes, University of California, Berkeley, United States
  • Nick Haining, Harvard University, United States
  • Bruce R. Blazar, University of Minnesota, United States
  • Nir Yosef, University of California, Berkeley, United States

Short Abstract: Our goal is to provide a self-renewable and phenotypically stable source of T stem memory (Tsm) cells to promote chimeric antigen receptor (CAR) T cell persistence and remission rates in adoptive therapy. We reprogram Tsm and T naive (Tn) cells from four donors into induced pluripotent stem cells (iPSC) which can be re-differentiated back into their original populations, potentially providing a self-renewable source of Tsm cells which maintain the epigenetic and transcriptomic landscape of original Tsm populations. We also reprogram fibroblasts from three donors into iPSCs, and compare differential footprinting in the reprogrammed T cells (Tn/Tsm-iPSC) and reprogrammed fibroblasts (FB-iPSCs) to identify preserved fibroblast and immune related footprints after reprogramming. We first compare transcriptional and epigenetic signatures between Tn-iPSC and Tsm-iPSCs and their original T cell populations through evaluation of RNA-sequencing and ATAC-sequencing samples and find that differential signatures in the iPSCs are largely diminished, relative to their original populations. Both Tn-iPSCs and Tsm-iPSCs show loss of T cell specific signatures and gain of pluripotent related signatures, relative to original T cell populations. We compare differential expression and accessibility of all reprogrammed T cells (T-iPSCs) and FB-iPSCs and find a subset of fibroblast and immune related signatures maintained after reprogramming. While FB-iPSCs are enriched for fibrosis related epithelial/mesenchymal transition, T-iPSCs are enriched for notch, TNF-alpha, and TGF-beta signalling. We test the significance of differentially accessible regions with known regulator motifs in T-iPSCs, and find shared enriched motifs between T-iPSCS and their original T cells, including multiple motif enrichments for TCF4, TFAP2A and TFAP4. Original T cells show motif enrichment for unique factors not enriched in T-iPSCs, including CTCF and EOMES motifs, while being enriched for additional TCF4 and TCF2A motifs not seen in T-iPSC enrichment. Future work includes re-differentiation and expansion of T-iPSCs to their original T cell states to to evaluate similarities between the re-differentiated and original T cell populations. Furthermore, we are developing new methods for predicting differential binding of transcriptional regulators based on cell type similarity metrics calculated from chromatin footprinting patterns. This method will be applied to further determine differential binding in T-iPSCs and their original T cell populations.

RSG-113: Inference of spatial cell type patterns in the adult mouse cochlea through single cell RNA-Seq doublet deconvolution
Topic: single-cell analysis transcriptomics doublets comp
  • Gabriela Pregernig, Decibel Therapeutics, United States
  • Matthew Nguyen, Decibel Therapeutics, United States
  • Kathy So, Decibel Therapeutics, United States
  • Joseph Burns, Decibel Therapeutics, United States
  • Adam Palermo, Decibel Therapeutics, United States

Short Abstract: The adult mammalian cochlea is an architecturally complex organ, housing a highly diverse set of cell types spanning a variety of functions. Cochlear cell types have classically been defined based on ultrastructure, physiology, and markers, but to this day we lack a global, unbiased survey of the spatio-transcriptional patterns associated with this diversity. This is particularly true for the nonsensory cell types outside of the organ of Corti, which are essential for auditory function. Using high-throughput droplet microfluidics single cell RNA-Seq, we captured >75,000 cells from dissociated whole cochleae from adult mice. Unbiased clustering revealed over 40 cell types within the dataset. This level of diversity is comparable to what is seen in the mouse brain, despite the cochlea representing only a few millimeters in size. Based on known markers, we assigned identities to 16 of the clusters, encompassing hallmark cell types such as hair cells, responsible for mechanotransduction, and spiral ganglion neurons, which relay sound information to the brain. The remaining cell types fall into broad categories of immune, mesenchymal/fibrocyte, and epithelial cells. While scRNA-Seq provides a survey of all transcriptomes, it lacks spatial acuity. We hypothesized that doublets present in our scRNA-Seq datasets would arise from 2 sources – randomly paired cells while loading samples onto the microfluidics device, as well as biased pairs resulting from incomplete tissue dissociation. Here, we outline a workflow for deconvolution of scRNA-Seq doublets from mixed sources, without prior knowledge of pure cell population profiles. We show that doublet frequencies can be used to infer the spatial patterning of cell types in the cochlea, and validate our findings in situ via RNAscope. Altogether, this provides a proof of concept for how scRNA-Seq data which is generally discarded can be leveraged to obtain information about the spatial localization of cell types in previously uncharacterized tissue.

RSG-114: Topic modeling enables identification of regulatory complexes in a comprehensive epigenome
Topic: yeast epigenome computational epigenomics regulato
  • Guray Kuzu, The Pennsylvania State University, United States
  • Matthew J. Rossi, The Pennsylvania State University, United States
  • Naomi Yamada, The Pennsylvania State University, United States
  • Chitvan Mittal, The Pennsylvania State University, United States
  • William K.M. Lai, The Pennsylvania State University, United States
  • Nitika Badjatia, The Pennsylvania State University, United States
  • Gretta Kellogg, The Pennsylvania State University, United States
  • B. Franklin Pugh, The Pennsylvania State University, United States
  • Shaun Mahony, The Pennsylvania State University, United States

Short Abstract: Characterizing the composition and organization of protein complexes that form on DNA is key to understanding gene transcription and regulation. Chromatin immunoprecipitation (ChIP) based techniques have been widely applied to characterize protein-DNA binding across numerous systems. However, insights into the organization of regulatory complexes have been limited by three shortcomings: 1) a lack of comprehensiveness in existing compendia of protein-DNA binding profiles; 2) a lack of positional resolution in most existing genome-wide ChIP experiments; and 3) a lack of computational analysis methods for characterizing regulatory complex organization across large collections of genome-wide ChIP experiments. Under our ongoing Yeast Epigenome Project, we have characterized the genomic occupancy patterns of a comprehensive set of nuclear-localized proteins (~400 proteins) in yeast using the high-resolution ChIP-exo assay. The resulting dataset represents the first comprehensive characterization of any cell type’s genome-wide protein-DNA interaction landscape at a resolution sufficient to define the positional organization of factors. Here, we demonstrate that topic modeling approaches can be used to identify sets of interacting proteins within this regulatory landscape. Our approach, based on the hierarchical Dirichlet process, forms probabilistic topics from co-occurring ChIP-exo signals across the genome. In contrast with state-based approaches (e.g. Hidden Markov Models), topic models allow multiple topics to contribute to the generation of data in each bin, and are therefore more appropriate for modeling the fine-grained organization of protein-DNA complexes from high-resolution ChIP-exo data. We provide evidence that the topics estimated by our approach can be interpreted as functional groups of regulatory proteins; our topics encapsulate both subunits of known complexes and proteins from known interacting complexes. Furthermore, profiling the distribution of topics on the genome reveals the spatial organization of protein complexes during gene transcription and regulation.

RSG-115: Per peak strand cross- correlation as a peak call refinement metric.
Topic: ChIP-seq peak call algorithms R package
  • Joseph R Boyd, University of Vermont, United States
  • Joshua T Rose, University of Vermont, United States
  • Jonathan Ar Gordon, University of Vermont, United States
  • Seth Frietze, University of Vermont, United States
  • Sayyed K Zaidi, University of Vermont, United States
  • Janet L Stein, University of Vermont, United States
  • Jane B Lian, University of Vermont, United States
  • Gary S Stein, University of Vermont, United States

Short Abstract: Chromatin immunoprecipitation followed by sequencing (ChIP-seq) has become a core tool for understanding epigenomics. Locating peaks where reads are enriched by the pull-down procedure is a central task when analyzing this data and many peak-callers exist for this task. In practice, peak-calling for high-quality datasets is straightforward while sub-optimal datasets, which may be technically flawed (sequencing issues, low-depth, poor antibody performance) or biologically challenging (few binding sites) are unlikely to produce a high-confidence peak set. To improve peak-calling, we have a developed a set of heuristic metrics based on strand cross-correlation (SCC, an established dataset-wide quality control metric) applied to individual peaks and implemented supporting methods in an R package, peakrefine (https://github.com/jrboyd/peakrefine). We demonstrate the effectiveness of employing SCC in this way with a case-study of a Runx1 ChIP-seq experiment in which Runx1 binding in asynchronous and G1 MCF10A cells was compared to the small subset of Runx1 binding sites retained during mitotic arrest in MCF10A cells. This experiment was particularly challenging due to antibody limitations, and the low number of binding sites and low amount of Runx1 protein in arrested cells. Our standard peak-call pipeline with MACS2 yielded unsatisfactory results, with many called peaks mapping to known black-list regions, failing upon manual inspection, and not coinciding with known Runx1 binding sites from more successful ChIP-seq experiments. We observed that these suspect peaks exhibited a pattern where likely artifact peaks had concordant read pileups between the two DNA strands and likely biological peaks exhibited a strand offset matching the expected fragment size; MACS2’s p-value statistic was not able to discern between the two peak sets. By selecting for peaks that maximize SCC when shifted for the expected fragment size compared to unshifted, we were able to increase retrieval of the Runx1 motif and identify important cell cycle related genes that are mitotically bookmarked by Runx1. We have since applied this methodology to a wider array of transcription factors and found that, while the benefit is greatest in datasets of marginal quality, we are consistently able to select against certain artifact regions and improve motif retrieval in a manner orthologous to the MACS2 peak metrics, signal value and p-value.

RSG-116: Ada: The Clinical and Translational Data Integration and Analysis Platform with Scalable Machine Learning
Topic: translational medicine web platform visual analyti
  • Peter Banda, Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
  • Venkata Pardhasaradhi Satagopam, Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
  • Reinhard Schneider, Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg

Short Abstract: Ada is a performant and highly configurable system for secured integration, visualization, and collaborative analysis of heterogeneous clinical and experimental data sets. Ada's primary role is to provide a key infrastructure for NCER-PD project, which focuses on improving the diagnosis and stratification of Parkinson's disease (PD) by combining detailed clinical and molecular data of patients to develop novel disease biomarker signatures, mainly within Luxembourg. Ada's main features include a convenient web UI for an interactive data set exploration and filtering, and configurable views with widgets presenting various statistics, such as, distributions, scatters, correlations, independence tests, and box plots. To define data set's metadata Ada provides an editable dictionary and a categorical (I2B2) tree with drag-and-drop manipulation. Furthermore, Ada facilitates robust access control through LDAP authentication and an in-house user management with fine-grained permissions. The platform currently manages anonymized data sets associated with clinical research streamed daily from NCER-PD REDCap system, biosample-related information provided by IBBL, and kinetic data from mPower mobile application and eGaIT shoe sensors. Besides NCER-PD project, the main Ada instance hosts around 1600 (imported and derived) data sets summing to more than 100 million rows from diverse studies including DeNoPa, PPMI, TREND, and GBA, which will be used for cross-study comparison. The data set import adapters currently support three file formats: CSV JSON, and tranSMART data and mapping files, and three secured RESTful APIs: REDCap, Synapse, and eGaIT. Any data sets provided from these sources can be added to (or removed from) Ada on-the-fly as well as scheduled for periodic execution. As such, Ada has a potential to be used beyond NCER-PD project for other translational medicine or data-driven endeavors. For post-processing, filtered data can be exported into CSV, JSON, or tranSMART format. For more advanced analysis, well-grounded machine learning and statistical approaches were integrated using Spark ML library. This covers a wide variety of classification, regression, clustering, feature selection, normalization, and time-series processing routines. We opted for Spark since it is a popular computational grid library for an efficient large-scale data processing and analysis. Ada's computational infrastructure together with a convenient UI opens advanced analytics and machine learning to a diverse group of researchers, clinicians, and statisticians. The main Ada instance is available at https://ada.parkinson.lu.

RSG-117: Omic Network Modules as tools for Personalized cancer chemotherapy in Non-Small Cell Lung Cancer
Topic: Non-Small Cell Lung Cancer Whole Genome Sequencing
  • Tejaswi Badam, university of Skövde, Sweden
  • Mika Gustafsson, Linköping university, Sweden
  • Zelmina Lubovac, university of Skövde, Sweden

Short Abstract: Non-small cell lung cancer constitutes the most common type (80%-85%) of lung cancer. Even though there are targeted chemotherapies existing for the treatment, the hospital admissions due to adverse drug reactions are on the rise. The range of individual therapeutic window varies due to high variability in drug response for the given fixed dose of the drug. Powerful hypothesis generation from whole-genome sequencing analysis require synergistic and collective analysis of SNV using their combined effect, which could be performed using network and systems biology. Omic data of SNVs was derived from whole genome-sequencing of 96 lung cancer patients treated with gemcitabine/carboplatin, where about 50 suffered from induced leukopenia (leu), thrombocytopenia (thr),and neutropenia (npk) respectively and we identified 5019, 4594 and 5066 SNV’s for npk, tpk and lpk. These SNV’s were mapped to their closest gene both upstream and downstream of the gnome with in a window of 20kb. From these genes we constructed disease modules which is a topologically & functionally interconnected network of genes using 4 different algorithms namely Diamond , Cliquesum , Module discoverer and MCODE for each of the traits . then we constructed consensus modules with 4 methods per each trait and then a shared module from all the traits . Interestingly, 24 of those genes were shared across at least two modules, which hereafter referred to as the shared toxicity module. Simultaneously , we analyzed Gene expression data of human cells from 300 microarrays treated with Carboplatin and Gemcitabine respectively filtered for bone marrow expression , which corresponded to 120 Carboplatin and 109 Gemcitabine genes. We then performed enrichment analysis of the module genes and expression gene lists, which showed no enrichment for any of the trait lists, but significant enrichments (Fisher test P<0.05) for each of the modules on both lists(odds ratio (OR) = 2.5- 3.2). However, the shared module showed higher significant overlaps than each of the traits.(OR=4.1- 4.5, P= 2.2-3.7 x 10-3). Following major significant pathways were significantly enriched :Platinum Drug resistance , Non-small cell lung cancer(hsa05223) , RAS signaling pathway (hsa04014) , Calcium signaling pathway (hsa04020) , ErbB(hsa04012) and estrogen signaling pathways(hsa04915). Most of the signaling pathways stated above are known to be affected in case of the K-RAS mutated or EGFR mutated NSCLC’s. we found NSCLC genes to be highly enriched in the modules, which supported the modules and suggests that the cancer genes interact highly with the toxicity genes.

RSG-118: Genetic Diagnosis of Heterogeneous Conditions Applying Targeted Gene Capture and Next-generation Sequencing
Topic: Next-generation Sequencing Gene Capture Diagnostic
  • Sofia Esteban Serna, University College London, United Kingdom

Short Abstract: The genetic heterogeneity of certain conditions poses a medical challenge as different types of the same disease often display varying severity, prognosis, treatment and inheritance. Differential diagnosis methods have traditionally been used to elucidate the form of these diseases. However, the efficiency of said differential procedures is limited at diagnosing rare or de novo variants. We aimed (1) to overcome the diagnostic limitations of differential procedures and (2) to detect the genetic and protein sequence variants present on each patient to enable precision medicine. Applying selective genomic enrichment in combination with next-generation sequencing (NGS), we performed a targeted gene capture of the genomic regions accounting for different subtypes of heterogeneous diseases (clinical exome). We present a novel diagnosis methodology that was validated by successfully diagnosing 29 patients with diverse types of cardiopathy, Charcot-Marie-Tooth (CMT) disease and Maturity Onset Diabetes of the Young (MODY). The developed procedure allowed the detection of exonic and intronic variants in heterozygous, homozygous or hemizygous forms, and provided a tool for the identification of 17 new disease-causing mutations which had not previously been described as pathogenic.

RSG-119: The integrative genomics and bioinformatics analysis unveil the genetic contents and evolutionary origin of B chromosomes.
Topic: genomics chromosomes evolution genes next generati
  • Syed Farhan Ahmad, Sao Paulo State University, Brazil
  • Diogo Cabral-de-Mello, Sao Paulo State University, Brazil
  • Patrícia Parise-Maltempi, Sao Paulo State University, Brazil
  • Vladimir Margarido, Western Paraná State University, Brazil
  • Rachel O’neill, University of Connecticut, United States
  • Guilherme Valente, Sao Paulo State University, Brazil
  • Cesar Martins, Sao Paulo State University, Brazil

Short Abstract: B chromosomes (Bs), a type of supernumerary chromosomes, are the extra genomic units found in all major clades of eukaryotic species. Unlike, the autosomes (A chromosomes), Bs possess a distinguished set of characteristics including their non-Mendelian inheritance and the transmission advantage; thus reflecting an ideal example of genomic conflict. Over the decades, their genetic composition, function and evolution has remained an unresolved query. Here, we sequenced the complete genomes of three model species (the fishes: Astyanax mexicanus and Astyanax correntinus, the grasshopper: Abracris flavolineata) samples with B (B+) and without B (B-) chromosomes. We have identified the B-localized sequence by comparing B+ and B- genomes based on differential reads coverage analysis. This analysis comprised several steps such as, Denovo genome assemblies, whole genome alignments, extraction of B-localized regions, qPCR, FISH mapping, genes annotation and gene ontologies enrichment. We found that the Bs contains thousands of sequences representing mostly fragmented genes as well as a few largely intact genes. Our results showed that Bs are highly enriched in genes that code for functions involved in many important biological functions including but not limited to the interesting set of genes related to cell cycle and chromosome formation. We propose that the accumulation of these genes on B might have played a significant role in its transmission, survival and maintenance inside the cell. The discoveries of genes on B chromosomes open an exciting debate about their possible role in important evolutionary events such as adaptation, sex determination and their effect on host genome.

RSG-120: Interplay between copy number, dosage compensation and expression noise in Drosophila
Topic: drasophila genomics haploinsufficiency
  • Damian Wojtowicz, National Center for Biotechnology Information, United States
  • Dong-Yeon Cho, National Center for Biotechnology Information, United States
  • Hangnoh Lee, National Institute of Diabetes and Digestive and Kidney Diseases, United States
  • Steven Russell, University of Cambridge, United States
  • Brian Oliver, National Institute of Diabetes and Digestive and Kidney Diseases, United States
  • Teresa Przytycka, National Center for Biotechnology Information, United States

Short Abstract: Gene copy number variations are associated with many disorders characterized by high phenotypic heterogeneity. Disease penetrance differs even in genetically identical twins. Can such heterogeneity arise, in part, from increased expression variability of one dose genes? While increased variability in the context of single cell gene expression is well recognized, our computational simulations indicated that in a multicellular organism intrinsic single cell level noise should cancel out and thus the impact of gene copy reduction on organismal level expression variability must be due to something else. To systematically examine the impact of gene dose reduction on expression variability in a multi-cellular organism, we performed experimental gene expression measurements in DrosophilaDrosDel autosomal deficiency lines. Genome-wide analysis revealed that autosomal one dose genes have higher gene expression variability relative to two dose genes. In flies, gene dose reduction is often accompanied by dosage compensation at the gene expression level. Surprisingly, the expression noise was compensation level dependent. This increased compensation dependent variability was found to be a property of one dose autosomal genes but not X-liked genes in males despite the fact that they too are dosage compensated, suggesting that sex chromosome dosage compensation also results in noise reduction. Previous studies attributed autosomal dosage compensation to feedback loops in interaction networks. Our results suggest that these feedback loops are not optimized to deliver consistent responses to gene deletion events and thus gene deletions can lead to heterogeneous responses even in the context of an identical genetic background. Additionally, we show that expression variation associated with reduced dose of transcription factors propagate through the gene interaction network, impacting a large number of downstream genes. These properties of gene deletions could contribute to the phenotypic heterogeneity of diseases associated with haploinsufficiency.