Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

### Track: Regulatory Genomics (RegGenSig)

Session A-376: THiCweed: fast, sensitive detection of sequence patterns by clustering big data sets
COSI: RegGen

Short Abstract: Motivation: Thousands of ChIP-seq (chromatin immunoprecipitation followed by sequencing) datasets are now publicly available that provide genome-wide binding profiles for hundreds of transcription factors in various species and various cell types, with thousands to hundreds of thousands of peaks'' per dataset. Transcription factors commonly bind to regions of DNA that have short conserved sequence patterns, or "motifs". Ab initio motif finding is a well-established problem in computational biology, but such large data sets are challenging for most existing tools. Additionally, it is common for the target proteins to bind indirectly to DNA via co-factors, with the result that the ChIP-seq peaks contain a mixture of motifs. Few tools exist to deal effectively with this problem. Results: We describe a new approach to motif finding that models the problem as one of clustering bound regions based on sequence similarity. We take an iterative "top-down" approach of repeatedly subdividing an initial single large cluster of all input sequences into smaller and smaller clusters, while also exploring shift and reverse-strand matches of sequences to clusters. Our implementation is significantly faster than any other ChIP-seq-oriented motif-finding program we tested, able to process 5,000 sequences of 100bp length in a few minutes, or 30,000 sequences in 1-2 hours, on a desktop computer using a single CPU core. On synthetic data it outperforms all programs except one (MuMoD) on accuracy; compared to MuMoD it is somewhat less accurate but orders of magnitude faster. It is designed to perform well with "window" sizes much larger than the length of a typical binding site (7-15 base pairs), and we commonly run it with window sizes of 50bp or more. On actual genomic data it successfully recovers literature motifs, but also uncovers highly complex sequence characteristics in flanking DNA, and in many cases recovers secondary motifs (of possible, and sometimes documented, biological significance) even when they occur in less than 5% of the input sequences. We suggest that this is a powerful new approach to the analysis of ChIP-seq data. Availability: The software is open source and available at http://www.imsc.res.in/~rsidd/thicweed/ under the two-clause BSD license.

Session A-378: Differential analysis of regulatory elements based on ChIP-seq data
COSI: RegGen

Short Abstract: Gene expression is regulated by genomic DNA elements referred to as enhancers, that recruit a combination of transcription factors and co-factors to activate transcription from target core promoters. Epigenetic modifiers play a key role in disrupting the interaction between these regulatory elements, which potentially drives dysregulation in gene expression. Multifactorial diseases are not limited to a monogenic basis and are usually the result of malignant gene expression initiated by a complex aetiology including multiple genes and environmental conditions. As enhancers are central in the context of gene regulation and their implication on complex diseases it remains the challenge of the identification of these regulatory elements and the differential analysis of trait-correlated epigenetic marks. Here we present a strategy to detect regulatory elements based on histone modifications (HMs) to locate epigenetic differences between healthy and diseased individuals. A binary random forest classifier based on read counts from public mESC ChIP-seq experiments is trained on high confidence enhancer regions. This trained classifiers is then applied to new observations originating from a complex disease study. To reduce the search space of differentially active regulatory elements between healthy and diseased samples we filter all regions that are in the neighborhood of differentially expressed genes and rank these according to the highest variance between them. With this we aim to prioritize potentially disease causing candidate regions which are correlated to the dysregulation in gene expression in multifactorial disorders.

Session A-380: HMCan suite for the analysis of histone modifications and ATAC-seq data in cancer samples
COSI: RegGen

Short Abstract: Thousands of research studies use ChIP-seq and ATAC-seq data to profile histone modifications and assess changes of epigenetic profiles during the normal development or the disease. However, most of the existing analysis methods were developed for normal diploid genomes and do not take into account possible copy number aberrations inherent to cancer samples. Here we propose a suit of computational methods for ChIP-seq/ATAC-seq data analysis specifically developed for cancer samples (http://boevalab.com/tools.html). HMCan [1] detects ChIP-seq and ATAC-seq signal in normal and cancer samples. In case two conditions are available HMCan-diff [2] can compare the profiles. HMCan and HMCan-diff take into account replicates, paired-end reads when available; they correct for GC-content, copy number and mappability biases. Most importantly, they are able to correctly analyze the signal coming from amplification regions (i.e. regions amplified dozens of times in cancer samples). Our novel method, LILY, and accompanying scripts allow for further analysis of histone modification profiles. Using LILY, in only 45 seconds, the user can detect super-enhancer regions using H3K27ac data based on the HMCan output. Further, the user can extract valleys in H3K27ac peaks corresponding to open chromatin region bound by TF. This application is especially important for motif discovery: it allows identifying TFs driving specific enhancers and super-enhancers. The suite was validated on neuroblastoma cell lines with amplifications of MYCN. We could assess the status of a known MYCN enhancer active within the amplified regions. 1. Ashoor et al. Bioinformatics, 2013, 29(23): 2979-2986 2. Ashoor et al. Nucleic Acids Research, 2017, 45(8):e58

Session A-382: Human-specific genes and mechanistic innovations: is the hand that gives also the hand that takes?
COSI: RegGen

Short Abstract: “What makes us human?” There are many ways to analyze this pivotal question, none of which currently offers a thorough answer. Genomic differences between humans and chimpanzees were first described decades ago at the chromosome level. More recently, new technologies and Bioinformatics tools allowed comparisons at the molecular level and many traits have been observed that are ours alone. Both genetic and epigenetic mechanisms drive human evolution and the brain seems to be particularly (mainly positively) affected. Here we describe a myriad of methods by which organisms are able to adapt to change or evade undesired change. We discuss their effects in the human brain and provide many examples of dysregulation, revealing that mechanisms contributing to brain evolution may also result in psychiatric disease. One such mechanism, the generation of novel proteins, seems to be a central element to evolution and therefore we decided to investigate the universe of human-specific genes in greater detail. To accomplish this task we first compiled a dataset of such genes by extensively (although not exhaustively) searching the literature for reported examples. As a result, over 600 human-specific genes were listed, along with their location and functional annotation, comprising the largest set to date. We then analyzed the expression levels of such genes in human neuroblastoma cells and found many to be differentially expressed upon activation with KCl. We postulate that these human-specific genes play important roles in brain function in what may be a delicate balance between accelerated evolution and the emergence of disease.

Session A-384: Integrative analysis of single-cell expression data reveals distinct regulatory states in bidirectional promoters
COSI: RegGen

Short Abstract: Bidirectional promoters (BPs) are prevalent in eukaryotic genomes. However, it is poorly understood how the cell integrates different epigenomic information, such as transcription factor (TF) binding and chromatin marks, to determine directionality of gene expression at BPs. Single cell sequencing technologies are revolutionizing genetics and this project focuses on the integration of single-cell RNA data with bulk ChIP-seq and other epigenetics data for which single cell technologies are not yet established. We utilized novel human single cell RNA-seq data, produced by DEEP, to reveal clusters of BP genes exhibiting various states of directionality across individual cells. For instance, a cluster contains BPs with highly expressed upstream and downstream genes for almost all single cells of the K562 cell line. Whereas, some BP genes are expressed in an alternating manner, where the expression of one gene always dominates the other one, and vice versa, depending on the subpopulation of cells. These differences are recapitulated by analyzing correlation of both genes at a BP across individual cells. We integrated other levels of genomic and epigenomic information to shed light on this previously unrecognized complexity in BP gene regulation. We explored CAGE expression of the upstream and downstream genes in those clusters, as well as stratified TF binding patterns, Histone Modifications (HM), and DNA methylation. These observations are of interest, because despite the fact that the clusters are derived from the single cell data, the bulk TF, HM, and DNA methylation profiles, reflected the properties attributed to those states.

Session A-386: Discovery of Candidate Biomarker Using Regulatory Network Analysis for Lung Cancer RNA-seq Data
COSI: RegGen

Short Abstract: Non-Small Cell Lung Cancer (NSCLC) is the most common type of lung cancer. One of the treatment options for patients with NSCLC includes targeted therapies. Regulators or Transcription factors (TFs) affect gene expression and expression pattern must be associated with drug sensitivity or resistance. In present study, we used RNA-seq data of 143 NSCLC lines. We clustered the TPM(Transcripts Per Million) values using Cluster 3.0 and identified three distinct expression groups, 10 to 14 cell lines per group. Using iRegulon, a cytoscape plugin based on motif enrichment, we constructed regulatory network for the gene list of each group, resulting 6 to 13 regulons per group. Works in progress to prune the networks and represent them in hierarchical probabilistic Bayesian forms. The results may facilitate to discover master regulators. Mutations in these regulators may be possible as predictive biomarker for chemical compound sensitivity. This study can be helpful in developing precision medicine for lung cancer.

Session A-388: A seed extension approach to identify chromatin accessibility and DNA methylation from NOMe-seq
COSI: RegGen

Short Abstract: Chromatin is a fundamental structure for compactly packaging a genome and reducing its volume in eukaryotic cells, and consists of nucleosomes composed of ~147bp DNA wrapped around core histone proteins. Chromatin accessibility plays a key role in epigenetic regulation of gene activation and silencing. Open chromatin regions allow regulatory elements such as transcription factors and polymerases to bind for gene expression while closed chromatin regions prevent the activity of transcriptional machinery. It is well known that chromatin accessibility is highly correlated with DNA methylation and histone modifications such as methylation, acetylation, and phosphorylation. Recently, nucleosome occupancy and methylome sequencing (NOMe-seq) has been developed for simultaneously profiling chromatin accessibility and DNA methylation on single molecules. To the best of our knowledge, there is no standard method for de novo identification of chromatin accessibility from NOMe-seq data. Therefore, there is a great demand in developing computational methods to identify chromatin accessibility from NOMe-seq. In this paper, we present CAME (Chromatin Accessibility and Methylation), a seed-extension based approach that identifies chromatin accessibility from NOMe-seq. The efficiency and effectiveness of CAME were demonstrated through comparisons with other existing techniques on both simulated and real data, and the results show that our method not only can precisely identify chromatin accessibility but also outperforms other methods.

Session A-390: Integrated analysis of cancer omics and drug response for precision medicine
COSI: RegGen

Short Abstract: Current high-throughput technologies enable simultaneous acquisition of multi-level omics and RNAi/chemical screening data. Integration of these data help identifying associations of cancer target and biomarker, thus accelerating their clinical applications and patient stratification. In our previous work, we had developed a web-based interactive tool MACE to analyze drug response and gene expression on NCI60 cell lines. In this study, we have implemented QProfile, which is a java-based standalone software for interpreting drug response and gene silencing in the genotypic classification of cancer cell lines. Chemical screening data across NCI60 cell lines and shRNA screening data obtained from project Achilles were organized to identify mutation- or lineage-specific chemicals and gene silencing signatures. This software allows users to identify potential associations of chemicals and genes with fully annotated homozygous mutant genes of cancers. QProfile is a valuable tool to predict and optimize the therapeutic window for anticancer agents and related gene targets.

Session A-392: Single-cell enhancer RNA analysis in mouse embryonic stem cells using RamDA-seq
COSI: RegGen

Short Abstract: Identifying sources of cell-to-cell variability in gene expression is important to fully understand and control developmental process and cell differentiation. Enhancers control spatiotemporal and cell type-specific patterns of gene expression, but it remains unclear whether enhancers contribute to gene expression variability observed within the same cell type. A straightforward way to address this hypothesis is to simultaneously measure gene expression and enhancer activity in single cells. Enhancer activity can be inferred by measuring enhancer RNA (eRNA) transcription using total RNA-seq methods in bulk samples. However, eRNA detection in single cells has been difficult with conventional single-cell RNA sequencing (scRNA-seq) methods because eRNAs are largely non-polyadenylated and low in expression level, but most of these methods target polyadenylated RNAs. Here, we propose a single-cell analysis of eRNAs using RamDA-seq, a novel single-cell total RNA sequencing method developed in our lab. First, using 10 pg of diluted RNA, we showed that RamDA-seq could detect the largest number of eRNAs compared with the other tested scRNA-seq methods. Next, we applied RamDA-seq to mouse embryonic stem cells (mESC) undergoing differentiation, and computationally identified eRNA expression. We showed that RamDA-seq could detect eRNAs in a cell-type-specific manner, and found enrichment of DNA-binding motifs of transcription factors involved in self-renewal and pluripotency of mESC. Finally, we revealed some eRNAs showed cell-to-cell variability even in the same cell-cycle phase and correlated with nearby genes. In the presentation, we will discuss the implication of the results and bioinformatic perspectives regarding enhancer identification.

Session A-394: Integrative Bayesian network-based analysis of multiple genomic data sets to detect altered genes in case-control studies
COSI: RegGen

Short Abstract: The pathogenesis of complex diseases can be partially elucidated by examining the gene regulatory landscape of various interdependent epigenomic mechanisms. To identify and better understand disease relevant mechanisms, recent studies have generated epigenome-wide data to complement genetic or transcriptomic data within the same cohort. Here, we suggest a novel statistical method that integrates the analysis of various genomic data types. Briefly, by leveraging functional interaction networks, the method detects genes that are consistently altered by disease across multiple data types. We introduce a generalized correlation coefficient defined as the sum of the standardized differences between a patient sample and control samples observed in different data types. A large coefficient indicates consistent differences for that specific patient sample and gene across several measurements including active or repressive epigenomic marks. The coefficients' distributions are modeled by a hierarchical Bayesian model where the distribution means are regressed on two gene-specific effects. One of these effects is given an intrinsic Gaussian CAR prior to model the functional relationship between genes as defined by a given gene interaction network. Genes are classified as differential based on the posterior distribution of the sum of the gene-specific effects. We applied our model to a data set consisting of 141 Alzheimer's disease samples and 92 control samples. RNA-seq, H3K9Ac ChIP-seq and DNA methylation data was generated from the subjects' prefrontal cortices. Genes classified as differential by our method were enriched for known Alzheimer's disease genes. Simulated data indicated a higher sensitivity compared to non-integrative approaches.

Session A-396: Delineating gene regulatory networks in diatoms using gene expression and de novo motif finding analysis
COSI: RegGen

Short Abstract: Diatoms (Bacillariophyceae) belong to the heterokont algae (Chromalveolates), a group which originated through secondary endosymbiosis when a photosynthetic red algae, was engulfed by a heterotrophic host. Among microalgae, diatoms stand out, as they are not only one of the most species-rich phytoplankton classes, but are solely responsible for about 40% of all oceanic carbon fixation, totaling 20% of photosynthesis on Earth. Despite the great interest in diatoms for ecological and economical purposes, there is a severe lack of functionally and regulatory characterized genes. Based on the assumption that co-expressing genes may be co-regulated, co-expression networks were delineated for previously annotated transcription factors and de novo motif finding was performed genome-wide. The resulting motif collection was compared to a database of known transcription factor binding sites (TFBS), deducted from several sources, and motifs belonging to the correct transcription factor family were retained. Using this approach a TFBS could be inferred for 117 transcription factors. For a selection of transcription factors protein binding microarray (PBM) experiments will be performed to confirm the predicted transcription factor binding sites. To infer a gene regulatory network (GNR), which is the collection of regulatory interactions between transcription factors (TFs) and their target genes, target genes were defined by the simple mapping of TF binding sites. Although this can lead to many false-positive regulatory interactions, it has been shown that module-based enrichment analysis as well as filtering for conserved TF binding sites drastically improves the power to detect functional regulatory interactions.

Session A-398: RegulatorTrail: a web service for the identification of key transcriptional regulators
COSI: RegGen

Short Abstract: Transcriptional regulators such as transcription factors, cofactors or chromatin modifiers play an essential role in most biological processes. Alterations in their activities have been observed in many diseases, e.g. cancer. Hence, it is of utmost importance to identify influential regulators that might control natural and pathogenic mechanisms. To this end, we have developed a new web service called RegulatorTrail. RegulatorTrail provides a variety of methods that utilize regulator binding information in combination with transcriptomic or epigenomic datasets to infer the most influential regulators. Our web service can be used in four distinct application scenarios that provide solutions for different input files: gene lists, gene expression data, open chromatin regions or histone marks. In each scenario, the output is a prioritized list of influential transcriptional regulators that can be visualized in the web browser or downloaded in a variety of standard file formats, including CSV, JSON, Excel, and PDF. Additionally, all RegulatorTrail results can be further analyzed using the enrichment or network analysis functionality of the GeneTrail2 web service in order to find common biological functions or shared signaling pathways. Our web service not only provides an intuitive web interface, but also a well-documented RESTful API that allows for an integration into third-party workflows. RegulatorTrail is freely accessible at: https://regulatortrail.bioinf.uni-sb.de/.

Session A-400: Circadian analysis of genomic Glucocorticoid hormone action
COSI: RegGen

Short Abstract: The Glucocorticoid Receptor (GR) is a ligand-dependent transcription factor belonging to the nuclear receptor superfamily. Gene regulation by GR is essential for mammalian physiology (including metabolic homeostasis, circadian rhythms, immune reactions etc.). Impaired GR action is linked to metabolic dysregulation such as obesity and diabetes. Similarly, circadian clocks control metabolism and energy homeostasis, and chronic circadian disruption causes metabolic disease. For example, a prolonged High Fat Diet may alter the mammalian circadian clock. Glucocorticoids are secreted with a prominent circadian rhythm, and GR has been shown to directly regulate core clock genes, yet the circadian distribution of GR binding to chromatin has not yet been studied. Our aim is to elucidate the genome-wide binding profiles of GR using ChIP-Seq in mouse livers collected throughout the day/night cycle, and its alteration induced by High Fat Diet. We track the protein binding changes comparing the high fat and control diets in a time series manner. We used the RNA-Seq data to cross-validate differential binding sites using the gene isoform level analysis and also differential gene expression analysis. JTK circadian analysis is also coupled to measure the transcript rhythmicity. We have identified the circadian pattern of genome-wide GR binding every 4 hrs throughout the day/night cycle. GR peaks in our data show significant overlap per time point with CLOCK factors, highlighting its involvement in the mammalian circadian mechanism of gene regulation. Our results provide further insights into the effect of changes in diet on GR regulation in mammalian genomes.

Session A-402: InMoDe: tools for learning and visualizing intra-motif dependencies of DNA binding sites
COSI: RegGen

Short Abstract: The statistical modeling of transcription factor binding sites and similar short functional nucleotide sequences is one of the core challenges of computational biology. The classical position weight matrix (PWM) model and its visualization by sequence logos used to be the state of the art for motif representation for decades. However, recent studies suggest that its simplifying independence assumption is often not justified and that taking into account dependencies between binding site positions often yields a better motif representation. Challenges for leveraging dependencies are (i) choosing models of appropriate complexity in order to cope with the problem of overfitting and (ii) visualizing motifs with dependencies intuitively. In order to address these challenges, we present InMoDe, a comprehensive suite of tools for learning, leveraging, and visualizing intra-motif dependencies. Central features of InMoDe are (i) a robust model selection from a class of parsimonious models, taking into account dependencies only if justified by the data, and (ii) an intuitive graphical representation of the learned model by conditional sequence logos, an extension of traditional sequence logos. InMoDe contains tools for applying the learned models to sequence scans and classification tasks. To allow a broad community of scientists with different levels of expertise an easy use of InMoDe, it provides a command line interface, a GUI, and an integration into Galaxy workflows.

Session A-404: Gene nucleotide composition accurately predicts expression and is linked to topological chromatin domains
COSI: RegGen

Short Abstract: Gene expression is orchestrated by distinct regulatory regions (e.g. promoters, enhancers, UTRs) to ensure a wide variety of cell types and functions. A challenge is to identify which regulatory regions are active, what are their associated features and how they work together in each cell type. Several approaches have tackled this problem by modeling gene expression based on epigenetic marks (e.g. ChIP-seq, methylation, DNase hypersensitivity), with the ultimate goal of identifying driving genomic regions and mutations that are clinically relevant in particular in precision medicine. However, these models rely on experimental data, which are limited to specific samples and cannot be generated for all regulators and all patients. In addition, we show here that, although these approaches are accurate in predicting gene expression, their biological interpretation can be misleading. We develop here a method for predicting mRNA levels based solely on sequence features collected from distinct regulatory regions, which is as accurate as methods based on experimental data. Our approach confirms the importance of nucleotide composition in predicting gene expression and ranks regulatory regions according to their contribution. It also unveils strong influence of gene body sequence, in particular introns. We further provide evidence that the contribution of nucleotide content can be linked to co-regulations associated with genome 3D architecture and to associations of genes within topologically associated domains. Our study confirmed the existence of sequence-level instructions for gene expression, which lie in genomic regions largely underestimated in regulatory genomics but which appear to be linked to chromatin architecture.

Session A-406: Poly-Enrich: A Count-Based Method for Improved Gene Set Enrichment Testing of Genomic Region Sets
COSI: RegGen

Short Abstract: Gene set enrichment (GSE) testing can enhance the biological interpretation of ChIP-seq data or other large sets of genomic regions. Our group has previously introduced two GSE methods for ChIP-seq data: one for narrow peak regions based on a binary score of whether each gene has at least one peak (ChIP-Enrich), and one for broad genomic regions, such as for histone modifications. The first step in these methods is to assign genomic regions to a target gene based on the nearest transcription start site or other methods of choice. Here, we introduce a method, Poly-Enrich, which models the number of peaks assigned to a gene using a generalized additive model with a negative binomial family to determine gene enrichment, while adjusting for locus length (#bps associated with each gene). Using permutations of 90 ENCODE ChIP-Seq datasets, we validated Type I error and compared performance with ChIP-Enrich using the unpermuted data. As opposed to ChIP-Enrich, Poly-Enrich works well even when almost all genes have a peak. Aside from that, the optimal test depended more on the pathway being regulated than on the transcription factor or other properties of the dataset. We discovered clusters of biologically related GO terms that were consistently more enriched with either the count-based or binary score method. This suggests that the regulation of certain processes is modified by multiple binding sites (count-based), while others require only one (binary). We are currently developing a hybrid test that automatically chooses the optimal method to report, with correct FDR-adjustment.

Session A-408: Cross-species functional modules identify conserved splicing and immune biomarkers during aging
COSI: RegGen

Short Abstract: Aging is a universal process that results in progressive loss of viability and increase in vulnerability to death, and underlies a spectrum of diseases. Although the rate of aging varies widely across metazoans, it was shown that aging is under genetic regulation in model organisms. However, it is still poorly understood how model organisms can help in understanding the biology of healthy human aging, as well as how well we can transfer functional information of aging from model organisms to human. We took a system-level approach of evolutionary conservation, combining species-specific gene sets and evolutionary orthologous groups (EOG) to find cross-species core aging processes. This allowed us to identify pathways that were not significantly enriched in a single-species. Overall, we have conducted a comprehensive comparative transcriptomic analysis of healthy aging and of caloric restriction in four species: human, mouse, fly and worm. Furthermore, we used probabilistic approach to integrate functional information of healthy aging, borrowing this information from model organisms to human. This led to discovery of evolutionarily conserved splicing and innate immune system functional modules, with genes enriched in several age-related GWAS studies, such as fasting glucose and rheumatoid arthritis. Our study provides a resource for experimental and comparative aging genomics and gives insights into the evolution of health span.

Session A-410: BayesPI-BAR: predicting non-coding mutation effects on protein-DNA interaction with a new biophysical model
COSI: RegGen

Short Abstract: Sequence variations in regulatory DNA regions are known to cause functionally important consequences for gene expression. DNA sequence variations may have an essential role in determining phenotypes and may be linked to disease; however, their identification through analysis of massive genome-wide sequencing data is a great challenge. In this work, a new computational pipeline, a Bayesian method for protein-DNA interaction with binding affinity ranking (BayesPI-BAR), is proposed for quantifying the effect of sequence variations on protein binding. BayesPI-BAR uses biophysical modeling of protein-DNA interactions to predict single nucleotide polymorphisms (SNPs) that cause significant changes in the binding affinity of a regulatory region for transcription factors (TFs). The method includes two new parameters (TF chemical potentials or protein concentrations and direct TF binding targets) that are neglected by previous methods. The new method is verified on 67 known human regulatory SNPs, of which 47 (70%) have predicted true TFs ranked in the top 10. Importantly, the performance of BayesPI-BAR, which uses principal component analysis to integrate multiple predictions from various TF chemical potentials, is found to be better than that of existing programs, such as sTRAP and is-rSNP, when evaluated on the same SNPs. BayesPI-BAR is a publicly available tool and is able to carry out parallelized computation, which helps to investigate a large number of TFs or SNPs and to detect disease-associated regulatory sequence variations in the sea of genome-wide noncoding regions.

COSI: RegGen

Short Abstract: Many gene regulatory networks appear to contain enhancers with partially overlapping function (i.e., overlapping gene expression patterns). Those enhancers are referred to as shadow or redundant enhancers. While their purpose is a matter of debate, the mechanisms by which they originate have seldom been addressed. It is silently assumed, that shadow enhancers mainly originate by means of duplication. Here, we investigated whether convergent evolution of shadow enhancers may be more widespread than generally assumed. We examined a set of enhancers from the FANTOM project, assigned target genes to them, and grouped them accordingly to predict shadow enhancers. From initially ~43,000 Phase 1 FANTOM enhancers we identified a set of ~3,000 enhancers with high activity correlation to ~5,600 FANTOM promoters that correspond to ENSEMBL protein-coding transcripts (on average an enhancer is correlated with 2.8 promoters) across roughly 200 tissues. From those enhancers, ~1,500 overlap with transposons. Moreover, ~350 of them are part of a shadow enhancer group of transposon-overlapping enhancers. Finally, we identified a set of ~200 shadow enhancer pairs with different transposon origin and origination times. Our results provide evidence that convergence is indeed a pronounced mechanism of shadow enhancer evolution.

Session A-413: cepip: context-dependent epigenomic weighting for prioritization of regulatory variants and disease-associated genes
COSI: RegGen

Short Abstract: It remains challenging to predict regulatory variants in particular tissues or cell types due to highly context-specific gene regulation. By connecting large-scale epigenomic profiles to expression quantitative trait loci (eQTLs) in a wide range of human tissues/cell types, we identify critical chromatin features that predict variant regulatory potential. We present cepip, a joint likelihood framework, for estimating a variant’s regulatory probability in a context-dependent manner. Our method exhibits significant GWAS signal enrichment and is superior to existing cell type-specific methods. Furthermore, using phenotypically relevant epigenomes to weight the GWAS single nucleotide polymorphisms, we improve the statistical power of the gene-based association test. The software and user manual are available at http://jjwanglab.org/cepip or https://github.com/mulin0424/cepip.

Session A-414: ChARMDiff: Combinatorial Chromatin State Difference in multiple cell types and conditions using Association Rule Mining
COSI: RegGen

Short Abstract: Various chromatin modifications, identified in large-scale epigenomic analyses, are associated with distinct phenotypes of different cells and disease phases. To improve our understanding of these variations, many computational methods have been developed to discover novel sites and cell-specific chromatin modifications. However, the discovery of combinatorial patterns of differential chromatin modifications across tissues, cell types, and disease phases, is a non-trivial task and remained unaddressed. In this regard, we report ChARMDiff, a new computational approach based on association rule mining, which is pattern discovery of de novo differential chromatin modifications and characterize globally occurred patterns of combinatorial chromatin state difference between multiple cell types and conditions. By applying ChARMDiff to two pairs of epigenomes in normal and cancer cells i.e., GM12878 and K562 from ENCODE and from normal and hepatocellular carcinoma tissues of hepatitis B virus X -transgenic mice. ChARMDiff provides a scalable framework that can easily be applied to find various levels of combination patterns, which should reflect a range of globally common to locally rare chromatin modifications. Our approach provides new insights into the resolution of the histone code hypothesis to characterize epigenetic variations in distinct phenotypes of different cells, disease phase and experimental conditions.

Session A-415: MethylAger: DNA Methylation Based Age Prediction in RnBeads
COSI: RegGen

Short Abstract: Methylation of CpG dinucleotides in the genome is considered the best-understood epigenetic mark. Studies have linked DNA methylation to X-chromosomal inactivation, genomic imprinting and gene repression as well as to cell differentiation and disease. Recently, DNA methylation patterns have also been shown to change with human age, which finally led to models capable of accurately predicting chronological donor age from methylation signatures. The inferred ‘epigenetic age’ has been associated with diseases such as HIV1 infection and cancer. Current models predict age solely using data derived from the Illumina Infinium array platforms, while genome-wide bisulfite sequencing data becomes increasingly available. Here, we present MethylAger, the first epigenetic age prediction tool that supports data from DNA methylation microarrays and bisulfite sequencing. The tool facilitates usage of epigenetic age prediction within the user-friendly environment of RnBeads, an R package for comprehensive analysis of DNA methylation data. MethylAger can derive age-predictive models from user-provided data sets. Furthermore, the tool provides pre-trained models for Infinium 27/450 and RRBS data. These pre-trained models were created from large data sets and validated on additional independent test data sets. Integration into the RnBeads pipeline enables interpretation of the results with interactive HTML reports. Applications include the augmentation of available sample metadata, improved exploratory and differential analysis and assessment of data quality. In conclusion, MethylAger enables the integration of epigenetic age information into DNA methylation analysis workflows and is a versatile tool that can be used for disease-focused analyses.

Session A-416: ChIP-eat : from raw sequence reads to high quality TFBS prediction
COSI: RegGen

Short Abstract: Chromatin immunoprecipitation followed by sequencing (ChIP-seq) represents the most popular experimental assay to identify the genomic regions, so called ChIP-seq peaks, where transcription factors (TFs) bind to DNA in vivo. The ever increasing number of publicly available ChIP-seq data sets provides an unprecedented opportunity to develop and evaluate computational tools designed to infer the precise locations of the TF binding sites (TFBSs) within ChIP-seq peaks by combining both computational and experimental evidences of direct TF-DNA interactions. While TFBSs are traditionally modelled through position weight matrices (PWMs), more advanced computational methods have been recently developed to incorporate nucleotide dependencies, variable spacing and DNA conformation in their models. These methodologies highlight that a one-fits-all model for TFBS prediction is not applicable. We have developed ChIP-eat, a uniform ChIP-seq data processing pipeline, from raw data to accurate, TF-specific TFBS prediction. After ChIP-seq peak calling, we assessed four different types of TFBS models by computing the enrichments of predicted TFBSs at the ChIP-seq peak-summits (where the highest number of reads mapped). Along with PWMs, we evaluated binding energy models, transcription factor flexible models and DNA-shaped-based models for each ChIP-seq data set. We applied ChIP-eat using the hg38 version of the human genome on 1,168 ENCODE and 2099 GEO data sets covering a total of 496 distinct TFs. Our work culminates with the generation of a large, publicly available collection of uniformly processed ChIP-seq data sets from which we obtained ChIP-seq peaks and accurate TFBS predictions derived from the best model per data set.

Session A-417: Efficient Inference for Sparse Latent Variable Models of Transcriptional Regulation
COSI: RegGen

Short Abstract: Motivation: Regulation of gene expression in prokaryotes involves complex co- regulatory mechanisms involving large numbers of transcriptional regulatory proteins and their target genes. Uncovering these genome-scale interactions constitutes a major bot- bottleneck in systems biology. Sparse latent factor models, assuming activity of transcription factors (TFs) as unobserved, provide a biologically interpretable modelling framework, integrating gene expression and genome-wide binding data, but at the same time pose a hard computational inference problem. Existing probabilistic inference methods for such models rely on subjective filtering and suffer from scalability issues, thus are not well-suited for realistic genome-scale applications. Results: We present a fast Bayesian sparse factor model, which takes input gene expression and binding sites data, either from ChIP-seq experiments or motif predictions, and outputs active TF-gene links as well as latent TF activities. Our method employs an efficient variational Bayes scheme for model inference enabling its application to large datasets which was not feasible with existing MCMC-based inference methods for such models. We validate our method on synthetic data against a similar model in the literature, employing MCMC for inference, and obtain comparable results with a small fraction of the computational time. We also apply our method to large-scale data from Mycobacterium tuberculosis involving ChIP-seq data on 113 TFs and matched gene expression data for 3863 putative target genes. We evaluate our predictions using an independant transcriptomics experiment involving over-expression of TFs.

Session A-418: pqsfinder: imperfection-tolerant identification of potential quadruplex-forming sequences in R
COSI: RegGen

Short Abstract: Motivation: G-quadruplexes (G4s) are one of the non-B DNA structures easily observed in vitro and suspected to form in vivo. Latest experiments with G4-specific antibodies and G4-unwinding helicase mutants confirm these suspicions. These four-stranded structures have also been shown to influence a range of molecular processes in cells. Because the structures are intensively studied, it is often desirable to screen DNA sequences and pinpoint the precise locations where G4s might form. With Bioconductor being a popular platform for sequence analysis, we were motivated to provide such search capability to its users, building on our previous research of triplex-forming sequence detection. In the newly presented approach, we allow for flexible searches that accommodate possible divergence from the optimal base composition. The existence of such imperfections was one of the main conclusions of the recently published G4-seq experiments. Results: We describe a Bioconductor package for identifying potential quadruplex-forming sequences (PQS) that is easy-to-use but is at the same time also flexible and scalable. We demonstrate that the algorithm behind the searches has a 96% accuracy on 392 currently known and experimentally observed G4 structures. We also carried out searches against the recent G4-seq data to verify how well we can identify the structures detected by that technology. The correlation with pqsfinder predictions was 0.619, higher than the correlation obtained with the second best G4Hunter.

Session A-419: A comprehensive database of cis-regulatory elements associated with microRNAs
COSI: RegGen

Short Abstract: Background: MicroRNAs (miRNAs) are small non-coding RNAs, which affect the production of proteins from mRNAs through post-transcriptional regulation. Their own expression must be precisely controlled as they ensure a key cellular processes and are crucial for cell physiology. Despite the large number of known miRNAs, the grasp on their transcriptional regulation has been limited by the lack of knowledge regarding the location of the promoters driving their transcription. Further, studies have demonstrated that cis-regulatory regions may play an important role in miRNA post-transcriptional biogenesis. However, a comprehensive and interactive database of cis-regulatory regions associated with miRNAs are currently lacking. Results: We are developing such database of cis-regulatory regions with their association with miRNAs. First, have created a catalog of experimentally verified and computationally predicted miRNA transcription start sites (TSS) in a tissue/cell-type-specific fashion through careful literature curation. Second, we are developing a database of chromatin interaction hierarchy by using Hi-C data, which contain information about topologically associating domain (TAD), sub-TADs, and frequently interacting regions. By utilizing the curated TSS information and the chromatin interaction data, we associate cis-regulatory regions to miRNAs, which will be publicly available through an interactive web interface with easy browsing, searching, downloading and overlap analysis features. Conclusions: We envision that a wider research community will benefit from these carefully curated resources, which will help to understand and analyze the roles of cis-regulatory regions in the transcriptional gene regulation of miRNAs.

Session A-420: ROSE: A Deep Learning Based Framework for Predicting Ribosome Stalling
COSI: RegGen

Short Abstract: Translation elongation plays a crucial role in multiple aspects of protein biogenesis, e.g., differential expression, cotranslational folding and secretion. However, our current understanding on the regulatory mechanisms underlying translation elongation dynamics and the functional roles of ribosome stalling in protein synthesis still remains largely limited. Here, we present a deep learning based framework, called ROSE, to effectively predict ribosome stalling events in translation elongation from coding sequences. Our validation results on both human and yeast datasets demonstrate superior performance of ROSE over conventional prediction models. With high prediction accuracy and robustness across different datasets, ROSE shall provide an effective index to estimate the translational pause tendency at codon resolution. We also show that the ribosome stalling score (RSS) output by ROSE correlates with diverse putative regulatory factors of ribosome stalling, e.g., codon usage bias, codon cooccurrence bias, proline codons and N6-methyladenosine (m6A) modification, which validates the physiological relevance of our approach. In addition, our comprehensive genome-wide in silico studies of ribosome stalling based on ROSE recover several notable functional interplays between elongation dynamics and cotranslational events in protein biogenesis, including protein targeting by the signal recognition particle (SRP) and protein secondary structure formation. Furthermore, our intergenic analysis suggests that the enriched ribosome stalling events at the 5’ ends of coding sequence may be involved in the modulation of translation efficiency. These findings indicate that ROSE can provide a useful index to estimate the probability of ribosome stalling and offer a powerful tool to analyze the large-scale ribosome profiling data, which will further expand our understanding on translation elongation dynamics.

Session A-421: Impact of Evolutionary Age on Gene Coexpression Network Architecture and Genetic Diversity in Homo sapiens
COSI: RegGen

Short Abstract: In the era of ever increasing number of genome-available organisms, direct estimation of functionally related genes from genome is a fundamental challenge in computational biology. In this study, we investigated genomic features associated with the functional relationships of Homo sapiens genes by using gene coexpression data in COXPRESdb (http://coxpresdb.jp). Gene coexpression, a similarity of gene expression profiles, provides a genome-wide approximation of functional gene relationships at transcriptional regulation level. In comparative analysis between the similarity of genomic features and the strength of gene coexpression, we found that genes belonging to same evolutionary age group tend to be strongly coexpressed. Gene ontology enrichment analysis revealed that the evolutionary older genes possess a central role in cellular activities, whereas the younger genes are likely to participate in lineage- or species-specific phenotypic evolution. We also investigated the effect of evolutionary age on genetic diversity and selective pressure in human populations by comparing allele frequencies between the older gene loci and the younger gene loci. We anticipate our results to be a starting point for understanding the mechanisms underlying cellular systems evolution, and for developing a genome-based gene function prediction method.

Session A-422: Computational methods for detection of allele specific expression and their diagnostic relevance for rare Mendelian disorders
COSI: RegGen

Short Abstract: Background: Several rare hereditary disorders and traits can be explained by allele specific expression controlled by novel and/or inherited genetic mutations within regulatory regions of the genome. However, accurate identification of allelic imbalance using integrated genome and transcriptome sequencing remains a computational challenge. A thorough evaluation of existing bioinformatics approaches is vital to our understanding of limitations associated with detection of complex events such as loss of heterozygosity and genetic imprinting. Description: The study makes use of whole genome and transcriptome sequencing datasets for patients displaying clinical phenotypes of rare genetic diseases. We generated a profile for heterozygous mutations from whole genome data and studied the distribution of these loci within the transcriptome data in terms of reference and alternate allele frequency and sequencing depth. Using these features as a guide, we initiated the assembly of a computational framework by implementing an algorithm for the identification of allelic imbalance at the gene level and validated the presence of several genetically imprinted transcripts. Conclusion: Recent studies have demonstrated the use of several statistical tests such as a T-test, fisher’s exact test and beta binomial models for inferring allelic bias; however, each test poses its own limitations. We thoroughly evaluated existing methods using an integration of genome and transcriptome datasets and as a first step designed an algorithm to characterize mutation profiles for genes showing strongly skewed monoallelic expression. The analysis plays a topical role in chaperoning the design and implementation of improved computational approaches for decoding the genetic bases of rare Mendelian disorders.

Session A-423: Predicting transcription factor binding through ChIP-seq meta-analysis
COSI: RegGen

Short Abstract: Enhancers are “docking stations” for transcription factors (TFs) for which several challenging questions remain unresolved in the field of regulatory genomics. How are the genomic binding sites chosen and can we predict them from the sequence? What are the differences in TF binding between different cell types? To address these questions, we analyzed 83 TF ChIP-seq data sets representing binding sites of 7 different TFs (namely p53, REST, GRHL2, ESR1, HIF1A, SOX2, PPARG). Through a ranking-based meta-analysis we found that for each TF, the binding across different cell types and conditions is strongly preserved. This yielded a core set of bound regions across cell types. From this training data we learned features and trained models to predict TF binding genome-wide. Surprisingly, for REST and TP53 a single motif is already sufficient to predict binding with a high performance (AUROC 0.95 and 0.85, respectively). Predictions for the other TFs range from 0.53 (for HIF1A) to 0.73 (for GRHL2) but these can be significantly improved by including multiple motifs of the same TF, yielding performance increases by 11-16%. In addition, for SOX2, GRHL2, HIF1A and PPARG we find that the CG content is very informative to predict binding, increasing the AUROC by 9-20%. Using a combination of motifs and CG content allowed us to reach an overall performance between 0.717 and 0.989 to predict the binding of a TF genome-wide, working best for REST, TP53 and GRHL2, while co-factors are likely needed for the other factors. In conclusion, meta-analysis of ChIP-seq data yields high-quality training data to learn machine learning classifiers that can accurately predict TF binding from sequence.

Session A-424: Predicting multiple groups of enhancers using high confidence experimentally validated examples
COSI: RegGen

Short Abstract: Enhancers are important regulatory regions located throughout the genome, primarily in the non-coding part. Several experimental approaches have been developed over the last years to identify their location. Computational methods for enhancer prediction are often trained on experimentally identified regions, and therefore rely critically on their correctness. We used reporter assays to test candidate regions for enhancer activity with the goal of achieving a high confidence training set. But looking closer at the identified enhancers from our reporter assay experiments it strikes that many of them seem to be inactive when mapped back into the cellular context. These observed episomal versus chromosomal differences in enhancer activity underpin the remaining challenge in the composition of training sets for computational enhancer prediction methods in general. We tackle this problem by considering multiple groups of enhancers in our Random Forest based classification model. Thereby we reduce the introduction of misleading (or false positive) information to our classifier without dismissing a large and potentially very informative part of our experimental results.

Session A-425: Probabilistic simulation of scRNA-seq data from lineage trees
COSI: RegGen

Short Abstract: Background: Single-cell RNA sequencing (scRNA-seq) is revolutionizing the study of cellular heterogeneity and differentiation. Single-time snapshots often contain cells at intermediate differentiation stages, which should allow retrieval of the underlying lineage trees. These would aid the understanding of dynamic processes like early embryonic development or hematopoiesis. Testing methods that aim to predict lineage trees is challenging as no labeled data with known ground truth are available. To fill this gap and test methods on data from known and complex tree topologies, we developed PROSSTT (PRObabilistic Simulation of ScRNA-seq Tree-like Topologies). Results: PROSSTT simulates scRNA-seq data for arbitrary lineage trees. Average gene expression levels are calculated as the weighted sum of the expressions of the co-regulated gene modules they belong to. We simulate the time course of each of the modules as random walks with a momentum term. We simulate direct transcript counts or amplified mRNA molecule counts by sampling from negative binomial distributions. Users can control several aspects of the differentiation tree, i.e number of branches, branch points and modules, branch lengths, and noise levels. We use model parameters typical of values trained on real datasets. Low-dimensional visualizations of simulated data by state-of-the-art algorithms resemble those of real datasets. Conclusions: PROSSTT robustly simulates scRNA-seq data for complex differentiation processes for which the ground truth is given. PROSSTT therefore allows objective comparison and detailed analysis of how trajectory inference algorithms perform on single-cell RNA-seq data. The package can easily be extended to reflect new knowledge about scRNA-seq data.

Session A-426: Web atlas of neuroendocrine system in Bombyx mori
COSI: RegGen

Short Abstract: Neuroendocrine system of insect is a complex structure formed by specialised cells and organs, which regulate development, behaviour, reproduction and other physiological functions. Crucial signal molecules in the neuroendocrine system are peptides and derivatives of amino acids (biogenic amines) and fats (ecdysteroids and juvenile hormones). The silkworm (Bombyx mori) is a model organism for study of insect's neuroendocrinology. We have designed and implemented web atlas with the aim to compile current knowledge about silkworm’s neuroendocrine system and its individual components. So far it contains records for 60 neuroreceptors and 83 neuropeptides. Apart from basic data (alternative names, sequences, cross-links to other databases), our web atlas brings information on properties, expression, functions and mutual interactions among component of the neuroendocrine system. All information was collected from literature and biological databases. The unique part of the atlas are our experimental results regarding expression of neuroreceptors and neuropeptides in various sexes, organs and stages of development. If available, experimental data are also provided (images from immunohistochemistry and in situ hybridisation, datasets from qPCR and RNA-seq). In addition, for each record there is also list of known homologous proteins form other insect species. The web portal also provides several tools for sequence analysis, sequence alignment and prediction of neuropeptides. This web atlas is a unique information resource in the field of insect’s neuroendocrinology and is freely accessible for researchers around the world. This work was supported by VEGA grant 2/0164/15.

Session A-427: SeedHam: Seed-driven Learning of Position Weight Matrices from Large Sequence Sets
COSI: RegGen

Short Abstract: We formulate and analyze a novel seed-driven algorithm SeedHam for PWM learning. To learn a PWM of length t, the algorithm uses the most frequent t-mer of the training data as a seed, and then restricts the learning into a small Hamming neighbourhood of the seed. The SeedHam method is intended for PWM learning from large sequence sets (up to hundreds of Mbases) containing enriched motif instances. A robust variant of the method is introduced that decreases contamination from artefact instances of a self-similar motif and allows using larger Hamming neighbourhoods. To solve the motif orientation problem in two-stranded DNA we introduce a novel seed finding rule, based on analysis of palindromic structure of sequences. Test experiments are reported, that illustrate the relative strengths of different variants of our methods, and show that our algorithms are fast and give stable and accurate results.

Session A-428: Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences
COSI: RegGen

Short Abstract: Position weight matrices (PWMs) are the standard model for transcription regulatory motifs. In PWMs, nucleotide probabilities are independent of nucleotides at other positions. Models that account for nucleotide dependencies require many parameters for training and are prone to over-fitting. We have developed a Bayesian approach for motif discovery using Markov models in which conditional probabilities of order k-1 act as priors for those of order k. Bayesian Markov models (BaMMs) automatically adapt model complexity to the amount of available data. We derive an expectation maximization algorithm and Gibbs sampling for de novo discovery of enriched motifs. For evaluating the performance of the models, we define the area under the sensitivity-FDR curve (AUSFC). The AUSFC has the great advantage that it summarizes the performance of the model for the entire range of FDR values that are relevant in practice, without putting undue emphasis on any specific FDR values. For transcription factor binding, BaMMs achieve significantly higher cross-validated AUSFC than PWMs in 446 ChIP-seq ENCODE datasets and improve performance by 55% on average. BaMMs also learn complex multipartite motifs, improving predictions of transcription start sites, polyadenylation sites, bacterial pause sites, and RNA binding sites by 26–101%. BaMMs never perform worse than PWMs, which argues in favor of generally replacing PWMs by BaMMs. Our webserver (http://bammmotif.mpibpc.mpg.de) makes BaMM!motif a convenient web-based tool. We provide de novo motif discovery and motif search tools, a database of higher-order BaMMs from ENCODE ChIP-seq datasets and suggested database motif matches.

Session A-429: Predicting gene silencing dynamics during X chromosome inactivation with Random Forests
COSI: RegGen

Short Abstract: In mammals, gene dosage imbalances between sexes is compensated by a process known as X-chromosome inactivation (XCI). A key player during XCI is the long non-coding RNA Xist that forms a transcriptionally silent compartment (TSC) depleted of RNA Polymerase II and euchromatin marks, into which X-chromosomal genes are drawn as they become silenced. However, some genes are drawn faster into the TSC thereby being silenced faster than others. The goal of our analysis was to find genomic and epigenetic factors that cooperate in the gene silencing process by Xist, potentially determining the gene silencing rate. We used a time series PRO-Seq experiment on mESC line to estimate the silencing halftime for X-chromosomal genes and build a classification model for fastly and slowly silenced genes based on epigenetic and genomic features at gene promoters. The main challenge was to find a classifier that can deal with low number of target points, high class imbalance, and many correlated predictor variables, while keeping features interpretable. Using a Random Forest model, we could classify the genes with an accuracy of 80% and revealed that the distance to Xist, gene density and factors such as RING1B, TAF1 and H3K27ac are predictive for the silencing haltime.

Session A-430: Pioneering role of EBF1 in early B cell commitment
COSI: RegGen

Short Abstract: During hematopoiesis, the differentiation of multipotent Common Lymphoid Progenitor (CLP) into the B cell lineage requires a major restructuring of transcriptional and chromatin states. This is established in part by the B cell lineage-specific transcription factor Early B cell factor 1 (EBF1). A targeted deletion of Ebf1 locus in the mouse showed a complete developmental block at an early stage of B cell development (pre-pro-B cells). In an in vitro culture model, Ebf1-/- pre-pro-B cells were differentiated into pro-B cells by transducing EBF1. We studied the genome-wide molecular events associated with the pre-pro-B cells to pro-B cells transition. Genome-wide occupancy and the co-occupancy analysis of major B cell transcription factors, including EBF1, Pax5, E2A, IRF4, IKAROS, PU.1 and FOXO1, indicated that the majority of DNase I-accessible chromatin regions acquired in pro-B cells are majorly bound by EBF1 and/or Pax5. We also demonstrated the pioneering role of EBF1 on a specific subset of genes by deleting its C-terminal domain. This study revealed the role of C-terminus domain of EBF1 in shaping the naïve chromatin in a lineage-specific manner by establishing chromatin accessibility and DNA demethylation in regions of closed chromatin that offer limited collaboration with other transcription factors.

Session A-431: Efficient statistical methods for detecting differential methylation
COSI: RegGen

Short Abstract: Addition of the methyl group to the 5-position of a cytosine (5mC) is the most commonly studied epigenetic modification on DNA, and its effects on different diseases and cancer have been widely studied. We have previously developed a hierarchical generative model, LuxGLM, for analysing 5mC and oxidized methylcytosine species (oxi-mC). LuxGLM combines a generative model for sequencing data with a general linear model component to account for confounding effects, and the tool is shown to provide accurate detection of differential methylation when compared to other state-of-the-art methods. However, detecting differential cytosine methylation for a whole genome is a computationally cumbersome task. In LuxGLM, Bayes factors are calculated for the hypothesis testing to detect differential methylation. Originally, the Savage-Dickey estimate of the Bayes factor was used and the approximation is done by utilizing standard Hamiltonian Monte Carlo sampling feature of probabilistic programming language Stan. To increase the computational efficiency, we propose using variational inference feature of Stan for the calculation of the Bayes factor. Variational inference can be used for efficient posterior sampling and combine that with the Savage-Dickey estimate, or we can use the expectation lower bound directly as an approximation of Bayes factor. By adjusting the parameters of the variational inference, we can get equally good results with lower computation times. This makes LuxGLM an attractive method even for whole-genome analysis.

Session A-432: TF2Network: predicting transcription factor regulators and gene regulatory networks in Arabidopsis using publicly available binding site information
COSI: RegGen

Short Abstract: A gene regulatory network (GRN) is a collection of regulatory interactions between transcription factors (TFs) and their target genes. In plants, GRNs control different types of biological processes like growth, development and (a)biotic stress response and have been instrumental to understand the organization, complexity and mechanisms of transcriptional gene regulation. Although various experimental methods like yeast-one-hybrid and ChIP-Seq have been used to map GRNs in Arabidopsis thaliana, the limited throughput of these methods combined with the large number of 1500-1700 TFs makes that for many genes our knowledge about regulators in specific cellular conditions limited. To improve GRN inference, we introduce TF2Network, a tool that exploits the vast amount of TF binding site information and enables the delineation of GRNs by detecting potential regulators for a set of co-expressed or functionally related input genes. Validation using 24 TF ChIP bound gene sets and differentially expressed gene sets for 23 perturbed TFs reveals that TF2Network predicts the correct regulator in 96% and 78% of these test sets, respectively. Furthermore, we show that our tool is robust to noise in the input gene sets and has a low false discovery rate. Besides predicting TF regulators, TF2Network also predicts the putative target genes with good sensitivity. Comparison of TF2Network with other existing tools such as PlantRegMap and Cistome shows that TF2Network has a better performance. Apart from predicting regulators and target genes, TF2Network is accessible through a web interface where all predictions are shown using a new visualization to intuitively browse complex networks. Furthermore, the networks are annotated with additional functional information based on Gene Ontology (GO), protein-protein interactions and RNA-Seq transcript profiling data, facilitating biological discovery. Finally, we demonstrate how TF2Network can be used to perform systematic regulatory annotations for GO-based regulons.

Session A-433: Detecting Epistasis Using Random Forest
COSI: RegGen

Short Abstract: Epistasis (non-additive genetic interaction) has been proposed as one factor that underlies the ‘missing heritability’ observed in numerous complex biological traits. However, fully uncovering epistatic interactions from genetic association data has been hampered by a lack of methods for the efficient detection of epistasis. Existing methods for the detection of epistasis are limited by prior assumptions on the nature of interactions and/or the distribution of data. Further, many existing methods cannot properly handle situations where a trait is affected by more than two loci. Here we propose novel methods for the detection of epistasis using Random Forest (RF). A Random Forest model consists of an ensemble of decision trees, where the data is split on markers that best explain the phenotypic variance in a sequential manner, without making assumptions about the underlying ‘true’ model. Because the outcome is modeled differently over subgroups defined by previous splits, the structure of the trees, by essence, takes dependencies between markers into account [1]. We have shown that RF outperforms other methods for detecting genetic associations especially when complex (non-additive) interactions between loci are involved [2]. Yet, using RF for detecting such interactions has remained challenging [3]. Thus, we propose three new scores that exploit the structure of RF decision trees for detecting different types of epistatic interactions. The first score, called split asymmetry, evaluates the magnitude of phenotypic differences on two sides of a split and uses that information for detecting significant dependencies between markers. The second score, called selection asymmetry, quantifies the imbalance in the number of times a marker was used on two sides of a previous split. And the third score, called paired selection frequency, compares the frequency of two markers being used together in the same decision tree with the frequency of them being selected independent of each other. We have benchmarked these methods separately and as an ensemble on simulated data and showed that the RF-based epistasis detection approach outperformed other frequently used methods. In addition, the performance of the ensemble method was assessed on a real dataset by evaluating the biological relevance of the results. Finally, we are applying this method on large quantitative genetics datasets in order to evaluate the contribution of epistatic effects to phenotypic variance. This work shows that the methods we have developed based on the exploitation of the structure of Random Forest allow the explicit detection of interacting genetic factors. Since Random Forest makes no assumptions about the model complexity it is – in principle – possible to extend this approach from two-way interactions to higher order interactions. [1] Chen, Xi, and Hemant Ishwaran. "Random forests for genomic data analysis." Genomics 99.6 (2012): 323-329. [2] Michaelson, Jacob J., et al. "Data-driven assessment of eQTL mapping methods." BMC genomics 11.1 (2010): 502. [3] Picotti, Paola, et al. "A complete mass-spectrometric map of the yeast proteome applied to quantitative trait analysis." Nature 494.7436 (2013): 266-270.

Session A-434: Network-based integration of systems genetics data reveals pathways associated with lignocellulosic biomass accumulation and processing
COSI: RegGen

Short Abstract: Published in 2017, Proceedings of the National Academy of Sciences 114 (5), 1195-1200 As a consequence of their remarkable adaptability, fast growth and superior wood properties, eucalypt tree plantations have emerged as key renewable feedstocks (over 20 million ha globally) for the production of pulp, paper, bioenergy and other lignocellulosic products. However, most biomass properties such as growth, wood density and wood chemistry are complex traits that are hard to improve in long-lived perennials. Systems genetics, a process of harnessing multiple levels of component trait information (e.g. transcript, protein and metabolite variation) in populations that vary in complex traits, has proven effective for dissecting the genetics and biology of such traits. We have applied a novel network-based data integration (NBDI) method for a systems level analysis of genes, processes and pathways underlying biomass and bioenergy-related traits using a segregating Eucalyptus hybrid population. This NBDI approach is based on a unique network model in which connections between genes reflect interactions derived from either prior molecular interaction information or from eQTL information. In the latter case it is assumed that if two genes share an eQTL, they are connected in the network because of a shared co-regulation mechanism. Even though incidental overlap of eQTLs is possible, for instance through the action of separate polymorphisms in tightly linked but unrelated genes, we assumed that the majority of the overlapping trans-eQTLs can be treated as evidence of a shared regulatory polymorphism, as reflected by the shared functional annotations observed for the associated genes. Gene expression signals are then propagated through the network model to obtain an integrated signal that is used to explain the variation in the external traits. We applied the NBDI approach to study the genomic loci and pathways affecting wood formation in Eucalyptus. The experimental set up used (with high linkage disequilibrium (LD) and large effect QTLs segregating in a single family) is complementary to low LD studies (with high resolution, but typically small effect associations) in populations of unrelated individuals. We showed that the integrative approach can link biologically meaningful sets of genes to complex traits, and at the same time reveal the molecular basis of trait variation. Gene sets identified for related woody biomass traits were found to share regulatory loci, cluster in network neighborhoods, and exhibit enrichment for molecular functions such as xylan metabolism and cell wall development. These findings offer a new framework for identifying the molecular underpinnings of complex biomass and bioprocessing-related traits. A more thorough understanding of the molecular basis of plant biomass traits should provide additional opportunities for the establishment of a sustainable bio-based economy.

Session A-435: TITER: predicting translation initiation sites by deep learning
COSI: RegGen

Short Abstract: Motivation: Translation initiation is a key step in the regulation of gene expression. In addition to the annotated translation initiation sites (TISs), the translation process may also start at multiple alternative TISs (including both AUG and non-AUG codons), which makes it challenging to predict TISs and study the underlying regulatory mechanisms. Meanwhile, the advent of several high-throughput sequencing techniques for profiling initiating ribosomes at single-nucleotide resolution, e.g., GTI-seq and QTI-seq, provides abundant data for systematically studying the general principles of translation initiation and the development of computational method for TIS identification. Methods: We have developed a deep learning based framework, named TITER, for accurately predicting TISs on a genome-wide scale based on QTI-seq data. TITER extracts the sequence features of translation initiation from the surrounding sequence contexts of TISs using a hybrid neural network and further integrates the prior preference of TIS codon composition into a unified prediction framework. Results: Extensive tests demonstrated that TITER can greatly outperform the state-of-the-art prediction methods in identifying TISs. In addition, TITER was able to identify important sequence signatures for individual types of TIS codons, including a Kozak-sequence-like motif for AUG start codon. Furthermore, the TITER prediction score can be related to the strength of translation initiation in various biological scenarios, including the repressive effect of the upstream open reading frames (uORFs) on gene expression and the mutational effects influencing translation initiation efficiency.

Session A-436: GeneHancer and VarElect: disease interpretation of whole genome sequence variants
COSI: RegGen

Short Abstract: Simon Fishilevich*, Naomi Rosen, Michal Twik, Rotem Hadar, Tsippi Iny-Stein, Marilyn Safran and Doron Lancet Department of Molecular Genetics, Weizmann Institute of Science, Rehovot 7610001, Israel * - corresponding author, email: simon.fishilevich@weizmann.ac.il The emergence of whole genome sequencing (WGS) poses considerable challenges to variant disease interpretation. A typical WGS of an individual identifies ~5M non-reference variants, a 50 fold increase compared to whole exome sequencing (WES). A considerable proportion of this “variant avalanche” (10-15%) resides within transcription regulatory elements - promoters and enhancers. Promoters are relatively easy to identify due to their stereotyped positioning at the immediate 5’ neighborhood of genes, and their target genes are quite obvious. In contrast, the identification of enhancers constitutes a major undertaking. An equally difficult task is identifying the connections between these distant-acting regulatory elements and their target genes. Enhancers are centrally involved in the spatiotemporal orchestration of gene expression in embryonic development and in cell differentiation, and are implicated in disease. This makes them prime novel targets for annotating non-coding variants in WGS, and interpreting them in the realms of health and diseases. We created GeneHancer, a novel regulatory element database, in the framework of the GeneCards Suite (www.genecards.org). We integrated four enhancer data sources: a) 176,000 enhancer regions from the ENCODE project; b) 213,000 elements from the Ensembl regulatory build; c) 43,000 elements from the FANTOM project, identified via enhancer RNAs (eRNAs); d) 1,700 experimentally-validated elements from the VISTA enhancer browser. Subsequently, we consolidated gene-enhancer links obtained by five methodologies: a) GTEx expression quantitative trait loci (eQTLs); b) Capture Hi-C promoter-enhancer long range interactions (PMID 25938943); c) FANTOM expression correlations between eRNAs and candidate target genes; d) Expression correlations between enhancer-targeted transcription factors and genes; e) Enhancer-gene genomic distance. GeneHancer portrays 285,000 integrated non-redundant candidate enhancers (covering 12.4% of the genome), along with annotation-derived confidence scores. In parallel, our database incorporates ~1.02 million integrated and scored gene-enhancer links involving 101,337 genes. Among these, we define a subset of “double elite” enhancer-gene pairs, based on the conjunction of two or more methods for both enhancer identification and enhancer-gene association. This allows WGS variants within enhancers to be interpreted with high confidence, based on high-probability target gene links. These WGS analysis capabilities are being embedded within the GeneCards Suite, among others by modifying VarElect and TGex, its next generation sequencing (NGS) disease interpretation tools [PMID: 27357693]. For WES, VarElect prioritizes a list of variant-containing genes by seeking the relevance of such genes to phenotype/disease/symptom keywords, as inferred from the comprehensive web-mined information within the GeneCards knowledgebase. For WGS, a modified VarElect assigns GeneHancer target genes to variant-containing enhancers. These genes are added to VarElect’s input gene list for performing disease interpretation. Enhancer variants are ranked by a combination of phenotype scores and GeneHancer scores. We show by concrete examples how the combination of GeneHancer and VarElect, along with the power of the GeneCards Suite’s comprehensive gene and disease information, provides a facile route to discovering the genic roots of diseases.

Session A-437: Analysing large-scale epigenomic data with DeepBlue
COSI: RegGen

Short Abstract: While large amounts of epigenomic data are publicly available, their retrieval in a form suitable for downstream analysis is a bottleneck in current research. In a typical analysis, users are required to download huge files that span the entire genome, even if they are only interested in a small subset (e.g. promoter regions) or an aggregation thereof. Moreover, complex operations on genome-level data are not always feasible on a local computer due to resource limitations. The DeepBlue Epigenomic Data Server mitigates this issue by providing a powerful interface and API for filtering, transforming, aggregating and downloading data from several epigenomic consortia, making it the ideal resource for bioinformaticians that seek to integrate up-to-date epigenomics resources into their workflow. We present two projects that utilize the DeepBlue API to enable users not proficient in scripting or programming languages to analyze epigenomic data in a user-friendly way: (i) an R/Bioconductor package integrates DeepBlue into the R analysis workflow. The extracted data are automatically converted to suitable R data structures for downstream analysis and visualization within the Bioconductor framework. (ii) a web interface that enables users to search, select, filter and download the epigenomic data available in DeepBlue. DeepBlue was well received by the International Human Epigenome Consortium and already attracted much attention by the epigenomic research community with currently 90 registered users and more than a million anonymous data requests since the release in 2015. The web interface and the API documentation, including usage examples and use cases, are available at http://deepblue.mpi-inf.mpg.de/. The DeepBlueR package is available at http://deepblue.mpi-inf.mpg.de/R.

Session A-438: Discovery of murine tissue specific regulatory drivers and their impact
COSI: RegGen

Short Abstract: Tissue specific regulatory regions are important for various biomedical applications including manipulating gene expression in single cell types, tissues or organs of interest and for developing conditional mouse models. However, little is known about the relationship between specific regulatory regions and the corresponding phenotypes of genes they control. Using an in silico integrated approach, we present a novel analysis to identify genome wide Tissue Specific Regulatory Elements (TSREs) that may directly or indirectly drive the genes that correspond with mutant mouse phenotypes. Using ChromHMM, a 6 state chromatin map was generated from 3 primary histone marks in 22 mouse tissues. The model achieved a recall sensitivity of ~82% for promoters of protein coding genes and ~96% with Vista enhancers. We calculated the tissue specificity index (using Tau) of all strong enhancers and active promoters (posterior probability ≥0.95) across 22 epigenomes, and identified highly tissue specific regulatory elements (Tau≥0.85). Using RNA-seq data we show that genes with strong/active TSREs have significantly higher expression compared to genes with weak or absent TSREs (p<0.0009, permutation test). Interestingly, only ~5% of the tissue specific enhancers and ~21% of tissue specific promoters identified in each epigenome are observed to drive tissue specific expression of their target genes. We integrated mammalian phenotype ontology terms from Mouse Genome Informatics with the putative target genes of TSREs and show significant correlation between mouse phenotypes and corresponding TSREs (enrichments q≤2.87×10-8). Using a random forest model, we evaluated the capability of TSREs to predict mouse phenotypes and tissue specific active promoters achieved the greatest accuracy of ~76% (AUC=0.76) whilst enhancer and expression profiles achieved ~67% (AUC=0.77) and ~67% (AUC=0.80) respectively. In order to validate our predictions, we investigated protein-protein interactions (PPI) among genes with tissue specific elements in each epigenome and observed significant interactions among TSRE genes and with corresponding phenotype associated partners (p-value=0). Simulating these PPI networks by adding random genes also showed that TSRE genes with novel gene-phenotype associations interact more with known phenotype genes compared to randomly added genes (p≤0.02). Finally, we have identified known and novel transcription factors enriched in TSREs which potentially play important regulatory roles in their corresponding epigenomes. Overall, these data may identify specific regulatory elements and novel gene-phenotype associations that can drive specific mouse phenotypes, and serves as a helpful resource for researchers to formulate new hypothesis about gene biological functions.

Session A-439: Reducing Noise in Hi-C Interaction Matrices at Restriction Fragment Resolution
COSI: RegGen

Short Abstract: Hi-C, the high-throughput derivative of chromosome conformation capture (3C) technology, allows for the quantification of all DNA-DNA contacts genome-wide that are found within a population of cells. The output of a Hi-C experiment is stored in an interaction frequency (IF) matrix. At the restriction fragment (RF) resolution, most Hi-C IF matrices are sparse due to the required depth and high costs associated with sequencing. A majority of pair-wise RF-interactions receive a raw frequency of zero or one, with most contacts found at relatively short distances (<1 Mb). Typically, IFs are thus analyzed at a fixed resolution (e.g., 50 Kb) to increase their signal over noise ratio. The consequences of this reduction in resolution are that key interactions between fine-scale genomic elements (e.g., eQTL studies, enhancer/promoter interactions, chromatin looping events) may not be observed. A correct interpretation of Hi-C IF matrices relies on representing the observed data at the proper resolution, which involves a trade-off between signal and noise. We describe two adaptive density estimation (ADE) techniques that consider the changing density of RF-interactions across a Hi-C IF matrix when reducing noise while retaining the highest-possible resolution. The first is a novel application of a Markov Random Field (MRF) to Hi-C data. To estimate true IF from Hi-C data, the MRF considers both (i) the immediate neighborhood of RF-interactions and (ii) Topologically Associating Domain boundaries. The second ADE algorithm is a kernel density estimation approach that implements a dynamic bandwidth to consider surrounding RF-interactions. We validate our ADE algorithms by demonstrating that estimated matrices allow for higher accuracy in identifying true positive/negative contacts and provide a lower error when predicting IF across varying sequencing depths, compared to traditional fixed binning approaches. True positive interactions are those Hi-C RF-interactions found to be mediated by RNA polymerase II and CTCF (as identified by ChIA-PET – a 3C technology that incorporates chromatin immunoprecipitation). We also show that ADE Hi-C IF matrices correlate better with 5C data (a targeted sequencing 3C technology with high sequencing depth) when observing the same genomic regions. Authors are not willing to make a poster presentation.

Session A-440: Systematic Analysis of Dynamic Roles of TFs and MicroRNAs in Pressure Overload Cardiac Hypertrophy
COSI: RegGen

Short Abstract: Pathological cardiac hypertrophy is a major risk factor for heart failure. Although many studies have been published in the past and a number of regulators and related marker genes have been identified, only a few studies have used time course data. Therefore, how the transcription factors (TFs) dynamically regulate the associated genes and control the morphological and electrophysiological changes during the hypertrophic process is still largely unknown. In this study, we first obtained a set of time-course transcriptomes at five time points in four weeks from murine hearts subjected to a transverse aorta banding surgery that can induce cardiac hypertrophy within days in vivo. Then, we used a series of computational approaches and integrated regulatory information from public domains to analyze the time course transcriptomic data. We have three significant findings: (a) Three major co-regulation modules of TF genes were identified that may regulate the gene expression changes during the development of cardiac hypertrophy in mice. (b) The TF genes in the first module were up-regulated before the occurrence of significant morphological changes and one week later were down-regulated gradually, and then those in the second and third modules took over the regulation as the heart size increased. (c) The TF genes up-regulated at the early stages likely initiated the cascading regulation and most of the well-known cardiac miRNAs were up-regulated at later stages for suppression. In addition, we predicted several new candidate key regulators of cardiovascular-associated genes form the constructed time-dependent regulatory network. In short, this study has revealed some undiscovered regulatory properties in cardiac hypertrophy and provides a different perspective in the study of related research fields.

Session A-441: A discriminatory method for the identification of key transcriptional regulators using epigenetic data
COSI: RegGen

Short Abstract: Deciphering the regulatory mechanism that control the establishment and maintenance of cellular programs is an essential task in computational biology. Transcription Factors (TFs) are key players in these mechanisms. One established approach to understand how TFs regulate changes in gene expression between tissues, or normal and disease samples, is through TF-ChIP-seq experiments. While accurate, these experiments are laborious and time-consuming, and need a priori knowledge of the relevant TFs. Alternative approaches, which combine exclusively sequence-based TF-binding predictions with TF gene expression measurements, are less accurate in describing both gene expression differences and TF-target relationships. Recently, a number of genome-wide open-chromatin assays, e.g. DNaseI-seq or ATAC-seq, have been utilized to measure chromatin accessibility in a sample of interest. These measurements can be combined with computational TF-binding annotation (1). Here, we present a two-step machine learning approach using only open-chromatin data as input to identify TFs that might be key regulators of gene expression differences between tissues. First, a new statistical approach is developed for the computation of differential TF-binding scores for each gene and TF, using binding predictions computed in open-chromatin regions, incorporating open-chromatin replicate information, and weighted binding in far-away enhancer regions. Second, an interpretable logistic regression classifier is used to prioritize TFs that are best suited to classify gene expression differences. We show that our approach outperforms purely sequence-based approaches, and that our method is less sensitive to class imbalance. Also, our method shows comparable performance when compared to an approach that uses TF-ChIP-seq data (2). As part of the DEEP and IHEC consortia, we have applied our model to identify TFs that are key regulators for human CD4+ T cell differentiation from naive to effector memory T cells (TEMs). Our method achieved superior performance to alternative approaches and highlighted several TFs as key regulators of these transitions. One of these predicted factors, FOXP1, was suggested by our method to discriminate naive T cells from TEMs. This prediction was validated experimentally: T cell-specific depletion of FOXP1 protein expression in FOXP1 conditional knock-out mice indeed resulted in loss of the naive CD44low phenotype in T cells (3). Overall, we suggest an accurate and flexible method to identify key regulatory TFs of gene expression between tissues. As only open-chromatin and gene expression data are required, without a priori knowledge of important TFs, comparatively low experimental costs allow our method to be applied to many cellular systems. We have implemented the complete workflow in a user-friendly GUI to allow access directly to biologists. The GUI is available on github (4). (1) Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction. Schmidt, F. Gasparoni, N. Gasparoni, G. Gianmoena, K. Cadenas, C. et al., Nucleic Acids Research, 2016. (2) Understanding transcriptional regulation by integrative analysis of transcription factor binding data. Cheng, C. Alexander, R. Min, R. Leng J. Yip, K.Y. et al., Genome Research, 2012. (3) Epigenomic profiling of human CD4+ T cells supports a linear differentiation model and highlights molecular regulators of memory development. Durek, P. Nordst\"om, K. Gasparoni, G., Salhab, A. Kressler, C. et al. Immunity, 2016. (4) Transcription Factor Machine Learning (TFML) repository, https://github.com/SchulzLab/TFML.

Session A-442: All Fingers Are Not The Same: Handling Variable-Length Sequences In A Discriminative Setting Using Conformal Multi-Instance Kernels
COSI: RegGen

Short Abstract: Existing string kernels for comparison of genomic sequences are generally tied to using (absolute) positional information of the features in the individual sequences. This poses limitations when comparing variable-length sequences using such string kernels. For example, profiling chromatin interactions by Hi-C or 3C-based experiments results in variable-length genomic sequences (restriction fragments). Here, exact position-wise occurrence of signals in sequences may not be so important as compared to the scenario of analysis of the promoter sequences, that typically have a transcription start site as reference, for gene expression prediction. The existing string kernels have been shown to be useful for the latter scenario. In this work, we propose a novel approach for sequence comparison that allows larger positional freedom as compared to the existing approaches and identifies a possibly dispersed set of features in comparing variable-length sequences. Our approach, termed \emph{CoMIKL} for conformal multiple instance kernel learning, casts the sequence comparison problem into a multiple instance learning problem and benefits from the previously developed conformal multi-instance kernels. Specifically, we represent each genomic sequence (\textit{whole}) by its segments (\textit{parts}). Thus, in a multiple instance learning setting, each sequence is a bag or collection of all its segments as instances. More precisely, for any given sequence, we derive a complementary set of segments, namely \emph{non-shifted} and \emph{shifted} segments, for it. \emph{Shifted} segmentation accounts for any motif that may have been missed if it lay at the \emph{non-shifted} segment boundaries, thereby providing a complementary view and thus covering the complete sequence. In order to compare any two sequences, we compare all segments (\emph{non-shifted} and \emph{shifted}) of one sequence with all segments of the other. While the multi-instance kernel (G{\"a}rtner \textit{et al.}, 2002) can successfully compare whole sequences by comparing their individual instances, it has the issue that, in averaging, it looses any information related to the contributions of the individual instances. To this end, we use conformal multi-instance kernels (Blaschko and Hofmann, 2006) that enable us to obtain a segment weighting per sequence that denotes the contributions of the individual segments of a sequence towards classification of that sequence. The individual segments are represented using the oligomer distance histograms (ODH) representation and the corresponding ODH kernel (a dot product kernel) (Lingner and Meinicke, 2006) is used to compute the sequence similarity. The ODH kernel is then conformally transformed using a Gaussian such that discriminative regions in the feature space are magnified and the non-discriminative regions are shrunk. Selection of these candidate regions in the feature space is done by clustering the complete set of input instances and choosing the corresponding cluster centres as candidate regions. The resultant set of conformal multi-instance kernels is approximated and posed as a multiple kernel learning problem. We empirically show that the approximation holds for the combination of the ODH kernel and the conformal transformation. We present the results of our experiments on binary classification with two simulated datasets. These demonstrate the efficacy of \emph{CoMIKL} in identifying not just the features useful towards classification but also the segments (thus, the locations) in the complete sequence that hold these features. We present the early results towards identifying the putative causal segments of a locus and the features that are deemed useful in discerning the interactors of a locus of interest from its non-interactors (4C/virtual 4C perspective). Simply put, in the scenario that two genomic loci, say a promoter and an enhancer element, interact over a long-range in the three-dimensional space, this would enable to distinctly identify any new (sequence) features in the intervening chromatin. Furthermore, we show that we are able to efficiently retrieve and interpret the weight vector for the complex setting of multiple multi-instance kernels. We also demonstrate how to interpret the nonlinear classifiers by adopting visualization techniques that were recently introduced (Nikumbh and Pfeifer, 2017) in the more basic setting.

Session A-443: scTree: reconstructing complex cellular lineage trees from single-cell RNA-seq data
COSI: RegGen

Short Abstract: Introduction: Recent advances in single-cell sequencing have made it possible to measure expression profiles for thousands of cells at once. Tissues often contain not only fully committed cells but also at intermediate differentiation stages. RNA-seq snapshots should, therefore, allow us to reconstruct the complex cellular lineage trees that explain the emergence of differentiated cell fates from a single progenitor cell population, e.g. in embryonic development or tumor progression. The expression time courses along each edge of the lineage tree could form the basis for dissecting the intricate gene interaction networks that regulate this process. Different techniques have shown to be successful in finding a low-dimensional manifold in which cellular differentiation tree can be visualized. However, the problem of identification of correct lineage trees is not solved except for very simple topologies. Results: We present scTree, a tool to reconstruct lineage trees from single-cell RNA-seq data. Once a manifold is calculated by some dimensionality reduction technique, endpoints, branch points, and their connectivity are detected, and cells are ordered according to a pseudotime measure along the branches of the tree. scTree calculates the denoised expression profile for each gene along each branch of the reconstructed lineage tree. By applying different statistical tests, the lineage tree structure can help to reveal genes that are differentially expressed among the different cell types that constitute the tree. While all state-of-the-art methods use a specific manifold space on which they reconstruct lineage tree structures, scTree is able to work in any given space, making it much more flexible. We were able to reconstruct tree lineage topologies composed of up to 3 bifurcations, both from real case examples and simulated ones and compared the results with reconstructions made by other methods as well. Conclusion: scTree can reconstruct complex tree topologies with multiple branching points robustly and accurately, surpassing the available methods in the field. Pseudotime courses for all genes on the tree are derived. The obtained high pseudotime resolution should enable the derivation of complex gene regulatory networks from snapshots of single-cell expression data. Tools like scTree will be of special importance in the immediate future as more and more complex differentiation processes are being investigated with single-cell RNA-seq techniques.

Session A-444: A Deep Learning Approach to Predict miRNA Targets by Analyzing Whole miRNA & isomiR Transcripts
COSI: RegGen

Short Abstract: MicroRNAs (miRNAs) are a family of small non-coding RNAs that regulate gene expression by binding to partially complementary regions within their target genes. Computational methods play an important role in predicting potential miRNA targets and assume that it is the miRNA seed region (located in the first 8 to 9 nucleotides) that defines the key interactions between a miRNA and its target. However, recent studies indicate that the entire miRNA has a role in the targeting process (1), suggesting that a more flexible methodology is needed. Here we present miRAW, a deep-learning based approach for predicting miRNA targets, which uses the entire miRNA and 3’UTR mRNA target nucleotides as inputs, and automatically learns a set of feature descriptors uninhibited by limits in current knowledge regarding the targeting process. The consideration of the whole miRNA:mRNA transcript not only allows us to assess the impact of mutations in the target site region, but also to predict how miRNA isoform variations can affect the strength and functionality of a target. To build our model we used more than 150,000 experimentally validated homo sapiens miRNA:gene targets (2,3); in order to obtain the exact binding sites of the different miRNA targets we cross referenced these targets with different CLIP-Seq, CLASH and iPAR-CLIP datasets, which resulted in a dataset of ~20,000 validated miRNA:gene target sites that followed different binding structures. Using this data, we implemented and trained a deep neural network to distinguish positive and negative miRNA targets. To automatically learn the features describing miRNA-mRNA interactions, the network follows the shape of an autoencoder; the learned features are then classified using a feed-forward neural network. To obtain the predictive model we trained the network following a 20 folds random sub-sampling cross-validation methodology; then, we used the best resulting network to analyze potential target sites in the 3’UTR of the gene. In a comparison using independent datasets, miRAW consistently outperformed existing prediction methods (mirSVR, targetScan, microTDS and PITA), obtaining higher accuracy, precision, sensitivity, and f-scores. However, miRAW presented a lower specificity due to a higher number of false positive results, this situation was solved by applying a posteriori filtering removing sites located at inaccessible regions. Regarding the prediction of functional targets related to miRNA isoforms, miRAW showed a great impact in variations affecting the miRNA seed region but also reflected changes in the target behavior when variation was occurring at the, until recently overlooked, 3’ end of the miRNA. Our findings show that, when predicting miRNA targets, considering the whole mature microRNA sequence provides better results than just focusing on the miRNA seed region. This supports recent findings, which observe that pairings beyond the seed region can also play an important role in the miRNA target process. Our results also highlight the ability of deep learning algorithms to deal with bioinformatics problems and to learn their own feature descriptors without being constrained by human knowledge. 1. Broughton, James P., et al. Pairing beyond the seed supports microRNA targeting specificity. Molecular Cell 64.2 (2016) 2. Vlachos, Ioannis S., et al. DIANA-TarBase v7. 0: indexing more than half a million experimentally supported miRNA: mRNA interactions. Nucleic acids research 43 (2015) 3. Hsu, Sheng-Da, et al. miRTarBase update 2014: an information resource for experimentally validated miRNA-target interactions. Nucleic acids research 42 (2014)

Session A-445: Enhancers Reprogramming in Mammalian Genomes
COSI: RegGen

Short Abstract: Transcription factor binding site (TFBS) loss, gain, and reshuffling within the sequence of a regulatory element could alter its function. Some of the changes will be detrimental to the fitness of the species and will result in gradual removal from the population, while other ones might be either beneficial or just be a part of a genetic drift and end up being fixed in a population. This “reprogramming” of regulatory elements (and enhancers, in particular) results in the modification of the gene regulatory landscape during evolution. However, the role of enhancer reprogramming in the evolution of the mammalian gene regulatory landscape is largely unknown. Also unknown is the relative contribution of enhancer reprogramming to gene regulatory changes in comparison to simple enhancer loss and gain events. In this study, we identified enhancer gains (EGs), functionally conserved enhancers (FCEs), and reprogrammed enhancers (RPEs) comparing the distribution of tissue-specific enhancers in human and mouse. We found that up to 40% of mammalian enhancers could have been reprogrammed after the human-mouse speciation. In 69% of cases, the reprogramming of an enhancer resulted in a quantifiably different expression of a flanking gene. We found that 46% of heart RPEs are located within the loci of genes with four or more enhancers and as expected these genes have a relatively higher expression compared to genes with one to three enhancers. Furthermore, the proportion of RPEs within the loci of genes with three or more enhancers tops to about 22%. Our results suggest that the role of enhancer reprogramming is the regulatory reinforcement of genes with established regulatory loci. We studied the mechanisms of enhancer reprogramming, for this, we compared the transcription factor binding sites (TFBS) locations within human regions and their mouse orthologues for each RPEs. On this basis, sites in human regions were identified as gained sites (Gs), conserved sites (Cs), reshuffled sites (Hs) and reused sites (Rs). We found that in 80 of 110 cases (72%), the density of Gs is higher in RPEs when compared to the density of Gs in FCEs. Our result indicates that RPEs change their regulatory function mainly by the acquisition of gained binding sites.

Session A-446: From transcription factor cooperativity to molecular phenotypes using DNA-shape and machine learning.
COSI: RegGen

Short Abstract: Recent efforts have linked thousands of genetic variants to phenotypic traits, however, very little is known about molecular mechanisms underlying these associations. To date, just ~20% of histone Quantitative Trait Loci (hQTLs) can be explained by a transcription factor (TF) motif disruption, suggesting the need for a better understanding of protein-DNA recognition. On the other hand, recent findings based on CAP-SELEX data suggest TF combinatorial binding to be more frequent than previously thought. Therefore, we here sought to investigate the extent to which combinatorial TF binding contributes to causing molecular trait QTLs. To do so, we (i) systematically predict combinatorial TF binding, (ii) characterise DNA-dependent effects on TF interactions, and (iii) quantify its implication in phenotypic associations. Our results show that we can bridge the gap between unexplained QTL variants and molecular genotypes via combinatorial binding prediction aided by DNA-shape, which proposes a mechanistic model in which TF pairs cooperate directly through DNA allostery over QTLs. To systematically predict combinatorial TF binding, we devised a machine learning set-up based on models of ‘sequence-only’ and ‘sequence plus DNA-shape’ features. We found that models that include DNA-shape strongly outperformed sequence-only models, supporting the notion that most of these TF pairs are driven via DNA-dependent effects. Briefly, we observe significant improvements for 515 in vitro CAP-SELEX datasets, and 132 in vivo ChIP-seq datasets, suggesting that our models are useful to infer combinatorial binding in vivo. Notably, when adding the DNA-shape component, we observe enrichment of TFs that have previously been described to interact through DNA-shape (e.g. pairs including MAX as one partner). A feature importance analysis allowed us to classify each TF as shape or sequence dependent when interacting with others, and to assess the relevance of DNA-shape of a given TF1 when interacting with another TF2. We next devised an approach to explore unbiased predictions using ChiP-seq data by combining different topologies and spacing distances from known PWM pairs. Best performances were recovered based on maximization of performance metrics. As a validation, we report the successful prediction of the topology and spacing distance for the known SOX2-PAX6 motif. From these combinatorial TF pair predictions, we obtained a network of protein-DNA ternary complexes, which were enriched for interactions reported in STRING. Interestingly, when checking DNA-shape features for a fixed TF1 with multiple putative partners we observe periodic DNA-shape feature weights at 11 nucleotides, suggesting DNA allostery to be a driver for the reported predictions. When associating QTLs with combinatorial binding models in Lymphoblastoid Cell Lines data for 75 individuals, we found an enrichment of these in different chromatin marks. Particularly, we find a ~30% increase in significant QTL-phenotype associations for the H3K4me3 when using composite instead of single motifs for this mark. Interestingly, this is two-fold higher than for other histone marks, suggesting that promoters are particularly prone to be regulated by combinatorial TF binding. Finally, we find QTL-SNPs events occurring at non-random positions within the combinatorial motifs suggesting some QTLs are biased towards one side of the combinatorial motif in case of DNA-dependency of a given TF-pair, and equally distributed on both sides when the TF-TF interaction is mediated through protein contacts.. Our work proposes a biological model in which QTLs associated to a molecular phenotype can be explained by multiple TFs binding both alone and cooperation through DNA-shape, emphasizing the need to systematically described these interactions and its relevance in different biological contexts. All in all, our validated and predicted combinatorial binding events increase the amount of explanatory mechanisms for QTL-phenotype associations.

Session A-447: Sonic Hedgehog Medulloblastoma: Comparative genetic and epigenetic analysis between age groups
COSI: RegGen

Short Abstract: Background: Medulloblastoma is a highly malignant type of pediatric brain tumor, which can be distinguished into four clinically relevant subtypes. Sonic hedgehog driven medulloblastoma (SHH MB) are characterized by a bimodal distribution of infants (<3 years) and adults (>17 years) cases, whereas the other subgroups (WNT, Group 3, Group 4) are more frequently observed in children (3-17 years). Previous genomic studies have revealed that these three age groups are different in their genetic alterations such as mutations and copy number changes. Importantly, chromatin modifiers are frequently mutated in adult SHH MB tumors. Results: To analyze the consequences of mutated chromatin modifiers in adult SHH MB compared to infants cases, we aimed to deduce the chromatin landscape differences between Infatns and Adults SHH tumors. To this end, we have mapped the active enhancer landscape in 11 SHH MB primary tumors (n=5 infants and n-6 adults cases) by chromatin immunoprecipitation experiments sensitive to H3K27ac followed by sequencing (ChIP-seq). Moreover, we have generated matched DNA methylation and transcriptomic data from same samples. Our integrative analysis reveal active enhancer elements specific to each of the two age groups pointing to potentially de-regulated key tumor suppressor genes. Outlook: There are significant differences in enhancer landscapes and their downstream targets in distinct age groups. The mechanisms for tumor development in infants and adults are different. A deeper understanding of these differences, especially by analysing age group specific chromatin state modifications and chromatin modifier genes associated with these changes will help to design improved treatment procedures specific for infants and adults

Session A-448: Analyzing DNA mutations impacting topologically associated domains and enhancers in pediatric brain tumors
COSI: RegGen

Short Abstract: Oncogenes are known to be activated through a wide-range of chromosomal alterations, including involving changes to non-coding elements. In a recent finding, Hnisz D. et al., (Science, 2016), reported that changes in chromosomal looping structures, called topologically associated domains (TADs), can activate oncogenes. This observation could have extensive implications for cancer diagnostics and treatment. Using three-dimensional conformation maps of the human genome, it was found that essential genes controlling cell identity occur within TADs. These loops are maintained through anchor sites bound by the CTCF protein. Mutations occurring at CTCF anchor sites lead to the loss of TAD boundaries, which in turn leads to the activation of oncogenes by enhancers in the newly created neighborhood. This mechanism was identified in several cancer types, such as liver cancer and esophageal cancer. To test for this mechanism in pediatric brain tumors, we developed a methodology to identify non-coding somatic mutations occurring at TAD boundaries and to implicate enhancers contributing to tumor growth. Our methodology combines gene expression data, recurrence approaches as well as functional annotations to identify non-coding driver mutations. Our results suggest the absence of the aforementioned mechanism in pediatric brain tumors like Ependymoma and Medulloblastoma, but we see evidence in Glioblastoma. Further, we applied our algorithms to identify non-coding driver mutations at enhancer regions. The results demonstrate the inability of recurrence-based approaches to identify non-coding driver regions and the need to incorporate gene expression data and functional effect measures to estimate the contribution of non-coding mutations in pediatric brain tumors.

Session A-449: Inference and interrogation of a coregulatory network in the context of lipid accumulation in Yarrowia lipolytica
COSI: RegGen

Short Abstract: Complex phenotypes, such as lipid accumulation, result from cooperativity between regulators and the integration of multi-scale information. However, the elucidation of such regulatory programs by experimental approaches may be challenging, particularly in context-specific conditions. In particular, we know very little about the regulators of lipid accumulation in the oleaginous yeast of industrial interest Yarrowia lipolytica (1). This lack of knowledge limits the development of this yeast as an industrial platform, due to the time-consuming and costly laboratory efforts required to design strains with the desired phenotypes. In this study, we aimed to identify context-specific regulators and mechanisms, to guide explorations of the regulation of lipid accumulation in Y. lipolytica. Using gene regulatory network (GRN) inference (2,3), and considering the expression of 6539 genes over 26 time points from GSE35447 for biolipid production and a list of 151 transcription factors (TF), we reconstructed a GRN comprising 111 transcription factors, 4451 target genes and 17048 regulatory interactions (YL-GRN-1) supported by evidence of protein-protein interactions. This study, based on network interrogation and wet laboratory validation (a) highlights the relevance of our proposed measure, the TF influence (4), measurement for identifying phases corresponding to changes in physiological state without prior knowledge (b) suggests new potential regulators and drivers of lipid accumulation and (c) experimentally validates the impact of six of the nine regulators identified on lipid accumulation, with variations in lipid content from +43.2% to -31.2% on glucose or glycerol.

Session A-450: How is transcription activated?
COSI: RegGen

Short Abstract: How do transcription activators, once bound at a promoter or enhancer, recruit coactivators to activate transcription? From a handful of co-crystallized structures we know that the activators contain disordered regions harboring short activation motifs of 5 to 15 weakly conserved amino acids that bind to the target domain of the coactivator. Here, we present the first unbiased high-throughput experimental investigation of activation domains and their activation motifs. We transfected a library of S. cerevisisae cells with plasmids expressing a gene composed of the coding region of the DNA-binding domain of transcription factor Gcn4 fused to a randomized stretch coding for a 30-amino-acid activation domain. The plasmid also carries a GFP reporter gene under the control of either a SAGA- or TFIID-dependent promoter with Gcn4 binding sites. We FACS-sorted millions of cells into 5 bins depending on the strength with which the cell’s construct activated reporter transcription, and we sequenced the randomized stretches in each bin to read out the activation domain sequence. We extracted a multitude of residue-wise, windowed and full-length sequence features from the sequences (i.e. secondary structure, solvent accessibility, protein disorder, hydrophobicity), and designed deep neural network architectures to predict their transcriptional activity. The final architecture contains a convolution layer with max pooling, LSTM and fully connected, dense layers (implemented in Keras with TensorFlow). The prediction accuracy reaches up to 94% percent depending on the cell library used. Finally, we applied the predictor to discover novel transcription factors by their activation domains in human, yeast and fly proteins.

Session A-451: The Role of the TP53-DNMT1 complex in cell apoptosis.
COSI: RegGen

Short Abstract: Aberrant DNA methylation has been proposed to be one of the hallmarks of cancer. Such aberrations have been associated with cancer progression and that DNA methylation patterns can be used as a powerful prognosis predictor. However, the link between aberrant DNA methylation patterns and the biological mechanisms governing such aberrations remain elusive. TP53 is a major tumor suppressor gene known to regulate DNA repair, cell-cycle arrest, senescence and apoptosis. It has been proven that p53 interacts with DNMT1 in order to drive DNA methylation at the promoters of anti-apoptotic genes but the effects of this interaction have not been studied in a genome-wide fashion. Using genome-wide approaches we have identified a group of genes that are repressed in a TP53 and DNMT1 dependent manner. Pathway analysis shows that this group of genes is enriched in cell cycle and DNA damage response functions. These results suggest that TP53 and DNMT1 act through a common gene repression mechanism. Furthermore, TP53 has been always regarded as a transcriptional activator, however, our data indicates that p53-directed DNA methylation is a novel mechanism of gene repression.

Session A-452: Method for predicting cell-type specific transcription factor co-occurrence in DNase hypersensitive sites reveals differences in regulatory networks in undifferentiated and differentiated ESCs
COSI: RegGen

Short Abstract: Cell-type specific gene expression is regulated by the combinatorial action of transcription factors (TFs). The aim of our study is to unravel transcription factor (TF) combinations that jointly regulate their target genes in a cell-type specific manner. We first derive ubiquitous (ubiq-DHSs) and cell-type specific DNase hypersensitive sites (CTS-DHSs) in 64 cell types. Then, we develop a novel statistical method for detecting pairs of TFs co-occurring in a cell-type specific way by contrasting the TF occurrences in CTS-DHSs to ubiq-DHSs with a ratio score. We find that many predicted TF pairs are cell-type specific and that overlaps in TF pairs exist in cell types from the same tissue. Furthermore, independently validated co-occurring and directly interacting TFs are significantly enriched in our predictions. Focusing on the network derived from the predicted TF pairs in embryonic stem cells (ESCs) we find that it consists of two regulatory parts with distinct functions: maintenance of pluripotency with OCT4, SOX2 and NANOG and regulation of early development with KLF4, STAT3, ZIC3 and ZNF148. Conversely, the predicted network in differentiated ESCs shows loss of the essential pluripotent factor OCT4. In summary, our novel method can predict co-occurring TFs in a cell-type specific manner which reveal new insights into regulatory mechanisms.

Session A-453: Longer DNA insert lengths for whole exome sequencing guarantee enhanced mutation detection
COSI: RegGen

Short Abstract: In times of decreasing sequencing costs making whole genome sequencing (WGS) more attractive, one might question whether whole exome sequencing (WES) is still feasible for small studies with low budget. One major drawback of WES is its well-known trouble with the unevenness of coverage which is prerequisite for further mutation detection or copy number variation analyses. Here, we would like to propose a simple technical change in the WES protocol for effectively improve WES coverage uniformity. Typical human WES is conducted with 2x100 paired-end sequencing aiming a mean coverage of up to 100x for each sample. However, WES is frequently applied on <150 bp fragmented genomic DNA sequences, hereby producing overlapping paired-end reads and promoting skewed coverage distribution of reads. Therefore, we prepared six human cell line subclones using Agilent SureSelectXT for WES. After preprocessing and mapping, aligned reads of each sample were divided into two groups <150 bp (short) and >150 bp (long) inserts. We observed high amplitudes (“mountain-valley” contour) in coverage particularly at longer exon sites targeted by more than two baits which resulted in higher unevenness of coverage for short inserts. Comparison of the two short and long insert group pairs revealed clearly more mutations missed in short inserts by contrast with long inserts. Missed mutations for short inserts were characterised by low coverage surrounded by highly covered regions. In conclusion, WES with longer inserts provided enhanced evenness delivering a more useful basis for mutation detection in coding regions.

Session A-454: INTEGRATIVE VISUALIZATION OF MULTI-OMICS DATA: THE PAINTOMICS 3 PLATFORM
COSI: RegGen
Session A-509: Potential Regulator and Causal Genes for Boar Taint in Pigs from genome-transcriptome (eQTL) mapping analyses
COSI: RegGen

Short Abstract: Boar taint is an offensive odour rom a proportion of non-castrated male pigs due to skatole and androstenone accumulation and castration is used to this but is an animal welfare issue. This study aimed to identify expression quantitative trait loci (eQTLs) with potential effect on boar taint to aid genomic selection of non-castrated male pigs with low boar taint. Danish Landrace male boars with low, medium and high genetic merit of skatole and human nose score were slaughtered at 100 kg. Gene expression profiles were obtained by RNA-Seq and genotype data was obtained by Illumina 60K Porcine SNP-chip. Following quality control and filtering, 10,545 and 12,731 genes from liver and testis were included in the eQTL analysis together with 20,827 SNP variants. A total of 281 and 554 single-tissue eQTLs associated with 127 and 257 genes were identified in liver and testis, respectively. The highest densities of eQTLs were found on pig chromosomes SSC6, SSC12 and SSC14. Functional characterisation of eQTLs revealed functions within androgen receptor signalling pathway, cellular response to estradiol and sphingolipid metabolic process. By employing a multivariate Bayesian Hierarchical Model, a total of 27 eQTLs were identified as significant multi-tissue eQTLs, associated with the genes ABCC8, ALDH5A1, AP5Z1, ARMC7, ATP6V1H, CAPN5, KLHDC4, LYPLAL1, NOSTRIN, PBX2, PPP1R11 and XRCC2. Of these, 12 multi-tissue eQTLs resided within known QTLs for levels of androstenone, cholesterol, indole and smell intensity and contained candidate genotypes for low merit of boar taint. These results will be useful for genomics assisted selection.