SPONSORS:

Silver

Silver Sponsor: Sanofi



General

General Sponsor - IBM Research

General Sponsor - MAGNet

General Sponsor -National Cancer Institute

RECOMB/ISCB RegSysGen 2014 Sponsor - NRNB

Cytoscape Sponsors

RECOMB/ISCB RegSysGen 2014 Sponsor - Agilent Technologies

RECOMB/ISCB RegSysGen 2014 Sponsor - Cytoscape

REGULATORY GENOMICS POSTERS




Updated Nov 6, 2014


RG P01: Whole-genome bisulfite sequencing of multiple individuals reveals complementary roles of promoter and gene body methylation in transcriptional regulation

Shaoke Lou1, Heung-Man Lee1, Hao Qin1, Jing-Woei Li1, Zhibo Gao2, Xin Liu2, Landon L. Chan1, Vincent K. L. Lam1, Wing-Yee So1, Ying Wang1, Si Lok1, Jun Wang2, Ronald C. W. Ma 1, Stephen Kwok-Wing Tsui1, Juliana C. N. Chan1, Ting-Fung Chan1, Kevin Y. Yip1

1The Chinese University of Hong Kong, 2Beijing Genomics Institute - Shenzhen

Background: DNA methylation is an important type of epigenetic modification involved in gene regulation. Although strong DNA methylation at promoters is widely recognized to be associated with transcriptional repression, many aspects of DNA methylation remain not fully understood, including the quantitative relationships between DNA methylation and expression levels, and the individual roles of promoter and gene body methylation.

Results: Here we present an integrated analysis of whole-genome bisulfite sequencing and RNA sequencing data from human samples and cell lines. We find that while promoter methylation inversely correlates with gene expression as generally observed, the repressive effect is clear only on genes with a very high DNA methylation level. By means of statistical modeling, we find that DNA methylation is indicative of the expression class of a gene in general, but gene body methylation is a better indicator than promoter methylation. These findings are general in that a model constructed from a sample or cell line could accurately fit the unseen data from another. We further find that promoter and gene body methylation have minimal redundancy, and either one is sufficient to signify low expression. Finally, we obtain increased modeling power by integrating histone modification data with the DNA methylation data, showing that neither type of information fully subsumes the other.

Conclusion: Our results suggest that DNA methylation outside promoters also plays critical roles in gene regulation. Future studies on gene regulatory mechanisms and disease-associated differential methylation should pay more attention to DNA methylation at gene bodies and other non-promoter regions.

(This paper has been published online by Genome Biology and is available at http://genomebiology.com/2014/15/7/408/.)

................................................................................................................
RG P02: Epigenetic and post-transcriptional crosstalk between key players of the cancer genome

Beatrice Salvatori1 , Nenggang Zhang2, Pavel Sumazin2, Andrea Califano1

1
Columbia University, 2Baylor College of Medicine

Multilayer regulation of gene expression is the foundation for the evolution of complex phenotypes of higher organisms. MicroRNAs work at the post-transcriptional level, 'orchestrating' RNA expression to sustain almost every cellular process, including aberrant states such as cancer. Very compelling is the discovery that mRNAs, pseudogenes, and long noncoding RNAs compete for the binding of microRNAs, suggesting that microRNA targets can act as modulators of microRNA activity. Reports confirmed the existence of competitive forces between transcripts in multiple cellular contexts (Cesana M et al., 2011; Tay Y et al., 2014; Kumar M et al., 2014) and concomitantly in silico dynamical models were used to identify modulators of microRNAs activity (Sumazin P et al., 2011). Recently a pan-cancer study showed that competitive interactions for microRNAs might account for a substantial fraction of the 'missing genomic variability' in tumors. H. Chiu et al. (submitted 2014) identified genetic and epigenetic variants at the loci of microRNA targets as modulators of microRNA activity, leading to dysregulation of hundreds of genes, including most of the established cancer genome. We interrogated some relevant networks that have been found broadly implicated in tumorigenesis, including global epigenetic regulators. Our analyses implicated the TET family of proteins (TET1, TET2, TET3) as targets of microRNA modulation by distal genomic variants. This family of proteins was recently discovered to catalyze the conversion of 5-methyl-cytosine to 5-hydroxymethyl-cytosine, thus contributing to DNA de-methylation (Delatte B. et al., 2014). Moreover, TETs are found frequently deleted in blood tumors and are commonly down-regulated in solid cancer, including breast, lung, and pancreas cancer (H Yang et al., 2013). Our analyses suggest that their deregulation is mediated in all subtypes of breast cancer through competition for microRNA regulation. In particular, we found significant microRNA-mediated cross-talk between TET2 and the PTEN mRNAs. Our preliminary results include biochemical validation of this regulation and its functional relevance for tumor progression.

................................................................................................................
RG P03: Somatic mutations modulate ceRNA drivers of tumorigenesis

Jing He1 , Hua-Sheng Chiu2, Pavel Sumazin2, Andrea Califano1

1
Columbia University, 2Baylor College of Medicine

Pan-cancer studies have shown that competitive endogenous RNA (ceRNA) networks can cooperate with chromosome instability and abnormal DNA methylation in tumors to dysregulate tumor suppressors and oncogenes. However, ceRNA cooperative association with mutations in cancer has not been studied. Integrating data from TCGA and ENCODE, we show that the cooperation between ceRNA interactions and mutations of unknown function contribute to the dysregulation of cancer genes.

We integrated ceRNA networks and mutations in an attempt to mechanistically recover missing genomic variability of cancer genes in TCGA breast cancer biopsies. Genes have missing genomic variability in a tumor dataset when their dysregulation cannot be explained through profiling of their DNA locus. Using a group lasso regression model we showed that ceRNA drivers cooperating with somatic mutations, CNV, and methylation, could account for a large fraction of the missing genomic variability of cancer genes in breast cancer. Moreover, using a greedy-forward optimization algorithm, we identified ceRNA driver mutations that could potentially drive tumorigenesis through the ceRNA mechanism. Furthermore, we showed that driver ceRNA mutations are enriched in known and predicted binding sites of transcription factors and microRNAs.

In summary, our results suggest that somatic mutations, often of unknown function, cooperate with ceRNA regulators to alter the expression of cancer genes in breast cancer tumors.

................................................................................................................
RG P04: Diverse promoter-architectures revealed by decoding heterogeneity in high-throughtput sequence data

Leelavati Narlikar1

1
National Chemical Laboratory, India

An important question in biology is how different promoter-architectures contribute to the diversity in the regulation of transcription initiation. A major step forward has been the development of technologies like CAGE/RACE that map transcription start sites (TSSs) at high resolution in a genome-wide manner. However, the subsequent step of characterizing promoters and their functions is still largely done on the basis of previously established promoter-elements like the TATA-box in eukaryotes or the -10 box in bacteria. Unfortunately, a majority of promoters and their activities cannot be explained by the presence or absence of these few elements. Motif discovery methods like MEME identify novel overrepresented elements, but these also fail here, because TSS neighborhoods are highly heterogeneous containing no overrepresented motif. For example, one set of promoters may be characterized by elements A, B, & C, another by A & D, a third only by D, and a fourth by E & F. In such a scenario, there is little chance that all six elements and four promoter-architectures will be detected by conventional approaches. Things get even more complicated when spacing between elements becomes relevant.

I will present a new unsupervised machine learning-based method designed to explicitly characterize this heterogeneity, while simultaneously unraveling underlying promoter-architectures. The method is generalizable to any organism, identifying previously undetected elements with lengths ranging from a single base in bacteria to 15 bases in certain human promoter-architectures. A striking example is the clear presence of a pyrimidine right before the TSS under very specific circumstances, across five different bacteria, which is likely to play a crucial role for transcription initiation. In tuberculosis, analysis of TSS locations across two environmental conditions provides convincing evidence that the spacing between the -10 box and the TSS is utilized for dynamic regulation of gene-expression by the pathogen. This relationship between the spacing and transcription activity has not been identified before.

In the well-studied Drosophila, the method identifies new variants of the INR motif instrumental during development, along with several novel promoter-architectures. In humans, there appears to be a lot more heterogeneity than reported so far: 20 architectures composed of a few known and many novel elements are identified, with each architecture having distinct evolutionary patterns, cell-type specific activity, and chromatin state.

The applicability of this method extends beyond identifying new promoter-architectures. This new way of looking at high-throughput sequence data allows for the identification of diverse regulatory signals associated with any DNA specified biological event reported at high-resolution.
................................................................................................................
RG P05: Computational identification of protein binding sites on RNAs using high-throughput RNA structure-probing data

Xihao Hu1 , Thomas K. F. Wong2, Zhi John Lu3, Ting-Fung Chan1, Terrence Chi-Kong Lau4, Siu-Ming Yiu2, Kevin Yip1

1
The Chinese University of Hong Kong, 2The University of Hong Kong, 3Tsinghua University, 4City University of Hong Kong

High-throughput sequencing has been used to probe RNA structures, by treating RNAs with reagents that preferentially cleave or mark certain nucleotides according to their local structures, followed by sequencing of the resulting fragments. The data produced contain valuable information for studying various RNA properties. We developed methods for statistically modeling these structure-probing data and extracting structural features from them. We show that the extracted features can be used to predict RNA "zip codes" in yeast, regions bound by the She complex in asymmetric localization. The prediction accuracy was better than using raw RNA probing data or sequence features. We further demonstrate the use of the extracted features in identifying binding sites of RNA binding proteins from whole-transcriptome gPAR-CLIP data.

................................................................................................................
RG P06: Loregic: a method to characterize the cooperative logic of regulatory factors

Daifeng Wang1 , Koon-Kiu Yan1, Cristina Sisu1, Chao Cheng2, Joel Rozowsky1, William Meyerson1, Mark Gerstein 1

1
Yale University, 2Dartmouth Medical School

The topology of the gene regulatory network has been extensively analyzed. Now, given the large amount of available functional genomic data, it is possible to go beyond this and systematically study regulatory circuits in terms of logic elements. To this end, we present Loregic, a novel computational method that integrates gene expression and regulatory network data, to identify and characterize the cooperativity of regulatory elements using logic-circuit models. We describe the basic regulatory triplet consisting of two regulatory factors (RFs) acting on a common target gene, using a two-input-one-output logic gate model. We use binarized gene expression data, to score the agreement between a triplet's cross-sample expression characteristics with the idealized expression pattern of each of all 16 possible logic gates (e.g., AND or XOR). A high score suggests a strong cooperativity between the regulatory activities of the two RFs following the corresponding logic gate pattern. To demonstrate Loregic's applicability, we apply it to yeast cell cycle and human cancer datasets. In yeast, we use Loregic to study yeast transcription factors (TFs) regulatory activity, and validate the results using TF knockout experimental datasets. Next, using human ENCODE ChIP-Seq and TCGA RNA-Seq expression data, we are able to demonstrate how Loregic characterizes complex circuits involving miRNAs and both proximally and distally regulating transcription factors (TFs). We find that in acute myeloid leukemia, the oncogenic TFs such as MYC, can be modeled as acting independently from other TFs, but antagonistically with miRNAs. Next, we explore the algorithm's applicability to other regulatory features. We use Loregic for the discovery and classification of indirectly bound TFs. We also predict logical operations in feed-forward loops, a special type of regulatory triplet in which one TF regulates both the target gene and the other TF. Finally, we demonstrate that Loregic is able to identify the regulatory pathways to targets that have cascaded logic-circuit operations. In summary, Loregic is a valuable computational method that describes the complex process of gene regulation in terms of the regulatory cooperative logic. The present method can be further extended to analyze cooperativity among arbitrary regulatory elements such as long non-coding RNAs and pathways. We make Loregic freely available as a general-purpose tool via https://github.com/gersteinlab/Loregic.
................................................................................................................
RG P07: The role of PIWI interacting RNAs in LINE-1 evolutionary dynamics

Leanne Whitmore1 , Debjit Ray1, Wenfeng An1, Ping Ye1

1
Washington State University

Transposons, segments of DNA that have the ability to move throughout the genome, have been active in mammalian genomes for millions of years, driving evolution by generating genetic and epigenetic changes in the genome. The autonomous retrotransposon LINE-1 (L1) uses an RNA intermediate to move and makes up approximately 17% of the human genome. L1 retrotransposition is potentially detrimental to the organism by causing disruptions in the coding sequences of genes as well as regulatory regions. Throughout evolution the sequence of L1 elements has changed, most notably the 5' untranslated region (5'UTR), which contains an internal promoter and regulates L1 transcription. The biological mechanism driving the evolution of L1 elements is not yet understood. PIWI interacting RNAs (piRNA), small non-coding RNAs that are generated from L1 transcripts, can repress L1 transcription by targeting DNA methylation to the promoter regions. Further repression of L1 activity occurs during piRNA genesis in which L1 transcripts are degraded. As the 5'UTR sequence is overrepresented in prentatal piRNAs, we hypothesize that piRNA-mediated repression plays a major role in driving L1 lineage succession. To test this hypothesis we analyze piRNA abundance toward families of mouse L1s. Our preliminary analysis shows that piRNA abundance is highest toward younger and transcriptionally more active L1 families, suggesting that the piRNA system selectively acts against active L1 families, leaving room for new families to emerge.
................................................................................................................
RG P08: Deciphering functional mechanisms for non-coding genetic variants associated with complex traits

Cynthia Kalita1 , Gregory Moyerbrailean1, Roger Pique-Regi1, Francesca Luca1

1 Wayne State University

GWAS (Genome wide association study) has identified thousands of SNPs associated with complex traits. However, generally each SNP has a small effect and it is very challenging to identify the causal one and its underlying mechanism. Many GWAS signals are in non-coding regions, so they may disrupt gene regulatory sequences such as transcription factor (TF) binding sites. Here we focus on functional annotations we previously developed by integrating binding sites predicted by a motif model with DNase I footprinting data. Using an empirical Bayesian framework implemented in the fgwas software, data from GWAS studies was combined with these functional annotations. We observed improved posterior probability of association and increased interpretability of each signal, as compared to other annotations (e.g. distance to the TSS). For lipid levels in particular, the majority of the motifs correspond to TFs involved in inflammatory pathways. In contrast, the most enriched motifs for human height correspond to TFs important for development, cell proliferation, and maintenance of stem cell state.

From the enriched motifs, we selected two with GWAS hits for LDL and total cholesterol, respectively, to validate our computational predictions using reporter gene assays. Constructs containing either the reference or alternate allele at each SNP were assayed. We show that genetic variants predicted to disrupt TF binding can drive differential gene expression. Furthermore, the direction of gene expression changes confirms genetic effects from eQTL studies (GTEX consortium).
................................................................................................................
RG P09: Improving position weight matrix-based prediction of transcription factor binding sites by integrating DNA shape features

Jichen Yang1 , Stephen Ramsey1

1
Oregon State University

DNA sequence-dependent binding of transcription factors (TFs) to specific sites in the genome is central to gene regulation. Transcription factor binding site (TFBS) sequence patterns are often characterized by a position-nucleotide weight matrix (PWM) because it can be estimated from a small number of representative TFBS sequences. However, because the PWM probability model assumes independence between individual positions within the binding site, the PWMs for some TFs are poor discriminants of TFBS sequences from non-binding-site, noncoding DNA. Since three-dimensional DNA structure (i.e., the "shape" of the double helix) is recognized by TFs and is a determinant of binding specificity that depends on multi-base patterns, we investigated whether DNA shape-derived features could improve PWM-based prediction of TFBS. We defined unique features derived from base pair-level DNA shape parameters and integrated them with PWMs in a classifier for discriminating binding site sequences for 119 human TFs from random noncoding DNA sequences. We found that binding site prediction performance for 1/3 of TFs was significantly better in the shape+PWM method vs. the traditional PWM method, while the two methods had equivalent performance for the remaining TFs. Our findings indicate that DNA shape-derived features can conditionally improve PWM-based detection of TFBS, and that this improvement is in general due to the ability of DNA shape features to capture interdependencies between nucleotide positions that cannot be captured in a PWM.
................................................................................................................
RG P10: Inferring differential alternative splicing from paired-end RNA-Seq data

Ruolin Liu1 , Karin Dorman1, Julie Dickerson1

1Iowa State University

Alternative splicing (AS) is a post-transcriptional regulation mechanism under which a single gene produces multiple mRNA transcripts, called isoforms. The direct outcome of AS is to expand the diversity of mRNAs produced from the genome and, after translation, the protein diversity. Regulation of isoform abundance can have profound functional effects, including changes in protein-protein binding, protein localization, protein enzymatic properties, and protein-ligand binding, and changes in alternative splicing have been linked to human diseases. More than 95% of human genes are alternatively spliced. RNA sequencing (RNA-Seq) has emerged as a high-throughput technology capable of performing detailed transcript data surveys. As RNA-Seq becomes the standard for studying gene and isoform expression, a key problem is to detect differential alternative splicing. In this study, we propose a new method for detecting differential isoform expression using RNA-Seq. Following Rossell et. al (2014), we take a model-based approach for count data, where multiple, a priori known, isoforms contribute to each count. We extend the model to detect differential splicing, while accounting for overdispersed counts between biological replicates.
................................................................................................................
RG P11: Evaluation of the accuracy of enhancer-target associations

Qin Cao1 , Kevin Yip1

1
The Chinese University of Hong Kong

Motivation: Enhancers are essential regulatory elements that play critical roles across a wide range of cellular processes. Previous studies have suggested that mutation of enhancers may lead to abnormal gene expression and result in disorders. An important step towards a complete understanding of enhancer roles is to associate the target genes regulated by each enhancer. Since enhancers can regulate gene expression via long-range interactions, experimental approaches such as Hi-C and ChIA-PET could help identify enhancer targets. However, the low resolution, high noise level and limited data availability restrict their current use in finding enhancer targets. As a result, computational approaches have been proposed as an important alternative. These methods consider activity correlations, distance information, and co-evolution signals in identifying potential targets of enhancers. The accuracy of these predictions remains to be evaluated comprehensively.

Methods and results: Here, we reason that these putative enhancer-target associations can be evaluated in silico by checking whether the activities of the involved enhancers can accurately predict the expression of the potential targets, or can significantly improve the predictions of other features. This evaluation is non-trivial because 1) a large number of features have been used in calling targets, which should not be used in the evaluation process; and 2) one enhancer can regulate multiple target genes and multiple enhancers can regulate the same gene, making the relationship between enhancer activity and target gene expression fairly complex. In view of this, we have carefully chosen enhancer features and designed data selection and cross-validation procedures that can avoid various types of bias. Based on the potential enhancer-target associations reported in several previous studies, we found that enhancer features not involved in calling their targets could indeed partially indicate the expression levels of their potential targets, although their predictive power is in general not as strong as promoter and gene body features, and the prediction accuracy depends highly on the particular data set and testing configuration. Our results also suggest that enhancers may have stronger effects on their targets in complex multiple-to-multiple enhancer-target relationships. Overall, our study provides an objective evaluation of current potential enhancer-target lists, and suggests way to improve the calling of enhancer targets.
................................................................................................................
RG P12: GWAS next generation: identifying mechanisms of action in association studies.

Maria Rodriguez Martinez1 , Paola Nicoletti2, Damien Arnol1, Andrea Califano2

1IBM, Zurich Research Laboratory, 2Columbia University

In recent years, genome wide association studies (GWAS) have identified a plethora of genetic variants associated with complex phenotypes and disease. However, many of the identified variants map to intergenic regions or lie close to genes with unknown biological connection to the disease, and thus, interpreting their functional role remains a daunting task. To tackle this problem, we have designed gVITaMIN (Genetic Variability IdenTifies Missing INteractions), an algorithm that examines the molecular mechanisms underlying the association between genetic variants and complex phenotypes. Specifically, the algorithm tests whether a genetic variant modulates the expression level of a gene or the transcriptional activity of a transcription factor, by altering the relationship with its targets.

We have applied gVITaMIN to the study of breast cancer, a common complex disease with incompletely characterized genetic predisposition architecture. We have selected 50 SNPs previously associated with breast cancer susceptibility, run gVITaMIN using two different breast cancer expression datasets (TCGA and METABRIC), and compared the results obtained from both cohorts. Interestingly, gVITaMIN links the cancer susceptibility conferred by rs1876206 to dysregulation of TGFβ signaling, a potent growth inhibitor with tumor-suppressing activity.
................................................................................................................
RG P13: Are computationally predicted footprints result of DNase I cleavage bias?

Eduardo G. Gusmao1 , Martin Zenke2, Ivan G. Costa1

1RWTH University Aachen, 2RWTH University Aachen Medical School

DNase I digestion followed by massive sequencing (DNase-seq) has proven to be a powerful technique for identifying active transcription factor (TF) binding sites on a genome-wide scale [1, 2, 3, 4]. Several computational approaches have been proposed to find nucleotide-resolution footprints, regions with 5 to 20 bps within two DNase-seq peaks [2, 4]. Recently, He et al. (2014) demonstrated that DNase-seq signal has biases reflecting the preference of DNase I to cleave particular sequences. Moreover, they show that the performance of a digital footprint method correlates with the cleavage bias of the underlying TF motif and that footprints are outperformed by simple DNase hypersensitivity sites tag count scoring (DHS-TC). However, these results were based on footprints predicted with a simple version of the digital footprint occupancy score (FOS) from [4] and no attempt was made to correct sequence bias previous to footprint prediction.

To address these questions, we extended our segmentation-based digital footprinting framework (HINT - HMM-based identification of TF footprints) [2] by performing bias correction of DNase-seq signals (HINT-BC). We estimated DNase I cleavage bias as in [3] on ENCODE DNase-seq data sets obtained from the Crawford lab (H1-hESC, HeLa-S3, Huvec and K562) and the Stamatoyannopoulous lab (HepG2, Huvec and K562). We observed that cleavage bias is distinct for each DNase-seq data set and that differences were larger between experiments from distinct labs. We then executed HINT, HINT-BC, DHS-TC, and FOS on these data sets and evaluated predictions with 139 TF ChIP-seq data sets measured on these cell types. Performance of methods were evaluated regarding their area under the ROC curve (AUC) at 10% false positive rate. Results indicate that HINT-BC significantly outperforms all compared methods, while FOS was outperformed by all methods (Friedman-Nemenyi hypothesis test at 0.05 significance level). This reinforces our point that the method evaluated in [3] is not a good representative of footprint detection methods and that footprint methods profit from sequence bias correction.

Next, we measured the correlation between observed and expected number of DNase cleavage sites around each TF. This statistics measures the potential "bias score" of a TF motif for a given DNase-seq assay [3] (Fig. 6). We observed a high negative correlation between FOS AUC and the "bias score" (-0.41, p-value < 0.00001) for all evaluated motifs, which agrees with the observation that FOS footprints are affected by DNase cleavage bias. HINT and HINT-BC presented negative correlation values of -0.14 and -0.04 (p-values > 0.05). These results show that the impact of DNase-seq cleavage bias is low on robust digital footprinting methods and can be further decreased after the correction of DNase-seq signal.

References

1. Crawford, G.E. et al. (2006). Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Research, 16(1), 123-131.

2. Gusmao, E.G. et al. (2014). Detection of active transcription factor binding sites with the combination of DNase hypersensitivity and histone modifications. Bioinformatics.

3. He, H.H. et al. (2014). Refined DNase-seq protocol and data analysis reveals intrinsic bias in transcription factor footprint identification. Nat Meth, 11(1), 73-78.

4. Neph, S. et al. (2012). An expansive human regulatory lexicon encoded in transcription factor footprints. Nature, 489(7414), 83-90.
................................................................................................................
RG P14: Understanding myelome multiple patients recall by automatic reasoning on an integrated model of transcriptomic data and large-scale signaling pathways

Bertrand Miannay1 , Carito Guziolowski1, Stephane Minvielle2, Morgan Magnin1, Florence Magrangeas2

1
Ecole Centrale de Nantes, 2Centre de Recherche en Cancérologie Nantes-Angers

Multiple myelome (MM) is an incurable haematological malignancy cancer; our aim is to better understand mechanisms of relapse by using expression profiles. In this work, we studied the consistency of a large-scale causal network of signaling and transcriptional events with respect to gene expression profiles from 32 patients (9 healthy, 11 measured at MM diagnosis and 12 at MM recall).

Firstly, we automatically build the regulatory network of MM cancer by connecting components of NFkappaB (RelA and NFkappaB1), a protein relevant for this cancer with differentially expressed genes across all patients. For this step we used the Pathways Interaction Database (NCI-PID). Secondly, we studied if the logic given by the causal flow events of the regulatory network is consistent with the up or down expression-shifts when comparing MM patients' expression data with respect to healthy profiles.

The logic confrontation of this data was done with a tool using a qualitative modeling approach in which each component of the system shift of expression must be explained by the shift of its predecessors, in an exhaustive manner.

Our results provide a causal and signed (activations, inhibitions) graph, composed of 961 nodes, and 1234 edges. This graph provides putative explanations for the expression-change of the variant genes in all patients. The graph causality was found much more consistent when using healthy patients data (35% inconsistency score, IS) with respect to MM patients data (45% IS). These results may imply that NCI-PID regulatory information better explains the qualitative logic of healthy cells; whereas, it presents incompleteness in describing the logic of cancer cells. Moreover, the inconsistency of expression profiles measured at diagnostic (49% IS) is slightly higher than the one measured at relapse (42% IS), suggesting that NCI-PID regulatory knowledge explains better relapse than diagnostic data. This result is supported by the fact that cancer treated cells are subject to a Darwinian selection, implying that cancer cells populations after treatment becomes more homogenous, and in one sense, some of the regulatory mechanisms are more canonical (consistent with NCI-PID).

In conclusion, this study provides a consensual measure for healthy and cancer patients, taking into account a global logical reasoning of large-scale regulatory pathways despite high variability of the patients' expression profiles. Our results can be seen as signatures of cancer stages, and in this context we consider this approach as novel and complementary to machine learning ones. We will pursue this research project by expanding the number of patients, integrating RNA-seq and ChIP-seq data in our models and link these genomic profiles to functional pathways such as apoptosis and proliferation.

................................................................................................................
WITHDRAWN
RG P16: Genetic variation and geographical implication of Moringa oleifera accessions in Nigeria using amplified fragment length polymorphism (AFLP)

Jacob Popoola1, Conrad Omonhinmin1

1Covenant University, Nigeria

Moringa oleifera is an underutilized tree crop that exists in varied geographical areas within Sub-Saharan Africa and deserves careful genetic assessment. In this study, AFLP marker was employed to evaluate the intra-specific genetic variation among 40 accessions of M. oleifera collected from different areas in Nigeria that were introduced from different countries. Six AFLP selective primer combinations were screened for their ability to generate AFLP polymorphic bands and based on the results of the banding patterns, two primer combinations (M-CAC/E-ACC and M-CAG/E-ACA) were selected. Principal coordinate analysis (PCA) and cluster analysis (CA) were employed to analyze the relationships among the 40 accessions. The primer combinations generated a total of 1272 amplification bands (primer M-CAG/E-ACA generated 859 bands while M-CAC/E-ACC generated 413 bands) out of which 1252 were polymorphic (98.43%), with size ranging from 100 to 1000 bp. High gene diversity (0.973) and polymorphic information content (0.974) were recorded for the accessions. The first two eigenvectors of PCA accounted for 18.21% of the total variation and grouped the 40 accessions into four groups. Similarity coefficient from CA ranging from 0.73 to 0.94 segregated the 40 accessions into six groups. The analysis of the clusters revealed that some accessions with similar areas of collection and background were widely separated, others clustered along collection lines. Accessions that are far apart based on their genetic similarity coefficient (KnN077, ogN026, oyN003 and edN037) could be selected for future breeding trials.

................................................................................................................
RG P15: Phenotypic and RAPD Intra-specific variability in some accessions of underutilized African yam bean (Sphenostylis stenocarpa, Hochst. ex A. Rich, Harms).

Jacob Popoola1, Mary Adebayo2, Olawale Ezekiel1, Emmanuel Adegbite3

1
Covenant University, Nigeria 2International Institute of Tropical Agriculture (IITA), 3Ondo State University of Science and Technology

Intra-specific variability study was carried out on 10 accessions of African yam bean (AYB) (Sphenostylis stenocarpa, Hochst. ex A. Rich, Harms) collected from the International Institute of Tropical Agriculture (IITA) in Ibadan, Nigeria. Fourteen (14) morpho-metric characters and nine (9) arbitrary RAPD primers were used to evaluate genetic intra-specific variability among the accessions. A total of 410 bands were generated out of which 261 were polymorphic (63.66%). The significant correlation among the consistent characters such as days to 50% flowering and pods per peduncle, number of locules per pod, number of seeds per pod, pod length and seed set percentage points to their suitability for breeding and genetic improvement purposes. RAPD cluster analysis using NYSYS-pc and UPGMA program produced two major clusters with one minor cluster with similarity ranging from 72% to 93% while morph-metric clusters produced three major groups. Two accessions TSs 56 and TSs 94 had the highest level of similarity index of 93%. The use of both means had not only enabled their characterization, RAPD has eliminated selection error that may arise based on areas of collection or origin. Genetic diversity studies are very important in selection of good character traits, genetic improvement and conservation.
................................................................................................................
RG P17: STAT4 regulated transcriptional and epigenetic specification of human Th1 cells
Sini Rautio1 , Sanna Edelman2, Yuka Kanno3, Jussi Jalonen2, Subhash Tripathi2, John O'Shea3, Harri Lähdesmäki1 and Riitta Lahesmaa2

1 Aalto University, 2Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, 3National Institute of Arthritis and Musculoskeletal and Skin Diseases, National Institutes of Health

Signal transducer and activator of transcription 4 (STAT4) is a key factor driving the differentiation program of T helper 1 (Th1) cells. However, the genome-wide STAT4 binding and its role in controlling gene expression and epigenetic landscape have not been characterized in human Th1 cells before. By performing STAT4 siRNA silencing followed by ChIP-seq we identified genome-wide STAT4 binding sites in early differentiating and fully differentiated Th1 cells originating from naïve cord blood CD4+ Th cells activated and cultured in vitro in Th1 polarizing condition. In addition, we identified STAT4-regulated genes and global STAT4-dependent epigenetic active (H3K4me3) and repressive (H3K27me3) promoter modifications and p300 coactivator and H3K27ac recruitment to enhancers.

Preliminary results show that STAT4 binds over 25,000 loci with little changes in the binding targets over the course of differentiation. We identified genes controlled by distal or proximal STAT4 binding. STAT4 was observed to both enhance and block p300 recruitment to distal regulatory sites (enhancers), as well as activate and repress gene expression and to modulate promoter modifications of STAT4 regulated genes. STAT4 regulates many genes important in Th1 cell differentiation but also Th2 and Th17 subsets specific genes. An integrative analysis of STAT4 binding sites and immune mediated diseases associated SNPs revealed several shared loci. STAT4 binding sites were found to overlap with SNPs associated with diseases such as Crohn's disease and multiple sclerosis. In summary, the results outline the important role of STAT4 in controlling the transcription by both distal and proximal regulation and shaping the chromatin configuration, and suggests a potential role in disease etiology.
................................................................................................................
RG P18: A Bayesian multi-scale Poisson model for detecting differences in high-throughput sequencing data between multiple groups and its application to small sample sizes

Heejung Shim1 , Zhengrong Xing1, Ester Pantaleo1, Matthew Stephens1

1University of Chicago

Identification of differences between multiple groups in molecular and cellular phenotypes measured by high-throughput sequencing assays is frequently encountered in genomics applications. For example, common problems include detecting differential gene expression between multiple conditions using RNA-seq and detecting differences in transcription factor binding/chromatin accessibility across tissues using DNase-seq or ChIP-seq. Motivated by WavetQTL, our previous wavelet-based approach to genetic association analysis of functional phenotypes that better exploits high-resolution information from high-throughput sequencing assays, here we present multiseq, statistical methods that model the count nature of the sequence data directly using multi-scale models. Specifically, multiseq considers the data as an inhomogeneous Poisson process and tests for differences in underlying intensities using a Bayesian multi-scale model. We compared multiseq to WaveQTL on simulated data sets with different sample sizes or different library read depths. As expected, multiseq outperforms WaveQTL in smaller sample sizes. Even with larger sample sizes (e.g., 70), multiseq outperforms WaveQTL unless library read depths are very high. In addition, we applied those two multi-scale methods and a window-based approach, DESeq, to ATAC-seq data measured on Copper-treated samples and control samples (3 vs 3), and we found that multiseq detected substantially more differences in chromatin accessibility between two conditions than WaveQTL and DESeq.
................................................................................................................
RG P19: Progressive promoter element combinations classify conserved orthogonal plant circadian gene expression modules

Sandra Smieszek1

1Cleveland Clinic

We aimed to test the proposal that progressive combinations of multiple promoter elements acting in concert may be responsible for the full range of phases observed in plant circadian output genes. In order to allow reliable selection of informative phase grouping of genes for our purpose, intrinsic cyclic patterns of expression were identified using a novel, non-biased method for the identification of circadian genes. Our non-biased approach identified two dominant, inherent orthogonal circadian trends underlying publicly available microarray data from plants maintained in constant conditions. Furthermore, these trends were highly conserved across several plant species. Four phase-specific modules of circadian genes were generated by projection onto these trends and, in order to identify potential combinatorial promoter elements that might classify genes into these groups, we used a random forest pipeline which merged data from multiple decision trees to look for the presence of element combinations. We identified a number of regulatory motifs which aggregated into coherent clusters capable of predicting the inclusion of genes within each phase module with very high fidelity and these motif combinations changed in a consistent, progressive manner from one phase module group to the next, providing strong support for our hypothesis.
................................................................................................................
RG P20: An integrated network approach for the identification of functional large intergenic noncoding RNAs

Jiajian Zhou1, Huating Wang1, Hao Sun1

1The Chinese University of Hong Kong

Increasing evidence has indicated that large intergenic non-coding RNAs (lincRNAs) are a novel family of gene regulators. With many novel lincRNAs having been identified using high-throughput sequencing approach, computational exploration of the potential functions of novel lincRNAs will be ultra-important to prioritize and shortlist the possible novel functional lincRNAs for further studies. In this work, we used a systems biology approach to develop an integrative network by combining the regulatory data and gene expression data obtained ChIP-seq and RNA-seq together. We also developed a ranking method to evaluate the importance of key lincRNA nodes in the network. When applying this network on a public dataset on mESC, we can identify more than 70% of the reported functional lincRNAs as key functional lincRNA nodes in our network. Altogether, our approach has been demonstrated to be useful for lincRNA functional annotation.
................................................................................................................
RG P21:A discriminatory approach for transcription factor binding prediction using DNase I hypersensitivity data

Juhani Kähärä1 , Harri Lähdesmäki1

1Aalto University

Transcriptional regulation is largely controlled by DNA binding proteins called transcription factors. Understanding transcription factor binding is integral to understanding gene expression and the function of gene regulatory networks.

Currently transcription factor binding sites are determined by chromatin immunoprecipitation followed by sequencing, but this method has several limitations. To overcome these caveats, DNase I hypersensitive sites sequencing is increasingly being used for mapping gene regulatory sites. Computational tools are needed to accurately determine transcription factor binding sites from this new type of data.

In this work a novel method is developed for detecting transcription factor binding sites using DNase I hypersensitivity data. The method utilizes feature selection for choosing relevant features from DNase-seq data for optimal discrimination of bound and unbound genomic sites. The procedure is designed to ignore features resulting from the intrinsic sequence bias of the DNase I and to choose only features that improve the binding prediction performance.

The method is applied to 57 different transcription factors in cell type K562. We demonstrate that the prediction performance of the method exceeds the performance of other existing methods. Our results indicate that DNase I hypersensitivity data should be used in multiple resolutions instead of the highest possible resolution. We also show that the binding predictions should be made separately for each transcription factor and that the sequencing depth of currently available data sets is sufficient for binding predictions for most transcription factors. Finally, we show that models built with our method generalize between different cell types, making the method a powerful tool in transcription factor binding predictions using DNase I hypersensitivity data.
................................................................................................................
RG P22:Characterization of enhancer gene interactions using DNaseI and gene expression data cross 110 cell types

Pouya Kheradpour1 , Manolis Kellis1

1Massachusetts Institute of Technology

Recent efforts to characterize diseases through genome-wide association studies and annotate the genome using ChIP-seq experiments have led to a dramatic increase in putative functional genomic regions. While most of the implicated loci have fallen outside coding regions and are thought to be regulatory in nature, efforts to link these regions to their target genes, thereby permitting a better understanding of their importance, has lagged considerably. Generally, experimental linking techniques only permit the interrogation of a small number of specific regions or produce a genome-wide linking at very low resolution.

We utilize DNaseI hypersensitivity sites (DHS) and expression data from 110 human cell types produced by the ENCODE and Roadmap Epigenomics projects to produce linking confidences between hypersensitive regions and nearby genes. We find that high confidence links are supported by independent datasets such as eQTL annotations and tend to show preserved synteny across mammals.

Beyond producing these links, a careful analysis of the distribution of correlations for each gene allows us to address a number of fundamental questions in enhancer biology. We estimate the number of enhancers per gene and where they are distributed with respect to the TSS. We find bulk signal for linking to hypersensitive sites as far as 10 megabases away from a gene's TSS - substantially in excess of the distances linking or eQTLs are generally considered.

We find that linking is influenced by the presence and orientation of other nearby genes, and genomic features such as CpG islands. We also examine how these estimates and our ability to identify DHS-gene links change as we vary the number of cell types in our analysis, allowing us to extrapolate what will happen as more data becomes available.

By examining the correlation distribution correcting for distance, we are able to support the biologically established insulating role for CTCF. Applying the same methodology to all ENCODE ChIP-seq datasets and hundreds of regulatory motifs predicts other factors that are also associated with increases or decreases in enhancer linking, suggesting a more complex picture of enhancer targeting.
................................................................................................................
RG P23:Using epigenomics data to predict gene expression in lung cancer

Jeffery Li1, Travers Ching2, Sijia Huang2, Lana Garmire2

1Johns Hopkins University, 2University of Hawaii Cancer Center

Epigenetic alternations are known to be correlated with changes in gene expression. However, quantitative models that accurately predict the expression of gene expression are currently lacking. DNA methylation and histone modification are two major mechanisms of epigenetic regulation. Together, these data can accurately predict gene expression in lung cancer.
................................................................................................................
RG P24:Transcriptome profiling of pediatric core binding factor AML

Chih-Hao Hsu1 , Cu Nguyen1, Rhonda Ries2, Chunhua Yan1, Qing-Rong Chen1, Ying Hu1, Julia Kuhn2, Emma Geiduschek2,
Fabiana Ostronoff2, Derek Stirewalt2, Warren Kibbe1, Daoud Meerzaman1, Soheil Meshinchi 2

1National Institutes of Health, 2Fred Hutchinson Cancer Research Center

Acute myeloid leukemia (AML) is a hematopoietic malignancy that leads to dysregulation of critical signal transduction pathways and results in clonal expansion without complete differentiation. Although several adult AML studies have been reported, the pathogenesis of pediatric AML is still poorly understood. In this study, RNA-Seq analysis was performed in 64 pediatric patient samples to study the impact of different cytogenetic abnormalities on the transcript profiles in pediatric AML. Specifically, we focused on the comparison of samples with t(8; 21) and inv(16), referred to as the core binding factor (CBF) AML, and those with normal karyotype. In our study, the expression of all homeobox (HOX) genes in those with t(8; 21) and most of HOX genes in those with inv(16) were down-regulated compared to the samples with normal karyotype, suggesting the potential of dysregulation of HOX genes for the perturbation of normal hematopoiesis. In addition, we applied four different gene fusion detection methods, including Defuse, Tophat-Fusion, FusionMap, and Snowshoes-FTD, to identify gene fusion events in the pediatric AML samples. A total of 69 putative fusion events have been identified by at least two detection methods or by one method with a ChimerDB hit. Eight of 69 putative fusion events were found in ChimerDB and 6 of them are previously reported to be fusion events in AML. Furthermore, PIM3-SCO2 that was identified as a putative fusion event in 3 cases with Inv(16) was verified by RT-PCR and Sanger sequencing, suggesting that combination of gene fusion detection methods and ChimerDB can accurately identify fusion events. Differential splicing events were also identified between different cytogenetic cohorts, indicating the great influence of cytogenetic abnormalities on the whole transcriptome in pediatric AML. Our studies shed light on the novel cytogenetic changes in pediatric AML that might be useful to predict survival and treatment outcome.
................................................................................................................
RG P25: Unveiling the DNA binding specificities of oncoprotein c-Myc and its antagonist Mxi1 using novel high-throughput data

Ning Shen1 , John Horton1, Raluca Gordan1

1
Duke University

Transcription factors (TFs) that belong to the same structural family are known to have similar DNA binding specificities. Consequently, TF binding motifs (represented as position weight matrices, PWMs) are often indistinguishable for closely related factors. This is the case for c-Myc, one of the most frequently deregulated TFs in human cancers, and Mxi1, the closest c-Myc paralog and antagonist. Both TFs have a high binding affinity for E-box CAnnTG motifs, and available PWM models cannot differentiate between c-Myc and Mxi1 [1, 2]. Binding of oncoprotein c-Myc to the genome generally leads to gene expression amplification, whereas Mxi1 functions as a transcriptional repressor and tumor suppressor. Thus, Mxi1 appears to be an ideal antagonist for c-Myc. However, previous studies revealed that overexpression of Mxi1 retards but does not stop proliferation of c-Myc-expressing cancer cells, and in vivo TF-DNA binding data (from ChIP-seq experiments) identified many genomic targets bound by c-Myc but not Mxi1. These findings raise the question of whether the DNA binding preferences of these paralogous TFs are really identical, as inferred using consensus motifs and PWM models.

In this study we present the first quantitative, high-throughput analysis of the in vitro and in vivo DNA binding specificities of two closely related TFs, c-Myc and Mxi1. In contrast to what has been reported previously in the literature, by examining the binding specificity of c-Myc comprehensively, we identified a number of non-E-box motifs with much higher binding affinities for c-Myc than many CAnnTG E-box motifs. To better understand the DNA binding specificity of c-Myc and its antagonist Mxi1, we used the genomic-context protein-binding microarray (gcPBM) technology [3] to measure in vitro binding of the two TFs to ~50,000 putative genomic target sites, as identified by ChIP-seq in vivo. The gcPBM results show that c-Myc and Mxi1 have significant differences in their in vitro binding specificities to genomic sequences. Intriguingly, both non-canonical E-box motifs and non-E-box motifs are preferred differently by the two TFs, and these differences are not captured by PWM models. Furthermore, we show that the differences in the intrinsic binding specificities between c-Myc and Mxi1 are relevant for their differential genomic binding in the cell, and are also important for different c-Myc biological functions. Finally, computational models and DNA/protein structure analyses reveal possible mechanisms for the differential binding specificity of c-Myc and Mxi1. Our findings have important implications for the direct competition for DNA binding between c-Myc and Mxi1, and thus for the potential use of tumor suppressor Mxi1 in therapeutic approaches aimed at targeting the oncoprotein c-Myc.

References
1. Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K et al: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic acids research 2006, 34(Database issue):D108-110.

2. Munteanu A, Gordân R: Distinguishing between genomic regions bound by paralogous transcription factors. In: Research in Computational Molecular Biology: 2013: Springer; 2013: 145-157.

3. Gordan R, Shen N, Dror I, Zhou T, Horton J, Rohs R, Bulyk ML: Genomic regions flanking E-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape. Cell reports 2013, 3(4):1093-1104.
................................................................................................................
RG P26:DNA methylation dynamics during somatic cell reprogramming

Giancarlo Bonora1 , Constantinos Chronis1, Marco Morselli1, Liudmilla Rubbi1, Matteo Pellegrini1, Kathrin Plath1

1University of California, Los Angeles

We used whole-genome bisulfite sequencing (WGBS) to assess global DNA methylation patterns at four different stages of reprogramming of mouse embryonic fibroblasts to an embryonic stem cell-like state, including early and late reprogramming intermediates. To better understand the dynamics of DNA methylation and how it is reset to the pluripotent state during the course of reprogramming, we compared differentially methylated regions (DMRs) with change in other genomic features, including chromatin modifications and transcription factor binding. Furthermore, we investigated the utilization of multi-restriction enzyme digestion of genomic DNA in combination with bisulfite sequencing (MRE-BS) to determine DMRs between two samples at a fraction of cost of WGBS, and found that it compared favorably to results achieved using traditional reduced representation bisulfite sequencing (RRBS). The results of this research will shed light on epigenetic regulation during the reprogramming process.
................................................................................................................
RG P27:Hidden Markov model analysis reveals complex binding modes for the transcription factor Gcn4

Todd Riley1, Cory Colaneri1

1University of Massachusetts Boston

Many new high-throughput technologies have been developed to quantitatively measure both in vivo and in vitro protein-DNA binding. Likewise, many computational algorithms have been developed to determine the binding preferences of these transcription factors (TFs) from the binding data. Most protein-DNA specificity inference methods suffer from a major weakness in that they can model only fixed-length motifs. This is especially problematic since recent studies have revealed that many families of TFs that bind primarily as dimers and tetramers have variable-length sequences (called spacers) separating the two sequence-specific half sites that, all together, delineate the binding motif. Other recent studies have shown examples of proteins that can interface with the DNA via different regions of the protein. These different "binding modes" also introduce variability in the lengths of their respective binding sites. Additionally, some proteins exhibit degenerate DNA specificity and exhibit tolerated insertions or deletions of nucleotides in multiple places within some of their binding sites as compared to their canonical consensus motifs - which again introduces binding site length variability.

With the impetus to correctly model variable-length motifs, many researchers have used hidden Markov models (HMMs), and their generalizations, as a probabilistic framework to capture insertions or deletions (indels) of nucleotides within a binding site. The so-called profile hidden Markov model (pHMM) has a topology that is well-suited for properly modeling observed indels within a sequence and has been successfully applied to many domains. PHMM methods have recently been developed to model variable-length spacers in protein-DNA binding sites. However, there exist many possible HMM topologies for modeling variable-length spacers and each topology has different modeling characteristics. It has not been determined which HMM topology is best to capture these spacer dynamics for any given protein.

In order to identify which among the many possible HMM topologies is optimal to model a variable-length spacer within a protein-DNA binding site, we chose to analyze both protein binding microarray (PBM) and "high-throughput sequencing"-"fluorescent ligand interaction profiling" (HiTS-FLIP) binding data for the bZIP protein Gcn4. The Gcn4 protein is ideal for our study since: (1) it is known to contain a variable-length spacer, and (2) the very deep sequencing of the HiTS-FLIP binding data for Gcn4 allows for the inference of very highly accurate affinity models for analysis.

Our analysis reveals complex dependencies between the variable-length spacer and the surrounding half-sites. Furthermore, we show that the simple HMM topologies currently in use to model variable-length spacers in protein-DNA binding sites are not adequate to capture these dependencies. We propose spacer-dependent HMM models that more accurately capture the complex dependencies between the spacers and their surrounding half sites with fewer model parameters as compared to common pHMM topologies. While modeling the two known spacer-lengths of Gcn4, our spacer-dependent HMM with 58 parameters explains 80% of the variance for all measured 12-mer relative affinities of at least 0.01 (1.0 being the maximum). In addition, we show that the typical pHMM topologies currently in use contain too many parameters (≥ 84), fail to correctly capture the spacer dependencies, and generate protein-DNA affinity models with inflated nucleotide insertion and deletion rates as compared to the consensus motif.
................................................................................................................
RG P28: GREAT: Genome REgulatory and Architecture Tools. The GREAT:SCAN software suite

Costas Bouyioukos1, Mohamed Elati1, François Képès1

1Institute of Systems & Synthetic Biology, Université d'Evry

Modern advances in genomics, transcriptomics, and genome structural biology have revealed significant insights on the interdependence between genome expression and layout. Evidence for non-random genome layout [1], defined as relative positioning of co-regulated or co-functional genes, stems from two main insights. Firstly, the analysis of contiguous genome segments across species has highlighted the conservation of gene order along chromosome regions. Secondly, the study of long-range regularities within chromosomes in a given species has emphasized periodic positioning of genes that are co-regulated, co-expressed, evolutionarily correlated, or highly codon-biased [3],[5]. Tools to detect, visualize, systematically analyze, integrate, and exploit gene position regularities along genomes have been developed [2].

Here we present a software suite designed to perform a systematic and integrated analysis of regular patterns along genomes. The suite is based on an algorithm to detect periodicities and it provides an easy-to-use interface to perform complete analyses of regular patterns and to visualize results.

The suite comprises three tools. GREAT:SCAN:patterns, a package for systematic study of periodic patterns; GREAT:SCAN:integrate, a novel computational process which integrates regularities along multiple transcription factors (TFs) and chromosomes; and GREAT:SCAN:presicion, a machine learning tool to predict novel TFBS.

GREAT:SCAN:patterns performs a complete analysis of periodic patterns in three steps. The first step calculates an exact p-value for all predicted periods and returns a rank of them and a periodogram. In the second step a clustering algorithm detects clusters of genes that are "in-phase" on the modulo period coordinates, providing evidence of possible local spatial proximity of genes. In the last step a variable size sliding window performs a more fine-tuned search for regularities on specific domains of the chromosome. In this work, we present a complete analysis of 7 major TFs of E. coli and report preliminary results that regions of periodic arrangement are associated with the macro-domain organization of this bacterial genome.


GREAT:SCAN:integrate: is a computational process which automatically consolidates and integrates analysis of periodic patterns on multiple TFs and/or on multiple chromosomes. It consists of a series of seven steps. Initially, periods are detected on all the groups of co-regulated genes and then a couple of integration steps on the TF and the chromosome level consolidate periods and extend overlapping extremities. Finally, the process is searching for integer multiple periods and collects them all to form families of harmonics with their periodic extremities extended. The result of the final step is visualized as a set of periodic regions that span chromosomes and the results of each intermediate step are stored in a database for further analysis and/or visualization.

We will present the formal description of the 7 step process together with initial evidence of an application on the yeast Saccharomyces cerevisiae TF network. Where we identify common periods, harmonics and significant degree of overlap between the master transcription regulators of yeast.

The two tools are developed to detect periods on co-regulated genes; however, it can work with any gene set of interest as well as with any set of genomic positions of interest, including but not limited to ChIP-seq data.

GREAT:SCAN:precision is a novel implementation of a machine learning tool for TFBS prediction [4] which incorporates two inputs in a classifier: a) direct DNA sequence motif readout, and b) genome layout readout from the genomic coordinate. The underlying rationale is based on the emerging observation that co-regulated genes are positioned at periodic intervals along the chromosome. The combined classifier is then obtained with an iterative weight update scheme, using a modified version of the AdaBoost algorithm. We will report on the novel prediction of E. coli TFBS as well as insights on the interplay between sequence motif and position.

References

1. Képès, F. (2004). Periodic transcriptional organization of the E. coli genome. J Mol Biol 340, 957-964.

2. Junier, I., Hérisson, J., and Képès, F. (2010). Efficient detection of periodic patterns within small datasets. Algorithms for Molecular Biology 5, 31.

3. Junier, I., Hérisson, J. and Képès, F. (2012). Genomic Organization of Evolutionarily Correlated Genes in Bacteria: Limits and Strategies. J. Mol. Biol. 419, 369-86.

4. Elati, M., Nicolle, R., Junier, I., Fernandez, D., Fekih, R., Font, J. and Képès, F. (2013). PreCisIon: PREdiction of CIS-regulatory elements improved by gene's positION. Nucleic Acids Res. 41, 1406-15.

5. Képès, F., Jester, B.C., Lepage, T., Rafiei, N., Rosu, B. and Junier, I. (2012). The layout of a bacterial genome. FEBS Lett. 586, 2043-2048.
................................................................................................................
RG P29: A biophysical analysis of transcription factor binding data

Rahul Siddharthan1

1The Institute of Mathematical Sciences

In 2003, Djordjevic, Shraiman, and Sengupta proposed a biophysical approach to the binding of transcription factors, pointing out that rather than the simple statistical description offered by a PWM, a more accurate expression for the probability of a TF binding to a sequence is given by the Boltzmann factor for the binding energy E, exp(-beta (E-mu)) (where beta is the inverse temperature and mu is the chemical potential), divided by the partition function Z. In the simple case that the two options are binding or not-binding, Z = 1+exp(-beta(E-mu)), and the overall probability takes the form of a sigmoidal or Fermi function. Even if we assume that E is additive across nucleotides, this function is not multiplicative, and the PWM description of the probability as a product over individual nucleotides breaks down.

Transcription factor binding sites (TFBS) rarely occur individually and usually several are bound to a single sequence. This raises the possibility that what is important is not the "strength'' (binding affinity or probability, PWM score, etc.) for a single site, but the expected number of bound factors across a promoter or enhancer region. The thermodynamic probability of this, too, can be calculated efficiently, taking into account competition between factors for binding sites.

The difficulty, of course, is that biophysical binding energies are available for very few transcription factors, and for very few bound sequences per factor, and vary according to cell type and environmental conditions. Our approach is not to depend on experimental data on binding energy, but to use it as a description of binding whose parameters are to be inferred from experimental data on bound sequences.

Specifically, we use this idea to distinguish between bound and unbound sequences in ChIP-seq data in yeast, from Venters and Pugh. For each of 32 transcription factors for which adequate numbers of bound sequences were available, we divided the bound sequences into training and testing sequences. For each training sequence we constructed a synthetic "negative'' sequence from a Markov model constructed from the set of training sequences. We then inferred energy matrices that maximized the difference in binding between training and negative sequences.

For the testing sequences, we constructed two negative sets, each of the same size as the training data: one synthetic as above, and one of actual intergenic sequence in yeast that has not been observed to be bound by the concerned factor. We then calculated the difference in the biophysically calculated number of bound factors among these three sets. We do the same exercise with literature PWMs.

The results, though preliminary, are highly encouraging. In 20 of 32 cases, the biophysical method clearly distinguishes the bound testing regions from synthetic data. In 12 out of 32 cases, the biophysical method also clearly distinguishes the known-bound testing regions from other intergenic regions. The PWM-based approach, with its well-known propensity for false positives, also gives a higher score to the bound testing regions in many cases when literature PWMs are used, but much less significantly.

While there have been several previous attempts to go beyond position weight matrices for TFBS description, including this author's, the biophysical approach holds appeal for its physical intuitiveness and ease of calculation.

This is work in progress. Further refinement of the method and validation on other datasets, including ENCODE chip-seq data, is in progress.

................................................................................................................
RG P30: Transcription factor cooperatively reveals distinguishing characteristics of the HepG2 cell-line

Konnor La1 , Parsa Hosseini1, Yupeng Wang1, Ivan Ovcharenko1

1 National Institutes of Health

Transcription regulation is a tightly controlled biological process. One particular aspect of transcription regulation includes transcription factors (TFs) binding to specific DNA sequences known as cis-elements. With advances in high-throughput sequencing technologies, researchers can now quantify TF interplay and improve the high-resolution map of the genomic regulatory landscape.

Enhancers are one class of regulatory elements that are generally found kilobases away from or sometimes within the target gene and possess the ability to regulate transcription. Until lately, genomic repositories of such elements were largely lacking, and initiatives such as the ENCODE Consortium aspire to quantify their systematic interplay.

In this study, building upon the ENCODE's human liver carcinoma cell-line (HepG2), we quantify the HepG2 TF-TF spatial landscape. We identified numerous TFs pairs involved in liver cellular respiration. We used statistical models to quantify enrichment of TF binding sites across HepG2 enhancers and found 0.05% of possible transcription factors combinations to be significant pairs. TFs such as FOX, HNF4, HNF3, and NR1H2-RXRA, are enriched both in a spatial and sequential manner. Interestingly, when we compared HepG2-specific TF-TF pairs to eight other ENCODE cell-lines, we found that many HepG2 TF pairs were mutually exclusive to HepG2. Lastly, we show that many HepG2-specific TF-TF pairs exhibit increased conservation when compared to pairs of other cell-lines. Such results could therefore shed light on a liver-specific set of TFs that potentially function collectively to govern liver specificity.

In summary, this study examines the TF-TF spatial landscape across HepG2. Our results reveal a set of liver-specific TFs that are enriched in crucial biological processes and exhibit conservation across placental mammals.
................................................................................................................
RG P31: An evolutionarily biased distribution of miRNA sites toward regulatory genes with high promoter-driven intrinsic transcriptional noise

Hossein Zare1 , Arkady Khodursky2, Vittorio Sartorelli1

1National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS), National Institute of Health (NIH), 2University of Minnesota

miRNAs are a major class of regulators of gene expression in metazoans. By targeting cognate mRNAs, miRNAs are involved in regulating most, if not all, biological processes in different cell and tissue types. To better understand how this regulatory potential is allocated among different target gene sets, we carried out a detailed and systematic analysis of miRNA target sites distribution in the mouse genome.

We used predicted conserved and non-conserved sites for 779 miRNAs in 3' UTR of 18440 genes downloaded from TargetScan website. Our analysis reveals that 3' UTRs of genes encoding regulatory proteins harbor significantly greater number of miRNA sites than those of non-regulatory, housekeeping and structural, genes. Analysis of miRNA sites for orthologous 3'UTR's in 10 other species indicates that the regulatory genes were maintaining or accruing miRNA sites while non-regulatory genes gradually shed them in the course of evolution. Furthermore, we observed that 3' UTR of genes with higher gene expression variability driven by their promoter sequence content are targeted by many more distinct miRNAs compared to genes with low transcriptional noise.

Based on our results we envision a model, which we dubbed "selective inclusion," whereby non-regulatory genes with low transcription noise and stable expression profile lost their sites, while regulatory genes which endure higher transcription noise retained and gained new sites. This adaptation is consistent with the requirements that regulatory genes need to be tightly controlled in order to have precise and optimum protein level to properly function.

................................................................................................................
RG P32: Consensus strategy improves microRNA prediction

Bin Xue1

1University of South Florida

microRNAs are short regulatory RNA with about 22 nucleotides. microRNAs are produced through at least two pathways. In the first pathway, primary microRNAs with multiple stem-loop structures are processed by microprocessor Drosha to produce precursor microRNAs, which have ~80 nucleotides forming stem-loop structure. The precursor microRNAs will be further processed by Dicer to produce mature microRNAs. In the second pathway, short RNA transcripts that have step-loop structures will be processed by Dicer directly to produce mature microRNAs. In both pathways, specific sequential features on the RNA transcripts were observed. Based on these observations, many computational strategies have been proposed to predict microRNAs based on secondary structure, base pairing free energy, or sequence conservation. These predictors are normally developed using hundreds of known microRNAs. Although very successful, the application of these predictors is still limited due to their high false positive rate. We proposed a consensus strategy to integrate the strength of different predictors and tested the performance in a much larger dataset of nearly 2000 microRNAs. Our result shows that the consensus strategy improves the prediction performance consistently. This strategy may have broad application in the future due to its remarkably reduced false positive rate.
................................................................................................................
RG P33: Post-transcriptional regulation mediated by the interplay between RNA-binding proteins and miRNAs

Atefeh Lafzi1, Saber Hafezqorani1, Yesim Aydin Son1, Hilal Kazan1

1Middle East Technical University, Turkey

Post-transcriptional regulation (PTR) is mediated by the interactions RNA-binding proteins (RBPs) and microRNAs (miRNAs) with cis-regulatory sites in mRNAs. Recent studies have found that each factor binds to hundreds of targets, and each mRNA is occupied by several factors. Also, RBPs and miRNAs are shown to function in coordination with each other in many cases [1]. However, a majority of previous research has focused on the regulatory effect of individual factors. In this study, we leveraged the recent explosion of PTR-related data to map both RBP and miRNA sites on mRNAs and considered the effect of multiple factors at the same time. We mapped RBP sites by taking into account motifs identified with RNAcompete [2], peaks from gPARCLIP and CLIP data [3-4] and PhastCons conservation scores. To map miRNA sites, we combined PicTar and TargetScan predictions, Ago2 CLIP-identified peaks [4] and experimentally supported targets from miRTarBase database [6]. We then analyzed the mapped sites concurrently to detect potential interactions. These interactions could be competitive when there is overlap between the sites or cooperative when sites of two factors are located on each side of a stem (e.g. Pum and miR-221 [1]). For HuR and IGF2BP2, we showed that mRNAs with sites which are in competition with other factors' sites show a distinct stability profile compared to mRNAs with other sites. We also studied the effect of cooperative interactions of HuR with other factors upon HuR knockdown. Lastly, we explained the distinct activities of identical PUM1 sites [7] by considering the accessibility and cooperative / competitive interactions of these sites. These results show that modeling the effect of multiple factors and their interactions concurrently improves our understanding of PTR.

References

1. Kouwenhove MV, Kedde M, Agami R. MicroRNA regulation by RNA-binding proteins and its implications for cancer. Nat Rev Cancer 9 (2011), 644-656.

2. Ray D, Kazan H,et al. A compendium of RNA-binding motifs for decoding gene regulation. Nature 499 (2013), 172-177.

3. Baltz AG, Munschauer M et al. The mRNA-bound proteome and its global occupancy. Mol Cell (2012) 46(5):674-690.

4. Anders G, Mackowiak SD et al. doriNA: a database of RNA interactions in post-transcriptional regulation. NAR (2012) D180-D186.

5. Cook KB, Kazan H et al. RBPDB: a database of RNA-binding specificities. Nucleic Acids Res (2011) D301-308.

6. Hsu SD, Lin FM et al. miRTarBase: a database curates experimentally validated miRNA-target interactions. NAR (2011) D163-169.

7. Zhao W, Pollack JL et al. Massively parallel functional annotation of 3'UTRs. Nat Biotechnology (2014) 32(4):387-391.
................................................................................................................
RG P34: Transcriptome-wide identification of cancer-specific splicing events across multiple tumors

Yihsuan S. Tsai1 , Daniel Dominguez1, Shawn M. Gomez1, Zefeng Wang1

1University of North Carolina at Chapel Hill

Dysregulation of alternative splicing (AS) is one of molecular hallmarks of cancer, with splicing alteration of numerous genes in cancer patients. However, identification of cancer-specific AS events is complicated by large noise of tissues-specific splicing that hinders the mechanistic understanding of splicing dysregulation. To determine a signature of cancer-specific splicing, we explored large-scale transcriptome sequencing data from the TCGA project to identify a core set of cancer-specific AS events that are significantly altered across multiple cancer types. Further analyses suggest that these cancer-specific AS events are (1) altered in cancer among different tissue types; (2) highly conserved among vertebrates; (3) more likely to maintain protein reading frame than control AS events; (4) have functions related to cell cycle and cell adhesion; (5) able to serve as new molecular biomarkers of cancer. Finally we identified genes whose expression is closely associated with cancer-specific splicing, and discovered that most of these genes are key regulators of the cell cycle. This suggests that the activity of splicing factors may be controlled in a cell cycle dependent manner and thus cell cycle proteins can indirectly affect splicing in tumor cells. In summary, our work identifies a common set of cancer-specific AS events dysregulated across different types of cancers and provides mechanistic insight into how splicing might be mis-regulated in cancers.
................................................................................................................
RG P35: Integrative and personalized QSAR analysis in cancer by kernelized Bayesian matrix factorization

Muhammad Ammad-Ud-Din1, 2 , Elisabeth Georgii1,2, Mehmet Gönen1,2, Tuomo Laitinen3, Olli Kallioniemi4,5, Krister Wennerberg 4,5, Antti Poso3,6, Samuel Kaski1,2,5

1Helsinki Institute for Information Technology, 2Aalto University, 3University of Eastern Finland, 4Institute for Molecular Medicine Finland (FIMM), 5University of Helsinki, 6University Hospital Tübingen

We develop in silico models to find drugs with a potential for cancer treatment. Recent large-scale drug sensitivity measurement campaigns offer the opportunity to build and test models that predict responses for more than one hundred anti-cancer drugs against several human cancer cell lines. So far, these data have been used for searching dependencies between genomic features and drug responses, addressing the personalized medicine task of predicting sensitivity of a new cell line to an a priori fixed set of drugs. On the other hand, traditional quantitative structure-activity relationship (QSAR) approaches investigate small molecules in search of structural properties predictive of the biological activity of these molecules, against a single cell line. We extend this line of research in two directions: (1) an integrative QSAR approach, predicting the responses to new drugs for a panel of multiple known cancer cell lines simultaneously, and (2) a personalized QSAR approach, predicting the responses to new drugs for new cancer cell lines. To solve the modeling task, we apply a novel kernelized Bayesian matrix factorization method. For maximum applicability and predictive performance, the method optionally utilizes multiple side-information sources such as genomic features of cell lines and target information of drugs, in addition to chemical drug descriptors. In a case study on 116 anti-cancer drugs and 650 cell lines from Sanger Institute Wellcome Trust, we studied the usefulness of the method in several relevant prediction scenarios, differing in the amount of available information, and analyzed the importance of various types of drug features for the response prediction. We showed that the use of multiple side information sources for both drugs and cell lines simultaneously improved the prediction performance. In particular, combining chemical and structural drug properties, target information, and genomic features yielded more powerful drug response predictions than drug descriptors or targets alone. Furthermore, the method achieved high performance (RMSE=0.46, R2=0.78, Rp=0.89) in predicting missing drug responses, allowing us to reconstruct a global map of drug responses, which is then explored to assess treatment potential and treatment range of therapeutically interesting anti-cancer drugs.
................................................................................................................
RG P36: MicroRNA portal server for deep sequencing, expression profiling and mRNA targeting

Byungwook Lee1

1Korean Bioinformation Center

In the field of microRNA (miRNA) research, biogenesis and molecular function are two key subjects. Next-generation sequencing has become the principal technique for cataloging miRNA repertoire and generating expression profiles in an unbiased manner. We developed a web-based database server that compiled the deep sequencing miRNA data available in public and implemented several novel tools to facilitate exploration of massive data. The miR-seq browser supports users to examine short read alignment with the secondary structure and read count information available in concurrent windows. Features such as sequence editing, sorting, ordering, import, and export of user data would be of great utility for studying iso-miRs, miRNA editing, and modifications. miRNA-target relation is essential for understanding miRNA function. Coexpression analysis of miRNA and target mRNAs is visualized in the heat-map and network views where users can investigate the inverse correlation of gene expression and target relations, compiled from various databases of predicted and validated targets. By keeping datasets and analytic tools up-to-date, miRGator should continue to serve as an integrated resource for biogenesis and functional investigation of miRNAs.
................................................................................................................
RG P37: Detection of a fusion gene using soft-clipping reads in exome-sequencing data

Nam Jin Gu1 , Ji Woong Kim2, Ryan W Kim1

1Korean Bioinformation Center, 2UT Southwestern Medical Center

A gene fusion plays an important role in oncogenes that drive tumor formation and progression because it can produce active abnormal protein. To use next generation sequencing (NGS) technologies, a different fusion gene has been discovered in human cancers. Many computational methods are developed by RNA-Seq data, and a few WGS-Seq data for fusion gene discovery, but Exome-Seq data has not yet seen any use. We developed the new algorithm for detection of the fusion gene to use Exome-Seq data. In this approach, we first found candidate of fusion region to use soft-clipping reads, and the split reads are used for mapping fusion partner. Finally, the predicted fusion sequence was estimated for the frequency in read alignment. It is impossible to detect a fusion of introns with Exome-seq data that are far away from exon boundary. We were able to predict fusion boundary within exon that was caused by structural variations of the genome such as chromosomal translocation, deletion, and inversion by aligning the sequence to the pseudo-reference from fusion sequence. In addition, the allele frequency information of fusions could be very useful to filter them and discover the biology from the genomic aberration, especially fusion gene in samples with heterogeneity like tumor tissue.
................................................................................................................
RG P38: Topographical mapping of temporal gene activation

Daniel Morris1

1Loma Linda University

Despite limitations that prevent Pol II ChIP from rigorous quantitative application, the techniques ability to instantaneously measure of Pol II density at any position along a gene provides unique transcriptional information. Analysis of acute IEG (immediate early gene) activation following α1a Adrenergic Receptor stimulation identified rate-limiting transcriptional events that control both the speed of mRNA maturation and the extent of transcriptional upregulation. ChIP results were validated by comparison to major transcriptional events assessable by microarray and PCR analysis of precursor and mature mRNA. My data shows that initial transcriptional velocity on newly activated mammalian IEGs can be very high and approach maximal transcriptional rates. Despite the limited gene set, recently described mechanisms of co-transcriptional gene regulation were identified, including abrogation of promoter proximal pausing, internal transcriptional blockade, and polyadenylation associated pre-mRNA degradation. Importantly, although co-transcriptional regulatory mechanisms were present for most genes, increased recruitment of RNA Pol II was a substantial factor contributing to increased mRNA levels for all genes. As an example, regulation of Nr4a3 involved increased recruitment, delayed abrogation of a strong proximal pause, transcription of short and long isoforms, and an apparent decrease in transcriptional velocity for the long isform due to the internal polyadenylation site. Significantly, delayed abrogation of promoter proximal pausing implies this mechanism functions to delay transcription, probably as a means of ensuring transcriptional fidelity.

Given the generality of multilevel gene regulation, integrated analysis made possible by Pol II ChIP appears necessary to distinguish causative from potential rate limiting mechanisms. Further, the temporal and sometimes transient nature of cotranscriptional mechanisms provides important caveats to nontemporal methods. For example, recent omic analysis of transcriptional velocity used only two time points and would not have detected reduced velocity due to transient pausing within genes. In addition, nontemporal approaches could not identify the delay induced by promoter proximal pausing. Our data suggests temporal omic approaches that can address deficiencies in non-temporal analysis.
................................................................................................................
RG P39: Consensus
approach to identify consistent brain gene expression signatures for neurodegenerative diseases

Raymond Yan1, Jie Quan2, Li Xi2, Simon Xi2

1Boston University, 2Pfizer

A large number of neurodegenerative disease-related gene-expression datasets have been deposited in public repositories over the years. These datasets have tremendous value for integrative data mining to uncover dysregulated biological pathways in diseased brains. However, the limited sample sizes, differences in brain collection procedures, sample characteristics, and other hidden confounding factors often decrease the confidence in changes observed in individual studies. Here, we carefully curated and reprocessed dozens of independent postmortem brain gene expression datasets of Alzheimer's and Parkinson's diseases from GEO and ArrayExpress. We used a consensus scoring scheme to identify hundreds of genes that are most consistently differentially expressed under disease conditions across studies. Many of these gene expression changes were also observed in mouse models of disease. They provided novel insights to AD- and PD-related pathways and biological processes. Comparisons of these AD and PD signature genes with genes located in GWAS risk loci further suggest potential causal genes in these regions.

................................................................................................................
RG P40: Identification
of stage specific functional regulatory elements in Brugia malayi for Lymphatic Filariasis (LF) disease intervention

Rami Al-Ouran1, Elodie Ghedin2, Lonnie Welch1

1Ohio University, 2New York University

Gene transcription initiation and gene regulation are complex biological mechanisms that involve several molecular components working in a precise manner. Transcription factors (TFs) and the transcription factor binding sites (TFBSs) are functional elements that control gene regulation. Identifying TFBSs will assist in deciphering regulation of transcription and represent potential target sites for disease prevention. Lymphatic filariasis (LF), also known as elephantiasis, is a neglected tropical disease that affects over 120 million people worldwide. Brugia malayi is one of the nematode (roundworm) parasites that cause LF. Each stage of the B. malayi lifecycle has a unique gene transcriptional signature and third-stage filarial larvae (L3) are of particular interest as they represent the infective stage. In this study we aim to discover the putative TFBSs that are unique to genes over expressed in the L3 phase of the B. malayi lifecycle. Identifying the B. malayi stage specific regulatory elements could help in developing intervention strategies for the control of LF.
................................................................................................................
RG P41: Design
principles of circadian systems

Nandita Damaraju1 , Karthik Raman2

1Georgia Institute of Technology, 2Indian Institute of Technology Madras

Circadian rhythms are biological processes, which have time periods of approximately 24 hours. Circadian networks orchestrate a variety of processes in a diverse set of organisms as simple cyanobacteria and more complex organisms such as mammals. This raises an interesting question of whether such diverse organisms have common design principles underlying their circadian networks. Previous attempts have derived design principles by studying the preexisting circadian regulatory networks in organisms. Such approaches do not exhaustively search for motifs that could potentially give rise to more robust and sustained oscillations. In this study, a thorough unbiased search is performed across all possible topologies, to identify motifs that give rise to circadian oscillations. To identify such features, all two-node and three-node networks were enumerated and their interactions were dynamically modeled. Only a few networks capable of producing oscillations were observed. These favorable topologies were then analyzed to identify common motifs. The motifs obtained were consistent with the existing circadian networks of organisms, thereby successfully identifying the core features responsible for circadian oscillations. This study identifies the design principles of circadian networks in an unbiased fashion and answers questions about the minimum requirements to achieve circadian oscillations while highlighting the key topological and dynamical features of such networks. The results obtained could be used to gain valuable insights into the circadian mechanisms across varied organisms and could help potentially build more complex systems with and custom targeted behaviors.
................................................................................................................
RG P42: A pooling-based approach to mapping genetic variants associated with DNA methylation

Irene Kaplow1 , Sarah Mah2, Julia MacIsaac2, Michael Kobor2, Hunter Fraser1

1Stanford University, 2University of British Columbia

DNA methylation is an epigenetic modification that plays a key role in gene regulation. Previous studies have investigated its genetic basis by mapping genetic variants that are associated with DNA methylation at specific sites, but these have been limited to microarrays that cover less than 2% of the genome and cannot account for allele-specific methylation (ASM). Other studies have performed whole-genome bisulfite sequencing on a few individuals, but these lack statistical power to identify variants associated with methylation. We present a novel approach in which bisulfite-treated DNA from many individuals is sequenced together in a single pool, resulting in a truly genome-wide map of DNA methylation. Compared to methods that do not account for ASM, our approach increases statistical power to detect associations while sharply reducing cost, effort, and experimental variability. As a proof of concept, we generated deep sequencing data from the pooled DNA of 60 human cell lines and identified over 2000 genetic variants associated with DNA methylation. We found that these variants are enriched in tissue-specific transcription factor binding sites and can also be associated with chromatin accessibility and gene expression. In sum, our approach allows genome-wide mapping of genetic variants associated with DNA methylation in any species, without the need for individual-level genotype or methylation data.
................................................................................................................

RG P43: Integrative analysis of haplotype-resolved epigenomes across human tissues

Inkyung Jung1 , Danny Leung1, Nisha Rajagopal1, Bing Ren1

1Ludwig Institute of Cancer Research

Allelic differences between the two sets of chromosomes can affect the propensity of inheritance in humans; however, the extent of such differences in the human genome has yet to be fully explored. Here, for the first time, we delineate allelic chromatin modifications and transcriptomes amongst a broad set of human tissues, enabled by a chromosome-spanning haplotype reconstruction strategy. The resulting masses of haplotype-resolved epigenomic maps are the first of its kind and reveal extensive allelic biases in the transcription of human genes, which appear to be primarily driven by genetic variations. Furthermore, allelic resolution of chromatin states allows us to discover cis-regulatory relationships between genes and their control sequences. These maps also uncover intriguing characteristics of cis-regulatory elements and tissue-restricted activities of repetitive elements. The rich datasets described here will enhance our understanding of the mechanisms controlling tissue-specific gene expression programs.
................................................................................................................
RG P44: Comparison of D. melanogaster and C. elegans developmental stages, tissues, and cells by modENCODE RNA-Seq data

Jingyi (Jessica) Li1 , Haiyan Huang2, Peter J. Bickel2, Steven Brenner2

1
University of California, Los Angeles, 2University of California, Berkeley

We report a statistical study to discover transcriptome similarity of developmental stages from D. melanogaster and C. elegans using modENCODE RNA-seq data. We focus on "stage-associated genes" that capture specific transcriptional activities in each stage and use them to map pairwise stages within and between the two species by a hypergeometric test. Within each species, temporally adjacent stages exhibit high transcriptome similarity, as expected. Additionally, fly female adults and worm adults are mapped with fly and worm embryos, respectively, due to maternal gene expression. Between fly and worm, an unexpected strong collinearity is observed in the time course from early embryos to late larvae. Moreover, a second parallel pattern is found between fly prepupae through adults and worm late embryos through adults, consistent with the second large wave of cell proliferation and differentiation in the fly life cycle. The results indicate a partially duplicated developmental program in fly. Our results constitute the first comprehensive comparison between D. melanogaster and C. elegans developmental time courses and provide new insights into similarities in their development. We use an analogous approach to compare tissues and cells from fly and worm. Findings include strong transcriptome similarity of fly cell lines, clustering of fly adult tissues by origin regardless of sex and age, and clustering of worm tissues and dissected cells by developmental stage. Gene ontology analysis supports our results and gives a detailed functional annotation of different stages, tissues, and cells. Finally, we show that standard correlation analyses could not effectively detect the mappings found by our method.
................................................................................................................
RG P45: Enhancer RNAs reveal widespread chromatin reorganization in prostate cancer cell lines

Ville Kytölä1 , Annika Kovakka1

1University of Tampere

Chromatin conformation determines the gene regulatory programs and enables the diversity of cell types. Characterization of chromatin state across different cell lines has been a central focus of major projects such as ENCODE. These studies have revealed a number of insights into cellular programs in cell differentiation and disease related dysregulation. However, the degree of chromatin variation between individuals is less studied and the diversity of chromatin organization in cancer is not known.

In order to gain insight into diversity of chromatin organization in prostate cancer we characterized 11 prostate and prostate cancer cell lines under different culture conditions using Global Run-On sequencing (GRO-seq). This assay allowed us to identify the active enhancer areas from each cell line through detection of nascent transcription of enhancer RNA (eRNA) molecules. To this end, we developed a new computational algorithm to identify eRNA signals in a genome-wide manner by utilizing the unique bi-directional pattern of nascent transcription. Identified eRNA sites show high consistency with areas of open chromatin from DNase I sequencing (DNase-seq) data as over 80% of the sites are covered by open chromatin signals in LNCaP cells.

We present a comparison of eRNA signals across prostate cancer cell lines. Our analysis reveals extensive variation in enhancer activity between prostate cancer models. On average, approximately 3000 active eRNA loci were identified from each cell line with the number of detected sites varying from 1300 to 8000. Based on the detection results, the cell lines clustered according to androgen receptor (AR) status. When cultured in the presence of androgens, the number of identified eRNA sites in LNCaP and VCaP cells doubled in comparison to cells cultured without androgens. Overall, we identified nearly 25,000 distinct loci of which only 33% were shared between more than two cell lines. We find a high number of loci for which eRNA activity correlates with the expression of nearby genes. Interestingly, from among these sites we were able to extract a subset of over a hundred extremely highly correlating ( > 0.9) connections, strongly indicating that these enhancer regions are contributing to the phenotypic diversity of prostate cancer. Taken together, these analyses highlight several new patterns of active enhancer regions that associate with specific prostate cancer subtypes. We are integrating eRNA activities with DNA methylation and transcriptome data from the same cell lines to uncover detailed regulatory programs in prostate cancer.
................................................................................................................
RG P46: Viral and retrotransposon sequences have shaped the preferred contexts for APOBEC-mediated mutagenesis

Jeffrey Chen1 , Thomas MacCarthy1

1Stony Brook University

The AID/APOBEC gene family of cytidine deaminases consists of mutagenic enzymes that have evolved roles in innate immunity such as virus restriction and suppression of transposable elements, particularly in mammals. The ancestral APOBEC gene, Activation Induced Deaminase (AID) arose early in vertebrate evolution and plays a key adaptive immunity role (somatic hypermutation of the Immunoglobulin genes) in all jawed vertebrates. Biochemical and in vivo profiling of many APOBECs shows they cause C to T transitions and have evolved a variety of local DNA sequence context preferences. APOBEC3F, for example, has a preference for mutations at TTC sites whereas APOBEC3G has a preference for CCC. We assess the impact of each motif on a set of potential target genes to investigate how individual preferences have been shaped. By specifically examining the impact of replacement mutations we demonstrate that the known APOBEC preferences maximally impact retrotransposons while minimally impacting essential host genes. Furthermore, permutation analysis of several mammalian virus genomes shows these have evolved to avoid the impact of these mutations. Our results also suggest that APOBEC preferences impose restrictions on codon and amino acid usage in their target genes by, for example, heavily disfavoring amino acid pairs that must encode the TTC motif favored by APOBEC3F.
................................................................................................................

RG P47: ATAC-seq is predictive of chromatin state

Chuan-Sheng Foo1 , Sarah Denny1, Jason Buenrostro1, William Greenleaf1, Anshul Kundaje1

1
Stanford University

Distinct combinations of chromatin modifications (chromatin states) have been found to be associated with different types of active and repressed functional elements in the human genome such as promoters, enhancers, and transcribed elements. Previously, multivariate hidden Markov models (e.g. ChromHMM and Segway) have been used to learn combinatorial chromatin states and automatically annotate genomes. However, such methods typically require multiple high-quality chromatin mark datasets as input, thus limiting their applicability in practice. Chromatin ChIP-seq experiments are time-consuming and costly to perform, and more importantly, require large amounts of input material to obtain reliable signal. We (Greenleaf lab) recently developed an assay, ATAC-seq, that accurately profiles genome-wide chromatin accessibility, DNA binding protein footprints, and nucleosome positioning from low amounts of input material based on direct in vitro transposition of sequencing adaptors into native chromatin. We previously showed that loci with different chromatin states (learned from histone modification ChIP-seq datasets) showed distinct distributions of ATAC-seq insert sizes in aggregate.

In this work, we further this connection between chromatin architecture and chromatin states by showing that chromatin architecture is in fact predictive of chromatin state at individual loci. More concretely, we show that a machine learning model trained on various features derived solely from ATAC-seq data is able to accurately predict different classes of regulatory elements in active and repressed chromatin states in cell lines and primary cells. The success of our method suggests that different classes of regulatory elements are associated with distinct open chromatin and nucleosome positioning signatures. We explore the feasibility of cross-cell-line chromatin state prediction and determine the minimum sequencing depth required for good predictive performance by subsampling reads. In conclusion, when applied to ATAC-seq data, our method enables high quality genome-wide chromatin state annotations from low quantities of input material using a single assay, potentially enabling the in vivo dissection of chromatin states from (rare) sorted cell populations in primary tissue.

................................................................................................................
RG P48: Identifying genetic and environmental determinants of gene expression

Roger Pique-Regi1 , Christopher Harvey1, Gregory Moyerbrailean1, Omar Davis1, Donovan Watza1, Xiaoquan Wen2, Francesca Luca1

1
Wayne State University, 2University of Michigan, Ann Arbor

The effect of genetic variants on a molecular pathway, and ultimately on the individual's phenotype, is likely modulated by "environmental" factors. However, it is generally difficult to determine in which tissues and conditions genetic variants may have a functional impact. We denote the functional genetic variants that show cellular environment-specific effects as GxE expression quantitative trait loci (GxE-eQTLs). Achieving a better understanding of the mechanisms underlying GxE-eQTLs is a critical step in understanding the link between genotype and complex phenotypes.

To identify and characterize GxE-eQTLs we have established a new two-step and cost-effective experimental approach. In the first step, we identify global changes in gene expression using low-coverage sequencing of pools of highly multiplexed samples. In the second step, we select a subset of samples for deep sequencing and allele-specific analysis. For the first step, we generated 960 RNA-seq libraries in pools of 96 spanning 265 cellular environments across 5 cell-types (3 individuals), and 53 different treatments (including hormones, dietary components, environmental contaminants and metal ions). Relevant GO categories were enriched in the observed global gene expression changes (e.g., immune response for Dexamethasone, ion homeostasis for Zinc). We then analyzed allele specific expression (ASE) using a novel method (QuASAR) that allows for joint genotyping and allele specific analysis on RNA-seq data. Across 56 cellular environments we discovered 7738 instances of ASE (FDR<10%), corresponding to 6234 unique ASE genes. Using a Bayesian model across treatments within cell types, we observed that generally >95% ASE signals are shared and their effect sizes are highly concordant (posterior correlation coefficient 0.9). This is highly consistent with previous analysis of condition-specific eQTLs. Nevertheless, out of 112,564 tests we still estimate 2318 loci with a Bayes posterior probability supporting GxE interaction (1273 sites treatment-specific and 1045 sites control-specific, GxE-eQTLs). Genes that are differentially expressed also show a higher enrichment for condition-specific ASE. Our results constitute a first comprehensive catalog of GxE-eQTLs and we anticipate that it will contribute to the discovery and understanding of GxE interactions underlying complex traits.
................................................................................................................

RG P49: MyoD induces active and poised chromatin structures during transdifferentiation

Dinesh Manandhar1 , Lingyun Song1, Ami Kabadi1, Charles Gersbach1, Raluca Gordan1, Greg Crawford1

1
Duke University

Overexpression of transcription factor (TF) MyoD has been shown to transdifferentiate cells from non-myogenic lineages into cells with muscle-like expression and functional characteristics. However, expression studies show that the transdifferentiated cells have only some myogenic genes upregulated. Chromatin level reprogramming is also incomplete. In this work, we investigate the reasons behind incomplete MyoD-induced transdifferentiation of fibroblasts, including potential MyoD cofactors, DNA methylation, and posttranslational histone modifications. We analyzed high-throughput chromatin accessibility (DNase-seq) data, in vivo MyoD binding (ChIP-seq) data, and global gene expression (RNA-seq) data on primary skin fibroblast cells transduced with inducible MyoD, and compared against the data obtained from starting fibroblast cells and target myoblasts and myotubes. Our study of local chromatin changes genome-wide suggests that the chromatin state of transdifferentiated fibroblasts is intermediary between fibroblast and muscle chromatin states. Importantly, we observed a continuum of chromatin reprogramming in the MyoD-induced fibroblasts, indicating that complete reprogramming is achieved in only a small fraction of the genome. We also see evidence that during MyoD-induced transdifferentiation, chromatin closes more easily than it opens up. Using random forest and support vector machine classifiers, we show that various genetic and epigenetic features dictate the efficiency of chromatin level reprogramming. For instance, fibroblast DNase hypersensitive sites (DHSs) with higher GC content tend to stay open more than DHSs with low GC content. Our analysis of TF motifs and histone modification data suggests that the presence of certain TFs or histone modification marks at or around a genomic site can dictate the efficiency of chromatin reprogramming. Analysis of gene expression data shows that reprogramming of genes correlates well with reprogrammed chromatin state. Nonetheless, enriched levels of "poised" or "memory" state chromatin are also observed around such genes. This indicates that MyoD is capable of inducing both active and poised chromatin structures that are similar to primary muscle lineages, and that other additional factors - such as Uhrf1, a chromatin remodeler under-expressed in transdifferentiated cells - can potentially help improve the reprogramming efficiency. Interestingly, we also found that although MyoD binding in non-DHSs opens up the chromatin at many genomic loci, a big fraction of MyoD-bound sites remain closed. Most of these closed sites lack MyoD-specific binding sites, which suggests that during transdifferentiation MyoD can also bind non-specifically or mediated by protein cofactors.
................................................................................................................

RG P50: Quantification of DNA cleavage specificity in Hi-C experiments

Dario Meluzzi1, Gaurav Arya1

1University of California, San Diego

Hi-C experiments yield large numbers of DNA sequence read pairs, which are typically analyzed to deduce chromatin interactions across whole genomes. A key step in these experiments is the digestion of cross-linked chromatin with a restriction endonuclease. Although this enzyme is expected to cleave specifically at its recognition sequence, an unknown proportion of cleavages may occur non-specifically, resulting from the enzyme’s star activity or from random DNA breakage. Here we show that Hi-C data sets can be analyzed to quantify such non-specific cleavages. In particular, we describe a computational method to estimate the fractions of cleavages resulting from the putative alternative mechanisms. The method relies on expressing a measured local site distribution near genomic locations of aligned reads as a linear combination of conditional local site distributions. We validated this method using read pairs obtained from computer simulations of Hi-C experiments. We then analyzed a few published Hi-C data sets from murine pre-pro-B and pro-B cells, and found significant variation in cleavage patterns. Knowledge of these patterns may thus enable researchers to optimize Hi-C experimental conditions and fine-tune algorithms for Hi-C data analysis.
................................................................................................................

RG P51: Learning to predict microRNA-mRNA interactions from AGO CLIP-seq and CLASH data

Yuheng Lu1, Steve Lianoglou1, Christina Leslie1

1Memorial Sloan Kettering Cancer Center

MicroRNAs mediate post-transcription gene regulation by guiding the binding of RISC to cognate sites in mRNA transcripts and play critical roles in numerous biological processes. Over the last decade, researchers have mainly focused on canonical rules of microRNA targeting – namely, Watson-Crick pairing between the 5’ seed region of the microRNA and complementary sequences in mRNA targets – but have also reported non-canonical microRNA target sites, which are functional but lack perfect seed pairing. Recently developed high-throughput technologies, like AGO CLIP sequencing and CLASH (crosslinking, ligation, and sequencing of microRNA-RNA hybrids) have made it possible to directly identify a large number of microRNA target sites across the transcriptome. These data underscore the prevalence on non-canonical targets and conversely show that exact microRNA seed matches are not always AGO-bound, indicating that microRNA targeting is determined by factors beyond seed matches. Here we present a novel model for microRNA target prediction based on discriminative learning on transcriptome-wide AGO CLIP and CLASH profiles. As the CLASH protocol captures direct interactions between microRNAs and mRNAs by ligation, it provides a partially labeled microRNA-mRNA pairing dataset, along with the AGO binding sites identified by both AGO CLIP and CLASH. We train support vector machine (SVM) classifiers that model the microRNA-mRNA pairing duplexes and both the local and global context of AGO binding. The duplex and context models together outperform existing target prediction approaches when evaluated on AGO binding and microRNA perturbation expression data sets. Our flexible representation of microRNA-mRNA duplex structures also enables the classifier to predict both canonical and non-canonical pairings between microRNA and target sequences. Moreover, interpretation of the learned models has revealed novel duplex and context features about microRNA targeting. Therefore, this study gives a better characterization of more general microRNA targeting principles and improves target prediction by leveraging rich new high-throughput data with discriminative learning.

................................................................................................................

RG P52: Nencki Genomics Portal – a web-based platform for analysis of transcriptional co-regulation and function, starting from (epi-) genomic and expression data

Michal Dabrowski1, Izabella Krystkowiak1, Michal Petas1, Jaroslaw Lukow2, Norbert Dojer2, Bozena Kaminska1

1Nencki Institute of Experimental Biology, 2University of Warsaw

We present Nencki Genomics Portal (NGP), a website that integrates tools for analysis of gene transcriptional co-regulation and function. It is accessible to a broad biological community via a web browser at http://galaxy.nencki-genomics.org. The NGP tools are separated into four categories — genomic, expression, regulation and function — and are closely integrated, so that the output of one tool can be an input for another (or can be stored).

The genomic tools leverage on Nencki Genomics Database, which extends Ensembl funcgen. The portal provides functionality of genome-wide refinement of regulatory regions, including mapping them to genes, intersecting with other types of regions, intersecting with TFBS motifs, and visualization of these data for specific genes. This makes public data from funcgen (and thus from ENCODE) immediately available to the user alongside his or her own data.

The expression tools provide a typical workflow of analysis of transcriptomics data, from preprocessed gene expression data (genes x conditions) to identification of differentially expressed genes, clustering, and visualization.

The regulation section provides a specialized version of BNFinder that permits analysis of effects of interactions of several regulatory features on gene expression. This tool uses our mammalian model of cis-regulation, updated to take advantage of experimentally identified gene regulatory regions.

The function tools accept a (ranked) list of genes as inputs and provide a unified interface to established tools, including gProfiler, for analysis of functional annotations, such as Gene Ontology, KEGG, and REACTOME.

In addition to the web browser, the local NGD tools are also accessible programmatically, via the standard SOAP/WSDL interface (webservices.nencki-genomics.org), permitting integration into automated analysis pipelines. The middle layer of NGP is based on Taverna Server, which allows us to seamlessly connect to web services and command line tools, and to rapidly deploy new analysis workflows. The NGP architecture permits future tailoring of the portal to users' needs.

................................................................................................................

RG P53: A validated gene regulatory network and GWAS to identify early transcription factors in T-cell associated diseases

Mika Gustafsson1, Danuta Gawel1, Sandra Hellberg1, Aelita Konstantinell1, Daniel Eklund1, Jan Ernerudh1, Antonio Lentini1, Robert Liljenström1, Johan Mellergård1, Hui Wang2, Colm E. Nestor1, Huan Zhang1, Mikael Benson1

1Linköpings UniveristetUniversitet, 2MD Anderson Cancer Center

The identification of early regulators of disease is important for understanding disease mechanisms, as well as finding candidates for early diagnosis and treatment. Such regulators are difficult to identify because patients generally present when they are symptomatic, after early disease processes. Here, we present an analytical strategy to systematically identify early regulators by combining gene regulatory networks (GRNs) with GWAS. We hypothesized that early regulators of T-cell associated diseases could be found by defining upstream transcription factors (TFs) in T-cell differentiation. Time-series expression profiling identified upstream TFs of T-cell differentiation into Th1/Th2 subsets enriched for disease associated SNPs identified by GWAS. We constructed a Th1/Th2 GRN based on integration of expression, DNA methylation profiling, and sequence-based predictions data using LASSO algorithm. The GRN was validated by ChIP-seq and siRNA knockdowns. GATA3, MAF, and MYB were prioritized based on GWAS and the number of GRN predicted targets. The disease relevance was supported by differential expression of the TFs and their targets in profiling data from six T-cell associated diseases. We tested if the three TFs or their splice variants changed early in disease by exon profiling of two relapsing diseases, namely multiple sclerosis and seasonal allergic rhinitis. This showed differential expression of splice variants of the TFs during relapse-free asymptomatic stages. Potential targets of the splice variants were validated based on expression profiling and siRNA knockdowns. Those targets changed during symptomatic stages. Our results show that combining construction of GRNs with GWAS can be used to infer early regulators of disease.

................................................................................................................

RG P54: Weak base-pairing in both seed and 3’ regions reduce RNAi off-targets and enhance si/shRNA designs

Shuo Gu1, Yue Zhang1, Lan Jin1, Yong Huang1, Feijie Zhang1, Michael Bassik1, Martin Kampmann2, Mark Kay1

1Stanford University, 2University of California, San Francisco

The use of RNA interference (RNAi) is becoming routine in scientific discovery and treatment of human disease. However, its applications are hampered by unwanted effects, particularly off-targeting through miRNA-like pathways. Recent studies suggest that the efficacy of such off-targeting might be dependent on binding stability. Here, by testing shRNAs and siRNAs of various GC content in different guide strand segments with reporter assays, we establish that weak base-pairing in both seed and 3’ regions is required to achieve minimal off-targeting while maintaining the intended on-target activity. The reduced off-targeting was confirmed by RNA-Seq analyses from mouse liver RNAs expressing various anti-HCV shRNAs. Finally, our protocol was validated on a large scale by analyzing results of a genome-wide shRNA screen. Compared with previously established work, the new algorithm was more effective in reducing off-targeting without jeopardizing on-target potency. These studies provide new rules that should significantly improve in siRNA/shRNA design.


Top of Page