Attention Conference Presenters - please review the Speaker Information Page available here.
If you need assistance please contact email@example.com and provide your poster title or submission ID.
Category E - 'Functional Genomics'
Short Abstract: Allele-specific expression (ASE) is the imbalance in transcription between maternal and paternal alleles at a locus and can be probed in single individuals using massively parallel DNA sequencing technology. Assessing ASE within a single sample provides a static picture of the ASE, but the magnitude of ASE for a given transcript may vary between different biological conditions in an individual. Such condition-dependent ASE could indicate a genetic variation with a functional role in the phenotypic difference. We investigated ASE through RNA-sequencing of primary white blood cells from eight human individuals before and after the controlled induction of an inflammatory response, and detected condition-dependent and static ASE at 211 and 13021 variants, respectively. We developed a method, GeneiASE, to detect genes exhibiting static or condition-dependent ASE in single individuals. GeneiASE performed consistently over a range of read depths and ASE effect sizes, and did not require phasing of variants to estimate haplotypes. We observed condition-dependent ASE related to the inflammatory response in 19 genes, and static ASE in 1389 genes. Allele-specific expression was confirmed by validation of variants through real-time quantitative RT-PCR, with RNA-seq and RT-PCR ASE effect-size correlations r=0.67 and r=0.94 for static and condition-dependent ASE, respectively. GeneiASE is available at https://sourceforge.net/projects/geneiase/.
Short Abstract: Expression Atlas (http://www.ebi.ac.uk/gxa) contains pre-analyzed RNA-sequencing and expression microarray data for querying gene expression across tissues, cell types, developmental stages and many other experimental conditions, in over 35 organisms including metazoans and plants. Queries can be either in a baseline context, e.g. find genes expressed in the macaque brain, or in a differential context, e.g. find genes that are up or downregulated in response to auxin in Arabidopsis. All datasets are manually curated to a high standard by in-house curators and processed using standardised analysis methods. As of March 2016, Expression Atlas consists of 2723 datasets, including 220 RNA-sequencing experiments. All data in
Expression Atlas is free to browse, download, reuse and is selected from the ArrayExpress archive of functional genomics data at EMBL-EBI. It is possible to search and download datasets in Expression Atlas into R for further analysis, and we now provide a RESTful API for access to thousands of pre-analysed RNA-sequencing datasets.
ArrayExpress (http://www.ebi.ac.uk/arrayexpress) experiments are systematically imported from NCBI GEO or directly submitted by scientists through Annotare, our webform-based tool. Accession numbers are generated within 15 minutes of submission, pre-published data sets can be kept private, and submitter's identity can be hidden for double-blind review. As of March 2016, ArrayExpress consists of 64197 freely available datasets studying a wide variety of organisms. Access is possible via the website or programmatically.
Experiments in both databases are annotated to Experimental Factor Ontology (EFO), allowing efficient search via ontology-driven query expansion, and facilitating data integration.
Short Abstract: Background
Enhancers are important regulatory elements that mediate the temporal and spatial control of gene expression. It has been estimated that human genome contains hundreds of thousands of enhancers. Considerable effort has been put into enhancer identification from histone modifications or chromatin accessibility. However, it is still challenging to find optimal combination of histone modifications to predict enhancer activity. Compared with histone modifications, enhancer-derived RNAs (eRNAs) have been demonstrated to be more indicative of enhancer activity.
Global nuclear run-on sequencing (GRO-seq) and precision nuclear run-on sequencing (PRO-seq) provide a sensitive way to identify and quantify eRNAs, making them ‘ideal’ techniques for detecting active enhancers. We analyzed all public available GRO/PRO-seq data. As a result, we identified 358 to 4794 active enhancers across 15 cell lines, among of which 80% are novel enhancers. The novel enhancers are marked by the enrichment of H3K4me1 and H3K27ac signals. In most cell lines, enhancer-associated genes generally have higher expression abundance than those not related to enhancers (p < 0.0001), further demonstrating the reliability of enhancer identification. Furthermore, we predicted ubiquitous and tissue specific enhancers. We found that tissue specific enhancers are closer to tissue specific genes than ubiquitous enhancers, indicating the regulatory role of enhancers in establishing tissue specific gene expression.
We comprehensively characterized active enhancers across 15 human cell lines. The large-scale discovery of active enhancers provides valuable source to study enhancer function.
Short Abstract: Genetic variation influences phenotype in part by altering the transcriptional regulatory architecture that controls mRNA expression levels. However, there remain fundamental questions about the relationship between genetic, transcriptomic, and organismal phenotypic variation, in part because there are few data sets encompassing transcript and phenotype measurements from the same genotypes. The Drosophila Genetic Reference Panel (DGRP) is a collection of 205 inbred Drosophila melanogaster lines harboring natural genetic variation derived from wild-caught flies. Genetic variation was previously mapped across these lines, and multiple research groups have used this collection to map the genetic architecture of various quantitative traits, including alcohol sensitivity, aggression, feeding behavior, and growth control. We have now performed total RNA-seq on adult whole flies across this collection to map genetic variation in the entire transcriptome, including previously unknown lncRNA and anti-sense transcripts. We are developing novel analysis pipelines to unify transcriptomes derived from distinct genomes, identify novel associations between transcript levels and phenotypes, and map eQTL and genetic correlation networks with unprecedented power and resolution. Ultimately this data set will serve as a powerful community resource complementing the existing inbred lines and genetic variation map for the DGRP.
Short Abstract: Genome-scale expression profiling has become a key tool of functional genomics, critically supporting progress in molecular biology and biomedical research in the post-genomic era. The deduction of gene function remains a major bottleneck in improving our understanding of living systems at the molecular level. Typical applications include the acceleration of unbiased genome-wide screens for candidate genes that are implicated in phenotypes and processes of interest by differential expression calling. The rapid improvement of next generation sequencing (NGS) platforms has triggered a wave of new findings based on whole transcriptome sequencing (RNA-Seq). NGS technology, however, has been shown to suffer from different sources of unwanted variation affecting interpretation of the results.
In the controlled setup of the SEQC benchmark study, we have recently shown that unwanted variation is largely due to library preparation. Appropriate tools for factor analysis like PEER or SVASeq can identify and remove confounding factors. With such corrections for site effects we could improve specificity without any loss of sensitivity. Going beyond comparisons in the original SEQC study, we here present results for a range of realistic effect strengths. Moreover, we demonstrate the benefits that can be gained by analysing novel results in the context of other experiments. In particular, use of a standardized reference sample much improves reliability across labs.
Short Abstract: GenomeSpace (http://www.genomespace.org) is a free, open-source, cloud-based environment that provides interoperability between best of breed computational tools, enabling scientists to easily combine tool capabilities without needing to program. It offers a common space to utilize, contribute, and share an ever-growing range of genomic analysis tools. GenomeSpace provides support for cloud-based data storage and analysis, multi-tool analytic workflows or “recipes”, automatic conversion of data between tools, and ease of connecting new tools to the environment.
We will describe new features including (1) development of support for use from Jupyter (iPython) notebooks; and (2) the newly released, crowd-sourced Recipe Resource. The Resource is a repository for GenomeSpace analysis recipes - short, standalone guides for completing complex bioinformatic analyses. Recipes use GenomeSpace-enabled tools and data resources to demonstrate integrative analysis, walking users step-by-step through the process of obtaining and analyzing data. The Recipe Resource provides the GenomeSpace community the ability to create, share and collaborate on analytic recipes of common value.
Short Abstract: Many studies have linked ambient fine particulate matter air pollution and diesel exhaust exposures to different cardiopulmonary diseases in the general population. However the complex mechanisms underlying biodiesel (BD) induced adverse health effects are not yet fully elucidated. In the present study, we aimed to identify genes and pathways that may contribute to 50% BD blend (BD50) induced pulmonary toxicity through omics and bioinformatics approaches. To reveal the potential health effects of BD, global gene expression profiles in the lungs of C57BL6 female mice were analyzed following 90 days post pharyngeal aspiration exposure to BD50 and ultra-low sulfur diesel (D100). A large number of genes that exhibited a significant change (>= ±2 folds, p- and q-value < 0.05) upon BD50 and D100 exposure were involved in inflammatory/immune response, oxidative stress, and response to DNA damage, which might contribute to PM related cardiopulmonary diseases. While majority of differentially expressed genes were similar to those found after exposure to D100, integrated analysis of gene signaling networks, pathways, toxic-end points and disease related pathologies revealed differential regulation of biological responses upon exposure to BD50. Performed analyses revealed novel genes, upstream regulators and pathways highly perturbed upon exposure to BD50, potentially underlying the inflammatory responses. These observations were also supported by the pathological, pulmonary function and biochemical outcomes seen in the lungs, thus highlighting the mechanisms of BD50 induced health effects. This study is the first to characterize global transcriptomics changes in the mouse lungs upon exposure to BD50.
Short Abstract: Pseudogenes are fossil relatives of genes. Although pseudogene sequences are highly similar to gene sequences, accumulation of mutations has disabled pseudogenes from producing protein products.
In this study we first group primate pseudogenes and genes into unique families based on sequence similarity. Unlike previous studies which only allow genes to be parents of pseudogenes, we allow pseudogenes to be parents of genes as well. Then, in each of the gene-pseudogene families, we identify the mutation signatures, including both conservative and non-conservative sites, and infer the mutation processes that have altered the sequences’ ability of coding protein. We propose a point-wise score and an alignment-wise score for measuring conservation, which take into account nucleobase composition and are not sensitive to the number of sequences aligned. With the identified mutation signatures, we infer the mutation processes using constrained minimum spanning tree.
The study identifies key mutation signatures, which have disabled protein coding ability of genes, or enabled protein coding ability of pseudogenes. One interesting example is the olfactory receptor gene family, which has expanded greatly in primates, but a large fraction of the genes have become nonfunctional pseudogenes in human. Potentially, such analysis can be applied to gene-pseudogene families in cancer evolution, where some pseudogenes (non-coding in normal cell lines) have been found to code proteins in cancer cell lines. Key mutations may contribute to frame-shift events that make the pseudogenes produce protein products again.
Short Abstract: The number of sequenced genomes is growing exponentially, profoundly shifting the bottleneck from data generation to genome interpretation. Traits are often used to characterize and distinguish bacteria, and are likely a driving factor in microbial community composition, yet little is known about the traits of most microbes. We present Traitar, the microbial trait analyzer, a fully automated software package for deriving phenotypes from the genome sequence. Traitar accurately predicts 67 traits related to growth, proteolysis, oxygen requirement, morphology, carbon source utilization, antibiotic susceptibility, amino acid degradation, carboxylic acid use and enzymatic activity.
Traitar uses L1-regularized L2-loss support vector machines for phenotype assignments, trained on protein family annotations of a large number of characterized bacterial species, as well as on their ancestral protein family gains and losses. We demonstrate that Traitar can reliably phenotype bacteria even based on incomplete single-cell genomes and simulated draft genomes. We furthermore showcase its application by characterizing two uncultured phylotypes based on genomes recovered from the metagenomes of commercial biogas reactors, verifying and complementing a manual metabolic reconstruction.
Traitar enables microbiologists to quickly characterize the rapidly increasing number of bacterial genomes. It could lead to models of microbial interactions in a natural environment and inference of the conditions required to grow microbes in pure culture. Our phenotype prediction framework offers a path to understanding the variation in microbiomes. Traitar is available under the GPL 3.0 license at https://github.com/hzi-bifo/traitar.
Short Abstract: Dormancy in grapevines plays an important role for survival under stress conditions and can be induced by decreasing photoperiod or low temperature. To understand the photoperiod signal transduction during dormancy induction; shoot tip transcriptome profiling (RNAseq) was conducted at 7 and 21 days using long (LD, 15h) and short (SD, 13h) day photoperiod treatments in Vitis riparia. Results revealed, after 7 days, 23 and 39 genes up- and down-regulated (≥ or ≤ 2 fold), respectively in SD relative to LD. After 21 days there were striking differences with 1667 and 1855 genes up- and down-regulated, respectively. The greatest difference in gene expression at 7 days of photoperiod treatment was the up-regulation of heat shock genes. Whereas, after 21 days there was a greater number of abscisic acid signaling related genes up-regulated and cell cycle, flower development, auxin biosynthesis genes were down-regulated in SD relative LD treatment. Enriched transcription factor families such as WRKY and NAC were up-regulated while BHLH and GRF were down-regulated in SD relative to LD. Gene Ontology analysis showed photosynthesis, stress and cell wall, flower development related biological processes were enriched respectively, for up-and down-regulated genes in SD relative to LD. Comparison of transcriptomes and published shoot tip proteomes showed correlations in the later time point with 26 and 41 abundant proteins, respectively in SD and LD. Therefore, the hormone and transcription factor differential expression and protein abundance highlight potential signal transduction resulting in changes in cell wall and photorespiratory proteins during growth cessation and dormancy induction.
Short Abstract: The presence of micropollutants within wastewater is a growing problem. Micropollutants describe a group of substances, typically present in µg.L-1 or lower concentrations, which are capable of having undesirable environmental and toxicological consequences. One example is ethinylestradiol (EE2), a synthetic estrogen that is present in a number of widely prescribed pharmaceuticals such as the combined oral contraceptive pill. EE2 is particularly recalcitrant due to the presence of an ethinyl group. While current treatment methods, such as removal of EE2 in activated sludge plants, are capable of reducing estrogen concentrations in effluent down to ng.L-1 orders of magnitude, this is still sufficient to cause feminisation of fish species in receiving waters. As a consequence, regulation of the use of EE2 was considered in the 2012 European Union’s Water Framework Directive. The Newcastle University Frontiers in Engineering Biology (NUFEB) project’s synthetic biology team is exploring methods for enhancing the metabolism of EE2 in wastewater communities. Currently, there is insufficient knowledge of the pathways involved in estrogen metabolism to enable a synthetic biology approach. Thus, there is a need to develop a framework to support this. This poster presents our progress to date: acquisition and genome sequencing of 15 putative estrogen metabolisers, and the development of methods that integrates chemoinformatics and bioinformatics to identify proteins of interest within putative degraders. The developed pipeline has identified a number of promising targets which are currently being investigated in vivo.
Short Abstract: Eukaryotic mRNA is subject to intensive tailing which critically influences mRNA stability and translatability. To investigate RNA tails at the genomic scale, we previously developed a method called TAIL-seq, but its low sensitivity precluded its application to minute amounts of biological materials. In this study, we report a new version of TAIL-seq (mRNA TAIL-seq or mTAIL-seq), which increases sequencing depth for mRNAs by ~1000 fold compared to the previous version. With the improved method, we here investigate the regulation of poly(A) tail in Drosophila oocytes and embryos. We find that maternal mRNAs are polyadenylated mainly during late oogenesis, prior to fertilization, and further modulated upon egg activation. Wispy, a noncanonical poly(A) polymerase, adenylates most maternal mRNAs with a few intriguing exceptions such as ribosomal protein transcripts. By comparing mTAIL-seq data to ribosome profiling data, we further find a strong coupling between poly(A) tail length and translational efficiency during egg activation. Our data suggest that regulation of poly(A) tail in oocytes shapes the translatomic landscape of embryos, thereby directing the onset of animal development. By virtue of the high sensitivity, low cost, technical robustness, and broad accessibility, mTAIL-seq will be a potent tool to improve our understanding of mRNA tailing.
Short Abstract: The cell composition of the hematopoietic tissue is highly heterogeneous and the characterization of phenotype and functionality of all the immune cell types a has high impact on the treatment of diseases and increasing of life expectancy. Deconvolution is a promising approach to define gene expression levels and cell proportions of specific subsets from transcriptomic data of a heterogeneous sample. Several deconvolution algorithms have already been proposed, nevertheless, there is still no consensus on the optimal methodology as well as on which cell type is more suitable for this approach.
Here, transcriptomic and flow cytometry analysis was performed on PMBC extracted from fresh blood samples of a cohort of young and old Singaporean individuals. We used basic linear regression and support vector machine for the deconvolution of transcriptomic data and we validated the computed percentage of cell sub-populations by flow cytometry. The prediction of the proportions of NK cells and monocytes was satisfying for both methods, while the results for B cells and T cells where not agreeable.
In conclusion, we believe that deconvolving blood expression data is a promising approach even though it is still in its early stages. Further investigation is necessary in order for immunological research to greatly profit from this methodology.
Short Abstract: A wealth of legacy microarray gene expression experiments exist in private and public databases, relevant to a large range of research questions. While RNA-seq data are now relatively inexpensive to generate, discarding the existing corpus of gene expression experiments, wastes a valuable opportunity and biological tissues, particularly in the case of rare diseases for which it is hard to obtain samples. We have developed a new method for cross platform normalization, Training Distribution Matching (TDM), which is specifically focused on machine learning applications. TDM allows researchers to build machine learning models on legacy microarray data that can be applied to RNA-seq data, by transforming the RNA-seq data to have a similar distribution. Using a variety of data and methods, we show that the validation performance of machine-learning derived models trained on array data and applied to TDM transformed RNA-seq data is analogous to data generated on comparable platforms. We also provide a TDM package for the R programming language, which is available at: https://github.com/greenelab/TDM (doi:10.5281/zenodo.32852).
Short Abstract: Co-expression networks have been a useful tool for functional genomics, providing important clues about the cellular and biochemical mechanisms that are active in normal and disease processes. With the recent advances in single cell RNA-seq technology, it is now possible to zoom in to identify pathways at single cell resolution. We performed the first major analysis of single cell co-expression, sampling from 31 individual studies comprising 28799 samples from 163 cell-types. Data from 163 bulk RNA-seq experiments were used as an external control. Using neighbor voting in cross-validation, we found that single cell network connectivity is less likely to overlap with known gene ontology functions than co-expression derived from bulk RNA-seq (aggregate sc AUROC=0.68, aggregate bulk AUROC=0.73), which can be attributed to the preferential occurrence of expression drop-outs in single cell data. Strikingly, we discovered that functional variation within celltypes strongly resembles variation occurring across celltypes (rs~0.95). The lack of additional variation within celltypes suggests that current knowledge in GO cannot readily identify functions occurring in a celltype-specific manner, and that systematic mining of single cell data may be required to define novel pathways.
Short Abstract: MicroRNAs (miRNAs) regulate target gene expression at post-transcriptional level. Intense research has been conducted for miRNA identification and the target finding. However, much less is known about the transcriptional regulation of miRNA genes themselves. A fundamental step to study the transcriptional regulation of miRNAs is to identify the transcription start sites (TSSs) of the miRNA genes. Available RNA-Seq data is not useful to identify miRNA TSSs because primary miRNAs are quickly cleaved during maturation process. In this work, we studied the transcriptional regulation of intergenic miRNAs in C. elegans and mouse by finding TSSs of miRNAs using Cap-seq data, which contains reads from capped RNAs. The results showed that the transcriptional mechanisms of miRNAs are much more complicated than we previously thought. Besides canonical ways for producing mature miRNAs, alternative transcriptional mechanisms are identified.
In both species, we have identified a class of special pre-miRNAs whose 5' ends are capped, and are most probably generated directly by transcription. Moreover, we also distinguished another class of special pre-miRNAs that are 5'-capped but are also part of longer primary miRNAs. These pre-miRNAs may have been generated by different transcription mechanisms under different conditions. Multiple cap read peaks within miRNA clusters are located on both arms of pre-miRNAs in C. elegans. We speculated that the miRNAs in a cluster might either be transcribed independently or be re-capped during the microprocessor cleavage process. Our results reveal that miRNA transcription in metazoans is highly regulated and has multiple mechanisms.
Short Abstract: Recent advances in high-throughput small RNA (sRNA) sequencing and the ever expanding transcriptome have opened incredible opportunities to better understand sRNA gene regulation and the biological roles of extracellular sRNAs. Extracellular small RNAs (sRNA) are transported in circulation by lipoproteins, namely low-density lipoproteins (LDL) and high-density lipoproteins (HDL). These sRNAs include microRNAs (miRNA), tRNA-derived sRNAs (tDR), sRNAs-derived from small nuclear RNAs (sndRNA), sRNAs-derived ribosomal RNAs (rDR) and many other classes. To fully characterize lipoprotein sRNA transport and define their link to hepatic and biliary sRNA signatures, high-throughput sRNAseq was used to profile the entire sRNA transcriptome of in apolipoprotein B-containing lipoproteins (apoB), HDL, bile, and liver. To analyze these large lipoprotein datasets, we developed a novel sRNAseq data analysis pipeline optimized for extracellular sRNA entitled, “Tools for Intensive Genome alignment of Extracellular small RNA (TIGER).” This pipeline has several advantages, including microRNA variant analyses, non-host genome alignments for microbiome and soil bacteria, data visualization packages and quantitative tools for tRNA-derived sRNAs (tDR), and optimization for lipoprotein extracellular sRNAs. We quantified the impact of Scavenger Receptor BI (SR-BI) deficiency on sRNA signatures in HDL, apoB lipoproteins, liver, and bile in mice. Results suggest that HDL and apoB lipoproteins transport both endogenous (mouse) host and exogenous bacterial sRNAs. Moreover, HDL, and not apoB, likely contribute to biliary sRNA. Most interestingly, SR-BI, HDL’s receptor, likely contributes to this process. Using TIGER, we were able to make critical discoveries in lipoprotein and biliary sRNA changes that would not be quantified by existing pipelines.
Short Abstract: Two novel approaches were recently suggested for genome-wide identification of protein aspects synthesized at a given time. Ribo-Seq is based on sequencing all the ribosome protected mRNA fragments in a cell, while PUNCH-P is based on mass-spectrometric analysis of only newly synthesized proteins. Here we describe the first Ribo-Seq/PUNCH-P comparison via the analysis of mammalian cells during the cell-cycle for detecting relevant differentially expressed genes between G1 and M phase. Our analyses suggest that the two approaches significantly overlap with each other. However, we demonstrate that there are biologically meaningful proteins/genes that can be detected to be post-transcriptionally regulated during the mammalian cell cycle only by each of the approaches, or their consolidation. Such gene sets are enriched with proteins known to be related to intra-cellular signalling pathways such as central cell cycle processes, central gene expression regulation processes, processes related to chromosome segregation, DNA damage, and replication, that are post-transcriptionally regulated during the mammalian cell cycle. Moreover, we show that combining the approaches better predicts steady state changes in protein abundance. The results reported here support the conjecture that for gaining a full post-transcriptional regulation picture one should integrate the two approaches.
Short Abstract: Combining biological data sources at the decision level is viewed as a natural way to manage heterogeneous data sources. We enumerate the advantage of modeling multi-heterogeneous biological data fusion in the gene prioritization task using statistic order and the ordered weighting averaging (OWA) operators.
As a result, we develop several kernel-based gene prioritization frameworks that combine multiple genomic data sources through the late integration. After transforming genomic data matrices into kernel matrices, for each kernel matrix, a one-class support vector machine (SVM) is exploited to perform gene prioritization for each phenotype under investigation. Then, the fusion methods defined by the use of OWA operator weights and ordered statistics weights are applied to aggregate performance rankings.
We use several genomic data sources including STRING, literature, gene expression and several annotation-based data sources, as well as sequence-based protein features such as pairwise local sequence alignment and information extracted directly from position-specific frequency matrices (PSFM).
Finally, we assess our models by applying them with the disease data sets on which Endeavour has been benchmarked. We also evaluate our models using an unbiased and prospective benchmark developed on the basis of the OMIM associations. Experimental results on both Endeavour benchmark and the unbiased benchmark demonstrate that our model can improve the accuracy of the state-of-the-art gene prioritization model. Moreover, we already used this model for the second Critical Assessment of Functional Annotation (CAFA2) to predict Human Phenotype Ontology (HPO). We succeeded in obtaining the promising results among the participant groups in that challenge.
Short Abstract: Advancements in sequencing technologies have highlighted the role of alternative splicing in increasing transcriptome complexity. They have also demonstrated the prevalence of differential splicing patterns among tissues, cell types, and developmental stages, motivating two streams of research, experimental and computational. The first involves RNA-Seq and CLIP-Seq to identify putative targets of splice factors that regulate splicing. The second involves probabilistic models that infer regulatory mechanisms and predict splicing outcome directly from genomic sequence. Till date, these streams of research, while useful for the study of gene regulation, development, and disease, have mostly been disconnected.
Here, we propose a computational framework that extends Xiong et al., Bioinformatics 2011, for deriving predictive splicing code models so that it can leverage the vast amounts of experimental data for splice factor knockdowns. Exploiting recent advances in alternative splicing quantification offered by MAJIQ (Jorge et al., eLife 2016), we define a new target function for splicing code quality based on a mixture of posterior beta distributions. We explore several approaches for optimizing this target function efficiently given a large set of putative regulatory features and demonstrate the usefulness of this new modelling framework on several datasets involving RNA-Seq and separate knockdown experiments of CELF1, CELF2 and MBNL1 in mouse heart and knockdown of MBNL1, RBFOX1 and RBFOX2 in mouse brain. Overall, our novel approach offers a scalable solution to extend splicing code modelling for a new type of experimental data that is becoming increasingly common and yet lacks predictive modelling.
Short Abstract: Microbial engineering for bio-remediation requires understanding of the molecular characteristics of strains naturally suited for that purpose. A strain of the bacterium, Pseudomonas aeruginosa, isolated from fuel storage tanks, is able to use n-alkanes (e.g. decane, dodecane) as a sole carbon source, making it an ideal candidate for bio-remediation studies. To elucidate the molecular basis for this capability, we have performed ribosome profiling of this strain and the lab strain PAO1, when grown in alkane or glycerol, producing RPKM values in triplicate for each of more than 5000 genes. The RPKM value is a proxy for ribosome occupancy on a gene’s mRNA transcript, and thus the amount of active translation. To assess differences in ribosome occupancy, we take a novel statistical approach to model the RPKM values based on two observations: 1) RPKMs are positively valued and 2) genes in each strain/condition can be coarsely categorized as either occupied or unoccupied. We cluster the RPKM values for all genes using a mixture of gamma distributions, an approach yet to be applied to these data to the best of our knowledge. We can separate the unoccupied and occupied clusters by constraining the unoccupied mixture (where RPKMs are near zero) to be an exponential distribution, a special case of the gamma distribution. We fit our model in a Bayesian manner using Gibbs sampling and compare the size of the occupied fraction across strains and conditions. We also identify changes in ribosome occupancy of individual genes, uncovering molecular candidates important for alkane metabolism.
Short Abstract: Functional interpretation of sets of genes/proteins obtained as a result of high-throughput analyses allows understanding the biological meaning of the experimental results. Typically such analysis is performed in a form of enrichment analysis, where the expert looks for the list of statistically significant over-represented Gene Ontology terms describing obtained gene/protein signature.
However, typically the number of GO terms in the description is huge (over 100), and the expert needs to go trough them manually, searching for the similar processes that give the stronger support of the hypothesis on particular biological function related to the experiment. Sometimes it is easy to find similar processes simply by comparing their names, but in many cases, only the visual analysis of the structure of GO graph can reveal interesting dependences among them.
In this work we analyse how different GO term similarity measures can be used to cluster GO terms in order to help the expert understanding the meaning of the experimental results by (i) searching for statistically significant clustered GO terms and (ii) reducing the number of GO terms presented to the expert by replacing a group of them with the most relevant one. The clusters obtained with different semantic GO similarity measures such as Resnick, Jiang-Conrath, Lin, and G-SESAME measures, and pathway-based similarity measure are compared.
Finally, we present the Galaxy Server tool for clustering GO terms using different similarity measures. The important part of the process is visualization, which helps understanding relations among GO terms that functionally describe the experimental results.
Short Abstract: Psychiatric disorders are multigenic diseases with complex etiology contributing significantly to human morbidity and mortality. We compared molecular signatures across brain regions and disorders in the transcriptomes of postmortem human brain samples (Pritzker Neuropsychiatric Disorders Research Consortium). RNA sequencing was performed on tissue from the anterior cingulate cortex, dorsolateral prefrontal cortex, and nucleus accumbens from three groups of 24 patients each diagnosed with schizophrenia, bipolar disorder, or major depressive disorder, and from 24 control subjects. We validated 129 gene expression changes (FDR<0.05) with an independent cohort provided by the Stanley Neuropathology Consortium Integrative Database Array Collection. The most significant disease differences were observed in the anterior cingulate cortex of schizophrenia samples compared to controls and biochemical consequences of gene expression changes were assessed with untargeted metabolomic profiling. We also detected significant heterogeneity within schizophrenia and bipolar disorder samples with a subset of patients exhibiting gene expression profiles significantly different from both the larger subset of schizophrenia and bipolar disorder patients and from the major depressive disorder and control cohorts. We find no evidence this disease heterogeneity is linked to any known clinical phenotypes or technical variables, however virtual microdissection reveals high correlation with transcripts previously identified as specific to astrocyte- (p-value = 6.90x10-04) and endothelial-specific expression (p-value = 4.99x10-07), but depleted of neuron-specific expression (p-value = 3.77x10-09). We identify NPAS4, IGKC, and IGHG1 as a refined list of putative cell-type independent transcripts that may particularly warrant further investigation.
Short Abstract: Microbial population growth is used to determine effects of genetics, nutrients, and stress on cell survival. Here, we develop a robust modeling frame work of microbial population growth through functional ANOVA with Gaussian Process priors. The model represents experimental effects on growth, such as genetic modifications and stress, as time dependent processes. Additionally, interactions between the effects are also modeled. With this model, we develop tests to identify significant effects and interactions affecting population growth. We apply our model to microbial growth data of a halophilic extremophile Archaeon, and identify novel interactions between stress and genetics. We also apply our method to growth data corresponding to combinations of stressors, and identify significant interactions between conditions. Our model provides a general and robust frame work to test functional roles of genomic content and environmental perturbation in controlling microbial population growth.
Short Abstract: Transcription factors tend to function as co-regulators to synergistically induce or inhibit expression of their targets in living organisms. In recent years, several genome-scale gene regulatory networks have been generated for multiple eukaryotic organisms including human, mouse, drosophila and Arabidopsis. However, existing module-finding algorithms fail to capture transcription co-regulators in these large-scale networks because these algorithms typically search for groups of densely connected genes (nodes) rather than co-regulating genes. In this study, we developed a new algorithm, CoReg, to identify transcription co-regulators in large-scale gene regulatory networks. CoReg groups genes based on regularized similarities between target nodes and regulatory nodes in a network. We applied hierarchical clustering followed by dynamic tree cut to identify co-regulatory modules. To compare the performance of our method with existing module-finding algorithms (Walk Trap, Edge Betweenness and Label Propagation), we conducted network-rewiring simulations and found that CoReg outperformed existing algorithms in identifying true co-regulators in large networks. We tested our algorithms in E coli, human and Arabidopsis gene regulatory networks. We found that connectivity is high between network modules of co-regulators and their targets. In particular, we found that in the Arabidopsis regulatory network, the expression correlation for more than 60% of the modules detected by CoReg are significantly higher than that of randomly constructed modules. This result strongly supports that the co-regulators identified by CoReg are involved in gene co-regulation. Our study provides a new tool for dissecting the architecture of regulatory network in multiple organisms.
Short Abstract: The genomes of cultivated plants are often large and highly complex, owing to polyploidy and a strongly increased repeat content. The genome of hexaploid bread wheat (Triticum aestivum) is five times larger than the human genome. Comprised of 17 billion basepairs, and distributed over 21 chromosomes from 3 subgenomes, the wheat genome poses new challenges for data storage and bioinformatics processing. To speed up biological studies, for example on elucidating the resistance against pathogens in crops, comprehensive resources have to be compiled by integrating data from both genetics (e.g. genetic maps) and genomics (e.g. exome capture) research. Similarly, comparative and functional genomics in wheat require rapid, memory efficient and reproducible bioinformatics pipelines and make high demands on the hardware side. For our studies, we utilise cutting edge computing resources, such as the ultra-fast hardware-based data processing capabilities provided by the DRAGEN Bio-IT platform.
In this poster, we highlight our current developments for the integration of large genomics data sets including a number of wheat genomes and thousands of wheat exomes; as well as genetic and physical maps.
Short Abstract: In human tissues protein expression is only very weakly correlated to transcript expression due to post-transcriptional layers of regulation that are poorly understood, indicating a need for integrative models that bridge this divide. In this work we aim to develop a quantitative post-transcriptional model that predicts the experimentally measured expression of a clinically important protein, α-1-antitrypsin. α-1-antitrypsin is expressed in a variety of human tissues as a necessary protectant. Deficient levels α-1-antitrypsin can cause COPD, adult liver disease and infantile cirrhosis. The parent gene, SERPINA1, transcribes a total of eleven transcript isoforms that only differ in their 5’ untranslated region (UTR) and therefore all produce the same peptide. However, the transcripts have distinct translational efficiencies, which direct the amount of peptide each transcript produces. Splicing in the 5’ UTR of SERPINA1 determines the inclusion of multiple upstream open reading frames (uORFs), short protein-coding sequences in the 5’ UTR known to inhibit translation, and alters local secondary structure, as modeled with transcript-specific SHAPE-MaP structure probing data. To model translational efficiency we first tested a biophysical model that relies on the uORF Kozak sequence strengths alone, but its predictions do not fit our quantitative data nor adequately predict α-1-antitrypsin protein levels in human primary tissue. Only a version of this model modified to include secondary structure around Kozak sequences accurately predicts SERPINA1 translational efficiencies and improves predictions in primary human tissues. Our data demonstrates the relative importance of RNA structure in a complex post-transcriptional program that ultimately decides α-1-antitrypsin output.
Short Abstract: Premature Cleavage and PolyAdenylation (PCPA) process produces 3’ truncated mRNAs. In general, PCPA is suppressed by splicing events. U1 snRNP (U1 small nuclear ribonucleoprotein) is a well-characterized factor that regulates PCPAs and considered as a therapeutic target for the treatment of viral infections and a genetic neuromuscular disease. However the mechanisms of PCPA regulation are not well studied.
Detection of Altered Premature Cleavage and PolyAdenylation (PCPA), in the target genes provides a novel perspective for characterizing gene expression and explaining possible mechanisms. The change in PCPA in our knowledge can cause altered last exon splicing and cleavage efficiency, which further propagate the total stability of the mRNA products and their proteins.
We are developing a PCPA knowledge base for researchers to analyze their data for finding altered PCPA and comparing the results with existing examples. Here, we introduce PCPA computational pipelines with their functions and issues. Using the pipelines we identify change in PCPA in cancerous cell lines comparing to the corresponding normal cell lines. This trend is also observed in many public breast cancer cell lines.
Short Abstract: The Gene Ontology (GO) is widely used to annotate gene function, and is considered a gold standard for genomics data annotation and interpretation. Given its prominence and value as a bioinformatics resource, it is important to understand how changes to GO and annotations over time impact users. While some assessments have been presented, routine assessment by end users remains challenging. Here we present GOTrack, a user-friendly web-based system for analyzing the historical stability of GO annotation data in multiple model organisms. The web site allows exploration of GO annotations at the gene and term level with multiple metrics and visualizations. The newest feature is a “historical enrichment” analysis where over-representation analysis can be run on a gene hit list of interest using GO annotations from any era, with measurements of stability over time provided. Using the data and algorithms presented we investigate the impact of changes to GO on the stability and robustness of gene set enrichment results in a corpus of hit lists. GOTrack will help the research community assess the impact of changes to GO on their research.
Short Abstract: Autism spectrum disorder (ASD) is a range of neurodevelopmental disabilities with a strong genetic basis. Yet, owing to extensive genetic heterogeneity, multiple modes of inheritance and limited study sizes, sequencing and quantitative genetics approaches have had limited success in characterizing the complex genetics of ASD. Currently, only about 65 genes (out of an estimated several hundred) are known based on strong genetic evidence. Hence, there is a critical need for complementary approaches to further characterize the genetic basis of ASD, enabling development of better screening and therapeutics. Here, we use a human brain-specific functional gene interaction network to present a genome-wide prediction of autism-associated genes, including hundreds of candidate genes for which there is minimal or no prior genetic evidence. Our approach is validated in an independent case-control sequencing study of approximately 2,500 families. We further demonstrate that the large set of ASD genes converges on a smaller number of key cellular pathways and specific developmental stages of the brain. Specifically, integration with spatiotemporal transcriptome expression data implicates early fetal and midfetal stages of the developing human brain in ASD etiology. Likewise, analysis of the connectivity of top autism genes in the brain-specific interaction network reveals the breadth of autism- associated functional modules, processes, and pathways in the brain. Finally, we identify likely pathogenic genes within the most frequent autism-associated copy-number-variants (CNVs) and propose genes and pathways that are likely mediators of autism across multiple CNVs. All the predictions, interactions, and functional insights are available to biomedical researchers at asd.princeton.edu.
Short Abstract: During the maternal-to-zygotic transition (MZT), maternal transcriptome should be degraded, and replaced by zygotic transcripts in a highly coordinated manner. As transcription is silenced in early stages, transcriptome is regulated by cytoplasmic polyadenylation and RNA decay. We recently developed a new high-throughput sequencing technique, coined TAIL-seq, that profiles the poly(A) length and 3′-end modifications. TAIL-seq enables us to monitor the dynamics of adenylation, deadenylation and nucleotide tagging to the 3′-end of maternal RNAs with single-nucleotide resolution at genomic scale. In this study, we apply TAIL-seq to early stage embryos of zebrafish to reveal how the RNA regulatory mechanisms influence the transcriptome. Our data confirm that poly(A) tails of most mRNAs elongate shortly after fertilization until the middle of the MZT, while a smaller group of transcripts escapes from this regulation. Surprisingly, most maternal RNAs acquire U tails in addition to poly(A) tails during the MZT. This phenomenon is observed in mouse and Xenopus laevis embryos, as well. We identify Zcchc6 (also known as TUT7) and Zcchc11 (also known as TUT4) as the enzymes responsible for uridylation with further TAIL-seq experiments using morpholino-mediated knockdown. Further experiments suggest that Zcchc6 and Zcchc11 contribute to selective clearance of molecules with short poly(A) tails. In conclusion, Zcchc6 and Zcchc11 add uridine tails to short poly(A) tails in the vertebrate MZT, which is essential for the precise regulations of maternal factor expression in the early embryogenesis.
Short Abstract: In the era of ever increasing number of genome-available organisms, direct estimation of functionally related genes from genome is a fundamental challenge in bioinformatics. Despite the great efforts to search for genomic features associated with the functional relationships of genes, they are still poorly understood. In this study, we investigated such genomic features in model plant Arabidopsis thaliana by using gene coexpression data in ATTED-II (http://atted.jp). Gene coexpression, a similarity of gene expression profiles, provides a genome-wide approximation of functional gene relationships at transcriptional regulation level. In comparative analysis between the similarity of genomic features and the strength of gene coexpression, we found that genes having similar CDS length tend to be coexpressed strongly. We also demonstrated that the evolutionary age of structural genes is a key factor in dictating the CDS length variation: the older genes tend to have a longer CDS length. These observations suggest that the basic structure of gene coexpression network is strongly dominated by gene age, namely, it is a multilayered structure consisting of several gene modules created at each evolutionary step. Gene ontology enrichment analysis actually revealed that the evolutionary older genes possess a central role in cellular activities, whereas the younger genes are likely to participate in lineage- or species-specific phenotypic evolution. We anticipate our results to be a starting point for understanding the mechanisms underlying cellular systems evolution, and for developing a genome-based gene function prediction method.
Short Abstract: Background- In recent years RNA-seq deep sequencing technology has emerged as a revolutionary tool to precisely measure transcriptome profiling in eukaryotic genomes. Beyond protein coding RNAs, long non-coding RNAs (lncRNAs) have become recognized as a gene regulators as well as prognostic markers in cancer. In this study, we initiated an in-silico analysis of co-expression of lncRNAs with epigenetically regulated genes (EReg) in TCGA Glioma RNA-seq data.
Method- Open-source RNA-seq data sets were integrated to capture highly correlated bio-molecules, in our case, lncRNAs and ERegs. A set of 12382 differentially regulated lncRNAs transcripts were identified across various cancers including Glioma samples (372) were derived from Chinnaiyan et.al by the Tuxedo suite. A set of 809 EReg transcripts were obtained from Cecceralli et al. categorizes 7 distinct glioma subtypes in IDHmutant (codal=69,G-CIMP-high=104,G-CIMP-low=8) and IDHwildtype (Classic-like=54,LGm6-GBM=12,Mesenchymal-like=69,PA-like=15) and their expression estimates were generated by Mapsplice/RSEM were downloaded from GDAC. Expression estimates of lncRNAs (FPKM) and EReg (estimated transcript fraction) were converted to transcripts per million (TPM). Following QC on this integrated data, 315 samples and 12991 (lncRNA=12195 and EReg=796 transcripts) molecules were analyzed with Weighted Correlation network analysis. Following detection of networks, association of glioma subtypes to these co-expression networks were analyzed with anova.
Results: There were 27 lncRNA-EReg gene networks were detected. Among these, 2 networks were significantly associated 2 glioma subtypes at FDR level < 0.05.
Conclusion- This study demonstrates the application of existing bioinformatics algorithms to analyze open source RNA-seq data to capture gene-lncRNA networks in respect to sample subtypes.
Short Abstract: Mammalian genomes contain over 700 zinc finger proteins (ZFPs), but most have unknown functions. One example, PR Domain Containing 9 (PRDM9), regulates location of meiotic homologous recombination by binding to the DNA, causing epigenetic modifications and allowing for a double strand break (DSB); however its targeting mechanism is not fully understood. Knockouts in mice PRDM9 have been associated with infertility, which is suspected to be caused by mislocation of DSBs during meiosis. Over 100 alleles of PRDM9 are known in mice, each of which contains a unique zinc finger array and therefore selects different DNA binding sites for PRDM9. To detect and quantify all binding sites of PRDM9 without interference by additional regulatory effects (protein abundance, chromatin accessibility), we used a novel, in vitro assay called Affinity-Seq. We found over 39,000 significant binding sites for the PRDM9Dom2 allele in C57BL/6J mouse DNA. Quantification of the binding frequency at each sequence enabled estimation of binding affinity at each site in addition to standard nucleotide frequencies. To gauge the contribution of each nucleotide, we built a linear regression model that includes additive and interactive effects on the binding preference of PRDM9. We identified a few core nucleotides required for binding and additional bases that quantitatively modify binding affinity. We tested this model by performing Affinity-Seq on the CAST/EiJ genome providing various natural polymorphisms for quantitative validation. Our work yields a detailed view of the targeting mechanism of PRDM9 in meiosis and can be broadly applied to any ZFP.
View Posters By Category
- A) Bioinformatics of Disease and Treatment
- B) Comparative Genomics
- C) Education
- D) Epigenetics
- E) Functional Genomics
- F) Genome Organization and Annotation
- G) Genetic Variation Analysis
- H) Metagenomics
- I) Open Science and Citizen Science
- J) Pathogen informatics
- K) Population Genetics Variation and Evolution
- L) Protein Structure and Function Prediction and Analysis
- M) Proteomics
- N) Sequence Analysis
- O) Systems Biology and Networks
- P) Other