Accepted Posters
Attention Conference Presenters - please review the Speaker Information Page available here.
If you need assistance please contact submissions@iscb.org and provide your poster title or submission ID.
Category F - 'Genome Organization and Annotation'
Short Abstract: The scientific literature published and indexed in public databases is one of the fastest growing data types now available for computational analysis in the life sciences. For the popular model organism Drosophila, e. g., the total number of publications has grown from less than 50,000 ten years ago to over 85,000 today. Moreover, this growth is accelerating, and the number of new scientific publications getting indexed per year has now doubled. For instance, several thousand papers on Drosophila are now being added to the PubMed repository in a year. Tools and databases that extract information from all these records automatically are thus of increasing relevance and potential impact. The question arises to what degree information computed from literature records can reliably complement analyses of high-throughput data from genome-scale molecular profiling or serve as a meaningful benchmark. We have examined this issue using a unique set of differential gene expression profiles collected under several measurement conditions, which have been analysed by information-theoretic metrics in a comparison of signal strength. By design, this set of measurements in particular allows an assessment of the internal consistency of benchmark results when restricting data sources to ensure no circular arguments. Interestingly, we observed surprisingly poor outcomes for a number of seemingly obvious benchmarks relying on differential expression profile meta-analysis. In contrast, benchmarks based on data extracted from literature records by modern methods can be parameterized to yield good concordance with information-theoretic measures, corroborating the use of literature data as meaningful complementary data source.
Short Abstract: RNA expression probing (microarrays, RNA-Seq) has become sufficiently inexpensive, enabling sampling of many experimental conditions in one project. A common objective is to find genomic expression features (EF) (genes, isoforms, exon inclusion, etc.) that behave similarly in expression. Historically, this type of analysis has employed one-way clustering. With increasing sample numbers, it is practical to assume that some EF interactions diminish across all samples from diverse conditions. In such situations, one-way clustering can be insufficient, likely missing those EF as a cluster. Consequently, biclustering methods have been introduced to find subsets of EF that behave similarly among subsets of conditions. We propose SCCA-BC, a biclustering method based on resampling, partitioning, and sparse canonical correlation analysis (SCCA), which finds numerous types of biclusters and performs well in diverse settings. By resampling and randomly partitioning EF into two groups, SCCA searches for linear group relationships. Following that, we estimate inclusion of experimental conditions. Since SCCA-BC finds correlated biclusters, many existing models are special cases of ours and are also discoverable by SCCA-BC. Through simulation, we show that SCCA-BC performs comparably in common situations, and outperforms other methods in more difficult situations. We then applied SCCA-BC to modENCODE data, identifying genes related to development, consistent with other studies.
Short Abstract: Accurate modeling of genome architecture is crucial for identifying functional interactions among regulatory elements and their target promoters. We describe a method, Fit-Hi-C, that assigns statistical confidence estimates to intrachromosomal contacts by jointly modeling the random polymer looping effect and technical biases in Hi-C data. High-confidence contacts identified by Fit-Hi-C preferentially link expressed gene promoters to active enhancers in human embryonic stem cells (ESCs), capture 77% of enhancer-promoter interactions identified by ChIA-PET in mouse ESCs, confirm previously validated, cell line-specific contacts in mouse, and link loci with similar replication timing in human and mouse ESCs. We also observe that regions containing binding peaks of master pluripotency factors such as NANOG and POU5F1 are enriched in high-confidence contacts for human ESCs. Our results suggest that insulators and heterochromatin regions are hubs for high-confidence contacts, whereas transcription start sites, promoters and strong enhancers are involved in fewer but potentially more targeted contacts.
Short Abstract: Global investigation of the 3′ extremity of mRNA (3′-terminome), despite its importance in gene regulation, has not been feasible due to technical challenges associated with homopolymeric sequences and relative paucity of mRNA. We here develop a method, TAIL-seq, to sequence the very end of mRNA molecules. TAIL-seq allows us to measure poly(A) tail length at the genomic scale. Poly(A) length correlates with mRNA half-life, but not with translational efficiency. Surprisingly, we discover widespread uridylation and guanylation at the downstream of poly(A) tail. The U tails are generally attached to short poly(A) tails (<25 nt), while the G tails are found mainly on longer poly(A) tails (>40 nt), implicating their generic roles in mRNA stability control. TAIL-seq is a potent tool to dissect dynamic control of mRNA turnover and translational control, and to discover unforeseen features of RNA cleavage and tailing.
Short Abstract: Analysis of alternative splicing is one of the most challenging applications of RNA-Seq. In spite of the large number of available tools, systematic benchmarking efforts showed that splice junctions false discovery rates, annotation usage and multi-mapping reads remain critical issues in spliced reads alignment. We aimed at dissecting the problem by assessing the splice-site mapping, detection and quantification performance under a variety of experimental designs, alignment strategies and annotation sets. Based on the results, we propose a pipeline to bridge the gap between the superior mapping accuracy of transcriptome-first alignment and the low false-positive rates of intron-centric approaches. Our strategy constrains the detection problem at the post-processing stage, by coupling TopHat2 to a novel method, FineSplice, that allows identifying unreliable gapped alignments and filtering out false-positive junctions via semi-supervised logistic regression. We further show how this strategy can benefit RNA-Seq analysis of alternative splicing and improve isoform-level quantification.
Short Abstract: Serum response factor (SRF) is a DNA-binding transcription factor that binds to the 10-base pair transcription factor binding site (TFBS) known as the CArG box. SRF is a master regulatory protein for several developmental processes and has been implicated in the pathogenesis of vascular diseases via disparate programs of gene expression, although convincing evidence to support this remains elusive as the full complement of SRF target genes are incompletely defined. Regulatory SNPs (rSNPs) reside primarily within the non-protein coding genome and are thought to disturb normal patterns of gene expression by altering DNA binding of transcription factors. There are increasing efforts to characterize the functional significance of rSNPs in human disease given the explosive rise in genome wide association studies. To test the hypothesis that clinically-relevant rSNPs residing in the CArG box alter gene expression, we developed a computational algorithm to predict SRF-DNA binding genome-wide by incorporating DNA sequence, epigenetic markers, and SRF binding data from ENCODE. We simultaneously performed ChIP-seq in human coronary artery smooth muscle cells (HCASMCs) and human umbilical vein endothelial cells (HUVECs) to biologically validate SRF binding in vitro in vascular phenotypes. To further discover functional SRF target elements, we integrated ChIP-seq data with RNA-Seq experiments wherein SRF was knocked down in HUVECs and HCASMCs. Finally, we merged this set of SRF cis-targets with clinically-relevant SNPs from dbSNP that reside within the associated CArG box, thereby identifying clinically relevant rSNPs theoretically associated with SRF targets that will be functionally validated by wet lab experiments.
Short Abstract: We have conducted one of the first human microbiome studies in a well-described large prospective cohort incorporating taxonomic, metagenomic, and metatranscriptomic profiling at multiple body sites. Systematic comparison of the gut metagenome and metatranscriptome revealed that a substantial fraction of microbial transcripts were not differentially regulated relative to their genomic abundances. Of the remainder, consistently under-expressed pathways included sporulation and amino acid biosynthesis, while upregulated pathways included ribosome biogenesis and methanogenesis. Across subjects, metatranscriptional profiles were significantly more individualized than DNA-level functional profiles, indicative of subject-specific whole-community regulation. This work also identified a subset of abundant oral microbes that routinely survive transit to the gut, but with minimal transcriptional activity there. Together, these results provide a community-wide profile of biomolecular regulatory processes in the gut, as well as validating one of the first protocols appropriate for large-scale functional profiling of the microbiome in human populations.
Short Abstract: We will present two novel algorithms: NPEST (prediction of TSS) and cisExpress (genome-wide detection of specific DNA sequence motifs associated with expression pattern of interest). Accurate identification of TSS is
an important genomics task, since the position of regulatory elements with respect to the TSS can have large effects on gene regulation, and performance of promoter motif-finding methods depends on correct identification of TSSs. Our probabilistic approach expands recognition capabilities to multiple TSS per locus that may be a useful tool to enhance the understanding of alternative splicing mechanisms.
CisExpress is especially designed for use with large datasets, such as those generated by publicly accessible whole genome and transcriptome projects. cisExpress uses a task farming algorithm to exploit all available computational cores within a shared memory node. We demonstrate the robust nature and validity of the proposed method.
Short Abstract: Functional genomics is a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic projects (such as genome sequencing projects) to describe gene (and protein) functions and interactions. Unlike genomics and proteomics, functional genomics focuses on the dynamic aspects of gene transcription, translation, and protein-protein interactions.
The Functional Genomics Data Society (FGED) Society, founded in 1999 as the MGED Society, is a registered International Society that advocates for open access to genomic data sets and works towards providing concrete solutions to achieve this. FGED’s mission is to be a positive agent of change in the effective sharing and reproducibility of functional genomic data. FGED’s work on defining minimum information specifications for reporting data in functional genomics papers have already enabled large data sets to be used and reused to their greater potential in biological and medical research. The Minimum Information about Microarray Experiments (MIAME) has been citied often and has been influential in this effort (PMID:11726920). The FGED Society seeks to promote mechanisms to improve the reviewing process of functional genomics publications. We also work with other organizations to develop standards for biological research data quality, annotation and exchange. We actively develop methods to facilitate the creation and use of software tools that build on these standards and allow researchers to annotate and share their data easily. We promote scientific discovery that is driven by genome wide and other biological research efforts in data integration and meta-analysis.
Short Abstract: Background. A fundamental question in neuroscience is how memories are stored and retrieved in the brain. Many psychiatric and neurodevelopmental disorders are associated with cognitive deficits. Characterizing the biological basis of memory storage and retrieval is therefore critical for understanding normal and abnormal brain function. Long-term memory formation requires transcription, translation and epigenetic processes that control gene expression. Thus, characterizing the transcriptional changes that occur after memory acquisition and retrieval is of broad interest and importance.
Results. We evaluated genome-wide transcriptional changes at several points after acquisition and retrieval of memory in the mouse hippocampus, observing the largest changes in gene expression 30 minutes after both. Genes downregulated after acquisition and retrieval represent different functions: chromatin assembly and RNA processing, respectively. Levels of histone 2A variant Hsit2h2ab are reduced only following acquisition, a finding we confirmed using quantitative proteomics. On the other hand, splicing factor Rbfox1 and NMDA receptor-dependent microRNA miR-219 are only downregulated after retrieval, accompanied by an increase in protein levels of miR-219 target CAMKIIγ.
Conclusions. This is the first study to characterize coding and non-coding gene expression genome-wide at several time-points after memory acquisition and retrieval. We demonstrate downregulation of histone variants after memory acquisition, and splicing factors and microRNAs after retrieval. Our results provide mechanistic insights into the molecular basis of cognition by highlighting the differential involvement of epigenetic mechanisms, such as histone variants and post-transcriptional RNA regulation, after acquisition and retrieval of memory.
Short Abstract: Genome-wide binding preferences of the key components of eukaryotic preinitiation complex (PIC) have been recently measured at high resolution in Saccharomyces cerevisiae by Rhee and Pugh. However, the rules determining the PIC binding specificity remain poorly understood. In this study, we show that nonconsensus protein-DNA binding significantly influences PIC binding preferences. We estimate that such nonconsensus binding contributes statistically at least 2–3 kcal/mol (on average) of additional attractive free energy per protein per core-promoter region. The predicted attractive effect is particularly strong at repeated poly(dA:dT) and poly(dC:dG) tracts. Overall, the computed free-energy landscape of nonconsensus protein-DNA binding shows strong correlation with the measured genome-wide PIC occupancy. Remarkably, statistical PIC preferences of binding to both TFIID-dominated and SAGA-dominated genes correlate with the nonconsensus free-energy landscape, yet these two groups of genes are distinguishable based on the average free-energy profiles. We suggest that the predicted nonconsensus binding mechanism provides a genome-wide background for specific promoter elements, such as transcription-factor binding sites, TATA-like elements, and specific binding of the PIC components to nucleosomes. We also show that nonconsensus binding has genome-wide influence on transcriptional frequency.
Short Abstract: Cellular functions in microorganisms can be acquired via prevalent horizontal gene transfers. However, if all types of cellular functions can be easily transferred across the entire Bacteria and Archaea domains, it raises the fundamental question of what defines a microbial species and whether the speciation of a microorganism should be considered as a process of descending from an ancestor or combining genes from multiple sources. To shed light on this question, a new protein family database based on UniProt--UniFam was developed and used to re-annotate ~12000 genomes in GenBank.
UniFam constructed families on the whole protein level, obtained high family-wide sequence diversity, and derived rich and uniform annotation. Comparing to the commonly used InterPro annotation pipeline, UniFam provided a comprehensive coverage of protein sequence space to obviate the need to integrate multiple databases and enabled fast genome annotation. The comprehensive genome annotation allowed metabolic pathway reconstruction with MetaCyc. UniFam is to be updated in sync with the UniProt database to take advantage of increasing sequence coverage in TREMBL and improving annotation in SwissProt.
Genomes' cellular functions were represented by metabolic pathways and their phylogenetic tree was inferred. Many cellular functions, such as antibiotic resistance, were found dispersed across the phylogenetic tree, suggesting horizontal gene transfer. However, many other cellular functions, including methanogenesis, showed high consistency with the phylogeny. We hypothesize that a microbial species can be defined by a set of cellular functions that are generally only acquired by vertical gene transfer.
Short Abstract: MicroRNAs (miRNAs) are a class of ~22nt non-coding RNAs that potentially regulate over 60% of human protein-coding genes. MiRNA activity is highly specific, differing between species, cell types, developmental stages and environmental conditions, so the prediction of active miRNAs in a given sample is of significant interest. MiRNA target prediction in animals using only sequence data is difficult and suffers from high false positive rates. Here we present a novel computational approach for analyzing paired sequence and gene expression data, called MixMir. Our method incorporates 3' UTR background sequence similarity between transcripts, which is known to affect miRNA binding efficiency. We demonstrate that after accounting for kmer sequence similarities in 3’ UTRs, just a simple statistical linear model using motif presence/absence can discover active miRNAs in the sample. MixMir utilizes a fast software implementation for solving large systems of mixed linear model equations that is widely-used in genome-wide association studies (GWAS). Essentially we use 3’ UTR sequence similarity in place of population cryptic relatedness in the GWAS problem. Compared to similar methods such as miREDUCE, Sylamer and cWords, we found that MixMir performed better at discovering true miRNA motifs in a mouse CD4 T-cell Dicer knockout line. MixMir also performs at least as well as miREDUCE on both protein expression and mRNA expression obtained from miRNA transfection experiments in human cell lines.
Short Abstract: Sequence-based target prediction algorithms and anti-correlation profiles have been applied to predict miRNA targets using omics data, but this approach often leads to false positive predictions. Here, we applied the joint profiling analysis of mRNA and miRNA expression levels toTg6799 AD model mice at 4 and 8 months of age using a network topology-based method. We constructed gene regulatory networks and used the PageRank algorithm to predict significant interactions between miRNA and mRNA. In total, 8 cluster modules were predicted by transcriptome data for co-expression networks of AD pathology. AD networks were constructed by integrating mRNA and miRNA profiles. In total, 54 miRNAs were identified as being differentially expressed in AD. AD networks were constructed by integrating mRNA and miRNA profiles. Among those, 50 significant miRNA-mRNA interactions were predicted by integrating sequence target prediction, expression analysis, and the PageRank algorithm. We identified a set of miRNA-mRNA interactions that were changed in the hippocampus of Tg6799 mice at both early- and late- symptomatic stages compared to littermate controls. Our results demonstrate AD-specific changes in the miRNA regulatory system as well as a relationship between the levels of miRNAs and those of their targets in the hippocampus of Tg6799 mice. These data further our understanding of the function and mechanism of various miRNAs and their target genes in the molecular pathology of AD.
Short Abstract: During embryogenesis, the heart is the first functional organ to form and this complex process is tightly orchestrated by a gene regulatory network (GRN) encompassing cardiac transcription factors (TFs). A slight change in the expression of any of these cardiac TFs leads to severe congenital heart defects. To date, little is known about the system-level effects of mutations in TFs embedded within the cardiac GRN and their contribution to disease.
We aim to determine the mechanism of action of disease-causing cardiac TFs, in order to unravel the effects of GRN perturbations in abnormal conditions. For this, we have undertaken an unbiased approach to systematically interrogate for the first time, the genome-wide targets of mutant cardiac TFs that occur in congenital heart disease using the DamID technology.
We discovered that mutated TFs regulate a specific set of targets that differs from those of wild-type. De novo DNA motif discovery on these new targets showed a change of binding site affinity. Hence, in contrary to what was previously known, we evidenced that diseased cardiac TFs lead to heart malformations, not because they have lost their function, but they have gain a novel regulatory potential which leads to the misregulation of off-targets that in turn disturbs the cardiac GRN. These observations were validated in vivo in zebrafish.
This work addresses significant gap in the comprehensive knowledge of the players and associated cross-interactions that contribute to proper heart development and provides further insight into the mechanism of congenital heart disease at a genome-wide level.
Short Abstract: The glucocorticoid receptor (GR) exerts is main downstream effects via its function as transcription factor. The here presented novel data investigate on a genome-wide level, whether variants that alter the immediate transcriptional response to GR activation may alter the risk to suffer from mood and anxiety disorders.
We use expression quantitative trait locus (eQTL) analysis using genome-wide gene expression data from GR-stimulated gene expression in peripheral blood cells of 160 male individuals and genome-wide SNP array data to identify genetic variants that alter GR-induced mRNA induction in a cis window of ±1Mb.
We identified 3,820 eQTLs in which SNPs moderate the GR-induction of gene transcription. These SNPs were highly enriched among SNPs associated with MDD, as identified in data of the mega-analysis consortium for MDD with an N of over 9,000 cases and controls but also with schizophrenia, bipolar disorder and the variants conferring cross disorders psychiatric risk (N=33,000 cases and 29,000 controls). The 282 SNPs showing both an association with GR-mediate transcription and with MDD regulate 25 distinct transcripts. Pathway analysis suggests that these 25 transcripts are involved in neurite outgrowth/synaptic plasticity and ubiquitination. In mice, over 65% of these 25 transcripts were also regulated following GR agonist stimulation in either hippocampus or frontal cortex.
Genetic variants that moderate the first transcriptional response to stress are thus more likely to be associated with mood disorders, supporting the importance of molecular gene x environment interactions for the understanding of the pathophysiology of these disorders.
Short Abstract: To effectively utilize stem cells in regenerative medicine, the molecular mechanisms underlying their maintenance and proper differentiation must be thoroughly understood. To achieve this goal, we used Drosophila male germline stem cell (GSC) lineage as a model adult stem cell system and systematically analyzed transcriptome of normally developing germ cells at discrete but continuous differentiation stages. We first developed a strategy to isolate a single cyst of germ cells (encapsulated by two somatic cells) at each stage from wild-type testes of Drosophila. We then applied these single germ cell cysts as the starting material for high-throughput mRNA sequencing (RNA-seq). Our data from every stage germ cell cysts delineates a high-resolution transcriptional profile in the entire male germline lineage and leads to multiple novel discoveries. (1) We found that spermatogonia, share a high degree of similarities in their overall transcriptomes, which explains their plasticity shown by their similar behavior during both dedifferentiation and differentiation processes. (2) We confirmed a dramatic transcriptome switch from mitotic spermatogonia to early meiotic spermatocyte, and found that many chromatin and transcriptional regulators show a bi-modal expression pattern. (3) We inferred that the differentiation in spermatocyte stage has distinct regulation mechanism. Many potential target genes are implicated. (4) Finally, we analyzed dosage compensation and found little compensation of X-chromosomal genes in germline cysts at each differentiation stage. In summary, our single cyst-resolution, genome-wide transcriptional profile analyses provided a supreme and unprecedented data set to understand many interesting questions remained in stem cell biology and germ cell biology fields.
Short Abstract: Genome-wide high-throughput/high-content microscopy screening has matured to become a key tool of functional genomics, being able to provide detailed intracellular information at the single cell level for all genes in an organism. Yeast is an ideal platform for such studies as it combines a simple, rapid life cycle and straightforward methods of genetic manipulation. Studying for the first time three processes at the same time, we performed a high-throughput/high-content microscopy genome-wide screen for genes controlling cell shape, microtubule organisation and cell cycle progression, allowing us to assign new functions to hundreds of genes and providing insight into their workings via classical machine learning tools.
Bayesian networks are a way to represent as a graphical model the conditional independence structure of a set of variables. To mine the large feature-based datasets generated by our multi-process screen further, we applied Bayesian networks structure inference to a set of extracted cell shape and microtubule features – both at the single cell level and at the gene average level – to investigate the interplay between cell shape and microtubule control. This has allowed us to ask systems’ level questions such as whether the link between microtubule length and microtubule number control in cells is solely a consequence of cell shape changes. Our results show the potential of Bayesian structure inference as a tool to study phenotypes in functional genomics, and the related outstanding issues involved in dealing with heterogeneous noisy, non-normal data.
Short Abstract: Brain atlases depict the parcellation of neural tissue into spatially contiguous regions with distinct attributes. These “maps” traditionally reflect macroscopic anatomical or cytoarchitectural features and are only distantly connected to proteins and molecules expressed in those areas. The Allen Brain Atlas (ABA) is a collection of neuroanatomically-linked transcriptomic data collected with high spatial resolution at genome-scale. Determining the extent to which gene expression profiles provide “signatures” that differentiate the brain areas depicted in classical brain atlases begins to form an important bridge between molecular and anatomical/functional brain organization. Here we studied cortical expression profiles from the ABA adult mouse atlas using smoothed volumetric data registered to a 3D template. Several studies used these data to demonstrate differential and clustered gene expression across brain regions. In addition, differences in laminar cortical gene expression are widely observed due to cell type distinctions, but no studies to date have demonstrated the use of gene expression to classify neocortical mouse brain regions. We used feature selection methods coupled with support vector machines to learn relationships between normalized gene expression profiles and cortical region labels. We show that using very few genes, a sample can be classified as belonging to one of 18 cortical regions with greater than 70% accuracy. Classification performance for the reference atlas consistently outperforms accuracy for a set of random spatial parcellations of the cortex. Our results provide evidence that, while gene expression is relatively homogenous across the cortex, there are consistent transcriptomic differences that may underlie specialization of these regions.
Short Abstract: Human diseases of different etiologies, for example, different acute inflammatory diseases, likely share a common underlying mechanism. Investigating this commonality will help us not only better understand the diseases but also identify new molecular targets for therapy. In our study, we make use of genomic profile data together with clinical information and propose a statistical approach to detect the common gene set that is associated with multiple diseases. In clinical medicine, often multiple clinical variables are recorded for patients of a particular disease that measure different aspects of the disease severity or progress. Our proposed model, paired sparse multiple Canonical Correlation Analysis (psmCCA), utilizes these multiple clinical outcome variables. Using lasso penalty, psmCCA identifies the sparse linear combination of genes that is highly correlated with the corresponding clinical variables. It maximizes the correlation within diseases and restricts the model such that the selected genes are common for all given diseases. Our simulation shows that psmCCA is able to select a common gene set for multiple diseases. We also demonstrate this new method on data sets of burn and trauma patients and reveal common host immune response signature between the diseases.
Short Abstract: The majority of protein-coding genes in higher eukaryotes consist of multiple exons. Splicing patterns of these genes are determined by multiple cis-regulatory signals, which are involved in proper recognition of exon-intron boundaries by trans-splicing factors. Genetic variation in splicing signals may lead to serious differences in splicing patterns between different genotypes (allele-specific splicing). We use data from The 1000 Genomes Project to study single-nucleotide polymorphisms (SNPs) which disrupt canonical dinucleotides of splice sites in genomes of healthy individuals. We find that SNP-carrying splice sites represent a biased subset of all splice sites in many respects. Most of our analyses show that the effects of such disruption on the proteome are weaker than would be expected by chance. However, splice site-disrupting mutations represent only small fraction of SNPs which may affect splicing patterns of the genes. But predicting the effect of mutations outside of the consensus dinucleotide based only on genomic data is a challenge. To address these questions, we analyzed a panel of coupled genotype/individual transcriptomic data from multiple Drosophila melanogaster inbred lines. We were able to detect many cases of allele-specific splicing patterns associated with proximal SNPs. Previous studies on allele-specific splicing were primarily focused on changes in ratios of known annotated isoforms. Here we applied an annotation-free approach to detect and quantify splicing events. This allowed us to identify qualitative changes in exon-intron structure of genes, including de novo creation and/or activation of cryptic splicing signals.
Short Abstract: Genome features are an important component for understanding genomic diversity and for understanding gene behavior. With the advent of high throughput sequencing and the explosion of genome feature discovery, the development of efficient and accurate computational methods, bioinformatics tools, and databases for discovery and annotation of these genome features becomes crucial. The need for flexible Bioinformatics tools that can multitask and process these fast growing datasets in a timely manner is equally as important as the process of discovering these features.
We present Genome Effects, a Bioinformatics Tool Multiplexer implemented using the C++ programming language. Genome Effects provides a platform for researchers and bioinformatics tool and database developers to select and run the appropriate tool for the task at hand with less hassle. The system can use either public databases (MGI, ENSEMBL, UCSC, NIH, ..) or local annotations to extract on-demand annotations or genomes. Genome Effects can be used from the server via web browsers or run as a standalone application. A combination of C++ libraries, webservices, and data caching makes it easy for standalone clients to have up to date information with little or no overhead.
The current version of Genome Effects includes the following tools:
Annotator (Variations), Validator (SNPs), flankingSequence extractor, fileTypeConverter, featureCoordinates extractor, and featureSequence extractor
Genomic features include transcripts, exons, introns,CDS, ORFs,mRNA, repeats, miRNA, feature junctions, SNPs,TSS,TTS, UTRs, and more.
With the supporting C++ libraries, it will be easy task to add more functionality to the tool.
Short Abstract: Expression Atlas (http://www.ebi.ac.uk/gxa/) provides information on gene
expression patterns in different organisms, tissues, treatments and biological
conditions. Public gene expression data from ArrayExpress
(http://www.ebi.ac.uk/arrayexpress) is manually curated to a high standard, and
re-processed using in-house analysis pipelines. Expression Atlas consists of
two components with different focuses, enabling users to ask different kinds of
questions about gene expression. The Differential Atlas comprises comparative
microarray and RNA-seq datasets, enabling questions such as "which genes are
upregulated in HeLa cells treated with camptothecin compared with untreated
cells?". The Baseline Atlas is made up solely of RNA-seq datasets, and displays
absolute gene expression levels in healthy or untreated conditions, e.g.
tissues or cell lines. Its focus is on answering questions like "which genes
are expressed in a normal human kidney?". Expression Atlas can be queried
using Ensembl identifiers, gene ontology terms, and various other identifiers
and keywords. Users can search using gene sets, for example genes in a
particular Reactome pathway or matching a gene ontology term. Results of gene set
enrichment analysis (GSEA) for differential expression datasets will be available soon,
as will visualisation of Expression Atlas results on the Ensembl
genome browser. Recent updates include improvements to the Differential Atlas
experiment pages to sort results in a more biologically meaningful way, and the
addition of reports detailing quality assessment of each dataset.
Ontology-driven query expansion is soon to be added to the Expression Atlas
search, using the Experimental Factor Ontology (http://www.ebi.ac.uk/efo/), to
allow more powerful searching.
Short Abstract: A set of online informatics tools has been developed at the Drosophila RNAi Screening Center (DRSC) to help scientists identify genes, select RNAi reagents, analyze high-throughput datasets and validate results.
Mapping of orthologous genes allows researchers to develop hypotheses about gene function based on knowledge in other species. We developed a simple but effective tool, DIOPT (flyrnai.org/diopt), for identification of orthologs among 8 common model systems by integrating 10 existing approaches. DIOPT-DIST (flyrnai.org/diopt-dist) facilitates human disease-relevant studies in model organisms based on disease gene annotation from OMIM and GWAS.
RNAi is a common loss-of-function approach but RNAi results are relevant only if reagents are correctly annotated. UP-TORR (flyrnai.org/up-torr) keeps reagent-gene relationships up-to-date via daily updates of genome and reagent information. This allows users to compare and choose appropriate RNAi reagents from public resources and correctly interpret RNAi results for human, mouse, worm and Drosophila genes.
Analysis of high-throughput data increasingly relies on pathway and functional annotation. To supplement existing annotation tools as well as analyze network dynamics, we developed COMPLEAT (flyrnai.org/compleat), a resource and tool related to protein complexes, which are at the core of network reorganization. COMPLEAT facilitates data mining and visualization of high-throughput data for human, Drosophila and yeast.
Quantitative real-time PCR is used to evaluate RNA levels. We developed and experimentally validated a genome-wide qPCR primer resource for Drosophila, FlyPrimerBank (flyrnai/flyprimerbank). It allows researchers to identify and view primer sets, evaluation data, etc. and submit their own validated primers and data to further improve the resource.
Short Abstract: We present the Proteome Quality Index (PQI), a much-needed measure of proteome quality available from a comprehensive database of downloadable proteomes.
The advent of large-scale de novo sequencing technologies has lead to the resolution of a vast number of genomes across all domains of life. However, the assembly of a new genome is still a challenging task and the quality of the results is not consistent across all projects. The “sequenced/species Tree Of Life” [1] demonstrates numerous disagreements between taxonomic and molecular classifications of species. As a consequence of this discrepancy previously published work, especially in the field of comparative genomics, may have reached spurious conclusions, highlighting a need for quality assurance of proteomic data.
Coming up with a good measure of proteome quality is difficult, but with PQI we hope to seed discussion and development towards an adequate solution. PQI is a constantly updated web resource that currently includes over 3,200 annotated proteomes from multiple providers including all entries from NCBI and ENSEMBL. For each proteome we provide information about sequencing technology used, publication count, and numerous scoring metrics based on protein composition and phylogenetic placement.
[1] Fang, Hai, et al. "A daily-updated tree of (sequenced) life as a reference for genome research." Scientific reports 3 (2013).
Short Abstract: RNA-Seq and other sequencing techniques enable biologists to measure changes in the transcriptome that reflect regulated transcription and mRNA processing, notably alternative splicing, using computational tools. The computational pipelines are implemented in three parts to perform read mapping, event generation and hypothesis testing. The problem is that such pipelines are often not sufficiently flexible to satisfy the demands of the biologist. By relaxing the tasks, we show how to handle integration of complementary alternative splicing results from RNA-Seq and RASL-Seq, how to represent events, and utilize an existing statistical framework to test hypotheses.
We have applied this approach to test how RNA polymerase II elongation rate affects skipping and inclusion of alternatively spliced exons and introns. Existing tools such as MISO and MATS directly measure the proportion of included exons, but do not consider variation among replicates. DEXSeq accepts replicates to calculate dispersion to calculate significance of exon inclusion-exclusion events but exons need to be segmentized into non-overlaping fragments.
We developed the DJ (DoenJang) method to permit users to generate splicing events and to test hypotheses utilizing existing statistical frameworks such as EdgeR package. DJ also allows for testing discriminative power of intron/exon lengths and sequence motifs affecting alternative splice site choice. We will show analysis of RNA-Seq and RASL-Seq studies of RNA from human cells expressing slow and fast mutants of the RNA pol II large subunit.
We found that manipulating elongation rate causes widespread changes in alternative splicing but, surprisingly, slow and fast elongation often have similar effects.
Short Abstract: The Saccharomyces Genome Database (SGD; www.yeastgenome.org) is a comprehensive resource of curated molecular and genetic information on the genes and proteins of Saccharomyces cerevisiae. The emergence of large-scale, genome-wide technologies such as expression microarrays, RNA-seq and high-throughput sequencing have widened the scope of functional annotation beyond that of individual genes to entire genomes. As such, we have sequenced 25 S. cerevisiae strains, including both common laboratory and commercial strains, and are preparing these sequences to be incorporated into SGD. These new data allows us to compile and process sequence annotations in order to identify shared and divergent features between strains. In addition to new sequence data, we have collected published data from whole genome studies that employ a diverse set of techniques, including tiling arrays, cDNA clone libraries, TIF-seq, single and paired end RNA-seq, and serial analysis of gene expression (SAGE). These divergent techniques target different genomic regions, such as ncRNA, transcription start sites (TSS), transcripts, poly-A sites, antisense RNA, and Xrn1-sensitive unstable transcripts (XUTs), and add to our knowledge of biological processes in yeast. These data are being organized into a structured format, beginning with the compilation of various datasets to produce a complete yeast transcriptome. Here, we discuss computational approaches to process and to integrate these varying data with genomic sequences from different strains to form a more complete understanding of the complex processes of yeast biology.
Short Abstract: Empirical evidence of when, where, and under what circumstances a gene is expressed is invaluable for determining a gene's function. Thus, genome sequences are now commonly supplemented with transcriptomic data. Gene expression assays surveying expression behaviour can be combined to construct a 'gene atlas', a resource for functional annotation that describes the conditions under which genes are activated.
In addition to cataloguing 'present-day' gene function, atlas-derived expression profiles can be combined with gene family information to study the evolution of function. By contrasting the profiles of related sequences, it is possible to glean information about the nature and timing of functional innovations (for example, recruitment from various gene families into a venom gland). These innovations are often coincident with gene duplications, substantial changes to expression behaviour, or both. Phylogenetic trees for gene families can be annotated with this information, providing a diagram that succinctly describes gene functional diversification.
Our software, 'BranchOut', uses the aforementioned process to annotate events associated with functional innovation on a genomic scale by applying it to collections of gene families. Owing to the near comprehensive nature of transcriptomic technology, gene atlases are replete with incidentally collected but relatively unstudied expression information for known transcripts. Through our software, it is possible to identify patterns in functional evolution, both within and across organisms, that might not be apparent from the study of a single gene family alone. The present work shows examples of the use of our software on several recently published gene atlases.
Short Abstract: The stochastic gene expression, or gene expression “noise” has been studied extensively during the last decade. It is now widely recognized that the gene noise is a major source of the phenotypic variation of isogenic cells grown in the same environment. Due to the noise can be propagated in gene networks , it is suggested that the transcriptional regulation of genes as well as regulons can be derived from the analysis of transcriptional noise of genes. However, this method has not been extensively examined at a genome scale. In this project, we evaluate the gene transcriptional noise under multiple culture conditions in yeast S. cerevisiae. We sequenced single-cell transcriptomes of yeast cells grown under three different treatments (hypertonic condition, amino acid depleted condition, normal osmolality condition) using Illumina HiSeq platform. Transcription abundance was quantified for each transcript of all annotated protein coding genes. We found that the single cells with the same treatment can be clustered together based on their transcriptomes. In addition, we used the transcription noise to infer regulons and compared the results between different treatments. This research would lead a better understanding of how the transcriptional noise is related to gene regulation in yeast in a genome scale.
Short Abstract: RNA sequencing (RNA-Seq) enables the study of transcriptomes in an unprecedented way. One of its important applications is transcript abundance quantification. The most common assumption made by existing tools is that the RNA-Seq reads are sequenced uniformly across the transcriptome. However, in real data this assumption is often violated due to various biases introduced in sequencing library preparation steps, such as PCR amplification and reverse transcription. Most available bias correction methods are designed for a specific type of bias and therefore are not general enough. Building off of existing quantification methods, we present a novel method that is robust to sequencing biases. Unlike many other approaches addressing the issue of bias, our approach can handle multi-mapping reads, which is important for RNA-Seq transcript quantification. Furthermore, it does not rely on strong modeling assumptions. Using simulated data sets, we showed that our approach outperforms existing quantification tools significantly when strong sequencing biases are present.
Short Abstract: Gene coexpression is often used as part of methodologies to gain insight into expression
patterns and to predict gene function. We and others have hypothesized that making gene
networks context-specific will improve their utility for gene function prediction. For the
current work we are focusing on the role of tissue as contextual information. Previous
work has suggested that modifying protein interaction networks based on knowledge of the
expression patterns of the genes leads to performance improvements. However, to our
knowledge the potential for using coexpression patterns in which tissue context is used
has not been tested. Therefore focus of this project is to construct and characterize
tissue specific coexpression networks and study their unique and common features. To do
so we have been collecting expression profiling data from different types of human
tissues. Each dataset is used to build a coexpression network. Preliminary results
obtained from 46 datasets of four tissues identify tissue-specific and tissue-generic
features of coexpression networks. In the next stage individual networks will be merged
with other coexpression networks in their tissue to capture the robust tissue specific
patterns of the networks. Further characterization will focus on the impact of these
tissue specific coexpression patterns on gene function prediction.
Short Abstract: The eukaryotic transcriptome consists of many RNA species including alternative splicing variants and long non-coding RNAs. To systematically identify cell type-specific coding and non-coding transcripts, we interrogated the transcriptome of Arabidopsis by generating over 1.4 billion 100 base pair, paired-end reads using total RNA. This data set provides comprehensive expression and splicing information for over 40,000 coding and non-coding genes in almost all cell types in the Arabidopsis root. We identified thousands of cell-type specific alternative splicing events and hundreds of novel lincRNAs. Co-expression analysis shows that some lincRNA transcripts form co-expression clusters with protein coding genes, suggesting a potential cis-regulatory role of lincRNAs. To identify functional alternatively spliced transcripts, we implemented a computational pipeline called “Functional Analysis of Spliced Transcripts Suite (FASTS-Toolkit)”. With this we found that majority of alternatively spliced genes express one dominant transcript in all cell types. Minor, alternatively spliced transcripts are usually tissue-specifically expressed and encode shorter proteins. We also found evidence for intron definition playing a predominant role in creating alternatively spliced transcripts in plants. By comparing our transcriptome data with a published whole genome alignment of closely related species of Arabidopsis, we identified evolutionarily conserved intron retention events, and showed that tissue specific expression of spliced isoforms may explain the origin of introns in higher eukaryotes. Finally, we show that over-expression of a minor isoform of a transcription factor induces precocious differentiation. Overall, our analysis demonstrates the rich expression pattern and potential functional roles of alternatively spliced transcripts and long non-coding RNAs.
Short Abstract: In eukaryotes, transcriptional regulation is usually mediated by the interactions of multiple transcription factors (TFs) with their respective specific cis-regulatory elements (CREs) in the so-called cis-regulatory modules (CRMs) in DNA. Although the knowledge of CRMs in a genome is crucial to elucidate gene regulatory networks and understand many important biological phenomena, little is known about the CRMs in most eukaryotic genomes. This is mainly due to the difficulty of characterizing CRMs by either computational predictions or traditional experimental methods. However, the large number of TF binding location data produced by the recent wide adaptation of chromatin immuneprecipitation coupled with microarray hybridization (ChIP-chip) or high-throughput sequencing (ChIP-seq) technologies has provided an unprecedented opportunity to identify CRMs in genomes. Nonetheless, how to effectively utilize the large volumes of ChIP data to identify CREs and CRMs is a challenging task. Here we have developed a novel graph-theoretic based algorithm DePCRM for genome-wide de novo predictions of CRMs using a large number of ChIP datasets. DePCRM predicts CRMs by identifying overrepresented combinatorial motif patterns in multiple ChIP datasets in an effective way. When applied to 170 ChIP datasets of 56 TFs from Drosophila Melanogaster, DePCRM identified 804 overrepresented putative CRE motif combinations, and predicted a total of 123,136 putative CRMs in the genome. Moreover, the algorithm was able to recover 79% of the known CRMs located in the datasets.
Short Abstract: The current inability of “target-centric” approaches to efficiently characterize compounds’ mechanisms of action in a high-throughput manner inhibits progress in fields such as therapeutic development, toxicology screening, and biological probe discovery. We have developed an ultra high-throughput yeast chemical genomics assay that allows the prediction of a compound’s gene- and process-level targets across the entire genome, filling a critical gap in the way compounds are screened for bioactivity. This methodology was applied to screen more than 10,000 natural products and derivatives from the RIKEN natural product depository. From these data, we have identified compounds with novel bioactivity as well as the general cellular functions that tend to be disrupted by the compounds in this collection. We obtain high confidence process-level predictions for over 8% of the screened compounds. At the current level of throughput, we can screen 10,000 compounds and generate genome-wide target predictions within a few months’ time, demonstrating that we have developed an efficient, high-throughput method to assess genome-wide bioactivities present in a large compound collection.
Short Abstract: In mammalian genomes, a single gene can be alternatively spliced into multiple isoforms which greatly increase the functional diversity of the genome. In the human, more than 95% of multi-exon genes undergo alternative splicing. It is hard to computationally differentiate the functions for the splice isoforms of the same gene, because they are almost always annotated with the same functions and share similar sequences. In this paper, we developed a generic framework to identify the ‘responsible’ isoform(s) for each function that the gene carries out, and therefore predict functional assignment on the isoform level instead of on the gene level. Within this generic framework, we implemented and evaluated several related algorithms for isoform function prediction. We tested these algorithms through both computational evaluation and experimental validation. Through cross-validation, we demonstrated that our algorithm is effective in assigning functions to genes, especially the ones with multiple isoforms, and robust to gene expression levels and removal of homologous gene pairs. Using proteomics data from mammary tissue, we validated the predictions of ‘responsible’ isoforms for a group of genes. Furthermore, we validated the predicted isoform functional differences for the genes Cdkn2a and Anxa6 with protein structure modeling and experimental evidence. Our generic framework is the first to predict and differentiate functions for alternatively spliced isoforms, instead of genes, using genomic data. It is extendable to any base machine learner and other species with alternatively spliced isoforms, and shifts the current gene-centered function prediction to isoform-level predictions.
Short Abstract: Next generation sequencing (NGS) platforms offer a rapid way in which to sequence gene expression profiles of microbial pathogens, important for detection and pathogenesis studies. This has proven an essential set of tools, important with regard to understanding transmission of communicable pathogens. Harnessing NGS technologies along with development of novel methods to reconstruct metabolic pathways and integrate gene expression data with mass spectrometry (MS) proteomic/metabolite datasets, offers a powerful set of tools to understand the biology in relation to public health.
We have developed novel bioinformatics tools to utilize the hypothesis driven mechanism of comparative genomics and rapid sequencing technologies of NGS data; to be screened against existing metabolic pathway data, for the reconstruction of metabolic pathways. Developed methods for identifying nonsynonymous SNPs that are compared to protein mass shifts, detect the presence of protein variants, between divergent and clonal bacteria implicated in disease outbreaks. Developed methods to reconstruct metabolic pathways from RNA-Seq gene expression data and compare and integrate both protein and small molecule metabolite datasets from mass spectroscopy studies. Enabling direct correlation with specific environmental conditions/phenotypes, which may be strain/species specific and also hypothesized from our comparative genomics studies.
Future development of novel bioinformatics methods from NGS data to compare and reconstruct metabolic pathways with the integration of heterogeneous datasets, including gene expression data, proteomics/metabolite datasets and new epigenetic data; will enable a better understanding of pathogenesis and provide an important tool set in public health.
Short Abstract: The phenotype of a cell is determined by several factors including genetic background, (micro-) environment, and stochastic fluctuations. In our work we determine the contribution of stochasticity to phenotypic variation. This contribution can be quantified by the pairwise comparison of daughter cells that share the same genetic and environmental background.
Cells were tracked over time and features characterizing the cellular morphology were calculated, resulting in a collection of genealogical trees where each node represents one cell at one time point by a high dimensional feature vector. Cells were clustered into phenotype classes according to their descriptors. For this purpose we extended the theory of hidden Markov models to hidden Factor Graph models (HFMs), in which the dependence structure of the hidden variables is given by a tree, and derive an efficient analog of the Baum-Welch algorithm for parameter learning.
Daughter cells are aligned according to their phenotype classes and the alignment distance acts as a distance score for the morphological variability of the cells. By comparing these variability scores of daughter cells from a knockdown and from a negative control movie, we can detect and quantify increased or decreased stochasticity in the knockdown relative to wild type. Our algorithm was applied to the Mitocheck database[1], a large compendium of time lapse movies in which single genes were knocked down in human cells by RNA interference. We were able to identify genes that increase or decrease stochasticity in the morphology of daughter cells.
[1] Neumann et al. Nature 464.7289: 721-727, 2010
Short Abstract: Human embryonic stem cells (hESCs) are derived from the inner cell mass of the blastocyst. Despite sharing the common property of pluripotency, hESCs are notably distinct from epiblast cells of the preimplantation blastocyst. Here we use a combination of three small-molecule inhibitors to sustain hESCs in a LIF signaling-dependent hESC state (3iL hESCs) with elevated expression of NANOG and epiblast-enriched genes such as KLF4, DPPA3, and TBX3. Genome-wide transcriptome analysis confirms that the expression signature of 3iL hESCs shares similarities with native preimplantation epiblast cells. We also show that 3iL hESCs have a distinct epigenetic landscape, characterized by derepression of preimplantation epiblast genes. Using genome-wide binding profiles of NANOG and OCT4, we identify enhancers that contribute to rewiring of the regulatory circuitry. In summary, our study identifies a distinct hESC state with defined regulatory circuitry that will facilitate future analysis of human preimplantation embryogenesis and pluripotency.
Short Abstract: Biological investigations are generating large amounts of heterogeneous data that can be difficult to share in a useful way between bioinformaticians and biologists. Current data management systems for interactively sharing data require more expertise to maintain than many labs can afford. To address this issue, we developed the lightweight web-based LABrowser, an experiment browser that is flat-file based and data type agnostic for managing and querying heterogeneous data and analyses. We expect that any laboratory can rapidly deploy this system because it is based on simple, human readable text files and has no database backend. The underlying code of the experiment browser is primarily PHP and Perl. In LABrowser, data sets are grouped into studies. Each data set consists of a set of samples and queries. Queries compare the outcome of measurements for groups of samples and return information prioritized by significance metrics such as p-values, q-values or fold change. Queries are of the type, “return all data sets in which the gene Ptgs2 was upregulated by 2-fold with a p-value or q-value of 0.01.” This type of querying is also available for metabolites. Thus, biologists can identify relevant information quickly across all data in the laboratory without contacting a bioinformatician after the initial analysis has been completed. Graphs and a variety of external resources are integrated into the experiment browser using a loosely coupled strategy. We deployed LABrowser in a large translational laboratory that uses low-throughput and high-throughput technologies and works with a variety of model systems.
Short Abstract: BACKGROUND: Amoeba D. discoideum is a bacteria predator. Its bacterial response is relevant to infections in humans because it likely evolved from pathways in primitive eukaryotes to defend against bacteria. At present, though, there is little information on which genes in D. discoideum are responsible for coordination of bacterial recognition and resistance. A handful of these genes were recently identified through a screen for mutants that can grow on either Gram-positive or Gram-negative bacteria. Our aim was to extend this list by computationally proposing new gene candidates considering large array of available information and data sets.
RESULTS: We developed a data fusion approach based on penalized matrix tri-factorization that can simultaneously integrate data from heterogeneous sources and retains them in the original domain space. We fused over a dozen sources of information, including data from RNA-seq, ChIP-Seq, GO and phenotype annotations and ontologies, KEGG and Reactome annotations. Data fusion predicted nine genes for which mutants were available in the D. discoideum stock center. We tested these mutants in the wet lab and found that eight predictions were correct. This result has surprisingly high accuracy for a given setup and a training set with very few mutants with Gram-specific defects.
CONCLUSIONS: Our data fusion approach can consider any number of data sets that can be expressed in a matrix, including those from attribute-based representations, ontologies, associations and networks. Its high accuracy in bacterial resistance study of D. discoideum show promise for future applications.
Short Abstract: The fruitfly (Drosophila melanogaster) is one of the most studied model organisms, however only few transcriptomics and epigenetics studies by next-generation sequencing technologies are available for specific tissues along different developmental stages. In the current study we aim to profile RNAseq from five different wing compartments in three developmental stages, namely third instar larva, white pupa and late pupa. Preliminary RNA-seq data are available for the five sub-compartments of the wing imaginal discs in the third instar larva. We detected more than 60% of the annotated genes in all the compartments, although there is a common core set of around 100 genes with housekeeping functions that captures the bulk of transcriptional activity. Even if the partial overlap of the wing compartments may complicate the detection of tissue-specific genes, we identified two sets of ~10 down and up regulated genes in the wing pouch with respect to the other wing parts. The fact that the down-regulated genes have annotated functions unrelated to the wing development suggests that they are silenced to promote the formation of the mature wing. A comparison of the splicing patterns shows that differences in splicing are negligible when compared to differences in the gene expression. The strand-specificity of the RNA-seq protocol allowed us to identify ~40 sense-antisense gene pairs with a significant level of correlation of expression. GO analysis of the coding genes with overlapping antisense non-coding transcript reveals an enrichment for development-related terms and encourages a possible regulation for at least some of the antisense genes.
Short Abstract: ChIP-Seq, as a mature technology and part of the advances in high-throughput sequencing to investigate protein-DNA interactions, has been developed for broad applications in studying transcriptional factors, chromatin regulators and histone modifications. We have improved the ChIP-Seq analysis algorithm MACS (Model-based Analaysis of ChIP-Seq) to address recent challenges. MACS2 provides the flexibility and speed necessary to process large datasets and offers a range of new features from supporting board histone modifications to detecting differentially occupied regions. MACS2 is free and open-source at github and could be easily installed through PyPI (Python Package Index, https://pypi.python.org/pypi/MACS2).
Short Abstract: Genome-wide association studies (GWAS) have revealed numerous type 2 diabetes (T2D) risk loci. However, signals emerging from GWAS have rarely been traced to the disease-causing variants in non-coding regions. We recently developed a computational method, Phylogenetic Module Complexity Analysis (PMCA), to infer cis-regulatory variants in a region of disease association. PMCA tests non-coding variants by analyzing the flanking region for cross-species conserved motif modules, exploiting evolutionary information while allowing for binding site turnover.
We applied PMCA on the T2D ADCY5 risk locus and report multiple lines of evidence supporting the intronic variant rs56371916 (C/T, MAF=0.1356) to be causal: (1) it harbors a functional regulatory SREBP motif based on conservation of a AP1-NKXH-SREBP-EVI1 motif cluster across species; (2) chromatin state segmentation on the Roadmap reference epigenomes revealed rs56371916 to be located in an enhancer chromatin state in adipose tissue; (3) using EMSA, we show allele-specific SREBP binding in adipocytic nuclear extract for rs56371916; (4) luciferase assays revealed cell type-specific enhancer effects for rs56371916 in adipocytes; (5) qPCR showed that the rs56371916 risk allele increases ADCY5 expression in adipocyte patient samples; (6) Lastly, we used SREBP1 knockdown by siRNA to show that increase in ADCY5 mRNA levels depends on the risk allele and regulation by SREBP-1. We are currently undertaking CRISPR/Cas9 genome engineering in adipose patient samples.
Our general approach for the computational discovery and experimental dissection of disease variants has important implications on studying complex traits, which can help bridge the genotype-phenotype gap between genetic variants, molecular mechanisms, and cellular and organismal phenotypes.
Short Abstract: The CaneRegNet framework provides a knowledge base–driven pathway analysis based on uniform relational and graph database schema. The database minimizes heterogeneous data representation with suitable semantics, aiming to integrate Sugarcane data from several analysis tools. The integration of biological data, in association with gene expression, ontologies and metabolic pathways (in response to treatments such as drought and sucrose content) has the potential of revealing new gene functions and providing new knowledge about the development of sugarcane and its adaptation to the environment. This framework aims to associate functions and biological knowledge to individual sugarcane genes, based on high-throughput data analysis methods. The developed methods can be classified in the: 1) Over-Representation Analysis methods, which statistically evaluates the fraction of genes in a particular pathway/ontology found among the set of genes showing changes in expression; 2) the hypothesis of functional class scoring, where gene expression levels are aggregated into a single pathway-level statistic; 3) Pathway topology based methods, which uses pathways databases and operators analyzing gene expression, providing a systematic way of bridging the gap between data and biological knowledge. Gene expression studies made available through more than 400 microarray experiments, analyzing the behavior of more than 40,000 differentially expressed genes, encompassing 50% of all sugarcane sense and antisense genes. Through the integration of diverse biological datasets collected under several different experimental conditions, CaneRegNet makes it possible to understand interactions within and between sugarcane metabolic pathways that would be otherwise be extremely laborious if not impossible to uncover. http://sucest-fun.org/wsapp/
Short Abstract: A major promise of RNA-seq is the extension of expression profiling to the discovery and quantification of alternative transcripts. For transcript-specific profiling, however, no large-scale expression data from other technologies are available as an external reference point. We here present results from a multi-site cross-platform comparison of transcript-specific measurements.
We focused on a test set of 782 genes with multiple alternative transcripts of varying complexity and specifically selected to represent the full subset of spliced genes annotated in the AceView database. Covering 5,691 alternative transcripts, this test set allows a first comparison of transcript-specific expression level estimates from RNA-seq and high-resolution transcript-level microarray data. To this end, we combined multiple metrics for a robust characterization of platforms, sites, and data processing options. This is necessary because each metric shows a different and platform specific response to signal strength. For RNA-seq the response increases with transcript expression level and read depth. The read depth at which average RNA-seq performance meets or exceeds that of another platform thus directly depends on the examined metric and the distribution of expression strength and differential signal in the samples measured.
We found that efficient transcript-specific measurements with good precision on microarrays for quantitative expression profiling could complement the power of RNA-seq in the discovery and identification of new alternative transcripts. In other words, the novel transcripts found by RNA-seq can lead to efficient measurements with good precision on microarrays, which can in turn aid in the confirmation and functional study of new transcript variants.
Short Abstract: In the US FDA-led SEQC (i.e., MAQC-III) project, different sequencing platforms were tested across more than ten sites using well-established reference RNA samples with built-in truths in order to assess the discovery and expression-profiling performances of platforms and analysis pipelines.
Studies on microarrays have shown that results of typical statistical differential expression tests thresholded by p-value need to be filtered and sorted by effect strength (fold-change) in order to attain result that are robust across platforms and sites. We have shown that in RNA-seq studies a similar approach is also required. For RNA-seq, removing small fold-changes as well as excluding low-expression measurements reduced the false discovery rate considerably and, in general, gave an improvement over microarrays at similar sensitivity. These filters also achieved good inter-site agreement of lists of differentially expressed genes, with the performance of several (but not all) RNA-seq pipelines becoming comparable to that of microarrays. Even though a direct comparison of absolute expression levels across platforms was not possible, the filters yielded good agreement of differential expression calls between platforms (for example, A vs B on HiSeq 2000 compared to A vs B on SOLiD), suggesting that differential expression analyses from different platforms could be combined - for example, to extend existing studies with additional samples.
Short Abstract: Emerging methodologies such as next-generation sequencing contribute to our understanding of disease and health. Rapid progress over the last few years have moved these technologies from an exploratory to an applied stage, and an increasing amount of data derived from such approaches is received by regulatory agencies supporting the evidence for the safety and efficacy of new medical products. The realization has spawned a number of FDA efforts to utilize these technologies through integrated bioinformatics within inter-center and cross-community collaborations. This presentation is to discuss how the FDA led community wide MicroArray Quality Control (MAQC) makes an attempt to address the technical performance issues for these emerging biomarker technologies. Specifically, the third phase of MAQC, also known as the SEquencing Quality Control (SEQC) project, developed a comprehensive plan to assess the power and limitations of NGS with a substantial effort to compare RNA-Seq with microarrays (a mature transcriptomic technology). The project involved >200 participants from >80 organizations. Importantly, the project generated large RNA-Seq data sets covering a broad range of biological samples (human, rat and reference samples). Many critical issues of applying RNA-Seq in clinic and safety evaluation were evaluated and discussed with these datasets. This presentation will provide an overview and main conclusions of the SEQC project.
Short Abstract: SEQC Evaluation of the Performance of Microarrays and RNA-seq
Weihong Xu1, Anthony Schweitzer2, Leming Shi3, SEQC consortium, Wenzhong Xiao14. 1Stanford Genome Technology Center, 2Affymetrix Inc, 3Fudan University, 4Massachusetts General Hospital
The goal of SEQC consortium is to assess the technical performance of both platforms by generating benchmark datasets with reference samples and to evaluate advantages and limitations of various bioinformatics strategies in RNA and DNA analyses. Here we utilized a comprehensive RNA-Seq dataset of four titration pools from two human reference RNA samples generated by the SEQC consortium and systematically evaluated RNA-Seq and several commercially available microarrays (Affymetrix Hu133plus2, PrimeView, HuGene2.0), in terms of reproducibility, accuracy and detection power for both gene- and exon-level analyses. We found that different microarrays are comparable to RNA-Seq at different read depth, contingent on performance metrics. While with sufficient read depth RNA-Seq slightly outperforms microarray for absolute quantification, both platforms are comparable for relative quantification. RNA-Seq shows stronger expression level dependent trend, while microarray is generally reproducible across its whole dynamic range. Further analyses of the titration order and the linear relationship among mixture samples suggest that microarrays can recover the ground truth correctly. The new exon-junction array HTA2.0 shows competitive strength for exon-level analysis that matches to RNA-Seq at 1-2 HiSeq lane per sample.
Short Abstract: Our aim is to investigate how chronic alcohol fed liver dynamically adapts to changes in environment mediated by the dynamic shift in NF-κB p65-DNA interactions, during liver regeneration after 70% partial hepectectomy (PHx). Genome-wide NF-κB binding targets, during the early phase of liver regeneration were detected using ChIP-chip method. The effect of alcohol alone on NF-κB binding was not reflected in the number of targets, but was evident by their functional consequences in the adapted phase. However, NF-κB binding showed higher sensitivity to the combined stimuli of alcohol and PHx, as evident from the transient surge in NF-κB binding activity at 1hr post PHx followed by differential expression of genes at 6hrs post-PHx. Functional analysis indicated that much of the stress related pathways involved in ethanol metabolism in regenerating liver such as cell cycle, apoptosis, mitochondrial pathways, homeostasis were activated and maintained through the priming phase. Dynamic patterns affecting a large majority of promoters were statistically enriched with binding loci for transcription factors families such as STAT, AP-1, NFAT, GCR, HNF, GATA, CREB and C/EBP, in addition to NF-κB. Our findings suggest that NF-κB could be acting as a pioneer transcription factor whose binding at baseline and adapted state influences the response to acute perturbations. We interpret this result on altered genome-wide distribution of NF-κB binding in ethanol diet as underlying defective early gene regulatory network response post PHx leading to suppression of regeneration. Research Support: NIH AA016919 and AA018873.
View Posters By Category
- A) Bioinformatics of Disease and Treatment
- B) Comparative Genomics
- C) Education
- D) Epigenetics
- E) Functional Genomics
- F) Genome Organization and Annotation
- G) Genetic Variation Analysis
- H) Metagenomics
- I) Open Science and Citizen Science
- J) Pathogen informatics
- K) Population Genetics Variation and Evolution
- L) Protein Structure and Function Prediction and Analysis
- M) Proteomics
- N) Sequence Analysis
- O) Systems Biology and Networks
- P) Other
Search Posters: |