Attention Conference Presenters - please review the Speaker Information Page available here.
If you need assistance please contact firstname.lastname@example.org and provide your poster title or submission ID.
Category B - 'Comparative Genomics'
Short Abstract: High-throughput Next Generation Sequencing (NGS) technologies and reference databases have enhanced our ability to explore diversity at genetic and taxonomic levels. Most off-the-shelf tools for examining genetic diversity implement algorithms that rely on sequence similarity and composition, which can lead to resolution loss in genetic comparisons, particularly at the species/sub-species taxonomic ranks. We present a new version of the Automated Oligonucleotide Design Pipeline (AODP). AODP designs signature oligonucleotides (SO) with specificity and fidelity based on genome or DNA barcode sequence identity, reducing the resolution loss observed with existing approaches. SO designed with AODP highlight regions with taxon or clade-specific polymorphisms that are useful for comparative genomics and provide suitable candidates for the design of primers/probes in diagnostic assays. AODP has several unique features: 1) The AODP algorithm uses a novel packed-Trie data structure, with support for multi-threaded insertion, optimized for DNA nucleotide strings, which scales well to multi-processor architectures; 2) SO can be designed for a large dataset with relatively small memory footprint; 3) Regions of DNA with a single nucleotide polymorphism (SNP) can be optionally ignored to minimize noise caused by sequencing errors during NGS; 4) The specificity of SO can be further validated against large reference databases; 5) SO thermodynamic properties can be calculated for wet-lab experimental conditions; and 6) SO can be directly used for in silico identification of taxa from environmental NGS data.
Short Abstract: Filoviruses belong to a virus family called Filoviridae and they cause highly lethal hemorrhagic fever in humans. Little is known about the relationships between filoviruses, and this is important, especially to human health because they infect every type of organism, and these viruses are able to evolve quickly. Viruses don't leave a fossil record, so it is difficult to determine their origin. There is still no known effective treatment for the diseases caused by filoviruses, and transmission prevention is essentially done through the application of viral hemorrhagic fever isolation precautions, which is currently the centerpiece of Filovirus control. In this study, we did comparative genomics of eight filoviruses, so as to find out how related they are. We obtained complete genome sequences of eight filoviruses i.e. one complete genome for each of five Ebola, one Marburg, and two Lloviu virus species, available on the National Center for Biotechnology Information Database and we did phylogenetic analysis of these viruses, and inferred a clear evolutionary relationship of these virus species. Results from our Multiple Sequence Alignment revealed that some regions were more aligned than other segments. This can help provide new insight in genomic research by helping to identify the characteristics of well aligned regions and benefits that accrue to such discovery. We showed that in each pair, there is a common ancestor, and they are all generally linked by a common ancestor. We expect that this study will provide better understanding of the genomic, biodiversity and ecological knowledge of filoviruses.
Short Abstract: The exon junction complex (EJC) has a central role in marking splice sites in eukaryotic mRNA transcripts. Previous comparative genomics studies of the four core components of the EJC complex (Magoh, Y14, eIF4AIII and MNL51) have been performed mainly within the Opisthokonta and Archaeplastida eukaryotic super-groups. Many eukaryotic pathogens, such a trypanosomes and Plasmodium, fall outside the super-groups containing animals, fungi and plants; trypanosomes are in the Excavata super-group and Plasmodium falls in the SAR group. I have expanded the analyses to include several trypanosomes and related organisms in the Excavata eukaryotic super-group as well as parasitic apicomplexans and other organisms of the SAR super-group. Identifying unique differences between both trypanosomatid and apicomplexan parasites and their free-living relatives would provide insight into the management of their corresponding diseases. I have demonstrated that a core protein of the exon junction complex, eIF4AIII, is conserved in all eukaryotes and was present in the last eukaryotic common ancestor (LECA). Magoh and Y14 were present in the LECA, but were selectively lost in intron-poor species. Y14 has undergone a founder effect within the trypanosome lineage.
I have designed a 3-D model/eukaryotic map illustrating
1. The distribution of the EJC amongst the six eukaryotic super-groups.
2. The correlation of the core components of the EJC to intron density amongst both parasites and non-parasites.
Short Abstract: Vitis vinifera cultivar ‘Sultanina’ is an important seedless grape variety with a pivotal role in table grape breeding. Vitis vinifera is a highly heterozygous species, and the assembly of a heterozygous genome is an ongoing challenge. This study compares de novo genome assembly of ‘Sultanina’ by PLATANUS assembler with the published de novo assembly developed using ALLPATHS-LG assembler (De Genova, et al, 2014). Sequence reads were downloaded from NCBI using study accession SRP026420 and the same trimming parameters considered as in the published ALLPATHS-LG assembly. All statistics for the PLATANUS and ALLPATHS-LG assemblies were obtained using Assemblathon script and assembly quality results were obtained using QUAST tool. The results indicated our PLATANUS assembly was more closely related with reference genome of V. vinifera (PN40024) than the results of the ALLPATHS-LG assembly. The PLATANUS assembly had a greater number of large contigs, scaffolds and longest scaffold length with large number of NG50 than that of the published assembly. This assembly also retained more short informative scaffolds in comparison with the ALLPATHS-LG assembly. Further validation of the PLATANUS and published ALLPATHS-LG assemblies were conducted with EST (from NCBI), BAC and mRNA sequences of V. vinifera (PN40024). These results indicated that PLATANUS assembly had greater mRNA mapping than the ALLPATHS-LG assembly. These results suggest that PLATANUS is a suitable assembler for de novo genome assembly of the seedless grape and other highly heterozygous species.
Short Abstract: In this study, to identify the genes accountable for anthocyanin-enrichment in the black rice grains, 135K oriza sativa microarray were used to identify genes involved in anthocyanin biosynthesis and metabolism in both black rice and white rice cultivars. It was found that the 3,728 genes were associated with the production of anthocyanin pigment. A 135K oriza sativa microarray to identify gene involved in anthocyanin biosynthesis and metabolism in both black rice and white rice cultivars and found that the 3,728 genes were associated with the expression of anthocyanin pigment. Among them, the 573 conserved orthologous genes were identified using the Clusters of Orthologous Groups (COGs) analyses and were compared with the existing flavonoid biosynthesis pathway-network related genes.
Finally, 53 candidate genes were identified by comparing anthocyanin biosynthesis pathway-expression genes. These genes were anchored to the chromosomes of the rice genome to identify their genetic-map positions and were subjected to the phylogenetic tree construction together with their 31 homologous proteins sequences from A. thaliana, using the maximum-likelihood method. Our candidate genes seem to either play a regulatory role in anthocyanin biosynthesis or be related to anthocyanin metabolism.
Short Abstract: The extremely radioresistant eubacterium Deinococcus radiodurans and the phenotypically related prokaryotes, whose genomes have been completely sequenced, are presently used as model species in several laboratories to study the lethal effects of DNA-damaging and protein-oxidizing agents, particularly the effects of ionizing radiation (IR). Unfortunately, providing relevant information about radioresistant prokaryotes (RP) in a neatly centralized and organized manner still remains a need. In this study, we designed RadioP1 Web resource (www.radiop.org.tn) to gather information about RP defined by the published literature with specific emphasis on (i) predicted genes that produce and protect against oxidative stress, (ii) predicted proteins involved in DNA repair mechanisms and (iii) potential uses of RP in biotechnology. RadioP1 allows the complete RP proteogenomes to be queried using various patterns in a user-friendly and interactive manner. The output data can be saved in plain text, Excel or HyperText Markup Language (HTML) formats for subsequent analyses. Moreover, RadioP1 provides for users a tool “START ANALYSIS”, including the previously described R-packages “drc” and “lethal”, to generate exponential or sigmoid survival curves with D10 and D50 values. Furthermore, when accessible, links to external databases are provided. Supplementary data will be included in the future when the sequences of other RP genomes will become available.
Short Abstract: Model Organism Databases (MODs) have a strong history of gathering relevant data from the biomedical literature and from data loads, integrating that heterogeneous data, and providing that data in a variety of forms including web interfaces, APIs, dataMines, ftp files, and more. The Mouse Genome Informatics (MGI) project (www.informatics.jax.org), one such MOD, is the community resource for the laboratory mouse. MGI data includes data from over 220,000 publications for almost 23,000 protein-coding genes with information about 46,000 mutant alleles. There are over 14,000 genes with expression results, and over 24,000 genes with GO annotations.
Recently, MGI completed a comprehensive revision of data summation and presentation to provide comprehensive yet interpretable overviews of information available for mouse genes. Now, Gene Detail pages display more information and provide more ways to view subsets of data and access details. New graphical displays provide a synopsis of a gene's functions, where it is expressed and the phenotypes of mutant alleles. Particular emphasis is on the homology between human genes and diseases, with details about mouse models for these diseases. Links to curated Wikipedia pages for the human genes provide textual summation of available data
With rapidly increasing volume of heterogeneous data, the ability to provide high level summation of information about entities such as genes along with access to the deepest data available is essential for navigation of data resources by the diverse scientific community. MGI provides models that may see general adoption by bioinformatics resources.
Funding: NHGRI HG000330 and NICHD HD062499
Short Abstract: Background
With the development of high-throughput technologies, massive and complex omics such as genomic, transcriptomic and proteomic data are increasingly being generated. Comparing to a single type of data, large-scale heterogeneous data can provide a comprehensive view for system level understanding of complex diseases. Association analysis is one of the most commonly used and well-understood methods to investigate pair-wise relationship between attributes. Identifying biological or clinical association by traditional methods for small dataset is ease but on multi-omics data is a challenge. To address these, we developed a platform that can access thousands of attributes from genomics, proteomics and clinical data, enabling the discovery of novel associations between attributes.
LinkedOmics includes modules LinkFinder, LinkInterpreter and LinkCompare, as well as 1,620,049 attributes from genomic, epigenomic, transcriptomic, proteomic, and clinical data for the breast, colorectal and ovarian tumors of TCGA and CPTAC portal. Association between the query attribute and individual attributes can be calculated using an appropriate statistical test related to data types of the two attributes. To help biologists easily explore biological insights from the association results, LinkFinder provides the visualization function to plot individual association of the table by scatter plot, box plot or Kaplan-Meier plot. The LinkCompare module provides venn plot and heatmap analysis to allow easy comparison of association results generated by the LinkFinder module. The LinkInterpreter facilitates GO based enrichment of significantly associated genes.
LinkedOmics provide a unique platform for biologists to access, analyze and compare high-dimensional omics data.
Short Abstract: Many of the most powerful tools in biology rely on inference of homologs via sequence-based algorithms. However, many loci are invisible to such methods. Those that are short or rapidly evolving, such as orphan genes and small non-coding RNAs, may yield no significant hits. Whereas low-complexity or high-copy number loci may hide in a crowd of false positives. Searching by context bypasses this problem. We present an algorithm for tracing loci between genomes using a synteny map, and test its efficacy by mapping all Arabidopsis thaliana-specific genes to the genomes of eight related species. By reducing the search space and winnowing false positives, we were able to assess the origin of the individual orphan genes with unprecedented resolution. We traced many to their non-genic cousins, identifying the non-genic footprint from which they arose. We linked others to putative genes in related species from which they diverged beyond recognition. Knowing the approximate location of each gene across species also provides a starting point for future studies. Our pipeline can easily be adapted to contextualize elusive elements such as small RNAs and lineage-specific genes in any species for which reliable synteny maps can be built.
Short Abstract: Recent work on transcriptome profiling by RNA-sequencing has shown the feasibility of identifying tissue-specific, cross-species gene signatures in mammals, even when analyzing experiments across different laboratories, using different protocols. The ability to perform such analyses increases the kinds of hypotheses that can be tested in high-throughput using RNA-sequencing via "comparative transcriptomics" by integrating public data with de novo RNA-seq.
We developed a novel meta-analysis framework which accounts for confounding batch effects / bias affecting downstream analyses. We parsed reference transcriptomes to identify genes and alternative splicing patterns with unique human / mouse orthologs which were queried for tissue-restricted expression using information-theoretic approaches. In-silico library selection was performed to simulate consistent library preparation across samples, and quantile / upper-quartile normalizations were performed for between-species / between-tissue normalizations.
We performed RNA-seq on human donor-sourced nociceptive tissue Dorsal Root Ganglia (DRG), integrating this with 13 publicly available RNA-seq datasets from related tissues of neuroscientific / pharmacological interest; and corresponding mouse tissue datasets. We identified a set of transcription factors in DRG with evolutionarily conserved expression patterns across human / mouse, including DRG-specific PRDM12, which drives development of nociceptors. We further identified alternative splicing events specific to the DRG, including the evolutionarily conserved inclusion of the HSN2/WNK1 exon in the WNK1 gene in DRG, where mutations cause sensory neuropathy. These findings lend credence to the hypothesis that pain pathways and their regulatory mechanisms are evolutionarily conserved across humans and mouse and open the door to translational research for identifying pharmacological targets in the DRG.
Short Abstract: Transcriptome expression sequencing (RNA-Seq) is a powerful tool that allows the analysis of active genes under different conditions, even in organisms without comprehensive genomic resources. For that reason, RNA-Seq has become a common tool in studies aiming to characterize gene function in biological systems. The growing number of available experiments should allow exploratory analysis of candidate functional genes at different ages, growth conditions and tissues. However, the heterogeneous nature of the studies, the different analysis pipelines employed, and the disparate reference sequences hinders the comparison between them.
To make accessible the information from previously published RNA-Seq studies we have developed expVIP (expression Visualisation and Integration Platform). The gene expression is displayed in high-level factors, to allow an initial assessment of candidate genes in the different set of conditions. To explore the gene, it is possible to show fine grained levels for each factor. All this can be done dynamically on a web browser.
Tools to analyse gene expression in non-model organisms are scarce, as most of the available tools are focused on human or other model organisms. ExpVIP is a novel approach to visualization of expression experiments designed to compare homoeologous and homologous genes. The flexible design can deal with different factors to group the data and with several references seamlessly. The source code of expVIP component to plot the expression is available in github.
Philippa Borrill, Ricardo Ramirez-Gonzalez, and Cristobal Uauy expVIP: a Customizable RNA-seq Data Analysis and Visualization Platform. Plant Physiology 2016 170: 2172-2186.
Short Abstract: Alphaviruses, such as chikungunya virus (CHIKV), use a single-stranded positive-sense RNA genome to infect both arthropods and vertebrates. In addition to encoding the viral proteins, RNA virus genomes contain important RNA structural elements that control virus genome transcription and protein synthesis, and also allow the virus to avoid antiviral control by the host immune system. To find conserved and functional structures in alphavirus genomes, we present a new method integrating experimental RNA structural probing, secondary structure prediction by free energy minimization, and covariation models for evolutionary support.
Using experimentally-informed structures, we apply a stochastic context-free grammar to identify divergent but conserved structures across alphavirus genomes in CHIKV, sindbis virus (SINV), and Venezuelan equine encephalitis virus (VEEV). Comparative structural metrics quantify covariation in these candidate structures. We select the most conserved structures to test experimentally by introducing mutations within each region that preserve the protein sequence but disrupt the RNA structure.
Our method identifies 49 potential structured candidate regions across CHIKV, SINV, and VEEV. We verify our method by confirming that previously-known functional elements in alphaviruses are candidate structured regions and disruption of these structures reduces viral yields in our in vitro tests. We also find novel functional structures in CHIKV and SINV. Our new, alternative approach to finding conserved RNA structures combines experimental structure data with bioinformatic approaches to identify and validate functional structural elements in alphavirus genomes. The successful identification of functional structures gives us insight into the biology of alphaviruses and provides a foundation for future vaccine development.
Short Abstract: Overexpression of MyoD (myogenic determination factor) is known to transdifferentiate fibroblasts into muscle-like cells. However, despite phenotypic resemblance and expression of myogenic marker genes in these transdifferentiated cells, our global gene expression data suggests that over a hundred genes, many involved in normal muscle development and function, remain non-reprogrammed. In addition, a large fraction of fibroblasts-specific genes stay expressed. In order to better understand the reasons behind this partial reprogramming efficiency, we additionally obtained and analyzed genome-wide chromatin accessibility, and in vivo MyoD binding profiles on primary skin fibroblast cells transduced with inducible MyoD, and compared against the data obtained from starting fibroblast cells and target myoblasts. Our analysis shows that genome-wide, the chromatin is also incompletely reprogrammed, with thousands of cell-line specific sites not changing their accessibility profiles, or displaying partial changes. Our random forest and elastic net classification analyses between reprogrammed and non-reprogrammed chromatin sites suggest that: (1) the DNA binding specificity of MyoD and its co-factors (e.g. Meis1) are highly predictive of MyoD in vivo binding and chromatin opening, (2) SAND domain-like factors (such as Ski) can potentially rescue missed chromatin opening events and enhance overall efficiency of reprogramming, and (3) active histone modification marks are likely involved in sustaining accessibility at sites that fail to close down and remain non-reprogrammed. Combined analyses of chromatin and gene-expression data also interestingly indicate that reprogrammed chromatin sites surrounding genes that turn on have more efficient reprogramming than those around genes that fail to be activated at their myoblast levels.
Short Abstract: Single-cell RNA-seq (scRNA-seq) technologies enable gene expression measurement of individual cells and allow the discovery of cell population heterogeneity. However, scRNA-seq data sets are noisy with high degree of dropouts, and exhibit higher levels of diversity such as cell input heterogeneity and variation in cell cycle stages. These challenges make it difficult to define cell-to-cell similarity measures based on strict statistical assumptions that have been developed for bulk RNA-seq. Here, we propose a novel similarity-learning framework, SIMLR (single-cell interpretation via multi-kernel learning), which learns an appropriate distance metric from the data for dimension reduction, clustering and visualization.
We profiled >50,000 peripheral blood mononuclear cells (PBMCs) with the ChromiumTM system from 10x Genomics, and used SIMLR to provide an unbiased classification of all major subpopulations at expected proportions. In order to evaluate the sensitivity and accuracy of SIMLR, we further analyzed individual purified populations from PBMCs and pooled the data in silico at varying proportions. For all the datasets above, we first show that simple correlation-based or distance-based similarity measures are sensitive to noise, dropouts and outlier effects among the high dimensional data. In contrast, SIMLR can uncover clear similarity block structures by automatically learning appropriate cell-to-cell similarity specific to each dataset. Compared with 8 other popular dimension reduction methods, SIMLR yields a substantially higher clustering accuracy. When applied to visualization, we illustrate SIMLR’s advantage over other methods in projecting the high dimensional data to 2-D and 3-D where the different cell types are automatically projected in spatially distinct clusters.
Short Abstract: Streptomyces aureofaciens is a Gram-positive Actinomycete used for commercial antibiotic production. Although it has been the subject of many biochemical studies, no public genome resource had previously been made available. We sequenced Streptomyces aureofaciens (ATCC 10762) using a combination of sequencing platforms (Illumina MiSeq with 2x150bp paired end as well as Roche 454). Multiple de novo assembly methods have been used including SGA (Illumina only; count/N50 for contigs: 11,319/2.5kb; scaffolds: 5,264/4.8kb), MIRA (Illumina only; contigs: 5,385/8.1kb), Trinity (Illumina only; contigs: 2,559/10.6kb), IDBA (Illumina only; contigs: 1,249/18.5kb; scaffolds: 1,382/26.8 Kb), SPAdes (Illumina only; contigs: 574/59.8kb), velvet (Illumina + 454; contigs: 711/46.6kb; scaffolds: 389/8Mb) and SPAdes (Illumina + 454; contigs: 120/228kb; scaffolds: 60/661kb). Contig sets from all these assemblies were also integrated using CISA (contigs: 209/221kb). The best assembly was the one generated by SPAdes (Illumina + 454; total length / GC-content: 9.47 Mb / 71.15 %). We annotated this assembly using the NCBI Prokaryotic Genome Annotation Pipeline, revealing 8,083 total genes. This includes 7,986 coding sequences, more than any other public S. aureofaciens annotation to date. We discuss the value of integrating multiple assembly approaches including long-read sequences from the Oxford Nanopore MinION platform. We will also discuss the subsequent comparative genomic and phylogenetic analyses that suggest S. aureofaciens ATCC 10762 may be more closely related to the genus Kitasatospora than to neighboring Streptomyces species.
Short Abstract: Exponential growth of next-generation sequencing technologies has made the epigenomics analysis a big data science, which poses the challenges to its translation into knowledge. This has led to the emergence of a new field of comparative analysis called ‘Comparative Epigenomics”. Comparative epigenomics can be divided into three major directions, namely comparison across species, across time-course of a biological process and across individuals. In this study we focus on comparing the epi-modifications across species (particularly Human and Mouse). Thus, the overall purpose of this research is to compare different epi-genomic factors and where & how they concur or differ among species. In this study we have used histone modification data from various publicly available data sets in Human and Mouse. We were able to identify epi-modification co-appearance within species and across species. We also identified that regions of epigenomic conservation correlated well with quickly evolving sequences and slowly evolving sequences. We have also looked at how selective pressure on protein coding genes varies with respect to various histone modifications.
Short Abstract: Single-cell mRNA sequencing allows to profile heterogeneous cell populations offering exciting possibilities to tackle a variety of biological and medical questions. A range of protocols have been recently developed making it necessary to systematically evaluate their sensitivity, accuracy and power to quantify gene expression levels.
Here, we have generated data from 447 mouse embryonic stem cells and spike-in controls (ERCCs) using Drop-seq, SCRB-seq, Smart-Seq on the Fluidigm C1 platform and Smart-seq2 and also re-analyzed 35 cells that were prepared using CEL-seq in a previous study. We benchmark these five scRNA-seq methods in two independent replicates each and subsample one million sequencing reads per cell, a depth at which most libraries are sequenced to saturation.
We compare the sensitivity by the number of detected genes and ERCCs, the accuracy by correlating concentrations of ERCCs with expression estimates and the precision by power to detect differential expression based on simulations using the empirical mean-dispersion pairs. We find that only SCRB-seq performs well on all these parameters and that Drop-Seq performs well, when considering its higher throughput. For quantifying full length cDNAs Smart-seq2 performs best. Our dataset provides a solid basis to choose among five prominent RNA-Seq protocols for single cells and provides a basis for benchmarking new experimental and analytical approaches in the future.
Short Abstract: Currently, quantitative RNA-seq methods are pushed to work with increasingly small starting amounts
of RNA that require amplification. However, it is unclear how much noise or bias amplification
introduces and how this affects precision and accuracy of RNA quantification. To assess the effects
of amplification, reads that originated from the same RNA molecule (PCR-duplicates) need to be
identified. Computationally, read duplicates are defined by their mapping position, which does not
distinguish PCR- from natural duplicates and hence it is unclear how to treat duplicated reads. Here,
we generate and analyse RNA-seq data sets prepared using three different protocols (Smart-Seq,
TruSeq and UMI-seq). We find that a large fraction of computationally identified read duplicates are
not PCR duplicates and can be explained by sampling and fragmentation bias. Consequently, the
computational removal of duplicates does improve neither accuracy nor precision and can actually
worsen the power and the False Discovery Rate (FDR) for differential gene expression. Even when
duplicates are experimentally identified by unique molecular identifiers (UMIs), power and FDR are only
mildly improved. However, the pooling of samples as made possible by the early barcoding of the UMI-
protocol leads to an appreciable increase in the power to detect differentially expressed genes.
Short Abstract: There has been a recent explosion in avian genomics. In December 2014 the Beijing Genomics Institute in collaboration with a number of labs worldwide (including Kent) released 48 new de-novo avian genome sequences in a special edition of Science. This has led to a complete re-evaluation of the phylogenetic tree of birds and presents the opportunity to study avian comparative genomics in far more detail than before. Most of these genome sequences however exist only as “scaffolds” i.e. the depth of sequence and length of read produces contiguous fragments of sub-chromosomal size. This impedes insight into overall genome structure, which is particularly challenging, as one of the most interesting biological features of birds is the peculiarity of their karyotype. This project is an on-going effort to map scaffold assemblies to avian chromosomes using a combination of bioinformatics and Fluorescent in situ Hybridization (FISH). This has traditionally been a very time-consuming and costly procedure, however a combination of bioinformatic approaches coupled with novel hardware innovation has deconstructed the FISH protocol and re-invented it as a high throughput, cheaper procedure. Initial work has helped to reconstruct Pigeon and Peregrine Falcon genomes and will ultimately provide insight into various unanswered questions pertaining to avian gross genome rearrangement. These include why the unique overall genomic structure of birds is so evolutionarily conserved, why intra and inter-chromosomal rearrangements happen (e.g. in response to the development of traits such as vocal learning) and what the karyotypes of extinct species such as dinosaurs may have looked like.
Short Abstract: Microorganisms have evolved protein structures, biosynthetic pathways of extremolytes and interactions in a biofilm community to survive under the harsh environments such as a saturated salt condition and an extremely high temperature condition. Especially, functional proteins adapted to extreme environments are useful for the industrial application. With the discovery of the functional proteins from the extremophiles, it has been reported that small hydrophobic residues are preferred in halophilic proteins, and hydrophobic residues are filling the cavity of the interior core in hyperthermophilic proteins. However, these two adaptations were considered individually and were not compared among protein functional categories. Here we show that a surface tension defined by the hydrophobic networks within protein structures is well conserved among the same environment and discriminates not only hyperthermophilic proteins but also halophilic proteins from the others. We also found that this surface tension associates with protein functional categories. The surface tension was reduced in the proteins with nucleic acid binding transcription factor activity, but increased in those with catalytic activity. These results suggest that the optimum surface tension is required for the specific protein function. Our findings will contribute to the designing and engineering of functional proteins working under extreme conditions for industrial purposes.
Short Abstract: Protein phosphorylation is a post-translational modification that is essential for a wide range of eukaryotic physiological processes, such as transcription, cytoskeletal regulation, cell metabolism, and signal transduction. Although more than 200,000 phosphorylation sites have been reported in the human genome, the physiological roles mostly remain unknown. In a previous study we identified 178 phosphomotifs and the evolutionary conservation patterns are described by comparative genomic analysis for the known phosphosites observed in the human phosphomotifs. Our comparative genomic analysis was performed using genomic data of nine species that span from yeast to humans. The present data provide an overview of evolutionary patterns in acquisition of phosphomotifs and relationships between motif structures. By using these data, we investigated kinase substrates associated with phosphoproteins and the evolutionary conservations of kinase groups, and we also analyzed fractions of kinase groups from worm to human genomes. The substrates of AGC kinases showed higher conservation than those of the CMGC kinase family. In addition, we show the correlation between the evolutionary conservation and the distribution of disease related nonsynonymous mutation on the phosphomotifs. Our characterizations of phosphorylation motif structures and the assessments of evolutionary conservation of phosphosites are indicative of the physiological roles of unreported phosphosites. Thus, interactions between protein groups that share motifs would be helpful for inferring kinase–substrate interaction networks. In addition, our computational methods can be used to elucidate relationships between phosphorylation signaling and cellular functions.
Short Abstract: Mapping reads to a reference genome is a frequent starting point for genomic studies. Fully sequenced and annotated genomes, however, are not always available. This is particularly true in prokaryotic species, where diversity and divergence occur at an accelerated pace. Furthermore, due to the high rate of divergence in prokaryotes, significant differences are present between strains of the same species. In this study, we examine the question of how the use of reference genomes impacts the results of differential expression analysis between two significantly distant strains of Vibrio Vulnificus. By identifying orthology and mapping RNA-Seq reads to the reference genomes of both strains, we can perform differential expression analysis with the differential factor being the selected reference genome. We find that differential analysis can be reliably performed on the core genomes of two significantly different strains of bacteria without impacting the resulting information. Additionally, we provide a pipeline for identifying core genes between strains for both de novo assemblies and complete reference genomes that can be used to process RNA-Seq data of different bacterial strains from raw data to expression levels.
Short Abstract: Motivation:
Despite the recent progress in genome sequencing and assembly, many of the currently available assembled genomes come in a draft form. Such draft genomes consist of a large number of genomic fragments (scaffolds), whose positions and orientations along the chromosomes are unknown. While there exists a number of methods for reconstruction of the genome chromosomes from its scaffolds utilizing various computational and wet-lab techniques, they often can produce only partial error-prone scaffold assemblies. It therefore becomes important to compare and merge scaffold assemblies produced by different methods, thus combining their advantages and highlighting potential conflicts for further investigation. These tasks may be labor intensive if performed manually.
We present CAMSA - an online tool for a flexible comparative analysis and merging of two or more given scaffold assemblies. The tool (i) creates an extensive report with several comparative quality metrics; (ii) constructs the least conflicting combined scaffold assembly; and (iii) provides an interactive framework for a visual comparative analysis of the given assemblies.
CAMSA is freely available at https://cblab.org/camsa/
Short Abstract: Recently PacBio has developed SMRT (single-molecule real-time) sequencing of
full-length transcripts and gene isoforms with no assembly required.
Here we compare Isoseq transcripts with standart transcript assemblies from
Illimina short-read RNAseq data based on the same RNA samples in 2 wood-decay
Agaricomycotina fungi: Cylindrobasidium torrendii and Peniophora cinerea.
We descibe strengths and weaknesses of both approaches.
Isoseq sequencing also allowed more straightforward and accurate analysis of alternative splicing in these fungi.
Short Abstract: Zika and Ebola are emerging infectious diseases that represent unique challenges by virtue of their potential to spread across geographical boundaries and their devastating pathological features. Both attack the nervous system, the first through infection of neural precursors and neurons and the second indirectly through injury to the brain’s vascular network. We have adapted our EvoPrinter comparative genomics tool for the rapid analysis of their viral genomes. For example, once the user inputs a Flavivirus sequence (as little as 200 bases or up to a full genome), within 20 seconds, pairwise alignments with over 700 Flavi-genomes, including 54 different Zika isolates are automatically generated to identify the input sequence origin, lineage/sublineage relationship with EvoPrinter database-genomes. The Filovirus database includes over 367 Ebola and 66 Marburg genomes. The user has the option of viewing individual alignments or superimposing hundreds of color-coded alignments to highlight multi-genome polymorphisms and ultra-conserved sequences as they exist in the input sequence. EvoPrinter reveals lineage specific identifiers and sequences that are unique to the input. Readouts provide unique insights into viral evolution and identification of viral lineages and/or sublineages that trigger isolated outbreaks or epidemics. For example, comparative analysis of recent Brazil Zika isolates reveals that more than one sublineage is contributing to this spreading South American epidemic. Our comparative analysis has also identified Zika recombination events, in all of its lineages and A-to-I hyper-editing in both Ebola and Marburg genomes. This is an important addition to existing methods for the genetic surveillance of these pathogens.
Short Abstract: Background: A living cell has a complex, hierarchically organized signaling system that encodes and assimilates diverse environmental and intracellular signals, and it further transmits signals that control cellular responses, including a tightly controlled transcriptional program. An important and yet challenging task in systems biology is to reconstruct cellular signaling system in a data-driven manner. In this study, we investigate the utility of deep hierarchical neural networks in learning and representing the hierarchical organization of yeast transcriptomic machinery. Results: We have designed a sparse autoencoder model consisting of a layer of observed variables and 4 layers of hidden variables. We applied the model to over a thousand of yeast microarrays to learn the encoding system of yeast transcriptomic machinery. After model selection, we evaluated whether the trained models captured biologically sensible information. We show that the latent variables in the first hidden layer correctly captured the signals of yeast transcriptional factors (TFs), obtaining a close to one-to-one mapping between latent variables and TFs. We further show that genes regulated by latent variables at higher hidden layers are often involved in a common biological process, and the hierarchical relationships between latent variables conform to existing knowledge. Finally, we show that information captured by the latent variables provide more abstract and concise representations of each microarray, enabling the identification of better separated clusters in comparison to gene-based representation. Conclusions: Contemporary deep hierarchical latent variable models, such as the autoencoder, can be used to partially recover the organization of transcriptomic machinery.
View Posters By Category
- A) Bioinformatics of Disease and Treatment
- B) Comparative Genomics
- C) Education
- D) Epigenetics
- E) Functional Genomics
- F) Genome Organization and Annotation
- G) Genetic Variation Analysis
- H) Metagenomics
- I) Open Science and Citizen Science
- J) Pathogen informatics
- K) Population Genetics Variation and Evolution
- L) Protein Structure and Function Prediction and Analysis
- M) Proteomics
- N) Sequence Analysis
- O) Systems Biology and Networks
- P) Other