View Posters By Category
Session A: (July 7 and July 8)
Session B: (July 9 and July 10)
Short Abstract: There is an everlasting battle between prokaryotes and prokaryotic viruses. Prokaryotes continually evolve defenses against viruses while viruses evolve weapons to thwart those defenses. A key prokaryotic defense system is CRISPR-Cas, which targets viral nucleic acids. In turn, viruses have evolved counter-defenses to CRISPR-Cas. One such counter-defense is a set of recently discovered proteins that interact with CRISPR-Cas and render it ineffectual. These proteins are termed anti-CRISPR (ACR). ACRs have a biotechnological use: CRISPR-Cas has been appropriated by researchers to edit DNA, and ACRs allow greater control over this process, with the ability to stop it at will. However, detecting novel ACRs is a daunting challenge. We have therefore implemented an approach to detect novel ACRs, based on anomaly detection. Specifically, we created a set of quantifiable protein features and assessed them in known ACRs. Given a candidate protein, we examine its features and calculate the probability that they fit within the known ACR feature distributions. We found that our approach is predictive of ACRs, with ACRs ranked significantly higher than non-ACRs. Thus, we have developed an approach to detect novel ACRs, an elusive protein type that is a potential biotechnological tool and a key component of viral weaponry.
Short Abstract: Phylogenetic analysis both informs our view of the divergence of species and develops a framework on which to view and exploit phenotypic information. The UK National Collection of Yeast Cultures (NCYC; http://www.ncyc.co.uk) consists of over 4,000 diverse strains, ideal for the construction of such a framework for yeast. Recent genome sequencing of ~1,000 NCYC strains has provided raw material for next-generation sequencing (NGS)-based tree estimation. Several NGS-based approaches to phylogenetic analysis have emerged in the past few years. One highly popular approach uses feature frequency profiles (Sims et al., 2009), essentially comparing frequency distributions of k-mers between genome pairs as a proxy for evolutionary divergence. This approach is simple to use but has shown some problems with computational efficiency and in taking biological features into account. Here, we present a comparison of multi-locus sequence, whole genome SNP, and NGS-based phylogenetic approaches, focussing on a well-studied set of 40 Saccharomyces complex strains. The success of the various approaches was assessed by computational measures (e.g. Robinson Foulds distances, Mantel tests). Simulation studies were also used to assess the accuracy of the different phylogenetic methods. The results will inform future work aiming to develop new NGS-based approaches that incorporate additional biological knowledge.
Short Abstract: Horizontal gene transfer (HGT) events can often be inferred by comparative genomics analyses, however global processes that facilitate the persistence HGT are not well-understood. In particular, the functional compatibility of regulatory machinery between donor and recipients and its role in HGT has not been explored. Here, we used “synthetic HGT” to analyze the transfer of 36,968 putative natural regulatory sequences derived from microbial genomes and mobile DNA into three diverse bacteria (Bacillus subtilis, Escherichia coli, and Pseudomonas aeruginosa) to characterize their transcriptional profiles. Most regulatory sequences are driven by the sigma70 motif and have nearly universal conservation in their transcriptional start positions (TSS) across recipients. The ability to express a foreign promoter varied by recipients, with P. aeruginosa being the most promiscuous one and B. subtilius the least. Also, promoters from firmicutes showed overall higher activity regardless of the recipient. We provide a mechanism that explains the difference in promiscuity based on the AT-rich sigma70 motif and the recipient GC background. This mechanism may ultimately influence evolutionary adaptation to differing genomic backgrounds. Our work helps to elucidate the role of transcriptional regulation during HGT and suggests that HGT promotes the maintenance of a universal transcriptional architecture during community-wide evolutionary selection.
Short Abstract: Subterranean ecology provides a unique natural experiment of convergent evolution of unrelated taxa due to shared physical conditions and/or stresses. A principal example is the evolution of subterranean mammals in comparison to fossorial animals. In this study, we present a comparative analysis of protein-protein interactions between stress-response genes among 14 subterranean and fossorial animals. We found complete sub-clustering of these animals to their ecologies based on the protein-protein interaction network properties. We developed and applied a novel ‘evolution of protein domains’ model for evolution of proteins which is based on ‘mix and merge’ of protein domains. We define evolutionary events, translocation, as reciprocal exchange of a protein domain between two groups of orthologues between organisms, implemented on an assembled comprehensive collection of 76 organisms. We found translocations of key protein domains that are involved in pathways for sensing and regulating responses to environmental cues such as light and oxygen. Proteins containing these domains were found to be hubs in their PPI sub-networks. The distribution of hubs in ecology-specific protein-protein sub-networks accounts for the sub-clustering by ecology.
Short Abstract: To better understand the dynamics of the resistance mechanisms acquisition we developed an approach based on the experimental evolution and deep sequencing of time series. For the evolution we used morbidostat device that provides constant selective pressure via software-controlled adjustment of the antibiotic concentration. With this technique we studied development of a resistance to the widely used antibiotic ciprofloxacin in Escherichia coli and Acinetobacter baumannii. For each organism we evolved 6 parallel cultures until they were able to tolerate 80x of initial minimum inhibitory concentration. A. baumannii cultures were able to achieve this in 70 hours whereas for E.coli it took about 120 hours. However, the first rise of resistance was observed for both organism at the same time after 20-30 hours. The sequencing showed that resistance was achieved in two stages. At the first stage competition between different variants of the ciprofloxacin targets – DNA gyrase genes – occurred. At the second stage population accumulated defects in the transcription regulation which leads to overexpression of efflux transporters. To connect mutations and phenotype individual clones with distinct characteristics were selected from the populations. Genotypes were reconstructed by combination of individual clone sequencing with Oxford Nanopore MinION with the population data.
Short Abstract: Plasmodium parasites were responsible for about 212 million cases of Malaria in the most recent WHO report. Quick evolution and notable diversity in this genus has made effective drug, vaccine and diagnostic design a major hurdle to Malaria eradication. Further, it necessitates regional eradication programs tailored to specific Malaria populations. Plasmodium vivax (Pv) is the most common malarial pathogen outside of Africa, including in the Greater Mekong Subregion (GMS). Out of the 6 GMS countries, Myanmar has the highest documented Malaria burden, but very little is known about specific malaria populations in Myanmar, or the spread of Pv to and from nearby countries. In the past, population analyses for Pv in Myanmar and on its Chinese border have been limited by sample availability, among other factors. Using next generation sequencing data from Pv field isolates, we’ve addressed Pv variation around the China-Myanmar border both at the whole genome level and within individual genes of functional interest. Continuing analysis will expand these results to other parts of the GMS and guide local Malaria eradication efforts.
Short Abstract: Complexity is a fundamental attribute of life. Complex systems are made of parts that together perform functions that a single component, or most subsets containing individual components, cannot. Examples of complex molecular systems in bacteria include protein structures such as the F1F0-ATPase, the ribosome, or the flagellar motor: each one of these structures requires most or all of its components to function properly. At the molecular level, operons are a classic example of a complex system. An operon’s genes are co-transcribed under the control of a single promoter to a polycistronic mRNA molecule, with its gene products forming molecular complexes or metabolic pathways. With the large number of complete bacterial genomes available, we now have the opportunity to examine the evolution of operons and identify possible intermediate states. In this work, we develop a simple vertical evolution model of how operons evolve from individual component genes and orthologous gene blocks or orthoblocks. Utilizing this model, we present two algorithms to reconstruct ancestral operon states using maximum parsimony. Having reconstructed ancestral states, we identify intermediate functional forms and possible exaptations in reconstructed ancestors of operons.
Short Abstract: The 14-3-3 family of proteins are mostly conserved across eukaryotes. Several 14-3-3 isoforms, including zeta, are shown to be antigenic in autoimmune disease and cancer. Presence of anti-14-3-3z autoantibodies in humans are reported, but what makes this constitutive protein an antigen is not known. To identify antigenic determinants, we performed phylogenetic and bioinformatics analyses using multiple sequence alignments and visualizations with VMD, and compared sequences and structures of 14-3-3z. We identified six variable regions that provide specificity to 14-3-3z. Using several MHC class II servers, we identified antigenic epitopes and selected six that overlap with variable regions of 14-3-3z. Structure simulation of identified epitopes on HLA-DRB1*04 provided details on epitope binding. A couple of potential epitopes show significant similarity with fungal and hypothetical bacterial 14-3-3 (Plos One 2017). We previously successfully concluded a novel use of 14-3-3 inhibitors as antifungal due to a high degree of conservation between human and fungal 14-3-3. Our current study suggests significant conservation of potential epitopes from bacteria to humans may be responsible for antigenicity and molecular mimicry. Our current and future work includes performing in-vitro studies to verify the antigenic behavior of each epitope, and finding a way to block it for therapeutic applications.
Short Abstract: Comparative analyses of complete genomes reveal variations in genomic content, even among closely related organisms. Notably, genome similarity decays with evolutionary distance. Any cluster of genomes is therefore associated with a pangenome and a core-genome, which are the total repertoire of genes in the cluster and the set of genes that are common to all genomes in the cluster, respectively. The distribution of genes frequencies typically follows an asymmetric U shape, which consists of a "cloud" of accessory genes, a "shell" of genes with intermediate frequencies, and the core of essential genes that are present in (almost) all genomes. Here, we study a minimal mathematical model for prokaryotic genome content evolution and analyze 34 clusters of closely related genomes. We relate the genomes similarity decay with the gene frequency distribution, and show that the latter can be reconstructed using the model. We find that selection plays a role in maintaining genes in the core-genome.
Short Abstract: As the number of reference quality genome assemblies continues to grow, there is an increasing demand for methods to coalesce the contigs from draft assemblies into pseudochromosomes. Most current methods make use of linkage maps, optical maps, or chromatin interaction data (Hi-C), however these data are expensive and analysis methods often fail to accurately order and orient a high percentage of assembly contigs. Other approaches utilize contig alignments to a reference genome for ordering and orienting, however these tools rely on slow aligners, and are not robust to repetitive contigs. Here, we present RaGOO, a reference-guided contig ordering and orientation tool that leverages the speed and sensitivity of minimap2 to accurately construct pseudochromosomes. With the pseudochromosomes constructed, RaGOO also identifies structural variants, including structural variations spanning sequencing gaps that are not reported by alternative methods. We first show that RaGOO accurately and quickly orders and orients contigs from a real and simulated human genome assembly. We then demonstrate the utility of this speed and accuracy with a pan-genome analysis of over 100 Arabidopsis thaliana accessions by comparing structural variants detected after ordering and orientation.
Short Abstract: Biosynthetic gene clusters produce a wealth of secondary metabolites in a wide variety of organisms, from bacteria to plants and fungi. Yeasts are fungal organisms used extensively in the industrial-scale production of metabolites (e.g. biosurfactants, flavour compounds). As such, they are prime candidates for gene cluster mining, both to further exploit the metabolic potential of these organisms and to gain a greater understanding of the evolutionary processes leading to gene cluster formation. The genomes of 792 strains from the UK National Collection of Yeast Cultures (NCYC; http://www.ncyc.co.uk) were searched for industrially-relevant gene clusters using both established bioinformatics tools (e.g. antiSMASH) and ad hoc methods (e.g. clustering of enzyme-coding genes), enabling pros and cons of the two approaches to be inferred. Two gene clusters producing glycolipid biosurfactants were investigated in detail in a group of basidiomycetous yeasts. Contrasting patterns of homology were discovered between the two gene clusters, providing new insights into the processes by which these genetic elements are formed. This ongoing study is highlighting the biosynthetic diversity of a previously unexplored yeast collection. It sets the scene for future wet-lab research (e.g. gene knockouts, expression studies) that will aim to determine the functionality of the computationally predicted gene clusters.
Short Abstract: Orthology analysis is a pre-requisite for comparative genomic studies. However, when performing large-scale analyses, most tools require high memory and CPU usage typically available only in computational clusters. We developed a new tool, SwiftOrtho, that can analyze thousands of genomes effectively on a standard desktop or laptop. SwiftOrtho is a graph-based method that employs Reciprocal Best Hits (RBH) to infer orthology relationships from homologs. Usually, identification of RBH requires many query operations, and a Relational DataBase Management System (RDBMS) or hash table is often employed to speed up query operations. For large-scale data, an RDBMS is too slow, while a hash table consumes too much memory. SwiftOrtho solves these problems by introducing a binary search method, which finds a trade-off between time and space. In our tests, SwiftOrtho was the only tool that finished orthology inference for 1,760 bacterial genomes on a desktop with only 4 GB RAM. We used Orthobench to evaluate the predictive quality of SwiftOrtho and other conventional graph-based orthology analysis tools. SwiftOrtho performed with a competitively high accuracy. SwiftOrtho is intended for orthology analysis of large amounts of genomes where high-end computational resources are unavailable. The software is available from: https://github.com/Rinoahu/SwiftOrtho
Short Abstract: ORFans are protein encoding genes with no similar genes found in taxonomically related species. Previous studies of multiple strains of a single bacterial pathogen indicate that a significant fraction of each strain’s genetic content is unique. Many of these unique genes are located in pathogenic islands and may contribute to the pathogenicity of the organism. A comparison of pathogenic and non-pathogenic species will allow for identification of ORFans unique to pathogenic genomes. ORFanDB is a database containing comparative genomic information of ORFans for pathogenic and non-pathogenic bacterial species. The analysis utilizes publicly available data and a newly introduced ORFan classification scheme. ORFans are identified by using ORFanFinder (Ekstrom and Yin, Bioinformatics 32 (13), 2053-2055), a computer program that we developed in 2016. A further analysis of sequence features determines ORFans located in pathogenic islands, phages, antibiotic resistant genes, and virulence factors. These results will offer a better understanding of the contribution these ORFans make to the overall pathogenicity of the bacterial strain. ORFanDB will also allow users to classify and annotate their own ORFans using ORFanFinder.
Short Abstract: Microbes form fundamental bases of every Earth ecosystem. As their key survival strategies, some microbes adapt to broad ranges of environments, while others specialize to certain habitats. While ecological roles and properties of such "generalists" and "specialists" had been examined in individual ecosystems, general principles that govern their distribution patterns and evolutionary processes have not been characterized. Here, we thoroughly identified microbial generalists and specialists across 61 environments via meta-analysis of community sequencing data sets and reconstructed their evolutionary histories across diverse microbial groups using the Binary-State Speciation and Extinction (BiSSE) model. This revealed that generalist lineages possess 19-fold higher speciation rates and significant persistence advantage over specialists. Yet, we also detected three-fold more frequent generalist-to-specialist transformations than the reverse transformations. These results support a model of microbial evolution in which generalists play key roles in introducing new species and maintaining taxonomic diversity.
Short Abstract: Gene Fusions have been shown to have potential for successful targeted cancer therapies but have often eluded exploitation due to their inconsistency in disease penetrance. The currently accepted gold standard for identifying driver fusions in cancer is their rate of recurrence among the patient population. However many fusions are "private" to individual patients and are usually thus excluded from driver candidacy. With the recent reduction in sequencing costs and the advent of new high throughput single cell RNA sequencing techniques it is now possible to trace cancer cell lineage within a single tumor and explore the evolutionary process of mutation and translocation, identifying conserved lesions that are likely to drive cancer growth and drug resistance on a personal level. Here we analyzed a public single cell RNA-Seq dataset of 5 primary Glioblastoma tumors and two GBM cell lines with over 8000 gene expression levels profiled from 96 to 192 cells from each tumor, searching for gene fusions harbored by many cell clones within a tumor. Our findings indicate several significantly conserved fusions including CDK6 readthrough (cell cycle) harbored in 8 or 192 cells and a CNOT1-MALAT1 fusion (TMZ drug resistance) in 11 of 96 clones.
Short Abstract: Phylogenomics, the use of large datasets to examine phylogeny, has revolutionized the study of evolutionary relationships. However, genome-scale data have been unable to resolve many relationships in the tree of life; this could reflect the poor-fit of the models used to analyze heterogeneous datasets. That heterogeneity is likely to have many explanations. However, it seems reasonable to hypothesize that the different patterns of selection on proteins based on their structures might represent a source of heterogeneity. We developed an efficient pipeline to divide phylogenomic datasets into partitions based on secondary structure and relative solvent accessibility to test that hypothesis, using the deepest branches in animal phylogeny as a model system. Sites located in different structural environments did support distinct tree topologies. The most striking differences in phylogenetic signal appeared to reflect relative solvent accessibility; analyses of sites on the surface of proteins yield a tree that placed ctenophores sister to all other animals whereas the sites buried inside proteins yield a tree with a sponge-ctenophore clade. These results indicate that understanding the constraints due to protein structure is likely to improve phylogenetic estimation.
Short Abstract: Abnormal variations are frequent in clonal genome evolution of cancers. Such aberrational variations often function as a driver in cancer cell growth. Understanding fundamental evolutionary dynamics underlying these variations in tumor metastasis still is understudied owing to their genetic complexity. Recently, whole genome sequencing empowers to determine genome variations in short-term evolution of cell populations. This approach has been applied to evolving populations of unicellular organisms including yeast. It is substantial progress in evolutionary genomics to examine sequence changes at such fine-scale resolution. However, existing statistical tests for analyzing variation temporal changes in multiple time-points are limited to identify the full spectrum of intermediate changes. We designed a new statistical approach based on Kolmogorov-Smirnov test and integrated it into a software tool for determining the variation patterns in fine-scale temporal resolution in experimental evolution studies. We validated our method using simulation data and analyzed yeast (Saccharomyces cerevisiae) W303 strain genomes from 40 populations at 12 time-points using our software tool.
Short Abstract: Transmission and evolution of multi-drug resistant organisms (MDROs) represents a global public health threat. The CDC considers the MDRO carbapenem-resistant Klebsiella pneumoniae (CRKP) an urgent threat due to limited treatment options. We aim to understand how this nosocomial pathogen spreads between hospitals and evolves antibiotic resistance. Whole genome sequences (WGS) of ~400 isolates from 11 long-term acute care hospitals in the Los Angeles area were supplemented with geographically diverse isolates to understand how CRKP entered and spread through the region. Variants associated with antibiotic resistance were overlaid on the transmission network to discern where resistance emerged and how it spread between facilities. Using WGS, we elucidated a regional pathogen transmission network in an endemic setting. Transmission into the area occurred multiple times and although inter-facility transmission is common, intra-facility transmission drives disease prevalence. Resistance to the last-line drug colistin arose and disseminated through single nucleotide variants, indels, and large insertions in known resistance genes. This research demonstrates that WGS of pathogen isolates provides sufficient information to reconstruct transmission networks in endemic areas and thus could guide infection prevention efforts. Additionally, the enhanced understanding of antibiotic resistance evolution provided by WGS may inform antibiotic stewardship to reduce MDRO emergence and prevalence.
Short Abstract: In genome evolution, for simplicity, each gene is usually treated as an evolutionary unit, particularly in understanding and reconstructing the gene families within a set of closely related species, although there were frequent mosaic origins of many genes. To comprehensively explore the pattern of gene evolution at subgene level, here, we used a set of nine closely related Drosophila genomes with high quality to completely characterize all putative constituent units called modules. From the 111,641 annotated proteins, 22,861 modules were identified and each protein was composed of one or more different modules without any gaps. Based on the module organization, 24,312 different module architectures (MAs) were obtained, and further, 14,318 connected components were derived from the module architecture network created by an integrated similarity index With the known species phylogeny, we will explore the evolutionary patterns of the genes at both the module and the architecture levels. Moreover, we will reconstruct complete gene families, in which not only the evolutionary events, such as gene duplication/loss, but also those occurred at the subgene level, such as module insertion and gene fusion/fission, could be reflected as comprehensively as possible. These results may allow us understand the evolution of protein-coding genes better.
Short Abstract: Methicillin-resistant Staphylococcus aureus (MRSA) is one of the most common antibiotic-resistant pathogens worldwide. In the United States, molecular type USA300 causes the majority of community-associated infections and has become a major source of infections in hospitalized patients. In addition to causing invasive infection, MRSA can asymptomatically colonize its host for extended periods. However, the factors determining whether a colonized individual will progress to infection are unclear. We sought to understand the role of intra-patient evolution in causing bloodstream infection from colonization. To do so, we performed whole genome sequencing and collected patient metadata for USA300 MRSA colonization isolates and bloodstream infection isolates from Cook County Hospital (Chicago, IL). We inferred variants that accumulated within the individual using a phylogenetic approach. We then conducted a genome-wide association study to identify genetic signatures of intra-patient evolution associated with colonization or infection. We identified a number of metabolic pathways enriched in mutations preferentially in infection isolates. We will next elucidate patient factors that drive these signatures of selection using statistical methodology. Our findings will contribute to the understanding of MRSA pathogenesis at the level of the bacteria and the patient.
Short Abstract: In comparative genomics, phylogenetic analyses of gene expression have great potential for addressing a wide range of biological questions. Notably the recently controversial “Ortholog Conjecture” (OC) that assumes paralogs evolve functionally faster than orthologs. Using pairwise comparisons of tissue specificity, we proposed support for the OC . Using the phylogenetic comparative method (PCM) on time calibrated empirical gene trees; Dunn et al.  found that support for the OC was lost. We performed simulation under different models of evolution, overlooked by Dunn et al, notably changes in sequence evolutionary rate and models of functional shift. Although PCM is more sensitive, concomitant increases in evolutionary rates of sequences and of trait after duplication could lead to a false rejection of the OC. Interestingly, the pairwise method outperforms the PCM in this scenario for all models of evolution. To revalidate our result empirically, we used fish whole genome duplicate gene trees and transcriptomes from 11 fishes. Contra Dunn et al. , we found that the OC holds true for fish ohnologs. Reference:  Kryuchkova-Mostacci N, Robinson-Rechavi M (2016). PLoS Comput Biol.12: e1005274.  Dunn CW, Zapata F, Munro C, Siebert S, Hejnol A (2018). Proc Natl Acad Sci U S A. http://www.pnas.org/cgi/doi/10.1073/pnas.1707515115
Short Abstract: As phylogenetic tools are error prone, it is a common task to correct an initial tree T according to the reconciliation cost with a given species tree. The most natural way is to preserve all well supported branches, according to a given statistical support, and be allowed to modify all weakly supported branches. The underlying representation is a 0-1 edge labeling of T edges, where 0 indicates a low support and 1 a high support. In addition, if the tree T contains a set of subtrees whose topologies are trusted, they should be preserved in the corrected tree. We have developed LabelGTC, a general algorithmic framework for correcting a gene tree, which is a combination of polytomy resolution ans supertree methods. We tested LabelGTC on vertebrate gene families for which a Gold Standard tree is given in the SwissTree project. For each family, we also considered the EnsemblCompara tree, and the one corrected by LabelGTC. Terminal preserved subtrees were chosen by computing a functional similarity score on branches based on GeneOntology annotations of genes. According to Gold Standard trees, statistical likelihood, reconciliation cost and biological relevance, we show that LableGTC is able to appropriately correct EnsembleCompara gene trees.
Short Abstract: Since the discovery of an altered genetic code in vertebrate mitochondria, several deviations from the standard translation code have been reported in both nuclear and organelle genomes across multiple domains of life. While codon reassignment is actively studied in yeast and animal mtDNAs, it remains understudied in green plants mitochondria. We have performed an in-depth study of codon reassignment across 36 green plants mtDNAs. Our results rely on a new conceptual framework for studying genetic code evolution that explicitly accounts for codon reassignments effect on disparity between DNA and protein sequences, as well as for tRNAs evolution through duplications, losses and remolding. By applying this framework on green plants, we have identified, for the first time, sense-to-sense codon reassignments in several chlorophyte mtDNAs. Our results show that emergence of new tRNAs(CCU) through tRNA remolding has allowed arginine AGG codons decoding as alanine in several Sphaeropleales mtDNA. In Chromochloris zofingiensis, we report an unusual reassignment of arginine AGG codon to methionine, while the synonymous codons CGG is decoded as leucine. Finally, although the underlying mechanism is still unknown, we have found strong sequence-based evidence for the reassignment of isoleucine AUA codons to methionine in Pycnococcus provasolii.
Short Abstract: The phage shock protein (Psp) stress-response system protects bacteria from envelope stress and stabilizes the cell membrane. The key effector protein, PspA, is found in diverse bacterial, archaeal and plant phyla. Despite the prevalence of the functional Psp system, the various genomic contexts of Psp proteins, as well as their evolution across the kingdoms of life, have not been characterized. We developed a computational pipeline for comparative genomics and protein sequence-structure-function analyses to identify sequence homologs, phyletic patterns, domain architectures, gene neighborhoods, sequence conservation and evolution of the proteins and domains of interest across the tree of life (~6000 completed genomes). We first determined PspA-containing species across the tree of life followed by the other known cognate partners of PspA (pspBC, pspMN, liaIGF). Using contextual information from conserved gene neighborhoods and their domain architectures, we delineated the phyletic patterns of all the Psp members. In addition to systematically identifying all possible ‘flavors’ and neighborhoods of the known Psp systems, we could also trace their evolution (e.g., back to LUCA for PspA) leading us to several interesting observations as to their occurrence and co-migration, suggesting their function and role in stress-response systems dependent and independent of PspA that are often lineage-specific.
Short Abstract: Affinity maturation in B cells is an evolutionary process of descent from a germline sequence through somatic hypermutation (SHM), and clonal selection. Because of the similarity between affinity maturation and evolution in natural populations, phylogenetics has a long history of use in studying B cell clonal lineages. Common applications include constructing B cell clonal lineage trees, quantifying SHM and clonal selection, and reconstructing unobserved intermediate sequences and germline alleles. However, affinity maturation presents problems for model-based phylogenetic analysis. SHM is biased by sequence context, violating important assumptions in most substitution models. We previously introduced the HLP17 substitution model which accounted for this by explicitly modelling SHM bias. Unfortunately, this model is highly parametric, making parameter estimation imprecise on small lineages. To address this, we allow certain parameters to be shared among lineages within a repertoire, allowing for far greater power in parameter estimation, and for joint reconstruction of lineage trees using all available sequence information in the repertoire. We show how confidence intervals are estimated for these parameters, and how they may be used to detect processes such as dysregulation of clonal selection and SHM. All of these features are implemented within the program IgPhyML.
Short Abstract: Metastasis is the migration of cancerous cells from a primary tumor to other anatomical sites. While metastasis was long thought to result from monoclonal seeding, or single cellular migrations, recent phylogenetic analyses of metastatic cancers have reported complex patterns of cellular migrations between sites, including polyclonal migrations and reseeding. However, accurate determination of migration patterns from somatic mutation data is complicated by intra-tumor heterogeneity and discordance between clonal lineage and cellular migration. We introduce MACHINA, a multi-objective optimization algorithm that jointly infers clonal lineages and parsimonious migration histories of metastatic cancers from DNA sequencing data. MACHINA analysis of data from multiple cancers reveals that migration patterns are often not uniquely determined from sequencing data alone, and that complicated migration patterns among primary tumors and metastases may be less prevalent than previously reported. MACHINA’s rigorous analysis of migration histories will aid in studies of the drivers of metastasis.
Short Abstract: Since their discovery in yeast in 1996, orphan genes – also known as ORFans or taxonomically restricted genes (TRGs) – have continued to gain in interest within comparative and evolutionary biology. TRGs are defined as DNA coding sequences (open reading frames) found in a single species or genus, which lack orthologous sequences in other groups at corresponding taxonomic ranks. TRGs have been found in every genome thus far sequenced, and often encode functionally important, species-specific traits. Current theory holds that genes in different species result from gene duplication and recombination, and thus the widespread distribution of TRGs in sequenced genomes represents a paradox. To understand this phenomenon, we have developed a standalone, web-based computational algorithm (ORFanID) that can identify orphans from multiple or single gene products of an amino acid sequence in the FASTA format with their NCBI gene ID (GI) from various species, organismal and viral. This software engine can be further restricted to search by taxonomy level of the selected organism or by the organism name. Accuracy of the results can be adjusted based on e-value and other significant parameters and can be reported in tabular and graphical formats. TRGs identified from important model species will be presented.
Short Abstract: Inspired by recent efforts to model cancer evolution with phylogenetic trees, we consider a variant of the problem of building a consensus tree from a set of conflicting phylogenetic trees. This variant considers features present in tumor phylogenies such as labels on internal vertices and allowing multiple mutations to label a single vertex. Our method to solve this problem uses a weighted directed graph where vertices are sets of mutations and edges are weighted by the number of times a parental relationship is observed between their constituent mutations in the set of input trees. We find a maximum weight spanning arborescence in this graph and prove that the resulting tree minimizes the total distance to all input trees for one particular distance metric. We evaluate our method using both simulated and real data. Using a set of phylogenetic trees derived from both whole-genome and deep sequencing data from a Chronic Lymphocytic Leukemia patient we find that our approach identifies a new phylogeny that minimizes multiple distance metrics. Lastly, we show that our approach is applicable to problems outside cancer evolution by applying our method to a set of trees that describe the movement of transposable elements in a genome.
Short Abstract: Emergence of new gene editing tools like CRISPR-Cas and TALENs have heightened the need to develop new and efficient transformation technologies. However, maize transformation is not straightforward. Not all lines are easily transformed and can therefore not be directly subjected to genome editing. The maize genome reference sequence is currently based on the B73 inbred line, which is not readily transformed. Thus, the motivation for this project is to assemble a maize inbred line which is highly similar to the B73 but transformable. B104 meets these criteria. The availability of a B104 genome sequence would allow mapping of genetic variants between B73 and B104 and help illuminate the genetic architecture of the transformability trait. Here we report progress toward a de novo hybrid assembly of the B104 genome using PacBio and Illumina mate pair and paired end data. Additionally, we describe our strategy for annotation of B104 gene models using Maker-P pipeline that is informed by RNA-Seq data collected across ten distinct plant tissues. This new assembly of B104 will represent an improvement in both sequence contiguity and completeness of gene annotations relative to the existing draft assembly and will provide basic insights into the genetic basis of plant transformation.
Short Abstract: Computational prediction of gene function is ubiquitously used in newly sequenced genomes, where experimental validation is costly or unfeasible. These approaches often rely on the assumption that orthologs maintain similar functions across species, whereas their duplicated paralogs are relaxed from this constraint and evolve to acquire new functions. Following the growing literature and debate on this topic, we revisit this problem (known as the ortholog conjecture) on two pairs of species with recent experimentally supported functional annotations, implementing an information-theoretic based distance metric over the directed acyclic graphs to calculate functional similarities. In both species pairs evaluated in this study (Human vs. Mouse and S. cerevisiae vs. S. pombe), within-species paralogs remain functionally more similar compared to across-species homologs, even after accounting for several confounding factors. We quantify the contribution for different types of homologous relations in protein function prediction and show that function transfer from within-species homologs is preferable when proteins have both orthologs and inparalogs. Additionally, we quantify the fraction of gene families that benefit from such functional transfer to estimate the potential impact of our findings. Our results, consistent across Biological Process and Molecular Function ontologies, suggest inclusion of paralogs in functional predictions would improve function transfer.
Short Abstract: Alternative splicing presents a challenge for traditional multiple sequence alignment (MSA) tools when aligning protein isoforms. Mirage is a novel tool that accurately aligns isoforms by first mapping proteins to their encoding genomic sequence, and then aligning proteins to one another based on the genomic coordinates of their constitutive codons. Mirage's resulting MSAs display the underlying exonic structures of individual isoforms. Mirage combines original implementations of alignment and graph algorithms with existing software tools to maximize the number of protein sequences that successfully map back to their species' genomes. The memory overhead of Mirage is low enough to run on a standard desktop computer and the runtime of Mirage is competitive with popular MSA tools. The isoform MSAs produced by Mirage are significantly more accurate than those produced by existing MSA tools. Mirage is now being used for sequence alignment in phosphosite.org, a web service for understanding post-translational modification of protein sequences. Mirage alignments can help identify annotation errors and have revealed the ubiquity of "alternative reading frames" (ARFs) in which discrete exons encode multiple open reading frames as overlapping spliced segments of genomic sequence that are frameshifted. Mirage has identified putative ARFs in 7% of human genes.
Short Abstract: In recent years, networks have emerged as an alternative to phylogenetics for depicting relationships among bacteriophage (phage) genomes and between phages and their hosts. Here, we construct a network of genes shared among RefSeq virus genomes (including eukaryotic and archaeal viruses), as well as metaviromic contigs identified by IMG/VR from the Human Microbiome Project. In this network, each node represents all homologs of a particular gene, and two nodes are connected if the two genes are ever found in the same genome. Overall, this network is dominated by one large connected component, which includes most bacteriophages and archaeal viruses, as well as a subset of eukaryotic viruses. Interestingly, not all archaeal viruses cluster together, and two subclusters contain genes with homologs found in both bacterial and archaeal viruses. Further, genes from metaviromic contigs form a large set of unique groups, emphasizing the need for additional lab isolation. Specific genes also stand out as hubs connecting viruses from diverse hosts. This network approach can be updated readily as new sequences are made available, and metadata for each genome or contig can be associated with each node in the network, enabling analysis of associations between gene sharing and viral ecology.
Short Abstract: How do new gene functions originate? This is one of the most intriguing questions of biology. Ancient genes change their functionality over time by gene duplication and divergence. More recently, it is apparent that new genes (orphan genes) arise de novo from within the genome. To decipher the origin of new genes and understand how they may function, we conducted an in-depth analysis of the human genome and transcriptome. We compiled a list of putative orphan genes in the human genome using genes from Ensemble (v. 91), orphan genes from literature, and randomly selected 10,000 ORFs over 200nt some of which might be un-annotated orphan genes. We created a pipeline to download 8,619 runs of public RNA-seq data (~15 TB) and metadata from the NCBI-SRA/GEO databases and mapped the human transcriptome from Ensembl and our list of putative orphan genes to these data to get a comprehensive expression profile for the human transcripts. Meta-analysis of this data showed orphan genes and some unannotated ORFs are transcribed at rates similar to those of ancient genes. Our work uses massive amounts RNA-seq data and metadata to provide a powerful approach to identifying potential new genes and developing hypotheses about their functions.
Short Abstract: Inconsistencies in cell line-based studies jeopardize reproducibility of cancer research. Natural evolution leading to genetic and transcriptional heterogeneity within cancer cell lines may contribute to such inconsistencies. We therefore performed genetic and transcriptomic characterization of 27 strains of the common breast cancer cell line MCF7 and assessed the response of those strains to 321 anti-cancer compounds. We found extensive genomic variation across strains, resulting in disparate drug responses. Genetic variation occurred at all levels – point mutations, rearrangements, and copy number changes – and affected multiple oncogenes and tumor suppressor genes. Similar observations were obtained across 23 strains of the lung cancer cell line A549, as well as in multiple other cell lines, indicating that genomic variation is a general property of cancer cell lines. Genetic changes resulted in substantial differences in gene expression programs, cell morphology and proliferation, and strikingly high variability in drug response. Over 75% of the compounds that exhibited strong activity in some of the strains were completely inactive in others. Genomic analyses of single cell-derived clones showed that ongoing instability quickly translated into cell line heterogeneity, even in cell populations originating from a single cell. These findings have broad practical implications for cell line-based research.
Short Abstract: Nucleic acid binding proteins (NBPs), i.e., transcription factors (TFs), RNA-binding proteins (RBPs) and DNA and RNA binding proteins (DRBPs) are capable of regulating genes by interacting with DNA, RNA or both respectively. They are highly cooperated with each other and perform vital functions in every stage of gene regulation. To enhance and facilitate our understanding towards gene regulation mediated by nucleic acid binding proteins, the TF and RBP repertories of various organisms- Eukaryotic Nucleic acid binding Protein Database have been constructed based on a new developed pipeline by analyzing sequenced transcriptomes and genomes. From transcriptomes and genomes available for more than 1600 and 800 eukaryotic species respectively, 662 NBP families with more than 1.8 million NBPs were predicted for a total 2211 eukaryotic species, composing the largest NBP database that record TFs, RBPs and DRBPs. This database provides a key resource for evolutionary and functional studies on gene regulations. The evolutionary relationships of TFs, RBPs and DRBPs will also be further investigated by bioinformatics and phylogenetic approaches.
Short Abstract: Mitochondria are sub-cellular organelles in most eukaryotic cells that possess their own genome. However, this genome encodes only a minute portion of mitochondrial proteins, while the vast majority of them are encoded in the nucleus and imported to the organelle. Thus, understanding mitochondrial proteome requires identification of nuclear encoded mitochondrial proteins. Characterization of mitochondrial proteomes has been limited so far to several bilaterian species, and a few non-metazoan eukaryotes. In this analysis, two bioinformatic approaches (Reciprocal Best Blast Hit and Mitochondrial targeting presequence prediction) were used to identify mitochondrial proteomes in four non-bilaterian phyla: Porifera, Cnidaria, Ctenophora and Placozoa. The inferred mitochondrial proteomes in non-bilaterian animals are diverse with respect to size and content. The calcarean sponges possess the largest inferred mitochondrial proteomes, while the myxozoans possess the smallest. We identified a set of 850 common mitochondrial proteins present in all non-bilaterian phyla as well as mammals and 57 mitochondrial proteins found only in humans, of which several were involved in apoptotic pathway in mammals. We also analyzed mitochondrial targeting presequences from the non-bilaterian species. Overall, our analysis provides the first step in understanding mitochondrial proteomes in non-bilaterian animals and evolution of animal mitochondrial proteome.
Short Abstract: The mammalian CIP/KIP family proteins p21, p27, and p57 are intrinsically disordered proteins (IDPs) that regulate various cellular processes like cell cycle progression, apoptosis, transcription, cell migration, and cytoskeletal dynamics. These proteins possess conformational flexibility that enables them to perform numerous functions, however, it is known that IDPs generally evolve rapidly and exhibit high rates of insertion and deletion. In order to understand the functional adaptability of CIP/KIP ensemble, we predicted the disorder propensity as well as the O-GlycNAcylation and phosphorylation sites propensity and calculated the rates of evolution per site in the vertebrate CIP/KIP proteins. The analyses revealed that the cyclin-dependent kinase inhibitor (CDI) domain sequences and some protein modifications sites remain highly conserved in all the CIP/KIP proteins. However, the order-disorder transitions observed for p21 and p57 in CDI domain, indicating a high possibility of functional adaptation. Further, a disordered region located beyond the CDI domain towards the C-terminus exhibits a high rate of evolution, except for some conserved modification sites, though the disorder propensity is noticeably conserved for p21 and p27. These findings thus signify that the structural flexibility of IDPs is important for the functional versatility of these proteins.
Short Abstract: Pairwise correlations between sites in deep multiple sequence alignments have been used successfully to predict 3D protein structure and protein-protein interactions using statistical physics models called Potts models. We are interested in whether these models can also be employed to improve the sensitivity of remote homology search methods, extending current 1D profile methods to include higher-order sequence correlations.To extend Potts models to homology search, two challenges need to be addressed. First, we want to know that pairwise correlation structures are conserved in remote homologs outside the clade on which the model was trained, which is not necessarily true if pairwise covariation signals result from complex epistatic effects specific to a particular phylogeny. Second, Potts models assume that alignments are given, whereas homology search applications require a model of insertion/deletion processes and an alignment algorithm. To address the first issue, we are performing benchmark experiments on simple, contrived datasets and comparing the performance of Potts models to state-of-the-art homology search tools. For the second, we are using a type of model called a hidden Potts model (HPM) that couples a Potts emission process to a generative probability model of insertions and deletions.
Short Abstract: Most of the vertebrate genome is derived from mostly-ancient replication of transposable elements (TEs). The thorough annotation of TEs is critical to genome annotation pipelines. Historically, the fixation on complete annotation of TEs was due to the havoc they otherwise wreak on genome analysis. However, TE annotation is increasingly valued because of the role it can play in understanding phylogenetic and population dynamics, and due to the role TEs have played in the evolution of gene function and regulation. TEs are organized into families, each corresponding to a collection of replicates derived from a single ancestral sequence. Over evolutionary time, families can experience bursts of replication activity, and the results of these bursts can be annotated as belonging to different subfamilies. Subfamily sequences are often highly similar, complicating annotation. We have found that TE replicates (in segmental duplications, or in human-chimp homology pairs) are annotated as belonging to different subfamilies more that 10% of the time - we call this "discordant classification". We have developed new methods of incorporating confidence estimates into subfamily annotation, which helps explain discordant classification. These confidence estimates can be built directly into annotation pipelines, improving genome annotation and downstream analysis.
Short Abstract: Epistasis, the non-additive contribution of single amino acid substitution to the fitness, is one of the most important factors of molecular evolution. Epistasis can be described differently; among the other types, researchers use a concept of higher-order and multidimensional epistasis. Here we propose new methods for study epistasis. First, we present method for finding all hypercube structures in huge protein genotype-phenotype map produced by random mutagenesis, which is valuable for study higher-order epistasis. Second, we extend the framework for working with epistasis by finding hyperrectangles rather than just hypercubes. And the last, we present a new type of multidimensional epistasis, which turned out to be abundant in high-throughput experimental data.
Short Abstract: Sequence alignment is fundamental to modern molecular biology. Our work addresses an important aspect of sequence alignment – avoidance of incorrect sequence annotation due to alignment overextension. This occurs when software correctly identifies that a substring of one sequence is related (aligns to) to a substring of another sequence, but incorrectly claims that flanking regions of the two sequences are also related. The impact of overextension is substantial - for example, in the annotation of transposable elements in the human genome, we have estimated that 2% of the annotated genome is the result of overextension. Current methods used to combat overextension are only somewhat effective, and can have the unintended consequence of reducing search sensitivity and over-trimming the alignment. In our research, we developed two prototype methods for mitigating overextension. One uses a simple hidden Markov model (HMM), and the second employs Convolutional Neural Networks (CNN). We benchmark these techniques using an artificial sequence dataset that mimics transposable elements inserted into simulated genomic sequence. Our results with both approaches provide a dramatic decrease in overextension with a minimal amount of over-trimming.
Short Abstract: Bacterial cells during many replication cycles accumulate random mutations which result in the birth of new clones. As a result of this clonal expansion, bacterial samples from different time-points differ in their composition of various clones. Accurately inferring the genotypes of these clones, the clonal frequencies and the clonal evolutionary history from bacterial samples will help characterize the evolutionary forces acting on sets of mutations rather than on individual mutations. In this study, we investigate the computational problem of inferring these information from variant allele frequencies observed in different samples collected from the same population taken at different time points. The problem of clonal reconstruction from mixture samples without a chronological order is shown to be NP-complete. We formulate the problem as a maximum likelihood inference problem, where the likelihood function is defined under the assumption that the likelihood of a candidate clone to be the parent of a new clone is proportional to its relative abundance. We show through simulations that our approach is fast and produces the most accurate solution that is practically plausible. We will also discuss the limitations of the approach. We validate our method using experimental data from a study on long-term evolution of E.coli.
Short Abstract: LTR Retrotransposons (LTRs) predominate the genomic landscape of virtually all plants and animals. These retroelements are ubiquitous because of their ability to rapidly amplify their copy number in the genome. Because of their replicative nature, these elements have been implicated as potential catalysts of the phenotypic diversity required for the development of domesticated crops. The primary goal of this project is to gain a better understanding of the role that LTR insertions have played in the evolution and domestication of Soybean by characterizing unique retroelement insertions among wild and domesticated varieties using Next-Generation Sequencing techniques. To this end, we have developed a transposon-anchored PCR protocol which amplifies genomic regions flanking target LTRs. To accompany this protocol, we have conceived a bioinformatics pipeline that efficiently processes, filters and compares each sequence library and aligns these reads to a reference genome. We have used these bioinformatic techniques to perform a large-scale comparison of insertions across 25 different varieties of Soybean. These techniques will provide better insight into the role that retrotransposons have played in accelerating domestication, and may ultimately guide the development of stress-resistant crops.