All times listed are in UTC
- Salvatore Cosentino, The University of Tokyo, Japan
- Wataru Iwasaki, The University of Tokyo, Japan
Presentation Overview: Show
Accurate inference of orthologous genes constitutes a prerequisite for genomic and evolutionary studies. SonicParanoid is one of the fastest methods for orthology inference and comparably accurate to well-established methods despite being orders of magnitude faster. Nevertheless its scalability is hampered by the lengthy all-vs-all alignments, and sequence-similarity search alone is not enough to predict very distant orthologs. In this work we try to tackle these two limitations using machine learning.
We substantially reduced the all-versus-all alignment execution time using an AdaBoost model which exploits the properties of the Bidirectional-Best-Hit and the factors affecting the computational time in local sequence alignment. Evaluation based on multiple datasets showed reductions in execution time up to 50% without negative effects on the accuracy of the orthology inference.
To address the second limitation we trained a doc2vec model with domain-architectures extracted from the input proteins, and we used it to infer orthologs based on domain-architecture similarities, which resulted in an increase of one-third in the number of predicted orthologs.
The way we reduced all-vs-all execution time could be used by other graph-based methods, while the domain-based approach could in the future, thanks to its scalability, eliminate the need for all-vs-all alignments in orthology inference.
- Inhyuk Song, Pusan National University, South Korea
- Giltae Song, Pusan National University, South Korea
Presentation Overview: Show
Copy number variations (CNV) are frequent in cancer genomes and cause abnormal changes in chromosome number and chromosome arms. CNV are known to be significantly involved in cancer development and progression, but the role of CNV in the molecular mechanisms of cancer is still undisclosed. One of the reasons is that determining CNV in cancer genomes is challenging using short-read sequencing. There are some cancer genome analysis studies using long-read sequencing for identifying CNV more accurately, but it is still quite costly.
In this study, we develop a deep learning approach based on convolutional neural networks (CNN) for detecting CNV in cancer genomes using short-read sequencing only. We collect some CNV data from DGV (Database of Genomic Variants) for cancer genomes using long-read sequencing. We use the CNV results as label data and feed short-read sequence mapping data to CNN as input. We build our CNN model and validate our method using cross validation. We compare the performance of our method with other existing methods and improve the accuracy of CNV detection. We believe that our method can be a useful toolset to reveal unknown roles of CNV in the mechanisms of cancer.
- Sophie Seidel, ETH Zurich, Switzerland
- Ashley Maynard, ETH, Switzerland
- Zhisong He, ETH, Switzerland
- Barbara Treutlein, ETH, Switzerland
- Tanja Stadler, ETH, Switzerland
Presentation Overview: Show
A central goal of developmental biology is to understand the building of a complex tissue from an initial cell. Single-cell lineage tracing and expression data have the potential to elucidate the cell-state transitions during that process. However, lineage tracing data is typically analysed using methods based on parsimony. A computational framework quantifying the transition dynamics by integrating both data sources is lacking.
We derived a novel substitution model that approximates the editing process of frequently used lineage tracing systems (LINNEAUS, ScarTrace). Using this substitution model, we perform Bayesian inference of single-cell lineage trees. Compared to state-of-the-art maximum parsimony methods, this allows us to take recurring editing outcomes and phylogenetic uncertainty into account. Alongside the trees, we estimate a cell-state’s growth and state transition rate. We implemented our model as a package within the BEAST2 platform and validated it on simulated data. We apply it to cerebral organoid data from multiple time points to investigate the cell-state transitions from neural progenitor to neuron cells.
We provide a framework to estimate cell-state transition rates based on lineage tracing and RNAseq data. This framework will enable a more quantitative understanding of development in health and disease.
- Qiqing Tao, Temple University, United States
- Sudhir Kumar, Temple University, United States
- Jose Barba-Montoya, Temple University, United States
Presentation Overview: Show
Motivation: Precise time calibrations needed to estimate ages of species divergence are not always available due to fossil records' incompleteness. Consequently, clock calibrations available for Bayesian analyses can be few and diffused, i.e., phylogenies are calibration-poor, impeding reliable inference of the timetree of life. We examined the role of speciation tree prior on Bayesian node age estimates in calibration-poor phylogenies and tested the usefulness of an informative, data-driven (dd) tree prior to enhancing the accuracy and precision of estimated times.
Results: We present a simple method to estimate parameters of the birth-death (BD) tree prior from the molecular phylogeny. The use of ddBD tree priors can improve Bayesian node age estimates for calibration-poor phylogenies. We show that the ddBD tree prior, along with only a few well-constrained calibrations, can produce excellent node ages and credibility intervals, whereas the use of a flat tree prior may require more calibrations. Relaxed clock dating with ddBD tree prior also produced better results than a flat tree prior when using diffused node calibrations. We also suggest using ddBD tree priors to improve the detection of outliers and influential calibrations in cross-validation analyses. Our results have practical use because the ddBD tree prior reduces the number of well-constrained calibrations necessary to obtain reliable node age estimates. This would help address key impediments in building the grand timetree of life, revealing the process of speciation, and elucidating the dynamics of biological diversification.
Availability: An R module for computing the ddBD tree prior, simulated datasets, and empirical datasets are available at https://github.com/cathyqqtao/ddBD-tree-prior.
- Yuval Tabach, The Hebrew University-Hadassah Medical School, Israel
- Doron Stupp, The Hebrew University-Hadassah Medical School, Israel
Presentation Overview: Show
Over the next decade, more than a million eukaryotic species are expected to be fully sequenced. This data has the potential to better understand genotype and phenotype crosstalk, predict gene function and functional interactions, and solve fundamental evolutionary questions. Here, we develop a machine-learning approach for utilizing phylogenetic profiles across 1154 eukaryotic species. This method integrates the co-evolutionary signal across eukaryotic clades to predict functional interactions between human genes. We benchmarked our approach and found a 14% increase in performance (auROC) as compared to other phylogenetic profiling methods. Utilizing this approach, we predicted novel functional interactions and enabled functional annotation for poorly characterized genes. study the species and pathway crosstalk we uncover major lost events that define pathways evolution and revealed influential clades on different biological process. Finally, we show that parasitic organisms have lost pattern in patways and genes that distinct them from other species and …... overall, our approach enables a better annotation of the functional interactions and facilitates the understanding of evolutionary processes underlying pathway co-evolution and it linked to species genomes. The manuscript is accompanied by a webserver available at: http://mlpp.cs.huji.ac.il:8000 .
- Chris Papadopoulos, Institute for Integrative Biology of the Cell (Université Paris-Saclay, CEA, CNRS), France
- Isabelle Callebaut, IMPMC (Sorbonne Université, Muséum National d'Histoire Naturelle), France
- Jean-Christophe Gelly, Biologie Intégrée du Globule Rouge (Université de Paris), France
- Isabelle Hatin, Institute for Integrative Biology of the Cell (Université Paris-Saclay, CEA, CNRS), France
- Olivier Namy, Institute for Integrative Biology of the Cell (Université Paris-Saclay, CEA, CNRS), France
- Maxime Renard, Institute for Integrative Biology of the Cell (Université Paris-Saclay, CEA, CNRS), France
- Olivier Lespinet, Institute for Integrative Biology of the Cell (Université Paris-Saclay, CEA, CNRS), France
- Anne Lopes, Institute for Integrative Biology of the Cell (Université Paris-Saclay, CEA, CNRS), France
Presentation Overview: Show
The noncoding genome plays an important role in de novo gene birth and the emergence of genetic novelty. How noncoding sequences' properties shape the evolution, and the structural diversity of proteins remains unclear. Therefore, we characterized the fold potential diversity of the amino acid sequences encoded by all intergenic ORFs of S. cerevisiae in order to estimate the potential of the noncoding genome to produce novel protein bricks that can either give rise to novel genes or be integrated into pre-existing proteins, thus participating in protein structure diversity and evolution. We showed that amino acid sequences encoded by most yeast intergenic ORFs contain the elementary building blocks of protein structures. Moreover, they cover a large structural diversity with strikingly the majority predicted with foldability potential similar to the one of globular proteins. Then, we investigated the early stages of de novo gene birth by identifying, with ribosome profiling, intergenic ORFs with a strong translation signal and reconstructing the ancestral sequences of 70 yeast de novo genes. We showed a strong correlation between the fold potential of de novo proteins and the one of their ancestral amino acid sequences, reflecting the relationship between the noncoding genome and the protein structure universe.
- Erin Molloy, University at California, Los Angeles, United States
- Arun Durvasula, University at California, Los Angeles, United States
- Sriram Sankararaman, University at California, Los Angeles, United States
Presentation Overview: Show
Motivation: Admixture, the interbreeding between previously distinct populations, is a pervasive force in evolution. The evolutionary history of populations in the presence of admixture can be modeled by augmenting phylogenetic trees with additional nodes that represent admixture events. While enabling a more faithful representation of evolutionary history, admixture graphs present formidable inferential challenges, and there is an increasing need for methods that are accurate, fully automated, and computationally efficient. One key challenge arises from the size of the space of admixture graphs. Given that exhaustively evaluating all admixture graphs can be prohibitively expensive, heuristics have been developed to enable efficient search over this space. One heuristic, implemented in the popular method TreeMix, consists of adding edges to a starting tree while optimizing a suitable objective function.
Results: Here, we present a demographic model (with one admixed population incident to a leaf) where TreeMix and any other starting-tree-based maximum likelihood heuristic using its likelihood function is guaranteed to get stuck in a local optimum and return an incorrect network topology. To address this issue, we propose a new search strategy that we term maximum likelihood network orientation (MLNO). We augment TreeMix with an exhaustive search for a MLNO, referring to this approach as OrientAGraph. In evaluations including published admixture graphs, OrientAGraph outperformed TreeMix on 4/8 models (there were no differences in the other cases). Overall, OrientAGraph found graphs with higher likelihood scores and topological accuracy while remaining computationally efficient. Lastly, our study reveals several directions for improving ML admixture graph estimation.
- Yi-Fei Huang, Pennsylvania State University, United States
Presentation Overview: Show
In evolutionary genomics, it is fundamentally important to understand how characteristics of genomic sequences, such as the expression level of a gene, determine the rate of adaptive evolution. While numerous statistical methods, such as the McDonald-Kreitman test, are available to examine the association between genomic features and positive selection, we currently lack a statistical approach to disentangle the direct effects of genomic features from the indirect effects mediated by confounding factors. To address this problem, we present a novel statistical model, the MK regression, which augments the McDonald-Kreitman test with a generalized linear model. Analogous to the classical multiple regression model, the MK regression can analyze multiple genomic features simultaneously to distinguish between direct and indirect effects on positive selection. Using the MK regression, we identify numerous genomic features responsible for positive selection in chimpanzees. These features include well-known ones, such as local mutation rate, residue exposure level, tissue specificity, and immune system genes, as well as new features not previously reported, such as gene expression level and metabolic genes. Overall, the MK regression is a powerful approach to elucidate the genomic basis of adaptive evolution.
- Samuel Z Chen, Michigan State University, USA, United States
- Lauren M Sosinski, Michigan State University, USA, United States
- Joseph T Burke, Michigan State University, United States
- John B Johnston, Oakland University, USA, United States
- Janani Ravi, Michigan State University, USA, United States
Presentation Overview: Show
Studying how bacterial pathogenic proteins evolve can help identify lineage-specific and pathogen-specific signatures and variants, and consequently, their functions. We have developed a streamlined computational approach for characterizing the molecular evolution and phylogeny of target proteins, widely applicable across proteins and species of interest. Our approach starts with query protein(s) of interest, identifying their homologs, and characterizing each protein by its domain architecture and phyletic spread. We have developed the MolEvolvR webapp to enable biologists to run our entire workflow on their data by simply uploading a list of their proteins of interest. The webapp accepts inputs in multiple formats: protein/domain sequences, multi-protein operons/homologous proteins, or motif/domain scans. Depending on the input, MolEvolvR returns the complete set of homologs/phylogenetic tree, domain architectures, common partner domains. Users can obtain graphical summaries that include MSA and phylogenetic trees, domain architectures, domain proximity networks, phyletic spreads, co-occurrence patterns, and relative occurrences across lineages. Thus, MolEvolvR provides a powerful, easy-to-use interface for a wide range of protein characterization analyses, including data summarization and dynamic visualization. The webapp can be accessed here: http://jravilab.org/molevolvr. Soon, it will be available as an R-package for use by computational biologists.
- Mohak Sharda, National Centre for Biological Sciences, Bangalore, India, India
- Anjana Badrinarayanan, National Centre for Biological Sciences, Bangalore, India, India
- Aswin Sai Narain Seshasayee, National Centre for Biological Sciences, Bangalore, India, India
Presentation Overview: Show
DNA double-strand breaks (DSBs) are a threat to genome stability. DSBs are either faithfully fixed via homologous recombination or erroneously via Non-homologous end joining (NHEJ). Unlike recombination-based repair, NHEJ is only sporadically present in prokaryotes. Towards understanding why many prokaryotes lack it, we used comparative genomics and phylogenetic approaches to show that multiple independent gain and loss events along with extensive horizontal gene transfers have shaped the evolutionary history of NHEJ. We also highlight the association of NHEJ with three genome characteristics- GC content, genome size and growth rate. Given the central role these traits play in determining the ability to carry out recombination, it is possible that the evolutionary history of bacterial NHEJ may have been shaped by requirement for efficient DSB repair. Approaches used in our study could be extended to other repair pathways in order to understand how they might have contributed in bacterial evolution or vice versa.
- Hao Zhou, Cornell University, United States
- Juan Beltran, Cornell University, United States
- Ilana Brito, Cornell University, United States
Presentation Overview: Show
Phylogenetic distance, shared ecology and genomic constraints are often cited as key drivers governing horizontal gene transfer (HGT), although their relative contributions are unclear. Here, we apply machine learning algorithms to a curated set of diverse bacterial genomes to tease apart the importance of specific functional traits on recent HGT events. We find that functional content accurately predicts the HGT network (AUROC=0.983), and performance improves further (AUROC=0.990) for transfers involving antibiotic resistance genes (ARGs), highlighting the importance of HGT machinery, niche-specific and metabolic functions. We find that high-probability not-yet detected ARG transfer events are almost exclusive to human-associated bacteria. Our approach is robust at predicting the HGT networks of pathogens, including Acinetobacter baumanii and Escherichia coli, as well as within localized environments, such as an individual’s gut microbiome.
- Yoann Anselmetti, Université de Sherbrooke, Canada
- Nadia El-Mabrouk, Université de Montréal, Canada
- Manuel Lafond, Université de Sherbrooke, Canada
- Aïda Ouangraoua, Université de Sherbrooke, Canada
Presentation Overview: Show
It is largely established that all extant mitochondria originated from a unique endosymbiotic event integrating an alpha-proteobacterial genome into an eukaryotic cell. Subsequently, eukaryote evolution has been marked by episodes of gene transfer, mainly from the mitochondria to the nucleus, resulting in a significant reduction of the mitochondrial genome, eventually completely disappearing in some lineages. However, in other lineages such as in land plants, a high variability in gene repertoire distribution, including genes encoded in both the nuclear and mitochondrial genome, is an indication of an ongoing process of Endosymbiotic Gene Transfer (EGT). Understanding how both nuclear and mitochondrial genomes have been shaped by gene loss, duplication and transfer is expected to shed light on a number of open questions regarding the evolution of eukaryotes, including rooting of the eukaryotic tree.
We address the problem of inferring the evolution of a gene family through duplication, loss and EGT events, the latter considered as a special case of horizontal gene transfer occurring between the mitochondrial and nuclear genomes of the same species. We present a linear-time algorithm for computing the DEL (Duplication, EGT and Loss) distance, as well as an optimal reconciled tree, for the unitary cost, and a dynamic programming algorithm allowing to output all optimal reconciliations for an arbitrary cost of operations. We illustrate the application of our EndoRex software and analyse different costs settings parameters on a plant dataset and discuss the resulting reconciled trees.
- Magda Markowska, University of Warsaw, Poland
- Tomasz Cąkała, University of Warsaw, Poland
- Błażej Miasojedow, University of Warsaw, Poland
- Dilafruz Juraeva, Merck Healthcare KGaA, Translational Medicine, Oncology Bioinformatics, Germany
- Johanna Mazur, Merck Healthcare KGaA, Translational Medicine, Oncology Bioinformatics, Germany
- Edith Ross, Merck Healthcare KGaA, Translational Medicine, Oncology Bioinformatics, Germany
- Eike Staub, Merck Healthcare KGaA, Translational Medicine, Oncology Bioinformatics, Germany
- Ewa Szczurek, University of Warsaw, Poland
Presentation Overview: Show
Copy number alterations constitute important phenomena in tumor evolution. Whole genome single cell sequencing gives insight into copy number profiles of individual cells, but is highly noisy. Here, we propose a novel approach for Copy Number Event Tree (CONET) inference and copy number calling. CONET fully exploits the signal in scDNA-seq, as it relies directly on both the per-breakpoint and per-bin data. The model jointly infers the structure of an evolutionary tree on copy number events and copy number profiles of the cells, gaining statistical power in both tasks. The nodes of the evolutionary tree are copy number events, which are allowed to overlap. CONET employs an efficient MCMC procedure to search the space of possible model structures and parameters, with a range of model priors and penalties for efficient regularization. Results on simulated data and 260 cells from xenograft breast cancer sample demonstrate the excellent performance of CONET in inferring both the copy number evolutionary history in cancer tissue, as well as integer copy number profiles for each cell. Taken together, the proposed approach is a step towards a better understanding of copy number evolution in cancer.
CONET implementation is available at https://github.com/tc360950/CONET.
- Conor Walker, EMBL - European Bioinformatics Institute, United Kingdom
- Nicola De Maio, EMBL - European Bioinformatics Institute, United Kingdom
- Nick Goldman, EMBL - European Bioinformatics Institute, United Kingdom
Presentation Overview: Show
Detecting adaptive changes in multiple sequence alignments of protein-coding DNA sequences is typically performed using likelihood-based tests, involving statistical evaluation of the nonsynonymous to synonymous substitution rate ratio (“dN/dS”) at individual sites and/or on individual branches. Likelihood-based tests perform well in idealised scenarios involving perfect alignments, but increasingly poorly as greater levels of divergence cause multiple indels and alignment errors. We show for the first time that convolutional neural networks (CNNs) can learn to accurately detect selection using alignments containing many such errors, demonstrating that test accuracy does not have to be constrained entirely by alignment quality.
Treating this as a binary classification problem, CNNs are tasked with identifying if either positive selection (conventionally dN/dS>1), or no selection, has occurred within alignments of orthologous sequences. We simulate sequence evolution under various realistic conditions, using alignments of these sequences for training and testing our networks. We show that CNNs trained on Clustal alignments can classify selection with high accuracy in the presence of alignment error, performing favourably when compared to best-in-class aligner+likelihood combinations. Using global saliency maps, we show that CNNs learn site-wise information that may permit expanding this approach into evaluating selection at individual sites.
- Chaitanya Aluru, Princeton University, United States
- Mona Singh, Princeton University, United States
Presentation Overview: Show
Motivation: Protein domain duplications are a major contributor to the functional diversification of protein families. These duplications can occur one at a time through single domain duplications, or as tandem duplications where several consecutive domains are duplicated together as part of a single evolutionary event. Existing methods for inferring domain level evolutionary events are based on reconciling domain trees with gene trees. While some formulations consider multiple domain duplications, they do not explicitly model tandem duplications; this leads to inaccurate inference of which domains duplicated together over the course of evolution.
Results: Here, we introduce a reconciliation-based approach that considers the relative positions of domains within extant sequences. We use this information to uncover tandem domain duplications within the evolutionary history of these genes. We devise an integer linear programming (ILP) formulation that solves this problem exactly, and a heuristic approach that works well in practice. We perform extensive simulation studies to demonstrate that our approaches can accurately uncover single and tandem domain duplications, and additionally test our approach on a well-studied orthogroup where lineage-specific domain expansions exhibit varying and complex domain duplication patterns.
- Malay Basu, University of Alabama, Birmingham, United States
Presentation Overview: Show
Background
Proteome complexity has a major role in the progression of cancer. Various drugs have been used to target proteome imbalance. We devised a new method to measure the proteome complexity of cells modifying our published work on the “grammar” of the proteome, which is based on the remarkable similarity of genomes to natural language texts.
Results
Using expression-weighted N-gram modeling we measured proteome complexity in various human tissues and from 31 cancer types from TCGA. We show that proteome complexity is variable among the cancer types: in some cancers, the gene expression is streamlined and can be considered to have evolved under positive selection, while in others the entropy of proteome increases possibly due to dedifferentiation. We provide evidence that the “atavistic” model of cancer which suggests that cancer cells evolve by retrogressive dedifferentiation into an earlier state of evolution is not a universal mechanism. We also show that proteome complexity can explain the survivability of cancer patients.
Conclusion
The work describes a novel linguistic method of measuring proteome complexity. Besides providing clues to cancer evolution, the work may help in patient prioritization for therapies targeting proteome imbalance.
- Elise Parey, Institut de Biologie de l'Ecole normale supérieure (IBENS), France
- Alexandra Louis, Institut de Biologie de l'Ecole normale supérieure (IBENS), France
- Jérôme Monfort, INRAE, LPGP, France
- Yann Guiguen, INRAE, LPGP, France
- Hugues Roest Crollius, Institut de Biologie de l'Ecole normale supérieure (IBENS), France
- Camille Berthelot, Institut de Biologie de l'Ecole normale supérieure (IBENS), France
Presentation Overview: Show
Whole-genome duplications (WGDs), or polyploidizations, are major evolutionary events contributing extensively to species diversification processes. However, accurately tracing genes and genome evolution after WGD is challenging as these events are followed by massive gene losses, gene conversions and widespread evolutionary divergence. To address this, we developed a novel gene tree correction method, named SCORPiOs (Synteny-guided CORrection of Paralogies and Orthologies). SCORPiOs integrates information from the genomic organization of genes, or synteny, to complement classical sequence-based gene tree construction methods. We apply SCORPiOs to a large set of gene phylogenies containing 101 vertebrates, including 74 teleost species sharing a common 320 million years old WGD event. By combining these refined gene trees together with a state-of-the-art ancestral teleost karyotype reconstruction, we establish a genomic atlas of WGD-duplicated regions across teleosts. We reveal that gene losses after the WGD have unequally affected duplicated chromosomes, with some genomic regions displaying pronounced retention biases on one of the homeologues. Analyzing strong disagreements between sequence and synteny predictions for gene family evolution, we uncover WGD-duplicated regions likely subjected to homeologous recombination for an extended period of time following polyploidization. Altogether, our results shed light on the contribution of WGDs to the evolution of vertebrate genomes.
- Bastian Pfeifer, Institute for Medical Informatics, Statistics and Documentation. Medical University Graz, Austria
- Durrell D. Kapan, Center for Comparative Genomics, Institute for Biodiversity Sciencde and Sustainabilty, California Academy of Sciences,, United States
Presentation Overview: Show
Introgression (the flow of genes between species) is a major force structuring the evolution of genomes, potentially providing raw material for adaptation. Here, we present a versatile Bayesian model selection approach for the detection and quantification of introgression. The proposed df-BF approach builds upon the recently published distance-based df statistic. Unlike df, df-BF takes into account the number of variant sites within a genomic region. The df-BF method quantifies introgression with the inferred theta parameter, and at the same time enables weighing the strength of evidence for introgression based on Bayes Factors. To ensure fast computation we make use of conjugate priors with no need for computational demanding MCMC iterations. We compare our method with other approaches including df, fd, and Patterson's D using a wide range of coalescent simulations. Furthermore, we showcase the applicability of the df-BF approach using whole genome mosquito data. Finally, we integrate the new method into the powerful genomics R-package PopGenome.
- Dana Sherill-Rofe, Hebrew University, Israel
- Idit Bloch, Hebrew University, Israel
- Yuval Tabach, Hebrew University, Israel
Presentation Overview: Show
Despite the sequencing revolution, the majority of the genes are poorly annotated. As almost none of the unannotated genes uniquely evolved in human, their evolution across hundreds of organisms can be the anchor for their functional characterization.
Although identifying uncharacterized pathways represents an extremely difficult challenge, novel pathways were recently identified after extensive analysis of gene groups showing similar phenotypes, interactions, or expression. Given the extensive growth in genomic data, a powerful approach to predict gene function and its interactions is phylogenetic profiling.
Phylogenetic profiling (PP) is an unbiased approach to predict gene function and its interactions. The main assumption is that genes sharing a similar PP are also functionally coupled. Recently, we established that integrating information from different clades can optimize co-evolution signals, and improve gene function discovery. We generated a network of all genes divided into paralogous groups for 12 clades containing almost 2000 species and identified clusters. We found several known pathways, but the annotated clusters corresponded to only 22% of the overall clusters, thus the remaining clusters may represent undiscovered biology. Using data integration and biological validation we intend to identify novel biological pathways. Characterization of even a single novel pathway is of paramount importance.
- Wei Wang, Michigan State University, United States
- Ahmad Hejasebazzi, Michigan State University, United States
- Julia Zheng, Michigan State University, United States
- Kevin Liu, Michigan State University, United States
Presentation Overview: Show
The standard bootstrap method is used throughout science and engineering to perform general-purpose non-parametric resampling and re-estimation. Among the most widely cited and widely used such applications is the phylogenetic bootstrap method, which Felsenstein proposed in 1985 as a means to place statistical confidence intervals on an estimated phylogeny. A key simplifying assumption of the bootstrap method is that input data are independent and identically distributed (i.i.d.). However, the i.i.d. assumption is an over-simplification for biomolecular sequence analysis, as Felsenstein noted. Special-purpose fully parametric or semi-parametric methods for phylogenetic support estimation have since been introduced, some of which are intended to address this concern.
In this study, we introduce a new sequence-aware non-parametric resampling technique, which we refer to as RAWR (“RAndom Walk Resampling”). RAWR consists of random walks that synthesize and extend the standard bootstrap method and the “mirrored inputs” idea of Landan and Graur. We apply RAWR to the task of phylogenetic support estimation. RAWR’s performance is compared to the state of the art using synthetic and empirical data that span a range of dataset sizes and evolutionary divergence. We show that RAWR support estimates offer comparable or typically superior type I and type II error compared to phylogenetic bootstrap support. We also conduct a re-analysis of large- scale genomic sequence data from a recent study of Darwin’s finches. Our findings clarify phylogenetic uncertainty in a charismatic clade that serves as an important model for complex adaptive evolution. We conclude with thoughts on future research directions.
- Senbai Kang, University of Warsaw, Poland
- Nico Borgsmüller, ETH Zurich, Switzerland
- Monica Valecha, University of Vigo, Spain
- Jack Kuipers, ETH Zurich, Switzerland
- Niko Beerenwinkel, ETH Zurich, Switzerland
- David Posada, University of Vigo, Spain
- Ewa Szczurek, University of Warsaw, Poland
Presentation Overview: Show
Understanding intra-tumor heterogeneity is the cornerstone of developing effective cancer treatment and precision medicine. The development of single-cell DNA sequencing technology remarkably increases the resolution of DNA profiles to single-cell level. This facilitates the inference of phylogenetic trees with individual tumor cells as leaves, providing an evolutionary model of the mechanism behind intra-tumor heterogeneity. However, most of the methods proposed for tree reconstruction from single-cell data are based on infinite-sites assumption, which is often violated in reality due to evolutionary events like loss of heterozygosity.
Here, we develop a novel computational model, called Sieve, to jointly infer tumor phylogeny and call variants under finite-sites assumption from single-cell data. We propose a novel rate matrix, with states representing genotypes corresponding to heterozygous and homozygous mutations. To properly integrate the noisy single-cell sequencing data, we develop a Dirichlet-Multinomial based probabilistic model of the sequencing coverage and nucleotide read counts. The model accounts for allelic dropouts. To acquire accurate branch lengths, acquisition bias correction is applied. We prove that Sieve outperforms existing approaches on simulated data, especially regarding branch lengths and calling homozygous mutations. Sieve is then applied to publicly available real datasets. Sieve is implemented as a package of Beast 2.
- Maria Chikina, University of Pittsburgh, United States
- Elysia Saputra, University of Pittsburgh, United States
- Nathan Clark, University of Utah, United States
Presentation Overview: Show
Understanding the genetic underpinnings of organism-level phenotype changes is a major challenge in molecular evolution. Convergent phenotypes, whereby multiple species independently develop similar characteristics, are useful for inferring such genotype-phenotype associations. Several methods have been developed to associate convergent phenotypes with genetic elements, all operating on a defined set of genetic elements. However, for non-coding sequences, the notion of a “genetic element” is not well-defined. We propose a new method, phyloConverge, which uses a maximum likelihood phylogenetic model and phylogeny-aware bias correction to scan entire multiple sequence alignments for convergent evolutionary rate shifts at nucleotide resolution. We evaluate our method using a dataset previously analysed with a competing method to identify conserved non-coding elements (CNEs) convergently diverged in 4 independent subterranean mammal lineages. Because all 4 species have degenerate eyes, significant divergence in eye-related regions is expected. We computed rate acceleration scores for 491,576 CNEs and found that top-accelerated CNEs from phyloConverge overlapped eye-related ATAC-seq regions better than the competing method. Functional enrichment analysis also showed enrichment for functions related to ocular and neuronal development. Finally, through high-resolution scoring of nucleotides within top-accelerated CNEs, we discovered a genome-wide divergence of transcription factor binding sites associated with ocular and neuronal development.
- Mattéo Delabre, University of Montreal, Canada
- Nadia El-Mabrouk, University of Montreal, Canada
Presentation Overview: Show
During evolution, genes are mutated, duplicated, lost, and passed on to other organisms through speciations and horizontal gene transfers (HGTs) and, over time, form families of homologous genes. Reconciliation is a longstanding model used for inferring the evolution history of such families, explaining incongruence between the tree of a given family with the corresponding species tree by evolutionary events. A major drawback of reconciliations is that gene families are considered to evolve independently from one another. This assumption is not suited for explaining the evolution of chromosomal segments of genes that evolved together. The super-reconciliation model was the first attempt to generalize the reconciliation approach for several gene families organized into syntenies. This model was however limited to duplications and losses and did not allow for HGTs. In this presentation, I will present new algorithmic results for the super-reconciliation problem, extending it to include HGTs events, seeking for the DTL-distance. I will show that duplications, HGTs, and full losses cannot be inferred independently from segmental losses if an optimal solution is sought. I will also present an algorithm for the DTL-distance. This extended model can enable future studies on the evolution of HGT-shaped-syntenies, such as operons in bacteria.
- Louxin Zhang
- Nadia El-Mabrouk, Université de Montréal, Canada
- Dannie Durand, Carnegie Mellon University, United States