EVOLVE: a toolkit for statistical molecular evolutionary analysis of genomes

Gavin Huttley1, Alex Isaev2, Andrew Butterfield, Edward lang, Cath Lawrence
1gavin.huttley@anu.edu.au, ANU; 2Alexander.Isaev@maths.anu.edu.au, ANU

The number of genes and species for which DNA sequence data are now available is enormous compared with just five years ago. This data present an opportunity for statistical dissection of molecular evolutionary processes. The ability to exploit the data is limited, however, by the poor scalability and extensibility of existing software. As a result, developments in distributed high performance computing cannot be efficiently exploited. We present a description of the functionality and performance of EVOLVE, software we have developed in response to these challenges. EVOLVE is an object-oriented toolkit designed to perform existing, and for the development of new, methods of molecular evolutionary analysis. The functional capabilities of EVOLVE are centered on its ability to perform phylogeny-based maximum-likelihood calculations. EVOLVE implements a range of existing and several novel Markov models of substitution (nucleotide, dinucleotide, codon, protein, and models for measuring interactions between sites) that can be used for these calculations. Other features include allowing parameter heterogeneity (per site or across a tree), ancestral sequence reconstruction, a sequence simulation capacity for parametric bootstrapping, and a selection of numerical optimization techniques. Example analytical applications of EVOLVE are testing for evidence of selection using codon substitution models, identifying positively selected sites, or performing relative rate tests. The toolkit can also be employed for the development of novel models of substitution, or phylogenetic reconstruction applications. We have implemented EVOLVE as a dynamically loadable module for the popular scripting language Python to facilitate flexible application development. The single cpu performance of the software is respectable in comparison with existing strictly C applications by virtue of our writing the most computationally intensive algorithms in C/C++. Parallelisation of portions of the numerical optimizer achieves a significant performance boost, the magnitude of which depends primarily on the number of sequences. EVOLVE will be released under the GPL.