Protein Sequence Comparison and Protein Evolution

William R. Pearson
Department of Biochemistry
Jordan Hall Box 800733
University of Virginia, Charlottesville, VA 22908, USA
FAX: (804) 924-5069; email: wrp@virginia.EDU

Synopsis

The combination of rapid sequence comparison algorithms, powerful computers, and accurate statistical estimates, has changed fundamentally the practice of biochemistry and molecular biology. With the possible exceptions of E. coli and yeast, the vast majority of the genes in newly sequenced genomes are characterized by sequence similarity searching. BLAST, FASTA, and Smith-Waterman similarity searches provide the most informative and reliable method for inferring the biological function of an anonymous gene (or the protein that it encodes). Typically 60--80% of eubacterial (and yeast) genes share statistically significant sequence similarity with sequences from another organism. Significant sequence similarity can be used to infer common ancestors and similar three-dimensional structures, and is routinely used to assign functions in metabolic pathways. Even for the first archaebacterial genome (M. jannaschii, similarity based functional gene assignments could made for about 50% of the genes; subsequent sequence analyses suggested functions for another 20% of the genes.

This tutorial examines how the information conserved during the evolution of a protein molecule can be used to infer reliably homology, and thus a shared protein fold and possibly a shared function. We will start by reviewing the geological/evolutionary time scale; many homologous proteins can be identified that diverged 1-2 Billion years ago. Next we will look at the evolution of several protein families. During the tutorial, these families will be used to demonstrate that homologous protein ancestry can be inferred with confidence. We will then examine the technical aspects of protein sequence comparison. We will survey the statistics of local similarity scores, and how these statistics can both be used to improved the selectivity of a search and to evaluate the significance of a match. We will then examine distantly related members of three protein families, serine proteases, glutathione transferases, and G-protein-coupled receptors (GCRs). Strategies for identifying distant relationships in these families will be examined.

Details

A. Introduction to Protein Evolution

Evolutionary time scales
Modes of Evolution
- Conventional divergence - orthology - cytochome 'c'
- Gene duplication - paralogy - globins
Sequence similarity and homology, the H+ ATPase family
- Distribution of similarity scores
- Sequence alignments
- The PAM250 matrix
- Moving through the evolutionary tree
Protein families diverge at different rates
Classification of Protein families - Ancient, Middle-aged, Modern
Mosaic proteins
DNA vs Protein comparison

B. Sequence Comparison Algorithms

Dynamic Programming Algorithms - global, local
Dynamic Programming - step-by-step
Heuristic Algorithms - BLAST and FASTA

C. Statistics of Local Similarity Scores

The extreme value distribution
Scoring matrices re-examined
Estimating statistical Parameters
Low complexity regions
Statistical significance - search based or shuffled?