back to Tutorial Program

Protein Sequence Comparison and Protein Evolution

William R. Pearson
Department of Biochemistry
Jordan Hall Box 800733
University of Virginia, Charlottesville, VA 22908, USA
FAX: (804) 924-5069; email: wrp@virginia.EDU


The combination of rapid sequence comparison algorithms, powerful computers, and accurate statistical estimates, has changed fundamentally the practice of biochemistry and molecular biology. With the possible exceptions of E. coli and yeast, the vast majority of the genes in newly sequenced genomes are characterized by sequence similarity searching. BLAST, FASTA, and Smith-Waterman similarity searches provide the most informative and reliable method for inferring the biological function of an anonymous gene (or the protein that it encodes). Typically 60--80% of eubacterial (and yeast) genes share statistically significant sequence similarity with sequences from another organism. Significant sequence similarity can be used to infer common ancestors and similar three-dimensional structures, and is routinely used to assign functions in metabolic pathways. Even for the first archaebacterial genome (M. jannaschii, similarity based functional gene assignments could made for about 50% of the genes; subsequent sequence analyses suggested functions for another 20% of the genes.

This tutorial examines how the information conserved during the evolution of a protein molecule can be used to infer reliably homology, and thus a shared protein fold and possibly a shared function. We will start by reviewing the geological/evolutionary time scale; many homologous proteins can be identified that diverged 1-2 Billion years ago. Next we will look at the evolution of several protein families. During the tutorial, these families will be used to demonstrate that homologous protein ancestry can be inferred with confidence. We will then examine the technical aspects of protein sequence comparison. We will survey the statistics of local similarity scores, and how these statistics can both be used to improved the selectivity of a search and to evaluate the significance of a match. We will then examine distantly related members of three protein families, serine proteases, glutathione transferases, and G-protein-coupled receptors (GCRs). Strategies for identifying distant relationships in these families will be examined.


A. Introduction to Protein Evolution

  1. Evolutionary time scales
  2. Modes of Evolution
  3. Sequence similarity and homology, the H+ ATPase family
  4. Protein families diverge at different rates
  5. Classification of Protein families - Ancient, Middle-aged, Modern
  6. Mosaic proteins
  7. DNA vs Protein comparison

B. Sequence Comparison Algorithms

  1. Dynamic Programming Algorithms - global, local
  2. Dynamic Programming - step-by-step
  3. Heuristic Algorithms - BLAST and FASTA

C. Statistics of Local Similarity Scores

  1. The extreme value distribution
  2. Scoring matrices re-examined
  3. Estimating statistical Parameters
  4. Low complexity regions
  5. Statistical significance - search based or shuffled?

D. Identifying distantly related protein sequences

  1. Serine proteases
  2. Glutathione transferases
  3. G-protein coupled receptors

E. Internal duplications in proteins

  1. Internal duplications in calmodulin
  2. Mosaic domains shared by the EGF-precursor and LDL-receptor
  3. Coiled-coil structures share local similarity

F. Summary

G. Suggested Reading