back to Tutorial Program

Protein Folding and Protein Structure Prediction

Ram Samudrala, Department of Structural Biology, Stanford University School of Medicine, Stanford, CA 94305


The tutorial will focus on prediction of protein structure from sequence. After a brief introduction outlining the problems involved in predicting protein structure, the three main approaches (comparative modelling, fold recognition, and ab initio) for structure prediction will be described. In each category, different methodologies that have been successful in blind prediction experiments will be explained in detail. Heavy emphasis will be placed on the ab initio methods and the recent results from the blind predictions at the third meeting on the Critical Assessment of Protein Structure Prediction methods (CASP3). An overview of ab initio simulation methods to predict structure, such as molecular dynamics, molecular mechanics, monte carlo, simulated annealing, and genetic algorithms, will be provided.




Proteins form the very basis of life. They regulate a variety of activities in all known organisms, from replication of the genetic code to transporting oxygen, and are generally responsible for regulating the cellular machinery and consequently, the phenotype of an organism. Proteins accomplish their task by three-dimensional tertiary and quaternary interactions between various substrates such as DNA and RNA, and other proteins. Thus knowing the structure of a protein is a prerequisite to gain a thorough understanding of the protein's function.

The protein folding problem

Once a protein sequence has been determined, deducing its unique three-dimensional (3D) native structure is a daunting task. Experimental methods to determine detailed protein structure, such as x-ray diffraction studies and nuclear magnetic resonance (NMR) analyses, are highly labour intensive. Since it was discovered that proteins are capable of folding into their unique functional 3D structures without any additional genetic mechanisms, over 25 years of effort has been expended into the prediction of 3D structure from sequence. Despite the large amount of effort expended, the protein folding or protein structure prediction problem, as it has come to be known, remains largely unsolved.

Knowing the structure of a protein sequence enables us to probe the function of the protein, understand substrate and ligand binding, devise intelligent mutagenesis and biochemical protein engineering experiments that improve specificity and stability, perform rational drug design, and design novel proteins. Understanding structure has potential applications in the various genome projects being undertaken, such as mapping the functions of proteins in metabolic pathways for whole genomes and deducing evolutionary relationships. The protein folding problem is therefore one of the most fundamental unsolved problems in computational molecular biology today.

Methods for protein structure prediction

There are three major theoretical methods for predicting the structure of proteins: comparative modelling, fold recognition, and ab initio prediction.

Comparative modelling

Comparative modelling exploits the fact that evolutionarily related proteins with similar sequences, as measured by the percentage of identical residues at each position based on an optimal structural superposition, have similar structures. The similarity of structures is very high in the so-called ``core regions'', which typically are comprised of a framework of secondary structure elements such as alpha-helices and beta-sheets. Loop regions connect these secondary structures and generally vary even in pairs of homologous structures with a high degree of sequence similarity.

The process of building a comparative model is conceptually straightforward. First, an alignment is performed between the sequence for which the structure has been determined by experimental methods (the parent) with the sequence to be modelled (the target). This sequence alignment is used to construct an initial model (sometimes referred to as a framework or template) by copying over some main chain and side chain coordinates from the parent structure based on the equivalent residue in the sequence alignment. Side chains must be built for residues in the target that does not correspond to an identity in the alignment, and for residues where the side chain conformation is thought to vary in the target relative to the parent structure. Main chains must be built in the case of insertions, regions surrounding a deletion, and in other regions of suspected main chain variation.

Fold recognition or "threading''

Threading uses a database of known three-dimensional structures to match sequences without known structure with protein folds. This is accomplished by the aid of a scoring function that assesses the fit of a sequence to a given fold. These functions are usually derived from a database of known structures and generally include a pairwise atom contact and solvation terms. Threading methods compare a target sequence against a library of structural templates, producing a list of scores. The scores are then ranked and the fold with the best score is assumed to be the one adopted by the sequence. The methods to fit a sequence against a library of folds can be extremely elaborate computationally, such as those involving double dynamic programming, dynamic programming with frozen approximation, Gibbs Sampling using a database of ``threading'' cores, and branch and bound heuristics, or as ``simple'' as using sophisticated sequence alignment methods such as Hidden Markov Models.

Ab initio prediction

The ab initio approach is a mixture of science and engineering. The science is in understanding how the three-dimensional structure of proteins is attained. The engineering portion is in deducing the three-dimensional structure given the sequence. The biggest challenge with regards to the folding problem is with regards to ab initio prediction, which can be broken down into two components: devising a scoring function that can distinguish between correct (native or native-like) structures from incorrect (non-native) ones, and a search method to explore the conformational space. In many ab initio methods, the two components are coupled together such that a search function drives, and is driven by, the scoring function to find native-like structures.

Currently there does not exist a reliable and general scoring function that can always drive a search to a native fold, and there is no reliable and general search method that can sample the conformation space adequately to guarantee a significant fraction of near-natives (< 3.0 angstroems RMSD from the experimental structure).

Some methods for ab initio prediction include Molecular Dynamics (MD) simulations of proteins and protein-substrate complexes provide a detailed and dynamic picture of the nature of inter-atomic interactions with regards to protein structure and function; Monte Carlo (MC) simulations that do not use forces but rather compare energies, via the use of Boltzmann probabilities; Genetic Algorithms which tries to improve on the sampling and the convergence of MC approaches, and exhaustive and semi-exhaustive lattice-based studies which are based on using a crude/approximate fold representation (such as two residues per lattice point) and then exploring all or large amounts of conformational space given the crude representation.

What can structure prediction do for us?

Given the large volume of genes being sequenced, the rate of new protein sequences is growing exponentially relative to the rate of protein structures being solved by experimental methods. In many situations, even a crude or approximate model can help an experimentalist significantly in guiding his/her experiments. Thus even though the current methods are still in their infancy, prediction of structures for all protein sequences of complete genomes in conjunction with experimental work is a realistic goal. Structural analyses on demand of proteins for further mutagenesis, substrate and inhibitor design, and enhanced function and stability is also possible, as is analysis of basic functional behaviour on demand using time-tested methods such as molecular dynamics simulations. These methods method can use structural data and methods for structure prediction to probe protein and organismal function and evolution.