Protein class recognition with neural networks

Vadim Valuev1
1valuev@bionet.nsc.ru, Institute of Cytology and Genetics

At present, there are three main structure-functional protein classifications SCOP, CATH and FSSP. ( SCOP: http://scop.mrc-lmb.cam.ac.uk/scop/ , CATH: http://www.biochem.ucl.ac.uk/bsm/cath/ , FSSP: http://www2.ebi.ac.uk/dali/fssp/fssp.html ). The first one (SCOP) relies on structure-functional basis, the CATH database relies on more automatic approach; in FSSP classification all-against-all structural comparison is carried out, Given all the differences among these classifications, they all agree in that at the highest level of hierarchy the overwhelming majority of proteins can be split into few distinct groups, that have neither exactly determined evolutionary kinship nor functional or structural similarity, but only their content at the secondary structure level. Recognition of protein belonging to one of the 4 classes starting was addressed several times before. In the works (Chou and Zhang, 1995; Bahar et al., 1997) for recognition was employed linear Fisher discriminant. In the work (Dubchak et al., 1993) classification was made by means of neural network and the length of sequence was invoked as an additional parameter. We also have used aminoacid composition of a protein for criterium, as the most simple and the most effective from all the simple ones, and neural network for classification. Our work differs from the previous ones not in its methods, but in an effort to handle the problem in a more systematic way. We have applied our method to analyze the SCOP classification of proteins, which comprises all proteins with known 3-dimensional structure, instead of relying on arbitrary criteria in forming classes. We worked with the SCOP classification, and only with the selection of proteins that have less than 40% homology (file pdb40d_1.37). The sets were randomly divided into training and test ones, approximately equal in volume. As features were taken frequencies of occurrence of each aminoacid in the sequence (so totally 20 numbers in the range from 0 to 1 summing up to 1). For discrimination we used neural networks with one hidden layer. There were 20 input neurons, 1 output neuron, and from 3 to 20 neurons in the hidden layer. In such a way, each network was trained to recognise only one class (against all the others). We didn t succeed in improving the performance of previous works, but that was not our ultimate goal (though our results for the first three classes are not worse than those of others). Unlike other authors, we didn t confine ourselves to selections of few tens of proteins, but applied slightly modified standard methods (neural network and linear Fisher discriminant) to analyze global structural classification of proteins. Probably, some increase in the accuracy can be achieved by introducing some additional features (for example, the length of the chain, as in Dubchak et al., 1993). The results obtained allow to make some conclusions. First, it was shown, that linear discriminating function, such as linear Fisher discriminant, in general case couldn t distinguish any of the four structural classes against the rest on the basis of aminoacid composition. Nevertheless, application of a non-linear method permits a rather effective classification, which witnesses the fact that predominance of this or that type of folding is really connected with physical reasons stemming from individual aminoacid properties. In the same time, the character of recognition (the number of learning cycles and accuracy on test and training sets) of the three classes (alpha, beta and alpha/beta) being virtually the same, the fourth class (alpha+beta) resists recognition at the same network parameters, and uniting it with alpha/beta class yields no good result either. This could be probably due to the heterogeneity of this class. This class is also absent in one of the three main s! tructure-functional classifications CATH (Orengo et al., 1999).