Non-negative matrix factorization for gene expression and scientific texts analysis

A. D. Pascual-Montano1, P. Carmona-Saez2, M. Chagoyen and J.M. Carazo
1pascual@cnb.uam.es, National Center of Biotechnology. Madrid. Spain; 2pcarmona@cnb.uam.es, National Center of Biotechnology. Madrid. Spain

DNA microarray is capable of producing a large amount of information related to gene expression levels across different experimental conditions. One of the most common analysis of these results consists of finding information about the structure of the data, such as similarity between genes and/or experiments. Such information can then be used to help in understanding the underlying biological processes. Clustering is one of the most used technique in this type of analysis, aiming at finding similar groups depending on the gene behavior patterns. However, clustering focus its attention in finding similarities of genes across all experimental conditions at the same time, so more focalized local patterns hidden in the data are missed. In this work we have explored the potential use of Non-negative Matrix Factorization technique to find hidden biological information in gene expression data and scientific literature. NMF has been recently proposed for dimensionality reduction [Lee, D.D. and Seung, H.S., Learning the parts of objects by non-negative matrix factorization. Nature, 1999. 401(6755): p. 788-91] but unlike classical matrix factorization methods like Principal Component Analysis (PCA) or Factor Analysis, that learn global patterns present in the whole data set, NMF is able to learn part-based representation of data, allowing the extraction of hidden localized patterns. We applied NMF to gene expression data to identify highly correlated genes and experiments that behave in a similar manners in only a sub portion of the data. In addition, we have also applied this methodology to freeform texts in MEDLINE abstracts for a semantic analysis of scientific articles. Contact: pascual@cnb.uam.es