Developing Analysis and Visualization Tools for Lead Discovery

Dimitri Petrov1, Shumei Jiang2, Andrey Santrosyan, Hayk Asatryan, Kaisheng Chen, Chris Benner, Robert Downs, John Isbell, Yingyao Zhou
1dpetrov@gnf.org; 2sjiang@gnf.org

Genomics Institute of the Novartis Research Foundation (GNF) is developing data analysis and visualization tools on top of a web-based informatics system for its lead discovery biomedical research. These tools provide means for biologists and medicinal chemists to query a structure-activity matrix of one million compounds across more than 150 biological assays. Dose-response data fitting and visualization system introduces several novel features not available from commercial packages: a constrain-based optimization routine, which guarantees a biologically meaningful solution; an automatic outlier detection routine; a Monte Carlo-based error estimation routine, and a fuzzy value determination routine. The compound structural clustering tool uses the fingerprint-based Tanimoto distance metric and the complete-linkage hierarchical clustering algorithm. The results are currently visualized with the TreeView program. The compound diversity analysis tool identifies all the ring components in a compound list and compiles a statistics on ring usage. Based on the resultant table, biologists can quickly identify the most common scaffolds in a compound list and chemists can evaluate the diversity information of compound libraries to be purchased. The analytical chemistry system provides interactive query tools to LCMS data. Data generated by experimental apparatus are coded in an XML format. The Plate Java applet provides a global view of purity information for any arbitrary plate format; the chromatogram applet visualizes both chromatograph and corresponding mass spectra.