A technology for integration of databases with common subject domains

Maria Samsonova1, Andrei Pisarev2, Maxim Blagov
1samson@spbcas.ru, SPbSPU; 2pisarev@spbcas.ru, SPbSPU

We present a novel approach to the integration of distributed molecular biology information resources, which is based on processing of natural language queries and application of multiagent technology.

The integrated system consists of agents for access to heterogeneous data sources, natural language processing agents and user interface agents. The coordinating agent provides for interaction between agents, optimal distribution of queries with regard to a real load of the system, as well as for continuous work of the system when new agents are added or old one are removed. The natural language processing agent interprets grammatical and lexical units of a natural language into concepts of subject domain as described in [1].

To demonstrate a feasibility of the approach we fuse the information from several databases, which contain data about expression of segmentation genes in fruit fly Drosophila, namely FlyEx and FlyEx mirror at University of New York at Stony Brook, Mooshka, FlyBase and In situ Database. The integrated system is available at http://urchin.spbcas.ru/NLP/NLP.htm.

Our approach allows to integrate any information resources (published in the Internet as well as stored locally), which have a common subject domain. Its benefits are in possibility to formulate arbitrary queries in various languages (in English and in Russian, currently), optimal transformation of queries from natural language to SQL, as well as in opportunity to present information visually as hyperschemata. Other advantages are simplicity in access to information and in integration of new databases, adaptivity with respect to changes in the knowledge domain and user's views, increase of the robustness of the system as well as optimization of distribution of queries load between several database mirrors.

Acknowledgements. The support of the NIH Grants RR07801 is gratefully acknowledged.

1. M. Samsonova , A. Pisarev and M. Blagov. Processing of natural language queries to a relational database (2003) Bioinformatics, 19,Supl.1, i241-i249.