An architecture for a modularized gene information retrieval and summarization tool: Bioretrieve

Anton Bergheim1, Sheila Rock2
1anton@cs.wits.ac.za, University of the Witwatersrand; 2sheila@cs.wits.ac.za, University of the Witwatersrand

The greatest difficulty for the biologist today is no longer the acquisition of information but rather the finding of relevant information from the increasing mass of knowledge available. This repository of information, which is becoming known as the "biobibliome" is larger and faster growing than the human genome sequence itself.

The ability to process natural language based information computationally is becoming a necessity for the geneticist. Identification of disease genes is going through a major shift, from the identification of single gene disorders to those caused through multiple inheritance. Complex disease trait mapping often results in linkage to a number of large regions potentially containing hundreds of genes in each region alone. Analysis of all these genes is in most cases financially and practically infeasible, so the researcher must prioritize the investigation of likely genes according to a number of criteria. BioRetrieve has as its overall goal the construction of an automated system which outputs a ranking of a set of genes as well as summarized information, when known, about each these genes. We present here the first steps toward the development of this tool. The architecture is designed in a top-down fashion, and using the principle of modularity; decisions about the actual framework define the architecture itself. Each module has been designed to be as independent as possible in an attempt to eliminate the monolithic nature of the most existing information retrieval systems, such as MedMiner and BioSifter (Palakal et al, 2002; Tanabe et al. 1999).

Modules can be broadly classified int the following categories: query input; information retrieval; information extraction; 1st order knowledge base builders; output; information ranking; information confidence and conflict.

Information is gathered through user defined query as well as two sets of user-defined keywords. The first set of keywords contains information relevant to the search itself---the gene region or a list of genes---while the second contains information relevant to the disease, such as phenotype, inheritance, possible functions, possible locations, pathways, etc.. The first set of keywords is passed to the information retrieval modules which are able to gather information from Medline abstracts, journal articles, and web pages. This information is passed to information extraction and representation modules which parse the language to identify relevant facts and represent them in a knowledge base. Meta-data about the source of each fact is stored in a relational database, and is used by the confidence and conflict resolution modules when resolving apparent conflicts. When conflicts cannot be resolved, the output modules return the relevant facts to the user in a category known as conflicting. Two output modules, the summarizer and the ranker, groups facts according to gene name and a set of user and system defined keywords and returns relevant facts. Finally the outputter module returns the results of the summarizer module, broadly grouped under gene names, to the user. We believe that this module based application is preferable to existing monolithic approaches. Additionally, the computational linguistics modules will provide the missing components for successful analysis of the information. The modular framework also allows for the development and implementation of components in a distributed and flexible manner.