Approaches to Integrating Biological Data

Kenneth Giffiths, Richard Resnick, NetGenics, Inc.

Synopsis

Academic institutions, pharmaceutical, and biotechnology companies are faced with the growing need to incorporate proprietary Biological data with public data sources across all life science domains.Currently, there are a number of proposed solutions to address the data integration problem. Approaches to biological data integration will be discussed in the tutorial, presenting the advantages and disadvantages of each. The tutorial will then focus on additional efforts needed by the industry to achieve effective data integration, including the need for standard ontologies and data models.

Concepts to be discussed

¨Federated database approach: object-relational layers, CORBA

¨Memory-mapped data structures for integration

¨Indexing flat files

¨Data Warehousing

¨The need for standard ontologies and data models to achieve successful integration

¨Tools required to effectively browse, access, and visualize integrated data

Details

In the current age of the life sciences, investigators have to interpret many types of information from a variety of sources: lab instruments, public databases, gene expression profiles, raw sequence traces, single nucleotide polymorphisms, chemical screening data, proteomic data, putative metabolic pathway models, and many others.

This is necessary because in order to find new discoveries one needs a large set of genetic information in order to generate valid leads.In order to find valid leads, one needs to study gene function.Here, an even more complex set of information is needed.

In order to get a better understanding of the molecular mechanisms for disease, metabolic and regulatory biochemical pathways must be inferred from this information.

And finally, to stimulate the discovery of breakthrough healthcare products, therapeutics must be developed and tested in a pre-clinical, then clinical environment. Ongoing long-term clinical research must feed back into the discovery process as well.Yet more unintegrated data.

Thereâs lots of data, coming from all kinds of places, some public, some proprietary; some sequence, some clinical; some curated, some raw, and itâs virtually entirely unintegrated. Researchers in the same organization might be considering the problem, indeed the same gene, but in different domains and using different names for that gene, theyâd never make the connection.

Life Science data integration is one of the most challenging problems facing Bioinformatics today.There are currently a number of techniques, approaches, and products available to help scientists tackle this increasingly complex issue. They include:

Application-level Integration

The application-level integration approach operates under the paradigm that databases can easily be wrappered using some form of middleware technology (such as CORBA), and that an integrated application can be built on top of this middleware. Middleware is a generic term that describes a layer of software that sits between one class of application and another. This is a useful concept in software engineering since the middleware represents a layer of abstraction so that the top layer doesnât need to know the details about the bottom layer. However, this approach adds an additional level of complexity to the system (slower performance) and does not address issues of data cleaning and transformation.Using middleware to integrate disparate data sources is usually referred to as a ãfederated databaseä approach. Many companies take this approach.

Data-level Integration without Semantic Cleaning

The basic idea of this approach is to integrate data at the data layer through indices, database links, and memory-mapped data structures. While these approaches achieve integration at the lowest level, they do not address the need to clean and transform the data to prepare it for complex querying, analysis, and visualization.Also, due to their complexity, they do not scale well and the resulting data is difficult to browse.These approaches include:

¨Memory-mapped data structures:In this approach, subsets of data from various sources are collected, normalized, and integrated in memory for quick access.While this approach performs actual data integration and addresses the problem of poor performance in the federated approach, it requires additional calls to traditional relational databases to integrate descriptive data.While data cleaning is being performed on some of the data sources, it is not being done across all sources or in the same place.This makes it difficult to quickly add new data sources.Because one still needs to make additional relational database calls outside of the memory mapped data structures, there is a performance hit when queries are executed.

¨Indexing flat files:In this approach, flat text files are indexed and linked supporting fast query performance.Data integration takes place by using the results of one query to link the user to another database.The problem with this approach is that it does not provide a mechanism to integrate in-house relational databases, nor does itprovide a mechanism to perform data cleaning and transformation for complex data mining.

Data-level Integration with Semantic Cleaning

In this approach data is exported from each source application into a Data Staging Area. Here, all of the data is cleaned up, transformed as necessary, and linked with data from other sources. This staging process is key to the success of this approach, and is fully automatable once the scientific logic has been properly specified and implemented. (The automation can be updated manually if the source applications increase or decrease in functionality.) Then, when staging is complete, the data is placed in a unified and integrated central presentation database which is a composition of smaller databases. A generic metadata repository sits on top of the integrated database to provide a layer of

abstraction for developers and users. The one major drawback of this approach is the time it takes to extract, clean, transform, and load the data into the warehouse.However, this problem can be addressed by scheduling smaller incremental updates. This process is generally known as Data Warehousing.

The Need for Standards Efforts

Currently, there are two major efforts underway to standardize technologies required for data integration:

¨Bio-ontology Standards Group: There is currently an effort underway to standardize domain-specific ontologies and vocabularies to support interoperability of data and software components.The past, present, and future of this group will be discussed.

¨Data Model Standards Group: There is currently an effort underway to standardize domain-specific analytical data models to help integrate public data with proprietary data across all life science domains in an enterprise.The past, present, and future of this group will be discussed.

Tools Required to Access Integrated Data

In order to effectively use integrated data, a number of tools are required and will be discussed:

¨Data Browsers:Required to help users understand what is contained in the integrated data source.The Browser should lead users to an intuitive query interface.

¨Query Tools:Required to help users ask meaningful biological questions across multiple domains and transfer the integrated data to a visualization tool for complex analysis.

¨Visualization Tools:Required to help users sort through large volumes of integrated data, finding patterns and trends that would otherwise go unnoticed.

¨Data Mining Tools:Required by advanced users to automatically and intelligently search the integrated database to find ways to understand the data, predict future outcomes from it, and extract knowledge leading to new discoveries.