MineLink: A novel information integration framework for Life Sciences

Tanveer Syeda-Mahmood¹, Bhooshan Kelkar²
¹stf@almaden.ibm.com, IBM Almaden Research; ²bkelkar@us.ibm.com, IBM Life Sciences

Abstract:

Now that the human genome has been sequenced, a greater challenge faces the scientists: to extract, analyze and integrate the information being populated in genome databases world-wide for improved diagnosis and cure of diseases. With progress in Genomics, scientists have also begun to ask queries that often span more than one data source and/or one or more analytic components. To satisfactorily address the needs of scientists, an information integration framework is needed that can pull together both life sciences data and analytic applications from disparate sources. MineLink is a novel federated information integration framework that is designed to address the data and analysis needs of the scientists. It specifies a design methodology for automatically integrating individual components, be they data sources, processors, data miners or visualization components without the need for explicit programming. It addresses both syntactic and semantic aspects of information integration through the introduction of a generalized schema as an abstract data type for communication between components and a concept connectivity graph for the specification of semantic integration of components that is learned automatically through examples. It combines state-of-the-art service composition techniques of distributed computing with active feedback on the ease-of-use from scientists in Life Sciences. With the emergence of grid computing, platforms such as MineLink will continue to gain importance on grid architectures, and are expected to play a pivotal role in Life Sciences applications.

Introduction

Now that the human genome has been sequenced, a greater challenge faces the scientists: to extract, analyze and integrate the information being populated in genome databases world-wide for improved diagnosis and cure of diseases. With advances in sequencing techniques and the advent of Gene Chips, increasingly large amounts of data is becoming available on a worldwide basis as a combination of public and private genome databases. In addition, increasing number of analytic tools are becoming available, including both commercial (eg. ArrayScout, DiscoveryStudio, SpotFire), and public domain bioinformatics tools (BLAST, HMMER, Clustal-W). In most cases, the tools developed are meant to be standalone applications or deployed over the web. They are often written using a closed architecture, and come with built-in data assembly/access, analytics and visualization components. In addition, each tool uses proprietary data formats so that scientists often have to do a lot of document preparation before they can use such tools. With progress in Genomics, scientists have also begun to ask queries that often span more than one data source and/or one or more analytic components. For example, a diagnosis that combines information from gene expression, blood test, and x-ray data may need to access, analyze and combine information in three separate data sources. To satisfactorily address the needs of scientists, therefore, an information integration framework is needed that can pull together both life sciences data and analytic applications from disparate sources.

MineLink is a novel federated information integration framework developed at IBM that is designed to address the data and analysis needs of the scientists. It specifies a design methodology for automatically integrating individual components, be they data sources, processors, data miners or visualization components without the need for explicit programming. The components could be as small as individual pieces of code developed by researchers, to full-fledged commercial applications (Scitegic, SpotFire, Accelrys) and databases (Oracle, DB2, Sybase). Users of MineLink can dynamically configure an integration between components of their interest by simply selecting the components and linking together in a workflow. These workflow queries of users at client sites are translated into internally executable workflows by the MineLink middleware by addressing issues of syntactic and semantic connectivity between components. Specifically, MineLink middleware validates that user requests for connection between components are indeed feasible by exploring the physical interfaces offered by components (parameter matching through API), as well as analyzing whether it makes sense to semantically connect the components (eg., normalization precedes data mining).

A distinguishing feature of MineLink is the ease-of-use aspect for both end-users and component developers. End-users avoid programming through drag-and-drop component selection and graphical query formation. For example, automatically generated SQL commands for data selection from commercial databases such as Oracle and DB2 allow lay users to easily store and retrieve data without the need to learn SQL. Similarly, it is easy for component developers to integrate their components by providing the executable component itself (eg. Java class), and a description of their component through a wsdl (web services definition language) document. MineLink avoids the need for learning proprietary languages for data pipelining (eg. PipelinePilot), as well as explicit tailoring/custom development efforts to integrate entirely new functional components.

MineLink Architecture

MineLink achieves federated information integration through a web services architecture shown in Figure 1. As can be seen, the data and application components can reside at multiple locations on a network (federated setting). Before they can be integrated into the MineLink architecture, they must be suitably packaged. Components in MineLink are classified as input components (eg. Data sources), processors (eg. Data mining, pre-processing), and output components (eg. Data sinks such as visualization components and data sources). Component packaging is discussed in detail in Section 2. The MineLink middleware maintains a component connectivity graph for all the components registered with MineLink. The component connectivity graph is a directed graph that captures data connectivity between components and will be discussed in detail in Section 3. The initiation of MineLink middleware is made through a workflow query generated by a user using the MineLink dynamic workflow client. The information fusion module combines the result of mining. The semantic schema mapping module is the interactive module that allows semantic connectivity between components. The MineLink middleware is java code that can in turn be packaged as a web service to be served by any application server. The MineLink client is an Eclipse plug-in that allows users to draw workflow graphs among components known to MineLink. Among the features offered in this user interface is the ability to do visual data selection and automatic SQL query generation. The component connectivity graph (CCG) is used as the master template to display available components in the workflow client and also to ensure workflow consistency. One of the data source components is the DiscoveryLink web service to allow data access from a variety of data formats such as relational tables (DB2, Oracle), ASCII text files, Excel spread sheets. Other packaged components include data mining applications, and visualization components such as those offered in SpotFire and Matlab.

The MineLink architecture allows for three class of users with distinct responsibilities as follows:

1. End-Users: This class of users (eg. Scientists) compose workflow queries by picking visually displayed components in the MineLink workflow client, and drawing workflows in an easy to use GUI. Their typical tasks include selection of data sources, selection of items from data sources (eg. Selected columns in selected tables of a database), combining of items from multiple data sources (eg. A visual join of the tables), applying pre-processing modules to the data, applying different kinds of data mining, and selecting a visualization tool to display the results of analysis. Figure1. MineLink Web Services Architecture Figure 2: MineLink middleware details.

2. Systems Administrators/Domain Experts: The semantic connectivity between components is indicated by domain experts who are knowledgeable about the functionality represented by the components and can form meaningful workflows. These workflows are developed using the same client interface as before, but this time, instead of executing the workflow, the result is the refinement of the component connectivity graph. In this way, the CCG is learned from examples over time

3. Component Developers/Programmers: The component developers are responsible for providing the components (packaged as Java classes), and a wsdl document specifying the exposed methods and attributed of classes as well as the associated metadata. MineLink automatically wraps the component using a I-C-O model (input-component-output wrappers) to render it as a MineLink component. The input wrappers take a generalized schema as input and produce the API needed for the operation of the individual component. The output wrappers do the reverse, namely, take the output produced by the individual component and turn it back into the generalized schema data type for MineLink. This ensures that MineLink connects components using the standardized generalized schema as an abstract data type.

Section 2: Component generation and packaging: The Generalized Schema

While to a user the workflow outlines the connections between processes, the middleware interprets it actually as a data flow graph to make syntactic connections possible. To make syntactic connections between components, and to simplify the resulting graph’s compilation within the middleware, MineLink uses a common abstract data type to be passed between components, called the generalized schema (Gs). The data type is designed to be generic to allow both data and parameters that form part of any API of a component. It can be specified by the following context-sensitive grammar:

Gs->C*|e

C->NaCty

Cty->NaTyObj

Na->any string

Ty->Gs|Javaclass|jbasetype

Javaclass -> ….all classes defined in Java language spec….

Note that the Generalized schema is recursively defined to allow arbitrarily complex data types to be represented. Samples of generalized schema for some common data structures is illustrated in

Figure 3.

Figure 3: The generalized schema instantiated for the K-means clustering component. Both the input and output instantiations are shown. Here the objects included are called ‘data’,’cluster-obj’, an d’Memb-array’ respectively.

Certain components are designated as input only or output only. They will not have the corresponding Gsin or Gsout components defined.

A generic wrapper code is now available in MineLink which takes as input a Gsin encapsulating the name of the component and the corresponding arguments generated from a previous component execution in the workflow. The data structures needed for the execution of the component specified are then automatically generated (using class introspection) and then given to the component for execution. Any output returned form the component are then packaged back into a generalized schema (again using class introspection on the returned result). The correspondence of arguments with the next module are drawn from the specification in the component connectivity graph.

After the components are packaged by the use of input and output wrappers preceding the component, each component poses a uniform interface as indicated in Figure 4.

Figure 4: A component in the MineLink architecture. The role of the input wrapper I is to convert the generalized schema instance Gsin into Din the expected input API to the component C. The role of the output wrapper O is to convert the output produced by the component C called Dout back into another generalized schema instance Gsout.

Using the generalized schema as uniform component interface, the MineLink middleware assembles an internal workflow graph in response to a workflow query by chaining the I-C-O modules in the packaged component and supplying them with the correct form of the instantiated a sample workflow constructed by an end-user using 4 components indicated in Figure 5a can be converted into an internal workflow indicated in Figure 5b using the syntactic connectivity provided by the generalized schema.

Figure 5: (a) user-specified workflow from the dynamic workflow client. Here component C4 combined input from component C3 and C2. (b) Internal workflow graph using syntactic connectivity alone, where each path is composed of I-C-O chains. The packaged components (shown in yellow) are supplied by the developers. The schema mapping modules (shown in pink boxes above the components) are supplied by the MineLink middleware. The schema mapping process is defined during component registration described in Section 3.

Section3: Component Registration: Component Connectivity Graph

With the above method of generating internal workflow graph ensures syntactic connectivity (i.e., the resulting code will compile), it may not ensure semantic connectivity. For example, if we had a clustering component that takes as input (int A, int B, char *C) and we had a choice of supplying (number_clusters, number_dimen, GeneLabel), permuting the first two arguments would have very different consequences in the resulting clusters generated. While some of the mapping of labels to types can be determined by their description (hence the choice of Name as a field in the generalized schema), such as for example, age and years, such a semantic mapping need not be unique, and furthermore, may be domain-specific. MineLink uses an interactive process to determine the schema mapping based on the algorithm in Clio and taking help from a systems administrator/domain expert who knows the meaning of some of the labels used in the declaration of the Gsin and Gsout for the components.

The component connectivity graph is assembled over time from sample specified workflow in a learning phase. A workflow graph specifies which components could be connected. For each such pair of components, the schema mapping algorithm is applied to automatically infer the correspondence between arguments (produced by previous and consumed by the next component). Such a correspondence makes use of both name, type and structural information present. The resulting correspondences are then shown to the user for correction. This is a one-time process between each pair of specified methods in components. Thus the mapping of the schemas from Gsout of selected components to the Gsin of the new component is made semi-automatically using Clio’s schema mapping algorithm and showing the result to the user to allow editing. If a combination of Gsout of two or more components is needed to form the Gsin of the new component, this is specified through a special edge called the combine-edge. This allows a MineLink generated combiner component to be inserted between these components in the connectivity graph. The connection out from the new component (Gsout) to the inputs of existing components is defined similarly.

Each such pair of connections is then registered as an edge in the component connectivity graph. The weights associated with the edges point to the correspondence of arguments information. This could be a lookup table of correspondence in the simple case, as well as a piece of code for obtaining the correspondence in the more complicated case. MineLink uses a relational database to record all such information (DB2).

Section 4:Executing user-specified Workflow queries.

The dynamic workflow client provide the necessary GUI to allow end users to compose workflows in the MineLink client. A simple error checking is enabled in the GUI using the types of the components (eg. Input component cannot accept incoming connections). Further type checking and semantic connectivity checking is done in the MineLink middleware using the CCG. Using the CCG and the workflow query, and internal workflow graph is constructed (it is a labeled subgraph of the CCG), and executed. To enable the execution of the workflow graph, the wsdl document descriptions supplied by the component developers are used to pass along the schema instances to the components and the fragments of the workflow query that is intended for the component. If the user has specified a visualization component in the workflow, the respective component will be launched and the result displayed in its GUI (eg. SpotFire GUI). Figure 6 shows a conceptual workflow composed by a user, and Figure 6 shows an actual workflow that can be composed in MineLink client using the Eclipse plug-in.

Summary:

Minelink is a novel information integration framework that combines state-of-the-art service composition techniques of distributed computing with active feedback on the ease-of-use from scientists in Life Sciences. The success of this model can also lead to changed strategy on the part of bioinformatics tools companies that can now expose APIs of components that are sharable thus leading to increased adoption of their tools for integration. With the emergence of grid computing, platforms such as MineLink will continue to gain importance on grid architectures, and are expected to play a pivotal role in Life Sciences applications.

Figure 6: A workflow graph composed by an end-user to compose analytics and data access.