Performing in silico experiments on the Grid using myGrid

Robert Stevens1, Tom Oinn2, Peter Li
1robert.stevens@cs.man.ac.uk, University of Manchester; 2tmo@ebi.ac.uk, European Bioinformatics Institute

This poster gives an overview of the myGrid project which aims to provide a semantically rich middleware layer upon which E-Science applications can be written for the bioinformatics domain. myGrid builds on the recent interest in Grid technology, but aims to put the scientist and how he or she works as the focal point of the Grid infra-structure -- hence the project's name “myGrid”. ‘e-Science’ is global collaboration around the sharing of ideas, resources and effort. The “Grid” is the next generation infrastructure necessary to support and enable this collaboration of people and resources through highly capable compute and data management systems.

myGrid is a UK e-Science pilot project specifically targeted at developing open source high-level middleware to support personalized semantic-rich in silico experiments in biology. The emphasis of myGrid is on database integration, workflow, personalization and provenance, with a primary focus on semantics-aware services for processing complex data, multiple data sources and complex biological queries.

The architecture is based on services managed by ontology-based metadata. The target users are bioinformaticians, tool builders and service providers. Many in silico experiments are computationally and data intensive, needing heavyweight simulations and generating large quantities of output. The majority of Grid research to date has focused on technology to streamline the process of bringing multiple resources to bear on such problems. Not all scientific disciplines, however, can be characterized in this manner. Bioinformatics requires support for a scientific process that is more modest in terms of its computational needs, but is significantly more complex semantically. Many bioinformatics tasks can be characterized as in silico experiments, a procedure using computer based information repositories and computational analysis adopted for testing hypothesis or to demonstrate known facts. The components include the objectives, plan, methods, results, notes etc. Bioinformatics experiments can be regarded as workflows: Some data are taken as input to some analytical tool, together with some other parameters; output from these can be taken, perhaps after interaction with the user, as input to further tools or database queries. A bioinformatician needs a great deal of knowledge in order to perform these tasks efficiently and effectively.

It is some of this knowledge burden that myGrid wishes to relieve by incorporating some of this specialist knowledge into semantically rich services over the Grid. myGrid aims to offer a middleware layer upon which bioinformatics applications can access a variety of semantically rich services that allow such workflows to be built, enacted, recorded, annotated and re-used. Many of the basic tasks needed to support such e-Science activities are currently semantically impoverished. Data are stored, but often without even sufficient description of their structure, even less their semantics -- is this string a protein? What tasks can I perform with these data? In myGrid, both data and services are semantically annotated through ontologies. In order to build, enact and record a workflow, the services offered by myGrid are augmented by rich semantics that can describe those data and those service that act upon those data in a manner interpretable both by human and machines. To achieve these goals, myGrid uses services that offer discovery, provenance, notification, workflow composition and enactment in a manner that puts the biologist at the center of the in silico experimental environment.