A dimensional data warehouse for biological data

Tore Eriksson1, Katsuki Tsuritani2
1tore.eriksson@po.rd.taisho.co.jp, Taisho Pharmaceuticals, Co., Ltd.; 2k.tsuritani@po.rd.taisho.co.jp, Taisho Pharmaceuticals, Co., Ltd.

In response to the emergence of increasingly complete sets of data for various biological entities -- genes, proteins, structural and functional domains etc. -- there is a need to create a flexible way of storing and analysing these, as well as data aquired from a multitude of high throughput experimental techniques.

This project is an attempt to apply the framework of dimensional modeling, which is a database design technique used in data warehousing. In addition, hierarchial data in the form of DAGs found in i.e. ontologies is adopted in a way to allow for easy mining of the dimensional model.

Data is taken from various public sources like Refseq, Ensembl, dbEST, Medline etc. After parsing and transforming to a common format, the data is quality checked and loaded. In this way, inherently dirty data like ESTs can be used effectively.

Metadata is used to administer the warehouse, depicting the strucure of all tables as well as their content, making it possible to create a single, web-based application to query and drill on all fields.