Databases, Information and Knowledge Management

1 - Genomic Hypothesis Creator: An Environment that Assists the Design and Implementation of Computational Experiments for Knowledge Discovery from Genomic Databases
Hideo Bannai, University of Tokyo; Yoshinori Tamada, Tokai University; Osamu Maruyama, Kyushu University; Satoru Miyano, University of Tokyo
We present Genomic Hypothesis Creator: a genome-oriented version of a programming library that enables domain experts to effectively conduct computational knowledge discovery experiments. Several hypothesis generation algorithms implemented with POSIX threads are available. Design goals: support the creation and seamless integration of new attributes and/or existing attributes accumulated in major genomic databases.

2 - Distributing Bioinformatics Applications with Piper
J. W. Bizzaro, Gary Van Domselaar, Brad Chapman, Jean-Marc Valin, Jarl van Katwijk, Dominic Letourneau, Deanne Taylor, University of Massachusetts Lowell
Piper is an interactive system for creating and managing links between Internet-distributed components such as those used in bioinformatics analyses. Components can reside remotely, on higher performance and capacity computers, while only representations reside locally. Links can depict protocol-independent data flow, procedural steps, and relationships.

3 - PathDB: A Second Generation Metabolic Database
J. L. Blanchard, D. L. Bulmore, A. D. Farmer, M. Gonzales, P. A. Steadman, M. E. Waugh, S. T. Wlodek, P. Mendes, National Center for Genome Resources
PathDB is a relational database that stores detailed metabolic information. The database is coupled to query, visualization, and discovery tools that allow for pathway diagrams to be drawn "on the fly" and for new connections to be made between independently discovered facts thereby avoiding the rigid confines of "textbook pathways."

4 - DRAGON: Database Referencing of Array Genes ONline
Christopher M. L. S. Bouton, Johns Hopkins School of Medicine and The Kennedy Krieger Institute; Elizabeth Johnson, Johns Hopkins School of Hygiene and Public Health, Carlo Colantuoni, Johns Hopkins School of Medicine and The Kennedy Krieger Institute; Scott Zeger, Johns Hopkins School of Hygiene and Public Health; Jonathan Pevsner, Johns Hopkins School of Medicine and The Kennedy Krieger Institute
We have developed "Database Referencing of Array Genes ONline" (DRAGON). DRAGON is a Web-accessible database that contains information derived from public databases. DRAGON defines the characteristics of genes in microarray data sets. The inclusion of this information during analysis allows for deeper insight into gene expression patterns.

5 - END: the Enzyme Nomenclature Database
Sínead Boyce, Trinity College; Andrej Bugrim, Washington University; Andrew McDonald, Trinity College; Francis Fabrizio, Jakub Slomczynski, Washington University; Keith Tipton, Trinity College; Toni Kazic, Washington University
We are developing END, a database of Enzyme Nomenclature, to be used for updating the enzyme data, amending existing entries, and for on-line queries. We have written a suite of parsers and other software tools to convert various data inputs, substitute terms, bring older nomenclature up to date, and check for data consistency and duplications.

6 - The Agora--an Environment for Distributed Deposit, Review, and Analysis of Biochemical Information
Andrej Bugrim, Washington University; Sínead Boyce, Trinity College; Guang Yao, University of Minnesota, Minneapolis; Francis Fabrizio, Washington University; Andrew McDonald, Trinity College; Jakub Slomczynski, Washington University; Jun Ong, University of Minnesota; Brian Feng, William Wise, Washington University; Keith Tipton, Trinity College; Lynda Ellis, University of Minnesota; Toni Kazic, Washington University
We present The Agora, a distributed computational environment for the deposit, review, and analysis of biochemical information. It provides an interface for sharing curatorial functions and queries among the independent, participating databases, while allowing each database and algorithm to preserve its native semantics, data model, and query language and the scientific community to deposit, review, and query biochemical information.

7 - Drivers for Mutant Databases in the RbDe Web Service: Design Principles and Implementation
Fabien Campagne, Harel Weinstein, Mount Sinai School of Medicine
The Residue-based Diagram editor Web service (RbDe, allows online creation of Residue-based diagrams of proteins. The presentation will outline the design of interfaces for the query of a mutant database and illustrate their use in the context of RbDe.

8 - Representation of Sequence Data: A Comparison of Prototypes
Fabien Campagne, Mount Sinai School of Medicine
This presentation compares four representations of sequence data (OMG lifesci/99-04-04, BioPerl, BioJava, crover) to find common design patterns and major differences. I attempt to relate the design choices that underlie these representations to the amount of interoperability they achieve.

9 - A Database of Recently Diverging Paralogus Genes in C. elegans
M. Daniel Caraco, University of Florida; Sridhar Govindarajan, Stephen G. Chamberlin, EraGen Biosciences, Inc.; Steven A. Benner, University of Florida
We categorize all paralogous genes of C. elegans into those that arose recently, and those that were established near or before the creation of the C. elegans developmental biology plan. We then hypothesize sets of genes that are not involved in core developmental biology of C. elegans using a recently developed statistical parametric method to compute NED (Neutral Evolutionary Distance) to clock divergence.

10 - Gene Expression Database (GXD): An Integrated Resource for Mouse Gene Expression Information
John P. Corradi, Dale A. Begley, Geoffrey L. Davis, Janan T. Eppig, David P. Hill, Jim A. Kadin, Ingeborg McCright, Joel E. Richardson, Martin Ringwald, The Jackson Laboratory, ME
The major goal of GXD is to provide for the storage, integration and retrieval of primary gene expression data for the developing and adult laboratory mouse. Gene expression information is placed in a larger biological context via careful curation, the use of controlled vocabularies, and integration with the Mouse Genome Database. Future plans for GXD will be addressed.

11 - A Distributed Annotation System Client
Robin Dowell, Sean R. Eddy, Washington University; Lincoln D. Stein, Cold Spring Harbor Laboratory, NY
The distributed annotation system (DAS) client software is a java-based browser-like application that allows a researcher to query one or more disparate annotation servers to retrieve features about a region of interest within a genome. The client displays graphical maps of the data, which returned in a standard XML format.

12 - The University of Minnesota Biocatalysis/Biodegradation Database: Predicting Biodegradative Metabolism for a Post-genomic World
Lynda B. M. Ellis, C. Douglas Hershberger, Lawrence P. Wackett, University of Minnesota
One goal of the UM-BBD is to cover the wide range of organic functional groups that can be metabolized by microbes. We discuss the latest developments in the UM-BBD and the methodology through which we may be able to use this knowledge to predict biodegradation pathways of novel compounds.

13 – The Lipase Engineering Database
Markus Fischer, Rolf D. Schmid, Jürgen Pleiss, University of Stuttgart
The Lipase Engineering Database (LED) is a WWW-accessible resource on sequence-structure-function relationships of microbial lipases. A set of data mining and data processing tools have been developed to provide multisequence alignments of lipase families and consistently annotated, superposed X-ray structures. It has been shown to be a powerful tool for protein engineering.

14 - A Framework for Evaluating Global Strategies for Paallel Experiment Design under Varying Resource Constraints
Vanathi Gopalakrishnan, University of Pittsburgh
In this research, a Parallel Experiment Planning (PEP) framework is developed that: (1) provides a computational representation and set of tools to manage information about parallel experiments (or trials), and (2) can provide intelligent assistance for decision-making by suggesting likely places in search space for new trials and portions of space that are unlikely to yield results so that they could be closed.

15 - Development of Protein Thermodynamic Database and Its Application for Predicting the Stability of Protein Mutants
M. Michael Gromiha, Jianghong An, Motohisa Ootulake, Hiditoshi Kono, Hatsuko Vedairo, Akinore Sarai, RIKEN Tsukuba Institute, Japan; Motohisa Oobatake, Meijo University
We developed a "thermodynamic database for proteins and mutants (ProTherm)" containing important thermodynamic parameters, experimental details, structural, functional and literature information. Hydrophobicity is the major factor for the stability of buried mutants whereas partially buried coil mutations are mainly influenced by entropy.

16 - maxd - A Data Warehouse, Analysis, and Visualisation Environment for Expression Data
David J Hancock, Norman Morrisson, Magnus Rattray, Andy Brass, Michael J. Cornell, University of Manchester
'maxd' is a warehousing and analysis environment specifically for expression data. The database, based on the EBI's ArrayExpress+ model and decribed in ANSI SQL92 for portablilty, can store data from a variety of sources, including cDNA and olionucleotide based microarrays. A complementary suite of modular, open-source JAVA tools for data storage, retrieval, analysis and visualisation is described.

17 - Functional Gene Networks: A Case Study of Novel Data Management
S. Heymann, Peter Rieger, Kelman Gesellschaft für Geninformation mbH
Kelman's high-end solution of bioinfomatics and functional genome research ensures new levels of data consistency and exploitation. The resulting Gene Network provides an indepth understanding of gene interplay, involving gene products in all their molecular versions. This is exemplified here by means of a case study for hereditary disease research.

18 - Collecting and Harvesting Biological Data: The NucleaRDB
Florence Horn, Fred H. Cohen, University of California, San Francisco
We have set up a database for nuclear hormone receptors, the NucleaRDB. It already holds sequence information for 500 receptors. Our main aim is to capture and provide heterogeneous experimental data, such as ligand binding constants, mutation and expression data. This data will be automatically extracted from electronically available literature.

19 - Structured Vocabularies in Mouse Genome Informatics
J. A. Kadin, J. A. Blake, J. E. Richardson, M. Ringwald, C. J. Bult, J. T. Eppig, The Jackson Laboratory, ME
MGI is involved in the development of several large structured vocabularies and is using these to annotate mouse genes and expression results. These vocabularies include the Anatomical Dictionary of Mouse Development, representing anatomical structures, and the Gene Ontology (GO), describing molecular functions, biological processes, and cellular locations of gene products.

20 -A Protein Localization Knowledge Base Populated by Text-Extracted Assertions
Kiarri Kershaw, Toby Goldstein, Francisco Pereira, Chris Hauser, Mark Craven, Robert Murphy, Carnegie Mellon University
We have developed a protein localization knowledge base that describes more than 50 subcellular structures and nearly 700 relations among them. Completeness has been confirmed by analyzing the subcellular location field in SWISS-PROT. We are presently populating our knowledge base with instances of protein-location relations extracted automatically from the literature.

21 - AraXDb: The Arabidopsis thaliana Expresson Database
Sebastian Kloska, Max-Planck-Institut of Molecular Plant Physiology; André Flöter, University of Potsdam; Bernd Essigmann, Thomas Altmann, Max-Planck-Institut of Molecular Plant Physiology; Torsten Schaub, University of Potsdam
Microarray technologies are promising tools for investigating the molecular physiological status of organisms as a whole. It is clear that in order to manage the massive datasets generated using these approaches, standard data processing systems must give way to specialized programs. The implementation of a workflow system for the storage and analysis of large-scale expression profiling data is reported.

22 - SLAD, a Model Data Warehouse in Molecular Biology
Judice L. Y. Koh, Christian Schönbach, Vladimir Brusic, National University of Singapore
SLAD is a small database of swine leukocyte antigen (SLA) genes. The multi-dimensional data model of SLAD allows for a) quick and easy annotation of data, b) combination of qualitative, quantitative and descriptive data types and c) ease of adding new analyses. SLAD has demonstrated that data warehousing can provide the means for efficient analysis and data mining in molecular biology.

23 - ArrayBankTM, a Community Microarray Database and Knowledgebase with Integrated Analysis Tools
John Kokinis, David Jones, Gary Lindstrom, Ron Lundstrom, University of Utah
Our project sets out to combine a relational database of mRNA expression data, sample descriptive data (pathology, tissue type, drug, concentation, etc.), and putative functional information to form a knowledge base of gene expression profiles and an open-ended set of distributed and easy-to-use analysis tools that provide the ability to effectively discover and compare patterns of gene expression between a priori unrelated data.

24 - Computational Linguistics of DNA: A Case in the Knowledge Representation and Pattern Recognition of Escherichia coli Promoters
Siu-Wai Leung, Chris Mellish, Dave Robertson, University of Edinburgh
Basic Gene Grammars (BGGs) was developed to represent the knowledge of E. coli promoters, including a domain theory, consensus sequences, weight matrices, the results of symbolic learning and knowledge-based neural networks. DNA-ChartParser provided bidirectional parsing facilities for BGGs. The knowledge of E. coli promoters was assessed by parsing actual DNA sequences.

25 - mtmDB: A Maize-targeted Mutagenesis Database
Hong Liu, Robert Martienssen, Cold Spring Harbor Laboratory, NY; Mike Freeling, University of California, Berkeley; Lincoln Stein, Cold Spring Harbor Laboratory
mtmDB is a maize-targeted mutagenesis database currently containing information on 43,776 transposon insertion mutants. The database contains phenotype information (including images), pedigrees, and partial DNA sequencing information, as well as other information. Using an online request form, researchers may request mutant strains that affect genes of interest, and are invited to return phenotypic information for incorporation into the database.

26 - A Database and Browser for Genome Analysis and cDNA Assembly
Yuan Liu, Yuhong Wang, Guochun Xie, Yu Lin, Richard Blevins, Merck and Co., Inc.
To extract information from genomic sequence data and to facilitate gene discovery, a genomic data mining system is developed to provide scientists with the most up-to-date information. All data is stored in an Oracle relational database. A set of interactive visualization tools has been developed to access the Oracle database.

27 - Using a Newly Constructed Virtual Protein Database for Plasmodium in the Search for Virulence Genes on Which Positive Selection May Operate
Ralhston Muller, Winston Hide, University of the Western Cape, South Africa
The aim was to apply the stack_pack clustering algorithm on 15, 468 Plasmodium sequences, and predict protein sequences from the DNA consensi using ESTScan. Secondly, using a simple method for estimating the number of synonymous and non-synonymous substitutions, and thereby detecting genes on which positive selection may operate.

28 - Bioinformatics Resources for Genome Analysis in Farm Animals
J. Paul Nelson, Alan L. Archibald, Andy S. Law, Roslin Institute, UK
The Bioinformatics group at the Roslin Institute is developing bioinformatics tools and resources for scientists engaged in genome analysis in farmed and domestic animals. The resources developed encompass both the databases and the associated analytical and display tools required for genetic and physical mapping of farm animals.

29 - Bio-calculus: Toward a Generalized Description System for Biology
Shuichi Onami, Kitano Symbiotic Systems Project Japan; Masao Nagasaki, Satoru Miyano, Kitano Symbiotic Systems Project, University of Tokyo; Hiroaki Kitano, Kitano Symbiotic Systems Project, Sony CSL
Bio-calculus is a knowledge description system, trying to describe any kind of biological knowledge using the same description principle. We presented its concept, and syntax and simulation software for molecular interaction. Currently, we are developing bio-calculus for more complicated biological phenomena, such as sub-cellular localization, cell-cell interaction, and cell division.

30 -Relational Databases, Statistics, and Improved Homology Assays
Robert P. Otillar, University of California, San Francisco
This work investigates empirical limits of alignment-based homology assays between genes with very low sequence identity. We show useful statistics and interesting case studies from 1.15 million sequence alignments from the Structural Classifications of Proteins, emphasizing gene comparisons with low (sub-'twilight') residue position-by-position similarity scores. We discuss fundamental limitations of Smith-Waterman, Fasta, gapped/PSI BLAST, and hidden Markov assays.

31 - A Resource for Information on Plant Chromatin Remodeling Genes
Ritu Pandey, David Selinger, David Mount, Vicki Chandler, Richard Jorgensen, University of Arizona
Genes identified in the genomes of several model organisms that regulate chromatin structure have been used as probes to find similar genes in Arabidopsis genomic sequence and maize EST sequences. Information on these genes stored in ChromDB will be accessible through the existing Website at

32 - GIMS - Genome Information Management System
Norman W. Paton, Shakeel A. Khan, Andy Hayes, Fouzia Moussouni, Michael J. Cornell, Karen Eilbeck, Andy Brass, Carole A. Goble, University of Manchester; Simon Hubbard, University of Manchester Institute of Science and Technology; Stephen H. Oliver, University of Manchester
Complex data sets provided by genome sequencing projects, present challenges in the storage, analysis, and presentation of information. Using UML, we have produced models that describe genome sequences, protein interactions, and transcription data. These have been applied to data from Saccharomyces cerevisiae, a lead organism in functional genomics.

33 - Data Processing of Barley ESTs
K.-P. Pleissner, W. Michalek, U. Willschere, A. Graner, Institute of Plant Genetics and Crop Plant Research, Germany
The data processing of barley ESTs comprises the assignment of putative functions of ESTs and the identification of an unigene set of barley. For the functional assignment the ESTs were blasted against SWISSPIRPLUS. A database containing the ESTs and their BLASTX2 results was built up using the mySQL DBMS. BLAST searches allowed the identification of more then 3000 barley genes.

34 - XNotify: An Automatic Sequence Database Search and Results Management System
G. B. Quinn, The Burnham Institute; Philip E. Bourne, University of California, San Diego
Based on a Linux OS platform, an automatic sequence database search system has been written in the C programming language that performs daily sequence database searches of NCBI sequence datasets, notifying the user by email only when new or previously unseen matches have been found to a query sequence. Additionally, search data is stored online.

35 - The TRANSPATH Signal Transduction Database--a Knowledge Base on Signal Transduction Networks
F. Schacherer, GBF German Research Centre for Biotechnology; E. Wingender, GBF, Biobase Biological Databases GmbH, Germany
TRANSPATH provides access to the growing amount of signal transduction data, mainly to pathways involved in mammalian transcription regulation. Entries are validated with references to original publications and linked to other databases. The knowledge base goes beyond the approach of traditional protein databases by storing the network of interactions.

36 - ISYS: A Software Platform to Enable the Integration of Heterogeneous Bioinformatics Resources
Adam Siepel, Andrew Tolopko, Andrew Farmer, Peter Steadman, Dawn Perry, Faye Schilkey, William Beavis, National Center for Genome Resources
Heterogeneity of databases and software resources continues to hamper the integration of biological information. We present a highly flexible, bottom-up approach to this problem that uses a generic integration platform to enable the interoperation of diverse software components. Our solution is designed to make maximal use of existing resources.

37 - The Development of GIDS: A Relational Database for High-throughput Genotyping
H. H. M. J. van Bakel, P. L. Pearson, C. Wijmenga, University Medical Center Utrecht
The Genome Information Database System (GIDS) is being developed to store the large amount of genotypical and phenotypical data needed for the study of complex genetic diseases. Besides serving as a central data storage facility, GIDS will also play an active role in regulating the data flow in the laboratory.

38 - PIR-Class: An Object-relational Protein Classification Database for Sequence Annotation and Genome Research
Cathy H. Wu, Chunlin Xiao, Zhenglin Hou, Winona C. Barker, Georgetown University Medical Center
PIR-Class database is designed to provide an integrated platform for describing comprehensive family relationships and structural and functional features of proteins, with summary information of superfamilies/families, domains, and motifs, and rich links to various databases. The database is implemented in Oracle, searchable from, and can support genomic research.

39 - Database of Interacting Proteins: A Benchmarking Tool for Protein-protein Interactions Prediction
Ioannis Xenarios, Edward M. Marcotte, Michael Thompson, Xiaquon Joyce Duan, Lukasz Salwinsky, David Eisenberg, University of California, Los Angeles
The Database of Interacting Proteins (DIP) is a database that contains experimentally determined protein-protein interactions. This database has two main goals: 1) giving an integrative database for browsing and efficiently extracting information about proteins of interest; 2) being a useful tool to benchmark protein-protein prediction methods.

40 - Using a Formal Language to Define Biological Semantics: A Case Study
Guang Yao, University of Minnesota, Minneapolis; Lynda B. M. Ellis, Toni Kazic, Washington University
Glossa is a language that defines the semantics of biological ideas as executable code. We have used the University of Minnesota Biocatalysis/Biodegradation Database as a testbed for developing both Glossa and the underlying machinery for distributed computations (The Agora) among independent databases.

41 - Bioinformatics Needs Analysis by Data Mining
Stuart Yarfitz, Eugene Wan, Joanne West, University of Washington
We have been studying researchers information seeking behavior and needs by analyzing bioinformatics services program records. Relational databases are used to organize and query data from Web logs, consultants’ email folders, and software registration and log files. Facet analysis of consultation encounters is being used for ontology and thesaurus development.

42 - GeneX: A Generic Relational Schema for the Storage and Exchange of Gene Expression Data Using Relational Database and XML
Jiaye Zhou, Guanghong Chen, Greg Colello, Andrew Farmer, Harry Mangalam, National Center for Genome Resources; Jason Stewart, Avestha Engraine Technologies, India; Mark Waugh, Jennifer Weller, National Center for Genome Resources
We present the GeneX data storage schema, the XML data exchange format as well as the GeneX software tools implemented at the National Center for Genome Resources. Design and implementation details of the storage and analytical systems and the source code of the software tools can be found at

43 - Tools for Analysis of Groups of E. coli Genes
D. P. Zimmer, S. Kustu, University of California, Berkeley
We have developed a database and tools for analysis of groups of E. coli genes.The tools are designed to facilitate analysis of global expression experiments. The database is implemented in MySQL and programs are written in Perl on Linux/Intel.

44 - An Intelligent Database System for Analysis of Signal Transduction Pathways
Zhuang Zuo, Gary Pestano, Ramprasad Ramakrishna, Kam-Chuen Jim, Phisiome Sciences, Inc.
A fully interactive, Web-based intelligent database system, as part of the In Silico CellT, was developed for modeling signal transduction pathways. A T cell model built within this system successfully mimicked in-vivo changes of cytokine secretion when stimulated. This system offers enhanced functions including virtual knockouts and over-expression analyses.

