Tutorial on Graph Data Management for Biology Tutorial on Graph Data Management for Biology
 
 

Tutorial on Graph Data Management for Biology

Frank Olken (LBNL)
Proposed for ISMB 2004

Goals and Motivation

Many types of biological and chemical information can be naturally as graphs, i.e., sets of vertices and edges, examples include biopathways data, protein interaction networks, taxonomies and phylogenetic trees, chemical structure graphs, food webs, laboratory protocols, genetic maps, multiple sequence alignments, etc. Queries against such graphs often include various types of path queries (regular expressions, shortest paths, etc.), or graph pattern matching (subgraph isomorphism, subgraph homomorphism, and subgraph homeomorphisms). Conventional relational databases are cumbersome for answering many types of graph queries.

This tutorial introduce students to graph data management technology for biological applications, with an emphasis on applications to biopathways data, i.e., metabolic pathways, signal transduction pathways, and genetic regulatory networks. The tutorial will cover graph data models, graph queries, graph query languages, implementation strategies, history of graph data management systems, examples of current biopathways database systems, comparison to other types of database management systems such as relational, logic and object oriented database systems, and possibly a brief mention of the potential role graph grammars.

Upon completion of the tutorial, students should

  • have an understanding of the basic concepts, difficulties, and roles of graph data management in biological applications,
  • be prepared to read and understand the literature on graph data management,
  • understand the issues involved in choosing a graph data management technology

Duration

This tutorial will run for a half day, approximately 3.5 hours.

Audience

The intended audience is for graduate students, researchers and practicioners in bioinformatics, data management, and graph algorithms, and computer-savvy biologists interested in data management of graph-based biological data: biopathways data, protein interactions networks, chemical structure graphs, phylogenetic trees, etc.

Prerequisites

We shall assume that the audience is familiar with the biological applications and with relational data management technology. No previous knowledge of graph theory is assumed.

Instructor

The instructor is Frank Olken a computer scientist in the Scientific Data Management Group at Lawrence Berkeley National Laboratory, where he has worked on statistical and scientific databases. He has also worked on metadata standards, including XML and RDF Schema Languages standards committees. He has previously taught data management at UC Berkeley, and XML Schema Language and Workflow Management for UCB Extension. He has also taught an earlier version of this tutorial at the IEEE Computational Biology meeting at Stanford in August of 2003. He also lectures on this topic at the Program in Genomic Applications week-long in-service training courses offered at LBNL several times / year. He is currently working in the area of graph data management systems for biopathways data at LBNL on the Biopathways Graph Data Manager Project website located at http://www.lbl.gov/~olken/graphd/graphdm.htm

Detailed outline

I. Introduction
	A.  What is a graph?  set of nodes + binary edge relation
	B.  Examples of biopathways graphs: aMAZE, KEGG, ...
	C.  Advantages of graph data managers:
		natural data model, concise queries, query efficacy

II. Types of Graphs
	A.  Undirected Graphs 
	B.  Directed Graphs
	C.  Labeled Graphs (nodes, edges)
	D.  Nested graphs
	E.  Hypergraphs 
	F.  Multi-graphs
	G.  DAGS (Directed Acyclic Graphs)
	H.  Cyclic graphs

II.  Biological Graph Data
	A. Biopathways (directed, nested, possibly hypergraphs)
	B. Protein Interaction Networks  (undirected)
	C. Chemical Structure Graphs (undirected, multi-graphs)
	D. Genetic Maps (DAGs)
	E. Bibliographic citations (DAGs)
	F. Taxonomies, phylogenetic trees (DAGs)
	G. Sequence overlap graph for shotgun sequence
		assembly (nearly interval graphs)
	H. Multiple Sequence Alignments (C. Lee at UCLA)
	I. Partonomies for anatomy (DAGs)
	J. Topology relations for imaging and anatomy
	K. Laboratory protocols  (process models)
	L. Database schemas / mappings 
	M. Contact graphs
	N. Atomic matchings (thru reactions)

III.  Types of Graph Queries
	A. Boolean operations: graph intersection, union
	B. Path queries (transitive closure, shortest path, ...)
	C. Graph pattern matching
		1.  subgraph isomorphism
		2.  subgraph homomorphism
		3.  subgraph homeomorphism
	D. Approximate graph matching (Shasha, et al.)

IV. Graph Data Models 
	A.  Binary relational models (NIAM)
	B.  Semantic Networks
	C.  Graphlog, Hylog (Mendelzon, Consens)
	D.  GRAM
	E.  GraphDB (Guting)
	F.  GOOD (Gemis)
	F.  GOQL (Sheng, Ozsoyoglu, Ozsoyoglu)
	G.  Semi-structured Data, Querying the Web
	H.  RDF (Resource Description Framework)
	

V.  Graph Query Languages
	A.  Selection portion
	B.  Construction portion 
	C.  XQuery
	D.  RDF Query Languages

VI. Graph representation
	A. Edge list (relational, sparse representation)
	B. Adjacency Matrix (dense representation)

VI.  Approaches to Query Processing
	A. Depth First Search
	B. Breadth First Search
	C. Bottom up query processing 
	D. Transitive closure query methods
	E. Shortest path queries
	F. Spectral methods

VII. Query Optimization
	A. Selection of query processing strategy
	B. Memory resident vs. external (disk) processing
	C. Use of Graph statistics (avg degree, diameter, ...)
	D. Predicate ordering
	E. Pattern matching - where to start
	F. Subgraph homomorphism:  term expansion vs. graph exploration
	G. Choice of edge list vs. adjacency matrix representations

VI.  Implementation Approaches
	A. Design Criteria: performance, extensibility, scalability,
	development difficulty
	B. Integrated vs. Layered Designs
	C. Backend DBMS: relational, logic, OO, XML, storage server

VIII.  Comparison of Graph Data Mgt to other data models
	A.  Homogeneous nodes vs. Heterogeneous "relations"
	B.  Simple path queries vs. union queries
	C.  Efficiency of query evaluation

IX.  Graph Grammars (GG) 
	A.  What is a graph grammar?
	B.  Graph grammars as query language


Prepared by Frank Olken on Dec. 14, 2003