| |
Tutorial on Graph Data Management for Biology
Frank Olken (LBNL) Proposed for ISMB 2004
Goals and Motivation
Many types of biological and chemical information can be
naturally
as graphs, i.e., sets of vertices and edges,
examples include biopathways data, protein interaction
networks, taxonomies and phylogenetic trees,
chemical structure graphs, food webs,
laboratory protocols, genetic maps, multiple sequence alignments, etc.
Queries against such graphs often include various types
of path queries (regular expressions, shortest paths, etc.),
or graph pattern matching (subgraph isomorphism, subgraph
homomorphism, and subgraph homeomorphisms).
Conventional relational databases are cumbersome for
answering many types of graph queries.
This tutorial introduce students to graph data management technology
for biological applications, with an emphasis on applications
to biopathways data, i.e., metabolic pathways, signal transduction
pathways, and genetic regulatory networks.
The tutorial will cover graph data models,
graph queries, graph query languages,
implementation strategies,
history of graph data management systems,
examples of current biopathways database systems,
comparison to other types of database management
systems such as relational, logic and object oriented
database systems, and possibly a brief mention of the
potential role graph grammars.
Upon completion of the tutorial, students should
- have an understanding of the basic concepts,
difficulties, and roles of graph data management
in biological applications,
- be prepared to read and understand the literature
on graph data management,
- understand the issues involved in choosing a graph
data management technology
Duration
This tutorial will run for a half day, approximately 3.5 hours.
Audience
The intended audience is for
graduate students, researchers and practicioners
in bioinformatics, data management, and graph algorithms,
and computer-savvy biologists interested in data management of
graph-based biological data: biopathways data, protein interactions
networks, chemical structure graphs, phylogenetic trees, etc.
Prerequisites
We shall assume that the audience is familiar
with the biological applications and with
relational data management technology.
No previous knowledge of graph theory is assumed.
Instructor
The instructor is
Frank Olken
a computer scientist
in the Scientific Data Management Group at
Lawrence Berkeley National Laboratory,
where he has
worked on statistical and scientific databases.
He has also worked on metadata
standards, including XML and RDF Schema Languages standards
committees.
He has previously taught data management at UC Berkeley,
and XML Schema Language and Workflow Management for UCB Extension.
He has also taught an earlier version of this tutorial at the
IEEE Computational Biology meeting at Stanford in August of 2003.
He also lectures on this topic at the
Program in Genomic Applications week-long in-service training
courses offered at LBNL several times / year.
He is currently working in the area of graph data management
systems for biopathways data at LBNL on the
Biopathways Graph Data Manager Project
website located at
http://www.lbl.gov/~olken/graphd/graphdm.htm
Detailed outline
I. Introduction
A. What is a graph? set of nodes + binary edge relation
B. Examples of biopathways graphs: aMAZE, KEGG, ...
C. Advantages of graph data managers:
natural data model, concise queries, query efficacy
II. Types of Graphs
A. Undirected Graphs
B. Directed Graphs
C. Labeled Graphs (nodes, edges)
D. Nested graphs
E. Hypergraphs
F. Multi-graphs
G. DAGS (Directed Acyclic Graphs)
H. Cyclic graphs
II. Biological Graph Data
A. Biopathways (directed, nested, possibly hypergraphs)
B. Protein Interaction Networks (undirected)
C. Chemical Structure Graphs (undirected, multi-graphs)
D. Genetic Maps (DAGs)
E. Bibliographic citations (DAGs)
F. Taxonomies, phylogenetic trees (DAGs)
G. Sequence overlap graph for shotgun sequence
assembly (nearly interval graphs)
H. Multiple Sequence Alignments (C. Lee at UCLA)
I. Partonomies for anatomy (DAGs)
J. Topology relations for imaging and anatomy
K. Laboratory protocols (process models)
L. Database schemas / mappings
M. Contact graphs
N. Atomic matchings (thru reactions)
III. Types of Graph Queries
A. Boolean operations: graph intersection, union
B. Path queries (transitive closure, shortest path, ...)
C. Graph pattern matching
1. subgraph isomorphism
2. subgraph homomorphism
3. subgraph homeomorphism
D. Approximate graph matching (Shasha, et al.)
IV. Graph Data Models
A. Binary relational models (NIAM)
B. Semantic Networks
C. Graphlog, Hylog (Mendelzon, Consens)
D. GRAM
E. GraphDB (Guting)
F. GOOD (Gemis)
F. GOQL (Sheng, Ozsoyoglu, Ozsoyoglu)
G. Semi-structured Data, Querying the Web
H. RDF (Resource Description Framework)
V. Graph Query Languages
A. Selection portion
B. Construction portion
C. XQuery
D. RDF Query Languages
VI. Graph representation
A. Edge list (relational, sparse representation)
B. Adjacency Matrix (dense representation)
VI. Approaches to Query Processing
A. Depth First Search
B. Breadth First Search
C. Bottom up query processing
D. Transitive closure query methods
E. Shortest path queries
F. Spectral methods
VII. Query Optimization
A. Selection of query processing strategy
B. Memory resident vs. external (disk) processing
C. Use of Graph statistics (avg degree, diameter, ...)
D. Predicate ordering
E. Pattern matching - where to start
F. Subgraph homomorphism: term expansion vs. graph exploration
G. Choice of edge list vs. adjacency matrix representations
VI. Implementation Approaches
A. Design Criteria: performance, extensibility, scalability,
development difficulty
B. Integrated vs. Layered Designs
C. Backend DBMS: relational, logic, OO, XML, storage server
VIII. Comparison of Graph Data Mgt to other data models
A. Homogeneous nodes vs. Heterogeneous "relations"
B. Simple path queries vs. union queries
C. Efficiency of query evaluation
IX. Graph Grammars (GG)
A. What is a graph grammar?
B. Graph grammars as query language
Prepared by
Frank Olken
on Dec. 14, 2003
|
|