Optimizing Genome Interval Overlap Queries Using an R-Tree Index

Hilmar Lapp1, Chris Mungall2, Scott Cain, Lincoln Stein
1hlapp@gnf.org, GNF; 2cjm@fruitfly.org, University of California, Berkely

Genome interval overlap queries are probably the most common queries issued by genome browsers. Formally, such a query asks for all features that either end in, or start in, or are contained within, or contain a certain interval on a segment of a genome. A genome browser issues this query against a database of an annotated genome in order to extract the annotations to be displayed for a certain window, typically on a chromosome or assembly contig. Annotations located to the genome constitute the features, the window of interest is the interval, and segment represents the chromosome or contig.

Traditionally, this type of query has been supported by a composite B-tree index on the triple of (segment, start, end), where feature locations are stored in one table and start and end represent the respective coordinates of a location. Even though this index can provide sufficient support for queries against smaller genomes, the performance of this approach has been plagued by its inherent dependency on how well the first attributes in the index slice the index tree that needs to be scanned for the other attributes, and therefore needs to be read from disk. This strong dependency leads to a huge variance of the response time with occasional exceptionally poor performance, for instance for queries into the middle of a segment. The fact that genome browsers often are core components of genome portals and hence need to respond to many queries simultaneously amplifies the problem, as does the availability of more and more large eukaryotic genomes. Various attempts have been made to improve the performance, like physically separating locations by segment, or introducing a binning scheme, none of which addresses the fundamental problem in the B-tree index structure for this type of query.

Result. We present a solution that builds on translating the interval overlap query into a two-dimensional point-in-box geometric query supported by an R-tree index. Our solution is based on a theoretical framework developed in the domain of text processing [1], where efficient handling of region-based operations is a common and well-studied problem. In this framework, a region is translated into a two-dimensional point. We then construct a box for each location containing all points that correspond to all intervals with which the feature overlaps. The interval overlap query then reduces to a query for all features the computed boxes of which contain the point that represents the region of interest, which is well supported by an R-tree index on the rectangle data type.

We will present an overview of the underlying theory, together with reference implementations of this approach for Chado (GMOD), BioSQL, and DBGFF (GBrowse) [2], using PostgreSQL and Oracle as RDBMSs. Initial performance comparisons to the traditional approach using benchmark datasets show a 2 to 4-fold speed-up of the average query response time, with the most notable improvement being the very small variance, regardless of feature position on the segment.

References

[1] Robert C. Miller. Lightweight Structure in Text. PhD thesis, Computer Science Department, Carnegie Mellon University, May 2002

[2] Chado, http://www.gmod.org/schema, BioSQL, http://obda.open-bio.org/. GBrowse, http://www.gmod.org/ggb/.