Abstract

Efficient Index Structures for String Databases

by: Tamer Kahveci, Ambuj K. Singh

Abstract:

We consider the problem of substring searching in very large string databases. Typical applications of this problem are genetic data, web data, and event sequences.Since the size of such databases grows exponentially, it becomes impractical to use in-memory algorithms for these problems.In this paper, we propose to map the substrings of the data into an integer space with the help of wavelet coefficients.Later, we index these coefficients using Minimum Bounding Rectangles. We define a distance function which is a lower bound to the actual edit distance between strings.We experiment with both nearest neighbor queries and range queries.The results show that our technique prunes significant amount of the database (typically 50-95\\%), thus reducingboth the disk I/O cost and the CPU cost significantly.

Keywords:

strings, similarity, bioinformatics, indexing, wavelets

Date:

April 2001

Document: 2001-10

Department of Computer Science

University of California, Santa Barbara

Abstract

Efficient Index Structures for String Databases

by: Tamer Kahveci, Ambuj K. Singh