Abstract
Efficient Index Structures for String Databases
by: Tamer Kahveci, Ambuj K. Singh
Abstract:
We consider the problem of substring searching in very large string databases. Typical applications of this problem are genetic data, web data, and event sequences.Since the size of such databases grows exponentially, it becomes impractical to use in-memory algorithms for these problems.In this paper, we propose to map the substrings of the data into an integer space with the help of wavelet coefficients.Later, we index these coefficients using Minimum Bounding Rectangles. We define a distance function which is a lower bound to the actual edit distance between strings.We experiment with both nearest neighbor queries and range queries.The results show that our technique prunes significant amount of the database (typically 50-95\\%), thus reducingboth the disk I/O cost and the CPU cost significantly.
Keywords:
strings, similarity, bioinformatics, indexing, wavelets
Date:
April 2001
Document: 2001-10