- Universidad Diego Portales, Santiago de Chile
Monday, June 5, 2017
Aula C - Rinfresco 14.15, inizio seminario 14.30.
Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become a hot topic since the turn of the millennium. A simple solution is to use an FM-index based on the run-length compressed Burrows-Wheeler Transform (RLBWT) of the text, which achieves excellent compression and allows us, given a pattern, to count quickly the number of times it appear in the text. In order to be able to locate quickly the positions where it occurs, however, conventional wisdom has been that we must augment the RLBWT with a sample of the suffix array, and that does not compress so well: the product of the size and the query per occurrence is roughly linear, meaning it is either much larger than the RLBWT or very slow. In this talk we demonstrate another way to augment the RLBWT, that is simple, does not increase the total size much, and supports locating in predecessor time. In our preliminary experiments, it is a thousand times faster than a suffix array sample with the same memory footprint.
This is joint work with Gonzalo Navarro and Nicola Prezza.