Matching statistics (MS) computation is at the heart of numerous bioinformatics applications, from read alignment to computing phylogenies of a set of genomes or even speeding up the computation of core data structures on collections of genomes. Many of these datasets have the property of being highly similar to the reference, which itself, however, may not be very repetitive. Some heuristics based on sequenceto-sequence similarity have already been studied in [Lipt´ak et al., Alg. Mol. Biol. 2024], leading to a significant speedup in the computation of the matching statistics. In this paper, we introduce a new heuristic that further speeds MS computation. The core idea is to take advantage of existing similarities between the input sequences and the
reference. We give an implementation making use of this heuristic, which also allows the use of multiple threads to parallelize MS computation. We give an experimental evaluation of our tool, LRF-ms, comparing it to other MS computation tools, on publicly available genomic datasets, and show that it is the fastest when the collection of genomes is highly similar to the reference string, while keeping a comparably low memory footprint.
Id prodotto:
143408
Handle IRIS:
11562/1147690
ultima modifica:
19 dicembre 2024
Citazione bibliografica:
Liptak, Zsuzsanna; Luca', Martina; Masillo, Francesco; Puglisi, Simon J.,
Fast matching statistics for sets of long similar strings in Proc. of the 27th Prague Stringology Conference (PSC 2024)
, Prague, Czech Republic, August 26-28
, Atti di "Prague Stringology Conference"
, Prague, Czech Republic
, 26.08.2024-27.08.2024
, 2024
, pp. 3-15