Indexing highly repetitive string collections, part II: Compressed indexes

G Navarro - ACM Computing Surveys (CSUR), 2021 - dl.acm.org
Two decades ago, a breakthrough in indexing string collections made it possible to
represent them within their compressed space while at the same time offering indexed …

Data compression for sequencing data

S Deorowicz, S Grabowski - Algorithms for Molecular Biology, 2013 - Springer
Post-Sanger sequencing methods produce tons of data, and there is a generalagreement
that the challenge to store and process them must be addressedwith data compression. In …

Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data

I Birol, A Raymond, SD Jackman, S Pleasance… - …, 2013 - academic.oup.com
White spruce (Picea glauca) is a dominant conifer of the boreal forests of North America, and
providing genomics resources for this commercially valuable tree will help improve forest …

BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters

J Chu, S Sadeghi, A Raymond, SD Jackman… - …, 2014 - academic.oup.com
Large datasets can be screened for sequences from a specific organism, quickly and with
low memory requirements, by a data structure that supports time-and memory-efficient set …

A learned approach to design compressed rank/select data structures

A Boffa, P Ferragina, G Vinciguerra - ACM Transactions on Algorithms …, 2022 - dl.acm.org
We address the problem of designing, implementing, and experimenting with compressed
data structures that support rank and select queries over a dictionary of integers. We shine a …

Prefix-free parsing for building big BWTs

C Boucher, T Gagie, A Kuhnle, B Langmead… - Algorithms for Molecular …, 2019 - Springer
High-throughput sequencing technologies have led to explosive growth of genomic
databases; one of which will soon reach hundreds of terabytes. For many applications we …

Practical linear-time O(1)-workspace suffix sorting for constant alphabets

G Nong - ACM Transactions on Information Systems (TOIS), 2013 - dl.acm.org
This article presents an O (n)-time algorithm called SACA-K for sorting the suffixes of an
input string T [0, n-1] over an alphabet A [0, K-1]. The problem of sorting the suffixes of T is …

A survey of BWT variants for string collections

D Cenzato, Z Lipták - arXiv preprint arXiv:2202.13235, 2022 - arxiv.org
In recent years, the focus of bioinformatics research has moved from individual sequences to
collections of sequences. Given the fundamental role of the Burrows-Wheeler Transform …

Lightweight data indexing and compression in external memory

P Ferragina, T Gagie, G Manzini - Algorithmica, 2012 - Springer
In this paper we describe algorithms for computing the Burrows-Wheeler Transform (bwt)
and for building (compressed) indexes in external memory. The innovative feature of our …

Sketching and sublinear data structures in genomics

G Marçais, B Solomon, R Patro… - Annual Review of …, 2019 - annualreviews.org
Large-scale genomics demands computational methods that scale sublinearly with the
growth of data. We review several data structures and sketching techniques that have been …