Indexing highly repetitive string collections, part II: Compressed indexes

G Navarro - ACM Computing Surveys (CSUR), 2021 - dl.acm.org
Two decades ago, a breakthrough in indexing string collections made it possible to
represent them within their compressed space while at the same time offering indexed …

Fully functional suffix trees and optimal text searching in BWT-runs bounded space

T Gagie, G Navarro, N Prezza - Journal of the ACM (JACM), 2020 - dl.acm.org
Indexing highly repetitive texts—such as genomic databases, software repositories and
versioned text collections—has become an important problem since the turn of the …

Searching and indexing genomic databases via kernelization

T Gagie, SJ Puglisi - Frontiers in Bioengineering and Biotechnology, 2015 - frontiersin.org
The rapid advance of DNA sequencing technologies has yielded databases of thousands of
genomes. To search and index these databases effectively, it is important that we take …

An upper bound and linear-space queries on the LZ-End parsing

D Kempa, B Saha - Proceedings of the 2022 Annual ACM-SIAM …, 2022 - SIAM
Lempel–Ziv (LZ77) compression is the most commonly used lossless compression
algorithm. The basic idea is to greedily break the input string into blocks (called “phrases”) …

[HTML][HTML] Efficient construction of a complete index for pan-genomics read alignment

A Kuhnle, T Mun, C Boucher, T Gagie… - Journal of …, 2020 - liebertpub.com
Short-read aligners predominantly use the FM-index, which is easily able to index one or a
few human genomes. However, it does not scale well to indexing collections of thousands of …

Towards pan-genome read alignment to improve variation calling

D Valenzuela, T Norri, N Välimäki, E Pitkänen… - BMC genomics, 2018 - Springer
Background Typical human genome differs from the reference genome at 4-5 million sites.
This diversity is increasingly catalogued in repositories such as ExAC/gnomAD, consisting …

[图书][B] Genome-scale algorithm design: bioinformatics in the era of high-throughput sequencing

V Mäkinen, D Belazzougui, F Cunial, AI Tomescu - 2023 - books.google.com
Presenting the fundamental algorithms and data structures that power bioinformatics
workflows, this book covers a range of topics from the foundations of sequence analysis …

Sublinear time Lempel-Ziv (LZ77) factorization

J Ellert - International Symposium on String Processing and …, 2023 - Springer
Abstract The Lempel-Ziv (LZ77) factorization of a string is a widely-used algorithmic tool that
plays a central role in data compression and indexing. For a length-n string over integer …

Indexes of large genome collections on a PC

A Danek, S Deorowicz, S Grabowski - PloS one, 2014 - journals.plos.org
The availability of thousands of individual genomes of one species should boost rapid
progress in personalized medicine or understanding of the interaction between genotype …

Founder reconstruction enables scalable and seamless pangenomic analysis

T Norri, B Cazaux, S Dönges, D Valenzuela… - …, 2021 - academic.oup.com
Motivation Variant calling workflows that utilize a single reference sequence are the de facto
standard elementary genomic analysis routine for resequencing projects. Various ways to …