Indexing highly repetitive string collections, part II: Compressed indexes
G Navarro - ACM Computing Surveys (CSUR), 2021 - dl.acm.org
Two decades ago, a breakthrough in indexing string collections made it possible to
represent them within their compressed space while at the same time offering indexed …
represent them within their compressed space while at the same time offering indexed …
Fully functional suffix trees and optimal text searching in BWT-runs bounded space
Indexing highly repetitive texts—such as genomic databases, software repositories and
versioned text collections—has become an important problem since the turn of the …
versioned text collections—has become an important problem since the turn of the …
Searching and indexing genomic databases via kernelization
T Gagie, SJ Puglisi - Frontiers in Bioengineering and Biotechnology, 2015 - frontiersin.org
The rapid advance of DNA sequencing technologies has yielded databases of thousands of
genomes. To search and index these databases effectively, it is important that we take …
genomes. To search and index these databases effectively, it is important that we take …
An upper bound and linear-space queries on the LZ-End parsing
Lempel–Ziv (LZ77) compression is the most commonly used lossless compression
algorithm. The basic idea is to greedily break the input string into blocks (called “phrases”) …
algorithm. The basic idea is to greedily break the input string into blocks (called “phrases”) …
[HTML][HTML] Efficient construction of a complete index for pan-genomics read alignment
Short-read aligners predominantly use the FM-index, which is easily able to index one or a
few human genomes. However, it does not scale well to indexing collections of thousands of …
few human genomes. However, it does not scale well to indexing collections of thousands of …
Towards pan-genome read alignment to improve variation calling
Background Typical human genome differs from the reference genome at 4-5 million sites.
This diversity is increasingly catalogued in repositories such as ExAC/gnomAD, consisting …
This diversity is increasingly catalogued in repositories such as ExAC/gnomAD, consisting …
[图书][B] Genome-scale algorithm design: bioinformatics in the era of high-throughput sequencing
Presenting the fundamental algorithms and data structures that power bioinformatics
workflows, this book covers a range of topics from the foundations of sequence analysis …
workflows, this book covers a range of topics from the foundations of sequence analysis …
Sublinear time Lempel-Ziv (LZ77) factorization
J Ellert - International Symposium on String Processing and …, 2023 - Springer
Abstract The Lempel-Ziv (LZ77) factorization of a string is a widely-used algorithmic tool that
plays a central role in data compression and indexing. For a length-n string over integer …
plays a central role in data compression and indexing. For a length-n string over integer …
Indexes of large genome collections on a PC
A Danek, S Deorowicz, S Grabowski - PloS one, 2014 - journals.plos.org
The availability of thousands of individual genomes of one species should boost rapid
progress in personalized medicine or understanding of the interaction between genotype …
progress in personalized medicine or understanding of the interaction between genotype …
Founder reconstruction enables scalable and seamless pangenomic analysis
Motivation Variant calling workflows that utilize a single reference sequence are the de facto
standard elementary genomic analysis routine for resequencing projects. Various ways to …
standard elementary genomic analysis routine for resequencing projects. Various ways to …