Data structures based on k-mers for querying large collections of sequencing data sets

C Marchet, C Boucher, SJ Puglisi, P Medvedev… - Genome …, 2021 - genome.cshlp.org
High-throughput sequencing data sets are usually deposited in public repositories (eg, the
European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached …

Survey and taxonomy of lossless graph compression and space-efficient graph representations

M Besta, T Hoefler - arXiv preprint arXiv:1806.01799, 2018 - arxiv.org
Various graphs such as web or social networks may contain up to trillions of edges.
Compressing such datasets can accelerate graph processing by reducing the amount of I/O …

Succinct de Bruijn graphs

A Bowe, T Onodera, K Sadakane, T Shibuya - International workshop on …, 2012 - Springer
We propose a new succinct de Bruijn graph representation. If the de Bruijn graph of k-mers
in a DNA sequence of length N has m edges, it can be represented in 4 m+ o (m) bits. This is …

FMLRC: Hybrid long read error correction using an FM-index

JR Wang, J Holt, L McMillan, CD Jones - BMC bioinformatics, 2018 - Springer
Background Long read sequencing is changing the landscape of genomic research,
especially de novo assembly. Despite the high error rate inherent to long read technologies …

Accurate self-correction of errors in long reads using de Bruijn graphs

L Salmela, R Walve, E Rivals, E Ukkonen - Bioinformatics, 2017 - academic.oup.com
Motivation New long read sequencing technologies, like PacBio SMRT and Oxford
NanoPore, can produce sequencing reads up to 50 000 bp long but with an error rate of at …

Representation of k-Mer Sets Using Spectrum-Preserving String Sets

A Rahman, P Medevedev - Journal of Computational Biology, 2021 - liebertpub.com
Given the popularity and elegance of k-mer-based tools, finding a space-efficient way to
represent a set of k-mers is important for improving the scalability of bioinformatics analyses …

Indexing variation graphs

J Sirén - 2017 Proceedings of the ninteenth workshop on …, 2017 - SIAM
Variation graphs, which represent genetic variation within a population, are replacing
sequences as reference genomes. Path indexes are one of the most important tools for …

Data Structures to Represent a Set of k-long DNA Sequences

R Chikhi, J Holub, P Medvedev - ACM Computing Surveys (CSUR), 2021 - dl.acm.org
The analysis of biological sequencing data has been one of the biggest applications of
string algorithms. The approaches used in many such applications are based on the …

BLight: efficient exact associative structure for k-mers

C Marchet, M Kerbiriou, A Limasset - Bioinformatics, 2021 - academic.oup.com
Motivation A plethora of methods and applications share the fundamental need to associate
information to words for high-throughput sequence analysis. Doing so for billions of k-mers …

Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era

R Rizzi, S Beretta, M Patterson, Y Pirola, M Previtali… - Quantitative …, 2019 - Springer
Background De novo genome assembly relies on two kinds of graphs: de Bruijn graphs and
overlap graphs. Overlap graphs are the basis for the Celera assembler, while de Bruijn …