Data structures based on k-mers for querying large collections of sequencing data sets
High-throughput sequencing data sets are usually deposited in public repositories (eg, the
European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached …
European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached …
Survey and taxonomy of lossless graph compression and space-efficient graph representations
Various graphs such as web or social networks may contain up to trillions of edges.
Compressing such datasets can accelerate graph processing by reducing the amount of I/O …
Compressing such datasets can accelerate graph processing by reducing the amount of I/O …
Succinct de Bruijn graphs
A Bowe, T Onodera, K Sadakane, T Shibuya - International workshop on …, 2012 - Springer
We propose a new succinct de Bruijn graph representation. If the de Bruijn graph of k-mers
in a DNA sequence of length N has m edges, it can be represented in 4 m+ o (m) bits. This is …
in a DNA sequence of length N has m edges, it can be represented in 4 m+ o (m) bits. This is …
FMLRC: Hybrid long read error correction using an FM-index
Background Long read sequencing is changing the landscape of genomic research,
especially de novo assembly. Despite the high error rate inherent to long read technologies …
especially de novo assembly. Despite the high error rate inherent to long read technologies …
Accurate self-correction of errors in long reads using de Bruijn graphs
Motivation New long read sequencing technologies, like PacBio SMRT and Oxford
NanoPore, can produce sequencing reads up to 50 000 bp long but with an error rate of at …
NanoPore, can produce sequencing reads up to 50 000 bp long but with an error rate of at …
Representation of k-Mer Sets Using Spectrum-Preserving String Sets
A Rahman, P Medevedev - Journal of Computational Biology, 2021 - liebertpub.com
Given the popularity and elegance of k-mer-based tools, finding a space-efficient way to
represent a set of k-mers is important for improving the scalability of bioinformatics analyses …
represent a set of k-mers is important for improving the scalability of bioinformatics analyses …
Indexing variation graphs
J Sirén - 2017 Proceedings of the ninteenth workshop on …, 2017 - SIAM
Variation graphs, which represent genetic variation within a population, are replacing
sequences as reference genomes. Path indexes are one of the most important tools for …
sequences as reference genomes. Path indexes are one of the most important tools for …
Data Structures to Represent a Set of k-long DNA Sequences
R Chikhi, J Holub, P Medvedev - ACM Computing Surveys (CSUR), 2021 - dl.acm.org
The analysis of biological sequencing data has been one of the biggest applications of
string algorithms. The approaches used in many such applications are based on the …
string algorithms. The approaches used in many such applications are based on the …
BLight: efficient exact associative structure for k-mers
C Marchet, M Kerbiriou, A Limasset - Bioinformatics, 2021 - academic.oup.com
Motivation A plethora of methods and applications share the fundamental need to associate
information to words for high-throughput sequence analysis. Doing so for billions of k-mers …
information to words for high-throughput sequence analysis. Doing so for billions of k-mers …
Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era
Background De novo genome assembly relies on two kinds of graphs: de Bruijn graphs and
overlap graphs. Overlap graphs are the basis for the Celera assembler, while de Bruijn …
overlap graphs. Overlap graphs are the basis for the Celera assembler, while de Bruijn …