Data structures based on k-mers for querying large collections of sequencing data sets

C Marchet, C Boucher, SJ Puglisi, P Medvedev… - Genome …, 2021 - genome.cshlp.org
High-throughput sequencing data sets are usually deposited in public repositories (eg, the
European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached …

Creating and using minimizer sketches in computational genomics

H Zheng, G Marçais, C Kingsford - Journal of Computational …, 2023 - liebertpub.com
Processing large data sets has become an essential part of computational genomics.
Greatly increased availability of sequence data from multiple sources has fueled …

[HTML][HTML] Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs

G Holley, P Melsted - Genome biology, 2020 - Springer
Memory consumption of de Bruijn graphs is often prohibitive. Most de Bruijn graph-based
assemblers reduce the complexity by compacting paths into single vertices, but this is …

Sparse and skew hashing of k-mers

GE Pibiri - Bioinformatics, 2022 - academic.oup.com
Motivation A dictionary of k-mers is a data structure that stores a set of n distinct k-mers and
supports membership queries. This data structure is at the hearth of many important tasks in …

REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets

C Marchet, Z Iqbal, D Gautheret, M Salson… - …, 2020 - academic.oup.com
Motivation In this work we present REINDEER, a novel computational method that performs
indexing of sequences and records their abundances across a collection of datasets. To the …

[HTML][HTML] Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2

J Khan, M Kokot, S Deorowicz, R Patro - Genome biology, 2022 - Springer
The de Bruijn graph is a key data structure in modern computational genomics, and
construction of its compacted variant resides upstream of many genomic analyses. As the …

Representation of k-Mer Sets Using Spectrum-Preserving String Sets

A Rahman, P Medevedev - Journal of Computational Biology, 2021 - liebertpub.com
Given the popularity and elegance of k-mer-based tools, finding a space-efficient way to
represent a set of k-mers is important for improving the scalability of bioinformatics analyses …

Small Searchable κ-Spectra via Subset Rank Queries on the Spectral Burrows-Wheeler Transform

JN Alanko, SJ Puglisi, J Vuohtoniemi - SIAM Conference on Applied and …, 2023 - SIAM
The κ-spectrum of a string is the set of all distinct substrings of length κ occurring in the
string. This is a lossy but computationally convenient representation of the information in the …

Conway–Bromage–Lyndon (CBL): an exact, dynamic representation of k-mer sets

I Martayan, B Cazaux, A Limasset, C Marchet - Bioinformatics, 2024 - academic.oup.com
In this article, we introduce the Conway–Bromage–Lyndon (CBL) structure, a compressed,
dynamic and exact method for representing k-mer sets. Originating from Conway and …

Efficient minimizer orders for large values of k using minimum decycling sets

D Pellow, L Pu, B Ekim, L Kotlar, B Berger… - Genome …, 2023 - genome.cshlp.org
Minimizers are ubiquitously used in data structures and algorithms for efficient searching,
mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select …