Data structures based on k-mers for querying large collections of sequencing data sets

C Marchet, C Boucher, SJ Puglisi, P Medvedev… - Genome …, 2021 - genome.cshlp.org
High-throughput sequencing data sets are usually deposited in public repositories (eg, the
European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached …

Creating and using minimizer sketches in computational genomics

H Zheng, G Marçais, C Kingsford - Journal of Computational …, 2023 - liebertpub.com
Processing large data sets has become an essential part of computational genomics.
Greatly increased availability of sequence data from multiple sources has fueled …

Effective sequence similarity detection with strobemers

K Sahlin - Genome research, 2021 - genome.cshlp.org
k-mer-based methods are widely used in bioinformatics for various types of sequence
comparisons. However, a single mutation will mutate k consecutive k-mers and make most k …

Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2

J Khan, M Kokot, S Deorowicz, R Patro - Genome biology, 2022 - Springer
The de Bruijn graph is a key data structure in modern computational genomics, and
construction of its compacted variant resides upstream of many genomic analyses. As the …

Lossless indexing with counting de Bruijn graphs

M Karasikov, H Mustafa, G Rätsch, A Kahles - Genome Research, 2022 - genome.cshlp.org
Sequencing data are rapidly accumulating in public repositories. Making this resource
accessible for interactive analysis at scale requires efficient approaches for its storage and …

Kmtricks: efficient and flexible construction of bloom filters for large sequencing data collections

T Lemane, P Medvedev, R Chikhi… - Bioinformatics …, 2022 - academic.oup.com
When indexing large collections of short-read sequencing data, a common operation that
has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is …

Simplitigs as an efficient and scalable representation of de Bruijn graphs

K Břinda, M Baym, G Kucherov - Genome biology, 2021 - Springer
Abstract de Bruijn graphs play an essential role in bioinformatics, yet they lack a universal
scalable representation. Here, we introduce simplitigs as a compact, efficient, and scalable …

BLight: efficient exact associative structure for k-mers

C Marchet, M Kerbiriou, A Limasset - Bioinformatics, 2021 - academic.oup.com
Motivation A plethora of methods and applications share the fundamental need to associate
information to words for high-throughput sequence analysis. Doing so for billions of k-mers …

Conway–Bromage–Lyndon (CBL): an exact, dynamic representation of k-mer sets

I Martayan, B Cazaux, A Limasset, C Marchet - Bioinformatics, 2024 - academic.oup.com
In this article, we introduce the Conway–Bromage–Lyndon (CBL) structure, a compressed,
dynamic and exact method for representing k-mer sets. Originating from Conway and …

MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic …

SK Srikakulam, S Keller, F Dabbaghie, R Bals… - …, 2023 - academic.oup.com
Motivation Bloom filters are a popular data structure that allows rapid searches in large
sequence datasets. So far, all tools work with nucleotide sequences; however, protein …