Data structures based on k-mers for querying large collections of sequencing data sets
High-throughput sequencing data sets are usually deposited in public repositories (eg, the
European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached …
European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached …
Creating and using minimizer sketches in computational genomics
Processing large data sets has become an essential part of computational genomics.
Greatly increased availability of sequence data from multiple sources has fueled …
Greatly increased availability of sequence data from multiple sources has fueled …
Effective sequence similarity detection with strobemers
K Sahlin - Genome research, 2021 - genome.cshlp.org
k-mer-based methods are widely used in bioinformatics for various types of sequence
comparisons. However, a single mutation will mutate k consecutive k-mers and make most k …
comparisons. However, a single mutation will mutate k consecutive k-mers and make most k …
Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2
The de Bruijn graph is a key data structure in modern computational genomics, and
construction of its compacted variant resides upstream of many genomic analyses. As the …
construction of its compacted variant resides upstream of many genomic analyses. As the …
Lossless indexing with counting de Bruijn graphs
Sequencing data are rapidly accumulating in public repositories. Making this resource
accessible for interactive analysis at scale requires efficient approaches for its storage and …
accessible for interactive analysis at scale requires efficient approaches for its storage and …
Kmtricks: efficient and flexible construction of bloom filters for large sequencing data collections
When indexing large collections of short-read sequencing data, a common operation that
has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is …
has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is …
Simplitigs as an efficient and scalable representation of de Bruijn graphs
Abstract de Bruijn graphs play an essential role in bioinformatics, yet they lack a universal
scalable representation. Here, we introduce simplitigs as a compact, efficient, and scalable …
scalable representation. Here, we introduce simplitigs as a compact, efficient, and scalable …
BLight: efficient exact associative structure for k-mers
C Marchet, M Kerbiriou, A Limasset - Bioinformatics, 2021 - academic.oup.com
Motivation A plethora of methods and applications share the fundamental need to associate
information to words for high-throughput sequence analysis. Doing so for billions of k-mers …
information to words for high-throughput sequence analysis. Doing so for billions of k-mers …
Conway–Bromage–Lyndon (CBL): an exact, dynamic representation of k-mer sets
In this article, we introduce the Conway–Bromage–Lyndon (CBL) structure, a compressed,
dynamic and exact method for representing k-mer sets. Originating from Conway and …
dynamic and exact method for representing k-mer sets. Originating from Conway and …
MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic …
Motivation Bloom filters are a popular data structure that allows rapid searches in large
sequence datasets. So far, all tools work with nucleotide sequences; however, protein …
sequence datasets. So far, all tools work with nucleotide sequences; however, protein …