The Library
Catwalk : identifying closely related sequences in large microbial sequence databases
Tools
Volk, Denis, Yang-Turner, Fan, Didelot, Xavier, Crook, Derrick W. and Wyllie, David (2022) Catwalk : identifying closely related sequences in large microbial sequence databases. Microbial Genomics, 8 (6). 000850. doi:10.1099/mgen.0.000850 ISSN 2057-5858.
|
PDF
WRAP-Catwalk-identifying-closely-related-sequences-in-large-microbial-sequence-databases-Didelot-22.pdf - Published Version - Requires a PDF viewer. Available under License Creative Commons Attribution 4.0. Download (951Kb) | Preview |
Official URL: http://dx.doi.org/10.1099/mgen.0.000850
Abstract
There is a need to identify microbial sequences that may form part of transmission chains, or that may represent importations across national boundaries, amidst large numbers of SARS-CoV-2 and other bacterial or viral sequences. Reference-based compression is a sequence analysis technique that allows both a compact storage of sequence data and comparisons between sequences. Published implementations of the approach are being challenged by the large sample collections now being generated. Our aim was to develop a fast software detecting highly similar sequences in large collections of microbial genomes, including millions of SARS-CoV-2 genomes. To do so, we developed Catwalk, a tool that bypasses bottlenecks in the generation, comparison and in-memory storage of microbial genomes generated by reference mapping. It is a compiled solution, coded in Nim to increase performance. It can be accessed via command line, rest api or web server interfaces. We tested Catwalk using both SARS-CoV-2 and Mycobacterium tuberculosis genomes generated by prospective public-health sequencing programmes. Pairwise sequence comparisons, using clinically relevant similarity cut-offs, took about 0.39 and 0.66 μs, respectively; in 1 s, between 1 and 2 million sequences can be searched. Catwalk operates about 1700 times faster than, and uses about 8 % of the RAM of, a Python reference-based compression and comparison tool in current use for outbreak detection. Catwalk can rapidly identify close relatives of a SARS-CoV-2 or M. tuberculosis genome amidst millions of samples.
Item Type: | Journal Article | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Subjects: | Q Science > QH Natural history R Medicine > RA Public aspects of medicine |
||||||||||||
Divisions: | Faculty of Science, Engineering and Medicine > Science > Life Sciences (2010- ) Faculty of Science, Engineering and Medicine > Science > Statistics |
||||||||||||
Library of Congress Subject Headings (LCSH): | Bacterial genomes , Epidemics -- Transmission -- Computer simulation, Genomes -- Analysis | ||||||||||||
Journal or Publication Title: | Microbial Genomics | ||||||||||||
Publisher: | Microbiology Society | ||||||||||||
ISSN: | 2057-5858 | ||||||||||||
Official Date: | 30 June 2022 | ||||||||||||
Dates: |
|
||||||||||||
Volume: | 8 | ||||||||||||
Number: | 6 | ||||||||||||
Number of Pages: | 6 | ||||||||||||
Article Number: | 000850 | ||||||||||||
DOI: | 10.1099/mgen.0.000850 | ||||||||||||
Status: | Peer Reviewed | ||||||||||||
Publication Status: | Published | ||||||||||||
Access rights to Published version: | Open Access (Creative Commons) | ||||||||||||
Date of first compliant deposit: | 11 July 2022 | ||||||||||||
Date of first compliant Open Access: | 11 July 2022 | ||||||||||||
RIOXX Funder/Project Grant: |
|
Request changes or add full text files to a record
Repository staff actions (login required)
View Item |
Downloads
Downloads per month over past year