Indexed reference databases for KMA and CCMetagen

Rossetto Marcelino, Vanessa; Buchmann, Jan; Clausen, Philip

Permalink

Type

Dataset

Author/s

Rossetto Marcelino, Vanessa
Buchmann, Jan
Clausen, Philip

Abstract

This database was built to identify taxa in metagenome samples using the CCMetagen pipeline. The whole NCBI nt collection allows a complete taxonomic overview, including from microbial eukaryotes that may be present in the dataset. This database is already indexed, ready to use ...
See moreThis database was built to identify taxa in metagenome samples using the CCMetagen pipeline. The whole NCBI nt collection allows a complete taxonomic overview, including from microbial eukaryotes that may be present in the dataset. This database is already indexed, ready to use with KMA and CCMetagen. A manual describing how to use this dataset can be found at: https://github.com/vrmarcelino/CCMetagen Additionally, a tutorial on the whole analysis of a set of metatranscriptome samples can be found at: https://github.com/vrmarcelino/CCMetagen/tree/master/tutorial The database was built as follows: The partially non-redundant nucleotide database was downloaded from the NCBI website (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz) in January 2018. This database was formatted to include taxids in sequence headers. Indexing was then performed with KMA using the commands: kma_index -i nt_taxid.fas -o ncbi_nt -NI -Sparse TG Three indexed databases are provided: 1 - NCBI nucleotide collection 2 - RefSeq database of bacterial and fungal genomes. V 1.0 - Initial Upload. V 2.0 - Addition of NCBI database: The NCBI nucleotide collection contains many environmental and artificial sequence entries without taxonomic information (e.g. uncultured marine bacteria). We therefore compiled a database without those. The file ncbi_nt_no_env_11jun2019.zip contains therefore all ncbi nt entries excluding the descendants of environmental eukaryotes (taxid 61964), environmental prokaryotes (48479), unclassified sequences (12908) and artificial sequences (28384).
See less

Date

2019-04-30

Publisher

The University of Sydney

Licence

Creative Commons Attribution-NonCommercial-ShareAlike 4.0

Faculty/School

Faculty of Medicine and Health, Sydney Medical School

Department, Discipline or Centre

Marie Bashir Institute for Infectious Diseases and Biosecurity

Subjects

metagenomics
metatranscriptomics

Licence

Except where otherwise noted, this item's licence is described as Creative Commons Attribution-NonCommercial-ShareAlike 4.0