Indexed reference databases for KMA and CCMetagen
Type
DatasetAbstract
This database was built to identify taxa in metagenome samples using the CCMetagen pipeline. The whole NCBI nt collection allows a complete taxonomic overview, including from microbial eukaryotes that may be present in the dataset. This database is already indexed, ready to use ...
See moreThis database was built to identify taxa in metagenome samples using the CCMetagen pipeline. The whole NCBI nt collection allows a complete taxonomic overview, including from microbial eukaryotes that may be present in the dataset. This database is already indexed, ready to use with KMA and CCMetagen. A manual describing how to use this dataset can be found at: https://github.com/vrmarcelino/CCMetagen Additionally, a tutorial on the whole analysis of a set of metatranscriptome samples can be found at: https://github.com/vrmarcelino/CCMetagen/tree/master/tutorial The database was built as follows: The partially non-redundant nucleotide database was downloaded from the NCBI website (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz) in January 2018. This database was formatted to include taxids in sequence headers. Indexing was then performed with KMA using the commands: kma_index -i nt_taxid.fas -o ncbi_nt -NI -Sparse TG Three indexed databases are provided: 1 - NCBI nucleotide collection 2 - RefSeq database of bacterial and fungal genomes. V 1.0 - Initial Upload. V 2.0 - Addition of NCBI database: The NCBI nucleotide collection contains many environmental and artificial sequence entries without taxonomic information (e.g. uncultured marine bacteria). We therefore compiled a database without those. The file ncbi_nt_no_env_11jun2019.zip contains therefore all ncbi nt entries excluding the descendants of environmental eukaryotes (taxid 61964), environmental prokaryotes (48479), unclassified sequences (12908) and artificial sequences (28384).
See less
See moreThis database was built to identify taxa in metagenome samples using the CCMetagen pipeline. The whole NCBI nt collection allows a complete taxonomic overview, including from microbial eukaryotes that may be present in the dataset. This database is already indexed, ready to use with KMA and CCMetagen. A manual describing how to use this dataset can be found at: https://github.com/vrmarcelino/CCMetagen Additionally, a tutorial on the whole analysis of a set of metatranscriptome samples can be found at: https://github.com/vrmarcelino/CCMetagen/tree/master/tutorial The database was built as follows: The partially non-redundant nucleotide database was downloaded from the NCBI website (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz) in January 2018. This database was formatted to include taxids in sequence headers. Indexing was then performed with KMA using the commands: kma_index -i nt_taxid.fas -o ncbi_nt -NI -Sparse TG Three indexed databases are provided: 1 - NCBI nucleotide collection 2 - RefSeq database of bacterial and fungal genomes. V 1.0 - Initial Upload. V 2.0 - Addition of NCBI database: The NCBI nucleotide collection contains many environmental and artificial sequence entries without taxonomic information (e.g. uncultured marine bacteria). We therefore compiled a database without those. The file ncbi_nt_no_env_11jun2019.zip contains therefore all ncbi nt entries excluding the descendants of environmental eukaryotes (taxid 61964), environmental prokaryotes (48479), unclassified sequences (12908) and artificial sequences (28384).
See less
Date
2019-04-30Publisher
The University of SydneyLicence
Creative Commons Attribution-NonCommercial-ShareAlike 4.0Faculty/School
Faculty of Medicine and Health, Sydney Medical SchoolDepartment, Discipline or Centre
Marie Bashir Institute for Infectious Diseases and BiosecurityShare
Licence