Filtered NCBI-nt in FASTA Format

Filtered NT dataset is generated by excluding sequences from the whole nt file provided by NCBI, based on whether they have unwanted taxonomy names or any child taxonomy name of these unwanted ones. These unwanted taxonomy names are listed in the black list generated by two steps: (1) Getting all taxonomy names which contain the strings listed below (Step 3); (2) Getting all possible child taxonomy names of each of the taxonomy names from (1). For example, “other sequences” (taxId: 28384) is excluded with all its child taxonomy names including “artificial sequence”, “vector”, “synthetic”, and so on.

We have chosen to apply the Creative Commons Attribution 3.0 Unsupported License to this version of the software.

Version	Downloadable Files	File Size	Release Notes	NCBI Download Date
Version 6.0	Filtered_NT v6.0	168G	Release Notes v6	July 2018
Version 5.0	Filtered_NT v5.0	131G	Release Notes v5.0	May 2017
Version 4.0	Filtered_NT v4.0	110G	Release Notes v4.0	July 2016

Summary of the protocol:

Step 1. Download the whole nt file

downloaded from: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ version: 5/21/2017 command: wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz gunzip nt.gz (42,439,338 rows)

Step 2. Download the taxonomy list

downloaded from: ftp://ftp.ncbi.nih.gov/pub/taxonomy/ version: 5/21/2017; 5/30/2017 command: wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/*.gz gunzip *.gz wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz gunzip taxdump.tar.gz |tar -xvf location: /data/projects/targetdbs/downloads/

Step 3. Generate black list

protocol: unwanted taxonomy names (scientific names) from names.dmp and all child taxonomy names of them, include: [‘unclassified’,’unidentified’,’uncultured’,’unspecified’,’unknown’,’phage’,’vector’] [‘environmental sample’,’artificial sequence’,’other sequence’]

There are two steps for generating the black list, first is to get all taxonomy names with the strings above, and then to get all child taxonomy names of them.
script: /projects/targetdbs/scripts/get-parent-taxid-of-blacklist.py /projects/targetdbs/scripts/get-child-taxid-of-blacklist.py
output: /data/projects/targetdbs/generated/blacklist-taxId.1.csv /data/projects/targetdbs/generated/blacklist-taxId.2.csv

After generating blacklist-taxId.2.txt, use command line “sort -u” to delete duplicated records, and store them into: /data/projects/targetdbs/generated/blacklist-taxId.unique.csv
QC script: /projects/targetdbs/scripts/compare-old-new-blacklist.py Compare the newly generated with the older version.

Step 4. Check the completion of taxonomy list (QC)

protocol: First check if all seqAcs in nt file have taxIds from nucl_gb.accession2taxid file, and the ones do not have taxIds are checked in all other ac2taxid files.
script: /projects/targetdbs/scripts/check-ac2taxid-completion-step1.py /projects/targetdbs/scripts/check-ac2taxid-completion-step2.py /projects/targetdbs/scripts/check-ac2taxid-completion-step3.py
output: /data/projects/targetdbs/generated/logfile.step1.txt /data/projects/targetdbs/generated/logfile.step2.txt /data/projects/targetdbs/generated/logfile.step3.txt

This step needs a lot of memory. Suggest to run on large machine. 123 records of PDB accessions have extra characters, fixed that in step3.py. However, 28 records are not in the files, search taxIds manually for them (/data/projects/targetdbs/generated/logfile.step3.manually.added.txt).

Step 5. Get the seqAc-taxonomy list

protocol: Exclude those taxIds in the blacklist. And first get all seqAc-taxIds from nucl_gb.accession2taxid, and all of other ac2taxid files from both version 05/21/2017 and 05/30/2017.
script: /projects/targetdbs/scripts/get-seqac2taxid.py
output: /data/projects/targetdbs/generated/logfile.ac2taxid.list.txt
QC step: All seqAcs in nt files are mapped to at least one taxId. The number of seqAcs in the list matches the one in nt file. SeqAcs with multiple taxIds are listed in: /data/projects/targetdbs/generated/seqAc-with-multiple-taxids.txt

Step 6. Filtering nt file

protocol: Remember to add those manually added ac2taxids.
script: /projects/targetdbs/scripts/filter-nt.py
output: /data/projects/targetdbs/generated/filtered_nt_Jun06-2017.fasta
QC script: /projects/targetdbs/scripts/check-removed-seqacs-count.py