-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating a large custom database from NCBI refseq #10
Comments
adding top few lines from input tsv for your reference here
|
Hi @JensUweUlrich After building the custom database successfully. search command runs fine but profile command fails
the fail log
|
@humbleflowers |
the search file is 712MB in size. I can provide the file. How can i facilitate it? |
Could you upload the file to the following dropbox folder?: |
Hello @JensUweUlrich, DropBox is restricted in my organisation. I am exploring other possibilities with internal data sharing software. Can you share me your email? |
Hello @JensUweUlrich, Its not possible to share data using dropbox. I will have to use Microsoft One drive of my organisation. It will be great if you share email address to facilitate it. Sorry for the incovenience. |
Thanks for your support. You can use my gmail address, which is [email protected] |
Hello @JensUweUlrich I could only share it to your official email jens-uwe.ulrich[at]hpi.de in the publication due to my organization policies. Thank you. |
Thanks, I successfully downloaded the file and found the issue immediately.
At the very end of that line, you can see the taxids of the full lineage. Here, the number of taxids is greater than the number of taxon strings in the field before. The problem is that taxids for subphylum, clade, subkingdom etc. are included in the refseq accession file, which is used for building the index. Did you use one of the prebuilt indexes I provided or did you create your own one? |
Hello @JensUweUlrich Do i need to clean input tsv file to only include taxids at species level? |
Yes, you need to have exactly the same number of taxon ids and taxon names (the last two rows) in the input file describing your reference database. For ease of use I would stick to the classical ones species, genus, family, order, class, phylum, kingdom |
Thank you @JensUweUlrich, I will give it a try and get back to you. |
Hello @JensUweUlrich What is the correct way to add this line in input file?
In nutshell, i want to know how to represent a subspecies like one in the example above. |
@humbleflowers |
Hello, I am trying to build a database for taxor using ncbi refseq sequences from Prokaryotes, Archaea, Bacteria, Virus and Fungii (almost 50k orgamisms in total).
taxor build --input-file ../../taxor/taxor_input.tsv --input-sequence-dir . --output-filename refseq-PBFAV-VK --kmer-size 22 --syncmer-size 12 --threads 30
i get this error
Thank you for the tool, the initial results are really good. @JensUweUlrich
The text was updated successfully, but these errors were encountered: