Skip to content

Fetch taxonomic information from Entrez using a list of TaxIDs and visualize user-selected taxonomic ranks with SankeyMATIC.

License

Notifications You must be signed in to change notification settings

justin-tpb/TaxIDs-to-SankeyMATIC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

TaxIDs to SankeyMATIC

This script fetches taxonomic information from the Entrez database using a list of TaxIDs and outputs user-selected taxonomic ranks to a file. A ready-to-use code of the taxonomic distribution is then generated for visualization with SankeyMATIC.

Features

  • Fetch taxonomic information from Entrez for specified TaxIDs.
  • Output the fetched data with user-selectable taxonomic ranks.
  • Generate code compatible with SankeyMATIC for visualizing the taxonomic distribution of the TaxIDs, sorted by hierarchy and count.
  • Optional grouping of taxa with low counts to reduce clutter in the Sankey diagram.

Prerequisites

Before running this script, it is necessary to have the following installed:

  • Python 3.6 or later
  • pandas and biopython libraries. To install these libraries, execute the following command:
pip install pandas biopython

Usage

python taxids2sankey.py <input_file> [options]

or

./taxids2sankey.py <input_file> [options]

Arguments

  • <input_file>: File containing a column of TaxIDs, one per line.
    • The first line must contain column headers.
    • The delimiter will be automatically detected.

Options

  • -c, --header <name>: Header of the column containing the TaxIDs.
    • Default is #Taxid from the BLAST text output.
  • -t, --tax_ranks <taxonomic_ranks>: Comma-separated list of taxonomic ranks.
    • Default is class,order,genus.
  • -e, --email <address>: Email for identification by Entrez.
    • Will be saved to entrez_config.ini for future use.
    • Entrez will show a warning without an email and might block access in case of excessive usage.
  • -g, --group <threshold>: Group ranks below this threshold for less cluttered Sankey diagrams.
    • Groups will be named <parent_rank> (grouped).
    • Default is no grouping.
  • -s, --skip: Skip the generation of SankeyMATIC compatible code (only output taxonomic information).
  • -h, --help: Show this help message.

Examples

Running the script with the default parameters:

python taxids2sankey.py example.csv -e [email protected]

Running the script after an email was saved to entrez_config.ini:

python taxids2sankey.py example.csv

Specifying a custom column header for the TaxIDs and skipping SankeyMATIC code generation:

python taxids2sankey.py example.csv -c TaxIDs -s

Specifying custom taxonomic ranks and enabling grouping for SankeyMATIC:

python taxids2sankey.py example.csv -t phylum,class,order -g 10

Note

Entrez will sometimes cause an HTTP Error 400 Bad Request error. If this happens, just try again after a few seconds.

Output

  • A <input_filename>.taxonomy.csv file containing the input TaxIDs along with the fetched taxonomic information.
  • If not skipped, a <input_filename>.sankey.txt file with SankeyMATIC code based on the taxonomic distribution.
    • The code will be sorted by taxonomic hierarchy and count.
    • If grouping is enabled, the grouping threshold will be added to the filename.
    • The file content can be copied to the input field of SankeyMATIC to generate a Sankey diagram.

Author

Justin Teixeira Pereira Bassiaridis

License

Distributed under the MIT License. See LICENSE for more information.

About

Fetch taxonomic information from Entrez using a list of TaxIDs and visualize user-selected taxonomic ranks with SankeyMATIC.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages