slideDupIdentify

slideDupIdentify.py is a Python script for identifying and managing duplicate WSI files. It processes files by their study type and stain and organizes duplicates based on a defined prioritization. The script outputs a report (CSV) with metadata on the identified duplicates and logs the entire process, ensuring clear documentation of which files were kept, moved, or skipped.

Options and Arguments:

Option	Description
`--image-folder, -i`	Optional. The folder where the input images are located. Defaults to the current directory if not specified.
`--study-type, -t`	Required. The study type prefix (e.g., AE). Files must start with this prefix to be processed.
`--stain, -s`	Required. The stain name (e.g., CD34). Files must contain this string to be processed.
`--out-file, -o`	Required. The output CSV file name (without extension) for saving duplicate information.
`--force, -f`	Optional. Overwrite the output file if it already exists.
`--dry-run, -d`	Optional. Perform a dry run where no actual file operations are performed. Actions are reported to the terminal.
`--debug, -D`	Optional. Print debug information for troubleshooting.
`--verbose, -v`	Optional. Show details of all duplicate samples identified.
`--help, -h`	Optional. Show help message and usage instructions.
`--version, -V`	Optional. Print the script version and exit.

Example Usage:

python slideDupIdentify.py --image-folder /path/to/images \
                           --study-type AE \
                           --stain CD34 \
                           --out-file duplicates_report \
                           --verbose

In this example, the script will:

Search /path/to/images for files related to the AE study type and CD34 stain.
Identify duplicates and save the information to a CSV file duplicates_report.AE.CD34.metadata.csv.
Log details about duplicate identification and prioritization.

Duplicate Identification and Prioritization:

The script uses a structured process to prioritize duplicates based on:

Preferred file type: .ndpi files are preferred over .TIF.
Creation date: The latest file is preferred.
Checksum and size: If files have the same type and creation date, the largest file is preferred.

For each study number and stain, one prioritized file remains, and metadata about duplicates is stored in the output CSV.

^{Licence. The MIT License (MIT): http://opensource.org/licenses/MIT.}

slideQuantify v1

slideQuantify_v1
slideQuantify_v1_1_expresshist_mask.sh
slideQuantify_v1_2_expresshist_tile.sh
slideQuantify_v1_3_tile_normalizing.sh
slideQuantify_v1_4_cellprofiler.sh
slideQuantify_v1_5_wrapup.sh

slideQuantify v2

slideQuantify_v2
slideQuantify_v2_1_entropy_segmentation.sh
slideQuantify_v2_2_extract_tiles.sh
slideQuantify_v2_3_tile_normalizing.sh
slideQuantify_v2_4_cellprofiler.sh
slideQuantify_v2_5_wrapup.sh

slideQuantifyOSX

slideQuantifyOSX
slideQuantify_cellprofiler.sh
slideQuantify_mask.sh
slideQuantify_normalizing.sh
slideQuantify_tiling.sh
slideQuantify_wrapup.sh

Other scripts

slideToolKitTest.py

Installation

macOSX

Conda version (default/preferred)
Homebrew version

Linux

Rocky 8 Conda version (default/preferred)

Legacy

Ubuntu 16.04 LTS
Ubuntu 12.04 CentOS7 Conda version with modules
Administrator version

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slideDupIdentify

Options and Arguments:

Example Usage:

Duplicate Identification and Prioritization:

Overview

Manual

slideToolKit scripts

slideQuantify v1

slideQuantify v2

slideQuantifyOSX

Other scripts

Installation

macOSX

Linux

Legacy

Requirements

FAQs

Issues

Clone this wiki locally