-
Notifications
You must be signed in to change notification settings - Fork 1
slideDupIdentify
slideDupIdentify.py
is a Python script for identifying and managing duplicate WSI files. It processes files by their study type and stain and organizes duplicates based on a defined prioritization. The script outputs a report (CSV) with metadata on the identified duplicates and logs the entire process, ensuring clear documentation of which files were kept, moved, or skipped.
Option | Description |
---|---|
--image-folder, -i |
Optional. The folder where the input images are located. Defaults to the current directory if not specified. |
--study-type, -t |
Required. The study type prefix (e.g., AE). Files must start with this prefix to be processed. |
--stain, -s |
Required. The stain name (e.g., CD34). Files must contain this string to be processed. |
--out-file, -o |
Required. The output CSV file name (without extension) for saving duplicate information. |
--force, -f |
Optional. Overwrite the output file if it already exists. |
--dry-run, -d |
Optional. Perform a dry run where no actual file operations are performed. Actions are reported to the terminal. |
--debug, -D |
Optional. Print debug information for troubleshooting. |
--verbose, -v |
Optional. Show details of all duplicate samples identified. |
--help, -h |
Optional. Show help message and usage instructions. |
--version, -V |
Optional. Print the script version and exit. |
python slideDupIdentify.py --image-folder /path/to/images \
--study-type AE \
--stain CD34 \
--out-file duplicates_report \
--verbose
In this example, the script will:
- Search
/path/to/images
for files related to theAE
study type andCD34
stain. - Identify duplicates and save the information to a CSV file
duplicates_report.AE.CD34.metadata.csv
. - Log details about duplicate identification and prioritization.
The script uses a structured process to prioritize duplicates based on:
-
Preferred file type:
.ndpi
files are preferred over.TIF
. - Creation date: The latest file is preferred.
- Checksum and size: If files have the same type and creation date, the largest file is preferred.
For each study number and stain, one prioritized file remains, and metadata about duplicates is stored in the output CSV.
Licence. The MIT License (MIT): http://opensource.org/licenses/MIT.
Copyright (c) 2014-2024, Bas G.L. Nelissen & Sander W. van der Laan, UMC Utrecht, Utrecht, the Netherlands.
Introduction
General instructions
slide2Tiles
slideAppend.sh
slideAppendGCT.sh
slideConvert
slideDirectory
slideDupIdentify.py
slideEMask
slideEntropySegmentation.py
slideExtract.py
slideExtractTiles.py
slideInfo
slideInfo.py
slideJobChecker
slideLookup
slideMacro
slideMacro.py
slideMask
slideMoveNewWSI.py
slideNormalize
slideRename
slideRename.py
slideThumb
slideThumb.py
slideQuantify_v1
slideQuantify_v1_1_expresshist_mask.sh
slideQuantify_v1_2_expresshist_tile.sh
slideQuantify_v1_3_tile_normalizing.sh
slideQuantify_v1_4_cellprofiler.sh
slideQuantify_v1_5_wrapup.sh
slideQuantify_v2
slideQuantify_v2_1_entropy_segmentation.sh
slideQuantify_v2_2_extract_tiles.sh
slideQuantify_v2_3_tile_normalizing.sh
slideQuantify_v2_4_cellprofiler.sh
slideQuantify_v2_5_wrapup.sh
slideQuantifyOSX
slideQuantify_cellprofiler.sh
slideQuantify_mask.sh
slideQuantify_normalizing.sh
slideQuantify_tiling.sh
slideQuantify_wrapup.sh
Conda version (default/preferred)
Homebrew version
Rocky 8 Conda version (default/preferred)
Ubuntu 16.04 LTS
Ubuntu 12.04
CentOS7 Conda version with modules
Administrator version