Skip to content

Introduction

Sander W. van der Laan edited this page Jan 8, 2024 · 8 revisions

The slideToolKit workflow

The slideToolKit is a collection of open-source scripts to handle each step from digital whole-slide images (WSI) to the storage of your results. A common slideToolKit workflow consists of four consecutive steps.

In the first step, acquisition, whole slide images are collected and converted to TIFF files. In the second step, preparation, all the required files are created and organized. The third step, tiles, creates multiple manageable tiles to count. The fourth step, analysis, is the actual tissue analysis and saves the results in a meaningful data set.

A set of tools is designed for each step. Instructions on how to use each tool can be found running the --help flag (e.g. slideConvert --help).

Here you can find a graphical workflow for the slideToolkit.

Step 1 - acquisition

Most slide scanners are, in addition to their own proprietary format, capable of storing the digital slides in pyramid TIFF files. The slideToolkit uses the Bio-Formats library to convert other microscopy formats (Bio-Formats supports over 120 different file formats, openmicroscopy.org) into the compatible pyramid TIFF format if needed. TIFF is a tag-based file format for raster images. A TIFF file can hold multiple images in a single file, this is known as a multi-layered TIFF. The term "Pyramid TIFF" is used to describe a multi-layered TIFF file that wraps a sequence of raster images that each represents the same image at increasing resolutions. The different layers contain, among others, the slide label and multiple enlargements of the tissue on the slide.

Some slides do not have the proper filenames. Sometimes you want the filename to be exactly like the content of the barcode, sometimes you want all your slides to start with the project name (e.g. MyProject.original.slide.name.TIF). slideRename makes it easy to rename multiple slides at once.

To read whole slide images, the open-source libTIFF libraries and the OpenSlide libraries are used. These libraries are also applied to extract metadata (e.g. scan time, magnification and image compression) of the scanned slides. Descriptive information about the slide is stored as metadata and contains, for example, pixels per micrometer, presence of different layers, and scan date.

ImageMagick and OpenSlide/OpenCV

For image processing we use ImageMagick in bash-versions of some scripts, and OpenSlide and OpenCV in python-versions. ImageMagick is a command-line image manipulation tool that is fast, highly adjustable and capable of handling big pyramid TIFF files. Generally we recommend to use the python-version as this offers more speed than bash-versions, and more flexibility in terms of reading image-formats through the OpenSlide and OpenCV libraries.

The tools designed for step 1:

  • slideConvert
    • Convert different file types of WSI to TIFF format.
  • slideDirectory
    • Creates staging directories to process images.
  • slideDirectory
    • Creates staging directories to process images.
  • slideInfo
    • Fetch slide metadata (resolution, dates, magnification, et cetera).
  • slideLookup
    • Lookup a list of virtual slides.
  • slideRename
    • Rename virtual slides, this methods supports auto-renaming using barcodes.

Step 2 - preparation

In the following steps multiple output files for each slide are created. For each digital slide, a staging directory is constructed in which the slide, and all output data concerning the slide are stored.

Thumbnails contain a photo of the whole slide, including the label. This makes it easy to identify your slides.

In digital image manipulation, a mask defines what part of the image will be analyzed and what part will be hidden. Usually a mask can be defined as black (hidden) or white (not hidden). The slideToolKit creates a mask and a miniature version of the whole slide image using convert (from the ImageMagick library). To create the masks the image is blurred, this will remove dust and speckles. Now, the white background is identified using a fuzzy, non-stringent selection and then background is replaced with black. Settings for blur and fuzziness can be found and changed in the slideMask tool. Generated masks can be adjusted manually in an image editor of choice (such as the freely available GNU Image Manipulation Program; GIMP). Sometimes this is necessary to remove unwanted areas on the whole slide image (like marker stripes or air bubbles under the coverslip).

The tools designed for step 2:

  • slideDirectory
    • Create a staging directory per slide.
  • slideExtract
    • Extract a slide thumbnail, including label, and scaled macro version of the WSI (in .png-format).
  • slideMacro
    • Create a scaled macro version from a slide (in .png-format).
  • slideNormalize
    • Create a normalized version from a macro version of a given slide (in .png-format).
  • slideThumb
    • Create slide thumbnail, including label from a slide (in .png-format).

As sometimes the former slideMask was unable to make proper masks, especially when the contrast between tissue and background is very low, we created slideEntropyMasker (python version) and slideEMask (C++ program).

The following (legacy) masking tools are available:

  • slideEMask
    • Create a scaled masked macro version from a slide (in .png-format) using image entropy.
  • slideEntropyMasker
    • Create a scaled masked macro version from a slide (in .png-format) using image entropy.
  • slideMask
    • Create a scaled masked macro version from a slide (in .png-format) using ImageMagick.

Step 3 - tiles

Image analysis of memory intensive, whole 20x representations of the digitized slides is currently impossible due to hardware and software limitations. The goal of this step is to create multiple smaller images (i.e. 'tiles') from the 20x magnified WSI. An upscaled version of the mask is placed over the 20x WSI (in our example this is layer 3 of the multi layered TIFF). Image manipulation on 20x sized WSI requires large amounts of computer RAM. To make it possible for computers without sufficient RAM to handle these files, the slideToolKit uses a memory-mapped disk file of the program memory. Using disk mapped memory files (ImageMagick .mpc-files), the slideToolKit can efficiently extract all tiles. Without a mask, a faster and more memory efficient method is used using the openslide library.

The tools designed for step 3:

  • slide2Tiles
    • Cut virtual slide into tiles (bash version).

Step 4 - analysis

At this step, multiple tiles containing tissue data have been made. And now the different objects in this tissue can be identified. Although you can use any image analysis program from now on, we prefer CellProfiler. CellProfiler is designed to quantitatively measure phenotypes from thousands of images automatically without training in computer vision or programming. CellProfiler can run using a graphical user interface (GUI) or a command-line interface (CLI). Using the CellProfiler’s GUI, different algorithms for image analysis are available as individual modules that can be modified and placed in sequential order to form a pipeline. Such a pipeline can be used to identify and measure biological objects and features in images. Pipelines can be stored and reused in future projects. The CLI can be used to run the pipeline for actual image analysis.

An illustrated example on how to create pipelines in CellProfiler is described by Vokes and Carpenter in their manuscript "Using CellProfiler for Automatic Identification and Measurement of Biological Objects in Images".

CellProfiler is able to output its measurements in .gct and/or .csv-format. The .csv files are commenly used data files and can be imported in nearly every statistical program.

The tools designed for step 4:

  • slideAppend
    • Appends the output from CellProfiler, which is per tile, into one .csv file. Also possible to use the .gct-format with slideAppendGCT.sh
  • slideJobChecker
    • Checks the output of the given step from the slideToolKit.

CellProfiler is also able to output .sql-files, where the .sql-file contains the structure of the data and the .csv-file contains the actual measurements. With the (no longer supported) legacy slideSQLheader you can extract the SQL structure and add it as a header row to the .csv-file.

All-in-one workflow

The slideToolKit contains a collection of script meant to be used manually and separately as needed, but it can also be used as an all-in-one workflow. For this purpose the above steps are capture in a few separate bash-scripts which can be called locally with slideQuantifyOSX on macOS, or on a SLURM-based LINUX-server with slideQuantify.

  • slideQuantifyOSX or slideQuantify
    • Main script to run the slideToolKit on a given collection of WSI from steps 1 through 4. Run locally (slideQuantifyOSX) or on a SLURM-based LINUX server (slideQuantify). Executes the following sequentially:
      • slideQuantify_1_expresshist_mask.sh / slideQuantify_mask.sh
        • Creates thumbnails and scaled (masked) macro versions of a given set of images.
      • slideQuantify_2_expresshist_tile.sh / slideQuantify_tiling.sh
        • Create image tiles from the macro-version while masking non-tissue areas using the masked images.
      • slideQuantify_3_tile_normalizing.sh / slideQuantify_normalizing.sh
        • Normalizes tiled images.
      • slideQuantify_4_cellprofiler.sh / slideQuantify_cellprofiler.sh
        • Run a CellProfiler pipeline of a given set of tiled, masked, and normalized images.
      • slideQuantify_5_wrapup.sh / slideQuantify_wrapup.sh
        • Wraps up the results and produces the final dataset in .csv format.
Clone this wiki locally