Performance analysis of PyTorch

Foreword

The PyTorch port to ROCm is under active development especially in regards to performance. We are focussing our efforts on server-grade accelerators (MI25/MI60/...) but the following applies to all supported AMD hardware.

Performance analysis

We supply a small microbenchmarking script for PyTorch training on ROCm. To use, download micro_benchmarking_pytorch.py and fp16util.py.

To execute: python micro_benchmarking_pytorch.py --network <network name> [--batch-size <batch size> ] [--iterations <number of iterations>] [--fp16 <0 or 1> ] [--dataparallel] [--device_ids <comma separated list (no spaces) of GPU indices (0-indexed) to run dataparallel api on>]

Possible network names are: alexnet, densenet121, inception_v3, resnet50, resnet101, SqueezeNet, and vgg16.

Default are 10 training iterations, fp16 off (i.e., 0), and a batch size of 64.

Performance tuning

If performance on a specific card and/or model is found to be lacking, typically some gains can be made by tuning MIOpen. For this, export MIOPEN_FIND_ENFORCE=3 prior to running the model. This will take some time if untuned configurations are encountered and write to a local performance database. More information on this can be found in the MIOpen documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance analysis of PyTorch

Foreword

Performance analysis

Performance tuning

Clone this wiki locally