This file describes how we generate the PyTorch Benchmark Score Version 1. The goal is to help users and developers understand the score and be able to reproduce it.
V1 uses the same hardware environment as V0, but it covers far more models and test configurations.
The V1 benchmark suite uses an experimental JIT feature, optimize_for_inference, introduced on May 22, 2021. Therefore, it can't run on earlier versions of PyTorch.
The V1 suite covers 50 models from popular machine learning domains. The complete list of models is as follows:
Model name | Category |
---|---|
BERT_pytorch | NLP |
Background_Matting | COMPUTER VISION |
LearningToPaint | REINFORCEMENT LEARNING |
alexnet | COMPUTER VISION |
attention_is_all_you_need_pytorch | NLP |
demucs | OTHER |
densenet121 | COMPUTER VISION |
dlrm | RECOMMENDATION |
drq | REINFORCEMENT LEARNING |
fastNLP | NLP |
hf_Albert | NLP |
hf_Bert | NLP |
hf_BigBird | NLP |
hf_DistilBert | NLP |
hf_GPT2 | NLP |
hf_Longformer | NLP |
hf_T5 | NLP |
maml | OTHER |
maml_omniglot | OTHER |
mnasnet1_0 | COMPUTER VISION |
mobilenet_v2 | COMPUTER VISION |
mobilenet_v3_large | COMPUTER VISION |
moco | OTHER |
nvidia_deeprecommender | RECOMMENDATION |
opacus_cifar10 | OTHER |
pyhpc_equation_of_state | OTHER |
pyhpc_isoneutral_mixing | OTHER |
pytorch_CycleGAN_and_pix2pix | COMPUTER VISION |
pytorch_stargan | COMPUTER VISION |
pytorch_struct | OTHER |
resnet18 | COMPUTER VISION |
resnet50 | COMPUTER VISION |
resnet50_quantized_qat | COMPUTER VISION |
resnext50_32x4d | COMPUTER VISION |
shufflenet_v2_x1_0 | COMPUTER VISION |
soft_actor_critic | REINFORCEMENT LEAERNING |
speech_transformer | SPEECH |
squeezenet1_1 | COMPUTER VISION |
timm_efficientnet | COMPUTER VISION |
timm_nfnet | COMPUTER VISION |
timm_regnet | COMPUTER VISION |
timm_resnest | COMPUTER VISION |
timm_vision_transformer | COMPUTER VISION |
timm_vovnet | COMPUTER VISION |
tts_angular | SPEECH |
vgg16 | COMPUTER VISION |
yolov3 | COMPUTER VISION |
The reference config YAML file is stored here. It is generated
by repeated runs of the same benchmark setting on pytorch v1.10.0.dev20210612,
torchtext 0.10.0.dev20210612, and torchvision 0.11.0.dev20210612. We choose the
earliest PyTorch nightly version that has a stable implementation of the
optimize_for_inference
feature. We then picked a random execution of the
repeated V1 benchmark runs as the reference execution, and summarize its
performance metrics in the reference config YAML.
We have also manually verified that the maximum variance of any single test in the V1 suite is smaller than 7%. In the V1 nightly CI job, we raise signal if any tests performance metric changes over the 7% threshold, or the overall score number changes over 1% threshold.
We define the V1 score value of the referenece execution to be 1000. All other V1 scores are relative to the performance of the reference execution. For example, if another V1 benchmark execution's score is 900, it means the its performance is 10% slower comparing to the reference execution.