OpenShift PyTorch Example

This repository provides an example of how to run a PyTorch training job on OpenShift. The example demonstrates setting up a distributed training job using OpenShift resources and the PyTorchJob API. This repository is used in the walkthrough document "RoCE Multi node AI training on OpenShift".

Directory info

This repository is organized as follows:

docker-image-files/: Contains Dockerfile and related scripts for building the Docker image used for the PyTorch training job.
- Dockerfile: Defines the environment and dependencies for the PyTorch training container.
- entrypoint.sh: Script that sets up the environment variables and starts the training job.
examples/: Contains example scripts and configurations for running and testing the PyTorch training job.
- pytorchjob.yaml: Defines a basic PyTorchJob resource for running the distributed training job on OpenShift.
- pytorch-using-entrypoint.yaml: Defines the PyTorchJob resource for running the distributed training job by setting the environment variables inside the file and using the entrypoint.sh to execute the training on OpenShift.

Prerequisites

Before you begin, ensure you have the following prerequisites:

OpenShift cluster up and running.
oc command-line tool installed and configured.
docker/podman installed for building the container images.
Basic knowledge of Kubernetes or OpenShift.

Setup

Clone the repository:

git clone [email protected]:redhat-developer-demos/openshift-distrbuted-resnet-training.git
cd openshift-distrbuted-resnet-training

Build the Docker image:

docker build -t <your-dockerhub-username>/pytorch-training:latest .

Push the Docker image to your Docker registry:

docker push <your-dockerhub-username>/pytorch-training:latest

Running the Example

Apply the Kubernetes resources:
```
oc create -f pytorchjob.yaml
```
Verify the job is running:
```
oc get pods
```
Check the logs of the training job (for the job with the entry_point.sh):
```
oc logs <pod-name>
```

PyTorch Script Arguments

The main.py script accepts the following arguments:

--backend: Backend to use for distributed training (default: nccl)
--batch_size: Input batch size for training (default: 64)
--data_path: Path to the dataset (required)
--num_train_epochs: Number of training epochs (default: 1)
--learning_rate: Learning rate for optimizer (default: 0.001)
--weight_decay: Weight decay for optimizer (default: 0.0)
--gradient_accumulation_steps: Gradient accumulation steps (default: 1)
--evaluation_strategy: Evaluation strategy (default: no)
--save_strategy: Save strategy (default: epoch)
--lr_scheduler_type: Type of learning rate scheduler (default: cosine)
--pretrained_weights: Path to pre-trained weights (default: '')
--num_workers: Number of DataLoader workers (default: 2)
--max_samples: Maximum number of samples per epoch (-1 for full dataset, only works with a single node) (default: -1)
--print_interval: Interval for printing metrics (in batches) (default: 10)
--use_syn: Use synthetic data (default: False)
--output_dir: Output directory for saving models (default: .)

Example usage:

torchrun --nproc_per_node=1 --nnodes=3 --node_rank=2 --master_addr=192.168.1.5 --master_port=23456 main.py --backend=nccl --batch_size=128 --data_path=/mnt/storage/dataset/cifar10_imagefolder --num_train_epochs=1 --learning_rate=0.001 --num_workers=5 --print_interval=5 --output_dir /mnt/storage/

Convert Script Arguments

The convert.py script accepts the following arguments:

--root_dir: Root directory of the dataset batches (required).
--output_dir: Output directory for the ImageFolder format (required).

Example usage:

python convert_cifar10_to_imagefolder.py --root_dir /path/to/cifar-10-batches-py --output_dir /path/to/output

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
RoCE_generator		RoCE_generator
dataset-downloader-pod		dataset-downloader-pod
docker-image-files		docker-image-files
job-examples		job-examples
operators-policy		operators-policy
performance-profiles		performance-profiles
pv-creation		pv-creation
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenShift PyTorch Example

Table of Contents

Directory info

Prerequisites

Setup

Running the Example

PyTorch Script Arguments

Convert Script Arguments

About

Releases

Packages

Languages

License

redhat-developer-demos/openshift-distrbuted-resnet-training

Folders and files

Latest commit

History

Repository files navigation

OpenShift PyTorch Example

Table of Contents

Directory info

Prerequisites

Setup

Running the Example

PyTorch Script Arguments

Convert Script Arguments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages