multi-GPU training problem #6

LilySys · 2019-09-25T07:32:19Z

HI, when i trained the model with multi-gpu training, the model didn't start training after more than 30 minutes, and i don't konw why, could you give me some suggestions? Thank you!

2019-09-25 14:56:36,708 reid_baseline.train INFO: More than one gpu used, convert model to use SyncBN.
2019-09-25 14:56:40,504 reid_baseline.train INFO: Using pytorch SyncBN implementation
2019-09-25 14:56:40,535 reid_baseline.train INFO: Trainer Built

DTennant · 2019-09-25T08:04:45Z

Can you please provide your config file?

LilySys · 2019-09-26T00:23:20Z

ok.

In config.py, i jush modified the _C.TEST.VIS = True.

In debug_multi-gpu.yml, the configurations are show as follows:
MODEL:
PRETRAIN_PATH: '/home/wl/.torch/models/resnet50-19c8e357.pth'

INPUT:
SIZE_TRAIN: [256, 128]
SIZE_TEST: [256, 128]
PIXEL_MEAN: [0.485, 0.456, 0.406]
PIXEL_STD: [0.229, 0.224, 0.225]
PROB: 0.5 # random horizontal flip
RE_PROB: 0.5 # random erasing
PADDING: 10

DATASETS:
NAMES: 'retrieval'
DATA_PATH: '/home/wl/.data/retrieval/vehicle'
TRAIN_PATH: 'train.txt'
QUERY_PATH: 'query.txt'
GALLERY_PATH: 'gallery.txt'

DATALOADER:
SAMPLER: 'softmax_triplet'
NUM_INSTANCE: 8
NUM_WORKERS: 4

SOLVER:
OPTIMIZER_NAME: 'Adam'
MAX_EPOCHS: 120
BASE_LR: 0.00035
BIAS_LR_FACTOR: 1
WEIGHT_DECAY: 0.0005
WEIGHT_DECAY_BIAS: 0.0005
IMS_PER_BATCH: 128

STEPS: [40,70]
GAMMA: 0.1

WARMUP_FACTOR: 0.01
WARMUP_ITERS: 10
WARMUP_METHOD: 'linear'

CHECKPOINT_PERIOD: 10
LOG_PERIOD: 20
EVAL_PERIOD: 20

TEST:
IMS_PER_BATCH: 128
DEBUG: True
WEIGHT: "path"
MULTI_GPU: True

OUTPUT_DIR: "/home/wl/.pytorch_project/person_reid/reid_baseline_with_syncbn-master/outputs/20190925"

DTennant · 2019-09-26T02:01:03Z

the TRAIN_PATH, QUERY_PATH and GALLERY_PATH should be the folder to the images

LilySys · 2019-09-26T07:38:38Z

yes, i know it, so i modified the data.py according to my requirements.so i think the problem may not be here.

DTennant · 2019-09-26T07:57:00Z

Can you post your data.py?

LilySys · 2019-09-26T08:10:26Z

import torch
import os.path as osp
from PIL import Image
from torch.utils.data import Dataset
import numpy as np
from torchvision import transforms as T
import glob
import re
from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

def read_image(img_path):
"""Keep reading image until succeed.
This can avoid IOError incurred by heavy IO process."""
got_img = False
if not osp.exists(img_path):
raise IOError("{} does not exist".format(img_path))
while not got_img:
try:
img_type = 'RGB'
img = Image.open(img_path).convert(img_type)
got_img = True
except IOError:
print("IOError incurred when reading '{}'. "
"Will redo. Don't worry. Just chill.".format(img_path))
pass
return img

class ImageDataset(Dataset):
"""Image Person ReID Dataset"""

def __init__(self, dataset, cfg, transform=None):
    self.dataset = dataset
    self.cfg = cfg
    self.transform = transform

def __len__(self):
    return len(self.dataset)

def __getitem__(self, index):
    img_path, pid, camid = self.dataset[index]
    img = read_image(img_path)

    if self.transform is not None:
        img = self.transform(img)

    return img, pid, camid, img_path

class BaseDataset:
def init(self, root='/home/wl/.data/retrieval',
train_dir='', query_dir='', gallery_dir='',
verbose=True, **kwargs):
self.dataset_dir = root
self.train_dir = osp.join(self.dataset_dir, 'train')
self.query_dir = osp.join(self.dataset_dir, 'query')
self.gallery_dir = osp.join(self.dataset_dir, 'gallery')
self.list_train_path = osp.join(self.dataset_dir, 'train/' + train_dir)
self.list_query_path = osp.join(self.dataset_dir, 'query/' + query_dir)
self.list_gallery_path = osp.join(self.dataset_dir, 'gallery/' + gallery_dir)

    self._check_before_run()
    train = self._process_dir(self.train_dir, self.list_train_path)
    query = self._process_dir(self.query_dir, self.list_query_path)
    gallery = self._process_dir(self.gallery_dir, self.list_gallery_path)
    if verbose:
        print("=> retrieval loaded")
        self.print_dataset_statistics(train, query, gallery)

    self.train = train
    self.query = query
    self.gallery = gallery

    self.num_train_pids, self.num_train_imgs, self.num_train_cams = self.get_imagedata_info(self.train)
    self.num_query_pids, self.num_query_imgs, self.num_query_cams = self.get_imagedata_info(self.query)
    self.num_gallery_pids, self.num_gallery_imgs, self.num_gallery_cams = self.get_imagedata_info(self.gallery)

def get_imagedata_info(self, data):
    pids, cams = [], []
    for _, pid, camid in data:
        pids += [pid]
        cams += [camid]
    pids = set(pids)
    cams = set(cams)
    num_pids = len(pids)
    num_cams = len(cams)
    num_imgs = len(data)
    return num_pids, num_imgs, num_cams

def print_dataset_statistics(self, train, query, gallery):
    num_train_pids, num_train_imgs, num_train_cams = self.get_imagedata_info(train)
    num_query_pids, num_query_imgs, num_query_cams = self.get_imagedata_info(query)
    num_gallery_pids, num_gallery_imgs, num_gallery_cams = self.get_imagedata_info(gallery)

    print("Dataset statistics:")
    print("  ----------------------------------------")
    print("  subset   | # ids | # images | # cameras")
    print("  ----------------------------------------")
    print("  train    | {:5d} | {:8d} | {:9d}".format(num_train_pids, num_train_imgs, num_train_cams))
    print("  query    | {:5d} | {:8d} | {:9d}".format(num_query_pids, num_query_imgs, num_query_cams))
    print("  gallery  | {:5d} | {:8d} | {:9d}".format(num_gallery_pids, num_gallery_imgs, num_gallery_cams))
    print("  ----------------------------------------")

def _check_before_run(self):
    """Check if all files are available before going deeper"""
    if not osp.exists(self.dataset_dir):
        raise RuntimeError("'{}' is not available".format(self.dataset_dir))
    if not osp.exists(self.train_dir):
        raise RuntimeError("'{}' is not available".format(self.train_dir))
    if not osp.exists(self.query_dir):
        raise RuntimeError("'{}' is not available".format(self.query_dir))
    if not osp.exists(self.gallery_dir):
        raise RuntimeError("'{}' is not available".format(self.gallery_dir))

def _process_dir(self, dir_path, list_path):
    with open(list_path, 'r') as txt:
        lines = txt.readlines()

    pid_container = set()
    for img_idx, img_info in enumerate(lines):
        img_path, pid = img_info.split(' ')
        pid = int(pid)
        assert pid>=0, "pid less than 0"
        pid_container.add(pid)
    pid2label = {pid: label for label, pid in enumerate(pid_container)}

    dataset = []
    for img_idx, img_info in enumerate(lines):
        img_path, pid = img_info.split(' ')
        camid = 1
        pid = int(pid)
        img_path = dir_path + img_path
        pid = pid2label[pid]
        dataset.append((img_path, pid, camid))

    return dataset

def init_dataset(cfg):
"""
Use path in cfg to init a dataset
train set and val set should be organzed as
cfg.DATASETS.TRAIN_PATH: the path of train.txt
cfg.DATASETS.QUERY_PATH: the path of query.txt
cfg.DATASETS.GALLERY_PATH: the path of gallery.txt
"""
return BaseDataset(root=cfg.DATASETS.DATA_PATH, train_dir=cfg.DATASETS.TRAIN_PATH,
query_dir=cfg.DATASETS.QUERY_PATH, gallery_dir=cfg.DATASETS.GALLERY_PATH)

LilySys · 2019-09-26T08:40:57Z

HI，when i decrease the IMS_PER_BATCH form 128 to 64, the model started to training.
2019-09-26 16:34:26,818 reid_baseline.train INFO: More than one gpu used, convert model to use SyncBN.
2019-09-26 16:34:29,920 reid_baseline.train INFO: Using pytorch SyncBN implementation
2019-09-26 16:34:29,936 reid_baseline.train INFO: Trainer Built
2019-09-26 16:35:27,140 reid_baseline.train INFO: Epoch[1] Iteration[20/5582] Loss: 14.669,Acc: 0.000, Base Lr: 1.40e-05
2019-09-26 16:35:43,517 reid_baseline.train INFO: Epoch[1] Iteration[40/5582] Loss: 14.248,Acc: 0.000, Base Lr: 1.40e-05
2019-09-26 16:35:59,911 reid_baseline.train INFO: Epoch[1] Iteration[60/5582] Loss: 14.028,Acc: 0.000, Base Lr: 1.40e-05
2019-09-26 16:36:16,507 reid_baseline.train INFO: Epoch[1] Iteration[80/5582] Loss: 13.833,Acc: 0.000, Base Lr: 1.40e-05
2019-09-26 16:36:32,851 reid_baseline.train INFO: Epoch[1] Iteration[100/5582] Loss: 13.673,Acc: 0.000, Base Lr: 1.40e-05
2019-09-26 16:36:49,295 reid_baseline.train INFO: Epoch[1] Iteration[120/5582] Loss: 13.562,Acc: 0.000, Base Lr: 1.40e-05
2019-09-26 16:37:05,658 reid_baseline.train INFO: Epoch[1] Iteration[140/5582] Loss: 13.454,Acc: 0.000, Base Lr: 1.40e-05

LilySys · 2019-09-26T09:03:15Z

Hi，there is another question?
Why do I use multi-GPU training more slowly than single-gpu training?

HaoWang1006 · 2020-12-25T03:08:30Z

@LilySys Did you reduce the batch size to solve this problem? and, Do you have any decrease in test accuracy after training？

YUHANG-Ma · 2021-12-07T07:31:08Z

HI，when i decrease the IMS_PER_BATCH form 128 to 64, the model started to training. 2019-09-26 16:34:26,818 reid_baseline.train INFO: More than one gpu used, convert model to use SyncBN. 2019-09-26 16:34:29,920 reid_baseline.train INFO: Using pytorch SyncBN implementation 2019-09-26 16:34:29,936 reid_baseline.train INFO: Trainer Built 2019-09-26 16:35:27,140 reid_baseline.train INFO: Epoch[1] Iteration[20/5582] Loss: 14.669,Acc: 0.000, Base Lr: 1.40e-05 2019-09-26 16:35:43,517 reid_baseline.train INFO: Epoch[1] Iteration[40/5582] Loss: 14.248,Acc: 0.000, Base Lr: 1.40e-05 2019-09-26 16:35:59,911 reid_baseline.train INFO: Epoch[1] Iteration[60/5582] Loss: 14.028,Acc: 0.000, Base Lr: 1.40e-05 2019-09-26 16:36:16,507 reid_baseline.train INFO: Epoch[1] Iteration[80/5582] Loss: 13.833,Acc: 0.000, Base Lr: 1.40e-05 2019-09-26 16:36:32,851 reid_baseline.train INFO: Epoch[1] Iteration[100/5582] Loss: 13.673,Acc: 0.000, Base Lr: 1.40e-05 2019-09-26 16:36:49,295 reid_baseline.train INFO: Epoch[1] Iteration[120/5582] Loss: 13.562,Acc: 0.000, Base Lr: 1.40e-05 2019-09-26 16:37:05,658 reid_baseline.train INFO: Epoch[1] Iteration[140/5582] Loss: 13.454,Acc: 0.000, Base Lr: 1.40e-05

Hi, I also met this problem. I changed the batch size to 32 but it doesn't work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-GPU training problem #6

multi-GPU training problem #6

LilySys commented Sep 25, 2019

DTennant commented Sep 25, 2019

LilySys commented Sep 26, 2019

DTennant commented Sep 26, 2019

LilySys commented Sep 26, 2019

DTennant commented Sep 26, 2019

LilySys commented Sep 26, 2019

LilySys commented Sep 26, 2019

LilySys commented Sep 26, 2019

HaoWang1006 commented Dec 25, 2020

YUHANG-Ma commented Dec 7, 2021

multi-GPU training problem #6

multi-GPU training problem #6

Comments

LilySys commented Sep 25, 2019

DTennant commented Sep 25, 2019

LilySys commented Sep 26, 2019

DTennant commented Sep 26, 2019

LilySys commented Sep 26, 2019

DTennant commented Sep 26, 2019

LilySys commented Sep 26, 2019

LilySys commented Sep 26, 2019

LilySys commented Sep 26, 2019

HaoWang1006 commented Dec 25, 2020

YUHANG-Ma commented Dec 7, 2021