Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The converted model does not perform well #307

Open
yakupakkaya opened this issue Jun 20, 2022 · 6 comments
Open

The converted model does not perform well #307

yakupakkaya opened this issue Jun 20, 2022 · 6 comments
Labels
bug Something isn't working

Comments

@yakupakkaya
Copy link

I trained a model with the following configs as in the demo code;

def prepare_for_launch():
    runner = GeneralizedRCNNRunner()
    cfg = runner.get_default_cfg()
    cfg.merge_from_file(model_zoo.get_config_file("faster_rcnn_fbnetv3g_fpn.yaml"))
    cfg.MODEL_EMA.ENABLED = False
    cfg.DATASETS.TRAIN = (tr,)
    cfg.DATASETS.TEST = (ts,)
    cfg.DATALOADER.NUM_WORKERS = 2
    cfg.MODEL.WEIGHTS = "/home/exx/workspace/round1_fpn/model_0034999.pth"
    cfg.SOLVER.IMS_PER_BATCH = 2
    cfg.SOLVER.BASE_LR = 0.00025  # pick a good LR
    cfg.SOLVER.MAX_ITER = 52045    # 35 iterations
    cfg.SOLVER.STEPS = [324800, 365400]        # do not decay learning rate
    cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 512   # faster, and good enough for this toy dataset (def$
    cfg.MODEL.ROI_HEADS.NUM_CLASSES = 6  # number of classes
    cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5
    cfg.OUTPUT_DIR = "/home/exx/Desktop/yakkaya/d2go/workspace/round1_fpn"
    os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
    return cfg, runner

cfg, runner = prepare_for_launch()

And converted the trained model to int8 model.

model = runner.build_model(cfg)

# disable all the warnings
previous_level = logging.root.manager.disable
logging.disable(logging.INFO)

patch_d2_meta_arch()

#DetectionCheckpointer(model).load("/home/exx/workspace/round1_fpn/model_0034999.pth")

checkpointer = runner.build_checkpointer(cfg, model, save_dir=cfg.OUTPUT_DIR)
checkpoint = checkpointer.resume_or_load(cfg.MODEL.WEIGHTS, resume=True)

model.eval()

pytorch_model =  model
pytorch_model.cpu()

datasets = cfg.DATASETS.TEST[0]
data_loader = runner.build_detection_test_loader(cfg, datasets)

predictor_path = convert_and_export_predictor(
  cfg,
  copy.deepcopy(pytorch_model),
  "torchscript_int8",
  './new',
  data_loader
)

# recover the logging level
logging.disable(previous_level)

The inference results with the converted model is not even close the original model. It has limited detections over %50 confidence score and they are irrelevant.

from mobile_cv.predictor.api import create_predictor
predictor_path = "/home/exx/workspace/new/torchscript_int8"
model = create_predictor(predictor_path)

from d2go.utils.demo_predictor import DemoPredictor
predictor = DemoPredictor(model)

meta = MetadataCatalog.get(ts)

dataset_dicts = DatasetCatalog.get(ts)
for i, d in enumerate(random.sample(dataset_dicts, 20)):
    im = cv2.imread(d["file_name"])
    outputs = predictor(im)
    v = Visualizer(im[:, :, ::-1], metadata=meta, scale=0.8)
    v = v.draw_instance_predictions(outputs["instances"].to("cpu"))
    plt.figure(figsize = (14, 10))
    plt.imshow(cv2.cvtColor(v.get_image()[:, :, ::-1], cv2.COLOR_BGR2RGB))
    plt.savefig(f'/home/exx/workspace/inference/inf_{i}')

@yakupakkaya yakupakkaya added the bug Something isn't working label Jun 20, 2022
@nh9k
Copy link

nh9k commented Jun 29, 2022

@yakupakkaya Do you know patch_d2_meta_arch() function in this code?
What does that function do?
I have import error from "from d2go.export.d2_meta_arch import patch_d2_meta_arch"
-> ModuleNotFoundError: No module named 'd2go.export.d2_meta_arch'
I just comment that line and then convert model.
I can't find relevent function called patch_d2_meta_arch.

@wat3rBro
Copy link
Contributor

@nh9k patch_d2_meta_arch was removed in #305. Ignoring that line should be fine.

@wat3rBro
Copy link
Contributor

wat3rBro commented Jun 30, 2022

@yakupakkaya Could you try exporting the model in fp32 first? i.e. change "torchscript_int8" to "torchscript", this model should produce the exact same result compared with original pytorch model, this can help identify if the issue is from the pipeline or quantization. If the issue is on (post-training) quantization, since it relies on the data to do calibration, so please check the dataset and maybe try increasing cfg.QUANTIZATION.PTQ.CALIBRATION_NUM_IMAGES.

@reganh98
Copy link

reganh98 commented Apr 1, 2023

Hi. I am facing similar issue. D2Go does not work as expected when trained on the balloon dataset in beginner tutorial. The problem occurs before the model has been exported. After 600 training iterations, the total_loss remains around > 1.4 and does not decrease like the output in https://gilberttanner.com/blog/d2go-use-detectron2-on-mobile-devices/

The output log on my computer:

INFO:detectron2.utils.events: eta: 0:30:09  iter: 19  total_loss: 1.487  loss_cls: 0.787  loss_box_reg: 0.5896  loss_rpn_cls: 0.0748  loss_rpn_loc: 0.003866    time: 3.1011  last_time: 3.1276  data_time: 0.1986  last_data_time: 0.0017   lr: 2.4208e-07  
INFO:detectron2.utils.events: eta: 0:29:06  iter: 39  total_loss: 1.648  loss_cls: 0.7914  loss_box_reg: 0.7108  loss_rpn_cls: 0.06734  loss_rpn_loc: 0.005434    time: 3.1216  last_time: 3.3930  data_time: 0.0158  last_data_time: 0.0025   lr: 2.3375e-07  
INFO:detectron2.utils.events: eta: 0:28:12  iter: 59  total_loss: 1.482  loss_cls: 0.7895  loss_box_reg: 0.5614  loss_rpn_cls: 0.06274  loss_rpn_loc: 0.005041    time: 3.1338  last_time: 2.8475  data_time: 0.0236  last_data_time: 0.0006   lr: 2.2542e-07  
INFO:detectron2.utils.events: eta: 0:27:07  iter: 79  total_loss: 1.605  loss_cls: 0.7812  loss_box_reg: 0.6029  loss_rpn_cls: 0.1263  loss_rpn_loc: 0.005607    time: 3.1514  last_time: 3.5055  data_time: 0.0171  last_data_time: 0.0021   lr: 2.1708e-07  
INFO:detectron2.utils.events: eta: 0:26:04  iter: 99  total_loss: 1.489  loss_cls: 0.7837  loss_box_reg: 0.5866  loss_rpn_cls: 0.06772  loss_rpn_loc: 0.002717    time: 3.1517  last_time: 2.9021  data_time: 0.0180  last_data_time: 0.0025   lr: 2.0875e-07  
INFO:detectron2.utils.events: eta: 0:25:02  iter: 119  total_loss: 1.45  loss_cls: 0.7776  loss_box_reg: 0.561  loss_rpn_cls: 0.08126  loss_rpn_loc: 0.002865    time: 3.1560  last_time: 3.4852  data_time: 0.0162  last_data_time: 0.0015   lr: 2.0042e-07  
INFO:detectron2.utils.events: eta: 0:23:59  iter: 139  total_loss: 1.577  loss_cls: 0.7819  loss_box_reg: 0.6894  loss_rpn_cls: 0.07186  loss_rpn_loc: 0.005186    time: 3.1529  last_time: 2.9089  data_time: 0.0217  last_data_time: 0.0028   lr: 1.9208e-07  
INFO:detectron2.utils.events: eta: 0:22:58  iter: 159  total_loss: 1.518  loss_cls: 0.7854  loss_box_reg: 0.587  loss_rpn_cls: 0.097  loss_rpn_loc: 0.00756    time: 3.1715  last_time: 3.5652  data_time: 0.0169  last_data_time: 0.0012   lr: 1.8375e-07  
INFO:detectron2.utils.events: eta: 0:21:54  iter: 179  total_loss: 1.451  loss_cls: 0.7849  loss_box_reg: 0.5646  loss_rpn_cls: 0.08599  loss_rpn_loc: 0.004463    time: 3.1666  last_time: 2.9171  data_time: 0.0018  last_data_time: 0.0015   lr: 1.7542e-07  
INFO:detectron2.utils.events: eta: 0:20:50  iter: 199  total_loss: 1.494  loss_cls: 0.7737  loss_box_reg: 0.6321  loss_rpn_cls: 0.08868  loss_rpn_loc: 0.00437    time: 3.1631  last_time: 3.0934  data_time: 0.0017  last_data_time: 0.0012   lr: 1.6708e-07  
INFO:detectron2.utils.events: eta: 0:19:51  iter: 219  total_loss: 1.498  loss_cls: 0.7779  loss_box_reg: 0.5971  loss_rpn_cls: 0.06831  loss_rpn_loc: 0.005652    time: 3.1658  last_time: 3.2412  data_time: 0.0018  last_data_time: 0.0016   lr: 1.5875e-07  
INFO:detectron2.utils.events: eta: 0:18:48  iter: 239  total_loss: 1.559  loss_cls: 0.7744  loss_box_reg: 0.6321  loss_rpn_cls: 0.08105  loss_rpn_loc: 0.004709    time: 3.1745  last_time: 3.2508  data_time: 0.0018  last_data_time: 0.0019   lr: 1.5042e-07  
INFO:detectron2.utils.events: eta: 0:17:46  iter: 259  total_loss: 1.477  loss_cls: 0.7812  loss_box_reg: 0.5639  loss_rpn_cls: 0.07202  loss_rpn_loc: 0.004677    time: 3.1725  last_time: 3.3343  data_time: 0.0018  last_data_time: 0.0018   lr: 1.4208e-07  
INFO:detectron2.utils.events: eta: 0:16:42  iter: 279  total_loss: 1.441  loss_cls: 0.7773  loss_box_reg: 0.5792  loss_rpn_cls: 0.06777  loss_rpn_loc: 0.004963    time: 3.1688  last_time: 2.8273  data_time: 0.0018  last_data_time: 0.0011   lr: 1.3375e-07  
INFO:detectron2.utils.events: eta: 0:15:40  iter: 299  total_loss: 1.628  loss_cls: 0.7723  loss_box_reg: 0.7296  loss_rpn_cls: 0.08279  loss_rpn_loc: 0.004522    time: 3.1719  last_time: 3.4024  data_time: 0.0018  last_data_time: 0.0012   lr: 1.2542e-07  
INFO:detectron2.utils.events: eta: 0:14:36  iter: 319  total_loss: 1.431  loss_cls: 0.7827  loss_box_reg: 0.5478  loss_rpn_cls: 0.08486  loss_rpn_loc: 0.003578    time: 3.1698  last_time: 3.0613  data_time: 0.0020  last_data_time: 0.0016   lr: 1.1708e-07  
INFO:detectron2.utils.events: eta: 0:13:34  iter: 339  total_loss: 1.565  loss_cls: 0.7731  loss_box_reg: 0.6707  loss_rpn_cls: 0.08853  loss_rpn_loc: 0.003791    time: 3.1709  last_time: 3.2544  data_time: 0.0018  last_data_time: 0.0014   lr: 1.0875e-07  
INFO:detectron2.utils.events: eta: 0:12:32  iter: 359  total_loss: 1.445  loss_cls: 0.7731  loss_box_reg: 0.574  loss_rpn_cls: 0.1074  loss_rpn_loc: 0.004381    time: 3.1693  last_time: 3.4931  data_time: 0.0019  last_data_time: 0.0021   lr: 1.0042e-07  
INFO:detectron2.utils.events: eta: 0:11:29  iter: 379  total_loss: 1.515  loss_cls: 0.7767  loss_box_reg: 0.6044  loss_rpn_cls: 0.07838  loss_rpn_loc: 0.003056    time: 3.1713  last_time: 2.8036  data_time: 0.0018  last_data_time: 0.0013   lr: 9.2083e-08  
INFO:detectron2.utils.events: eta: 0:10:25  iter: 399  total_loss: 1.546  loss_cls: 0.7646  loss_box_reg: 0.7043  loss_rpn_cls: 0.0791  loss_rpn_loc: 0.005606    time: 3.1675  last_time: 3.0942  data_time: 0.0018  last_data_time: 0.0016   lr: 8.375e-08  
INFO:detectron2.utils.events: eta: 0:09:24  iter: 419  total_loss: 1.434  loss_cls: 0.7773  loss_box_reg: 0.5281  loss_rpn_cls: 0.09414  loss_rpn_loc: 0.002616    time: 3.1749  last_time: 2.8658  data_time: 0.0019  last_data_time: 0.0018   lr: 7.5417e-08  
INFO:detectron2.utils.events: eta: 0:08:21  iter: 439  total_loss: 1.492  loss_cls: 0.7694  loss_box_reg: 0.5298  loss_rpn_cls: 0.07042  loss_rpn_loc: 0.008807    time: 3.1730  last_time: 3.2621  data_time: 0.0018  last_data_time: 0.0015   lr: 6.7083e-08  
INFO:detectron2.utils.events: eta: 0:07:18  iter: 459  total_loss: 1.549  loss_cls: 0.7697  loss_box_reg: 0.6829  loss_rpn_cls: 0.07949  loss_rpn_loc: 0.00454    time: 3.1803  last_time: 2.8725  data_time: 0.0017  last_data_time: 0.0013   lr: 5.875e-08  
INFO:detectron2.utils.events: eta: 0:06:16  iter: 479  total_loss: 1.39  loss_cls: 0.7702  loss_box_reg: 0.5234  loss_rpn_cls: 0.08044  loss_rpn_loc: 0.00855    time: 3.1770  last_time: 3.0671  data_time: 0.0019  last_data_time: 0.0021   lr: 5.0417e-08  
INFO:detectron2.utils.events: eta: 0:05:14  iter: 499  total_loss: 1.539  loss_cls: 0.7682  loss_box_reg: 0.6657  loss_rpn_cls: 0.06907  loss_rpn_loc: 0.004616    time: 3.1932  last_time: 3.4245  data_time: 0.0023  last_data_time: 0.0006   lr: 4.2083e-08  
INFO:detectron2.utils.events: eta: 0:04:11  iter: 519  total_loss: 1.467  loss_cls: 0.7677  loss_box_reg: 0.6015  loss_rpn_cls: 0.07386  loss_rpn_loc: 0.004788    time: 3.2006  last_time: 3.0385  data_time: 0.0019  last_data_time: 0.0015   lr: 3.375e-08  
INFO:detectron2.utils.events: eta: 0:03:08  iter: 539  total_loss: 1.474  loss_cls: 0.7756  loss_box_reg: 0.555  loss_rpn_cls: 0.07801  loss_rpn_loc: 0.004378    time: 3.1999  last_time: 3.1426  data_time: 0.0018  last_data_time: 0.0018   lr: 2.5417e-08  
INFO:detectron2.utils.events: eta: 0:02:05  iter: 559  total_loss: 1.396  loss_cls: 0.7705  loss_box_reg: 0.5269  loss_rpn_cls: 0.07221  loss_rpn_loc: 0.003681    time: 3.2091  last_time: 2.9828  data_time: 0.0022  last_data_time: 0.0021   lr: 1.7083e-08  
INFO:detectron2.utils.events: eta: 0:01:03  iter: 579  total_loss: 1.618  loss_cls: 0.7681  loss_box_reg: 0.6831  loss_rpn_cls: 0.0952  loss_rpn_loc: 0.007867    time: 3.2183  last_time: 3.5709  data_time: 0.0019  last_data_time: 0.0019   lr: 8.75e-09  
INFO:fvcore.common.checkpoint:Saving checkpoint to ./output\model_final.pth
INFO:detectron2.utils.events: eta: 0:00:00  iter: 599  total_loss: 1.427  loss_cls: 0.7737  loss_box_reg: 0.6311  loss_rpn_cls: 0.0672  loss_rpn_loc: 0.003856    time: 3.2220  last_time: 3.0360  data_time: 0.0019  last_data_time: 0.0022   lr: 4.1667e-10  
INFO:detectron2.engine.hooks:Overall training speed: 598 iterations in 0:32:06 (3.2220 s / it)
INFO:detectron2.engine.hooks:Total training time: 0:32:07 (0:00:00 on hooks)

Tested on:

  • Google Colab
  • torch 1.13.1+cpu, torchvision 0.14.1+cpu on Python 3.9.10, Windows 10

Unexpected vs. Expected output:

We can also see the results ran from an older jupyter notebook https://github.com/TannerGilbert/Object-Detection-and-Image-Segmentation-with-Detectron2/blob/592960ddc4243ff34af89a38124452a75309aa1c/D2Go/D2GO_Introduction.ipynb is working well, but the newer one does not work well: https://github.com/TannerGilbert/Object-Detection-and-Image-Segmentation-with-Detectron2/blob/master/D2Go/D2GO_Introduction.ipynb

Unexpected output:

Here are some sample output based on intro tutorial from @TannerGilbert code:
Both balloon and non-balloon objects show similar confidence rate at around 50%
image
image
image
download (1)

Expected output:

Only balloons are detected with confidence > 80%
image

@yuzhuhua
Copy link

Hello, please tell me how to quantize the trained model with custom data into int8. I have been failing

@yakupakkaya
Copy link
Author

yakupakkaya commented Jan 18, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants