Segmentation Fault Running Int8 Quantized Model on GPU #1437

wendywangwwt · 2024-12-18T19:51:48Z

Hi! We got into segmentation fault error when trying to run model inference on gpu. Below is a minimal example from the tutorial (link):

import torch
import time

# define a floating point model where some layers could be statically quantized
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        # manually specify where tensors will be converted from floating
        # point to quantized in the quantized model
        x = self.quant(x)
        x = self.conv(x)
        x = self.relu(x)
        # manually specify where tensors will be converted from quantized
        # to floating point in the quantized model
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval mode for static quantization logic to work
model_fp32.eval()
input_fp32 = torch.randn(4, 1, 1024, 1024)

time_s = time.time()
with torch.no_grad():
    out = model_fp32(input_fp32)
time_e = time.time()

model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [['conv', 'relu']])
model_fp32_prepared = torch.ao.quantization.prepare(model_fp32_fused)

model_fp32_prepared(input_fp32)

model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)

model_int8 = model_int8.to('cuda:0')
input_fp32 = input_fp32.to('cuda:0')

with torch.no_grad():
    out = model_int8(input_fp32)

Output:

Segmentation fault (core dumped)

Inference on CPU is fine for the int8 model. Could someone please advise on the potential reason? Thank you!

The text was updated successfully, but these errors were encountered:

supriyar · 2024-12-19T22:57:37Z

Hi! Looks like you're trying to use the eager mode quantization flow from pytorch core on the fbgemm backend which currently only runs on x86 server CPU backends.

If you're interested in running on GPU, you can check out the usage instructions from torchao https://github.com/pytorch/ao/tree/main/torchao/quantization#a8w8-int8-dynamic-quantization. However, this doesn't support quantizing conv layers yet (only Linear is supported).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation Fault Running Int8 Quantized Model on GPU #1437

Segmentation Fault Running Int8 Quantized Model on GPU #1437

wendywangwwt commented Dec 18, 2024

supriyar commented Dec 19, 2024

Segmentation Fault Running Int8 Quantized Model on GPU #1437

Segmentation Fault Running Int8 Quantized Model on GPU #1437

Comments

wendywangwwt commented Dec 18, 2024

supriyar commented Dec 19, 2024