Quantizing Falcon Instruct Model fails at tgi 0.9.0 #552

ChristophRaab · 2023-07-05T14:54:05Z

System Info

text-generation-inference: 0.9.0
Target: x86_64-unknown-linux-gnu
Cargo version: 1.70.0
Commit sha: e28a809
Docker label: sha-e28a809
nvidia-smi:

   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 450.216.04   Driver Version: 450.216.04   CUDA Version: 11.8     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
   | N/A   29C    P0    57W / 400W |      0MiB / 40537MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   1  A100-SXM4-40GB      On   | 00000000:87:00.0 Off |                    0 |
   | N/A   33C    P0    54W / 400W |      0MiB / 40537MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
                                                                                  
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   |  No running processes found                                                 |
   +-----------------------------------------------------------------------------+

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Steps to reproduce:

Use text-generation-inference with offical docker version 0.9.0 with system info as above.
Envoke following command at the cli:

text-generation-server quantize OpenAssistant/falcon-40b-sft-top1-560 /data/falcon-40b-oasst-gtpq --trust-remote-code

Error Appearance:

Quantization starts til layer 30.
Program fails at layer 30 with: NotImplementedError: Cannot copy out of meta tensor; no data!

Log:

Quantizing layer 26/60..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attention.query_key_val | 35618.367    | -          | -         | 3.209 |
| self_attention.dense         | 1244.327     | -          | -         | 2.956 |
| mlp.dense_h_to_4h            | 226500.344   | -          | -         | 3.245 |
| mlp.dense_4h_to_h           | 12750.321    | -          | -         | 18.166 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 27/60..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attention.query_key_val | 34944.094    | -          | -         | 3.222 |
| self_attention.dense         | 3060.006     | -          | -         | 2.972 |
| mlp.dense_h_to_4h            | 233095.219   | -          | -         | 3.151 |
| mlp.dense_4h_to_h           | 12451.180    | -          | -         | 18.105 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 28/60..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attention.query_key_val | 28128.297    | -          | -         | 3.216 |
| self_attention.dense         | 1319.688     | -          | -         | 2.972 |
| mlp.dense_h_to_4h            | 240008.969   | -          | -         | 3.101 |
| mlp.dense_4h_to_h           | 13451.772    | -          | -         | 18.092 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 29/60..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attention.query_key_val | 40246.789    | -          | -         | 3.254 |
| self_attention.dense         | 2450.354     | -          | -         | 2.945 |
| mlp.dense_h_to_4h            | 246500.750   | -          | -         | 3.112 |
| mlp.dense_4h_to_h           | 13381.992    | -          | -         | 18.210 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 30/60..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 187, in quantize
    quantize(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/quantize.py", line 800, in quantize
    quantizers = sequential(

  File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/quantize.py", line 667, in sequential
    layer = layers[i].to(dev)

  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)

  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)

  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)

  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

NotImplementedError: Cannot copy out of meta tensor; no data!

Expected behavior

Quantized model is saved at /data/falcon-40b-oasst-gtpq

The text was updated successfully, but these errors were encountered:

Narsil · 2023-07-06T20:07:56Z

I think 2 A100 will not fit 40B in its entirety (and apparently not even in CPU RAM).

The tensor being meta most likely means it was offloaded to disk (everything is handled automatically by accelerate.

ChristophRaab · 2023-07-07T13:13:02Z

Okay. That would be great and only a minor problem. I will try it out and give you a hint.

Narsil · 2023-07-18T10:19:49Z

Can you try the latest, version. I just merged an upgrade to the quantization script: #587

It should use barely and VRAM at all now.

ChristophRaab · 2023-07-19T15:23:38Z

I can confirm that with #587 and release v0.9.3 the issue is fixed. I was able to quantize the model with one A100 (40GB) and a limit of 128 GB ram.

However, in the following code snippets seems to be a bug. The trust_remote_code variable is not passed to AutoModelForCausalLM and throws an error if a model with third party code is loaded.

text-generation-inference/server/text_generation_server/utils/gptq/quantize.py

Lines 866 to 868 in b3f830a

    
           with init_empty_weights(): 
        
               model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16) 
        
           model = model.eval()

It should be changed to:

 with init_empty_weights(): 
     model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16,trust_remote_code=trust_remote_code) 
 model = model.eval()

I made a MR see below.

@Narsil

# What does this PR do?   Fixes a bug appeared with MR #587 fixing issue #552. See the discussion in #552. With MR #587 the trust_remote_code variable is not passed to AutoModelForCausalLM, but is found in the function signature. This prevents models like falcon to be quantized, because trust_remote_code is required. This MR fixes the issue. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [X] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [X] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @Narsil

ChristophRaab · 2023-07-20T12:47:53Z

Fixed with #587 and #647.

ssmi153 mentioned this issue Jul 10, 2023

Missing libraries for GPTQ Quantization #576

Closed

4 tasks

ChristophRaab mentioned this issue Jul 19, 2023

Add trust_remote_code to quantize script #647

Merged

5 tasks

ChristophRaab closed this as completed Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantizing Falcon Instruct Model fails at tgi 0.9.0 #552

Quantizing Falcon Instruct Model fails at tgi 0.9.0 #552

ChristophRaab commented Jul 5, 2023 •

edited

Loading

Narsil commented Jul 6, 2023

ChristophRaab commented Jul 7, 2023 •

edited

Loading

Narsil commented Jul 18, 2023

ChristophRaab commented Jul 19, 2023 •

edited

Loading

ChristophRaab commented Jul 20, 2023

Quantizing Falcon Instruct Model fails at tgi 0.9.0 #552

Quantizing Falcon Instruct Model fails at tgi 0.9.0 #552

Comments

ChristophRaab commented Jul 5, 2023 • edited Loading

System Info

Information

Tasks

Reproduction

Steps to reproduce:

Error Appearance:

Log:

Expected behavior

Narsil commented Jul 6, 2023

ChristophRaab commented Jul 7, 2023 • edited Loading

Narsil commented Jul 18, 2023

ChristophRaab commented Jul 19, 2023 • edited Loading

ChristophRaab commented Jul 20, 2023

ChristophRaab commented Jul 5, 2023 •

edited

Loading

ChristophRaab commented Jul 7, 2023 •

edited

Loading

ChristophRaab commented Jul 19, 2023 •

edited

Loading