Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantizing Falcon Instruct Model fails at tgi 0.9.0 #552

Closed
3 of 4 tasks
ChristophRaab opened this issue Jul 5, 2023 · 5 comments
Closed
3 of 4 tasks

Quantizing Falcon Instruct Model fails at tgi 0.9.0 #552

ChristophRaab opened this issue Jul 5, 2023 · 5 comments

Comments

@ChristophRaab
Copy link
Contributor

ChristophRaab commented Jul 5, 2023

System Info

text-generation-inference: 0.9.0
Target: x86_64-unknown-linux-gnu
Cargo version: 1.70.0
Commit sha: e28a809
Docker label: sha-e28a809
nvidia-smi:

   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 450.216.04   Driver Version: 450.216.04   CUDA Version: 11.8     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
   | N/A   29C    P0    57W / 400W |      0MiB / 40537MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   1  A100-SXM4-40GB      On   | 00000000:87:00.0 Off |                    0 |
   | N/A   33C    P0    54W / 400W |      0MiB / 40537MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
                                                                                  
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   |  No running processes found                                                 |
   +-----------------------------------------------------------------------------+

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Steps to reproduce:

  1. Use text-generation-inference with offical docker version 0.9.0 with system info as above.
  2. Envoke following command at the cli:
text-generation-server quantize OpenAssistant/falcon-40b-sft-top1-560 /data/falcon-40b-oasst-gtpq --trust-remote-code

Error Appearance:

  1. Quantization starts til layer 30.
  2. Program fails at layer 30 with: NotImplementedError: Cannot copy out of meta tensor; no data!

Log:

Quantizing layer 26/60..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attention.query_key_val | 35618.367    | -          | -         | 3.209 |
| self_attention.dense         | 1244.327     | -          | -         | 2.956 |
| mlp.dense_h_to_4h            | 226500.344   | -          | -         | 3.245 |
| mlp.dense_4h_to_h           | 12750.321    | -          | -         | 18.166 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 27/60..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attention.query_key_val | 34944.094    | -          | -         | 3.222 |
| self_attention.dense         | 3060.006     | -          | -         | 2.972 |
| mlp.dense_h_to_4h            | 233095.219   | -          | -         | 3.151 |
| mlp.dense_4h_to_h           | 12451.180    | -          | -         | 18.105 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 28/60..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attention.query_key_val | 28128.297    | -          | -         | 3.216 |
| self_attention.dense         | 1319.688     | -          | -         | 2.972 |
| mlp.dense_h_to_4h            | 240008.969   | -          | -         | 3.101 |
| mlp.dense_4h_to_h           | 13451.772    | -          | -         | 18.092 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 29/60..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
| self_attention.query_key_val | 40246.789    | -          | -         | 3.254 |
| self_attention.dense         | 2450.354     | -          | -         | 2.945 |
| mlp.dense_h_to_4h            | 246500.750   | -          | -         | 3.112 |
| mlp.dense_4h_to_h           | 13381.992    | -          | -         | 18.210 |
+------------------+--------------+------------+-----------+-------+


Quantizing layer 30/60..
+------------------+--------------+------------+-----------+-------+
|       name       | weight_error | fp_inp_SNR | q_inp_SNR | time  |
+==================+==============+============+===========+=======+
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 187, in quantize
    quantize(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/quantize.py", line 800, in quantize
    quantizers = sequential(

  File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/quantize.py", line 667, in sequential
    layer = layers[i].to(dev)

  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)

  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)

  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)

  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

NotImplementedError: Cannot copy out of meta tensor; no data!

Expected behavior

Quantized model is saved at /data/falcon-40b-oasst-gtpq

@Narsil
Copy link
Collaborator

Narsil commented Jul 6, 2023

I think 2 A100 will not fit 40B in its entirety (and apparently not even in CPU RAM).

The tensor being meta most likely means it was offloaded to disk (everything is handled automatically by accelerate.

@ChristophRaab
Copy link
Contributor Author

ChristophRaab commented Jul 7, 2023

Okay. That would be great and only a minor problem. I will try it out and give you a hint.

@Narsil
Copy link
Collaborator

Narsil commented Jul 18, 2023

Can you try the latest, version. I just merged an upgrade to the quantization script: #587

It should use barely and VRAM at all now.

@ChristophRaab
Copy link
Contributor Author

ChristophRaab commented Jul 19, 2023

I can confirm that with #587 and release v0.9.3 the issue is fixed. I was able to quantize the model with one A100 (40GB) and a limit of 128 GB ram.

However, in the following code snippets seems to be a bug. The trust_remote_code variable is not passed to AutoModelForCausalLM and throws an error if a model with third party code is loaded.

with init_empty_weights():
model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16)
model = model.eval()

It should be changed to:

 with init_empty_weights(): 
     model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16,trust_remote_code=trust_remote_code) 
 model = model.eval() 

I made a MR see below.

Narsil pushed a commit that referenced this issue Jul 20, 2023
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes a bug appeared with MR #587 fixing issue #552.
See the discussion in #552.

With MR #587 the trust_remote_code variable is not passed to
AutoModelForCausalLM, but is found in the function signature. This
prevents models like falcon to be quantized, because trust_remote_code
is required. This MR fixes the issue.


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [X] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [X] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
@Narsil
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

 -->
@ChristophRaab
Copy link
Contributor Author

Fixed with #587 and #647.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants