-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama-3.1-8B-Instruct-4bit keeps looping at the end. #1059
Comments
I ran the same command, below is the output I get which looks pretty reasonable. Can you share what you are seeing? Also could you share the machine, the OS, and the version of MLX as well?
|
Yea your output looks correct. MBp 16" with M3 Max 64GB Thanks! |
Also I noticed that your prompt processing speed is 643.576, and mine is 417.707. |
I was able to reproduce this with the same long prompt on M1 Max. I did a bit of bisecting and looks like after ml-explore/mlx#1509 I'm getting non deterministic outputs. |
Ok I’ll take a look at that. But it’s not in MLX 0.19 so it can’t really be the same issue as above.. |
@chigkim were you building the main branch of MLX from source or did you install MLX from PyPi? |
For MLX, I installed from Pip. For MLX-LM, I tried both from pip and git. |
I've tried to reproduce this on several machines (M1 Max, M2 Ultra, M1 Ultra, and M3 Max), so far not seeing any issues in the output. Some questions / suggestions:
|
I just tried mlx==0.19.1 as well as 0.20.0 along with mlx-lm==0.19.3. |
I'm really stumped by this one to be honest. I tried your exact command which works fine on several machines:
Given that it works for me, I don't think there is a problem with the command or the prompt (unless you've changed it). Without the ability to reproduce, it's very difficult to debug. Anything you can do to fuzz around and see if there are conditions that the looping is sensitive to would be great to help us get to the bottom of this. Some ideas:
Would be really great if you have time for any of those and can share the results. Thanks! |
I deleted my environment and created a fresh one with |
That is really curious.. There was a bug in one of our qmv kernels that was recently fixed. It might be possible (but I think pretty unlikely) that this would account for the looping behavior you are seeing. Do you only notice looping for the very long prompt or do you also see it for shorter prompts? Also you have a few arguments set. I'm wondering if you turn them off or change does the looping go away?
|
I've tried without specifying max-kv-size, temp, top_p, seed, but they all looped exactly the same way. |
Ok. Let's see if it's fixed after our next release (which includes a fix for quantization in some cases ml-explore/mlx#1577). If it's not fixed, I will try fuzzing around a bit to see if it can be reproduced on our side. |
Do you guys need to requantize and update the model on HF, or can I just pull the main branch, install with pip install -e ., and test? |
No you can use the same model (no requantization needed). You can test by pulling and building the main branch. It would be great to know if that works for you or not. |
Oh that was merged 4 days ago, and I already pulled the latest main branch when I tested. So the fix didn't work. :( |
Ok, so I might have a potentially good news that could lead to something... |
Interesting.. you can see all the commits between 18.1 and 19.0 here: ml-explore/mlx@v0.18.1...v0.19.0 The commit in there that seems most likely to have changed something for LLM inference is the fused attention. Can you try building and the commits before and after to see if that is the case? So concretely: For including the fused attention:
And the commit just before:
That would be my first guess as to a related cause but it would be good to check and see. |
Yep that was it! The commit 50d8bed loops, but 9dd72cd doesn't. |
Could you try running with Metal validation enabled to see if that gives us any clues? (Low probability but when it hits it hits well):
|
Also you can precompute the prompt cache to speed testing up:
Then use that to generate:
|
@chigkim Adding on @awni 's last message. Can you run the following commands and report back the outputs?
The first 2 should have exactly the same output, looping or not. The next 4 should all have different outputs. |
Hmm, the output doesn't seem to be any different. Am I supposed to look for something? % METAL_DEVICE_WRAPPER_TYPE=1 METAL_DEBUG_ERROR_MODE=0 mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --max-kv-size 33000 --max-tokens 1000 --temp 0.0 --top-p 0.9 --seed 1000 --prompt -<../text/portugal.txt;say done
2024-11-13 22:46:18.124 python[16522:3704956] Metal API Validation Enabled
Fetching 6 files: 100%|████████████████████████| 6/6 [00:00<00:00, 45507.82it/s]
==========
Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>
Summarize the following:
......
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Portugal is a country located on the Iberian Peninsula in southwestern Europe.
.....
==========
Prompt: 32134 tokens, 414.286 tokens-per-sec
Generation: 1000 tokens, 32.263 tokens-per-sec
Peak memory: 12.535 GB |
Nope it would have been obvious if it threw a validation error. Thanks for checking. |
Ooo, we're getting somewhere... |
I am pretty sure it is a numerical stability issue. The interesting part is that the fused attention is all happening in float32 so it should be more numerically accurate. If you are building from source could you edit |
Sorry, not sure what to edit in order to save the logits. Could you provide patch? |
Oh, I think I found something. I realized that running 8-bit automatically inserts a system prompt, but running 4bit doesn't. Running 4-bit without --system-prompt loops as my usual result. Notice that it starts with user tag without system prompt. mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --max-kv-size 33000 --max-tokens 2000 --temp 0.0 --top-p 0.9 --seed 1000 --prompt -<../text/portugal.txt;say done
Fetching 6 files: 100%|███████████████████████| 6/6 [00:00<00:00, 148910.20it/s]
==========
Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>
Summarize the following:
......<|eot_id|><|start_header_id|>assistant<|end_header_id|>
......
==========
Prompt: 32134 tokens, 422.547 tokens-per-sec
Generation: 2000 tokens, 33.290 tokens-per-sec Running 8bit without --system-prompt does not loop. Notice that it automatically inserted a system prompt with Cutting Knowledge Date and Today Date. mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-8bit --max-kv-size 33000 --max-tokens 2000 --temp 0.0 --top-p 0.9 --seed 1000 --prompt -<../text/portugal.txt;say done
Fetching 7 files: 100%|███████████████████████| 7/7 [00:00<00:00, 109145.46it/s]
==========
Prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024
<|eot_id|><|start_header_id|>user<|end_header_id|>
Summarize the following:
......<|eot_id|><|start_header_id|>assistant<|end_header_id|>
......
==========
Prompt: 32159 tokens, 422.712 tokens-per-sec
Generation: 819 tokens, 25.135 tokens-per-sec Running 4bit with --system-prompt does not loop. mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --max-kv-size 33000 --max-tokens 2000 --temp 0.0 --top-p 0.9 --seed 1000 --system-prompt $'Cutting Knowledge Date: December 2023\nToday Date: 23 July 2024\n\nYou are a helpful assistant' --prompt -<../text/portugal.txt;say done
Fetching 6 files: 100%|███████████████████████| 6/6 [00:00<00:00, 133152.51it/s]
==========
Prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 23 July 2024
You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>
Summarize the following:
......<|eot_id|><|start_header_id|>assistant<|end_header_id|>
......
==========
Prompt: 32164 tokens, 434.508 tokens-per-sec
Generation: 1153 tokens, 33.392 tokens-per-sec
Peak memory: 12.420 GB Running 8bit with --system-prompt results in a system prompt with Cutting Knowledge Date and Today Date twice. mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-8bit --max-kv-size 33000 --max-tokens 2000 --temp 0.0 --top-p 0.9 --seed 1000 --system-prompt $'Cutting Knowledge Date: December 2023\nToday Date: 23 July 2024\n\nYou are a helpful assistant' --prompt -<../text/portugal.txt;say done
Fetching 7 files: 100%|████████████████████████| 7/7 [00:00<00:00, 13855.65it/s]
==========
Prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024
Cutting Knowledge Date: December 2023
Today Date: 23 July 2024
You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>
Summarize the following:
......<|eot_id|><|start_header_id|>assistant<|end_header_id|>
......
==========
Prompt: 32184 tokens, 423.930 tokens-per-sec
Generation: 769 tokens, 25.218 tokens-per-sec |
Good catch regarding the system prompt! The model tokenizer in the HF repo seems to have been misconfigured for some unknown reason. I've fixed it and it should update automatically the next time you use the model. Regarding the concatenation of the system prompt with the date, this is the expected behavior for the tokenizer:
It always concatenates the system message to the date string. If you want to change the date string you can do so by passing it to messages = [
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "hi!"},
]
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, date_string="1.2.3")) To get rid of the cutting knowledge date would require modifying the chat template itself..which we likely won't provide an argument for in the CLI but you can easily do that in Python directly. I am going to close this issue since it appears that it is mostly resolved!! Thanks for your help getting to the bottom of this! |
Thanks for fixing the tokenizer. |
I'll reopen this in that case. I tried quite a few seeds and it still did not loop... which I find a bit strange. It seems like looping happens much more frequently in your setup 🤔 |
I'm on mlx-lm v0.19.1.
Running the following command with 4bit produced a bug where it would just generate the full 1000 max-tokens and just repeat the last two paragraphs over and over.
Here is my Full prompt from Wikipedia article.
Just to see what happens, I increased --max-kv-size to 34k and --max-tokens to 2000, and it generated 2k tokens with the loop.
Running the exact same command replacing 4bit with 8bit generated the correct text with 811 tokens and stopped at the end without the loop.
The text was updated successfully, but these errors were encountered: