Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama-3.1-8B-Instruct-4bit keeps looping at the end. #1059

Open
chigkim opened this issue Oct 21, 2024 · 32 comments
Open

Llama-3.1-8B-Instruct-4bit keeps looping at the end. #1059

chigkim opened this issue Oct 21, 2024 · 32 comments

Comments

@chigkim
Copy link

chigkim commented Oct 21, 2024

I'm on mlx-lm v0.19.1.

Running the following command with 4bit produced a bug where it would just generate the full 1000 max-tokens and just repeat the last two paragraphs over and over.

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --max-kv-size 33000 --max-tokens 1000 --temp 0.0 --top-p 0.9 --seed 1000 --prompt -<../text/portugal.txt;say done

Here is my Full prompt from Wikipedia article.

Just to see what happens, I increased --max-kv-size to 34k and --max-tokens to 2000, and it generated 2k tokens with the loop.

Running the exact same command replacing 4bit with 8bit generated the correct text with 811 tokens and stopped at the end without the loop.

@awni
Copy link
Member

awni commented Oct 22, 2024

I ran the same command, below is the output I get which looks pretty reasonable. Can you share what you are seeing? Also could you share the machine, the OS, and the version of MLX as well?

The text provides a comprehensive overview of Portugal's history, geography, government, economy, demographics, culture, and more. Here are the main points:

**History**

* Portugal is a country located on the Iberian Peninsula, in Southwestern Europe.
* The country has a rich history, with evidence of human presence dating back to prehistoric times.
* The territory was inhabited by various peoples, including the Celts, Iberians, and Romans.
* Portugal gained independence from the Kingdom of León in 1143 and became a major economic and political power in the 15th and 16th centuries.
* The country was a colonial power, with colonies in Africa, Asia, and the Americas.
* Portugal voluntarily entered a dynastic union with Spain in 1580, which lasted until 1640.
* The country experienced a period of decline in the 19th and 20th centuries, but has since recovered and become a developed country.

**Geography**

* Portugal occupies an area on the Iberian Peninsula and two archipelagos in the Atlantic Ocean: Madeira and the Azores.
* The country has a diverse geography, with mountains, plains, and coastlines.
* The Tagus River is the main river in Portugal and flows through the capital city of Lisbon.
* The country has a Mediterranean climate, with warm summers and mild winters.

**Government**

* Portugal is a semi-presidential representative democratic republic.
* The government is divided into four branches: the President, the Government, the Assembly of the Republic, and the Courts.
* The President is elected to a five-year term and has executive powers.
* The Assembly of the Republic is a unicameral body composed of up to 230 deputies.
* The Government is headed by the Prime Minister and includes Ministers and Secretaries of State.

**Economy**

* Portugal is a developed and high-income country.
* The country has a diverse economy, with a strong focus on services, industry, and agriculture.
* The country has a significant fishing industry and is a major producer of cork and carob.
* Portugal has a high level of foreign debt and has received a bailout from the European Union and the International Monetary Fund.
* The country has a strong tradition of trade and commerce, with a significant presence of foreign companies.

**Demographics**

* The population of Portugal is approximately 10.5 million people.
* The country has a relatively homogeneous population, with a single language and a single religion.
* The population is aging, with a high percentage of people over 65.
* The country has a low fertility rate, with an average of 1.5 children per woman.
* The country has a significant number of immigrants, with around 7.5% of the population being foreign-born.

**Culture**

* Portugal has a rich cultural heritage, with a strong tradition of music, dance, and art.
* The country has a significant number of UNESCO World Heritage Sites, including the Jerónimos Monastery and the Tower of Belém.
* The country has a strong tradition of folk music and dance, with a significant number of festivals and events throughout the year.
* The country has a significant number of museums and art galleries, including the National Museum of Ancient Art and the Calouste Gulbenkian Museum.

**Sport**

* Football is the most popular sport in Portugal, with a significant number of fans and a strong national team.
* The country has a significant number of sports clubs, including Benfica, Sporting CP, and FC Porto.
* The country has a strong tradition of athletics, with a significant number of Olympic medals and world championships.
* The country has a significant number of water sports, including surfing, windsurfing, and kitesurfing.

**Visual Arts**

* Portugal has a rich history in painting, with a significant number of famous painters, including Nuno Gonçalves and Vasco Fernandes.
* The country has a strong tradition of modern art, with a significant number of famous artists, including Amadeo de Souza-Cardoso and Almada Negreiros.
* The country has a significant number of contemporary artists, including Helena Almeida, Joana Vasconcelos, and Julião Sarmento.

Overall, the text provides a comprehensive overview of Portugal's history, geography, government, economy, demographics, culture, and more.
==========
Prompt: 32188 tokens, 643.576 tokens-per-sec
Generation: 892 tokens, 31.103 tokens-per-sec
Peak memory: 12.535 GB

@chigkim
Copy link
Author

chigkim commented Oct 22, 2024

Yea your output looks correct.
Here is my full log. It keeps looping after "The health system in Portugal is..."
https://pastebin.com/raw/XzgVh1Zc

MBp 16" with M3 Max 64GB
MacOS 15.0.1
Python 3.12.7 (main, Oct 1 2024, 02:05:46) [Clang 16.0.0 (clang-1600.0.26.3)] on darwin
Name: mlx
Version: 0.19.0
Name: mlx-lm
Version: 0.19.1

Thanks!

@chigkim
Copy link
Author

chigkim commented Oct 22, 2024

Also I noticed that your prompt processing speed is 643.576, and mine is 417.707.
The format of the output looks very different. I saw that style when I run mlx-community/Llama-3.2-3B-Instruct-4bit.
I wonder I have like older weight with a bug or something. Is there way to check if I have the latest weight?

@barronalex
Copy link
Collaborator

I was able to reproduce this with the same long prompt on M1 Max.

I did a bit of bisecting and looks like after ml-explore/mlx#1509 I'm getting non deterministic outputs.
Weirdly it only seems to happen with long prompts.

@awni
Copy link
Member

awni commented Oct 24, 2024

Ok I’ll take a look at that. But it’s not in MLX 0.19 so it can’t really be the same issue as above..

@awni
Copy link
Member

awni commented Oct 24, 2024

@chigkim were you building the main branch of MLX from source or did you install MLX from PyPi?

@chigkim
Copy link
Author

chigkim commented Oct 24, 2024

For MLX, I installed from Pip. For MLX-LM, I tried both from pip and git.

@awni
Copy link
Member

awni commented Oct 26, 2024

I've tried to reproduce this on several machines (M1 Max, M2 Ultra, M1 Ultra, and M3 Max), so far not seeing any issues in the output. Some questions / suggestions:

  • If you have a chance could you check mlx==0.19.1 to see if it still reproduces?
  • Is the output always the same looping behavior or is it non-deterministic?

@chigkim
Copy link
Author

chigkim commented Nov 8, 2024

I just tried mlx==0.19.1 as well as 0.20.0 along with mlx-lm==0.19.3.
The output always produces 1000 tokens with the same looping behavior 100% of time.
I wonder if there's any mistake in my command or prompt file?

@awni
Copy link
Member

awni commented Nov 8, 2024

I'm really stumped by this one to be honest. I tried your exact command which works fine on several machines:

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --max-kv-size 33000 --max-tokens 1000 --temp 0.0 --top-p 0.9 --seed 1000 --prompt -<../text/portugal.txt;say done

Given that it works for me, I don't think there is a problem with the command or the prompt (unless you've changed it).

Without the ability to reproduce, it's very difficult to debug. Anything you can do to fuzz around and see if there are conditions that the looping is sensitive to would be great to help us get to the bottom of this.

Some ideas:

  • Try rebooting
  • Try a new python env with a fresh install of MLX, MLX LM
  • Try a different model. Maybe a smaller one like mlx-community/llama-3.2-1B-Instruct-4bit
  • Anything else you can think of..

Would be really great if you have time for any of those and can share the results. Thanks!

@chigkim
Copy link
Author

chigkim commented Nov 9, 2024

I deleted my environment and created a fresh one with python -m venv.
I deleted the model in the Huggingface cache with huggingface-cli.
I deleted mlx-examples from my local machine, and cloned from fresh.
I installed mlx-examples/llms with pip install -e ..
That's the only thing I installed in the new environment.
It still loops at the end. :(
However, mmlx-community/Llama-3.2-3B-Instruct-4bit works fine.
Also mlx-community/Meta-Llama-3.1-8B-Instruct-8bit works fine.
Only mlx-community/Meta-Llama-3.1-8B-Instruct-4bit creates a loop at the end.

@awni
Copy link
Member

awni commented Nov 9, 2024

That is really curious..

There was a bug in one of our qmv kernels that was recently fixed. It might be possible (but I think pretty unlikely) that this would account for the looping behavior you are seeing.

Do you only notice looping for the very long prompt or do you also see it for shorter prompts?

Also you have a few arguments set. I'm wondering if you turn them off or change does the looping go away?

  • --max-kv-size 33000 - don't set that, does it still loop?
  • --temp 0.0 --top-p 0.9 - don't set that, does it still loop?
  • --seed 1000 - change that to something else, does it still loop?

@chigkim
Copy link
Author

chigkim commented Nov 11, 2024

I've tried without specifying max-kv-size, temp, top_p, seed, but they all looped exactly the same way.
Interestingly, if I delete the last section about visual art from the prompt, it doesn't loop!
That's only 295 tokens difference (32134-31839).

@awni
Copy link
Member

awni commented Nov 12, 2024

Ok. Let's see if it's fixed after our next release (which includes a fix for quantization in some cases ml-explore/mlx#1577).

If it's not fixed, I will try fuzzing around a bit to see if it can be reproduced on our side.

@chigkim
Copy link
Author

chigkim commented Nov 12, 2024

Do you guys need to requantize and update the model on HF, or can I just pull the main branch, install with pip install -e ., and test?

@awni
Copy link
Member

awni commented Nov 12, 2024

No you can use the same model (no requantization needed). You can test by pulling and building the main branch. It would be great to know if that works for you or not.

@chigkim
Copy link
Author

chigkim commented Nov 13, 2024

Oh that was merged 4 days ago, and I already pulled the latest main branch when I tested. So the fix didn't work. :(

@chigkim
Copy link
Author

chigkim commented Nov 13, 2024

Ok, so I might have a potentially good news that could lead to something...
It looks like something happened between mlx 0.18.1 and 0.19.0.
It doesn't loop in 0.18.1, but it does loop in 0.19.0.

@awni
Copy link
Member

awni commented Nov 13, 2024

Interesting.. you can see all the commits between 18.1 and 19.0 here: ml-explore/mlx@v0.18.1...v0.19.0

The commit in there that seems most likely to have changed something for LLM inference is the fused attention. Can you try building and the commits before and after to see if that is the case?

So concretely:

For including the fused attention:

git checkout 50d8bed4688e04d8ba4cc5b8e20a79f22e8e93ce
env CMAKE_BUILD_PARALLEL_LEVEL=10 pip install .

And the commit just before:

git checkout 9dd72cd421260ebc0f30e773f6b35fdf87555806
env CMAKE_BUILD_PARALLEL_LEVEL=10 pip install .

That would be my first guess as to a related cause but it would be good to check and see.

@chigkim
Copy link
Author

chigkim commented Nov 13, 2024

Yep that was it! The commit 50d8bed loops, but 9dd72cd doesn't.

@awni
Copy link
Member

awni commented Nov 13, 2024

Could you try running with Metal validation enabled to see if that gives us any clues? (Low probability but when it hits it hits well):

METAL_DEVICE_WRAPPER_TYPE=1 METAL_DEBUG_ERROR_MODE=0 mlx_lm.generate ...

@awni
Copy link
Member

awni commented Nov 13, 2024

Also you can precompute the prompt cache to speed testing up:

mlx_lm.cache_prompt --prompt-cache-file prompt.safetensors --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --prompt -<prompt.txt                

Then use that to generate:

mlx_lm.generate --prompt-cache-file prompt.safetensors --max-tokens 1024

@angeloskath
Copy link
Member

@chigkim Adding on @awni 's last message. Can you run the following commands and report back the outputs?

mlx_lm.cache_prompt --prompt-cache-file prompt.safetensors --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --prompt -<prompt.txt
mlx_lm.generate --prompt-cache-file prompt.safetensors --max-tokens 2048
mlx_lm.generate --prompt-cache-file prompt.safetensors --max-tokens 2048
mlx_lm.generate --prompt-cache-file prompt.safetensors --max-tokens 2048 --temp 0.1
mlx_lm.generate --prompt-cache-file prompt.safetensors --max-tokens 2048 --temp 0.1 --seed 100
mlx_lm.generate --prompt-cache-file prompt.safetensors --max-tokens 2048 --temp 0.2 
mlx_lm.generate --prompt-cache-file prompt.safetensors --max-tokens 2048 --temp 0.2 --seed 100

The first 2 should have exactly the same output, looping or not. The next 4 should all have different outputs.

@chigkim
Copy link
Author

chigkim commented Nov 14, 2024

Hmm, the output doesn't seem to be any different. Am I supposed to look for something?

% METAL_DEVICE_WRAPPER_TYPE=1 METAL_DEBUG_ERROR_MODE=0 mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --max-kv-size 33000 --max-tokens 1000 --temp 0.0 --top-p 0.9 --seed 1000 --prompt -<../text/portugal.txt;say done
2024-11-13 22:46:18.124 python[16522:3704956] Metal API Validation Enabled
Fetching 6 files: 100%|████████████████████████| 6/6 [00:00<00:00, 45507.82it/s]
==========
Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Summarize the following:
......
<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Portugal is a country located on the Iberian Peninsula in southwestern Europe.
.....
==========
Prompt: 32134 tokens, 414.286 tokens-per-sec
Generation: 1000 tokens, 32.263 tokens-per-sec
Peak memory: 12.535 GB

@awni
Copy link
Member

awni commented Nov 14, 2024

Hmm, the output doesn't seem to be any different. Am I supposed to look for something?

Nope it would have been obvious if it threw a validation error. Thanks for checking.

@chigkim
Copy link
Author

chigkim commented Nov 14, 2024

The first 2 should have exactly the same output, looping or not. The next 4 should all have different outputs.

Ooo, we're getting somewhere...
First 3 got loop, last 3 didn't.
Terminal Saved Output.txt

@angeloskath
Copy link
Member

I am pretty sure it is a numerical stability issue. The interesting part is that the fused attention is all happening in float32 so it should be more numerically accurate.

If you are building from source could you edit utils.py in mlx-lm to save the logits so we can inspect how large the final differences are? If not let me know and I will provide a patch.

@chigkim
Copy link
Author

chigkim commented Nov 25, 2024

Sorry, not sure what to edit in order to save the logits. Could you provide patch?

@chigkim
Copy link
Author

chigkim commented Nov 26, 2024

Oh, I think I found something. I realized that running 8-bit automatically inserts a system prompt, but running 4bit doesn't.
If I run 4bit with --system-prompt, it finally does not loop!!!
But then I found a weird behavior with 8bit. If I specifies --system-prompt, instead of using the system prompt I specified, it appends my specified system prompt to its own system prompt along with Cutting Knowledge Date and Today Date.
Here are all the log with different tests:

Running 4-bit without --system-prompt loops as my usual result. Notice that it starts with user tag without system prompt.

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --max-kv-size 33000 --max-tokens 2000 --temp 0.0 --top-p 0.9 --seed 1000 --prompt  -<../text/portugal.txt;say done
Fetching 6 files: 100%|███████████████████████| 6/6 [00:00<00:00, 148910.20it/s]
==========
Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Summarize the following:
......<|eot_id|><|start_header_id|>assistant<|end_header_id|>


......
==========
Prompt: 32134 tokens, 422.547 tokens-per-sec
Generation: 2000 tokens, 33.290 tokens-per-sec

Running 8bit without --system-prompt does not loop. Notice that it automatically inserted a system prompt with Cutting Knowledge Date and Today Date.

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-8bit --max-kv-size 33000 --max-tokens 2000 --temp 0.0 --top-p 0.9 --seed 1000 --prompt  -<../text/portugal.txt;say done
Fetching 7 files: 100%|███████████████████████| 7/7 [00:00<00:00, 109145.46it/s]
==========
Prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Summarize the following:
......<|eot_id|><|start_header_id|>assistant<|end_header_id|>


......
==========
Prompt: 32159 tokens, 422.712 tokens-per-sec
Generation: 819 tokens, 25.135 tokens-per-sec

Running 4bit with --system-prompt does not loop.

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --max-kv-size 33000 --max-tokens 2000 --temp 0.0 --top-p 0.9 --seed 1000 --system-prompt $'Cutting Knowledge Date: December 2023\nToday Date: 23 July 2024\n\nYou are a helpful assistant' --prompt  -<../text/portugal.txt;say done
Fetching 6 files: 100%|███████████████████████| 6/6 [00:00<00:00, 133152.51it/s]
==========
Prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 23 July 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Summarize the following:
......<|eot_id|><|start_header_id|>assistant<|end_header_id|>


......
==========
Prompt: 32164 tokens, 434.508 tokens-per-sec
Generation: 1153 tokens, 33.392 tokens-per-sec
Peak memory: 12.420 GB

Running 8bit with --system-prompt results in a system prompt with Cutting Knowledge Date and Today Date twice.

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-8bit --max-kv-size 33000 --max-tokens 2000 --temp 0.0 --top-p 0.9 --seed 1000 --system-prompt $'Cutting Knowledge Date: December 2023\nToday Date: 23 July 2024\n\nYou are a helpful assistant' --prompt  -<../text/portugal.txt;say done
Fetching 7 files: 100%|████████████████████████| 7/7 [00:00<00:00, 13855.65it/s]
==========
Prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

Cutting Knowledge Date: December 2023
Today Date: 23 July 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Summarize the following:
......<|eot_id|><|start_header_id|>assistant<|end_header_id|>


......
==========
Prompt: 32184 tokens, 423.930 tokens-per-sec
Generation: 769 tokens, 25.218 tokens-per-sec

@awni
Copy link
Member

awni commented Nov 26, 2024

Good catch regarding the system prompt! The model tokenizer in the HF repo seems to have been misconfigured for some unknown reason. I've fixed it and it should update automatically the next time you use the model.

Regarding the concatenation of the system prompt with the date, this is the expected behavior for the tokenizer:

{{- "Cutting Knowledge Date: December 2023\n" }}
{{- "Today Date: " + date_string + "\n\n" }}
{%- if tools is not none and not tools_in_user_message %}
    {{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}
    {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
    {{- "Do not use variables.\n\n" }}
    {%- for t in tools %}
        {{- t | tojson(indent=4) }}
        {{- "\n\n" }}
    {%- endfor %}
{%- endif %}
{{- system_message }}
{{- "<|eot_id|>" }}

It always concatenates the system message to the date string.

If you want to change the date string you can do so by passing it to tokenizer.apply_chat_template:

messages = [
    {"role": "system", "content": "You are a helpful AI assistant"},
    {"role": "user", "content": "hi!"},
]
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, date_string="1.2.3"))

To get rid of the cutting knowledge date would require modifying the chat template itself..which we likely won't provide an argument for in the CLI but you can easily do that in Python directly.

I am going to close this issue since it appears that it is mostly resolved!! Thanks for your help getting to the bottom of this!

@awni awni closed this as completed Nov 26, 2024
@chigkim
Copy link
Author

chigkim commented Nov 27, 2024

Thanks for fixing the tokenizer.
Now 4bit now inserts the system prompt with the dates automatically, but it still loops. :(
Having said that, running with --system-prompt 'You are a helpful assistant' doesn't loop.
It seems it's very sensitive to little change!
So far from my tests:
The commit 50d8bed loops, but 9dd72cd doesn't.
Changing --temp from 0.0 to 0.1 loops.
Changing --temp from 0.0 to 0.2 doesn't loop.
Changing --seed from 1000 to 100 doesn't loop.
We can keep it closed for now, but I'll let you know if I find something else.
Thanks!

@awni
Copy link
Member

awni commented Nov 27, 2024

I'll reopen this in that case.

I tried quite a few seeds and it still did not loop... which I find a bit strange. It seems like looping happens much more frequently in your setup 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants