Prompting the 7B Llama model to reply with integer score only results in subpar evaluations. #449
Replies: 2 comments 1 reply
-
I played around a bit more with this:
Here is the modified generation function.
In some cases, the responses are ranked correctly when we don't force it to answer with an integer only but not otherwise. In some cases, the evaluation is messed up even when we ask for the descriptive response. For example:
|
Beta Was this translation helpful? Give feedback.
-
Hey Sebastian, sorry for the delay in getting back. I did a bunch of experiments. I took 3 set of json files for evaluationg:
The idea behind this is 1 < 2 < 3 where 3 should near perfect response as that is desired output we are fine tuning GPT-355M with. I will refer to these as Score 1, Score 2, and Score 3. I'll be varying 2 things in each run.
Experiment 1Prompt: Original (Ch07 of the book)
Experiment 2Prompt: Original (Ch07 of the book) + "Do not include an explaination"
Experiment 3Prompt: Original (Ch07 of the book)
Experiment 4Prompt: Original (Ch07 of the book) + "Do not include an explaination"
Having said that, I went into bit of a rabit hole regarding how LLMs are being used an evaluators and how do to get better and faithful scores. One thing I realised that was being done is practice differently from what is written in the book is using a much smaller scale (such as 1-5) aided by a scoring rubric that tells what each score means. The evaluator produces a feedback with score in it which is extracted using some regex. Some approaches do Chain of thought prompting too. Here are some resources that you may find useful.
I also stumbled across this project (https://github.com/prometheus-eval/prometheus-eval) that can be used to evaluate LLMs using a LLM specifically trained for the task. However, I can't run this in concurrence with the training script as it fills up my GPU. However, inspired from their prompts and here (https://huggingface.co/learn/cookbook/en/llm_judge) I came up with the following prompt to Llama 3.1 which scores well. I have also found the variance between the scores when prompt is tweaked / model is retrained is less in lesser when using this approach. Here is the prompt. (I'm still a rookie at this)
We can use
|
Beta Was this translation helpful? Give feedback.
-
Hi Sebastian,
Thanks for writing this book and it has been of great help. However, there is something that I would like to bring to your attention.
I was fiddling with CH 07 notebook and while looking at the scores generated by the Llama3 model using Ollama, found some very subpar generations ranked fairly decently. This prompted me to run the
ch07/01_main-chapter-code/ch07.ipynb
notebook withnum_epochs=0
in the cell where the training is done, which basically evaluates the performance on the pretrained foundation model on the task. I was basically looking to quantify the improvement instruction finetuning brings to the performance of the model.Pre-loaded GPT Medium mostly just repeats something from the prompt. However, even bogus responses are ranked quite high by Llama3. So much so that it gets a score of 44.38 whereas the score with 2 epochs of training is 48.30
I've forked that repo and pushed the modified notebook with some example of such poor decisions here: https://github.com/ayooshkathuria/LLMs-from-scratch/blob/ollama_eval_weirdness/ch07/01_main-chapter-code/ch07.ipynb
I'm pasting one of them in that thread though you can play around on your own.
Is this expected? or am I missing anything about the proper use of LLMs here?
EDIT: I can't seem to run Llama 70B on my workstation for now to evaluate as there are some issues with my memory (ollama/ollama#941) so would be nice if you can see whether use of L70B alleviates this issue.
System Specs:
Ubuntu 24.04.1 LTS
Ryzen 5950, RTX 3090
Beta Was this translation helpful? Give feedback.
All reactions