Prompting the 7B Llama model to reply with integer score only results in subpar evaluations. #449

ayooshkathuria · 2024-11-26T06:14:41Z

ayooshkathuria
Nov 26, 2024

Hi Sebastian,

Thanks for writing this book and it has been of great help. However, there is something that I would like to bring to your attention.

I was fiddling with CH 07 notebook and while looking at the scores generated by the Llama3 model using Ollama, found some very subpar generations ranked fairly decently. This prompted me to run the ch07/01_main-chapter-code/ch07.ipynb notebook with num_epochs=0 in the cell where the training is done, which basically evaluates the performance on the pretrained foundation model on the task. I was basically looking to quantify the improvement instruction finetuning brings to the performance of the model.

Pre-loaded GPT Medium mostly just repeats something from the prompt. However, even bogus responses are ranked quite high by Llama3. So much so that it gets a score of 44.38 whereas the score with 2 epochs of training is 48.30

I've forked that repo and pushed the modified notebook with some example of such poor decisions here: https://github.com/ayooshkathuria/LLMs-from-scratch/blob/ollama_eval_weirdness/ch07/01_main-chapter-code/ch07.ipynb

I'm pasting one of them in that thread though you can play around on your own.

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Name the author of 'Pride and Prejudice'.

Dataset response:
>> Jane Austen.

Model response:
>> ### Name:

### Title:

### Author:

### Title:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

Score:
>> 98

-------------------------

Is this expected? or am I missing anything about the proper use of LLMs here?

EDIT: I can't seem to run Llama 70B on my workstation for now to evaluate as there are some issues with my memory (ollama/ollama#941) so would be nice if you can see whether use of L70B alleviates this issue.

System Specs:

Ubuntu 24.04.1 LTS
Ryzen 5950, RTX 3090

ayooshkathuria · 2024-11-26T12:27:43Z

ayooshkathuria
Nov 26, 2024
Author

I played around a bit more with this:

Using llama3.1 with a slightly modified prompt seems to fix the issue to some extent where the untrained model's scores are around 12-13 while trained model's scores are around 53-54.

Here is the modified generation function.

def generate_model_scores(json_data, json_key, model="llama3.1"):
    scores = []
    for entry in tqdm(json_data, desc="Scoring entries"):
        prompt = (
            f"Given the input `{format_input(entry)}` "
            f"and correct output `{entry['output']}`, "
            f"score the model response `{entry[json_key]}`"
            f" on a scale from 0 to 100, where 100 is the best score. "
            f"Respond with the integer number only. Do not include an explanation."
        )
        score = query_model(prompt, model)
        try:
            scores.append(int(score))
        except ValueError:
            print(f"Could not convert score: {score}")
            continue

    return scores

Having said that, if I stop generation at 50 tokens, even Llama 3.1 with modified prompts highly rate gibberish responses.

In some cases, the responses are ranked correctly when we don't force it to answer with an integer only but not otherwise. In some cases, the evaluation is messed up even when we ask for the descriptive response. For example:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Name the author of 'Pride and Prejudice'.

Dataset response:
>> Jane Austen.

Model response:
>> ### Name:

### Title:

### Author:

### Title:

### Author:

### Author:

### Author:

### Author:

### Author:

Score:
>> I would score this model response as 80.

The model correctly identifies that it needs to provide information about the author of 'Pride and Prejudice', which is a specific task. However, instead of providing a simple answer like "Jane Austen", it creates multiple sections with incorrect titles and authors, which suggests some confusion or misinterpretation of the instruction.

A perfect score (100) would require the model to provide a clear and concise response that directly answers the question, such as:

### Author:
Jane Austen

This response is short, accurate, and relevant to the task.

-------------------------

1 reply

rasbt Nov 27, 2024
Maintainer

Hi there, and thanks for looking into that. I agree, the scores look suspiciously high for the nonsense answers. Thanks also for investigating the updated prompt.

When I understand correctly, you changed 2 things:

Upgrade from "llama3" -> "llama3.1"
Append "Do not include an explanation." to the prompt

Do you by chance now if the llama 3.1 model upgrade may already be sufficient for improving the results, or is the combination of both changes required?

I am likely without access to a computer within the next 7-10 days but I am happy to also run some additional experiments as soon as I am able.

ayooshkathuria · 2024-12-12T16:12:11Z

ayooshkathuria
Dec 12, 2024
Author

Hey Sebastian, sorry for the delay in getting back. I did a bunch of experiments. I took 3 set of json files for evaluationg:

Responses generated by the pretrained model.
Responses by the finetuned model (i.e. instruction finetuning)
Simply using the ground truth or the output field of the data entry for the response.

The idea behind this is 1 < 2 < 3 where 3 should near perfect response as that is desired output we are fine tuning GPT-355M with. I will refer to these as Score 1, Score 2, and Score 3.

I'll be varying 2 things in each run.

The evaluator LLM (Llama 3.1 and Llama 3)
The prompt.

Experiment 1

Prompt: Original (Ch07 of the book)
Model: Llama3

Score 1: 39.94
Score 2: 50.08
Score 3: 93.90

Experiment 2

Prompt: Original (Ch07 of the book) + "Do not include an explaination"
Model: Llama3

Score 1: 36.22
Score 2: 47.31
Score 3: 93.63

Experiment 3

Prompt: Original (Ch07 of the book)
Model: Llama3.1

Score 1: 13.53   (103/110 as 7 entries had non-int answers)
Score 2: 47.31.  (104/110)
Score 3: 94.25  (109/ 110)

Experiment 4

Prompt: Original (Ch07 of the book) + "Do not include an explaination"
Model: Llama3.1

Score 1: 15.04   (109/110 as 1 entry had non-int answers)
Score 2: 60.24  
Score 3: 96.34

Having said that, I went into bit of a rabit hole regarding how LLMs are being used an evaluators and how do to get better and faithful scores. One thing I realised that was being done is practice differently from what is written in the book is using a much smaller scale (such as 1-5) aided by a scoring rubric that tells what each score means. The evaluator produces a feedback with score in it which is extracted using some regex. Some approaches do Chain of thought prompting too. Here are some resources that you may find useful.

I also stumbled across this project (https://github.com/prometheus-eval/prometheus-eval) that can be used to evaluate LLMs using a LLM specifically trained for the task. However, I can't run this in concurrence with the training script as it fills up my GPU. However, inspired from their prompts and here (https://huggingface.co/learn/cookbook/en/llm_judge) I came up with the following prompt to Llama 3.1 which scores well. I have also found the variance between the scores when prompt is tweaked / model is retrained is less in lesser when using this approach.

Here is the prompt. (I'm still a rookie at this)

JUDGE_PROMPT = """
You are a fair judge assistant tasked with providing clear, objective feedback based on specific criteria, ensuring each assessment reflects the absolute standards set for performance.
You will be given an instruction, a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing the evaluation criteria.
Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
Please do not generate any other opening, closing, and explanations.

Here is the rubric you should use to build your answer:
1: The response fails to address the instructions, providing irrelevant, incorrect, or excessively verbose information that detracts from the user's request.
2: The response partially addresses the instructions but includes significant inaccuracies, irrelevant details, or excessive elaboration that detracts from the main task.
3: The response follows the instructions with some minor inaccuracies or omissions. It is generally relevant and clear, but may include some unnecessary details or could be more concise.
4: The response adheres to the instructions, offering clear, accurate, and relevant information in a concise manner, with only occasional, minor instances of excessive detail or slight lack of clarity.
5: The response fully adheres to the instructions, providing a clear, accurate, and relevant answer in a concise and efficient manner. It addresses all aspects of the request without unnecessary details or elaboration

Provide your feedback as follows:

Feedback:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the instruction, the reference answer, and the response.

Instruction: {instruction}
Reference Answer: {reference}
Answer: {answer}


Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.
Feedback:::
Evaluation: """

We can use format_entry(data) for instruction. Using this to prompt Llama 3.1, I get the following scores (scale of 1-5).

Score 1: 1.37  (107/110 as 3 entries could not be parsed)
Score 2: 2.42
Score 3: 4.98

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prompting the 7B Llama model to reply with integer score only results in subpar evaluations. #449

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Prompting the 7B Llama model to reply with integer score only results in subpar evaluations. #449

ayooshkathuria Nov 26, 2024

System Specs:

Replies: 2 comments · 1 reply

ayooshkathuria Nov 26, 2024 Author

rasbt Nov 27, 2024 Maintainer

ayooshkathuria Dec 12, 2024 Author

Experiment 1

Experiment 2

Experiment 3

Experiment 4

ayooshkathuria
Nov 26, 2024

Replies: 2 comments 1 reply

ayooshkathuria
Nov 26, 2024
Author

rasbt Nov 27, 2024
Maintainer

ayooshkathuria
Dec 12, 2024
Author