extend explanation for prompts in new test cases #107

haesleinhuepf · 2024-09-03T14:31:40Z

This PR contains:

a new test-case for the benchmark
- I hereby confirm that NO LLM-based technology (such as github copilot) was used while writing this benchmark
new dependencies in requirements.txt
- The environment.yml file was updated using the command conda env export > environment.yml
new generator-functions allowing to sample from other LLMs
new samples (sample_....jsonl files)
new benchmarking results (..._results.jsonl files)
documentation update
bug fixes

Related github issue (if relevant): partial solution #79

Short description:

Here I'm modifying how we explain the prompt text style to new contributors

How do you think will this influence the benchmark results?

Not directly.
Long-term, if prompts are all in a similar style, this may make it easier for LLMs to solve the tasks. This way of standardization could potentially lead to over-fitting.

Why do you think it makes sense to merge this PR?

Standardization in this sense is important and we should merge this.
Long-term we could modify all prompts and run a second benchmark to measure the impact of the prompt style. For now this seems out of scope. Hence, standardization goes first.

@tischi would you mind reviewing this and potentially modifying the text to make things more clear?

tischi · 2024-09-04T06:16:48Z

I am wondering whether we should make the example slightly more complex. I am thinking of:

This function takes a numpy array image as an input and returns the number and sum of all pixels as outputs.

This would serve to explain that (i) the input type should be reasonably well defined and that (ii), if there are multiple outputs, the order of the output arguments matters for testing the function. We could explicitly write that this would be a different prompt:

This function takes a path to an image as an input and returns the sum and number of all pixels as outputs.

In principle, testing very similar abilities of the LLM, but now the input is a path to an image and the order of the outputs is reversed, which makes it a different test case, because the code to check the validity of this function would be different.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extend explanation for prompts in new test cases #107

extend explanation for prompts in new test cases #107

haesleinhuepf commented Sep 3, 2024

tischi commented Sep 4, 2024

haesleinhuepf commented Sep 4, 2024

tischi commented Sep 4, 2024

tischi commented Sep 4, 2024

tischi commented Sep 5, 2024

extend explanation for prompts in new test cases #107

Are you sure you want to change the base?

extend explanation for prompts in new test cases #107

Conversation

haesleinhuepf commented Sep 3, 2024

tischi commented Sep 4, 2024

haesleinhuepf commented Sep 4, 2024

tischi commented Sep 4, 2024

tischi commented Sep 4, 2024

tischi commented Sep 5, 2024