Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend explanation for prompts in new test cases #107

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

haesleinhuepf
Copy link
Owner

This PR contains:

  • a new test-case for the benchmark
    • I hereby confirm that NO LLM-based technology (such as github copilot) was used while writing this benchmark
  • new dependencies in requirements.txt
    • The environment.yml file was updated using the command conda env export > environment.yml
  • new generator-functions allowing to sample from other LLMs
  • new samples (sample_....jsonl files)
  • new benchmarking results (..._results.jsonl files)
  • documentation update
  • bug fixes

Related github issue (if relevant): partial solution #79

Short description:

  • Here I'm modifying how we explain the prompt text style to new contributors

How do you think will this influence the benchmark results?

  • Not directly.
  • Long-term, if prompts are all in a similar style, this may make it easier for LLMs to solve the tasks. This way of standardization could potentially lead to over-fitting.

Why do you think it makes sense to merge this PR?

  • Standardization in this sense is important and we should merge this.
  • Long-term we could modify all prompts and run a second benchmark to measure the impact of the prompt style. For now this seems out of scope. Hence, standardization goes first.

@tischi would you mind reviewing this and potentially modifying the text to make things more clear?

@tischi
Copy link
Collaborator

tischi commented Sep 4, 2024

I am wondering whether we should make the example slightly more complex. I am thinking of:

This function takes a numpy array image as an input and returns the number and sum of all pixels as outputs.

This would serve to explain that (i) the input type should be reasonably well defined and that (ii), if there are multiple outputs, the order of the output arguments matters for testing the function. We could explicitly write that this would be a different prompt:

This function takes a path to an image as an input and returns the sum and number of all pixels as outputs.

In principle, testing very similar abilities of the LLM, but now the input is a path to an image and the order of the outputs is reversed, which makes it a different test case, because the code to check the validity of this function would be different.

See also: #111

@haesleinhuepf
Copy link
Owner Author

Agreed! Do you want to update the example or shall I?

@tischi
Copy link
Collaborator

tischi commented Sep 4, 2024

I can do it.

@tischi
Copy link
Collaborator

tischi commented Sep 4, 2024

But I need to wait until we agreed on something in #111

@tischi
Copy link
Collaborator

tischi commented Sep 5, 2024

I changed it, could you please check?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants