Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend explanation for prompts in new test cases #107

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 17 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,32 +81,38 @@ To reproduce our benchmarks, you can go through the notebooks provided in the `/
You can add new test cases by adding new notebooks to the `/notebooks/human-eval-bia` directory.
Check out the examples there and make sure to stick to the following rules.

![CAUTION]
> ![CAUTION]
> Most importantly: When writing new test case notebooks, do not use language models for code generation.
> You would otherwise bias the benchmark towards this model.
> Use human-writen code only and/or examples from the documentation of specific librarires.

The notebooks have to have the following format:
* Within one cell there must be a function that solves a specific [bio-image analysis] task. Very basic example, computing the sum of two numbers:
* Within one cell there must be a function that solves a specific [bio-image analysis] task. An example would be to compute the number and sum of all pixels in an image:
```python
def sum(a, b):
def compute_image_sum(image):
"""
This function computes the sum of two numbers.
Takes an image as a numpy array as an input and returns the number and sum of all pixels as outputs.
"""
return a + b
flattened_image = image_array.flatten()
num_pixels = flattened_image.size
sum_pixels = np.sum(flattened_image)
return num_pixels, sum_pixels
```
* This function must have a meaningful docstring between """ and """. It must be so meaningful that a language model could possibly write the entire function.
* There must be another code cell that starts with `def check(candiate):` and contains test code to test the generated code.
* The text code must use `assert` statements and call the `candidate` function. E.g. if a given function to test is `sum`, then a valid test for `sum` would be:
* The function must have a meaningful docstring between """ and """ which will serve as prompt together with the function signature. Ideally, write a short natural sentence one could hear between two humans. It must be specific enough though, so that a language model (or a human) has all necessary information to write the entire function. For example, it is **not** specific enough to just write "Takes an image as an input...", because then the model cannot really know whether this is a path to an image, a numpy array, or something else. Also make sure you specify return values detailed enough; and if there is more than one return value, be aware that the order of those values matters: "...returns the number and sum of all pixels..." is different from "...returns the sum and number of all pixels...". [Check out the list of pre-existing prompts](https://github.com/haesleinhuepf/human-eval-bia/blob/main/test_cases/readme.md) to get some inspiration.
* There must be another code cell that starts with `def check(candidate):` and contains test code to test the generated code.
* The test code must use `assert` statements and call the `candidate` function. E.g. if a given function to test is `compute_image_sum`, then a valid test for `compute_image_sum` would be:
```
def check(candidate):
assert candidate(3, 4) == 7
image = np.array([[1, 2], [3, 4]])
num_pixels, sum_pixels = candidate(compute_image_sum)
assert num_pixels == 4
assert sum_pixels == 10
```
* A third python code cell in the notebook must call the `check` function with your custom function, e.g. like this, to prove that the code you provided works with the tests you wrote:
```
check(sum)
check(sum_image)
```
* Save the new test-case in a notebook that has the same name as the test, so that people can find it easily. In our case above: `sum.ipynb`.
* Save the new test-case in a notebook that has the same name as the test function, so that others can find it easily. In our case above: `my_sum.ipynb`.
* Optional: You can add as many markdown cells as you like to explain the test case.

## Adding dependencies
Expand Down