You can see a discussion of the results in the blog post and the details of the experiments here.
TL;DR:
- Haiku is the best model for tool use when only a single function call should be generated.
- However, when you need parallel tool use, GPT-4 Turbo is still the best model.
- Noteworthy, GPT-3.5 Turbo appears biased towards generating multiple function calls in parallel, no matter if that’s required or not.
Following the Gorilla repo, download the data from HuggingFace to the ./data
folder:
huggingface-cli download gorilla-llm/Berkeley-Function-Calling-Leaderboard --local-dir ./data --repo-type dataset
Then, manually download the possible answers into data/possible_answer
.
- Install the requirements:
pip install -r requirements.txt
- Get a Parea API key from here.
- Copy the
.env.example
file to.env
and fill in the API keys for Parea, OpenAI & Anthropic. - Run the experiments:
python3 experiment.py