Benchmarking Anthropic's Tool Use Beta API

You can see a discussion of the results in the blog post and the details of the experiments here.

TL;DR:

Haiku is the best model for tool use when only a single function call should be generated.
However, when you need parallel tool use, GPT-4 Turbo is still the best model.
Noteworthy, GPT-3.5 Turbo appears biased towards generating multiple function calls in parallel, no matter if that’s required or not.

Prepare Data

Following the Gorilla repo, download the data from HuggingFace to the ./data folder:

huggingface-cli download gorilla-llm/Berkeley-Function-Calling-Leaderboard --local-dir ./data --repo-type dataset

Then, manually download the possible answers into data/possible_answer.

Install the requirements: pip install -r requirements.txt
Get a Parea API key from here.
Copy the .env.example file to .env and fill in the API keys for Parea, OpenAI & Anthropic.
Run the experiments: python3 experiment.py

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
evals		evals
inference		inference
.env.sample		.env.sample
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
constants.py		constants.py
experiment.py		experiment.py
load_data.py		load_data.py
plots.ipynb		plots.ipynb
requirements.txt		requirements.txt
results.csv		results.csv
results.png		results.png