Skip to content

v5.5.0

Compare
Choose a tag to compare
@jamesbraza jamesbraza released this 20 Nov 00:23
· 46 commits to main since this release
0b3ef89

Highlights

In all of v5 before this release, we defined the presence of 1+ answer generations not containing the substring "cannot answer" as the agent loop's end. However, this (suboptimally) leads to the agent loop terminating early on partial answers like "Based on the sources provided, it appears no one has done x." We realized this, and have resolved this issue by:

  • No longer coupling our done condition with the substring "cannot answer" being not present in 1+ generated answers
  • No longer implicitly depending on clients mentioning this "cannot answer" sentinel in the input qa prompt

We also fixed several (bad) bugs:

  • We support parallel tool calling (2+ ToolCalls in one action: ToolRequestMessage). However, our tools (notably gather_evidence) are not actually concurrent-safe. Our tool schemae instructed not to call certain tools in parallel, nonetheless we observed agents specifying gather_evidence to be called in parallel. So now we force our tools to be non-concurrently executed to work around this race condition
  • When using LitQAEvaluation and the same GradablePaperQAEnvironment 2+ times, we repeatedly added the "unsure" option to the target multiple choice question, degrading performance over time
  • When using PaperQAEnvironment 2+ times, each reset was not properly wiping the Docs object
  • The reward distribution of LitQAEvaluation was mixing up "unsure" reward of 0.1 with the "incorrect" reward of -1.0, not properly incentivizing learning

There are a bunch of other minor features, cleanups, and bugfixes here too, see the full list below.

What's Changed

Full Changelog: v5.4.0...v5.5.0