v5.5.0
Highlights
In all of v5 before this release, we defined the presence of 1+ answer generations not containing the substring "cannot answer"
as the agent loop's end. However, this (suboptimally) leads to the agent loop terminating early on partial answers like "Based on the sources provided, it appears no one has done x." We realized this, and have resolved this issue by:
- No longer coupling our done condition with the substring
"cannot answer"
being not present in 1+ generated answers - No longer implicitly depending on clients mentioning this
"cannot answer"
sentinel in the inputqa
prompt
We also fixed several (bad) bugs:
- We support parallel tool calling (2+
ToolCall
s in oneaction: ToolRequestMessage
). However, our tools (notablygather_evidence
) are not actually concurrent-safe. Our tool schemae instructed not to call certain tools in parallel, nonetheless we observed agents specifyinggather_evidence
to be called in parallel. So now we force our tools to be non-concurrently executed to work around this race condition - When using
LitQAEvaluation
and the sameGradablePaperQAEnvironment
2+ times, we repeatedly added the "unsure" option to the target multiple choice question, degrading performance over time - When using
PaperQAEnvironment
2+ times, eachreset
was not properly wiping theDocs
object - The reward distribution of
LitQAEvaluation
was mixing up "unsure" reward of0.1
with the "incorrect" reward of-1.0
, not properly incentivizing learning
There are a bunch of other minor features, cleanups, and bugfixes here too, see the full list below.
What's Changed
- Deprecation cycle for
AgentSettings.should_pre_search
by @jamesbraza in #679 - Moved agent prompts to
prompts.py
by @jamesbraza in #681 - Refactor to remove
skip_system
fromLLMModel.run_prompt
by @jamesbraza in #680 - Resolving
evidence_detailed_citations
andAnswer
deprecations by @jamesbraza in #682 - Fixed agent prompt names and contents after #681 mess up by @jamesbraza in #683
- Removed
tool_names
validation forgen_answer
being present by @jamesbraza in #685 - Fixing
test_evaluation
logic bugs by @jamesbraza in #686 - Removed
GenerateAnswer.FAILED_TO_ANSWER
as its unnecessary by @jamesbraza in #691 - Allowing serialized
Settings
inget_settings
by @jamesbraza in #688 - Fixed LDP runner's
TRUNCATED
not callinggen_answer
, and documentedAgentStatus
by @jamesbraza in #690 - Removed
gen_answer
's dead argumentquestion
by @jamesbraza in #689 - Making sure we copy distractors by @sidnarayanan in #694
- Created
complete
tool to allow unsure answers by @jamesbraza in #684 - Added missing
test_from_question
cassette by @jamesbraza in #696 - Moved
fake
agent to LLM proposecomplete
tool by @jamesbraza in #695 - Default to ordered tool calls, w env variable control by @mskarlin in #697
- Lock file maintenance by @renovate in #699
- Refactored
TestGradablePaperQAEnvironment
for DRY code by @jamesbraza in #702 - Fixing
PaperQAEnvironment.reset
respectingmmr_lambda
andtext_hashes
by @jamesbraza in #703 - Removed
"cannot answer"
literals and addedreset
tool by @jamesbraza in #698 - Update all non-major dependencies by @renovate in #705
- Fixing
LitQAEvaluation
bugs: incorrect reward indices, not using LLM's native knowledge by @jamesbraza in #708 - Adding filters to paper-qa Docs by @whitead in #707
- Fixed mutably defaulted
NumpyVectorStore.texts
by @jamesbraza in #711
Full Changelog: v5.4.0...v5.5.0