Release v5.5.0 · Future-House/paper-qa

Highlights

In all of v5 before this release, we defined the presence of 1+ answer generations not containing the substring "cannot answer" as the agent loop's end. However, this (suboptimally) leads to the agent loop terminating early on partial answers like "Based on the sources provided, it appears no one has done x." We realized this, and have resolved this issue by:

No longer coupling our done condition with the substring "cannot answer" being not present in 1+ generated answers
No longer implicitly depending on clients mentioning this "cannot answer" sentinel in the input qa prompt

We also fixed several (bad) bugs:

We support parallel tool calling (2+ ToolCalls in one action: ToolRequestMessage). However, our tools (notably gather_evidence) are not actually concurrent-safe. Our tool schemae instructed not to call certain tools in parallel, nonetheless we observed agents specifying gather_evidence to be called in parallel. So now we force our tools to be non-concurrently executed to work around this race condition
When using LitQAEvaluation and the same GradablePaperQAEnvironment 2+ times, we repeatedly added the "unsure" option to the target multiple choice question, degrading performance over time
When using PaperQAEnvironment 2+ times, each reset was not properly wiping the Docs object
The reward distribution of LitQAEvaluation was mixing up "unsure" reward of 0.1 with the "incorrect" reward of -1.0, not properly incentivizing learning

There are a bunch of other minor features, cleanups, and bugfixes here too, see the full list below.

What's Changed

Deprecation cycle for AgentSettings.should_pre_search by @jamesbraza in #679
Moved agent prompts to prompts.py by @jamesbraza in #681
Refactor to remove skip_system from LLMModel.run_prompt by @jamesbraza in #680
Resolving evidence_detailed_citations and Answer deprecations by @jamesbraza in #682
Fixed agent prompt names and contents after #681 mess up by @jamesbraza in #683
Removed tool_names validation for gen_answer being present by @jamesbraza in #685
Fixing test_evaluation logic bugs by @jamesbraza in #686
Removed GenerateAnswer.FAILED_TO_ANSWER as its unnecessary by @jamesbraza in #691
Allowing serialized Settings in get_settings by @jamesbraza in #688
Fixed LDP runner's TRUNCATED not calling gen_answer, and documented AgentStatus by @jamesbraza in #690
Removed gen_answer's dead argument question by @jamesbraza in #689
Making sure we copy distractors by @sidnarayanan in #694
Created complete tool to allow unsure answers by @jamesbraza in #684
Added missing test_from_question cassette by @jamesbraza in #696
Moved fake agent to LLM propose complete tool by @jamesbraza in #695
Default to ordered tool calls, w env variable control by @mskarlin in #697
Lock file maintenance by @renovate in #699
Refactored TestGradablePaperQAEnvironment for DRY code by @jamesbraza in #702
Fixing PaperQAEnvironment.reset respecting mmr_lambda and text_hashes by @jamesbraza in #703
Removed "cannot answer" literals and added reset tool by @jamesbraza in #698
Update all non-major dependencies by @renovate in #705
Fixing LitQAEvaluation bugs: incorrect reward indices, not using LLM's native knowledge by @jamesbraza in #708
Adding filters to paper-qa Docs by @whitead in #707
Fixed mutably defaulted NumpyVectorStore.texts by @jamesbraza in #711

Full Changelog: v5.4.0...v5.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v5.5.0

Highlights

What's Changed

Contributors