Fixing `LitQAEvaluation` bugs: incorrect reward indices, not using LLM's native knowledge #708

jamesbraza · 2024-11-19T22:30:04Z

This PR:

Fixes both issues identified in Evaluation bug when answer is not in option list #693
1. Using LLM's innate knowledge, which this PR fixes by updating EVAL_PROMPT_TEMPLATE
2. Incorrectly handling options beyond the specified options, which this PR fixes by declaring those as "incorrect"
Fixes bad discounted_returns logic and incorrect reward indices
- Previously, incorrect answers were given 0.1 reward, and unsure answers were given -1.0 reward
- Adds test coverage of this, both upon discounted_returns and the TaskDataset
Makes the input distractors be a Sequence to avoid in-place edits, further robust-ification after Making sure we copy distractors #694

Closes #693

…ns too

paperqa/litqa.py

sidnarayanan · 2024-11-19T22:38:05Z

paperqa/prompts.py

    "Extract the single letter answer from the following question and answer"
-    "\n\n{qa_prompt}"
-    "\n\n{qa_answer}"
+    "\n\nQuestion: {qa_prompt}"
+    "\n\nAnswer: {qa_answer}"
    "\n\nSingle Letter Answer:"


I'd suggest something like:

"Given the following question and a proposed answer to the question, return the single-letter choice that matches the proposed answer." "\n\nQuestion: ..." "\n\nProposed Answer: ..."

If I didn't know the context of this method, as a human, I'd find the original prompt unclear.

Yeah it's a nice suggestion, but making this change breaks two of our test cases. Let's save this for another PR

Incorporated into #724

…s native knowledge

…tr, float]

jamesbraza added 7 commits November 19, 2024 11:04

Made doctest for make_mc_options to show how it works

cf70275

Added missing assertion to test_consistent_mc_options

6d360cd

Prevented case where someone specifies more than 26 options

9549436

Fixed LitQAEvaluation value and discounted_returns

47644f1

Made distractors a Sequence to enforce read-only nature via typing

ab24767

Handled when the evaluation is outside of the available options

39583bf

Updated test_from_question to use subtests and check discounted retur…

639b184

…ns too

jamesbraza added the bug Something isn't working label Nov 19, 2024

jamesbraza requested review from whitead, sidnarayanan, mskarlin, maykcaldas and nadolskit November 19, 2024 22:30

jamesbraza self-assigned this Nov 19, 2024

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Nov 19, 2024

sidnarayanan reviewed Nov 19, 2024

View reviewed changes

paperqa/litqa.py Outdated Show resolved Hide resolved

mskarlin reviewed Nov 19, 2024

View reviewed changes

paperqa/litqa.py Show resolved Hide resolved

sidnarayanan reviewed Nov 19, 2024

View reviewed changes

jamesbraza added 3 commits November 19, 2024 15:10

Added more test cases to test_from_question and preventing using LLM'…

12f14fa

…s native knowledge

Updated test_evaluation to also check worst-case reward

548238a

Updated rewards from 'distribution' to 'mapping', and using Mapping[s…

65c97e3

…tr, float]

jamesbraza force-pushed the fixing-answer-incorrect branch from 35b33db to 65c97e3 Compare November 19, 2024 23:10

mskarlin approved these changes Nov 19, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 19, 2024

Silencing pylint dangerous-default-value since it's ported to ruff

f01ee4a

jamesbraza merged commit 11f2727 into main Nov 19, 2024
5 checks passed

jamesbraza deleted the fixing-answer-incorrect branch November 19, 2024 23:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing `LitQAEvaluation` bugs: incorrect reward indices, not using LLM's native knowledge #708

Fixing `LitQAEvaluation` bugs: incorrect reward indices, not using LLM's native knowledge #708

jamesbraza commented Nov 19, 2024

sidnarayanan Nov 19, 2024

jamesbraza Nov 19, 2024

jamesbraza Nov 26, 2024

Fixing LitQAEvaluation bugs: incorrect reward indices, not using LLM's native knowledge #708

Fixing LitQAEvaluation bugs: incorrect reward indices, not using LLM's native knowledge #708

Conversation

jamesbraza commented Nov 19, 2024

sidnarayanan Nov 19, 2024

Choose a reason for hiding this comment

jamesbraza Nov 19, 2024

Choose a reason for hiding this comment

jamesbraza Nov 26, 2024

Choose a reason for hiding this comment

Fixing `LitQAEvaluation` bugs: incorrect reward indices, not using LLM's native knowledge #708

Fixing `LitQAEvaluation` bugs: incorrect reward indices, not using LLM's native knowledge #708