-
Notifications
You must be signed in to change notification settings - Fork 666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed EVAL_PROMPT_TEMPLATE
to handle empty string or multiple match answers
#724
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do the benchmarks move with this change?
@@ -98,9 +98,12 @@ | |||
# Prompt templates for use with LitQA | |||
QA_PROMPT_TEMPLATE = "Q: {question}\n\nOptions:\n{options}" | |||
EVAL_PROMPT_TEMPLATE = ( | |||
"Extract the single letter answer from the following question and answer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a solid iteration of the eval prompt template to address #709.
But I don't think this addresses Empty session.answer for exhaustive search without an answer #723
We still need the ability to inform the user of what happened at this step -- they just get a blank answer object currently. If the "user" is the eval script (as here) then this takes care of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh so you'd like to see something like the paper-qa
runner script swapping out empty answer for some sentinel value? Can you elaborate where in the stack you'd like to see a fix, beyond LitQAEvaluation
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, exactly -- after the runner finishes, we need some sentinel to be added. The blank isn't intuitive enough for users to understand (especially via command line or UI). Can def be another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cases like this eval though -- the sentinel will create a need for another prompt, so we could make the sentinel a module level variable and check for it in the eval method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to be running this shortly. With the empty answers, previously sometimes they got marked as correct, sometimes incorrect, and sometimes unsure. I am not sure which is more frequent, but that will likely dictate the impact on benchmark performance. Fwiw our mini test suite's benchmark has grown here. |
@sidnarayanan I will circle back with the impact of evaluation, going to merge this for now |
Okay, including all of v5.6, it seems:
So in general, performance drops a bit and variability increases. So the evaluation flaws in v5.0 to v5.5 were pumping up the numbers a bit |
This PR moves the evaluation prompt template to handle