-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(issue summary): Score possible cause confidence and novelty #1788
base: main
Are you sure you want to change the base?
Conversation
Looks like in the failing test |
I forgot to VCR the embeddings call in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q on latency, otherwise lgtm
issue_summary_with_scores = IssueSummaryWithScores( | ||
**issue_summary.model_dump(), scores=score_issue_summary(issue_summary) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may have mentioned this already and I missed it, but remind me how much latency does this add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks fine to me, just confused on the confidence score calculation. Also idk how we feel about the latency
|
||
# Round for some tolerance during equality comparison. | ||
# TODO: is decryption and decompression of the VCR causing tiny changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the local test I get exact equality. in CI (specifically, this failed run), we get inequality at the 16th decimal place:
> assert raw_result == expected_raw_result
E AssertionError: assert IssueSummaryW...319887938298)) == IssueSummaryW...319887938298))
E Full diff:
E - IssueSummaryWithScores(title='Critical Issue: red-timothy-sandwich Failure', whats_wrong='**red-timothy-sandwich** encountered **exceptions** during execution, impacting functionality. Check **event logs** for details.', session_related_issues='Related issues: **cyan-vincent-banana** and **green-fred-tennis** may indicate broader session instability.', possible_cause='Potential **resource contention** or **dependency failure** affecting multiple components.', scores=SummarizeIssueScores(possible_cause_confidence=0.4749528387220516, possible_cause_novelty=0.6286319887938298))
E ? ^
E + IssueSummaryWithScores(title='Critical Issue: red-timothy-sandwich Failure', whats_wrong='**red-timothy-sandwich** encountered **exceptions** during execution, impacting functionality. Check **event logs** for details.', session_related_issues='Related issues: **cyan-vincent-banana** and **green-fred-tennis** may indicate broader session instability.', possible_cause='Potential **resource contention** or **dependency failure** affecting multiple components.', scores=SummarizeIssueScores(possible_cause_confidence=0.47495283872205174, possible_cause_novelty=0.6286319887938298))
E ?
if the VCR weren't used / we'd actually hit the OpenAI API in the test, we'd see inequality at around the 4th decimal place (they add noise of around 5e-4 so embeddings are not deterministic). so this is probably a weird decompression or decryption error. left it as a TODO since it's not too important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, idk if we want to discuss latency before merging, but it's probably fine
The last commit gets rid of a mostly useless embedding. Added latency is now 0.3 sec p50, 0.43 sec p90 (see "Measure additional latency" in the notebook). |
Adds
"scores"
to the issue summary response.Costs 0.3 sec in latency
How has this been tested?
Changed unit test to use VCR
Local run which generated a summarization (trace ID: 1ebe8c50-e9ea-4b26-85e5-9fa41a06f8eb) w/ output:
output