task2vec · Research
Which AI model should fix your tickets?
We ran four leading AI models on 65 real Spring Framework bug tickets and scored each
answer against the actual commit that shipped. Here's what we found.
65 tickets · 4 models · Spring Framework 2009–2013 · February 2026
65
real tickets evaluated
+29%
Claude's edge over GPT
0×
benefit from paying more
Pass rate by model and tier
A ticket "passes" when the model both identified the right file and reproduced
at least 15% of the changed code tokens — the threshold where a response is useful as a
starting point for a developer.
| Tier |
| Automate |
100% |
80% |
60% |
20% |
| Assist |
60% |
60% |
67% |
50% |
| Escalate |
67% |
77% |
50% |
50% |
| Overall |
66% |
69% |
58% |
48% |
Code similarity to actual fix
Jaccard overlap between code tokens in the model's answer and the tokens added in the
real commit. A continuous 0–1 measure that captures partial credit and isn't affected
by how the model formats its response.
| Tier |
| Automate |
0.222 |
0.202 |
0.165 |
0.158 |
| Assist |
0.213 |
0.218 |
0.185 |
0.176 |
| Escalate |
0.239 |
0.240 |
0.168 |
0.180 |
| Overall |
0.226 |
0.227 |
0.176 |
0.177 |
Cost to evaluate 65 tickets
Opus 4.6
$6.92
230× more than mini
★ Sonnet 4.6
$1.50
best quality-per-dollar
GPT-4o
$0.57
19× more than mini
GPT-4o-mini
$0.03
cheapest option
Three findings
Finding 1
Claude models write significantly better code fixes than GPT models
Both Claude models scored ~0.227 on code similarity vs ~0.177 for both GPT models — a
consistent 29% advantage across all 65 tickets and every tier. When given
the same broken source file and the same ticket description, Claude produces answers that
are substantially closer to what a senior engineer actually committed.
Finding 2
Opus and Sonnet are identical in quality — Sonnet is the right choice
Claude Opus (0.226) and Claude Sonnet (0.227) are statistically indistinguishable across
65 tickets. Opus costs 4.6× more for no measurable gain on this task.
Sonnet is the optimal model for automated ticket resolution.
Finding 3
GPT-4o provides zero benefit over GPT-4o-mini
The two OpenAI models scored 0.177 vs 0.176 — essentially the same. GPT-4o costs
19× more. For ticket triage and resolution, GPT-4o-mini matches its
larger sibling entirely.
What the tiers predicted
The Automate / Assist / Escalate labels are derived purely from historical Jira resolution
patterns — resolution speed, watcher count, contributor experience — with no AI involvement.
The eval shows those labels correctly predict AI solvability: Automate tickets were
solved most reliably by every model, while Escalate tickets showed lower code
similarity across the board.
This validates the core premise: a team's own Jira history is a meaningful signal for
where AI can act autonomously vs. where human oversight is needed.
Automate
Act autonomously
Fast to resolve, low attention, routine change. AI answers are reliable enough to apply directly.
Assist
Draft + human review
AI produces a solid starting point. A developer should review before merging.
Escalate
Route to senior engineer
AI answers look plausible but often miss the real fix. Senior attention required.
Methodology
Corpus: 65 Spring Framework Jira tickets with a verified git commit, drawn
from 998 candidate tickets (matched from 11,424 SPR-#### commits in the spring-projects/spring-framework
repository). Commits range from 2009–2013.
Prompt: Each model received the ticket summary, description, and the
before-state Java source file(s), with instruction to produce a minimal code fix as a unified diff.
Scoring: File hit — did the response mention the correct class name?
Token overlap — Jaccard similarity between code identifier tokens in the model's
answer and the added lines in the real commit diff. Pass threshold: file hit AND overlap ≥ 0.15.
Fairness: All models used identical prompts and the same 65 tickets (seed=42).
No fine-tuning or few-shot examples were used.
Full code and raw results:
github.com/JussiTu/task2vec
Full scientific methodology →
See how your tickets score
The tier labels are built from your own Jira history — no training required.
Paste a ticket and see which tier it lands in.
Score a ticket →