Which AI model should fix your tickets?

We ran four leading AI models on 65 real Spring Framework bug tickets and scored each answer against the actual commit that shipped. Here's what we found.

65 tickets · 4 models · Spring Framework 2009–2013 · February 2026

65
real tickets evaluated
4
AI models compared
+29%
Claude's edge over GPT
benefit from paying more

Pass rate by model and tier

A ticket "passes" when the model both identified the right file and reproduced at least 15% of the changed code tokens — the threshold where a response is useful as a starting point for a developer.

Tier Opus 4.6 Sonnet 4.6 ★ GPT-4o-mini GPT-4o
Automate 100% 80% 60% 20%
Assist 60% 60% 67% 50%
Escalate 67% 77% 50% 50%
Overall 66% 69% 58% 48%

Code similarity to actual fix

Jaccard overlap between code tokens in the model's answer and the tokens added in the real commit. A continuous 0–1 measure that captures partial credit and isn't affected by how the model formats its response.

Tier Opus 4.6 Sonnet 4.6 ★ GPT-4o-mini GPT-4o
Automate 0.222 0.202 0.165 0.158
Assist 0.213 0.218 0.185 0.176
Escalate 0.239 0.240 0.168 0.180
Overall 0.226 0.227 0.176 0.177

Cost to evaluate 65 tickets

Opus 4.6
$6.92
230× more than mini
★ Sonnet 4.6
$1.50
best quality-per-dollar
GPT-4o
$0.57
19× more than mini
GPT-4o-mini
$0.03
cheapest option

Three findings

Finding 1

Claude models write significantly better code fixes than GPT models

Both Claude models scored ~0.227 on code similarity vs ~0.177 for both GPT models — a consistent 29% advantage across all 65 tickets and every tier. When given the same broken source file and the same ticket description, Claude produces answers that are substantially closer to what a senior engineer actually committed.

Finding 2

Opus and Sonnet are identical in quality — Sonnet is the right choice

Claude Opus (0.226) and Claude Sonnet (0.227) are statistically indistinguishable across 65 tickets. Opus costs 4.6× more for no measurable gain on this task. Sonnet is the optimal model for automated ticket resolution.

Finding 3

GPT-4o provides zero benefit over GPT-4o-mini

The two OpenAI models scored 0.177 vs 0.176 — essentially the same. GPT-4o costs 19× more. For ticket triage and resolution, GPT-4o-mini matches its larger sibling entirely.


What the tiers predicted

The Automate / Assist / Escalate labels are derived purely from historical Jira resolution patterns — resolution speed, watcher count, contributor experience — with no AI involvement. The eval shows those labels correctly predict AI solvability: Automate tickets were solved most reliably by every model, while Escalate tickets showed lower code similarity across the board.

This validates the core premise: a team's own Jira history is a meaningful signal for where AI can act autonomously vs. where human oversight is needed.

Automate
Act autonomously
Fast to resolve, low attention, routine change. AI answers are reliable enough to apply directly.
Assist
Draft + human review
AI produces a solid starting point. A developer should review before merging.
Escalate
Route to senior engineer
AI answers look plausible but often miss the real fix. Senior attention required.

Methodology

Corpus: 65 Spring Framework Jira tickets with a verified git commit, drawn from 998 candidate tickets (matched from 11,424 SPR-#### commits in the spring-projects/spring-framework repository). Commits range from 2009–2013.

Prompt: Each model received the ticket summary, description, and the before-state Java source file(s), with instruction to produce a minimal code fix as a unified diff.

Scoring: File hit — did the response mention the correct class name? Token overlap — Jaccard similarity between code identifier tokens in the model's answer and the added lines in the real commit diff. Pass threshold: file hit AND overlap ≥ 0.15.

Fairness: All models used identical prompts and the same 65 tickets (seed=42). No fine-tuning or few-shot examples were used.

Full code and raw results: github.com/JussiTu/task2vec

Full scientific methodology →

See how your tickets score

The tier labels are built from your own Jira history — no training required. Paste a ticket and see which tier it lands in.

Score a ticket →