The primary ticket corpus is the Spring Framework issue tracker, historically hosted at
jira.spring.io under project key SPR. This tracker is publicly
readable without authentication. The full ticket set spans SPR-1 through approximately
SPR-18000, covering the period 2003–2019, after which the project migrated to GitHub Issues.
To reproduce the corpus, query the Jira REST API iteratively:
GET https://jira.spring.io/rest/api/2/search
?jql=project%3DSPR%20AND%20resolution%3DFixed
&fields=key,summary,description,created,resolutiondate,
watches,assignee,status,issuetype
&maxResults=100
&startAt={offset}
Increment startAt by 100 until startAt ≥ total from the
response. In our corpus, 69,156 documents were collected; 61,564 had both a
created and resolutiondate field populated and were retained
for labelling.
The full git history of the Spring Framework is publicly available:
git clone https://github.com/spring-projects/spring-framework.git
The repository is approximately 244 MB (compressed). No authentication is required.
The eval uses commits from the full history including all branches
(--all flag). In our checkout (February 2026), the repository contained
11,424 commits whose subject lines matched the pattern SPR-\d+.
Three observable signals are extracted per ticket. No natural language processing or LLM inference is used at this stage.
| Signal | Source field | Computation |
|---|---|---|
| days | fields.resolutiondate,fields.created |
max(0, (resolutiondate − created).total_seconds() / 86400)Both timestamps parsed as UTC-aware datetimes. |
| watches | fields.watches.watchCount |
Integer; missing values treated as 0. |
| assignee_count | fields.assignee.displayName |
Count of all tickets (resolved or not) in the corpus assigned to the same
displayName. A proxy for contributor experience within this project.
|
Thresholds are computed from the empirical distribution of the full 61,564-ticket corpus and are fixed constants — not tuned per experiment:
Each ticket receives an integer score s ∈ {0, 1, 2, 3}:
| Label | Count | % of corpus |
|---|---|---|
| Automate | 1,973 | 3.2% |
| Assist | 23,158 | 37.6% |
| Escalate | 36,433 | 59.2% |
| Total labeled | 61,564 | 100% |
All commits referencing a Jira key are extracted in a single pass:
git log --all --format="%H|%P|%s" --grep="SPR-"
This yields a tab-separated stream of (sha, parent_shas, subject). For merge commits, only the first parent is used. In our repository, this produced 11,424 matching commits.
The SPR key is extracted from the commit subject using the regular expression
\bSPR-(\d+)\b (case-insensitive). The resulting key is normalised to
uppercase SPR-{N}. Commits are matched against the set of labeled tickets
from Section 2. Of 11,424 commits, 2,234 referenced a labeled key; after deduplication
(keeping the most recent commit per key), 1,581 unique keys remained.
For each matched commit, changed files are retrieved:
git diff-tree --no-commit-id -r --name-only {sha}
A file is retained if and only if:
.java/test/ or /test-Test (e.g., FooTests.java)Commits with zero retained files are discarded. This filter removed 583 commits (mostly documentation-only or test-only changes), leaving 998 tickets in the final git index.
| Label | In git index | % of labeled corpus | Files per commit (median) |
|---|---|---|---|
| Automate | 5 | 0.3% | 1 |
| Assist | 281 | 1.2% | 1 |
| Escalate | 712 | 2.0% | 1 |
| Total | 998 | 1.6% | 1 (mean 3.4) |
From the 998 tickets in the git index, a stratified random sample is drawn:
import random rng = random.Random(seed=42) sample[tier] = rng.sample(pool[tier], min(n, len(pool[tier])))
For the primary evaluation reported here, n = 30 for Assist and Escalate,
and n = 5 (full pool) for Automate. All four models were evaluated on the
identical 65 tickets using the same random seed, ensuring that differences in
aggregate scores are attributable to the models rather than ticket selection.
The following system prompt was used verbatim for all models and all tickets:
You are a senior Spring Framework engineer doing a code review. You are given a Jira ticket and the current (unfixed) source file(s). Write the minimal code fix — show exactly which lines change. Format your fix as a unified diff or clearly mark old/new lines. Be specific. Do not ask for more information.
The user message is assembled as follows:
**Ticket:** {key}
**Summary:** {summary}
**Description:**
{description} (truncated to 2,000 characters)
**Source files (before fix):**
--- {file_path_1} ---
```java
{file_content_1} (truncated to 6,000 characters per file)
```
--- {file_path_2} --- (up to 3 files shown)
...
File content is retrieved at the parent commit (i.e., the state immediately before the fix was applied):
git show {parent_sha}:{file_path}
| Parameter | OpenAI models | Anthropic models |
|---|---|---|
| max_tokens / max_tokens | 1,500 | 1,500 |
| temperature | 0.3 | (default, not set) |
| API version | openai-python 1.x | anthropic-python 0.84.0 |
| System prompt delivery | messages[0].role="system" | system parameter |
The ground truth for each ticket is the unified diff between the parent commit and the fix commit:
git diff {parent_sha} {fix_sha}
This diff may include test files, documentation, and files outside the 1–3 shown to the model. The full diff is used for scoring regardless of what was shown in the prompt.
Let F be the set of Java class name stems extracted from the diff header lines
matching the pattern b/(.+\.java), e.g., b/…/RedisTemplate.java
yields stem RedisTemplate.
This is a weak positive test: the model receives credit if it mentions any of the correct class names anywhere in its response. It does not require correct usage.
AbstractAutowireCapableBeanFactory). This artefact inflates the apparent
pass-rate disadvantage of GPT-4o relative to GPT-4o-mini. Token overlap is the more
reliable primary metric.
Let T(·) extract the set of identifier tokens from a text string, where an
identifier token matches the pattern [A-Za-z_][A-Za-z0-9_]{2,} (minimum
length 3 to exclude noise tokens such as if, for).
Let A be the set of added lines from the ground-truth diff (lines beginning
with +, excluding file-header lines beginning with +++).
This is the Jaccard similarity coefficient on identifier token sets. It ranges from 0 (no shared identifiers) to 1 (identical token sets). The metric is symmetric and does not require the model's response to be a valid diff.
The threshold of 0.15 was chosen to exclude responses that name the correct file but produce generic boilerplate code unrelated to the actual fix. Sensitivity analysis at thresholds 0.10 and 0.20 does not change the rank ordering of models.
| Model | Pass rate | Overlap mean | Overlap SD | Tokens in (mean) | Tokens out (mean) |
|---|---|---|---|---|---|
| claude-opus-4-6 | 66% | 0.226 | 0.141 | 2,449 | 930 |
| claude-sonnet-4-6 | 69% | 0.227 | 0.134 | 2,449 | 1,047 |
| gpt-4o | 48% | 0.177 | 0.109 | 1,844 | 420 |
| gpt-4o-mini | 58% | 0.176 | 0.093 | 1,844 | 317 |
| Model | Tier | n | Pass | Overlap mean | Overlap SD |
|---|---|---|---|---|---|
| claude-opus-4-6 | Automate | 5 | 100% | 0.222 | 0.043 |
| Assist | 30 | 60% | 0.213 | 0.131 | |
| Escalate | 30 | 67% | 0.239 | 0.161 | |
| claude-sonnet-4-6 | Automate | 5 | 80% | 0.202 | 0.052 |
| Assist | 30 | 60% | 0.218 | 0.137 | |
| Escalate | 30 | 77% | 0.240 | 0.142 | |
| gpt-4o | Automate | 5 | 20% | 0.158 | 0.058 |
| Assist | 30 | 50% | 0.176 | 0.106 | |
| Escalate | 30 | 50% | 0.180 | 0.121 | |
| gpt-4o-mini | Automate | 5 | 60% | 0.165 | 0.045 |
| Assist | 30 | 67% | 0.185 | 0.087 | |
| Escalate | 30 | 50% | 0.168 | 0.104 |
All four models were run on the identical 65 tickets (same random seed). For each ticket, we compute the range of token overlap scores across the four models:
The high within-ticket variance indicates that ticket difficulty is the dominant source of variance, not model identity. A more powerful evaluation design would increase n rather than add models.
The following steps reproduce the benchmark from scratch using only public data sources. No access to the authors' infrastructure is required.
# 1. Clone the Spring Framework repository
git clone https://github.com/spring-projects/spring-framework.git \
.cache/spring-framework
# 2. Download Spring Jira tickets via the public REST API
# (substitute your own scraper; no auth required)
# Target collection: project=SPR, all resolved tickets
# 3. Install dependencies
pip install anthropic openai pymongo numpy
# 4. Reproduce the tier labels
python build_outcome_cache.py # writes .cache/outcome_signals.json
# 5. Build the git index
python build_git_index.py # writes .cache/git_index.json
# 6. Run the evaluation (choose provider and model)
python run_eval.py \
--provider anthropic \
--model claude-sonnet-4-6 \
--n 30 --seed 42
# 7. Compare results across models
python eval_compare.py
The calibration thresholds in build_outcome_cache.py (P33_DAYS = 1.9,
P75_ASSIGNEE_CNT = 136) are fixed constants derived from the authors' corpus. If a
different Jira corpus is used, these should be recomputed from the empirical distribution.
The scoring rule in Section 2.3 remains valid with any thresholds, provided they are
specified before any labelling is performed.
All evaluation code is released under the MIT licence:
https://github.com/JussiTu/task2vec
Relevant files:
| File | Purpose |
|---|---|
build_outcome_cache.py | Signal extraction and tier labelling |
build_git_index.py | Git commit matching and file extraction |
run_eval.py | Prompt construction and model evaluation |
eval_compare.py | Multi-model result aggregation |
eval_report.py | Single-model report generation |
.cache/eval_results_*.json | Raw results for all four models |
.cache/git_index.json | Benchmark ticket–commit mapping (998 entries) |
The raw Jira dump is not redistributed due to size (~2 GB). The
outcome_signals.json file (61,564 labeled tickets, 4.6 MB) is not
included in the repository but can be regenerated from the public Jira API using
build_outcome_cache.py.