Outcome-Derived Tier Labels as Predictors of AI Code-Fix Quality: A Replication Study Using the Spring Framework Git History
task2vec.com research
February 2026 · Version 1.0
Abstract
We present an evaluation methodology and benchmark for measuring the ability of large language models (LLMs) to reproduce real-world Java bug fixes from issue descriptions alone. Using 69,156 publicly archived Spring Framework Jira tickets and the complete git history of the spring-projects/spring-framework repository, we construct a ground-truth benchmark of 65 tickets for which both a structured ticket and the corresponding commit diff are available. Tickets are labelled into three AI-readiness tiers (Automate, Assist, Escalate) using a scoring rule derived entirely from observable resolution metadata — resolution time, watcher count, and assignee experience — with no LLM involvement. We evaluate four models (claude-opus-4-6, claude-sonnet-4-6, gpt-4o, gpt-4o-mini) on this benchmark and report two metrics: binary pass rate and continuous token overlap. Claude Sonnet and Opus outperform GPT-4o and GPT-4o-mini by approximately 29% on token overlap (0.227 vs. 0.177). Within each provider family, model size provides no statistically significant improvement. All code and raw results are publicly available.

Data Sources and Availability

1.1 Spring Framework Jira Archive

The primary ticket corpus is the Spring Framework issue tracker, historically hosted at jira.spring.io under project key SPR. This tracker is publicly readable without authentication. The full ticket set spans SPR-1 through approximately SPR-18000, covering the period 2003–2019, after which the project migrated to GitHub Issues.

To reproduce the corpus, query the Jira REST API iteratively:

GET https://jira.spring.io/rest/api/2/search
    ?jql=project%3DSPR%20AND%20resolution%3DFixed
    &fields=key,summary,description,created,resolutiondate,
            watches,assignee,status,issuetype
    &maxResults=100
    &startAt={offset}

Increment startAt by 100 until startAt ≥ total from the response. In our corpus, 69,156 documents were collected; 61,564 had both a created and resolutiondate field populated and were retained for labelling.

Note for replicators: As of early 2026, jira.spring.io redirects most traffic to GitHub Issues for newer tickets, but historical SPR-#### records remain accessible via the REST API. Rate limiting applies; use a delay of ≥ 0.5 s between requests. Estimated download time: 4–6 hours.

1.2 Spring Framework Git Repository

The full git history of the Spring Framework is publicly available:

git clone https://github.com/spring-projects/spring-framework.git

The repository is approximately 244 MB (compressed). No authentication is required. The eval uses commits from the full history including all branches (--all flag). In our checkout (February 2026), the repository contained 11,424 commits whose subject lines matched the pattern SPR-\d+.

Outcome-Based Tier Labelling

2.1 Signal Extraction

Three observable signals are extracted per ticket. No natural language processing or LLM inference is used at this stage.

SignalSource fieldComputation
days fields.resolutiondate,
fields.created
max(0, (resolutiondate − created).total_seconds() / 86400)
Both timestamps parsed as UTC-aware datetimes.
watches fields.watches.watchCount Integer; missing values treated as 0.
assignee_count fields.assignee.displayName Count of all tickets (resolved or not) in the corpus assigned to the same displayName. A proxy for contributor experience within this project.

2.2 Calibration Thresholds

Thresholds are computed from the empirical distribution of the full 61,564-ticket corpus and are fixed constants — not tuned per experiment:

p33_days = 1.9 days (33rd percentile of days, resolved tickets) p67_days = 29.5 days (67th percentile; informational only) p75_assignee_cnt = 136 tickets (75th percentile of assignee_count) watch_threshold = 2 (fixed, not percentile-derived)

2.3 Scoring Rule

Each ticket receives an integer score s ∈ {0, 1, 2, 3}:

s = I(days ≤ 1.9) + I(watches ≤ 2) + I(assignee_count < 136) where I(·) is the indicator function (1 if true, 0 if false). Label: s = 3 → Automate s = 2 → Assist s ≤ 1 → Escalate

2.4 Label Distribution

LabelCount% of corpus
Automate1,9733.2%
Assist23,15837.6%
Escalate36,43359.2%
Total labeled61,564100%

Git Benchmark Construction

3.1 Commit Extraction

All commits referencing a Jira key are extracted in a single pass:

git log --all --format="%H|%P|%s" --grep="SPR-"

This yields a tab-separated stream of (sha, parent_shas, subject). For merge commits, only the first parent is used. In our repository, this produced 11,424 matching commits.

3.2 Key Matching

The SPR key is extracted from the commit subject using the regular expression \bSPR-(\d+)\b (case-insensitive). The resulting key is normalised to uppercase SPR-{N}. Commits are matched against the set of labeled tickets from Section 2. Of 11,424 commits, 2,234 referenced a labeled key; after deduplication (keeping the most recent commit per key), 1,581 unique keys remained.

3.3 File Filtering

For each matched commit, changed files are retrieved:

git diff-tree --no-commit-id -r --name-only {sha}

A file is retained if and only if:

  1. The filename ends in .java
  2. The path does not contain /test/ or /test-
  3. The filename stem does not contain Test (e.g., FooTests.java)

Commits with zero retained files are discarded. This filter removed 583 commits (mostly documentation-only or test-only changes), leaving 998 tickets in the final git index.

3.4 Git Index Distribution

LabelIn git index% of labeled corpusFiles per commit (median)
Automate50.3%1
Assist2811.2%1
Escalate7122.0%1
Total9981.6%1 (mean 3.4)
Important limitation: The extremely low coverage of the Automate tier (5 tickets, 0.3%) is a structural property of the Spring project, not a data quality problem. Automate tickets — characterised by fast resolution, low watcher count, and less experienced contributors — correspond almost exclusively to documentation, configuration, or sample application changes that produce no production Java source changes. This means the Automate tier cannot be meaningfully evaluated with this benchmark. Results for Automate (n=5) should be treated as illustrative only.

Evaluation Sample Selection

From the 998 tickets in the git index, a stratified random sample is drawn:

import random
rng = random.Random(seed=42)
sample[tier] = rng.sample(pool[tier], min(n, len(pool[tier])))

For the primary evaluation reported here, n = 30 for Assist and Escalate, and n = 5 (full pool) for Automate. All four models were evaluated on the identical 65 tickets using the same random seed, ensuring that differences in aggregate scores are attributable to the models rather than ticket selection.

Evaluation Protocol

5.1 System Prompt

The following system prompt was used verbatim for all models and all tickets:

You are a senior Spring Framework engineer doing a code review.
You are given a Jira ticket and the current (unfixed) source file(s).
Write the minimal code fix — show exactly which lines change.
Format your fix as a unified diff or clearly mark old/new lines.
Be specific. Do not ask for more information.

5.2 User Message Construction

The user message is assembled as follows:

**Ticket:** {key}
**Summary:** {summary}
**Description:**
{description}    (truncated to 2,000 characters)

**Source files (before fix):**

--- {file_path_1} ---
```java
{file_content_1}    (truncated to 6,000 characters per file)
```

--- {file_path_2} ---        (up to 3 files shown)
...

File content is retrieved at the parent commit (i.e., the state immediately before the fix was applied):

git show {parent_sha}:{file_path}

5.3 Model Parameters

ParameterOpenAI modelsAnthropic models
max_tokens / max_tokens1,5001,500
temperature0.3(default, not set)
API versionopenai-python 1.xanthropic-python 0.84.0
System prompt deliverymessages[0].role="system"system parameter

5.4 Ground Truth

The ground truth for each ticket is the unified diff between the parent commit and the fix commit:

git diff {parent_sha} {fix_sha}

This diff may include test files, documentation, and files outside the 1–3 shown to the model. The full diff is used for scoring regardless of what was shown in the prompt.

Scoring Metrics

6.1 File Hit (binary)

Let F be the set of Java class name stems extracted from the diff header lines matching the pattern b/(.+\.java), e.g., b/…/RedisTemplate.java yields stem RedisTemplate.

file_hit = |F| > 0 AND ∃f ∈ F : f is a substring of model_answer

This is a weak positive test: the model receives credit if it mentions any of the correct class names anywhere in its response. It does not require correct usage.

Known artefact: In 5 of 65 GPT-4o responses, file_hit was False despite non-trivial token overlap (range 0.13–0.73). Inspection showed that GPT-4o described changes without writing the full class name (e.g., "the factory class" rather than AbstractAutowireCapableBeanFactory). This artefact inflates the apparent pass-rate disadvantage of GPT-4o relative to GPT-4o-mini. Token overlap is the more reliable primary metric.

6.2 Token Overlap (continuous)

Let T(·) extract the set of identifier tokens from a text string, where an identifier token matches the pattern [A-Za-z_][A-Za-z0-9_]{2,} (minimum length 3 to exclude noise tokens such as if, for).

Let A be the set of added lines from the ground-truth diff (lines beginning with +, excluding file-header lines beginning with +++).

token_overlap = |T(A) ∩ T(answer)| / |T(A) ∪ T(answer)|

This is the Jaccard similarity coefficient on identifier token sets. It ranges from 0 (no shared identifiers) to 1 (identical token sets). The metric is symmetric and does not require the model's response to be a valid diff.

6.3 Pass (binary composite)

pass = file_hit AND token_overlap ≥ 0.15

The threshold of 0.15 was chosen to exclude responses that name the correct file but produce generic boilerplate code unrelated to the actual fix. Sensitivity analysis at thresholds 0.10 and 0.20 does not change the rank ordering of models.

Baseline Characteristics

7.1 Per-model summary statistics (n=65)

Model Pass rate Overlap mean Overlap SD Tokens in (mean) Tokens out (mean)
claude-opus-4-666%0.2260.1412,449930
claude-sonnet-4-669%0.2270.1342,4491,047
gpt-4o48%0.1770.1091,844420
gpt-4o-mini58%0.1760.0931,844317

7.2 Per-model, per-tier statistics

ModelTiernPassOverlap meanOverlap SD
claude-opus-4-6Automate5100%0.2220.043
Assist3060%0.2130.131
Escalate3067%0.2390.161
claude-sonnet-4-6Automate580%0.2020.052
Assist3060%0.2180.137
Escalate3077%0.2400.142
gpt-4oAutomate520%0.1580.058
Assist3050%0.1760.106
Escalate3050%0.1800.121
gpt-4o-miniAutomate560%0.1650.045
Assist3067%0.1850.087
Escalate3050%0.1680.104

7.3 Within-ticket variance across models

All four models were run on the identical 65 tickets (same random seed). For each ticket, we compute the range of token overlap scores across the four models:

mean within-ticket range = 0.151 max within-ticket range = 0.618

The high within-ticket variance indicates that ticket difficulty is the dominant source of variance, not model identity. A more powerful evaluation design would increase n rather than add models.

Limitations

  1. Automate tier underrepresentation (n=5). No statistically meaningful conclusions about the Automate tier can be drawn. The 5 available tickets are unrepresentative of Automate-labelled work in general (see Section 3.4).
  2. Temporal bias. All git commits in the benchmark date from 2009–2013, corresponding to the Spring 2.x–3.x era. The code style, API patterns, and package layout differ substantially from modern Spring 6.x. Model performance on current Spring code may differ from reported figures.
  3. Low coverage (1.6%). Only 1.6% of labeled tickets are included in the benchmark (998 of 61,564). Tickets without a corresponding code commit are excluded, creating selection bias toward tickets that were fixed by a code change rather than closed as duplicates or won't-fix.
  4. Token Jaccard as a proxy. The metric measures surface-level identifier overlap, not semantic correctness. A model can achieve a high overlap score by producing plausible-but-wrong code that reuses the same variable and method names. Conversely, a correct fix that uses different local variable names will receive a low score. The metric is intended as an inexpensive approximation; human evaluation or compilation-and-test scoring would be more definitive.
  5. File_hit artefact. As described in Section 6.1, the binary file_hit criterion penalises models that describe changes without naming the class explicitly. This artefact affects GPT-4o disproportionately. Pass rate results for GPT-4o should be interpreted with caution; token overlap is the recommended primary metric.
  6. Single project. All tickets and commits are from one project (Spring Framework). Generalisation to other languages, ecosystems, or issue tracker conventions is not established.
  7. max_tokens cap. Responses are capped at 1,500 tokens. Longer fixes may be truncated; this may disproportionately affect models that produce more verbose output (Sonnet and Opus averaged 1,047 and 930 output tokens respectively, approaching the cap on some tickets).

Replication Instructions

The following steps reproduce the benchmark from scratch using only public data sources. No access to the authors' infrastructure is required.

# 1. Clone the Spring Framework repository
git clone https://github.com/spring-projects/spring-framework.git \
          .cache/spring-framework

# 2. Download Spring Jira tickets via the public REST API
#    (substitute your own scraper; no auth required)
#    Target collection: project=SPR, all resolved tickets

# 3. Install dependencies
pip install anthropic openai pymongo numpy

# 4. Reproduce the tier labels
python build_outcome_cache.py   # writes .cache/outcome_signals.json

# 5. Build the git index
python build_git_index.py       # writes .cache/git_index.json

# 6. Run the evaluation (choose provider and model)
python run_eval.py \
    --provider anthropic \
    --model claude-sonnet-4-6 \
    --n 30 --seed 42

# 7. Compare results across models
python eval_compare.py

The calibration thresholds in build_outcome_cache.py (P33_DAYS = 1.9, P75_ASSIGNEE_CNT = 136) are fixed constants derived from the authors' corpus. If a different Jira corpus is used, these should be recomputed from the empirical distribution. The scoring rule in Section 2.3 remains valid with any thresholds, provided they are specified before any labelling is performed.

Code and Data Availability

All evaluation code is released under the MIT licence:

https://github.com/JussiTu/task2vec

Relevant files:

FilePurpose
build_outcome_cache.pySignal extraction and tier labelling
build_git_index.pyGit commit matching and file extraction
run_eval.pyPrompt construction and model evaluation
eval_compare.pyMulti-model result aggregation
eval_report.pySingle-model report generation
.cache/eval_results_*.jsonRaw results for all four models
.cache/git_index.jsonBenchmark ticket–commit mapping (998 entries)

The raw Jira dump is not redistributed due to size (~2 GB). The outcome_signals.json file (61,564 labeled tickets, 4.6 MB) is not included in the repository but can be regenerated from the public Jira API using build_outcome_cache.py.


References