Research

Published work

Empirical studies on AI-assisted software engineering, built on 69,000 Spring Framework Jira tickets and the spring-projects/spring-framework git history.

Papers & reports
Evaluation 2026
Can LLMs Fix Spring Tickets? A Four-Model Comparison
We tested GPT-4o, GPT-4o-mini, Claude Sonnet, and Claude Opus on 65 real Spring Framework bug tickets, scoring each against the actual commit that fixed the issue. Claude Sonnet achieves 69% pass rate and 0.227 token overlap — 29% better than GPT. Includes per-tier breakdown, cost comparison, and key findings.
Methods Replication guide 2026
Methods & Replication: Git-Based Evaluation of LLM Code Generation
Scientific-grade methodology document. Covers corpus construction (exact Jira REST API endpoint), outcome-based tier labelling with indicator function notation, git benchmark extraction algorithm, evaluation protocol with verbatim prompts, Jaccard scoring definition, per-model per-tier statistics with standard deviations, 7 documented limitations, and a shell-script replication guide. Designed for independent replication without access to private data or proprietary models.
Analysis 2025
AI Readiness Distribution Across 69,000 Spring Tickets
Outcome-based tier classification of the full Spring Framework Jira archive using resolution time, watcher count, and assignee experience as ground-truth signals. 22% Automate · 30% Assist · 48% Escalate. Interactive chart showing how the Escalate fraction grows from 36% (2010) to 68% (2020) as easy work is consumed.
Dataset
69,156
Spring Framework Jira tickets, 2002–2023
998
Tickets matched to git commits (eval subset)
65
Tickets used in model evaluation (seed 42)
4
Models evaluated: GPT-4o, GPT-4o-mini, Sonnet, Opus
11,424
SPR-#### commits scanned in git history
$8.95
Total API cost across all model runs
Want this on your data?
Commission a study on your own Jira
Same methodology, your tickets. Numbers your stakeholders will actually believe.
Get a quote →