Evaluation Metrics Guide
boring-gemini V10.25 introduces a comprehensive LLM-as-a-Judge evaluation system with statistical metrics to verify evaluation quality.
📊 Core Metrics Overview
| Metric | What It Measures | Use Case | Range |
|---|---|---|---|
| Cohen's κ (Kappa) | Agreement between two raters | AI vs Human scoring | -1 ~ 1 |
| Spearman's ρ (Rho) | Rank correlation | Are rankings consistent? | -1 ~ 1 |
| F1 Score | Classification accuracy | Pass/Fail decisions | 0 ~ 1 |
| Position Consistency | Pairwise comparison stability | Is there position bias? | 0 ~ 1 |
🎯 Detailed Explanations
1️⃣ Cohen's Kappa (Agreement Metric)
Question: "Does AI scoring agree with human experts?"
from boring.judge.metrics import cohens_kappa
human_scores = [4, 3, 5, 2, 4]
ai_scores = [4, 3, 4, 2, 4] # 3rd differs (5 vs 4)
kappa = cohens_kappa(ai_scores, human_scores)
print(f"Kappa: {kappa:.2f}") # 0.71 - Substantial agreement
Interpretation:
| κ Value | Interpretation |
|---|---|
| > 0.8 | Almost perfect agreement |
| 0.6-0.8 | Substantial agreement ✅ |
| 0.4-0.6 | Moderate agreement |
| 0.2-0.4 | Fair agreement |
| < 0.2 | Slight agreement |
Purpose: Validate if AI evaluation can replace human review
2️⃣ Spearman's ρ (Correlation Metric)
Question: "Is AI ranking order same as human ranking?"
from boring.judge.metrics import spearmans_rho
human_ranks = [1, 2, 3, 4, 5]
ai_ranks = [1, 2, 3, 4, 5] # Perfect match
rho, p_value = spearmans_rho(ai_ranks, human_ranks)
print(f"Spearman ρ: {rho:.2f}") # 1.0 - Perfect correlation
Interpretation:
| ρ Value | Interpretation |
|---|---|
| > 0.9 | Strong correlation ✅ |
| 0.7-0.9 | Moderate correlation |
| 0.5-0.7 | Weak correlation |
| < 0.5 | No significant correlation |
Purpose: Verify ranking is correct even if absolute scores differ
[!TIP] Spearman is ideal for ordinal data (like 1-5 ratings) because it only considers rank order, not absolute values.
3️⃣ F1 Score (Classification Accuracy)
Question: "Is AI pass/fail judgment accurate?"
from boring.judge.metrics import f1_score
actual = [1, 1, 0, 1] # 1=pass, 0=fail
predicted = [1, 0, 0, 1] # AI predictions
f1 = f1_score(predicted, actual)
print(f"F1: {f1:.2f}") # 0.80
Formula:
Purpose: Evaluate binary classification accuracy
4️⃣ Position Consistency
Question: "Is there position bias in pairwise comparison?"
from boring.judge.metrics import pairwise_metrics
comparisons = [
{"winner": "A", "position_consistent": True},
{"winner": "B", "position_consistent": True},
{"winner": "A", "position_consistent": False}, # Inconsistent
]
metrics = pairwise_metrics(comparisons)
print(f"Position Consistency: {metrics.position_consistency:.0%}") # 67%
Purpose: Detect position bias (preference for first option)
📈 When to Use Which Metric?
| Your Evaluation Task | Recommended Metric |
|---|---|
| Rate code 1-5 | Kappa + Spearman |
| Judge code Good/Bad | F1 Score |
| Compare two code snippets | Position Consistency |
| Check for AI bias | Bias Report |