跳轉至

Evaluation Metrics Guide

boring-gemini V10.25 introduces a comprehensive LLM-as-a-Judge evaluation system with statistical metrics to verify evaluation quality.


📊 Core Metrics Overview

Metric What It Measures Use Case Range
Cohen's κ (Kappa) Agreement between two raters AI vs Human scoring -1 ~ 1
Spearman's ρ (Rho) Rank correlation Are rankings consistent? -1 ~ 1
F1 Score Classification accuracy Pass/Fail decisions 0 ~ 1
Position Consistency Pairwise comparison stability Is there position bias? 0 ~ 1

🎯 Detailed Explanations

1️⃣ Cohen's Kappa (Agreement Metric)

Question: "Does AI scoring agree with human experts?"

from boring.judge.metrics import cohens_kappa

human_scores = [4, 3, 5, 2, 4]
ai_scores = [4, 3, 4, 2, 4]  # 3rd differs (5 vs 4)

kappa = cohens_kappa(ai_scores, human_scores)
print(f"Kappa: {kappa:.2f}")  # 0.71 - Substantial agreement

Interpretation:

κ Value Interpretation
> 0.8 Almost perfect agreement
0.6-0.8 Substantial agreement
0.4-0.6 Moderate agreement
0.2-0.4 Fair agreement
< 0.2 Slight agreement

Purpose: Validate if AI evaluation can replace human review


2️⃣ Spearman's ρ (Correlation Metric)

Question: "Is AI ranking order same as human ranking?"

from boring.judge.metrics import spearmans_rho

human_ranks = [1, 2, 3, 4, 5]
ai_ranks = [1, 2, 3, 4, 5]  # Perfect match

rho, p_value = spearmans_rho(ai_ranks, human_ranks)
print(f"Spearman ρ: {rho:.2f}")  # 1.0 - Perfect correlation

Interpretation:

ρ Value Interpretation
> 0.9 Strong correlation
0.7-0.9 Moderate correlation
0.5-0.7 Weak correlation
< 0.5 No significant correlation

Purpose: Verify ranking is correct even if absolute scores differ

[!TIP] Spearman is ideal for ordinal data (like 1-5 ratings) because it only considers rank order, not absolute values.


3️⃣ F1 Score (Classification Accuracy)

Question: "Is AI pass/fail judgment accurate?"

from boring.judge.metrics import f1_score

actual = [1, 1, 0, 1]    # 1=pass, 0=fail
predicted = [1, 0, 0, 1]  # AI predictions

f1 = f1_score(predicted, actual)
print(f"F1: {f1:.2f}")  # 0.80

Formula:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Purpose: Evaluate binary classification accuracy


4️⃣ Position Consistency

Question: "Is there position bias in pairwise comparison?"

from boring.judge.metrics import pairwise_metrics

comparisons = [
    {"winner": "A", "position_consistent": True},
    {"winner": "B", "position_consistent": True},
    {"winner": "A", "position_consistent": False},  # Inconsistent
]

metrics = pairwise_metrics(comparisons)
print(f"Position Consistency: {metrics.position_consistency:.0%}")  # 67%

Purpose: Detect position bias (preference for first option)


📈 When to Use Which Metric?

Your Evaluation Task Recommended Metric
Rate code 1-5 Kappa + Spearman
Judge code Good/Bad F1 Score
Compare two code snippets Position Consistency
Check for AI bias Bias Report

🔧 MCP Tool Usage

View Evaluation Metrics

boring_evaluation_metrics

View Bias Report

boring_bias_report

Natural Language Triggers

boring "show evaluation metrics"
boring "評估指標"
boring "show me the bias report"
boring "查看偏見報告"

📚 Further Reading