5-phase onboarding protocol — target: κ ≥ 0.70 in ≤7 days
Rater reads the full rubric documentation including all four dimension definitions and anchor examples. Watches two annotated training videos: one passing AI-generated synopsis (score 3.6) and one failing (score 1.8), with expert voiceover commentary explaining each dimension score and the rationale behind it. Videos are the single highest-leverage onboarding investment — they show what good and bad look like in context, which written anchors alone cannot fully convey.
Exit gate: 10-item comprehension quiz, score ≥ 80% required to advance. Topics: dimension definitions, score interpretation, edge case recognition, safety policy basics.
Rater scores 20 gold-standard items with known correct scores. Immediate feedback after each item: rater sees the correct score and the expert rationale at each dimension. Items are stratified across all four scoring dimensions and deliberately include edge cases — an accurate but tonally flat synopsis, an engaging synopsis with a subtle factual error, a safety-adjacent item that is technically compliant but requires careful reading. The immediate feedback loop is what makes this phase different from just doing more tasks.
Exit gate: Exact agreement ≥ 60% on gold set. Adjacent agreement (±1 point) ≥ 85%. If not met after 20 items, rater completes 10-item remediation set before advancing.
Rater scores 30 items alongside a calibrated senior rater, with no feedback during the session. After completion, rater receives a written debrief report showing: dimension-by-dimension agreement breakdown, all disagreements of 2+ points with the expert score, and systematic pattern identification (e.g., "You scored leniently on tone in 8 of 9 comedy items"). The written debrief gives raters a concrete, personalized account of where their calibration needs work — far more actionable than a single aggregate κ.
Exit gate: Cohen's κ ≥ 0.65 vs. senior rater. If not met, rater receives targeted remediation based on debrief findings and repeats Phase 3 with a new item set.
Rater enters the production queue with 15% sentinel items embedded (higher than steady-state because early production is the highest-variance period). Cohen's κ computed per 10-task rolling window. If κ drops below 0.65 in any window, rater is paused and receives a targeted re-calibration set before continuing. All disagreements in this phase are logged for cohort-level pattern analysis.
Graduation gate: κ ≥ 0.70 sustained over a 50-item window. Rater is then considered fully calibrated and sentinel rate drops to steady-state 10%.
Steady-state sentinel rate: 10%. Monthly cohort recalibration session — group debrief on systematic drift patterns identified across the cohort over the prior month, with updated anchor examples for any edge cases that caused widespread disagreement. Individual drift alert triggers if rolling 20-item κ drops more than 0.05 below established baseline. Drift is addressed before it compounds.
Drift threshold: >0.05 drop over rolling 20-item window triggers immediate re-calibration session. Rater continues producing while session is scheduled (within 48 hours).
Cohort IRR trajectory — 8 calibration sessions
Cohort κ
Target κ = 0.70
Sessions 1–2: Rapid improvement as raters absorb anchor examples and anchored practice feedback.
Session 4 plateau: Corresponds to first cohort calibration debrief; corrections identified in debrief accelerated gains in sessions 5–6.
Session 7 dip: Three new raters added mid-cycle, pulling cohort κ down before their individual onboarding was complete. Corrected by session 8.