Netflix AI Evaluation — Rater Calibration Dashboard

6.2 days

Median speed to IRR

↓ 2.1 days vs cohort 3

0.74

Cohort Cohen's κ

↑ 0.08 vs session 1

87 %

Throughput rate

— on target (≥85%)

Raters need attention

1 drift · 1 below κ · 1 onboarding

Rater health

Onboarding protocol

IRR trajectory

Active rater cohort — 7 of 14 shown

Raterκ scoreTrend (5 sessions)StatusAction

R-004 · Priya S.

0.81

0.61 → 0.69 → 0.74 → 0.78 → 0.81

Calibrated

Producing

R-007 · Marcus L.

0.77

0.58 → 0.63 → 0.71 → 0.74 → 0.77

Calibrated

Producing

R-002 · Wei C.

0.79

0.64 → 0.71 → 0.76 → 0.78 → 0.79

Calibrated

Producing

R-011 · Aiko T.

0.72

0.55 → 0.62 → 0.67 → 0.70 → 0.72

Calibrated

Producing

R-003 · James M.

0.68

0.71 → 0.75 → 0.72 → 0.70 → 0.68 ↓

Drift alert

Re-calibrate ↗

R-009 · Sofia R.

0.61

0.54 → 0.58 → 0.60 → 0.59 → 0.61

Below κ

Review ↗

R-013 · Dana K.

—

Day 3 of onboarding · Phase 2 in progress

Onboarding

Pending calibration

5-phase onboarding protocol — target: κ ≥ 0.70 in ≤7 days

Orientation — Day 1 · ~2 hours

Rater reads the full rubric documentation including all four dimension definitions and anchor examples. Watches two annotated training videos: one passing AI-generated synopsis (score 3.6) and one failing (score 1.8), with expert voiceover commentary explaining each dimension score and the rationale behind it. Videos are the single highest-leverage onboarding investment — they show what good and bad look like in context, which written anchors alone cannot fully convey.

Exit gate: 10-item comprehension quiz, score ≥ 80% required to advance. Topics: dimension definitions, score interpretation, edge case recognition, safety policy basics.

Anchored practice — Days 1–2 · 20 tasks

Rater scores 20 gold-standard items with known correct scores. Immediate feedback after each item: rater sees the correct score and the expert rationale at each dimension. Items are stratified across all four scoring dimensions and deliberately include edge cases — an accurate but tonally flat synopsis, an engaging synopsis with a subtle factual error, a safety-adjacent item that is technically compliant but requires careful reading. The immediate feedback loop is what makes this phase different from just doing more tasks.

Exit gate: Exact agreement ≥ 60% on gold set. Adjacent agreement (±1 point) ≥ 85%. If not met after 20 items, rater completes 10-item remediation set before advancing.

Blind calibration — Days 2–3 · 30 tasks

Rater scores 30 items alongside a calibrated senior rater, with no feedback during the session. After completion, rater receives a written debrief report showing: dimension-by-dimension agreement breakdown, all disagreements of 2+ points with the expert score, and systematic pattern identification (e.g., "You scored leniently on tone in 8 of 9 comedy items"). The written debrief gives raters a concrete, personalized account of where their calibration needs work — far more actionable than a single aggregate κ.

Exit gate: Cohen's κ ≥ 0.65 vs. senior rater. If not met, rater receives targeted remediation based on debrief findings and repeats Phase 3 with a new item set.

Monitored production — Days 3–7 · first 50 tasks

Rater enters the production queue with 15% sentinel items embedded (higher than steady-state because early production is the highest-variance period). Cohen's κ computed per 10-task rolling window. If κ drops below 0.65 in any window, rater is paused and receives a targeted re-calibration set before continuing. All disagreements in this phase are logged for cohort-level pattern analysis.

Graduation gate: κ ≥ 0.70 sustained over a 50-item window. Rater is then considered fully calibrated and sentinel rate drops to steady-state 10%.

Ongoing health monitoring — continuous

Steady-state sentinel rate: 10%. Monthly cohort recalibration session — group debrief on systematic drift patterns identified across the cohort over the prior month, with updated anchor examples for any edge cases that caused widespread disagreement. Individual drift alert triggers if rolling 20-item κ drops more than 0.05 below established baseline. Drift is addressed before it compounds.

Drift threshold: >0.05 drop over rolling 20-item window triggers immediate re-calibration session. Rater continues producing while session is scheduled (within 48 hours).