Netflix AI Evaluation — Ops Health Dashboard

Active task health

TaskVolumeThroughputκ liveStatusETA

Synopsis quality

3,200

87%

0.74

On track

Apr 30

Search relevance

5,800

91%

0.71

On track

May 3

Rec explanation

2,100

72%

0.63

Blocked

May 10 ⚠

Metadata tagging

8,400

95%

0.82

On track

Apr 28

Safety review

1,100

84%

0.68

Monitoring

May 6

Active blockers & escalations

Rec explanation — κ stuck below 0.70

Three raters showing systematic leniency on the "helpfulness" dimension. κ has not cleared 0.70 after four calibration sessions. Delivery at risk.

View escalation plan ↗

Safety task — rubric gap on comedic violence

Raters splitting on items involving comedic violence in adult animated content. Rubric has no anchor example for this edge case. Causing ~30% of safety task arbitrations.

Draft anchor examples ↗

APAC vendor onboarding — localization behind

12 new raters expected May 1. Training materials not localized for Korean and Japanese cultural context. At current pace, cohort won't be calibrated until May 10.

Review mitigation ↗

MilestoneOwnerDueProgressStatus

Rubric v2.2 — edge case expansion

Comedy violence anchors + helpfulness examples

K. Lovelace

Apr 30

Draft → Review → Publish70%

In review

APAC cohort — 12 raters calibrated

Korean (8) + Japanese (4)

Vendor ops

May 5

0 of 12 calibrated10%

At risk

Rec explanation — achieve κ ≥ 0.70

Leniency bias intervention required

Eval ops

May 10

κ = 0.63 → target 0.7040%

Blocked

Metadata tagging — 8,400 tasks delivered

κ = 0.82 · On time

Eval ops

Apr 28

8,200 / 8,400 complete98%

On track

Onboarding protocol v2 — published

5-phase protocol + calibration debrief template

K. Lovelace

Apr 25

Published to all vendors100%

Complete