Artifacts

Three focused proof points for Netflix: accountable evaluation operations, repeatable scoring design, and rater calibration at scale.

A command-center view of evaluation delivery: active task health, blockers, milestones, speed to IRR, throughput, arbitration rate, and ownership.

A weighted rubric for AI-generated synopsis quality with dimension-level scoring, thresholds, policy guardrails, and live composite scoring.

A rater-health and onboarding workflow showing calibration gates, rolling Cohen's κ, sentinel QA, drift alerts, and remediation paths.

Role fit map

A fast way for reviewers to connect portfolio evidence to the core responsibilities in human evaluation and data operations.

Lead execution end-to-endIntake, scope, blockers, milestones, delivery status.

Evidence shownTask health, sprint tracker, active blockers, owner/due-date model.

Develop rubrics and guidelinesConsistent scoring protocols for subjective AI output quality.

Where to lookScoring Rubric

Evidence shownWeighted dimensions, threshold logic, arbitration trigger, safety hard block.

Own rater calibration and QAOnboarding, gold sets, IRR gates, ongoing quality monitoring.

Evidence shownFive-phase onboarding protocol, rolling κ, sentinel rate, drift remediation.