Artifacts
Three focused proof points for Netflix: accountable evaluation operations, repeatable scoring design, and rater calibration at scale.
Start here
Eval Ops Health Dashboard
A command-center view of evaluation delivery: active task health, blockers, milestones, speed to IRR, throughput, arbitration rate, and ownership.
- Shows project execution and stakeholder visibility
- Models bottleneck escalation and mitigation
- Connects quality metrics to delivery outcomes
Open Eval Ops →
Quality system
Scoring Rubric
A weighted rubric for AI-generated synopsis quality with dimension-level scoring, thresholds, policy guardrails, and live composite scoring.
- Turns subjective quality into repeatable judgment
- Balances creative quality, relevance, and safety
- Defines arbitration and hard-block conditions
Open Rubric →
Human judgment
Rater Calibration Dashboard
A rater-health and onboarding workflow showing calibration gates, rolling Cohen's κ, sentinel QA, drift alerts, and remediation paths.
- Demonstrates onboarding and calibration protocols
- Tracks cohort and individual rater alignment
- Prevents quality drift before it compounds
Open Calibration →
Role fit map
A fast way for reviewers to connect portfolio evidence to the core responsibilities in human evaluation and data operations.
Lead execution end-to-endIntake, scope, blockers, milestones, delivery status.
Where to lookEval Ops Health Dashboard
Evidence shownTask health, sprint tracker, active blockers, owner/due-date model.
Develop rubrics and guidelinesConsistent scoring protocols for subjective AI output quality.
Where to lookScoring Rubric
Evidence shownWeighted dimensions, threshold logic, arbitration trigger, safety hard block.
Own rater calibration and QAOnboarding, gold sets, IRR gates, ongoing quality monitoring.
Where to lookRater Calibration Dashboard
Evidence shownFive-phase onboarding protocol, rolling κ, sentinel rate, drift remediation.