How to Facilitate a Calibration Session: Step-by-Step

Calibration sessions determine whether your performance reviews are fair or theater. Run poorly, they amplify bias and waste hours. Run well, they align standards across managers and ensure employees are evaluated consistently.

This guide gives you a tactical playbook: the pre-work, the agenda, the questions to ask, and how to handle the hard moments when managers disagree.

Before the Meeting: Pre-Work Checklist

Calibration sessions fail when participants arrive unprepared. The meeting itself should be for discussion and decisions, not discovery. Require managers to complete pre-work 48 hours before the session so you can identify potential flashpoints in advance.

Manager pre-work (due 48 hours before):

Submit proposed ratings for all direct reports
Provide 2-3 specific evidence points per employee (project outcomes, metrics, feedback received)
Flag any ratings they’re uncertain about
Note employees they want to discuss with the group

Facilitator pre-work (24 hours before):

Review all submitted ratings for distribution patterns
Identify outliers: ratings significantly higher or lower than peers in similar roles
Flag managers whose ratings skew consistently high or low
Prepare the agenda with specific employees to discuss
Send participants the anonymized rating distribution

The 75-Minute Calibration Agenda

Structure prevents calibration sessions from devolving into opinion battles. This agenda keeps discussions evidence-based and time-bound.

Phase	Time	Owner	Purpose
Opening	5 min	Facilitator	Ground rules and rating scale review
Individual Reviews	35 min	Managers	Present evidence, discuss flagged cases
Bias Check	10 min	Facilitator	Review decisions for patterns
Final Decisions	10 min	Group	Lock ratings with consensus
Action Planning	15 min	Managers	Document next steps and communication

Opening (5 minutes): State the purpose, review the rating scale definitions, and establish ground rules. Remind participants that discussions are confidential and focused on performance evidence, not personal opinions.

Individual Reviews (35 minutes): Work through flagged employees. The presenting manager shares their rating and evidence. Other managers ask clarifying questions. Aim for 3-5 minutes per employee.

Bias Check (10 minutes): Step back and review the decisions made. Look for patterns across demographic groups, tenure levels, or teams.

Final Decisions (10 minutes): Confirm or adjust ratings based on discussion. Document the rationale for any changes.

Action Planning (15 minutes): Assign who communicates what to whom. Note development actions for employees whose ratings changed.

Ground Rules to Set at the Start

Ground rules transform calibration from a political exercise into a structured process. State these at the opening and enforce them throughout. Post them visibly if you’re in a conference room or share them in the video call chat.

The non-negotiables:

Evidence over opinion. Every rating must be supported by specific examples. “I feel like they’re a strong performer” isn’t evidence.
Discuss performance, not personality. Redirect comments about attitude, likability, or “fit” back to observable behaviors and outcomes.
Confidentiality. What’s discussed stays in the room. Employees should never learn what specific managers said about them.
Time limits. Each employee gets 3-5 minutes. If consensus isn’t reached, table and return at the end.
One conversation. No side discussions. If it’s worth saying, say it to the group.

Facilitator Scripts for Common Situations

The facilitator’s job is to keep discussion on track and surface bias when it appears. These scripts give you language for the moments when calibration goes sideways.

When discussion drifts to personality:

Let’s bring this back to outcomes. What specific results did this person deliver, and how do those compare to the rating criteria?

When recency bias appears:

We’re focusing on Q4. What happened in the first half of the year? Let’s make sure we’re evaluating the full period.

When a manager is defensive:

I hear that you feel strongly about this rating. Help the group understand by walking us through the evidence against each criterion in the rubric.

When ratings cluster in the middle:

We have six ‘meets expectations’ in a row. What distinguishes someone who truly meets expectations from someone who exceeds? Are we confident these ratings reflect real differences?

When you suspect bias:

Before we finalize this rating, let’s check ourselves. Would we describe this behavior the same way for a different employee? Are we applying the criteria consistently?

Handling Disagreements

Disagreements are the point of calibration. If everyone agreed, you wouldn’t need the meeting. The goal isn’t to avoid conflict but to resolve it productively.

When two managers disagree on a rating:

Ask each to state their rating and primary evidence point (30 seconds each)
Identify the specific criterion where they diverge
Reference the rubric: “According to our ‘exceeds expectations’ definition, does this employee’s performance match?”
If still stuck after two minutes, note both perspectives and move on. Return at the end with fresh eyes.

When a manager won’t budge:

I understand your position. For calibration to work, we need to apply consistent standards. If this employee’s performance matches the ‘exceeds’ criteria, we should be able to point to similar ratings for similar performance elsewhere. Can we look at a comparison?

When the group gangs up on one manager:

Protect dissenting views. Ask: “What would need to be true for this rating to be correct?” Explore before dismissing.

Bias Checks During the Session

Research from Harvard Business Review found that calibration meetings can actually introduce new biases. In one study, women of color saw a 2 percentage point gap versus white men in supervisor ratings balloon to a 34 percentage point gap after calibration. The antidote is structured bias checks.

Run these checks before finalizing:

Distribution check: Are ratings distributed similarly across demographic groups? If one group clusters lower, investigate.
Language check: Are you using different words to describe similar behaviors? (“Assertive” vs. “aggressive,” “detail-oriented” vs. “nitpicky”)
Advocacy check: Did the most persuasive speakers get better outcomes for their teams? Verbal ability shouldn’t determine ratings.
Recency check: Are recent events overweighted? An employee’s Q4 stumble shouldn’t erase three strong quarters.

Assign someone as a “bias interrupter” whose job is to flag these patterns in real time.

Documentation Requirements

Document everything. Calibration decisions become legally significant when they affect compensation and promotions.

During the meeting, capture:

Final rating for each discussed employee
Key evidence points that supported the rating
Any rating changes and the rationale
Action items: who communicates what, development plans, follow-ups

After the meeting:

Distribute summary to participants within 24 hours
Store documentation securely (it’s sensitive)
Track rating change patterns over time to improve future sessions

How Windmill Simplifies Calibration Prep

The hardest part of calibration is preparation: gathering evidence, identifying discrepancies, and creating materials for discussion. Windmill’s calibration feature automates this work.

Windmill generates pre-reads with rating distributions, flags employees where similar performers received different ratings, and surfaces potential bias patterns. Committee members arrive with comprehensive briefs instead of spending the first 30 minutes getting oriented.

The result: teams report 75%+ shorter calibration sessions because they spend time making decisions, not gathering data.

Key Takeaways

Require 48-hour pre-work: ratings and evidence submitted in advance
Keep sessions to 75 minutes with a structured agenda
Set and enforce ground rules focused on evidence over opinion
Use facilitator scripts to redirect bias and keep discussion productive
Run explicit bias checks before finalizing any ratings
Document decisions and rationale for legal and improvement purposes

Calibration is where fair performance management is made or broken. With the right structure, you can run sessions that align standards, surface real performance differences, and build trust in your review process.

Frequently Asked Questions

How long should a calibration session take?

A well-run calibration session takes 60-90 minutes. Allocate 5 minutes for setup, 30-40 minutes for individual reviews, 10 minutes for bias checks, 10 minutes for final decisions, and 15 minutes for action planning. Sessions longer than 90 minutes lose focus and productivity.

Who should facilitate a calibration meeting?

An HR business partner or People Ops leader typically facilitates calibration sessions. The facilitator should be neutral, not a manager with direct reports being discussed. Their role is to manage time, enforce ground rules, and redirect conversations when bias emerges.

What should managers prepare before a calibration session?

Managers should submit proposed ratings with supporting evidence 48 hours before the meeting. Evidence includes specific accomplishments, project outcomes, peer feedback, and goal attainment data. Pre-submitted materials allow the facilitator to identify potential discussion points in advance.

How do you handle disagreements during calibration?

Redirect disagreements to evidence. Ask 'What specific outcomes support this rating?' When managers disagree, compare the employees in question against the rating rubric rather than against each other. If consensus isn't reached after two minutes of discussion, table the decision and return to it at the end.