Projection Bias and Calibration: Identifying Systematic Errors in Models

Projection models are not wrong randomly — they tend to be wrong in patterned, predictable ways. This page examines projection bias and calibration as the formal tools for identifying those patterns, explains the mechanisms that produce systematic error, and walks through the scenarios where miscalibration most frequently surfaces in fantasy sports contexts.

Definition and scope

A projection is calibrated when its predicted values match observed outcomes at the correct rate across a large sample. If a model projects 100 players to score 20 fantasy points and the average actual score is also 20, that's a calibrated model for that range. If the average actual comes in at 17, the model carries a systematic upward bias of 3 points at that scoring threshold.

Bias, in statistical terms, is the expected difference between a model's estimate and the true value — not the error on any single prediction, but the directional lean that persists when errors are averaged across the full population of forecasts. A model can have low bias and high variance (scattered but centered), high bias and low variance (consistently wrong in one direction), or both. Understanding projection confidence intervals alongside raw point totals is one way practitioners track whether the error distribution is centered correctly.

The practical scope of calibration analysis in fantasy sports covers positional groups, scoring formats, game contexts, and time horizons. A model that is well-calibrated for standard-scoring running backs in weeks 1–6 may be systematically off for tight ends in PPR formats because the underlying assumptions about target share translate poorly across positions.

How it works

Calibration is measured through a structured comparison between predicted and observed outcomes across binned prediction intervals. The process typically runs as follows:

  1. Collect paired data. For each projection, record the predicted value and the actual outcome after the game or season resolves.
  2. Group by prediction range. Sort projections into buckets — e.g., all players projected for 10–12 points, then 12–14, and so on.
  3. Calculate mean actual per bucket. For each bucket, compute the average realized score.
  4. Plot or compare the two series. A perfectly calibrated model produces a 45-degree line when predicted means are plotted against actual means.
  5. Identify systematic deviations. If the actual values consistently fall below predictions at the high end, the model is positively biased at that range. If actuals exceed predictions at the low end, the model underestimates floor values.

The statistical concept most applicable here is reliability in the sense used by meteorologists: a forecaster who says "70% chance of rain" should see rain on roughly 70 out of 100 such occasions. The same logic applies to fantasy projection ranges. When reviewing backtesting projection accuracy, calibration curves are among the first diagnostic tools applied.

Bias can emerge from the model inputs, the training data, or the structural assumptions baked into the methodology. Machine learning approaches in fantasy projections can introduce particularly subtle bias because learned features may encode historical conditions that no longer hold.

Common scenarios

Recency and hot-streak inflation. Models that weight the prior 2–3 games heavily tend to over-project players on unusual hot streaks and under-project those coming off quiet games. This is a documented form of projection bias related to regression to the mean, where extreme performances in a small window are treated as representative of true talent.

Positional group asymmetry. Running back projections built primarily on historical workload patterns tend to be positively biased for elite backs because they underweight the degree to which injury and committee backfields redistribute touches. Running back projection methodology frameworks address this by incorporating depth chart probability weights, but the calibration gap between single-back and two-back systems remains measurable.

Late-season context drift. Models calibrated on full-season data often carry negative bias in weeks 14–17 for teams that have clinched playoff positions. Rest decisions are structurally underweighted in most projection systems. In-season versus preseason projections differ substantially in how they handle these situational corrections.

Sample size compression at the tails. High-end projections — say, receivers projected for 30+ points — have much thinner calibration data because those outcomes are rare. At FantasyProjectionLab.com, the reliability of projections at scoring extremes is one area where confidence intervals carry more practical weight than point estimates.

Decision boundaries

Calibration analysis becomes actionable when it defines where a model should and should not be trusted. Three practical decision boundaries:

Confidence threshold for lineup decisions. If historical calibration shows a model's projections above 25 points are systematically biased upward by 4+ points, those projections should be discounted before applying them to lineup optimization. The raw number is not the decision input — the calibration-adjusted number is.

Positional versus cross-positional trust. A model that is well-calibrated for quarterbacks but poorly calibrated for tight ends (a common finding, given the high variance in tight end projection methodology) should be used selectively. Applying quarterback-calibration confidence to tight end outputs is a specific, named error — not a vague concern.

Bias type: multiplicative versus additive. Additive bias means the model is off by a roughly constant number across all predictions — easy to correct with a flat adjustment. Multiplicative bias means the error scales with the prediction value — trickier, because it requires a proportional correction. Distinguishing which type is present before applying a fix is a prerequisite for producing better projections rather than just differently wrong ones.

References