Fantasy Projection Accuracy: How to Benchmark and Evaluate Models

Projection accuracy is the difference between a number that helps and a number that just looks authoritative. This page covers how accuracy is defined in fantasy projection systems, the statistical methods used to measure it, where model evaluation gets genuinely contested, and the specific benchmarks that separate rigorous projection work from confident-sounding guesswork.


Definition and scope

A fantasy projection is a point estimate — a single number representing a model's best guess at what a player will score under a specific scoring format. Accuracy, in this context, is not a feeling or a reputation. It is the measurable relationship between that estimate and the outcome that actually occurred.

Scope matters immediately. A projection for a starting quarterback in a 16-game NFL regular season is being evaluated across at most 16 data points per player. Compare that to an MLB starting pitcher projection, where 30 or more starts give the error-measurement process far more to work with. The evaluation frame changes with the sport, the position, and the season structure — which is one reason accuracy claims require careful unpacking before they mean anything.

The field borrows heavily from forecasting science. Philip Tetlock's work on forecaster calibration, detailed in Superforecasting: The Art and Science of Prediction (Crown, 2015), established that calibration — how well confidence intervals match actual outcome frequencies — matters as much as raw directional accuracy. A projection model that consistently says "25 points" when the player scores between 20 and 30 points 90% of the time is well-calibrated, even if every single point estimate is off by 3.


Core mechanics or structure

Three primary error metrics dominate projection benchmarking:

Mean Absolute Error (MAE) — the average of absolute differences between projected and actual scores. If a model projects 18.4 points and the player scores 14.0, the absolute error is 4.4. Average that across a full slate of players and a season's worth of data and the result is MAE. It is interpretable in fantasy-point units, which makes it the most practically useful single number.

Root Mean Squared Error (RMSE) — the square root of the average of squared errors. Because squaring penalizes large errors more heavily than small ones, RMSE is sensitive to blowup events: the running back who gets 2 points when projected for 18, or the wide receiver who erupts for 40 when projected for 10. RMSE will always be equal to or greater than MAE for any given dataset.

Mean Absolute Percentage Error (MAPE) — errors expressed as percentages of the actual outcome. Useful for cross-format comparisons but distorted when actual scores are near zero (a kicker who scores 0 points makes any MAPE calculation break down mathematically).

Calibration testing adds a second dimension entirely. A well-calibrated model's stated confidence intervals should contain the actual outcome at the stated frequency — a 70% confidence interval should capture the true result roughly 70 times out of 100. This is covered in more depth at Projection Confidence Intervals.

Backtesting projection accuracy operationalizes these metrics: a historical dataset is assembled, projections are generated as if each week's information were the only information available at that time, and errors are computed against verified box-score results.


Causal relationships or drivers

Accuracy degrades or improves for specific, traceable reasons. Understanding the causal structure prevents the common mistake of attributing error to "noise" when the real driver is a model design choice.

Injury and availability uncertainty is the single largest driver of high-error outliers. A wide receiver projected for 14 points who misses the game with a late scratch produces a 14-point absolute error from a single binary event the model could not have known. The injury adjustments in projections page examines how different model architectures handle late-breaking availability information.

Usage volatility is the second major driver. Target share, snap count, and carry distribution are among the highest-predictive inputs available, but they fluctuate week to week. A model that anchors too heavily on a single recent week of usage will over-react; one that anchors too heavily on season-long averages will under-react to genuine role changes. Usage rate adjustments in projections and snap count and target share data address this balance directly.

Game script and Vegas line drift create systematic error when projections are finalized well before kickoff. A game with a total that drops from 49 to 42.5 between Tuesday and Sunday afternoon represents a meaningful change in expected offensive volume. Models that incorporate late vegas lines and fantasy projections data at a specific cutoff will outperform those locked to earlier totals.

Sample size constraints affect all position groups but hit some harder than others. A quarterback projection built on 5 starts carries different reliability properties than one built on 50. Sample size and projection reliability examines the empirical thresholds at which projection error rates stabilize.


Classification boundaries

Not all projection errors are equivalent, and the taxonomy matters when evaluating models fairly.

Reducible error is the portion of error that better data, better modeling, or better timing could have eliminated. A model that ignores weather adjustments on an outdoor game in December is producing reducible error — the information existed, the model just didn't use it. Weather impact on fantasy projections quantifies how large this category can be for certain game environments.

Irreducible error is the variance inherent to the outcome — the randomness that no model could have predicted. An interception return that costs a quarterback 6 points is genuinely unpredictable at the individual event level.

The distinction matters because critics sometimes conflate irreducible variance with model failure. A projection that produces an MAE of 7.2 fantasy points is not necessarily worse than one producing 6.8 if the second model was evaluated in games with fewer extreme outcomes.

Benchmark categories also differ by position. Floor and ceiling projections makes the case that evaluating a kicker projection against the same MAE standard as a wide receiver projection produces a misleading ranking of model quality.


Tradeoffs and tensions

The central tension in accuracy benchmarking is between point-estimate precision and distributional honesty.

A model optimized purely to minimize MAE will converge toward safe, conservative projections clustered near the league-average outcome for a position. This happens because extreme projections — very high or very low — are more likely to produce large absolute errors. The incentive structure, if MAE is the only metric, pushes models toward regression to the mean. This is covered in detail at regression to the mean in fantasy.

But fantasy decisions are not symmetric. In daily fantasy, lineup construction requires identifying upside outliers — players with high ceiling probabilities — not just accurate point estimates. A model with slightly higher MAE but better ceiling identification may produce better lineup outcomes than the technically more accurate one. See daily fantasy sports projections for how this tradeoff plays out in practice.

The second tension is between transparency and complexity. Machine learning in fantasy projections describes how ensemble methods and neural architectures can improve raw accuracy metrics while becoming increasingly opaque about why a particular projection was generated. A model with a lower RMSE but no interpretable structure is hard to trust when it produces an outlier output, because there is no way to verify whether it found a genuine signal or overfit to historical noise.


Common misconceptions

Misconception: A model is accurate if its top projections hit.
Selective recall is not accuracy measurement. Evaluating a model by cherry-picking its best calls ignores the full distribution of errors. Proper evaluation requires a complete, prospectively assembled dataset — every player projected, every week, compared to every actual result.

Misconception: Accuracy should be measured against the final score only.
Several projection systems are designed to project "expected" performance based on stable underlying inputs — targets, carries, air yards — rather than the volatile actual score. These systems can have higher MAE than a simpler model while providing better decision-relevant information. The right benchmark depends on what decision the projection is meant to support.

Misconception: Industry-leading accuracy means 70%+ hit rate.
The actual accuracy ceiling for weekly fantasy projections is much lower than casual users expect. Projecting which of two players will outscore the other at a coin-flip threshold (above or below projection) is a binary classification problem. Research in sports analytics, including work published by the MIT Sloan Sports Analytics Conference, consistently shows that even sophisticated NFL models struggle to exceed 55–60% directional accuracy on individual weekly player projections, given the irreducible variance in game outcomes.

Misconception: Lower MAE always means a better model.
MAE is one dimension. A model with lower MAE but poor calibration — one that is systematically overconfident or underconfident about its uncertainty — can produce worse outcomes for users making probabilistic decisions like lineup optimization with projections.


Checklist or steps

The following sequence describes the steps involved in conducting a structured projection accuracy evaluation:

  1. Define the evaluation dataset — specify the sport, season, weeks included, and the player pool (all rostered players, starters only, or a scoring-format-specific cutoff like top-100 PPR scorers).
  2. Establish a projection freeze point — record what projections looked like at a fixed time before games, such as 24 hours prior to kickoff. Post-game retroactive changes invalidate the evaluation.
  3. Collect verified actuals — pull official box-score scoring from a named public source (NFL.com, Basketball-Reference, Baseball-Reference) under the exact scoring format being evaluated.
  4. Compute MAE across the full player pool — not just starters, not just top scorers. Full-pool MAE is the honest number.
  5. Compute RMSE — compare it to MAE. A large gap (RMSE significantly higher than MAE) signals that the model is producing significant blowup errors on outlier players.
  6. Segment error by position — QB, RB, WR, TE, and flex positions have different baseline variance profiles. Aggregated accuracy numbers obscure position-level failures.
  7. Test calibration on stated intervals — if the model provides ranges or confidence intervals, check what percentage of actual outcomes fell within stated bounds.
  8. Compare against a baseline — the appropriate baseline is often a simple seasonal average or ADP-implied projection, not silence. A sophisticated model that only marginally outperforms "project everyone to their season average" is a weaker result than its raw MAE suggests.
  9. Document the injury/scratch rate — calculate what percentage of projected players did not play. High scratch rates inflate all error metrics and context is required.
  10. Repeat across at least 2 full seasons — single-season evaluations are susceptible to variance. A model that happens to work well in a low-injury season looks worse in a season with unusual availability disruption.

The full context for this kind of evaluation process connects to the broader framework at FantasyProjectionLab.com, where projection methodology is documented from input sourcing through output generation.


Reference table or matrix

Projection Accuracy Metrics: Comparison Matrix

Metric What It Measures Unit Sensitive to Outliers Best Used For
MAE (Mean Absolute Error) Average absolute deviation Fantasy points No General accuracy benchmarking
RMSE (Root Mean Squared Error) Penalized average deviation Fantasy points Yes Identifying blowup-prone models
MAPE (Mean Absolute % Error) Proportional deviation Percentage Yes (near-zero distortion) Cross-format comparisons
Bias (Mean Error) Systematic over/under projection Fantasy points No Detecting directional model drift
Calibration score Interval accuracy match % coverage No Probabilistic model evaluation
Directional accuracy % of correct rank-order calls Percentage No Head-to-head and start/sit context

Baseline Accuracy Benchmarks by Position (NFL, full-season PPR)

Position Typical MAE Range Key Variance Driver
Quarterback 5–8 fantasy points Game script, rushing TD variance
Running Back 6–10 fantasy points Usage volatility, snap share changes
Wide Receiver 6–11 fantasy points Target share, explosive play rate
Tight End 5–9 fantasy points Role clarity, red zone usage
Kicker 3–6 fantasy points Game environment, distance distribution

Ranges reflect publicly documented analysis in fantasy analytics literature including work from Fantasy Pros aggregator evaluations and MIT Sloan Sports Analytics Conference presentations. Individual model results vary by season and player pool definition.


References