Data Sources for Fantasy Projections: APIs, Databases, and Feeds
The quality of any fantasy projection is only as good as the data feeding it. This page covers the primary data source types used in projection systems — APIs, curated databases, and live data feeds — explaining how each category works, where it comes from, and what distinguishes a reliable pipeline from a brittle one. Builders of projection models and serious fantasy players who want to understand what sits beneath the numbers will find the technical and practical distinctions here.
Definition and scope
A fantasy projection draws from at least 3 distinct data layers: play-level tracking data, box score aggregates, and contextual inputs like injury reports or Vegas lines. Each layer typically arrives through a different channel, and the term "data source" covers all of them.
An API (Application Programming Interface) is a structured endpoint that delivers data programmatically — a projection model sends a request, the server returns a JSON or XML object with stats, schedules, or player information. A database is a stored, queryable collection of historical records — often the product of years of scraped or licensed data, normalized into a consistent schema. A feed is a continuous or near-real-time stream, typically used for injury updates, lineup changes, and game-time decisions. The scope of "data sources" in fantasy projection work spans all three.
The statistical inputs for fantasy projections page covers what those inputs are; this page is about where they physically come from and how the pipeline is structured.
How it works
A projection system ingests data at multiple points in its workflow. The numbered breakdown below reflects the standard pipeline architecture:
-
Historical play-by-play data — sourced from official league endpoints or public repositories like the NFL's Next Gen Stats or the nflfastR R package (maintained by Ben Baldwin and Sebastian Carl, documented at nflfastr.com). This layer provides the raw substrate for modeling carries, targets, snap counts, and expected value metrics.
-
Season-to-date aggregates — refreshed daily or weekly from providers like Sportradar or Stats Perform, both of which license structured box score data directly to media and fantasy platforms. Sportradar's official documentation describes delivery latency as low as 5 seconds post-play for live feeds.
-
Injury and availability data — the NFL publishes official injury reports under league operations policy, requiring teams to list all players on a Wednesday-through-Friday schedule during the season. Those reports are machine-readable through third-party parsing layers.
-
Vegas lines and implied totals — pulled from regulated sportsbook aggregators. The relationship between game totals and projected scoring touches on Vegas lines and fantasy projections, where that causal chain is examined.
-
Advanced tracking metrics — Next Gen Stats (NFL), StatCast (MLB via baseballsavant.com), and Second Spectrum (NBA) provide spatial and velocity data that traditional box scores cannot capture. Statcast, for instance, tracks exit velocity and launch angle for every batted ball since the 2015 season.
The outputs of each ingestion step are cleaned, deduped, and joined against a master player ID system — typically a third-party mapping file, since each data provider uses its own internal player IDs.
Common scenarios
The API rate limit problem. Free-tier API access from providers like MySportsFeeds or the Sleeper API imposes request caps — MySportsFeeds' developer tier, for example, restricts calls to a defined monthly allotment. A projection model refreshing 500 player profiles daily can exhaust a free tier in under 48 hours. Most serious projection systems operate on paid tiers or cache aggressively.
Historical database gaps. Pre-2000 play-by-play data in football has significant coverage inconsistencies. Pro Football Reference (a Sports Reference property) maintains one of the most complete public archives, but play-level granularity before 1994 is unreliable. MLB's Retrosheet project (retrosheet.org) provides play-by-play records back to 1921 for baseball, making it an outlier in historical coverage depth.
Feed latency in DFS contexts. In daily fantasy sports, a feed delivering lineup news 90 seconds behind competitors is functionally useless on late-swap deadlines. The daily fantasy sports projections workflow depends on near-zero-latency injury feeds, which is why professional DFS operators pay for dedicated news alert services rather than relying on public APIs.
Decision boundaries
Choosing between a live API, a static database pull, and a streaming feed is not a preference question — it follows from what the model is trying to do.
Static historical models (preseason projections, rest-of-season systems, dynasty applications) can run entirely off database queries. Latency is irrelevant; coverage and normalization quality dominate. Rest-of-season projections lean heavily on this architecture.
In-season weekly models need a hybrid: historical database for baseline rates, API refresh for current-week context (opponent, depth chart), and an injury feed for final availability decisions. The in-season vs preseason projections page details how those model types diverge structurally.
DFS and same-game models require live feeds as a first-class input. No static database refresh cycle is fast enough to handle the information environment in the 90 minutes before a slate locks.
One underappreciated distinction: official league data vs. scraped aggregates. Official sources (NFL Next Gen Stats, StatCast, NBA Stats API at stats.nba.com) carry terms of service restrictions on commercial redistribution, while scraped public data carries no licensing guarantee. The projection models explained page situates these data decisions within the broader modeling framework, and the home resource at fantasyprojectionlab.com provides orientation across all projection topics covered in depth here.
The cleanest projection pipelines are not the ones with the most data sources — they are the ones where each source has a defined role, a validated schema, and a documented failure behavior when the feed goes down.