Predicting high scoring NFL teams 🏈

5 min readAug 24, 2020

In this walkthrough, we are going to use historical data provided by https://fantasydata.com/ to create a model to predict high scoring NFL teams.

We’re going to skip a lot of the boring stuff in an effort to get results quickly. If you’re interested in all the gory details, the full Jupyter Notebook is available here.

So let’s start by getting our hands on some data. I’ve been using https://fantasydata.com/ for years and that’s what we’ll be using for this example. After you set up an account for FREE, you’ll need to go to your Subscription settings and copy your API key.

We are using the following end points, which are well documented at the link provided below:

“https://api.sportsdata.io/v3/nfl/scores/json/TeamSeasonStats/"

“https://api.sportsdata.io/v3/nfl/scores/json/TeamGameStats/"

NFL API Documentation | FantasyData

This is the documentation for FantasyData's NFL API. All of our API endpoints can be accessed via an HTTP GET request…

fantasydata.com

Paste your API Key into the key variable below, and you should be good to go. In this step, we are gathering all available stats at a team level for the 2017, 2018 and 2019 seasons.

If you ran the code above successfully, you should see similar output as below. 96 entries/row (1 for each team (32) over 3 season) and 224 columns.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96 entries, 0 to 31
Columns: 224 entries, SeasonType to GlobalTeamID
dtypes: float64(30), int64(190), object(4)
memory usage: 168.8+ KB
None

Next step is to pare the 224 attributes in our data set down to the ones that really “matter” (aka dimensionality reduction). To do this, we’ll simply calculate correlation coefficients between our target (Score) and all 223 attributes.

From here, we isolate the top 10 positive and negative attributes as they correlate to Score. Think of the positive list as attributes of a high scoring team and the negative list as attributes of a low scoring team.

                                         CoEfficient
Touchdowns                               0.958891
Kickoffs                                 0.951570
ExtraPointKickingAttempts                0.935872
ExtraPointKickingConversions             0.927289
RedZoneConversions                       0.865665
OffensiveYards                           0.827228
KickoffsInEndZone                        0.826991
RedZoneAttempts                          0.825684
OffensiveYardsPerPlay                    0.792614
PasserRating                             0.792418                                         CoEfficient
PointSpread                             -0.752413
Punts                                   -0.706528
PuntYards                               -0.696580
OpponentPuntReturns                     -0.609947
OpponentRushingAttempts                 -0.605628
OpponentReturnYards                     -0.583793
OpponentQuarterbackSacksDifferential    -0.531419
OpponentTurnoverDifferential            -0.512285
OpponentTacklesForLossPercentage        -0.502050
TimesSacked                             -0.493162

Among the positive attributes, we see expected measures such as Touchdowns, Offensive Yards and ExtraPointKickingAttempts. Some less obvious measures that correlate positively to Score are KickoffsInEndZone, PasserRating and OffensiveYardsPerPlay.

The attributes which correlate negatively to Score tell an equally informative story. Punts, PuntYards and OpponentPuntReturns are expected, as teams that Punt more Score less. PointSpread is the most interesting attribute here, as it suggests that teams with higher point spreads (i.e big Underdogs) generally score less. OpponentQuarterbackSacksDifferential, OpponentTacklesForLossPercentage and OpponentTurnoverDifferential are also some interesting negative variables we can use to inform our model.

If you are more of a visual learner, here’s the code for the correlation matrix as well as the output.

And now the fun starts! After some very basic exploratory data analysis we have a subset of features we can use to build a (hopefully) more intelligent model. Sticking with a basic linear regression here as we are trying to predict a numerical Score value per team.

I like to evaluate models early on to see if I’m on the right track. Below we pick some appropriate regression metrics to evaluate our model. Results are included as well …

MAE: -4.909 (0.881)
MSE: -34.435 (14.381)
R^2: 0.991 (0.005)

While it’s great to have all 3 evaluation metrics, I am going to focus primarily on Mean Absolute Error (MAE). According to our MAE metric, we are off by ~5 points on average across our predictions against random samples in the training data. Considering most teams are scoring at least 300 points per season, I’d say we’re on the right track 🎉

This is where it all comes together. We are going to pull data from Week 1 of 2017 and use our model to predict what a team’s score will be by the end of the regular season. The code block below, which is intentionally verbose, does exactly that:

Retrieve Week 17 data and extrapolate out to a full season
Fit a model with the historical data we have from 2017, 2018, 2019 to predict Score
Use our model to predict end of the year Score totals per team, based on 1 week of 2017.

Results also included below …

Team  Score   prediction
15   KC    672  1191.748336
17  LAR    736  1101.438160
19  MIN    464   994.289224
1   ATL    368   884.302910
10  DET    560   880.811397
20   NE    432   879.313098
24  PHI    480   803.814549
3   BUF    336   788.974414
25  PIT    336   759.355129
14  JAX    464   747.110792
0   ARI    368   743.725100
9   DEN    384   732.156183
18   LV    416   728.827852
8   DAL    304   726.286116
28  TEN    256   700.014437
21   NO    304   694.748705
2   BAL    320   649.793840
4   CAR    368   641.107487
11   GB    272   637.372931
29  WAS    272   627.581081
5   CHI    272   611.125367
16  LAC    336   606.655699
13  IND    144   591.028218
7   CLE    288   515.820934
26  SEA    144   504.385931
23  NYJ    192   460.485052
22  NYG     48   384.406178
12  HOU    112   378.169174
6   CIN      0   373.807815
27   SF     48   344.707935

😰 The bad news is that our actual predicted value is way off scale. Teams are only scoring 300–500 points per season, so any value > 500 is highly suspect. However, this is easily attributed to the fact that we only used one week of data to fuel our predictions … In a more perfect scenario, we would use multiple games to extrapolate season long values for a better result.

😃 The good news is that we can use this predicted value to “rank” teams as a way to create a data informed power ranking system that changes as new data is added every week.

If we look at the playoff bracket from 2017 below we can see that many of the teams at the top of our “power rankings”, based on our model, made it deep into the playoffs — All of our top 10 teams, except for Detroit, made the playoffs in 2017, and our no. 7 ranked team, Philadelphia, wound up winning the whole thing!

Predicting high scoring NFL teams 🏈

NFL API Documentation | FantasyData

This is the documentation for FantasyData's NFL API. All of our API endpoints can be accessed via an HTTP GET request…

Written by Dave Melillo