Building Betting Systems with Python, Pandas & Streamlit (NHL)
A primer on how to use Python, Pandas and Streamlit to build a living data product.
Table of Contents:
— Introduction 👋
— Extract 🏗
— Transform 🤖
— Deploy 🚀
— Afterword: Why Sports Data? 🏒
TLDR; Here’s a link to the GitHub Repo with all the code mentioned below and a link to the working Streamlit app.
Introduction
Winter is long in the Northeastern United States. It’s cold, it gets dark at 5pm, and the days are grey and dreary. One of the silver linings of winter for me is the advent of hockey. Over time, that (naturally) progressed from cheering for my favorite teams to (legally) wagering on hockey games to pass the time.
Hockey is not as popular as sports like American Football, and in that regard, you can’t bet on hockey the same way you bet on football and expect to be successful. With a less popular sport like hockey, you have to look at profitable systems like rested teams vs tired teams as this article describes:
When a favorite is coming off two to four days of rest and the opponent is on the second night of a back-to-back, the rested favorite has gone 1,834–1,133 (62%) since 2005, according to Bet Labs Sports
Using this information, I knew I could use my data skills to answer two important questions for any NHL (National Hockey League) matchup:
- Which team is relatively more rested than the other?
- Which team is playing a back to back game?
Extract
The first order of business was finding a data source and extracting the data that I needed. Luckily, I found this project which maintains an API of publicly accessible NHL data. My interaction with the Schedule API endpoint is below:
#get current date
datex = '2023-10-01'
#request
res=requests.get(f"https://statsapi.web.nhl.com/api/v1/schedule?startDate=2022-10-01&endDate={datex}").json()
#normalize
data = json_normalize(res,record_path= ['dates','games'])
data = data.drop_duplicates()
#get away team
df1 = data[['gameDate','teams.away.team.name']]
df1 = df1.rename(columns={"teams.away.team.name": "team"})
#get home team
df2 = data[['gameDate','teams.home.team.name']]
df2 = df2.rename(columns={"teams.home.team.name": "team"})
#concat results
df3 = pd.concat([df1,df2],axis='rows', join='inner')
df3['datex'] = pd.to_datetime(df3['gameDate']).dt.date
df3['datex'] = df3.datex.astype('datetime64[ns]')
#clean and sort results
df3 = df3.drop('gameDate', axis=1)
df3 = df3.drop_duplicates()
df3 = df3.sort_values(by=['datex'])
This allowed me to build a sequential dataframe that includes team and date of matchup, regardless of home/away status. All of our metrics can be calculated using this simple dataframe:
Transform
To quantify rest vs tired, I decided to calculate two metrics:
roll7 — a rolling 7 day count of matchups per team. This requires (1) creating a dataframe of all possible dates within a range, (2) a dataframe of all game dates scheduled for each team, (3) merging them together and (4) using a simple .rolling().count() to create the final metric.
back2back — Once I had the scheduled matchups ordered by date, per team, I was able to use the .shift() function to create a row over row comparison to quantify the back to back games metric (back2back).
This was all done using a loop over a list of all NHL teams, which eventually appended the results for each team to a conglomerated dataframe. The code for this first transformation is below:
#get unique list of teams
teamlist = df3['team'].tolist()
teamlist = list( dict.fromkeys(teamlist) )
dfmain = pd.DataFrame(columns=['datex','team','roll7','back2back'])
for team in teamlist:
#isolate to one team
tmpdf1 = df3.loc[df3['team'] == team]
#format date
tmpdf1['datex'] = pd.to_datetime(tmpdf1['datex']).dt.date
tmpdf1['datex'] = tmpdf1.datex.astype('datetime64[ns]')
# create all dates dataframe
tmpdf2 = pd.DataFrame({'datex':pd.date_range(start="2022-10-01",end=datex)})
#format date
tmpdf2['datex'] = tmpdf2.datex.astype('datetime64[ns]')
#merge
tmpdf3 = tmpdf2.merge(tmpdf1, on='datex', how = 'left')
#create stats
tmpdf3['roll7'] = tmpdf3['team'].rolling(7).count()
tmpdf3['offset'] = tmpdf3.team.shift(1)
tmpdf3['back2back'] = np.where(tmpdf3.team == tmpdf3.offset, 1, 0)
#limit columns
tmpdf3 = tmpdf3[['datex','team','roll7','back2back']]
#concat
dfmain = pd.concat([dfmain,tmpdf3], axis='rows')
The next step was merging the calculated metrics, roll7 and back2back, to the original data. This allowed me to look at each matchup, each day, and clearly understand which team (home or away) has an advantage in that specific matchup. The code used to complete this and the before/after dataframes are below:
#create new clean dataframe
dfx = data[['gamePk','gameDate','teams.away.team.name','teams.home.team.name']]
dfx = dfx.drop_duplicates()
#clean date
dfx['datex'] = pd.to_datetime(dfx['gameDate']).dt.date
dfx['datex'] = dfx.datex.astype('datetime64[ns]')
#rename columns
dfx = dfx.rename(columns={"teams.away.team.name": "away_team","teams.home.team.name": "home_team"})
#drop columns
dfx = dfx.drop('gameDate', axis=1)
#merge away data
dfy = pd.merge(dfx, dfmain, left_on= ['away_team', 'datex'],
right_on= ['team', 'datex'],
how = 'left')
#clean away data
dfy = dfy.rename(columns={"roll7": "away_roll7","back2back": "away_back2back"})
dfy = dfy.drop('team', axis=1)
#merge home data
dfy = pd.merge(dfy, dfmain, left_on= ['home_team', 'datex'],
right_on= ['team', 'datex'],
how = 'left')
#clean home data
dfy = dfy.rename(columns={"roll7": "home_roll7","back2back": "home_back2back"})
dfy = dfy.drop('team', axis=1)
#get other stats
dfz = data[['gamePk','teams.away.score','teams.away.leagueRecord.wins','teams.away.leagueRecord.losses','teams.home.score','teams.home.leagueRecord.wins','teams.home.leagueRecord.losses']]
#merge other stats
dfy = pd.merge(dfy, dfz, left_on= ['gamePk'],
right_on= ['gamePk'],
how = 'left')
#cleanup column names
dfy = dfy.rename(columns={"teams.away.score": "away_score","teams.home.score": "home_score","teams.away.leagueRecord.wins":"away_wins","teams.home.leagueRecord.wins":"home_wins",
"teams.away.leagueRecord.losses":"away_loss","teams.home.leagueRecord.losses":"home_loss",})
Although I had the raw metrics I wanted to take this a step further and make it supremely clear which team in each matchup has the advantage. Therefore, I added a few more fields with the value “home” or “away” based on which team has the advantage according to each metric category.
def roll7(row):
if row['away_roll7'] > row['home_roll7']:
val = 'home'
elif row['away_roll7'] < row['home_roll7']:
val = 'away'
else:
val = ''
return val
dfy['roll7_adv'] = dfy.apply(roll7, axis=1)
def back2back(row):
if row['away_back2back'] > row['home_back2back']:
val = 'home'
elif row['away_back2back'] < row['home_back2back']:
val = 'away'
else:
val = ''
return val
dfy['back2back_adv'] = dfy.apply(back2back, axis=1)
Let’s analyze what this data is telling us about the matchups on January 18th, 2023. If we look at the matchup between Minnesota and Washington (index 755) we see that the AWAY team, Minnesota, has the advantage according to roll7_adv and back2back_adv. This means Minnesota is MORE RESTED than Washington and that Washington is playing a back to back game while Minnesota is not. This should make Minnesota a more favorable pick according to our rest vs tired system.
Deploy
Most of the time I would be more than happy stopping here. I could run this code with a Jupyter Notebook every day and use the subsequent dataframe to make informed decisions. However, I wanted to try and build a Streamlit app so I could share this information with the rest of the world!
If you aren’t familiar with Streamlit, it’s a tool that allows you to create simple data applications using Python. It’s also worth mentioning that Streamlit was acquired by Snowflake in 2022, so I very much believe it will become a core piece of the modern stack moving forward.
Below is the Streamlit code I used to create and deploy my app using Streamlit cloud. The Streamlit specific code is relatively simple, allowing me to display the final dataframe and create an interactive date range filter so the end user can choose different slices of time to analyze.
#start streamlit template
# Sidebar - title & filters
st.sidebar.markdown('### Data Filters')
start_date, end_date = st.sidebar.date_input('Choose a date range :', [datetime.today(), datetime.today()+dt.timedelta(days=7)])
#create filter mask
mask = (dfy['datex'] > start_date) & (dfy['datex'] <= end_date)
#apply mask
dfstl = dfy.loc[mask]
# Main
st.title(f"NHL Systems")
# Commentary
st.markdown('Inspiration https://www.vsin.com/advantage-nhl-rest-vs-tired-matchups/')
# Detailed Commentary
rvt = "Rest vs Tired here is expressed as **roll7_adv**. Whichever team is mentioned here has played fewer games on a rolling 7 day basis"
b2b = "We also included a **back2back_adv** metric which indicates which team is playing a back to back game. For instance, if home is the value of back2back_adv, this means the away team played last night and is coming into a back to back contest"
st.markdown(rvt)
st.markdown(b2b)
# Dataframe
st.markdown('### Dataframe')
st.dataframe(dfstl.sort_values('datex',
ascending=False).reset_index(drop=True),1000,1000)
Why Sports Data?
I work with sports data in many of my projects, and I wanted to take some time to explain why I use this type of data. The most obvious answer is that I like sports, but my reasons extend far beyond that.
- Sports data is easy to access.
Whether you are scraping the internet or using a more sophisticated provider, sports data is easy to get your hands on. It’s also not sensitive by nature, and is generally voluminous. Even if you don’t have sophisticated methods of data retrieval, you can easily access sites like ESPN and manually capture any data you need.
2. Sports data is living data.
In the case of the NHL, games are basically played on a daily basis from October through June. This adds a “living” element to the data, as games are being won/lost, goals are being scored and all sorts of team/player statistics are being generated on a nightly basis. This very much mimics transactional systems and other real world data, where fast changing dimensions and facts must be accounted for and analyzed.
3. Techniques you use to build sports data products can be applied to any domain.
In this specific case, I used sports data to create new data features powering an interactive dashboard for educated end users. You can replace the sports data I used with finance data, health care data, transportation data, customer data, product data and you would get a similar result. You can use sports data to create regression, classification and deep learning models with no risk. You can use sports data to model and manage your own data warehouse and then take those lessons to the real world. If you can do it with sports data, you can do it with ANY DATA!