Home

Projects

Tutorials

Analysis

NHL Scraping in R

by Matthew Barlowe


Ok I'm going to do a brief tutorial on how to scrape NHL games using Emmanuel Perry's of Corsica fame. Also another big thanks to EvolvingWild for getting the code setup as is to make it easier to run. They both did 99% of the work for this I just did a little rearranging to make it easier for you to use.

Getting Started

The actual code we will be using can be found on my Github here. This tutorial will assume you have some familiarity with R and RStudio. If you have RStudio installed and running and know how to install packages you should be able to follow this tutorial just fine. Although I suggest being able to use R for some data analysis to actually work with the data produced as it can be very large.

But if you aren't don't be discouraged as working with this data taught me a lot about how to use dplyr and other R tidyverse packages even more so because I found the data interesting in and of itself. The first thing to get this code going is to make sure you have the correct packages installed and at near the top of the script you'll see the packages that are loaded when you run it.


## Dependencies
require(RCurl); require(rjson); require(dplyr);
require(lubridate); require(doMC); require(rvest)
      

If you don't have these packages all you need to do is type install.packages('rjson') in your Rstudio terminal, and Rstudio will install the package for you. If you aren't sure all you need to do is look at the packages tab in Rstudio and that will show all the packages you have installed. Here is a picture of where that tab is:

Rstudio packages tab

Once you have the packages installed the next part of the code, and for you the most important, part of the code we need to look at is the very bottom of the script.


pbp_list <- ds.compile_games(games = 20701,
                             season = "20162017",
                             pause = 2,
                             try_tolerance = 5,
                             agents = ds.user_agents)

pbp_df <- pbp_list[[1]]
      

These lines are well the script tells the scraper to go do stuff i.e. scrape NHL games and store the info into pbp_list. This is done with the ds.compile_games function. Lets go over the main keywords for this function which are games and season. games can either be one number or a vector of a range of numbers like this c(20001:20100). NHL game numbers before this season ranged from 20001 to 21230 but this season will max out at 21271 because of the addition of the Las Vegas Knights.

For the season keyword you'll need to include both years of the season as is done above. So if you wanted to scrape this seasons games you would put '20172018', last year would be '20162017', etc. So after you set the ds.compile_games() to scrape the games you want scraped the next line actually pulls the pbp dataframe out of the list the scraper creates and stores it in a seperate data frame.

This variable, pbp_df will be where the actual pbp data will be stored that you want to work with. If you scraped one game that dataframe will have one game, if you scraped 100 games it will contain all 100. Be careful scraping a lot of games in one session i.e. feeding many games to the ds.compile_games function at once as this can cause the code to throw an error and you'll lose all the data you've scraped and have to repeat the time consuming process.

Running the Script

Once you have the keyword variables set in the scraping function all thats left to do is hit source at the top of Rstudio and wait for the scraper to finish its job! What you do with the dataframe it produces is up to you although I have a couple tips. First I would save whatever you scrape to a text file so that if you want to access it again you can just read the file in and not have to scrape it again. The next tip is I would accumulate the data using a for loop to feed the function one game at a time and then saving that game to a text file so that if the code does throw an error you can pick up right where you left off and not have to scrape everything all over again.


library('readr')
games <-c(20001:21230)

for (game in games) {
    pbp_list <- ds.compile_games(games = game,
                                 season = "20162017",
                                 pause = 2,
                                 try_tolerance = 5,
                                 agents = ds.user_agents)

    pbp_df <- pbp_list[[1]]

    write_delim(pbp_df, as.character(game), delim ='|')
}
      

Update:

For Windows users you will need to replace the doMC package at the tope of the file with the Snow or doParallel packages. I've only used the doParallel package so I know that one works. To install it just type install.packages('doParallel') and then just change the require(doMC) at the top of the script to require(doParallel) and that should solve any issues you have with the program.

In the code above I put the process in a loop for you and set it to write the pbp to a | delimited text file named the game number. Obviously if you are scraping more than one season you'd want to add another differentiator to the name as the next season would overwrite the earlier game number. Also most people are common with CSV files as the common way to store data in text, but I avoid them for many reasons. The first being that commas are very often found in the data itself making reading in dificult for some programs. Since the pipe character, |, isn't very common in most people's data gathering it keeps those troubles from happening. All the text backups for my NHL database are stored in pipe delimited files. I swear by it.

And that's pretty much all there is to this program. If you'd like to see it in action check out my xGscrape script which is the setup I use to scrape the data that I used to create my daily game reports and to populate the SQL database the Twitter bot uses to answer its queries.

Sources

Library vs. Require