Home

Projects

Tutorials

Analysis

Plotting Tape2Tape

Tape 2 Tape Data

Ok today I’m going to be going over a few of the ways you can graph Tape 2 Tape data and show the different types of zone entrys and shot assists on a rink diagram. For this excercise you will need R installed along with the tidyverse package. tidyverse is actually a collection of packages, and you probably won’t need all of them for this excercise, but I just find it’s easier to load them all at once.

This tutorial will be written from the standpoint of someone who is pretty new to R so if you’ve just started don’t worry. However, you should have some familiarty with at least being able to run scripts in R, Rstuido, and some basic syntax such as <- assignment and function notation. Also some familiarity with ggplot2 will be helpful in understanding the graphs as well. Ok lets get to work.

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.3.0
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
##     date

The Data

Ok first thing to do is to import the data into a dataframe. You can think of a dataframe as kind of like an excel spreadsheet as there are rows and columns with values located at the intersection of each row with each column. I’m going to do this using the read_csv function that is in the readr package that will be loading into the script when we used library() on it up above.

readr is mainly a packge for reading in character delimited files such as csv files but also tab delimited files, or my personal favorite | delimited files. In this instance we will be reading in a csv file hence the use of read_csv.

The data comes in two files: one contains the actual play by play that has been tracked and another that has each player along with their skater ids. We’ll need that second file because as you’ll see with the play by play file each player is identified by their id instead of their name. Ok lets get the data in and get to work.

file_name <- c('~/HockeyStuff/Tape2TapeData/Tape2Tape/11_10_plays.csv')
players_file_name <- c('~/HockeyStuff/Tape2TapeData/Tape2Tape/11_10_roster.csv')

pbp <- read_csv(file_name)
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   periodTime = col_time(format = ""),
##   periodTimeRemaining = col_time(format = ""),
##   event = col_character(),
##   eventType = col_character(),
##   eventResult = col_character(),
##   eventTeam = col_character(),
##   eventId = col_character(),
##   pass0result = col_character(),
##   pass0team = col_character(),
##   pass1result = col_character(),
##   pass1team = col_character(),
##   pass2result = col_character(),
##   pass2team = col_character(),
##   pass3result = col_character(),
##   pass3team = col_character(),
##   pass4result = col_character(),
##   pass4team = col_character(),
##   tags = col_character(),
##   linkedEvents = col_character()
## )
## See spec(...) for full column specifications.
players <- read_csv(players_file_name)
## Parsed with column specification:
## cols(
##   playerId = col_integer(),
##   team = col_character(),
##   teamName = col_character(),
##   fullName = col_character(),
##   primaryNumber = col_integer(),
##   positionCode = col_character()
## )

Ok what I’ve done here is I’ve stored the path to the files as they would be on my computer into a variable and then read those variables into my dataframes I’ve named pbp and players. You can just put the file path in the read_csv function itself but is always good coding practice to store it in a variable and then use that variable throughout the script.

One reason is that it saves you from having to cut and paste/type a lot, and another is that if you wanted to run the same script on another set of data all you would have to do is change that one variable instead of multiple if you were reading it in at multiple points in your script.

And so you can see the output that shows us that the files were read in succesfully. You don’t really need to know what the output is, but its just telling you that it parsed certain columns with certain specifications such as the periodTime columns was read in as a time format etc. Again not much you need to worry about. Let’s take a look at the data.

## # A tibble: 6 x 68
##   period periodTime periodTimeRemaining awayPlayer0Id awayPlayer1Id
##    <int> <time>     <time>                      <int>         <int>
## 1      1 17'00"     19:43                     8478443       8471418
## 2      1 26'00"     19:34                     8478443       8471418
## 3      1 37'00"     19:23                     8478443       8471418
## 4      1 43'00"     19:17                     8474062       8465009
## 5      1 52'00"     19:08                     8474062       8465009
## 6      1 54'00"     19:06                     8474062       8465009
## # ... with 63 more variables: awayPlayer2Id <int>, awayPlayer3Id <int>,
## #   awayPlayer4Id <int>, awayPlayer5Id <int>, homePlayer0Id <int>,
## #   homePlayer1Id <int>, homePlayer2Id <int>, homePlayer3Id <int>,
## #   homePlayer4Id <int>, homePlayer5Id <int>, event <chr>,
## #   eventType <chr>, eventResult <chr>, eventTeam <chr>, eventId <chr>,
## #   x0 <int>, y0 <int>, x1 <int>, y1 <int>, player0Id <int>,
## #   player1Id <int>, pass0x0 <int>, pass0y0 <int>, pass0x1 <int>,
## #   pass0y1 <int>, pass0result <chr>, pass0team <chr>,
## #   pass0player0Id <int>, pass0player1Id <int>, pass1x0 <int>,
## #   pass1y0 <int>, pass1x1 <int>, pass1y1 <int>, pass1result <chr>,
## #   pass1team <chr>, pass1player0Id <int>, pass1player1Id <int>,
## #   pass2x0 <int>, pass2y0 <int>, pass2x1 <int>, pass2y1 <int>,
## #   pass2result <chr>, pass2team <chr>, pass2player0Id <int>,
## #   pass2player1Id <int>, pass3x0 <int>, pass3y0 <int>, pass3x1 <int>,
## #   pass3y1 <int>, pass3result <chr>, pass3team <chr>,
## #   pass3player0Id <int>, pass3player1Id <int>, pass4x0 <int>,
## #   pass4y0 <int>, pass4x1 <int>, pass4y1 <int>, pass4result <chr>,
## #   pass4team <chr>, pass4player0Id <int>, pass4player1Id <int>,
## #   tags <chr>, linkedEvents <chr>

The head() function will return the first six rows of any dataframe, and you can see all the different columns and we’ll have a brief break down of each one:

period: the period the event took place in
periodTime: the score keepers time for each event
periodTimeRemaining: what the clock would be if you were watching the game
awayPlayer0Id...awayPlayer6Id: the six away players on the ice for event includes goalie
homePlayer0Id...homePlayer0Id: same as away but with home players
event: the main events tracked by Tape2Tape which include Zone Entry, Zone Exit,
        Blocked Shot, Shot, Missed Shot, and Goal
eventType: this column further describes the event column. Will describe type of
            shot and type of Zone Exit/Entry as either Uncontrolled, Failed, or
            Controlled
eventResult: Lost or Recovered. This will only refer to Uncontrolled Zone Exits
        and Entries
eventTeam: Tells the team that performed the event in the event column
x0, y0: This is the location of the event on the ice, there will only be a
        corresponding x1, y1 value if the event was a zone exit created by a pass

plyer0Id, player1Id: Player0 is the player that performed the event in the event
                    column. If the event is a pass then player 0 is the passer and
                    player1 is the player that recieved the pass.
pass0x0, pass0y0...:These columns detail the passes that lead up to the event in
                    the event column. pass0 is the pass right before the event,
                    pass1 is the pass before that etc.

The players dataframe is a lot more straightforward and the columns are self explanatory so we won’t go over them. Ok let’s convert all the user Ids in the play by play dataframe to actual players names to make looking at the data and subsetting it easier.

#convert player ids to actual player names from the player dataframe
convert_ids <- function(column, player_df){
    column <- player_df[match(column, player_df$playerId, nomatch = column),
                        c('fullName')]
}

#converting playerids to playernames
pbp[4:15] <- pbp %>% select(awayPlayer0Id:homePlayer5Id) %>%
    sapply(convert_ids, player_df = players)

pbp[,c('player0Id', 'player1Id', 'pass0player0Id', 'pass0player1Id',
       'pass1player0Id', 'pass1player1Id', 'pass2player0Id', 'pass2player1Id',
       'pass3player0Id', 'pass3player1Id',
       'pass4player0Id', 'pass4player1Id')] <-
    pbp[,c('player0Id', 'player1Id', 'pass0player0Id', 'pass0player1Id',
           'pass1player1Id', 'pass1player1Id', 'pass2player0Id', 'pass2player1Id',
           'pass3player0Id', 'pass3player1Id', 'pass4player0Id',
           'pass4player1Id')] %>% sapply(convert_ids, player_df = players)

Ok let’s go over the code a bit here. The first part is a function I created that gets passed a column and a dataframe. This dataframe is the players dataframe we created earlier. So this function goes along the column we pass to it and matches the Id to the Id in the players dataframe and then returns the value from the fullname column in the players dataframe.

Once it does that for every value in the column its returns those names and we store it in the same column variable which will then overwrite the the columns from the pbp dataframe with names instead of ids.

The next part of the code is us applying that function to all the columns we need to in the pbp dataframe using sapply. sapply is similar to a loop as it does a repeated action over and over again given certain conditions, but behind the scenes things are quite different. R is built around what’s called vectorized operations because R is slow when dealing with memory allocation and it is that reason that loops run so slow in R. When you use the apply group functions or plyr functions R automatically takes care of all those memory issues making things run faster even though they actually use for loops inside their own source code.

But back to our sapply so all I’m doing is moving over all these columns and looking up the player name for each player id and copying that to our new column. This will allow us to break down our graphs to look at certain stats my player name without having to consult the players dataframe anymore.

sapply is a part of a group of apply functions that are found in base R. The main difference with sapply from the other apply functions is that sapply will retrun a vector unlike lapply which will return a list. If you would like to stick with full tidyverse, then map from the purr package will do the same things.

If you are coming from python the map function will feel familiar, because like map applies a function to a list in python, so does map apply a function to a vector in R. If at this point you’re wondering why I used sapply over map it’s actually quite simple: that’s the first Stack Over flow answer I came across that did what I wanted. So now that’s done lets get to graphing.

Creating the Rink

Ok the first thing we need to do is pull the code in to draw our rink. This code can be found at this link. Once you’ve saved it to the same directory you’ll be writing your graph scripts. We’ll load the rink function with the source command and then we will call it and store it in a variable for ease of use later.

source('~/graphautomation/RinkFunction.R')
rink <- fun.draw_rink() + coord_fixed()

This rink code was written by Prashanath Iyer. So now we have our rink function loaded lets look at graphing some different types of data from the game. The first one we’ll look at is let’s look at the zone entries the first thing we’ll do is subset our data by team.

boston_df <- pbp %>% filter(eventTeam == 'BOS')
toronto_df <- pbp %>% filter(eventTeam == 'TOR')

Ok now that’s done lets do a simple graph and look at the zone entries for Boston first:

rink + geom_point(aes(x = x0, y = y0),
                  data = filter(boston_df, event == 'Zone Entry'))

Ok let’s go deeper and look just Boston’s uncontrolled entries and whether they were lost entries or recovered. What I’ll do here is factor the color asesthetic by the eventResult column.

rink + geom_point(aes(x = x0, y = y0, color = eventResult),
                  data = filter(boston_df, event == 'Zone Entry',
                                eventType == 'Uncontrolled'))

And as we can see here as past analysis has shown that uncontrolled zone entries are largely lost, but it seems that when Boston was attacking the left side of the ice if they dumped it against the left defense they were able to recover it more often than not. Another thing to note is that since Boston was the away team, those zone entries would have happened in the second period when teams had the long change.

Does that mean something? It’s hard to say since when they attacked the other side of the ice they failed miserably on the right side at recovering uncontrolled zone entries. This is something that with more data we can see if its common among all teams or the Toronto Maple Leafs this year.

We also can look at Corsi events as well:

rink + geom_point(aes(x = x0, y = y0, color = event),
                  data = filter(boston_df, event %in% c('Shot', 'Blocked Shot',
                                                        'Missed Shot', 'Goal')))