Home

Projects

Tutorials

Analysis

Basic SQL Commands

by Matthew Barlowe


Ok if you've read the first tutorial on installing Postgres and getting your first database up and running, you're now wondering "So how do I access all this data?" Well that's where this tutorial comes in as we are going to cover some of the basic commands in SQL to get the data we want. Ok let's get started and dive right in to our data.

SELECT

The SELECT command is going to be your workhorse command when working with SQL. It's the command that tells Postgres to pull the data from the table and return it to you so you can use it. The SELECT is always paired with a FROM clause so you can tell Postgres exactly what table you want to select your data. Let's take a look at the command I used at the end of the first tutorial:


SELECT * FROM playerstats LIMIT 1;
      

Ok let's break it down piece by piece. SELECT tells Postgres that we want to pull data. The * tells Postgres that we want the data from every column in the table. You can choose whatever columns you want the data from and it doesn't have to be all of them, but this example we are taking all of them. Next is the FROM clause that tells Postgres we want to pull all the columns from the playerstats table.

If you have multiple tables you would change the name of the table to whatever table you wanted data from. But right now all we have is the playerstats so that's the one we will be using for the rest of this tutorial. The LIMIT clause while technically not needed for a SELECT query is very important here. If we had just entered in the command without using the LIMIT Postgres would have returned every single row in the table and you almost never want that. So I added the LIMIT clause and passed it the number 1 telling Postgres to limit the results to 1 row. Pretty simple eh?

Ok lets do a couple other SELECT queries to get the hang of it. First though lets get a list of the columns in our table with the \d+ playerstats which will produce and output similar to this:


    Column    |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description
--------------+---------+-----------+----------+---------+----------+--------------+-------------
 player       | text    |           |          |         | extended |              |
 team         | text    |           |          |         | extended |              |
 toi          | real    |           |          |         | plain    |              |
 cf           | real    |           |          |         | plain    |              |
 ca           | real    |           |          |         | plain    |              |
 c_plus_minus | real    |           |          |         | plain    |              |
 ff           | real    |           |          |         | plain    |              |
 fa           | real    |           |          |         | plain    |              |
      

So now we have some columns lets adjust our SELECT statement to just include a few of them instead of all of them:


SELECT player, game_date, team, cf, ca, c_plus_minus FROM playerstats LIMIT 10;
      

Which will produce an output similar to this:


       player      | game_date  | team | cf | ca | c_plus_minus
-----------------+------------+------+----+----+--------------
   ANDREJ.MESZAROS | 2014-10-23 | BUF  | 21 | 31 |          -10
   BRIAN.GIONTA    | 2014-10-23 | BUF  | 14 | 29 |          -15
   CHRIS.STEWART   | 2014-10-23 | BUF  | 29 | 23 |            6
   CODY.HODGSON    | 2014-10-23 | BUF  | 24 | 22 |            2
   CODY.MCCORMICK  | 2014-10-23 | BUF  |  5 |  9 |           -4
   DREW.STAFFORD   | 2014-10-23 | BUF  |  5 | 10 |           -5
   JHONAS.ENROTH   | 2014-10-23 | BUF  | 45 | 77 |          -32
   JOSH.GORGES     | 2014-10-23 | BUF  | 10 | 31 |          -21
   MARCUS.FOLIGNO  | 2014-10-23 | BUF  | 21 | 24 |           -3
   MATT.MOULSON    | 2014-10-23 | BUF  | 15 | 29 |          -14
(10 rows)
      

You can play around with the columns as much as you want and see the different outputs. So now we're getting some more info that is something we can work with but nothing we can actually draw insights from. The next clause though will help us narrow our data even more.

WHERE

So above we learned how to limit the columns the SQL query returns so now lets learn how to limit the rows that are returned. This is done in SQL by using the WHERE clause with your SELECT. WHERE gives your an SQL query a condition that it checks every row against in the table and if the condition is met it returns the row and if it isn't then the row is excluded. Let's look at it in action:


SELECT player, game_date, team, cf, ca, c_plus_minus FROM playerstats
WHERE team='CAR' LIMIT 10;
      

Now there is a order in whcih clauses go where in the SQL queries. Generally its SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT. So when playing around with your queries if you don't stick to the order Postgres will throw an error. Obviously you don't know what all of those clauses do, but just keep in mind for now that WHERE clauses will go before Limit clauses. So if you ran command above this should be your output:


          player      | game_date  | team | cf | ca | c_plus_minus
----------------------+------------+------+----+----+--------------
       ANDREJ.SEKERA   | 2014-11-16 | CAR  | 40 | 18 |           22
       ANTON.KHUDOBIN  | 2014-11-16 | CAR  | 84 | 44 |           40
       BRAD.MALONE     | 2014-11-16 | CAR  | 11 |  9 |            2
       BRETT.BELLEMORE | 2014-11-16 | CAR  | 22 | 14 |            8
       CHRIS.TERRY     | 2014-11-16 | CAR  | 23 |  8 |           15
       ELIAS.LINDHOLM  | 2014-11-16 | CAR  | 27 | 16 |           11
       ERIC.STAAL      | 2014-11-16 | CAR  | 37 | 14 |           23
       JAY.MCCLEMENT   | 2014-11-16 | CAR  |  7 |  9 |           -2
       JEFF.SKINNER    | 2014-11-16 | CAR  | 35 | 12 |           23
       JIRI.TLUSTY     | 2014-11-16 | CAR  | 28 | 14 |           14
(10 rows)
      

If you're a Carolina fan some of those names will make you cringe, but as you can see where before Buffalo was the first ten rows returned now the first ten rows returned will only be rows where the team column is equal to CAR. You can also chain these commands as well using logical switches such as AND or OR like so:


SELECT player, game_date, team, cf, ca, c_plus_minus FROM playerstats
WHERE team='CAR' AND game_date > '2018-02-01' LIMIT 10;
      

This will give us all the rows where the team is 'CAR' and the game_date is after Feburary 2nd, 2018. There are many more ways to combine these as well to help select the data that you are looking for but this is the basics. There will be a lot of trial and error at the beginning as you get used to it so don't despair and just keep at it.

Aggregate Functions

Ok so we've learned how to select the appropriate columns and rows that we want from our data, but how do we look at the overall picture? How do we look at say the total goals scored or assists earned over a season or even a certain time frame? This is done with our powerful aggregate functions and the GROUP BY.

As you can see with the data each row is an individual game for an individual player so just using WHEREwon't tell you the whole story. So let's try and see if we can calculate the goals and assists for Jeff Skinner for the 2018 season. We'll do that by using an SQL query that looks like this:


SELECT player, sum(g), sum(a1), sum(a2) FROM playerstats
WHERE player='JEFF.SKINNER' AND game_date > '2017-10-01' GROUP BY player;
      

Just a warning this won't be correct since we are using the data from my data dump that stopped at the end of February this is just to show you the syntax. Which should make this the output:


    player    | sum | sum | sum
--------------+-----+-----+-----
 JEFF.SKINNER |  20 |   9 |   8
(1 row)
      

Well those column names aren't very helpful. However there is a simple way to change that using the AS command in your query:


SELECT player, sum(g) AS goals, sum(a1) AS a1, sum(a2) AS a2
FROM playerstats WHERE player='JEFF.SKINNER' AND game_date > '2017-10-01'
GROUP BY player;
      

And now your output looks much nicer:


    player    | goals | a1 | a2
--------------+-------+----+----
 JEFF.SKINNER |    20 |  9 |  8
(1 row)
      

All AS does is just create an alias for the column. You can do this with any column you want along with tables as well. It just creates another name for something that you can refer to later on in the query. This won't work with WHERE statements because the WHERE clause references the raw data and your alias is created after the fact.

There several aggregate functions in SQL with the most common being sum, avg, min, max, count. Respectively those add the values, average the values, find the minimum value, the maximum value, and last counts the number of values i.e. the of rows in a column.

The next and most important part of an aggregate function is the GROUP BY clause. This tells Postgres how we want to calculate the aggregate functions. If we left out the player column then the functions would just sum the total goals and assists of every player after the date we gave it. GROUP BY though allows us to tell Postgres that we want to group the sums by each player. So Postgres will then group the rows together by each individual player then sum the goals and assists and return those totals for each player creating one row where once there where many. Let's take a look at our old command but this time we won't limit it to Jeff Skinner.


SELECT player, sum(g) AS goals, sum(a1) AS a1, sum(a2) AS a2
FROM playerstats WHERE game_date > '2017-10-01'
GROUP BY player LIMIT 10;
      

And this will be the output:


      player       | goals | a1 | a2
-------------------+-------+----+----
 BRANDON.HALVERSON |     0 |  0 |  0
 TOM.WILSON        |    10 | 10 |  7
 DARREN.HELM       |     7 | 11 |  4
 MATT.NISKANEN     |     4 | 10 |  5
 ANZE.KOPITAR      |    25 | 21 | 18
 BRANDON.DUBINSKY  |     5 |  7 |  3
 TRAVIS.DERMOTT    |     1 |  2 |  5
 CLAUDE.GIROUX     |    20 | 26 | 24
 ANTTI.NIEMI       |     0 |  0 |  1
 KRIS.VERSTEEG     |     3 |  3 |  2
      

As you can see we are now calculating the goal and assist totals for all players in our time frame. I added a LIMIT clause or else it would return every NHL player during that time frame. GROUP BY doesn't just work for players we can use it on the team column as well or any other categorical column in your database. In this one that is just mainly our teams and players, but if you were working with another database with different categorical columns GROUP BY would work just as well on those with calculating aggregate stats. If you have some experinece with pandas or R dataframes GROUP BY in SQL works basically the same as .groupby and group_by respectively in those langauges.

ORDER BY

Ok the last command we'll look at in this tutorial is the ORDER BY clause in our select statements. ORDER BY just tells Postgres by what column we want to order our data when it is returned to us. So using last command above:


SELECT player, sum(g) AS goals, sum(a1) AS a1, sum(a2) AS a2
FROM playerstats WHERE game_date > '2017-10-01'
GROUP BY player ORDER BY goals LIMIT 10;
      

Which produces this output:


      player       | goals | a1 | a2
-------------------+-------+----+----
 CAL.O'REILLY      |     0 |  0 |  0
 MARTIN.MARINCIN   |     0 |  0 |  0
 ZACH.REDMOND      |     0 |  0 |  0
 PAUL.MARTIN       |     0 |  0 |  0
 CHRIS.BIGRAS      |     0 |  0 |  1
 JAMES.REIMER      |     0 |  0 |  0
 BRANDON.HALVERSON |     0 |  0 |  0
 MADISON.BOWEY     |     0 |  4 |  8
 ANTTI.NIEMI       |     0 |  0 |  1
 MIKE.BLUNDEN      |     0 |  0 |  0
(10 rows)
      

Well that's really no fun but is an important lesson. ORDER BY always sorts in ascending order. So it will always sort from the smallest value in the column to the largest unless we tell it otherwise. To do that all you need to do is add the DESC flag to your order by statement like so:


SELECT player, sum(g) AS goals, sum(a1) AS a1, sum(a2) AS a2
FROM playerstats WHERE game_date > '2017-10-01'
GROUP BY player ORDER BY goals DESC LIMIT 10;
      

Which gives us this output:


      player      | goals | a1 | a2
------------------+-------+----+----
 ALEX.OVECHKIN    |    36 | 16 | 15
 EVGENI.MALKIN    |    33 | 19 | 18
 NIKITA.KUCHEROV  |    32 | 22 | 23
 WILLIAM.KARLSSON |    31 |  9 | 12
 ANDERS.LEE       |    30 | 12 |  6
 JOHN.TAVARES     |    30 | 22 | 12
 SEAN.COUTURIER   |    29 | 20 | 10
 TYLER.SEGUIN     |    29 | 16 |  6
 PATRIK.LAINE     |    28 | 11 |  9
 AUSTON.MATTHEWS  |    27 | 12 | 10
(10 rows)
      

And now we have the top ten goal scorers over the time frame we passed our SQL query. And just with a few simple commands you can start to do some actual analysis of hockey stats. Remember these are just the basics of all these commands. With experience you'll be able to slice and dice the data however you want. Stil with the basics learned here you can do a lot of research and start to get the hang of how SQL queries are structured.

Sources

Postgres Aggregate Functions

List Rows after Specific Date in SQL

List All Columns of Table