Ok if you’ve read the first tutorial on installing Postgres and getting your first database up and running, you’re now wondering “So how do I access all this data?” Well that’s where this tutorial comes in as we are going to cover some of the basic commands in SQL to get the data we want. Ok let’s get started and dive right in to our data.

SELECT

The SELECT command is going to be your workhorse command when working with SQL. It’s the command that tells Postgres to pull the data from the table and return it to you so you can use it. The SELECT is always paired with a FROM clause so you can tell Postgres exactly what table you want to select your data. Let’s take a look at the command I used at the end of the first tutorial:


SELECT * FROM appearances LIMIT 10;

Ok let’s break it down piece by piece. SELECT tells Postgres that we want to pull data. The * tells Postgres that we want the data from every column in the table. You can choose whatever columns you want the data from and it doesn’t have to be all of them, but this example we are taking all of them. Next is the FROM clause that tells Postgres we want to pull all the columns from the appearances table.

If you have multiple tables you would change the name of the table to whatever table you wanted data from. In order to get a list of all the tables you can select form type \dt and hit enter and Postgres will list all the tables in the public schema. The LIMIT clause while technically not needed for a SELECT query is very important here. If we had just entered in the command without using the LIMIT Postgres would have returned every single row in the table and you almost never want to do that when just trying to look at the data in a table. So I added the LIMIT clause and passed it the number 1 telling Postgres to limit the results to 10 rows. Pretty simple eh?

Ok lets do a couple other SELECT queries to get the hang of it. First though lets get a list of the columns in our table with the \d+ appearances which will produce and output similar to this:

      Column   |       Type       | Collation | Nullable | Default | Storage  | Stats target | Description
    -----------+------------------+-----------+----------+---------+----------+--------------+-------------
     yearID    | bigint           |           |          |         | plain    |              |
     teamID    | text             |           |          |         | extended |              |
     lgID      | text             |           |          |         | extended |              |
     playerID  | text             |           |          |         | extended |              |
     G_all     | bigint           |           |          |         | plain    |              |
     GS        | double precision |           |          |         | plain    |              |
     G_batting | bigint           |           |          |         | plain    |              |
     G_defense | double precision |           |          |         | plain    |              |
     G_p       | bigint           |           |          |         | plain    |              |
     G_c       | bigint           |           |          |         | plain    |              |
     G_1b      | bigint           |           |          |         | plain    |              |
     G_2b      | bigint           |           |          |         | plain    |              |
     G_3b      | bigint           |           |          |         | plain    |              |
     G_ss      | bigint           |           |          |         | plain    |              |
     G_lf      | bigint           |           |          |         | plain    |              |
     G_cf      | bigint           |           |          |         | plain    |              |
     G_rf      | bigint           |           |          |         | plain    |              |
     G_of      | bigint           |           |          |         | plain    |              |
     G_dh      | double precision |           |          |         | plain    |              |
     G_ph      | double precision |           |          |         | plain    |              |
     G_pr      | double precision |           |          |         | plain    |              |

This table in the database lists the number of appearances at each position for each player, each season and each team. G_1b is games at first base and so on. If you need more info about the data in the database the data dictionary can be found here

So now we have some columns lets adjust our SELECT statement to just include a few of them instead of all of them:

Side note here. When I was creating this database file for you I forgot that Postgres only stores column names in lower case. As you can see above it does recognize upper case when displaying the column. The problem comes when you want to select a column. If you tried to execute SELECT playerID from appearances limit 10; you would get an error saying HINT: Perhaps you meant to reference the column "appearances.playerID". But even if you added appearances to your column selection you would still get the same error. THe solution is to add "" around each column name so that the column is searched for on a case sensitive basis. But that’s annoying so lets fix it.

Logout of your current account with \q and log back in with the superuser account that is the same name as your computer user name and run this code:

\t on
select 'ALTER TABLE '||'"'||table_name||'"'||' RENAME COLUMN '||'"'||column_name||'"'||' TO ' || lower(column_name)||';'
from information_schema.columns
where table_schema = 'public' and lower(column_name) != column_name
\g /tmp/go_to_lower
\i /tmp/go_to_lower

Copy this and paste it into the terminal and hit enter and all your column names will be in lower case. I put this in the tutorial for a very good reason. I could have gone back and fixed the file and changed a couple things and made it look like I’m perfect, but I’m not. Mistakes happen. A lot. Things are going break and a vital skill is to learn how to fix them. In this case I literally googled ‘convert all columnames to lowercase postgres’ and the answer was the second link on google.

Do I know exactly what that code does? Nope but what better place to learn than in a situation with no ramifications for it being wrong. Now knowing what to google comes with experience, but a good place to start is to just cut and paste the error into google and start clicking. This is what every professional programmer does and the ones that say they don’t are lying. Ok back to the tutorial.


SELECT playerid, teamid, g_1b, g_2b, g_cf FROM appearances LIMIT 10;

Which will produce an output similar to this:

playerid  | teamid | g_1b | g_2b | g_cf
-----------+--------+------+------+------
 abercda01 | TRO    |    0 |    0 |    0
 addybo01  | RC1    |    0 |   22 |    0
 allisar01 | CL1    |    0 |    2 |   29
 allisdo01 | WS3    |    0 |    0 |    0
 ansonca01 | RC1    |    1 |    2 |    0
 armstbo01 | FW1    |    0 |    0 |   11
 barkeal01 | RC1    |    0 |    0 |    0
 barnero01 | BS1    |    0 |   16 |    0
 barrebi01 | FW1    |    0 |    0 |    0
 barrofr01 | BS1    |    0 |    1 |    0
(10 rows)

You can play around with the columns as much as you want and see the different outputs. So now we're getting some more info that is something we can work with but nothing we can actually draw insights from. The next clause though will help us narrow our data even more.

WHERE

So above we learned how to limit the columns the SQL query returns so now lets learn how to limit the rows that are returned. This is done in SQL by using the WHERE clause with your SELECT. WHERE gives your an SQL query a condition that it checks every row against in the table and if the condition is met it returns the row and if it isn't then the row is excluded. Let's look at it in action: SELECT playerid, teamid, g_1b, g_2b, g_cf from appearances WHERE yearid = 2016 limit 10;

Now there is a order in which clauses go where in the SQL queries. Generally its SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT. So when playing around with your queries if you don't stick to the order Postgres will throw an error. Obviously you don't know what all of those clauses do, but just keep in mind for now that WHERE clauses will go before `LIMIT` clauses. So if you ran command above this should be your output:

 yearid | playerid  | teamid | g_1b | g_2b | g_cf
--------+-----------+--------+------+------+------
   2016 | abadfe01  | BOS    |    0 |    0 |    0
   2016 | abadfe01  | MIN    |    0 |    0 |    0
   2016 | abreujo02 | CHA    |  152 |    0 |    0
   2016 | achteaj01 | LAA    |    0 |    0 |    0
   2016 | ackledu01 | NYA    |   13 |    1 |    0
   2016 | adamecr01 | COL    |    0 |   11 |    0
   2016 | adamsau01 | CLE    |    0 |    0 |    0
   2016 | adamsma01 | SLN    |   86 |    0 |    0
   2016 | adlemti01 | CIN    |    0 |    0 |    0
   2016 | adriaeh01 | SFN    |    0 |    7 |    0
(10 rows)

This has returned all the players that had an appearance in the 2016 baseball season. You can also chain these commands as well using logical switches such as AND or OR like so:

SELECT playerid, teamid, g_1b, g_2b, g_cf from appearances WHERE yearid = 2016 AND g_1b > 100;

This will give us all the rows where the yearid is 2016 and games played at first base is over 100 games. There are many more ways to combine these as well to help select the data that you are looking for but this is the basics. There will be a lot of trial and error at the beginning as you get used to it so don't despair and just keep at it.

Aggregate Functions

Ok so we've learned how to select the appropriate columns and rows that we want from our data, but how do we look at the overall picture? How do we look at say the total goals scored or assists earned over a season or even a certain time frame? This is done with our powerful aggregate functions and the GROUP BY.

As you can see with the data each row is an individual game for an individual player so just using WHEREwon't tell you the whole story. So let's try and see if we can calculate who had the most hits last season in the MLB. SELECT playerid, sum(h) FROM batting WHERE yearid=2017 GROUP BY playerid ORDER BY sum(h) DESC LIMIT 10; Which gives us this output: playerid | sum -----------+----- blackch02 | 213 altuvjo01 | 204 inciaen01 | 201 gordode01 | 201 hosmeer01 | 192 ozunama01 | 191 andruel01 | 191 abreujo02 | 189 lemahdj01 | 189 arenano01 | 187

Well that column name isn't very helpful. However there is a simple way to change that using the AS command in your query: SELECT playerid, sum(h) AS Hits FROM batting WHERE yearid=2017 GROUP BY playerid ORDER BY sum(h) DESC LIMIT 10;

And now your output looks much nicer:

 playerid  | hits
-----------+------
 blackch02 |  213
 altuvjo01 |  204
 inciaen01 |  201
 gordode01 |  201
 hosmeer01 |  192
 ozunama01 |  191
 andruel01 |  191
 abreujo02 |  189
 lemahdj01 |  189
 arenano01 |  187
(10 rows)

All AS does is just create an alias for the column. You can do this with any column you want along with tables as well. It just creates another name for something that you can refer to later on in the query. In many newer flavors of SQL you can leave out the `AS` and just put a name next to the column, but I wanted to show the `AS` incase something breaks if you do that. This won't work with WHERE statements because the WHERE clause references the raw data and your alias is created after the fact.

A couple other notes on this query. We have to do the GROUP BY because this table has a row for each player for each team they played for in the season. A lot of players do play for the same team all season but a lot don’t. If you didn’t do the group by you would miss out on those other team hits for that player. The next is the ORDER BY this tells the database how to sort the results in this case by hits descending, i.e. from largest to smallest, with the DESC keyword. When you alias the sum(h) column you can just using ORDER BY hits next time. ORDER BY will sort ascending by default unless you tell it otherwise

There several aggregate functions in SQL with the most common being sum, avg, min, max, count. Respectively those add the values, average the values, find the minimum value, the maximum value, and last counts the number of values i.e. the of rows in a column.

The next and most important part of an aggregate function is the GROUP BY clause. This tells Postgres how we want to calculate the aggregate functions. If we left out the player column then the functions would just sum the total goals and assists of every player after the date we gave it. GROUP BY though allows us to tell Postgres that we want to group the sums by each player. GROUP BY doesn't just work for players we can use it on the teamid column as well or any other categorical column in your database. In this one that is just mainly our teams and players, but if you were working with another database with different categorical columns GROUP BY would work just as well on those with calculating aggregate stats. If you have some experinece with pandas or R dataframes GROUP BY in SQL works basically the same as .groupby and group_by respectively in those langauges.

Our next tutorial will cover JOIN so you can finall tell which players are actually accumulating the stats you are calculating in you queries.

Sources

Postgres Aggregate Functions

List Rows after Specific Date in SQL

List All Columns of Table