SQL Basics with Pandas

13 minute read

Recently, I was compelled by my own curiosity to study SQL, a language I have heard about quite a lot but never had a chance to study. At first, SQL sounded difficult and foreign largely because it was a language fundamentally different from other programming languages I had studied, such as Java, Python, or R. However, after watching this fantastic video tutorial on YouTube, and completing a relevant course on DataCamp, I think I now finally have a somewhat concrete understanding of what SQL is and how to use it. Of course, I’m still so far away from being fluent in SQL, and the queries I can write are still pretty basic. Much like the blog post on R, this post will serve as a reference for myself.

Note: This notebook was drafted in January of 2020, yet I never had a chance to finish it. Finally, while working on some R tutorial notebooks on the dpylr package, I was reminded of this draft and hence decided to publish it. Hopefully this editorial discontinuity does not affect the quality of writing and content of this article.

SQL with Jupyter

There are many different ways of using and accessing SQL from Jupyter Notebooks. Here, I introduce two simple ways of practicing SQL without much complicated setup.

Magic Command

The first on the list is ipython-sql, which allows us to use magic commands in Jupyter notebooks. To install, simply type the following line in the terminal, assuming that you have activated the conda virtual environment of choice.

conda install ipython-sql

We can now use the load_ext sql magic command in Jupyter to connect to a local database. In my case I had a MySQL database initialized at localhost, and was able to connect to it as a root user. Note that you should replace some_password in the example command below according to your own configuration.

%load_ext sql
%sql mysql+pymysql://root:some_password@localhost:3306/test
'Connected: root@test'

Now that we have successfully connected to the data, we can use SQL commands in Jupyter!

%sql SELECT * FROM employee ORDER BY sex, first_name, last_name LIMIT 5;
 * mysql+pymysql://root:***@localhost:3306/test
5 rows affected.
emp_id first_name last_name birth_day sex salary super_id branch_id
103 Angela Martin 1971-06-25 F 63000 102 2
101 Jan Levinson 1961-05-11 F 110000 100 1
104 Kelly Kapoor 1980-02-05 F 55000 102 2
107 Andy Bernard 1973-07-22 M 65000 106 3
100 David Wallace 1967-11-17 M 250000 None 1

This method works, but it requires that you set up a MySQL server on your local workstation. While this is not particularly difficult, this method is somewhat made less compelling by the fact that it does not work right out of the box. The method I prefer, therefore, is the one that I would like to introduce next.

PandaSQL

pandas is an incredibly widely used Python module for deailng with tabular data. It some similarities with SQL in that they both deal with tables at the highest level. Of course, the two serve very different purposes: SQl is intended as a backend exclusive language, powering huge database servers and allowing developers to quickly query through large amounts of data. pandas, on the other hand, is a must-have in the Python data scientist’s toolbox, allowing them to extract new insight from organized tabular data.

pandasql is a Python module that allows us to query pandas.DataFrames using SQL syntax. In other words, it is a great way to learn SQL. The benefit of this approach is that no database setup is necessary: as long as there is some tabular data to work with, say some .csv file, we are ready to go. For the purposes of this post, we will thus be using this latter approach.

With all that said, let’s get started.

import pandas as pd
from pandasql import sqldf

Select Syntax

In this section, we will go over some basic core SQL statements to get our feet wet. It would be utterly impossible for me to cover SQL syntax in any level of detail in a single blog post, but this is a start nonetheless. At the minimum, I hope to continue this series as I start learning more SQL. The main references used to write this post were this excellent Medium article and the official documentation on the pandas website.

Let’s begin by loading a library to import some sample toy datasets at our disposal.

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Limit

Let’s first see a simple example of SELECT in action, alongwith LIMIT.

sqldf("SELECT sepal_length, petal_width FROM iris LIMIT 5;")
sepal_length petal_width
0 5.1 0.2
1 4.9 0.2
2 4.7 0.2
3 4.6 0.2
4 5.0 0.2

One of the perks of SQL is that it somewhat reads like plain English instead of complicated computer code. Of course, SQL statements can get quite complex, in which case this rule starts to break down. However, it isn’t too difficult to see what the statement above is doing: it is selecting the column sepal_length and petal_width from the iris dataset which we loaded, and showing the top five results only in accordance with the LIMIT.

We can also replicate the output of iris.head() by doing the following.

sqldf("SELECT * FROM iris LIMIT 5;")
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

The * is essentially a wild card argument that tells SQL that we want information pulled from every column instead of a specified few. This can be handy when we want to take a glimpse of the contents of the database.

Distinct

LIMIT is not the only addition we can make to a SELECT statement. For instance, consider the keyword DISTINCT, which does exactly what you think it does:

sqldf("SELECT DISTINCT species FROM iris;")
species
0 setosa
1 versicolor
2 virginica

As you can see, DISTINCT allows us to select only unique values in the table. Note that pandas offers a simliar function, unique, with which we can somewhat recreate a similar result.

iris.species.unique()
array(['setosa', 'versicolor', 'virginica'], dtype=object)

Where

Another useful fact to remember is that SELECT most often goes along with WHERE. We can imagine many instances where we would want to retrieve only those data entries that satisfy a certain condition. In the example below, we retrieve only those data entries whose species are labeled as setosa.

sqldf('''SELECT petal_length 
         FROM iris 
         WHERE species = 'setosa' 
         LIMIT 5;''')
petal_length
0 1.4
1 1.4
2 1.3
3 1.5
4 1.4

In pandas speak, we would have done the following:

iris[iris.species == 'setosa'].petal_length.head(5)
0    1.4
1    1.4
2    1.3
3    1.5
4    1.4
Name: petal_length, dtype: float64

The pandas version is not too difficult just yet, butI prefer SQL’s resemblance to plain human language. Just for the sake of it, let’s take a look at a slightly more complicated conditioning we can perform with WHERE, namely by linking multiple conditions on top of each other. In this example, we select petal_width and petal_length for only those entries whose species is setosa and sepal_width is smaller than 3.2 (this number is entirely random).

sqldf('''SELECT petal_width, petal_length
         FROM iris
         WHERE species = 'setosa'
         and sepal_width < 5
         LIMIT 5;''')
petal_width petal_length
0 0.2 1.4
1 0.2 1.4
2 0.2 1.3
3 0.2 1.5
4 0.2 1.4

All we did there was join the two conditions via the AND keyword. In pandas, this is made slighty more confusing by the fact that we use slicing to make multi-column selections.

iris[(iris.species == 'setosa') & (iris.sepal_width < 5)][['petal_width', 'petal_length']][:5]
petal_width petal_length
0 0.2 1.4
1 0.2 1.4
2 0.2 1.3
3 0.2 1.5
4 0.2 1.4

And by the same token, the OR SQL keyword translates into | in pandas.

Instead of sticking [:5] in the end, we could have used .head() as we have been doing so far. It isn’t difficult to see that introducing more conditionals can easily result in somewhat more longer statements in Python, whereas that is not necessarily the case with SQL. This is not to say that pandas is inferior or poorly optimized; instead, it simply goes to show that the two platforms have their own comaprative advantages and that they mainly serve different purposes.

Sort

Often time when sorting through some tabular data, we want to sort the entries in ascending or descending order according to some axis. For example, we might want to rearrange the entries so that one with the largest petal_width comes first. Let’s see how we can achieve this with SQL.

sqldf('''SELECT *
         FROM iris
         ORDER BY petal_width
         DESC
         LIMIT 10;''')
sepal_length sepal_width petal_length petal_width species
0 6.3 3.3 6.0 2.5 virginica
1 7.2 3.6 6.1 2.5 virginica
2 6.7 3.3 5.7 2.5 virginica
3 5.8 2.8 5.1 2.4 virginica
4 6.3 3.4 5.6 2.4 virginica
5 6.7 3.1 5.6 2.4 virginica
6 6.4 3.2 5.3 2.3 virginica
7 7.7 2.6 6.9 2.3 virginica
8 6.9 3.2 5.7 2.3 virginica
9 7.7 3.0 6.1 2.3 virginica

By default, the ORDER BY keyword in SQL lists values in asending order. To reverse this, we can explicitly add the DESC keyword. We see that the entries with the largets petal_width is indeed at the top of the selected query result.

We can also achieve a similar result in pandas.

iris.sort_values('petal_width', ascending=False).head(10)
sepal_length sepal_width petal_length petal_width species
100 6.3 3.3 6.0 2.5 virginica
109 7.2 3.6 6.1 2.5 virginica
144 6.7 3.3 5.7 2.5 virginica
114 5.8 2.8 5.1 2.4 virginica
140 6.7 3.1 5.6 2.4 virginica
136 6.3 3.4 5.6 2.4 virginica
148 6.2 3.4 5.4 2.3 virginica
115 6.4 3.2 5.3 2.3 virginica
145 6.7 3.0 5.2 2.3 virginica
143 6.8 3.2 5.9 2.3 virginica

In this case, I think pandas also offers a simple, clean interface to access data. One point to note is that both SQL and pandas have the same default settings when it comes to ordering or sorting entries: by default, ascending=True. Also, it is interesting to see that SQL does not have references to row values or IDs because we did not set them up, whereas pandas automatically keeps track of the original location of each row and displays them in the queried result.

Is X

I decided to jam-pack this last section with a bunch of somewhat similar commands: namely, isin(), isna(), and notna(). These commands are loosely related to each other, which is why they are all grouped under this section. Speaking of groups, we will continue our discussion of SQL and pandas in another post, starting with things lilke groupby() and GROUP BY. Anyhow, let’s begin by taking a look at isin().

In

In SQL, we can make selections based on whether an entry falls into a certain category. For instance, we might want to select data points only for setosas and virginicas. In that case, we might use the following SQL statement.

sqldf('''SELECT *
         FROM iris
         WHERE species IN ('virginica', 'setosa');''')
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
95 6.7 3.0 5.2 2.3 virginica
96 6.3 2.5 5.0 1.9 virginica
97 6.5 3.0 5.2 2.0 virginica
98 6.2 3.4 5.4 2.3 virginica
99 5.9 3.0 5.1 1.8 virginica

100 rows × 5 columns

To demonstrate the fact that we have both setosas and virginicas, I decided to avoid the use of LIMIT. The resulting table is has 100 rows and five columns. Let’s see if we can replicate this result in pandas using isin().

iris[iris.species.isin(['virginica', 'setosa'])]
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

100 rows × 5 columns

As expected, we also get a 100-by-5 table containing only setosas and virginicas.

It is worth noting that this was not the smartest way to go about the problem; we could have used negative boolean indexing: namely, we could have told pandas to pull every data point but those pertaining to versicolors. For example,

iris[~(iris.species == 'versicolor')]
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

100 rows × 5 columns

In development settings, we would of course use the negative boolean indexing approach shown immediately above, but for demonstration purposes, it helps to see how isin() can be used to model IN.

N/A

In SQL, empty values are encoded as NULL. We can perform selection based on whether or not there is a NULL entry in a row. This functionality is particularly important to preprocess data, which might be fed into some machine learning model.

First, observe that the current iris data does not have any NULL values.

sum(iris.isna().any())
0

Therefore, let’s add two dummy rows for the purposes of this demonstration. There are many ways to go about adding a row. For example, we might want to assign a new row by saying iris.loc[-1] = some_row_data, or use pd.concat([iris, dummy_df]). Let’s try the first approach for simplicity.

iris.iloc[-1] = [5.9, 3.0, 5.1, 1.8, None]

Now that we have this dummy row, let’s see what we can do with SQL. In fact, the syntax is not so much different from what we’ve been doing so far.

sqldf('''SELECT *
         FROM iris
         WHERE species IS NULL;''')
sepal_length sepal_width petal_length petal_width species
0 5.9 3.0 5.1 1.8 None

The only new part is IS NULL, which specifies that a certain attribute or column is NULL. Again, this is one of those instances that show that SQL statements somewhat read like normal English statements. pandas, on the other hand, obviously doesn’t flow as easily, but its syntax is not so much complicated either:

iris[iris.species.isna()]
sepal_length sepal_width petal_length petal_width species
149 5.9 3.0 5.1 1.8 None

Again, this shows that boolean indexing is a huge component of sifting through pandas data frames. A lot of the inspiration behind this API obviously comes from R and treatment of its own data frames.

The notna() function does the exact opposite of isna(). Without running the function, we already know that substituting isna() with notna() will simply give us the rest of all the rows in the iris dataset.

Conclusion

This post was intended as an introduction to SQL, but somehow it digressed into a comparison of SQL and pandas syntax. Nonetheless, for those who are already familiar with one framework, reading this cross comparison will help you glean a more intuitive sense of what the other side of the world looks like. As mentioned earlier, SQL and pandas each have their strenghts and weaknesses, and so it definitely helps to have both tools under the belt. As you might notice, my goal is to eventually gain some level of proficienchy in both Python, SQL, and R; hence the posts on R lately. It’s interesting to see how different tools can be used to approach the same problem. Better illuminated in that process are the philosophies behind each frameworks: where they each borrowed inspiration from, what values or UX aspects they prioritize, and et cetera.

I hope you enjoyed reading this post. Catch you up in the next one!