How to do SQL Select and Where Using Python Pandas

Lets start with movie database that I downloaded from Kaggle

In [9]:
import pandas as pd
In [10]:
df = pd.read_csv("movies_metadata.csv")
/home/anaconda3/envs/condapy37/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3049: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [11]:
df.columns
Out[11]:
Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

How to do SELECT, WHERE in pandas dataframe

lets do simple select first

Select first 2 rows

In [12]:
df.head(2)
Out[12]:
adult belongs_to_collection budget genres homepage id imdb_id original_language original_title overview ... release_date revenue runtime spoken_languages status tagline title video vote_average vote_count
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... http://toystory.disney.com/toy-story 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... ... 1995-10-30 373554033.0 81.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Toy Story False 7.7 5415.0
1 False NaN 65000000 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... NaN 8844 tt0113497 en Jumanji When siblings Judy and Peter discover an encha... ... 1995-12-15 262797249.0 104.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released Roll the dice and unleash the excitement! Jumanji False 6.9 2413.0

2 rows × 24 columns

Now lets do the SQL equivalent where clause now. Select all the movies where budget is greater than 30000000

In [14]:
df[df['budget'] > 30000000]

We got following error...

TypeError: '>' not supported between instances of 'str' and 'int'

The error is because we have the wrong data type, lets check the data type of budget

In [15]:
df.budget.dtype
Out[15]:
dtype('O')

its an object, so lets convert it to the number

In [16]:
df = df.astype({"budget": int})

We got following error

ValueError: invalid literal for int() with base 10: '/ff9qCepilowshEtG2GYWwzt2bs4.jpg'

In [17]:
pd.to_numeric(df['budget'],errors='coerce').head(2)
Out[17]:
0    30000000.0
1    65000000.0
Name: budget, dtype: float64

We added coerce to insert nan for the values where pd.to_numeric fails to convert, lets check what we get if run isnull()

In [18]:
df[pd.to_numeric(df['budget'],errors='coerce').isnull()]['budget'].head(2)
Out[18]:
19730    /ff9qCepilowshEtG2GYWwzt2bs4.jpg
29503    /zV8bHuSL6WXoD6FWogP9j4x80bL.jpg
Name: budget, dtype: object

Yes indeed we see that budget column contains values which are not even numbers. lets just go ahead with pd.to_numeric command and lets overwrite the budget column

In [19]:
df['budget'] = pd.to_numeric(df['budget'],errors='coerce')

Lets check what happened to our budget value at index number 19730

In [20]:
df.iloc[19730]['budget']
Out[20]:
nan

Yes indeed the value has changed to nan.

Now lets our run SQL equilvalent where command again on the budget

In [21]:
df[df['budget'] > 30000000].head(2)
Out[21]:
adult belongs_to_collection budget genres homepage id imdb_id original_language original_title overview ... release_date revenue runtime spoken_languages status tagline title video vote_average vote_count
1 False NaN 65000000.0 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... NaN 8844 tt0113497 en Jumanji When siblings Judy and Peter discover an encha... ... 1995-12-15 262797249.0 104.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released Roll the dice and unleash the excitement! Jumanji False 6.9 2413.0
5 False NaN 60000000.0 [{'id': 28, 'name': 'Action'}, {'id': 80, 'nam... NaN 949 tt0113277 en Heat Obsessive master thief, Neil McCauley leads a ... ... 1995-12-15 187436818.0 170.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released A Los Angeles Crime Saga Heat False 7.7 1886.0

2 rows × 24 columns

There you go we got the where clause operation working now.

It takes a while to get used to Pandas commands. If you want to still use SQL commands in Pandas , there is a library to do that as well which is pandasql

How to run SQL commands "select" and "where" using pandasql

Lets import the library pandasql first

In [22]:
import pandasql

I got following error...

ModuleNotFoundError: No module named 'pandasql'

Lets install the library first and import again

In [23]:
!pip install pandasql
In [24]:
import pandasql

Lets try SQL select statement now using pandasql now

In [28]:
pandasql.sqldf("SELECT * FROM df LIMIT 1;", globals())
Out[28]:
adult belongs_to_collection budget genres homepage id imdb_id original_language original_title overview ... release_date revenue runtime spoken_languages status tagline title video vote_average vote_count
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000.0 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... http://toystory.disney.com/toy-story 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... ... 1995-10-30 373554033.0 81.0 [{'iso_639_1': 'en', 'name': 'English'}] Released None Toy Story 0 7.7 5415.0

1 rows × 24 columns

Ok, now lets try the SQL statement including "where" clause

In [29]:
pandasql.sqldf("SELECT * FROM df where budget > 30000000 LIMIT 1 ;", globals())
Out[29]:
adult belongs_to_collection budget genres homepage id imdb_id original_language original_title overview ... release_date revenue runtime spoken_languages status tagline title video vote_average vote_count
0 False None 65000000.0 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... None 8844 tt0113497 en Jumanji When siblings Judy and Peter discover an encha... ... 1995-12-15 262797249.0 104.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released Roll the dice and unleash the excitement! Jumanji 0 6.9 2413.0

1 rows × 24 columns

There u go we got the results using SQL exact statement in Python Pandas.