Most Frequently Asked Questions Python Pandas Part1

For this exercise, I am using College.csv data. You can download the data from here. github.com/jstjohn/IntroToStatisticalLearningR-/blob/master/data/College.csv I would also create dummy dataframes to explain some of the concepts.

In [2]:
import pandas as pd
In [3]:
df = pd.read_csv('College.csv')
In [4]:
df.head(1)
Out[4]:
Unnamed: 0 Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
0 Abilene Christian University Yes 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60

How to rename column in Python Pandas

Lets check if we are missing a column name in our csv file. We can print out the header using unix command.

In [6]:
!head -1 College.csv

Yes, the first column is header is missing. Check out https://www.nbshare.io/notebook/58467897/3-Ways-to-Rename-Columns-in-Pandas-DataFrame/ to rename columns in Python Pandas.

How to copy dataframe in Python Pandas

Why would I need to make a copy explicitly in dataframe?

Indexing in Python Pandas doesn't make a seperate copy of the dataframe but it makes a reference to the original dataframe. Therefore if you make any change to the dataframe,it will change the original dataframe. Lets do an example.

In [39]:
df = pd.DataFrame({'name':['John','Evan']})
In [40]:
dfn = df[0:2]
In [41]:
print(dfn)
   name
0  John
1  Evan
In [42]:
dfn.iloc[0,0] = 'Adam'
In [44]:
df
Out[44]:
name
0 Adam
1 Evan

As we above our original dataframe has changed. Therefore correct way is to make a copy first.

In [45]:
df = pd.DataFrame({'name':['John','Evan']})
dfn = df[0:2].copy()
In [46]:
dfn
Out[46]:
name
0 John
1 Evan
In [47]:
dfn.iloc[0,0] = 'Adam'
In [48]:
df
Out[48]:
name
0 John
1 Evan
In [49]:
dfn
Out[49]:
name
0 Adam
1 Evan

As we see above our original dataframe df has not changed when we used df.copy() command.

How to create empty dataframe in Python Pandas

In [89]:
dfe = pd.DataFrame([])

How to add columns to add empty dataframe?

In [95]:
dfe = dfe.assign(col1=None,col2=None)
In [96]:
dfe.head()
Out[96]:
col1 col2

How to append values to empty dataframe?

Appending in dataframe is very easy. Just use the append command.

In [105]:
dfe = dfe.append({'col1':1,'col2':2},ignore_index=True)
Out[105]:
col1 col2
0 1 2

Remember above command although works, but it is not memory efficient. Above will reallocate the memory every time we do the append to dataframe. Dont use the pd.append inside the loop. Best way is to build the data in the python list and then use pd.DataFrame to create the dataframe at once as shown below.

In [108]:
data = []
data.append([3,4])
data.append([5,6])
In [109]:
data
Out[109]:
[[3, 4], [5, 6]]

Now create the dataframe using above data.

In [110]:
dfe = pd.DataFrame(data,columns=['col1','col2'])
In [111]:
dfe.head()
Out[111]:
col1 col2
0 3 4
1 5 6

How to convert Pandas dataframe to Numpy array

Lets use our previous dataframe dfe for this.

In [112]:
import numpy as np
In [114]:
dfe.to_numpy()
Out[114]:
array([[3, 4],
       [5, 6]])

Also we can do this way.

In [115]:
np.array(dfe)
Out[115]:
array([[3, 4],
       [5, 6]])

How to Concat Pandas Dataframe

Concat is used to concatenate dataframe either using rows or columns.

In [117]:
df1 = pd.DataFrame({'A':[1,2],'B':[3,4]})
df2 = pd.DataFrame({'C':[1,2],'D':[3,4]})

Lets concatenate df1 and df2 so that rows append.

In [124]:
pd.concat([df1,df2],sort=False)
Out[124]:
A B C D
0 1.0 3.0 NaN NaN
1 2.0 4.0 NaN NaN
0 NaN NaN 1.0 3.0
1 NaN NaN 2.0 4.0

We see that two columns have been created since, column names dont match in df1 and df2

How about concatenate the dataframes so that columns concatenate.

In [125]:
pd.concat([df1,df2],sort=False,axis=1)
Out[125]:
A B C D
0 1 3 1 3
1 2 4 2 4

How about concatenating the dataframes with same headers. Lets create a 3rd dataframe with same headers as df1.

In [126]:
df3 = pd.DataFrame({'A':[56,57],'B':[100,101]})

Lets concatenate df1 and df3 so that row append.

In [127]:
pd.concat([df1,df3])
Out[127]:
A B
0 1 3
1 2 4
0 56 100
1 57 101

As we see above, while concatenating row indexing are preserved from the original dataframe. We can ignore the indexes and make it incremental using option ignore_index=True

In [128]:
pd.concat([df1,df3],ignore_index=True)
Out[128]:
A B
0 1 3
1 2 4
2 56 100
3 57 101

with pd.concat, we can create an outside hierarchy by creating an index.

In [132]:
dfc = pd.concat([df1,df3],keys=['s1','s2'])
In [133]:
dfc.head()
Out[133]:
A B
s1 0 1 3
1 2 4
s2 0 56 100
1 57 101

Now we can access the data using the new index keys s1 and s2