Pandas - Intro

  • PANel DAta
  • Library suite that has classes that facilitates data structures like a spreadsheet or a table
  • It's the main structure used in all machine learning and deep learning dataset
  • Pandas provides two important data structures, namely Series and DataFrame

1. How to install Pandas and import?

from shell/command prompt run one of the following commands

1.1 npm

npm install pandas

or

1.2 conda

conda install pandas

In [1]:
import numpy as np
import pandas as pd # pd is the alias used

2. Series

  • Series = One dimensional array
  • Equivalent of a column in an excel sheet
  • Values in the Series are indexed
  • Index values are hashable - meaning their values remain constant (eg. Integer, String etc)

2.1 Create a Series

In [2]:
# Using primitive array

fruits = ['apple', 'orange', 'mango']
sr1 = pd.Series(fruits)
print(sr1)
0     apple
1    orange
2     mango
dtype: object
In [3]:
# Using numpy array

colors = np.array(['Green', 'Orange', 'Pink', 'Blue']) # ndarray
sr2 = pd.Series(colors)
print(sr2)
0     Green
1    Orange
2      Pink
3      Blue
dtype: object

2.2 Series operations

In [4]:
# Get operation
print ('Element at index 1 =', sr2.get(1))
Element at index 1 = Orange
In [5]:
# Searching for a value
sr2[sr2 == 'Blue'].index[0]
Out[5]:
3
In [6]:
# add method - appends the word color behind the exsting array elements
sr2 = sr2.add(' color')
sr2
Out[6]:
0     Green color
1    Orange color
2      Pink color
3      Blue color
dtype: object
In [7]:
sr2.drop(2) # Unless to save it back to sr2 it will not be in effect or use inplace=True
Out[7]:
0     Green color
1    Orange color
3      Blue color
dtype: object
In [8]:
sr2.append( pd.Series(['Yellow']), ignore_index = True)
Out[8]:
0     Green color
1    Orange color
2      Pink color
3      Blue color
4          Yellow
dtype: object

3. DataFrame

  • Similar to a spreadsheet with each column represented by a Series
  • Each column will have a title if provided, else initiated with integers
  • Each row has an index if provided, else initiated with integer

3.1 Creating a DataFrame

In [9]:
# Create a simple DataFrame with one column

names = ['Albert', 'Amir', 'Max', 'Ram']

df1 = pd.DataFrame(names)
print(names)
print('-------------------------------------')
df1
['Albert', 'Amir', 'Max', 'Ram']
-------------------------------------
Out[9]:
0
0 Albert
1 Amir
2 Max
3 Ram
In [10]:
# Create a spreahsheet like DataFrame from dictionary

del names # removes the previous instance of a variable

names     = [ 'Adam', 'Anwar', 'Basheer', 'Priya']
countries = [ 'Nigeria', 'UAE', 'KSA', 'Singapore' ]
major     = [ 'Economics', 'Computer Science', 'Accountancy', 'Mathematics' ]

students = {
    'Student_Name': names,
    'Student_Country': countries,
    'Degree_major': major
}

df_students = pd.DataFrame(students)
df_students
Out[10]:
Student_Name Student_Country Degree_major
0 Adam Nigeria Economics
1 Anwar UAE Computer Science
2 Basheer KSA Accountancy
3 Priya Singapore Mathematics

3.2 DataFrame operations

3.2.1 Column extraction

In [11]:
# Extract a column and its data type

print ( df_students['Student_Name'] )
print ( '-----------------------------------' )
print ( 'Data type of the column: ', type(df_students['Student_Name']) )
0       Adam
1      Anwar
2    Basheer
3      Priya
Name: Student_Name, dtype: object
-----------------------------------
Data type of the column:  <class 'pandas.core.series.Series'>

3.2.2 Index and Column

In [12]:
# Create new index for the spreadsheet

rollno_array = [ '12001', '12002', '12003', '12004' ] # lets create an array of hypothetical roll numbers
column_title = [ 'Name', 'Country', 'SUbject' ]       # lets create an array of column labels

df_students.index = rollno_array
df_students.columns = column_title
df_students
Out[12]:
Name Country SUbject
12001 Adam Nigeria Economics
12002 Anwar UAE Computer Science
12003 Basheer KSA Accountancy
12004 Priya Singapore Mathematics

3.3 DataFrame from excel/csv

In [13]:
# Reading a dataframe from excel or csv

df_iris = pd.read_csv('iris.csv') # to read xls use pd.read_excel()
df_iris.head()
Out[13]:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 6.5 3.2 5.1 2.0 Iris-virginica
1 2 6.1 2.8 4.0 1.3 Iris-versicolor
2 3 5.1 3.5 1.4 0.3 Iris-setosa
3 4 6.4 3.1 5.5 1.8 Iris-virginica
4 5 6.7 3.1 4.7 1.5 Iris-versicolor

3.4 Aggregation and Set operations

Data preparation

In [14]:
# Create a city population dataframe
cities  = [ 'Tokyo', 'Jakarta', 'Delhi', 'Manila', 'Seoul', 'Shanghai', 'Karachi', 'Beijing', 'Mumbai', 'Chongqing' ]
ct_ctry = [ 'JA', 'ID', 'IN', 'PH', 'KR', 'CN', 'PK', 'CN', 'IN', 'CN' ]
pop_mil = [ 37.84, 30.53, 24.99, 24.13, 23.48, 23.41, 22.12, 21.00, 17.7, 15.7 ]

city_df = pd.DataFrame( { 'City': cities, 'Country_Code': ct_ctry, 'Population(Mil)': pop_mil } )

city_df
Out[14]:
City Country_Code Population(Mil)
0 Tokyo JA 37.84
1 Jakarta ID 30.53
2 Delhi IN 24.99
3 Manila PH 24.13
4 Seoul KR 23.48
5 Shanghai CN 23.41
6 Karachi PK 22.12
7 Beijing CN 21.00
8 Mumbai IN 17.70
9 Chongqing CN 15.70
In [15]:
# Create a country master dataframe
country = [ 'Japan', 'Indonesia', 'India', 'Philippines', 'S.Korea', 'China', 'Pakistan', 'Singapore', 'Malaysia' ]
ctry_cd = [ 'JA', 'ID', 'IN', 'PH', 'KR', 'CN', 'PK', 'SG', 'MY' ]

ctry_df = pd.DataFrame( { 'Country_Code': ctry_cd, 'Country_Name': country } )

ctry_df
Out[15]:
Country_Code Country_Name
0 JA Japan
1 ID Indonesia
2 IN India
3 PH Philippines
4 KR S.Korea
5 CN China
6 PK Pakistan
7 SG Singapore
8 MY Malaysia

3.4.1 Group by operation

In [16]:
city_df.groupby(by='Country_Code').sum()
Out[16]:
Population(Mil)
Country_Code
CN 60.11
ID 30.53
IN 42.69
JA 37.84
KR 23.48
PH 24.13
PK 22.12
In [17]:
city_df.groupby(by='Country_Code').count().iloc[:,0]
Out[17]:
Country_Code
CN    3
ID    1
IN    2
JA    1
KR    1
PH    1
PK    1
Name: City, dtype: int64

3.4.2 Join operation

In [18]:
city_df.join(ctry_df, lsuffix='_city', rsuffix='_ctry')  # how='inner', -- Default is outer join
# Appends the columns based on index key
Out[18]:
City Country_Code_city Population(Mil) Country_Code_ctry Country_Name
0 Tokyo JA 37.84 JA Japan
1 Jakarta ID 30.53 ID Indonesia
2 Delhi IN 24.99 IN India
3 Manila PH 24.13 PH Philippines
4 Seoul KR 23.48 KR S.Korea
5 Shanghai CN 23.41 CN China
6 Karachi PK 22.12 PK Pakistan
7 Beijing CN 21.00 SG Singapore
8 Mumbai IN 17.70 MY Malaysia
9 Chongqing CN 15.70 NaN NaN

3.4.3 Merge operation

In [19]:
pd.merge(city_df, ctry_df, on='Country_Code', sort=True, how='outer')
Out[19]:
City Country_Code Population(Mil) Country_Name
0 Shanghai CN 23.41 China
1 Beijing CN 21.00 China
2 Chongqing CN 15.70 China
3 Jakarta ID 30.53 Indonesia
4 Delhi IN 24.99 India
5 Mumbai IN 17.70 India
6 Tokyo JA 37.84 Japan
7 Seoul KR 23.48 S.Korea
8 NaN MY NaN Malaysia
9 Manila PH 24.13 Philippines
10 Karachi PK 22.12 Pakistan
11 NaN SG NaN Singapore