Pandas - Intro¶

PANel DAta
Library suite that has classes that facilitates data structures like a spreadsheet or a table
It's the main structure used in all machine learning and deep learning dataset
Pandas provides two important data structures, namely Series and DataFrame

1. How to install Pandas and import?¶

from shell/command prompt run one of the following commands

1.1 npm¶

npm install pandas

or¶
1.2 conda¶
conda install pandas

import numpy as np
import pandas as pd # pd is the alias used

2. Series¶

Series = One dimensional array
Equivalent of a column in an excel sheet
Values in the Series are indexed
Index values are hashable - meaning their values remain constant (eg. Integer, String etc)

2.1 Create a Series¶

# Using primitive array

fruits = ['apple', 'orange', 'mango']
sr1 = pd.Series(fruits)
print(sr1)

0     apple
1    orange
2     mango
dtype: object

# Using numpy array

colors = np.array(['Green', 'Orange', 'Pink', 'Blue']) # ndarray
sr2 = pd.Series(colors)
print(sr2)

0     Green
1    Orange
2      Pink
3      Blue
dtype: object

2.2 Series operations¶

# Get operation
print ('Element at index 1 =', sr2.get(1))

Element at index 1 = Orange

# Searching for a value
sr2[sr2 == 'Blue'].index[0]

3

# add method - appends the word color behind the exsting array elements
sr2 = sr2.add(' color')
sr2

0     Green color
1    Orange color
2      Pink color
3      Blue color
dtype: object

sr2.drop(2) # Unless to save it back to sr2 it will not be in effect or use inplace=True

0     Green color
1    Orange color
3      Blue color
dtype: object

sr2.append( pd.Series(['Yellow']), ignore_index = True)

0     Green color
1    Orange color
2      Pink color
3      Blue color
4          Yellow
dtype: object

3. DataFrame¶

Similar to a spreadsheet with each column represented by a Series
Each column will have a title if provided, else initiated with integers
Each row has an index if provided, else initiated with integer

3.1 Creating a DataFrame¶

# Create a simple DataFrame with one column

names = ['Albert', 'Amir', 'Max', 'Ram']

df1 = pd.DataFrame(names)
print(names)
print('-------------------------------------')
df1

['Albert', 'Amir', 'Max', 'Ram']
-------------------------------------

# Create a spreahsheet like DataFrame from dictionary

del names # removes the previous instance of a variable

names     = [ 'Adam', 'Anwar', 'Basheer', 'Priya']
countries = [ 'Nigeria', 'UAE', 'KSA', 'Singapore' ]
major     = [ 'Economics', 'Computer Science', 'Accountancy', 'Mathematics' ]

students = {
    'Student_Name': names,
    'Student_Country': countries,
    'Degree_major': major
}

df_students = pd.DataFrame(students)
df_students

3.2 DataFrame operations¶

3.2.1 Column extraction¶

# Extract a column and its data type

print ( df_students['Student_Name'] )
print ( '-----------------------------------' )
print ( 'Data type of the column: ', type(df_students['Student_Name']) )

0       Adam
1      Anwar
2    Basheer
3      Priya
Name: Student_Name, dtype: object
-----------------------------------
Data type of the column:  <class 'pandas.core.series.Series'>

3.2.2 Index and Column¶

# Create new index for the spreadsheet

rollno_array = [ '12001', '12002', '12003', '12004' ] # lets create an array of hypothetical roll numbers
column_title = [ 'Name', 'Country', 'SUbject' ]       # lets create an array of column labels

df_students.index = rollno_array
df_students.columns = column_title
df_students

3.3 DataFrame from excel/csv¶

# Reading a dataframe from excel or csv

df_iris = pd.read_csv('iris.csv') # to read xls use pd.read_excel()
df_iris.head()

3.4 Aggregation and Set operations¶

Data preparation¶

# Create a city population dataframe
cities  = [ 'Tokyo', 'Jakarta', 'Delhi', 'Manila', 'Seoul', 'Shanghai', 'Karachi', 'Beijing', 'Mumbai', 'Chongqing' ]
ct_ctry = [ 'JA', 'ID', 'IN', 'PH', 'KR', 'CN', 'PK', 'CN', 'IN', 'CN' ]
pop_mil = [ 37.84, 30.53, 24.99, 24.13, 23.48, 23.41, 22.12, 21.00, 17.7, 15.7 ]

city_df = pd.DataFrame( { 'City': cities, 'Country_Code': ct_ctry, 'Population(Mil)': pop_mil } )

city_df

# Create a country master dataframe
country = [ 'Japan', 'Indonesia', 'India', 'Philippines', 'S.Korea', 'China', 'Pakistan', 'Singapore', 'Malaysia' ]
ctry_cd = [ 'JA', 'ID', 'IN', 'PH', 'KR', 'CN', 'PK', 'SG', 'MY' ]

ctry_df = pd.DataFrame( { 'Country_Code': ctry_cd, 'Country_Name': country } )

ctry_df

3.4.1 Group by operation¶

city_df.groupby(by='Country_Code').sum()

city_df.groupby(by='Country_Code').count().iloc[:,0]

Country_Code
CN    3
ID    1
IN    2
JA    1
KR    1
PH    1
PK    1
Name: City, dtype: int64

3.4.2 Join operation¶

city_df.join(ctry_df, lsuffix='_city', rsuffix='_ctry')  # how='inner', -- Default is outer join
# Appends the columns based on index key

3.4.3 Merge operation¶

pd.merge(city_df, ctry_df, on='Country_Code', sort=True, how='outer')

	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	1	6.5	3.2	5.1	2.0	Iris-virginica
1	2	6.1	2.8	4.0	1.3	Iris-versicolor
2	3	5.1	3.5	1.4	0.3	Iris-setosa
3	4	6.4	3.1	5.5	1.8	Iris-virginica
4	5	6.7	3.1	4.7	1.5	Iris-versicolor

	Student_Name	Student_Country	Degree_major
0	Adam	Nigeria	Economics
1	Anwar	UAE	Computer Science
2	Basheer	KSA	Accountancy
3	Priya	Singapore	Mathematics

	Name	Country	SUbject
12001	Adam	Nigeria	Economics
12002	Anwar	UAE	Computer Science
12003	Basheer	KSA	Accountancy
12004	Priya	Singapore	Mathematics

	City	Country_Code	Population(Mil)
0	Tokyo	JA	37.84
1	Jakarta	ID	30.53
2	Delhi	IN	24.99
3	Manila	PH	24.13
4	Seoul	KR	23.48
5	Shanghai	CN	23.41
6	Karachi	PK	22.12
7	Beijing	CN	21.00
8	Mumbai	IN	17.70
9	Chongqing	CN	15.70

	Country_Code	Country_Name
0	JA	Japan
1	ID	Indonesia
2	IN	India
3	PH	Philippines
4	KR	S.Korea
5	CN	China
6	PK	Pakistan
7	SG	Singapore
8	MY	Malaysia

	Population(Mil)
Country_Code
CN	60.11
ID	30.53
IN	42.69
JA	37.84
KR	23.48
PH	24.13
PK	22.12

	0
0	Albert
1	Amir
2	Max
3	Ram