Python Data Essentials - Pandas - IMTI

Pandas bring Python a data type equivalent to super-charged spreadsheets. Pandas add two highly expressive data structures to Python, Series and DataFrame. Pandas Series and DataFrames provide a performant analysis and manipulation of “relational” or “labeled” data similar to relational database tables like MySQL or the rows and columns of Excel. Pandas are great for working with time series data as well as arbitrary matrix data, and unlabeled data.

Pandas leverage NumPy and if you are not familiar with this fundamental library for working with numbers, then I suggest you take a look at Python Data Essentials - NumPy to get a decent footing.

Support this blog! Buy my new book:

Advanced Platform Development with Kubernetes

What You'll Learn

Build data pipelines with MQTT, NiFi, Logstash, MinIO, Hive, Presto, Kafka and Elasticsearch
Leverage Serverless ETL with OpenFaaS
Explore Blockchain networking with Ethereum
Support a multi-tenant Data Science platform with JupyterHub, MLflow and Seldon Core
Build a Multi-cloud, Hybrid cluster, securely bridging on-premise and cloud-based Kubernetes nodes

If you want to go beyond this brief overview of Pandas I suggest the following resources:

10 Minutes to pandas - Official
Official Tutorials
Data Analysis with Python and Pandas Tutorial Introduction - Video

Getting Started

In this article, I’ll be working with Pandas version 0.22.0. If you are running a newer version, there is a possibility of interfaces changing or functionality being deprecated or replaced. I these cases a quick review of the official documentation should suffice.

!conda list pandas

# packages in environment at /Users/cjimti/anaconda3:
#
# Name                    Version                   Build  Channel
pandas                    0.22.0           py36h0a44026_0

Series

Series data structures support integer and label based indexing.

One-dimensional ndarray with axis labels (including time series).

Series official documentation

import pandas as pd

fields = ['Name','CPU', 'GHz', 'Cores','Ram','Own']
b1 = ['Orange Pi Plus','ARM Cortex-A7',1.536,4,1046,2]
board = pd.Series(data=b1, index = fields)

print(f'Series: \n{board}\n')
print(f'     Shape: {board.shape}')
print(f'Dimensions: {board.ndim}')
print(f'      Size: {board.size}')

Series: 
Name     Orange Pi Plus
CPU       ARM Cortex-A7
GHz               1.536
Cores                 4
Ram                1046
Own                   2
dtype: object

     Shape: (6,)
Dimensions: 1
      Size: 6

# check for label
print(f'Do we have GPU data? {"GPU" in board}')
print(f'Do we have CPU data? {"CPU" in board}')

Do we have GPU data? False
Do we have CPU data? True

Accessing and Deleting Elements

pandas.DataFrame.iloc - explicit selection by integer-index
pandas.DataFrame.loc - explicit selection by label
pandas.Series.drop - return Series with specified index labels removed.
Indexing and Selecting Data official documentation.

print(f'The {board["Name"]} runs at {board["GHz"]} GHz.')
print(f'The {board[0]} has {board[3]} cores.')
print(f'The {board[0]} has {board[-1]:,} megabytes of ram.')

The Orange Pi Plus runs at 1.536 GHz.
The Orange Pi Plus has 4 cores.
The Orange Pi Plus has 2 megabytes of ram.

# select specific columns
cc = board[["CPU","Cores"]]

print(f'Series: \n{cc}\n')
print(f'     Shape: {cc.shape}')
print(f'Dimensions: {cc.ndim}')
print(f'      Size: {cc.size}')

Series: 
CPU      ARM Cortex-A7
Cores                4
dtype: object

     Shape: (2,)
Dimensions: 1
      Size: 2

# remove a column return or inplace=True
nb = board.drop("Cores")

print(f'Series: \n{nb}\n')
print(f'     Shape: {nb.shape}')
print(f'Dimensions: {nb.ndim}')
print(f'      Size: {nb.size}')

Series: 
Name    Orange Pi Plus
CPU      ARM Cortex-A7
GHz              1.536
Ram               1046
Own                  2
dtype: object

     Shape: (5,)
Dimensions: 1
      Size: 5

inventory = pd.Series([1,3,2],['Orange Pi Plus', 'Raspberry Pi 3', 'Asus Tinker Board'])
print(f'Series: \n{inventory}\n')

Series: 
Orange Pi Plus       1
Raspberry Pi 3       3
Asus Tinker Board    2
dtype: int64

inventory = inventory.add(1)
print(f'Add 1 to all values: \n{inventory}\n')

Add 1 to all values: 
Orange Pi Plus       2
Raspberry Pi 3       4
Asus Tinker Board    3
dtype: int64

NumPy on Series data.

import numpy as np

print(f'Square root of each item: \n{np.sqrt(inventory)}\n')
print(f'Each item to the power of 2: \n{np.power(inventory,2)}\n')

Square root of each item: 
Orange Pi Plus       1.414214
Raspberry Pi 3       2.000000
Asus Tinker Board    1.732051
dtype: float64

Each item to the power of 2: 
Orange Pi Plus        4
Raspberry Pi 3       16
Asus Tinker Board     9
dtype: int64

# Orange Pi Plus and Asus Tinker Boards
inventory[['Orange Pi Plus', 'Asus Tinker Board']] * 2

Orange Pi Plus       4
Asus Tinker Board    6
dtype: int64

Arithmetic on Series Data

containers = ['a','b','c']
items = [1,10,100]

item_containers = pd.Series(index=containers, data=items)

print(f'All: \n{item_containers}\n')
print(f'Greater than 1: \n{item_containers[item_containers > 1]}\n')

All: 
a      1
b     10
c    100
dtype: int64

Greater than 1: 
b     10
c    100
dtype: int64

# add 10 items to a
item_containers = item_containers.add([10,0,0])
print(f'All: \n{item_containers}\n')

All: 
a     11
b     10
c    100
dtype: int64

half_containers = item_containers / 2
print(f'Half: \n{half_containers}\n')

Half: 
a     5.5
b     5.0
c    50.0
dtype: float64

DataFrames

DataFrames are the central feature of Pandas, a dictionary like data object.

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

pandas.DataFrame official documentation.

import pandas as pd

Creating

Pandas can create a DataFrame from a NumPy ndarray (structured or homogeneous), dict, or another DataFrame. A DataFrame with a Python Dict can contain Series, arrays, constants, or list-like objects.

In the example below I create a dictionary that maps three indexes to three varied data sets. The fictional AI machines ‘hal’, ‘johnny 5’ and ‘bender’ all have different attributes with some overlap. Each of the dictionary keys contains a Pandas Series object. However, they may contain any list-like objects.

ai = {'hal': pd.Series(data = [100, 90], index = ['intellect', 'dangerous']),
      'johnny 5': pd.Series(data = [12, 5, 2], index = ['dangerous','humor','bending']),
      'bender': pd.Series(data = [20, 50, 50, 100], index = ['intellect', 'dangerous', 'humor', 'bending'])}

df_ai = pd.DataFrame(ai)
df_ai

	bender	hal	johnny 5
bending	100	NaN	2.0
dangerous	50	90.0	12.0
humor	50	NaN	5.0
intellect	20	100.0	NaN

print(f'     Shape: {df_ai.shape}')
print(f'Dimensions: {df_ai.ndim}')
print(f'      Size: {df_ai.size}')
print(f'Total NaNs: {df_ai.isnull().sum().sum()}')
print(f'NaN Counts: \n{df_ai.isnull().sum()}\n')

     Shape: (4, 3)
Dimensions: 2
      Size: 12
Total NaNs: 3
NaN Counts: 
bender      0
hal         2
johnny 5    1
dtype: int64

print(f'DataFrame Values: \n{df_ai.values}\n')

DataFrame Values: 
[[ 100.   nan    2.]
 [  50.   90.   12.]
 [  50.   nan    5.]
 [  20.  100.   nan]]

Selecting

The methods mask and where are provided by Panda’s Series and DataFrame data types. See the examples below for some simple examples of value selection using basic arithmatic expressions.

# mask out any data greater than 10
df_ai.mask(df_ai > 10)

	bender	hal	johnny 5
bending	NaN	NaN	2.0
dangerous	NaN	NaN	NaN
humor	NaN	NaN	5.0
intellect	NaN	NaN	NaN

# only return data greater than 10, otherwise NaN
df_ai.where(df_ai > 10)

	bender	hal	johnny 5
bending	100	NaN	NaN
dangerous	50	90.0	12.0
humor	50	NaN	NaN
intellect	20	100.0	NaN

# only return data greater than 10, otherwise 0
df_ai.where(df_ai > 10, 0)

	bender	hal	johnny 5
bending	100	0.0	0.0
dangerous	50	90.0	12.0
humor	50	0.0	0.0
intellect	20	100.0	0.0

Modifying

The AI bots ‘hal’, ‘johnny 5’ and ‘bender’ share some common attributes, however where they do not, we get nan (not a number). Running the AI bot data through any math function would be problematic with the existence of non-numbers. Pandas give us quite a few options.

There are many options for cleaning this data. I’ll start with removing any rows that contain nan values. We can make these adjustments with the optional parameter inplace=True if we wanted to modify the DataFrame in place, however for the sake of examples it is better to keep the original in-tact.

# return a frame eliminating rows with NaN values
df_ai_common_rows = df_ai.dropna(axis=0)
df_ai_common_rows

	bender	hal	johnny 5
dangerous	50	90.0	12.0

# return a frame eliminating rows with NaN values
df_ai_common_cols = df_ai.dropna(axis=1)
df_ai_common_cols

	bender
bending	100
dangerous	50
humor	50
intellect	20

Depending on requirements, no data could mean zero in our scale of 0-100. While zero is not a reasonable assumption for our AI bots, it’s an easy data fix:

# fill all NaNs with 0
df_ai.fillna(0)

	bender	hal	johnny 5
bending	100	0.0	2.0
dangerous	50	90.0	12.0
humor	50	0.0	5.0
intellect	20	100.0	0.0

# forward fill rows with previous column (axis=0) data 
df_ai.fillna(method='ffill', axis=0)

	bender	hal	johnny 5
bending	100	NaN	2.0
dangerous	50	90.0	12.0
humor	50	90.0	5.0
intellect	20	100.0	5.0

# forward fill rows with previous column (axis=0) data 
# then back fill
df_ai.fillna(method='ffill', axis=0).fillna(method='bfill', axis=0)

	bender	hal	johnny 5
bending	100	90.0	2.0
dangerous	50	90.0	12.0
humor	50	90.0	5.0
intellect	20	100.0	5.0

Forward (ffill) and backfilling (bfill) have far better uses in time-series data. In this case, hal having a danger rating of 90 should not assume that his bending ability would be 90 as well, but this example clearly illustrates the forward and backfilling capabilities of DataFrame’s fillna method.

If we needed to make assumptions regarding the ability of this team of AI bots we could assume unknown data could start as an average of known data.

# get the mean of data for each attribute row by column (axis=1)
df_ai.mean(axis=1)

bending      51.000000
dangerous    50.666667
humor        27.500000
intellect    60.000000
dtype: float64

pandas.DataFrame.apply method applies the return value of a function along an axis of DataFrame, axis=1 in the example below. The function given to pandas.DataFrame.apply is passed the row or column depending the axis specified, the function below receives rows (because axis=1 is specified) and assigns each row to the variable “x” in which the method “mean” is called and resulting data returned from the function.

We could have defied a named function; however this small opperation x.fillna(x.mean()) is hardly worthy of such attention. Python’s lambdas are one line, anonymous functions, and then used responsibly, can make the code more compact and readable at the same time.

clean_df_ai = df_ai.apply(lambda x: x.fillna(x.mean()),axis=1)
clean_df_ai

	bender	hal	johnny 5
bending	100.0	51.0	2.0
dangerous	50.0	90.0	12.0
humor	50.0	27.5	5.0
intellect	20.0	100.0	60.0

Sorting

# order the columns by ai bot with the highest intellect
hii = clean_df_ai.sort_values(['intellect'], axis=1, ascending=False)
hii

	hal	johnny 5	bender
bending	51.0	2.0	100.0
dangerous	90.0	12.0	50.0
humor	27.5	5.0	50.0
intellect	100.0	60.0	20.0

print(f'The bot with the highest intelligence is {hii.columns[0]}.')
print(f'The bot with the lowest intelligence is {hii.columns[-1]}.')

The bot with the highest intelligence is hal.
The bot with the lowest intelligence is bender.

I doubt that johnny 5 is more intelligent than bender but his data was unknown and therefore derived by using a mean, so the score is mathematically correct.

I won’t attempt even to scratch the surface of sorting functions and their parameters provided by DataFrames. This article is only intended to give you a taste and get you going.

Math

Use NumPy to perform any number of arithmetic operations on the values of a DataFrames. I suggest you take a look at my article Python Data Essentials - Pandas for an overview of this compelling data science library.

Essential Python 3

A lot of data science in done in Jupyter Notbooks and libraries like NumPy make developing reports or documenting numerical processes. However if you a software developer like me, this code needs to run in a script on a server, in Amazon’s Lambda Function Handler or even kubeless in a custom kubernetes cluster.

Check out my article on Essential Python 3 for a clean boilerplate script template to get you going.

This blog post, titled: "Python Data Essentials - Pandas: A data type equivalent to super-charged spreadsheets." by Craig Johnston, is licensed under a Creative Commons Attribution 4.0 International License.

SUPPORT

Order my new Kubernetes book: Advanced Platform Development with Kubernetes: Enabling Data Management, the Internet of Things, Blockchain, and Machine Learning

Python Data Essentials - Pandas

A data type equivalent to super-charged spreadsheets.