Python Data Essentials - Pandas

A data type equivalent to super-charged spreadsheets.

Posted by Craig Johnston on Sunday, June 17, 2018

Pandas bring Python a data type equivalent to super-charged spreadsheets. Pandas add two highly expressive data structures to Python, Series and DataFrame. Pandas Series and DataFrames provide a performant analysis and manipulation of “relational” or “labeled” data similar to relational database tables like MySQL or the rows and columns of Excel. Pandas are great for working with time series data as well as arbitrary matrix data, and unlabeled data.

Pandas leverage NumPy and if you are not familiar with this fundamental library for working with numbers, then I suggest you take a look at Python Data Essentials - NumPy to get a decent footing.

Support this blog! Buy my new book:

Advanced Platform Development with Kubernetes

What You'll Learn
  • Build data pipelines with MQTT, NiFi, Logstash, MinIO, Hive, Presto, Kafka and Elasticsearch
  • Leverage Serverless ETL with OpenFaaS
  • Explore Blockchain networking with Ethereum
  • Support a multi-tenant Data Science platform with JupyterHub, MLflow and Seldon Core
  • Build a Multi-cloud, Hybrid cluster, securely bridging on-premise and cloud-based Kubernetes nodes

If you want to go beyond this brief overview of Pandas I suggest the following resources:

Getting Started

In this article, I’ll be working with Pandas version 0.22.0. If you are running a newer version, there is a possibility of interfaces changing or functionality being deprecated or replaced. I these cases a quick review of the official documentation should suffice.

!conda list pandas
# packages in environment at /Users/cjimti/anaconda3:
#
# Name                    Version                   Build  Channel
pandas                    0.22.0           py36h0a44026_0  

Series

Series data structures support integer and label based indexing.

One-dimensional ndarray with axis labels (including time series).

  • Series official documentation
import pandas as pd

fields = ['Name','CPU', 'GHz', 'Cores','Ram','Own']
b1 = ['Orange Pi Plus','ARM Cortex-A7',1.536,4,1046,2]
board = pd.Series(data=b1, index = fields)

print(f'Series: \n{board}\n')
print(f'     Shape: {board.shape}')
print(f'Dimensions: {board.ndim}')
print(f'      Size: {board.size}')
Series: 
Name     Orange Pi Plus
CPU       ARM Cortex-A7
GHz               1.536
Cores                 4
Ram                1046
Own                   2
dtype: object

     Shape: (6,)
Dimensions: 1
      Size: 6
# check for label
print(f'Do we have GPU data? {"GPU" in board}')
print(f'Do we have CPU data? {"CPU" in board}')
Do we have GPU data? False
Do we have CPU data? True

Accessing and Deleting Elements

print(f'The {board["Name"]} runs at {board["GHz"]} GHz.')
print(f'The {board[0]} has {board[3]} cores.')
print(f'The {board[0]} has {board[-1]:,} megabytes of ram.')
The Orange Pi Plus runs at 1.536 GHz.
The Orange Pi Plus has 4 cores.
The Orange Pi Plus has 2 megabytes of ram.
# select specific columns
cc = board[["CPU","Cores"]]

print(f'Series: \n{cc}\n')
print(f'     Shape: {cc.shape}')
print(f'Dimensions: {cc.ndim}')
print(f'      Size: {cc.size}')
Series: 
CPU      ARM Cortex-A7
Cores                4
dtype: object

     Shape: (2,)
Dimensions: 1
      Size: 2
# remove a column return or inplace=True
nb = board.drop("Cores")

print(f'Series: \n{nb}\n')
print(f'     Shape: {nb.shape}')
print(f'Dimensions: {nb.ndim}')
print(f'      Size: {nb.size}')
Series: 
Name    Orange Pi Plus
CPU      ARM Cortex-A7
GHz              1.536
Ram               1046
Own                  2
dtype: object

     Shape: (5,)
Dimensions: 1
      Size: 5
inventory = pd.Series([1,3,2],['Orange Pi Plus', 'Raspberry Pi 3', 'Asus Tinker Board'])
print(f'Series: \n{inventory}\n')
Series: 
Orange Pi Plus       1
Raspberry Pi 3       3
Asus Tinker Board    2
dtype: int64
inventory = inventory.add(1)
print(f'Add 1 to all values: \n{inventory}\n')
Add 1 to all values: 
Orange Pi Plus       2
Raspberry Pi 3       4
Asus Tinker Board    3
dtype: int64

NumPy on Series data.

import numpy as np
print(f'Square root of each item: \n{np.sqrt(inventory)}\n')
print(f'Each item to the power of 2: \n{np.power(inventory,2)}\n')
Square root of each item: 
Orange Pi Plus       1.414214
Raspberry Pi 3       2.000000
Asus Tinker Board    1.732051
dtype: float64

Each item to the power of 2: 
Orange Pi Plus        4
Raspberry Pi 3       16
Asus Tinker Board     9
dtype: int64
# Orange Pi Plus and Asus Tinker Boards
inventory[['Orange Pi Plus', 'Asus Tinker Board']] * 2
Orange Pi Plus       4
Asus Tinker Board    6
dtype: int64

Arithmetic on Series Data

containers = ['a','b','c']
items = [1,10,100]

item_containers = pd.Series(index=containers, data=items)

print(f'All: \n{item_containers}\n')
print(f'Greater than 1: \n{item_containers[item_containers > 1]}\n')
All: 
a      1
b     10
c    100
dtype: int64

Greater than 1: 
b     10
c    100
dtype: int64
# add 10 items to a
item_containers = item_containers.add([10,0,0])
print(f'All: \n{item_containers}\n')
All: 
a     11
b     10
c    100
dtype: int64
half_containers = item_containers / 2
print(f'Half: \n{half_containers}\n')
Half: 
a     5.5
b     5.0
c    50.0
dtype: float64

DataFrames

DataFrames are the central feature of Pandas, a dictionary like data object.

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

import pandas as pd

Creating

Pandas can create a DataFrame from a NumPy ndarray (structured or homogeneous), dict, or another DataFrame. A DataFrame with a Python Dict can contain Series, arrays, constants, or list-like objects.

In the example below I create a dictionary that maps three indexes to three varied data sets. The fictional AI machines ‘hal’, ‘johnny 5’ and ‘bender’ all have different attributes with some overlap. Each of the dictionary keys contains a Pandas Series object. However, they may contain any list-like objects.

ai = {'hal': pd.Series(data = [100, 90], index = ['intellect', 'dangerous']),
      'johnny 5': pd.Series(data = [12, 5, 2], index = ['dangerous','humor','bending']),
      'bender': pd.Series(data = [20, 50, 50, 100], index = ['intellect', 'dangerous', 'humor', 'bending'])}

df_ai = pd.DataFrame(ai)
df_ai

bender hal johnny 5
bending 100 NaN 2.0
dangerous 50 90.0 12.0
humor 50 NaN 5.0
intellect 20 100.0 NaN
print(f'     Shape: {df_ai.shape}')
print(f'Dimensions: {df_ai.ndim}')
print(f'      Size: {df_ai.size}')
print(f'Total NaNs: {df_ai.isnull().sum().sum()}')
print(f'NaN Counts: \n{df_ai.isnull().sum()}\n')

     Shape: (4, 3)
Dimensions: 2
      Size: 12
Total NaNs: 3
NaN Counts: 
bender      0
hal         2
johnny 5    1
dtype: int64
print(f'DataFrame Values: \n{df_ai.values}\n')
DataFrame Values: 
[[ 100.   nan    2.]
 [  50.   90.   12.]
 [  50.   nan    5.]
 [  20.  100.   nan]]

Selecting

The methods mask and where are provided by Panda’s Series and DataFrame data types. See the examples below for some simple examples of value selection using basic arithmatic expressions.

# mask out any data greater than 10
df_ai.mask(df_ai > 10)

bender hal johnny 5
bending NaN NaN 2.0
dangerous NaN NaN NaN
humor NaN NaN 5.0
intellect NaN NaN NaN
# only return data greater than 10, otherwise NaN
df_ai.where(df_ai > 10)

bender hal johnny 5
bending 100 NaN NaN
dangerous 50 90.0 12.0
humor 50 NaN NaN
intellect 20 100.0 NaN
# only return data greater than 10, otherwise 0
df_ai.where(df_ai > 10, 0)

bender hal johnny 5
bending 100 0.0 0.0
dangerous 50 90.0 12.0
humor 50 0.0 0.0
intellect 20 100.0 0.0

Modifying

The AI bots ‘hal’, ‘johnny 5’ and ‘bender’ share some common attributes, however where they do not, we get nan (not a number). Running the AI bot data through any math function would be problematic with the existence of non-numbers. Pandas give us quite a few options.

There are many options for cleaning this data. I’ll start with removing any rows that contain nan values. We can make these adjustments with the optional parameter inplace=True if we wanted to modify the DataFrame in place, however for the sake of examples it is better to keep the original in-tact.

# return a frame eliminating rows with NaN values
df_ai_common_rows = df_ai.dropna(axis=0)
df_ai_common_rows

bender hal johnny 5
dangerous 50 90.0 12.0
# return a frame eliminating rows with NaN values
df_ai_common_cols = df_ai.dropna(axis=1)
df_ai_common_cols

bender
bending 100
dangerous 50
humor 50
intellect 20

Depending on requirements, no data could mean zero in our scale of 0-100. While zero is not a reasonable assumption for our AI bots, it’s an easy data fix:

# fill all NaNs with 0
df_ai.fillna(0)

bender hal johnny 5
bending 100 0.0 2.0
dangerous 50 90.0 12.0
humor 50 0.0 5.0
intellect 20 100.0 0.0
# forward fill rows with previous column (axis=0) data 
df_ai.fillna(method='ffill', axis=0)

bender hal johnny 5
bending 100 NaN 2.0
dangerous 50 90.0 12.0
humor 50 90.0 5.0
intellect 20 100.0 5.0
# forward fill rows with previous column (axis=0) data 
# then back fill
df_ai.fillna(method='ffill', axis=0).fillna(method='bfill', axis=0)

bender hal johnny 5
bending 100 90.0 2.0
dangerous 50 90.0 12.0
humor 50 90.0 5.0
intellect 20 100.0 5.0

Forward (ffill) and backfilling (bfill) have far better uses in time-series data. In this case, hal having a danger rating of 90 should not assume that his bending ability would be 90 as well, but this example clearly illustrates the forward and backfilling capabilities of DataFrame’s fillna method.

If we needed to make assumptions regarding the ability of this team of AI bots we could assume unknown data could start as an average of known data.

# get the mean of data for each attribute row by column (axis=1)
df_ai.mean(axis=1)
bending      51.000000
dangerous    50.666667
humor        27.500000
intellect    60.000000
dtype: float64

pandas.DataFrame.apply method applies the return value of a function along an axis of DataFrame, axis=1 in the example below. The function given to pandas.DataFrame.apply is passed the row or column depending the axis specified, the function below receives rows (because axis=1 is specified) and assigns each row to the variable “x” in which the method “mean” is called and resulting data returned from the function.

We could have defied a named function; however this small opperation x.fillna(x.mean()) is hardly worthy of such attention. Python’s lambdas are one line, anonymous functions, and then used responsibly, can make the code more compact and readable at the same time.

clean_df_ai = df_ai.apply(lambda x: x.fillna(x.mean()),axis=1)
clean_df_ai

bender hal johnny 5
bending 100.0 51.0 2.0
dangerous 50.0 90.0 12.0
humor 50.0 27.5 5.0
intellect 20.0 100.0 60.0

Sorting

# order the columns by ai bot with the highest intellect
hii = clean_df_ai.sort_values(['intellect'], axis=1, ascending=False)
hii

hal johnny 5 bender
bending 51.0 2.0 100.0
dangerous 90.0 12.0 50.0
humor 27.5 5.0 50.0
intellect 100.0 60.0 20.0
print(f'The bot with the highest intelligence is {hii.columns[0]}.')
print(f'The bot with the lowest intelligence is {hii.columns[-1]}.')
The bot with the highest intelligence is hal.
The bot with the lowest intelligence is bender.

I doubt that johnny 5 is more intelligent than bender but his data was unknown and therefore derived by using a mean, so the score is mathematically correct.

I won’t attempt even to scratch the surface of sorting functions and their parameters provided by DataFrames. This article is only intended to give you a taste and get you going.

Math

Use NumPy to perform any number of arithmetic operations on the values of a DataFrames. I suggest you take a look at my article Python Data Essentials - Pandas for an overview of this compelling data science library.

Essential Python 3

A lot of data science in done in Jupyter Notbooks and libraries like NumPy make developing reports or documenting numerical processes. However if you a software developer like me, this code needs to run in a script on a server, in Amazon’s Lambda Function Handler or even kubeless in a custom kubernetes cluster.

Check out my article on Essential Python 3 for a clean boilerplate script template to get you going.

This blog post, titled: "Python Data Essentials - Pandas: A data type equivalent to super-charged spreadsheets." by Craig Johnston, is licensed under a Creative Commons Attribution 4.0 International License. Creative Commons License

SUPPORT

Order my new Kubernetes book: Advanced Platform Development with Kubernetes: Enabling Data Management, the Internet of Things, Blockchain, and Machine Learning


SHARE
FOLLOW