Pandas bring Python a data type equivalent to super-charged spreadsheets. Pandas add two highly expressive data structures to Python, Series and DataFrame. Pandas Series and DataFrames provide a performant analysis and manipulation of “relational” or “labeled” data similar to relational database tables like MySQL or the rows and columns of Excel. Pandas are great for working with time series data as well as arbitrary matrix data, and unlabeled data.
Pandas leverage NumPy and if you are not familiar with this fundamental library for working with numbers, then I suggest you take a look at Python Data Essentials - NumPy to get a decent footing.
Advanced Platform Development with Kubernetes
What You'll Learn
- Build data pipelines with MQTT, NiFi, Logstash, MinIO, Hive, Presto, Kafka and Elasticsearch
- Leverage Serverless ETL with OpenFaaS
- Explore Blockchain networking with Ethereum
- Support a multi-tenant Data Science platform with JupyterHub, MLflow and Seldon Core
- Build a Multi-cloud, Hybrid cluster, securely bridging on-premise and cloud-based Kubernetes nodes
If you want to go beyond this brief overview of Pandas I suggest the following resources:
- 10 Minutes to pandas - Official
- Official Tutorials
- Data Analysis with Python and Pandas Tutorial Introduction - Video
Getting Started
In this article, I’ll be working with Pandas version 0.22.0. If you are running a newer version, there is a possibility of interfaces changing or functionality being deprecated or replaced. I these cases a quick review of the official documentation should suffice.
!conda list pandas
# packages in environment at /Users/cjimti/anaconda3:
#
# Name Version Build Channel
pandas 0.22.0 py36h0a44026_0
Series
Series data structures support integer and label based indexing.
One-dimensional ndarray with axis labels (including time series).
- Series official documentation
import pandas as pd
fields = ['Name','CPU', 'GHz', 'Cores','Ram','Own']
b1 = ['Orange Pi Plus','ARM Cortex-A7',1.536,4,1046,2]
board = pd.Series(data=b1, index = fields)
print(f'Series: \n{board}\n')
print(f' Shape: {board.shape}')
print(f'Dimensions: {board.ndim}')
print(f' Size: {board.size}')
Series:
Name Orange Pi Plus
CPU ARM Cortex-A7
GHz 1.536
Cores 4
Ram 1046
Own 2
dtype: object
Shape: (6,)
Dimensions: 1
Size: 6
# check for label
print(f'Do we have GPU data? {"GPU" in board}')
print(f'Do we have CPU data? {"CPU" in board}')
Do we have GPU data? False
Do we have CPU data? True
Accessing and Deleting Elements
- pandas.DataFrame.iloc - explicit selection by integer-index
- pandas.DataFrame.loc - explicit selection by label
- pandas.Series.drop - return Series with specified index labels removed.
- Indexing and Selecting Data official documentation.
print(f'The {board["Name"]} runs at {board["GHz"]} GHz.')
print(f'The {board[0]} has {board[3]} cores.')
print(f'The {board[0]} has {board[-1]:,} megabytes of ram.')
The Orange Pi Plus runs at 1.536 GHz.
The Orange Pi Plus has 4 cores.
The Orange Pi Plus has 2 megabytes of ram.
# select specific columns
cc = board[["CPU","Cores"]]
print(f'Series: \n{cc}\n')
print(f' Shape: {cc.shape}')
print(f'Dimensions: {cc.ndim}')
print(f' Size: {cc.size}')
Series:
CPU ARM Cortex-A7
Cores 4
dtype: object
Shape: (2,)
Dimensions: 1
Size: 2
# remove a column return or inplace=True
nb = board.drop("Cores")
print(f'Series: \n{nb}\n')
print(f' Shape: {nb.shape}')
print(f'Dimensions: {nb.ndim}')
print(f' Size: {nb.size}')
Series:
Name Orange Pi Plus
CPU ARM Cortex-A7
GHz 1.536
Ram 1046
Own 2
dtype: object
Shape: (5,)
Dimensions: 1
Size: 5
inventory = pd.Series([1,3,2],['Orange Pi Plus', 'Raspberry Pi 3', 'Asus Tinker Board'])
print(f'Series: \n{inventory}\n')
Series:
Orange Pi Plus 1
Raspberry Pi 3 3
Asus Tinker Board 2
dtype: int64
inventory = inventory.add(1)
print(f'Add 1 to all values: \n{inventory}\n')
Add 1 to all values:
Orange Pi Plus 2
Raspberry Pi 3 4
Asus Tinker Board 3
dtype: int64
NumPy on Series data.
import numpy as np
print(f'Square root of each item: \n{np.sqrt(inventory)}\n')
print(f'Each item to the power of 2: \n{np.power(inventory,2)}\n')
Square root of each item:
Orange Pi Plus 1.414214
Raspberry Pi 3 2.000000
Asus Tinker Board 1.732051
dtype: float64
Each item to the power of 2:
Orange Pi Plus 4
Raspberry Pi 3 16
Asus Tinker Board 9
dtype: int64
# Orange Pi Plus and Asus Tinker Boards
inventory[['Orange Pi Plus', 'Asus Tinker Board']] * 2
Orange Pi Plus 4
Asus Tinker Board 6
dtype: int64
Arithmetic on Series Data
containers = ['a','b','c']
items = [1,10,100]
item_containers = pd.Series(index=containers, data=items)
print(f'All: \n{item_containers}\n')
print(f'Greater than 1: \n{item_containers[item_containers > 1]}\n')
All:
a 1
b 10
c 100
dtype: int64
Greater than 1:
b 10
c 100
dtype: int64
# add 10 items to a
item_containers = item_containers.add([10,0,0])
print(f'All: \n{item_containers}\n')
All:
a 11
b 10
c 100
dtype: int64
half_containers = item_containers / 2
print(f'Half: \n{half_containers}\n')
Half:
a 5.5
b 5.0
c 50.0
dtype: float64
DataFrames
DataFrames are the central feature of Pandas, a dictionary like data object.
Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.
- pandas.DataFrame official documentation.
import pandas as pd
Creating
Pandas can create a DataFrame from a NumPy ndarray (structured or homogeneous), dict, or another DataFrame. A DataFrame with a Python Dict can contain Series, arrays, constants, or list-like objects.
In the example below I create a dictionary that maps three indexes to three varied data sets. The fictional AI machines ‘hal’, ‘johnny 5’ and ‘bender’ all have different attributes with some overlap. Each of the dictionary keys contains a Pandas Series object. However, they may contain any list-like objects.
ai = {'hal': pd.Series(data = [100, 90], index = ['intellect', 'dangerous']),
'johnny 5': pd.Series(data = [12, 5, 2], index = ['dangerous','humor','bending']),
'bender': pd.Series(data = [20, 50, 50, 100], index = ['intellect', 'dangerous', 'humor', 'bending'])}
df_ai = pd.DataFrame(ai)
df_ai
bender | hal | johnny 5 | |
---|---|---|---|
bending | 100 | NaN | 2.0 |
dangerous | 50 | 90.0 | 12.0 |
humor | 50 | NaN | 5.0 |
intellect | 20 | 100.0 | NaN |
print(f' Shape: {df_ai.shape}')
print(f'Dimensions: {df_ai.ndim}')
print(f' Size: {df_ai.size}')
print(f'Total NaNs: {df_ai.isnull().sum().sum()}')
print(f'NaN Counts: \n{df_ai.isnull().sum()}\n')
Shape: (4, 3)
Dimensions: 2
Size: 12
Total NaNs: 3
NaN Counts:
bender 0
hal 2
johnny 5 1
dtype: int64
print(f'DataFrame Values: \n{df_ai.values}\n')
DataFrame Values:
[[ 100. nan 2.]
[ 50. 90. 12.]
[ 50. nan 5.]
[ 20. 100. nan]]
Selecting
The methods mask and where are provided by Panda’s Series and DataFrame data types. See the examples below for some simple examples of value selection using basic arithmatic expressions.
# mask out any data greater than 10
df_ai.mask(df_ai > 10)
bender | hal | johnny 5 | |
---|---|---|---|
bending | NaN | NaN | 2.0 |
dangerous | NaN | NaN | NaN |
humor | NaN | NaN | 5.0 |
intellect | NaN | NaN | NaN |
# only return data greater than 10, otherwise NaN
df_ai.where(df_ai > 10)
bender | hal | johnny 5 | |
---|---|---|---|
bending | 100 | NaN | NaN |
dangerous | 50 | 90.0 | 12.0 |
humor | 50 | NaN | NaN |
intellect | 20 | 100.0 | NaN |
# only return data greater than 10, otherwise 0
df_ai.where(df_ai > 10, 0)
bender | hal | johnny 5 | |
---|---|---|---|
bending | 100 | 0.0 | 0.0 |
dangerous | 50 | 90.0 | 12.0 |
humor | 50 | 0.0 | 0.0 |
intellect | 20 | 100.0 | 0.0 |
Modifying
The AI bots ‘hal’, ‘johnny 5’ and ‘bender’ share some common attributes, however where they do not, we get nan (not a number). Running the AI bot data through any math function would be problematic with the existence of non-numbers. Pandas give us quite a few options.
There are many options for cleaning this data. I’ll start with removing any rows that contain nan values. We can make these adjustments with the optional parameter inplace=True if we wanted to modify the DataFrame in place, however for the sake of examples it is better to keep the original in-tact.
# return a frame eliminating rows with NaN values
df_ai_common_rows = df_ai.dropna(axis=0)
df_ai_common_rows
bender | hal | johnny 5 | |
---|---|---|---|
dangerous | 50 | 90.0 | 12.0 |
# return a frame eliminating rows with NaN values
df_ai_common_cols = df_ai.dropna(axis=1)
df_ai_common_cols
bender | |
---|---|
bending | 100 |
dangerous | 50 |
humor | 50 |
intellect | 20 |
Depending on requirements, no data could mean zero in our scale of 0-100. While zero is not a reasonable assumption for our AI bots, it’s an easy data fix:
# fill all NaNs with 0
df_ai.fillna(0)
bender | hal | johnny 5 | |
---|---|---|---|
bending | 100 | 0.0 | 2.0 |
dangerous | 50 | 90.0 | 12.0 |
humor | 50 | 0.0 | 5.0 |
intellect | 20 | 100.0 | 0.0 |
# forward fill rows with previous column (axis=0) data
df_ai.fillna(method='ffill', axis=0)
bender | hal | johnny 5 | |
---|---|---|---|
bending | 100 | NaN | 2.0 |
dangerous | 50 | 90.0 | 12.0 |
humor | 50 | 90.0 | 5.0 |
intellect | 20 | 100.0 | 5.0 |
# forward fill rows with previous column (axis=0) data
# then back fill
df_ai.fillna(method='ffill', axis=0).fillna(method='bfill', axis=0)
bender | hal | johnny 5 | |
---|---|---|---|
bending | 100 | 90.0 | 2.0 |
dangerous | 50 | 90.0 | 12.0 |
humor | 50 | 90.0 | 5.0 |
intellect | 20 | 100.0 | 5.0 |
Forward (ffill) and backfilling (bfill) have far better uses in time-series data. In this case, hal
having a danger rating of 90 should not assume that his bending ability would be 90 as well, but this example clearly illustrates the forward and backfilling capabilities of DataFrame’s fillna method.
If we needed to make assumptions regarding the ability of this team of AI bots we could assume unknown data could start as an average of known data.
# get the mean of data for each attribute row by column (axis=1)
df_ai.mean(axis=1)
bending 51.000000
dangerous 50.666667
humor 27.500000
intellect 60.000000
dtype: float64
pandas.DataFrame.apply method applies the return value of a function along an axis of DataFrame, axis=1 in the example below. The function given to pandas.DataFrame.apply is passed the row or column depending the axis specified, the function below receives rows (because axis=1 is specified) and assigns each row to the variable “x” in which the method “mean” is called and resulting data returned from the function.
We could have defied a named function; however this small opperation x.fillna(x.mean()) is hardly worthy of such attention. Python’s lambdas are one line, anonymous functions, and then used responsibly, can make the code more compact and readable at the same time.
clean_df_ai = df_ai.apply(lambda x: x.fillna(x.mean()),axis=1)
clean_df_ai
bender | hal | johnny 5 | |
---|---|---|---|
bending | 100.0 | 51.0 | 2.0 |
dangerous | 50.0 | 90.0 | 12.0 |
humor | 50.0 | 27.5 | 5.0 |
intellect | 20.0 | 100.0 | 60.0 |
Sorting
# order the columns by ai bot with the highest intellect
hii = clean_df_ai.sort_values(['intellect'], axis=1, ascending=False)
hii
hal | johnny 5 | bender | |
---|---|---|---|
bending | 51.0 | 2.0 | 100.0 |
dangerous | 90.0 | 12.0 | 50.0 |
humor | 27.5 | 5.0 | 50.0 |
intellect | 100.0 | 60.0 | 20.0 |
print(f'The bot with the highest intelligence is {hii.columns[0]}.')
print(f'The bot with the lowest intelligence is {hii.columns[-1]}.')
The bot with the highest intelligence is hal.
The bot with the lowest intelligence is bender.
I doubt that johnny 5
is more intelligent than bender
but his data was unknown and therefore derived by using a mean, so the score is mathematically correct.
I won’t attempt even to scratch the surface of sorting functions and their parameters provided by DataFrames. This article is only intended to give you a taste and get you going.
Math
Use NumPy to perform any number of arithmetic operations on the values of a DataFrames. I suggest you take a look at my article Python Data Essentials - Pandas for an overview of this compelling data science library.
Essential Python 3
A lot of data science in done in Jupyter Notbooks and libraries like NumPy make developing reports or documenting numerical processes. However if you a software developer like me, this code needs to run in a script on a server, in Amazon’s Lambda Function Handler or even kubeless in a custom kubernetes cluster.
Check out my article on Essential Python 3 for a clean boilerplate script template to get you going.
This blog post, titled: "Python Data Essentials - Pandas: A data type equivalent to super-charged spreadsheets." by Craig Johnston, is licensed under a Creative Commons Attribution 4.0 International License.
SUPPORT
Order my new Kubernetes book: Advanced Platform Development with Kubernetes: Enabling Data Management, the Internet of Things, Blockchain, and Machine Learning