Pandas bring Python a data type equivalent to super-charged spreadsheets. Pandas add two highly expressive data structures to Python, Series and DataFrame. Pandas Series and DataFrames provide a performant analysis and manipulation of “relational” or “labeled” data similar to relational database tables like MySQL or the rows and columns of Excel. Pandas are great for working with time series data as well as arbitrary matrix data, and unlabeled data.

Pandas leverage NumPy and if you are not familiar with this fundamental library for working with numbers, then I suggest you take a look at Python Data Essentials - NumPy to get a decent footing.

#### Advanced Platform Development with Kubernetes

##### What You'll Learn

- Build data pipelines with
**MQTT, NiFi, Logstash, MinIO, Hive, Presto, Kafka and Elasticsearch** - Leverage
**Serverless**ETL with**OpenFaaS** - Explore
**Blockchain**networking with**Ethereum** - Support a multi-tenant
**Data Science**platform with**JupyterHub, MLflow and Seldon Core** - Build a
**Multi-cloud**,**Hybrid cluster**, securely bridging**on-premise**and cloud-based Kubernetes nodes

If you want to go beyond this brief overview of Pandas I suggest the following resources:

- 10 Minutes to pandas - Official
- Official Tutorials
- Data Analysis with Python and Pandas Tutorial Introduction - Video

## Getting Started

In this article, I’ll be working with Pandas version 0.22.0. If you are running a newer version, there is a possibility of interfaces changing or functionality being deprecated or replaced. I these cases a quick review of the official documentation should suffice.

```
!conda list pandas
```

```
# packages in environment at /Users/cjimti/anaconda3:
#
# Name Version Build Channel
pandas 0.22.0 py36h0a44026_0
```

## Series

Series data structures support integer and label based indexing.

One-dimensional ndarray with axis labels (including time series).

- Series official documentation

```
import pandas as pd
fields = ['Name','CPU', 'GHz', 'Cores','Ram','Own']
b1 = ['Orange Pi Plus','ARM Cortex-A7',1.536,4,1046,2]
board = pd.Series(data=b1, index = fields)
print(f'Series: \n{board}\n')
print(f' Shape: {board.shape}')
print(f'Dimensions: {board.ndim}')
print(f' Size: {board.size}')
```

```
Series:
Name Orange Pi Plus
CPU ARM Cortex-A7
GHz 1.536
Cores 4
Ram 1046
Own 2
dtype: object
Shape: (6,)
Dimensions: 1
Size: 6
```

```
# check for label
print(f'Do we have GPU data? {"GPU" in board}')
print(f'Do we have CPU data? {"CPU" in board}')
```

```
Do we have GPU data? False
Do we have CPU data? True
```

### Accessing and Deleting Elements

- pandas.DataFrame.iloc - explicit selection by integer-index
- pandas.DataFrame.loc - explicit selection by label
- pandas.Series.drop - return Series with specified index labels removed.
- Indexing and Selecting Data official documentation.

```
print(f'The {board["Name"]} runs at {board["GHz"]} GHz.')
print(f'The {board[0]} has {board[3]} cores.')
print(f'The {board[0]} has {board[-1]:,} megabytes of ram.')
```

```
The Orange Pi Plus runs at 1.536 GHz.
The Orange Pi Plus has 4 cores.
The Orange Pi Plus has 2 megabytes of ram.
```

```
# select specific columns
cc = board[["CPU","Cores"]]
print(f'Series: \n{cc}\n')
print(f' Shape: {cc.shape}')
print(f'Dimensions: {cc.ndim}')
print(f' Size: {cc.size}')
```

```
Series:
CPU ARM Cortex-A7
Cores 4
dtype: object
Shape: (2,)
Dimensions: 1
Size: 2
```

```
# remove a column return or inplace=True
nb = board.drop("Cores")
print(f'Series: \n{nb}\n')
print(f' Shape: {nb.shape}')
print(f'Dimensions: {nb.ndim}')
print(f' Size: {nb.size}')
```

```
Series:
Name Orange Pi Plus
CPU ARM Cortex-A7
GHz 1.536
Ram 1046
Own 2
dtype: object
Shape: (5,)
Dimensions: 1
Size: 5
```

```
inventory = pd.Series([1,3,2],['Orange Pi Plus', 'Raspberry Pi 3', 'Asus Tinker Board'])
print(f'Series: \n{inventory}\n')
```

```
Series:
Orange Pi Plus 1
Raspberry Pi 3 3
Asus Tinker Board 2
dtype: int64
```

```
inventory = inventory.add(1)
print(f'Add 1 to all values: \n{inventory}\n')
```

```
Add 1 to all values:
Orange Pi Plus 2
Raspberry Pi 3 4
Asus Tinker Board 3
dtype: int64
```

#### NumPy on Series data.

```
import numpy as np
```

```
print(f'Square root of each item: \n{np.sqrt(inventory)}\n')
print(f'Each item to the power of 2: \n{np.power(inventory,2)}\n')
```

```
Square root of each item:
Orange Pi Plus 1.414214
Raspberry Pi 3 2.000000
Asus Tinker Board 1.732051
dtype: float64
Each item to the power of 2:
Orange Pi Plus 4
Raspberry Pi 3 16
Asus Tinker Board 9
dtype: int64
```

```
# Orange Pi Plus and Asus Tinker Boards
inventory[['Orange Pi Plus', 'Asus Tinker Board']] * 2
```

```
Orange Pi Plus 4
Asus Tinker Board 6
dtype: int64
```

### Arithmetic on Series Data

```
containers = ['a','b','c']
items = [1,10,100]
item_containers = pd.Series(index=containers, data=items)
print(f'All: \n{item_containers}\n')
print(f'Greater than 1: \n{item_containers[item_containers > 1]}\n')
```

```
All:
a 1
b 10
c 100
dtype: int64
Greater than 1:
b 10
c 100
dtype: int64
```

```
# add 10 items to a
item_containers = item_containers.add([10,0,0])
print(f'All: \n{item_containers}\n')
```

```
All:
a 11
b 10
c 100
dtype: int64
```

```
half_containers = item_containers / 2
print(f'Half: \n{half_containers}\n')
```

```
Half:
a 5.5
b 5.0
c 50.0
dtype: float64
```

## DataFrames

DataFrames are the central feature of Pandas, a dictionary like data object.

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

- pandas.DataFrame official documentation.

```
import pandas as pd
```

### Creating

Pandas can create a DataFrame from a NumPy ndarray (structured or homogeneous), dict, or another DataFrame. A DataFrame with a Python Dict can contain Series, arrays, constants, or list-like objects.

In the example below I create a dictionary that maps three indexes to three varied data sets. The fictional AI machines ‘hal’, ‘johnny 5’ and ‘bender’ all have different attributes with some overlap. Each of the dictionary keys contains a Pandas Series object. However, they may contain any list-like objects.

```
ai = {'hal': pd.Series(data = [100, 90], index = ['intellect', 'dangerous']),
'johnny 5': pd.Series(data = [12, 5, 2], index = ['dangerous','humor','bending']),
'bender': pd.Series(data = [20, 50, 50, 100], index = ['intellect', 'dangerous', 'humor', 'bending'])}
df_ai = pd.DataFrame(ai)
df_ai
```

bender | hal | johnny 5 | |
---|---|---|---|

bending | 100 | NaN | 2.0 |

dangerous | 50 | 90.0 | 12.0 |

humor | 50 | NaN | 5.0 |

intellect | 20 | 100.0 | NaN |

```
print(f' Shape: {df_ai.shape}')
print(f'Dimensions: {df_ai.ndim}')
print(f' Size: {df_ai.size}')
print(f'Total NaNs: {df_ai.isnull().sum().sum()}')
print(f'NaN Counts: \n{df_ai.isnull().sum()}\n')
```

```
Shape: (4, 3)
Dimensions: 2
Size: 12
Total NaNs: 3
NaN Counts:
bender 0
hal 2
johnny 5 1
dtype: int64
```

```
print(f'DataFrame Values: \n{df_ai.values}\n')
```

```
DataFrame Values:
[[ 100. nan 2.]
[ 50. 90. 12.]
[ 50. nan 5.]
[ 20. 100. nan]]
```

### Selecting

The methods mask and where are provided by Panda’s Series and DataFrame data types. See the examples below for some simple examples of value selection using basic arithmatic expressions.

```
# mask out any data greater than 10
df_ai.mask(df_ai > 10)
```

bender | hal | johnny 5 | |
---|---|---|---|

bending | NaN | NaN | 2.0 |

dangerous | NaN | NaN | NaN |

humor | NaN | NaN | 5.0 |

intellect | NaN | NaN | NaN |

```
# only return data greater than 10, otherwise NaN
df_ai.where(df_ai > 10)
```

bender | hal | johnny 5 | |
---|---|---|---|

bending | 100 | NaN | NaN |

dangerous | 50 | 90.0 | 12.0 |

humor | 50 | NaN | NaN |

intellect | 20 | 100.0 | NaN |

```
# only return data greater than 10, otherwise 0
df_ai.where(df_ai > 10, 0)
```

bender | hal | johnny 5 | |
---|---|---|---|

bending | 100 | 0.0 | 0.0 |

dangerous | 50 | 90.0 | 12.0 |

humor | 50 | 0.0 | 0.0 |

intellect | 20 | 100.0 | 0.0 |

### Modifying

The AI bots ‘hal’, ‘johnny 5’ and ‘bender’ share some common attributes, however where they do not, we get **nan** (not a number). Running the AI bot data through any math function would be problematic with the existence of non-numbers. Pandas give us quite a few options.

There are many options for cleaning this data. I’ll start with removing any rows that contain **nan** values. We can make these adjustments with the optional parameter **inplace=True** if we wanted to modify the DataFrame in place, however for the sake of examples it is better to keep the original in-tact.

```
# return a frame eliminating rows with NaN values
df_ai_common_rows = df_ai.dropna(axis=0)
df_ai_common_rows
```

bender | hal | johnny 5 | |
---|---|---|---|

dangerous | 50 | 90.0 | 12.0 |

```
# return a frame eliminating rows with NaN values
df_ai_common_cols = df_ai.dropna(axis=1)
df_ai_common_cols
```

bender | |
---|---|

bending | 100 |

dangerous | 50 |

humor | 50 |

intellect | 20 |

Depending on requirements, no data could mean zero in our scale of 0-100. While zero is not a reasonable assumption for our AI bots, it’s an easy data fix:

```
# fill all NaNs with 0
df_ai.fillna(0)
```

bender | hal | johnny 5 | |
---|---|---|---|

bending | 100 | 0.0 | 2.0 |

dangerous | 50 | 90.0 | 12.0 |

humor | 50 | 0.0 | 5.0 |

intellect | 20 | 100.0 | 0.0 |

```
# forward fill rows with previous column (axis=0) data
df_ai.fillna(method='ffill', axis=0)
```

bender | hal | johnny 5 | |
---|---|---|---|

bending | 100 | NaN | 2.0 |

dangerous | 50 | 90.0 | 12.0 |

humor | 50 | 90.0 | 5.0 |

intellect | 20 | 100.0 | 5.0 |

```
# forward fill rows with previous column (axis=0) data
# then back fill
df_ai.fillna(method='ffill', axis=0).fillna(method='bfill', axis=0)
```

bender | hal | johnny 5 | |
---|---|---|---|

bending | 100 | 90.0 | 2.0 |

dangerous | 50 | 90.0 | 12.0 |

humor | 50 | 90.0 | 5.0 |

intellect | 20 | 100.0 | 5.0 |

Forward (ffill) and backfilling (bfill) have far better uses in time-series data. In this case, `hal`

having a danger rating of 90 should not assume that his bending ability would be 90 as well, but this example clearly illustrates the forward and backfilling capabilities of DataFrame’s fillna method.

If we needed to make assumptions regarding the ability of this team of AI bots we could assume unknown data could start as an average of known data.

```
# get the mean of data for each attribute row by column (axis=1)
df_ai.mean(axis=1)
```

```
bending 51.000000
dangerous 50.666667
humor 27.500000
intellect 60.000000
dtype: float64
```

pandas.DataFrame.apply method applies the return value of a function along an axis of DataFrame, axis=1 in the example below. The function given to pandas.DataFrame.apply is passed the row or column depending the axis specified, the function below receives rows (because axis=1 is specified) and assigns each row to the variable “x” in which the method “mean” is called and resulting data returned from the function.

We could have defied a named function; however this small opperation **x.fillna(x.mean())** is hardly worthy of such attention. Python’s lambdas are one line, anonymous functions, and then used responsibly, can make the code more compact and readable at the same time.

```
clean_df_ai = df_ai.apply(lambda x: x.fillna(x.mean()),axis=1)
clean_df_ai
```

bender | hal | johnny 5 | |
---|---|---|---|

bending | 100.0 | 51.0 | 2.0 |

dangerous | 50.0 | 90.0 | 12.0 |

humor | 50.0 | 27.5 | 5.0 |

intellect | 20.0 | 100.0 | 60.0 |

### Sorting

```
# order the columns by ai bot with the highest intellect
hii = clean_df_ai.sort_values(['intellect'], axis=1, ascending=False)
hii
```

hal | johnny 5 | bender | |
---|---|---|---|

bending | 51.0 | 2.0 | 100.0 |

dangerous | 90.0 | 12.0 | 50.0 |

humor | 27.5 | 5.0 | 50.0 |

intellect | 100.0 | 60.0 | 20.0 |

```
print(f'The bot with the highest intelligence is {hii.columns[0]}.')
print(f'The bot with the lowest intelligence is {hii.columns[-1]}.')
```

```
The bot with the highest intelligence is hal.
The bot with the lowest intelligence is bender.
```

I doubt that `johnny 5`

is more intelligent than `bender`

but his data was unknown and therefore derived by using a mean, so the score is mathematically correct.

I won’t attempt even to scratch the surface of sorting functions and their parameters provided by DataFrames. This article is only intended to give you a taste and get you going.

### Math

Use NumPy to perform any number of arithmetic operations on the values of a DataFrames. I suggest you take a look at my article Python Data Essentials - Pandas for an overview of this compelling data science library.

## Essential Python 3

A lot of data science in done in Jupyter Notbooks and libraries like NumPy make developing reports or documenting numerical processes. However if you a software developer like me, this code needs to run in a script on a server, in Amazon’s Lambda Function Handler or even kubeless in a custom kubernetes cluster.

Check out my article on Essential Python 3 for a clean boilerplate script template to get you going.

This blog post, titled: "Python Data Essentials - Pandas: A data type equivalent to super-charged spreadsheets." by Craig Johnston, is licensed under a Creative Commons Attribution 4.0 International License.

##### SUPPORT

Order my new Kubernetes book: Advanced Platform Development with Kubernetes: Enabling Data Management, the Internet of Things, Blockchain, and Machine Learning