Using R-style Data Pipelines in Notebooks
Mutate Pandas data frames simply and elegantly with data pipelines.
import numpy as np
import pandas as pd
The sample data (about OS package deployments) is read into the raw_data
dataframe from a CSV file.
raw_data = pd.read_csv("../assets/data/cmdb-packages.csv", sep=',')
print('♯ of Records: {}\n'.format(len(raw_data)))
for name in raw_data.columns[1:]:
if not name.startswith('Last '):
print(name, '=', list(sorted(set(raw_data[name].fillna('')))))
raw_data.head(3).transpose()
def map_distro(name):
"""Helper to create canonical OS names."""
return (name.split('.', 1)[0]
.replace('Debian 7', 'wheezy')
.replace('Debian 8', 'jessie')
.replace('Debian 9', 'stretch')
.replace('Debian 10', 'buster')
.replace('squeeze', 'Squeeze [6]')
.replace('wheezy', 'Wheezy [7]')
.replace('jessie', 'Jessie [8]')
.replace('stretch', 'Stretch [9]')
.replace('buster', 'Buster [10]')
)
Data Cleaning With Pandas
This code cleans up the imported data using the Pandas API.
To get sensible version statistics, we split off the auxiliary information in the version column (anything after -
), leaving just the upstream part of the version string. The environment classifier is also cleaned up a little, and distributions are mapped to a canonical set of names. Some unused columns are dropped.
Finally, a subset of unique version samples is selected.
data = raw_data
data = data.assign(Version=data['Installed version'].str.split('-', 1, expand=True)[0])
data = data.assign(Environment=data.Environment.fillna('UNDEFINED').str.upper())
data = data.assign(Distribution=data.Distribution.apply(map_distro))
data = data.drop(columns=['CMDB_Id', 'Last seen', 'Last modified', 'Installed version'])
data = data.drop_duplicates(subset='Version', keep='first')
data.transpose()
Data Cleaning With Pipelines
This does the exact same processing as the code above, but is arguably more readable and maintained more easily:
- It has less boilerplate, and makes the use of pipelined processing transparent.
- Each step clearly states what it does to the data.
- When steps are copied into other pipelines, the
X
placeholder ensures you use the data of this pipeline (the code is more DRY)..
from dfply import *
piped = (raw_data
>> mutate(Version=X['Installed version'].str.split('-', 1, expand=True)[0])
>> mutate(Environment=X.Environment.fillna('UNDEFINED').str.upper())
>> mutate(Distribution=X.Distribution.apply(map_distro))
>> drop(X.CMDB_Id, X['Last seen'], X['Last modified'], X['Installed version'])
>> distinct(X.Version)
)
piped.transpose()
The result is identical to the pure Pandas code, as expected.
To learn more about dfply
, read the dplyr-style Data Manipulation with Pipes in Python blog post, which has more examples.