Cover Image

Overview

This post shows how mutating data frames can be written more elegantly (and thus understood more easily) by using data pipelines. R users know this concept from the dplyr package, and Python offers a similar one named dfply.

Setting the Stage

We start off with some global definitions…

import numpy as np
import pandas as pd

The sample data (about OS package deployments) is read into the raw_data dataframe from a CSV file.

raw_data = pd.read_csv("../assets/data/cmdb-packages.csv", sep=',')

print('♯ of Records: {}\n'.format(len(raw_data)))

for name in raw_data.columns[1:]:
    if not name.startswith('Last '):
        print(name, '=', list(sorted(set(raw_data[name].fillna('')))))

raw_data.head(3).transpose()

♯ of Records: 146

Distribution = ['Debian 8.11', 'Debian 8.6', 'Debian 8.9', 'jessie']
Architecture = ['amd64']
Environment = ['', 'Canary', 'DEV', 'LIVE', 'QA']
Team = ['Automation', 'Big Data', 'Email', 'Ops App1', 'Ops Linux', 'Persistence', 'Platform']
Installed version = ['41.15-2(amd64)', '42.28-2(amd64)', '42.44-1(amd64)', '45.11-1(amd64)', '48.33-1(amd64)']

def map_distro(name):
    """Helper to create canonical OS names."""
    return (name.split('.', 1)[0]
        .replace('Debian 7', 'wheezy')
        .replace('Debian 8', 'jessie')
        .replace('Debian 9', 'stretch')
        .replace('Debian 10', 'buster')
        .replace('squeeze', 'Squeeze [6]')
        .replace('wheezy', 'Wheezy [7]')
        .replace('jessie', 'Jessie [8]')
        .replace('stretch', 'Stretch [9]')
        .replace('buster', 'Buster [10]')
    )

Data Cleaning With Pandas

This code cleans up the imported data using the Pandas API.

To get sensible version statistics, we split off the auxiliary information in the version column (anything after -), leaving just the upstream part of the version string. The environment classifier is also cleaned up a little, and distributions are mapped to a canonical set of names. Some unused columns are dropped.

Finally, a subset of unique version samples is selected.

data = raw_data
data = data.assign(Version=data['Installed version'].str.split('-', 1, expand=True)[0])
data = data.assign(Environment=data.Environment.fillna('UNDEFINED').str.upper())
data = data.assign(Distribution=data.Distribution.apply(map_distro))
data = data.drop(columns=['CMDB_Id', 'Last seen', 'Last modified', 'Installed version'])
data = data.drop_duplicates(subset='Version', keep='first')

data.transpose()

Data Cleaning With Pipelines

This does the exact same processing as the code above, but is arguably more readable and maintained more easily:

It has less boilerplate, and makes the use of pipelined processing transparent.
Each step clearly states what it does to the data.
When steps are copied into other pipelines, the X placeholder ensures you use the data of this pipeline (the code is more DRY)..

from dfply import *

piped = (raw_data
    >> mutate(Version=X['Installed version'].str.split('-', 1, expand=True)[0])
    >> mutate(Environment=X.Environment.fillna('UNDEFINED').str.upper())
    >> mutate(Distribution=X.Distribution.apply(map_distro))
    >> drop(X.CMDB_Id, X['Last seen'], X['Last modified'], X['Installed version'])
    >> distinct(X.Version)
)

piped.transpose()

The result is identical to the pure Pandas code, as expected.

To learn more about dfply, read the dplyr-style Data Manipulation with Pipes in Python blog post, which has more examples.

Using R-style Data Pipelines in Notebooks

Overview

Setting the Stage

Data Cleaning With Pandas

Data Cleaning With Pipelines

Reference Links

dfply

Alternatives

	0	1	2
CMDB_Id	274656589	153062618	282201163
Distribution	jessie	jessie	jessie
Architecture	amd64	amd64	amd64
Environment	DEV	DEV	LIVE
Team	Ops App1	Ops App1	Ops App1
Last seen	2019-02-18 11:43	2019-02-18 11:56	2019-02-18 12:04
Last modified	2019-02-18 11:43	2019-02-18 11:56	2019-02-18 12:04
Installed version	42.28-2(amd64)	42.28-2(amd64)	48.33-1(amd64)

	0	2	26	45	62
Distribution	Jessie [8]	Jessie [8]	Jessie [8]	Jessie [8]	Jessie [8]
Architecture	amd64	amd64	amd64	amd64	amd64
Environment	DEV	LIVE	LIVE	UNDEFINED	DEV
Team	Ops App1	Ops App1	Platform	Email	Platform
Version	42.28	48.33	41.15	45.11	42.44