Cover Image

Overview

This post shows how mutating data frames can be written more elegantly (and thus understood more easily) by using data pipelines. R users know this concept from the dplyr package, and Python offers a similar one named dfply.

Setting the Stage

We start off with some global definitions…

import numpy as np
import pandas as pd

The sample data (about OS package deployments) is read into the raw_data dataframe from a CSV file.

raw_data = pd.read_csv("../assets/data/cmdb-packages.csv", sep=',')

print('♯ of Records: {}\n'.format(len(raw_data)))

for name in raw_data.columns[1:]:
    if not name.startswith('Last '):
        print(name, '=', list(sorted(set(raw_data[name].fillna('')))))

raw_data.head(3).transpose()
♯ of Records: 146

Distribution = ['Debian 8.11', 'Debian 8.6', 'Debian 8.9', 'jessie']
Architecture = ['amd64']
Environment = ['', 'Canary', 'DEV', 'LIVE', 'QA']
Team = ['Automation', 'Big Data', 'Email', 'Ops App1', 'Ops Linux', 'Persistence', 'Platform']
Installed version = ['41.15-2(amd64)', '42.28-2(amd64)', '42.44-1(amd64)', '45.11-1(amd64)', '48.33-1(amd64)']
0 1 2
CMDB_Id 274656589 153062618 282201163
Distribution jessie jessie jessie
Architecture amd64 amd64 amd64
Environment DEV DEV LIVE
Team Ops App1 Ops App1 Ops App1
Last seen 2019-02-18 11:43 2019-02-18 11:56 2019-02-18 12:04
Last modified 2019-02-18 11:43 2019-02-18 11:56 2019-02-18 12:04
Installed version 42.28-2(amd64) 42.28-2(amd64) 48.33-1(amd64)
def map_distro(name):
    """Helper to create canonical OS names."""
    return (name.split('.', 1)[0]
        .replace('Debian 7', 'wheezy')
        .replace('Debian 8', 'jessie')
        .replace('Debian 9', 'stretch')
        .replace('Debian 10', 'buster')
        .replace('squeeze', 'Squeeze [6]')
        .replace('wheezy', 'Wheezy [7]')
        .replace('jessie', 'Jessie [8]')
        .replace('stretch', 'Stretch [9]')
        .replace('buster', 'Buster [10]')
    )

Data Cleaning With Pandas

This code cleans up the imported data using the Pandas API.

To get sensible version statistics, we split off the auxiliary information in the version column (anything after -), leaving just the upstream part of the version string. The environment classifier is also cleaned up a little, and distributions are mapped to a canonical set of names. Some unused columns are dropped.

Finally, a subset of unique version samples is selected.

data = raw_data
data = data.assign(Version=data['Installed version'].str.split('-', 1, expand=True)[0])
data = data.assign(Environment=data.Environment.fillna('UNDEFINED').str.upper())
data = data.assign(Distribution=data.Distribution.apply(map_distro))
data = data.drop(columns=['CMDB_Id', 'Last seen', 'Last modified', 'Installed version'])
data = data.drop_duplicates(subset='Version', keep='first')

data.transpose()
0 2 26 45 62
Distribution Jessie [8] Jessie [8] Jessie [8] Jessie [8] Jessie [8]
Architecture amd64 amd64 amd64 amd64 amd64
Environment DEV LIVE LIVE UNDEFINED DEV
Team Ops App1 Ops App1 Platform Email Platform
Version 42.28 48.33 41.15 45.11 42.44

Data Cleaning With Pipelines

This does the exact same processing as the code above, but is arguably more readable and maintained more easily:

  • It has less boilerplate, and makes the use of pipelined processing transparent.
  • Each step clearly states what it does to the data.
  • When steps are copied into other pipelines, the X placeholder ensures you use the data of this pipeline (the code is more DRY)..
from dfply import *

piped = (raw_data
    >> mutate(Version=X['Installed version'].str.split('-', 1, expand=True)[0])
    >> mutate(Environment=X.Environment.fillna('UNDEFINED').str.upper())
    >> mutate(Distribution=X.Distribution.apply(map_distro))
    >> drop(X.CMDB_Id, X['Last seen'], X['Last modified'], X['Installed version'])
    >> distinct(X.Version)
)

piped.transpose()
0 2 26 45 62
Distribution Jessie [8] Jessie [8] Jessie [8] Jessie [8] Jessie [8]
Architecture amd64 amd64 amd64 amd64 amd64
Environment DEV LIVE LIVE UNDEFINED DEV
Team Ops App1 Ops App1 Platform Email Platform
Version 42.28 48.33 41.15 45.11 42.44

The result is identical to the pure Pandas code, as expected.

To learn more about dfply, read the dplyr-style Data Manipulation with Pipes in Python blog post, which has more examples.