By fragiledata Published on Sunday, 7 April 2019

Translating Tidy Data Tip: Key Transformation Functions in Alteryx, R and Python

Feature photo by Brina Blum on Unsplash

There's a lot of data where one observation to a human (e.g. one survey) isn't ideally one observation to the query language of a database system. In Tableau, this is the Tall v Wide debate, where for wide data sets are good for calculations but tall data sets are better for maximising interactivity. In R, tall data is known as Tidy data, advocated by the most popular mini ecosystem of packages, known as the Tidyverse, and very clearly explained by Hadley Wickham here. For serious database designers, this is essentially 3NF (3rd Normal Form).

Alteryx and Python seem to have less of an opinion on which is best but with the exception of the Select tool, the key tools I use to reshape data in Alteryx are the Cross Tab and Transpose tools, generally to move between tidy and ~~messy~~ wide, depending on what I need the data for. Pretty much the only go-er for this in Python is Pandas, but users have tried to create Python implementations of the Tidyverse (e.g. Dplython).

I'm currently reverse engineering a workflow with some R code inside in order to make it work a bit better and potentially reduce the requirement for R at all, and I found a neat explanation of two tools in the Tidyverse: gather( ) and spread( ).

Gather( ) takes 1+ columns and generates key value pairs, mirroring the function of the Transpose tool in Alteryx, or pandas.melt( ) in Python.
Spread( ) does the opposite, taking only two columns, and uses the designated key column to generate multiple columns, the reverse of gather and mirroring the Crosstab tool in Alteryx, or pandas.cast( ) in Python.

Pandas does have some explicit commands about pivoting and "unpivoting" tables and dataframes as well as cast( ) and melt( ).