When you do your laundry, you fold the clean clothes and put them into a specific drawer—folded socks should be placed to the left, folded T-shirts to the right. When it comes to data, however, marketers have gotten the hang of
“folding”, but struggle with organizing the data “in the drawer.”
The laundry challenge marketers face is parsing and cleaning marketing data, which is often messy in format. The data comes from various sources, usually unstructured—and there’s often missing data difficult to see at first glance. As customer interactions with brands generate more diversified data, marketers feel pinched to establish data structure that can reveal contextual insights and accurately defined customer segments.
Enter a data technique called “tidy data.” Tidy data (a term coined by Hadley Wickham) involves mapping a dataset’s structure to its meaning. Analysts align rows, columns, and tables (dataset structure) with observations, variables, and types (dataset meaning). The payoff for tidy data is a better view of variables because the database structure is established. This can lead to a variety of activities to improve the systems that rely on the data, ranging from spotlighting bad data to supporting an analytic model such as machine learning.
To achieve a tidy dataset, analysts arrange data into a dataset based on three key principles:
- Each variable forms a column
- Each observation forms a row
- Each type of observation forms a table.
The arrangement may require melting data together and splitting data columns, but the end result should be a simplified table that highlights the observations and variables in a shared data type.
The tidy data structure benefits analysts or a computer model by easing the way data can be viewed for developing repeatable and reliable processes. Programmer Hadley Wickham noted that tidy data is especially suited for programming languages that relies on vectors such as R programming (explained in my earlier post here).
Here’s a quick example of how tidy data should work. Suppose we had a dataset that contained vehicle make, model, engine specification, and the gender of the vehicle owner. Everything is labeled as we want, but there are some tweaks to the data needed.
The listing in Figure 1 occasionally shows a male and a female owning a vehicle model, or if two people of the same gender owned the same model. Also, the dataset shows two aspects of an engine spec, its size and type, combined in a column. Such data overlaps may confuse a program query for learning certain percentages of the population or other nuanced calculations.
Figure 2 shows a tidy data arrangement. Each row displays an observation, and the variables in the dataset are labeled. Note how the vehicle make and model can repeat. We have also separated the engine spec and type into separate columns. The end result is a dataset with data that can be readily accessed by a program for automated processes.
In fact the problems encountered with messy datasets usually arise from how columns and variables are identified and managed But helping data scientists and analysts agree on how a data structure is to be mapped can help alleviate many of these challenges.
So how can a marketer best use tidy data, even if the marketer is not data savvy?
First and foremost, marketers should not be intimidated by their personal knowledge about databases. A full understanding of protocol is not needed. In instances of tidy data, marketers can organize discussions on the kinds of variables expected in the database—are they fixed by the design of the table or expected to be captured from an activity?
Fixed variables can usually spur discussions on what model is expected, because they usually are dependent on data captured from activity i.e. the measured variables. Marketers can guide development discussions on dataset details, noting activities that are repeatedly accessed versus ones expected to be collected with activity. This effort will help frame fixed and measured variables that bring meaning to tidy data.
Marketers should next look at analysis activity that depends on the discussed variables. Decision trees in a machine learning algorithm are a good example. Machine learning algorithms can overanalyze by following a data error too closely. Whiteboarding a decision tree process surfaces these errors by asking if a model has learned a variable that holds true in general or has it only discover patterns that hold within the dataset. The potential to overanalyze always exists because a model is learning through induction, reasoning from details. A decision tree discussion can lead to better variables that can influence a machine learning effort.
Overall, tidy data is an important data mining methodology that foreshadows analytic models ranging from regression to machine learning. With open source solutions typically free, marketers can experiment with tidy data on open source tools and understand how the value they seek from the data is achieved.