Visualizing Data: Why, When, and How
Part 2: When Is Data Visualization a Good Choice?
This is the second part in a three part series entitled Visualizing Data: Why, When, and How. In Part 1, Data Visualization Throughout the Data Science Workflow (Article 1 and Article 2), we worked through some straightforward, accessible examples of data visualization and looked at where it serves a purpose in the data science workflow. In particular, we examined ways that data visualization can be useful for three aspects of the data science workflow: the process of using tools such as statistics and modelling, drawing insight from your data, and communicating data insights to others.
Numbers or visualizations?
Data visualization can be a powerful tool for communicating patterns in numbers and making data more interpretable, whether you’re telling a data story to yourself or to someone else. However, sometimes an image convolutes the issue, and it’s better to use the numbers themselves.
But when is information best communicated by numbers, and when are there too many data points for our brains to interpret patterns from the numbers alone?
In Part 1 of this series, we discussed several examples of how data visualization can enhance your workflow, supporting your understanding of steps throughout the data science process, as well as insight around the data itself. In this second part, we will discuss some factors to help you determine whether or not data visualization is the right choice for communicating information.
Whether or not to use a visualization largely depends on:
- How many data points are involved, and
- Whether the purpose is to communicate individual numbers or overall patterns.
Below, we will show examples of datasets and data stories that vary with regard to those two aspects, and that range from unnecessary to misleading to insightful.
This section focuses on visualization for communication, when you’re telling the story of your data to someone else. However, any consideration for communicating data to others also applies to interpreting data yourself. As you go through the examples below, consider how you may also be impacted by the nuances of data visualization throughout your own analytical process.
When visualization is unnecessary
Sometimes a picture is worth a thousand words, but sometimes a thousand words are overkill. With few enough numbers, a visualization can be unnecessary, redundant, and/or a waste of time and space. For example, if you’re only comparing a few values, a sentence often works fine. In fact, a sentence describing the relationship between the values may be more effective.
For example, take a look at the following plot, showing the fabricated results of a 2-person, 30-second marshmallow-eating contest:
What information is actually interesting here1, and what do we learn from the plot?
We probably want to know two things:
- Who ate more marshmallows and won the contest?
- How many marshmallows did each contestant eat?
The plot does a fine job of answering the first question. However, it would have been just as effective, and would have taken less space, to write, “Rebecca ate more marshmallows than Joe, and won the contest.”
The second question is not easily answered: we can estimate that Joe ate just less than 5 marshmallows and Rebecca ate between 15 and 20 marshmallows, but we don’t know for sure. Sometimes this is addressed by adding numbers to a plot:
However, now the plot becomes even more unnecessary, as all of the necessary information is contained in the two numbers, which are completely redundant with the bars.
A better approach would be to write a single sentence:
“In the 30-second marshmallow-eating contest, Rebecca won by eating 17 marshmallows, almost 6 times greater than Joe’s 3.”
Visualizing data can often go a long way towards making it more interpretable. However, when there are only a few numbers that need to be communicated, a plot is unnecessary and a sentence does the trick.
When visualization obstructs usefulness
If you are working with more numbers than can be easily communicated in a sentence, you have two options: a table or a figure. A table might be the right choice if your goal is one (or more) of the following:
- To compare or communicate specific values rather than a trend
- To compare pairs of numbers
- To include both summary and detailed numbers
- To communicate sets of numbers that have different units
For example, let’s say that you were examining exchange rates between Canadian dollars and several other countries’ currencies in 2016 and 2017.
There are several problems with these plots.
First, both plots emphasize differences between different currencies, which isn’t necessarily meaningful and could be misleading. This is because each currency is a different unit, so the fact that you can get more AUD than USD with $1 CAD is only informative in the context of the buying power of each currency. (How much does, say, a loaf of bread cost in Australia vs. the US?) This means that reasoning about differences in this plot may be impeded, thwarted, or misled.
Second, the plots also indicate trends between years. This is also not very meaningful without the context of previous changes. Additionally, the scale of the plot is determined by the range of different currencies, which may make trends look more or less substantial for any given currency. (More on this below.)
Last, if you wanted to use the actual exchange rates, you would not be able to get an accurate number from either plot – you would have to approximate values for each point.
What if the same data were presented in a table?
With a table, the values are clearly communicated and accessible. It is easy to find the exact exchange rate for a country and year, and there are no misleadingly implied relationships. You might like looking at the plot, but the table is much clearer.
When visualization is effective
So, when is it useful to visualize data? Data visualization is helpful and effective when the purpose is to communicate overall patterns in data that would be more difficult to gather from the numbers alone. In this case, the interesting aspect may not be the specific values, but the data’s overall shape, or the relative positions of different parts of the data.
For example, let’s look at exchange rates again, but this time we will examine two currencies over a monthly time series: British Pounds (GBP) and US Dollars (USD). The data will be presented in two plots with different units, but the same time period is presented, and the x-axis of the plots are aligned.
With these plots, the important point is not the specific values of each monthly exchange rate, but the relative change over time. As a visualization, this allows you to ask questions like, Why did the value of the Canadian dollar fall relative to both GBP and USD from 2013 through early 2016? and Why did CAD continue to gain value relative to GBP after early 2016, when it’s value relative to USD remained more stable? These kinds of questions might lead you to explore what was happening in the UK (for example, Brexit) and what was happening in the US (for example, the presidential election) in 2016, along with Canadian current events.
As simple as it is, this is a successful visualization if it facilitates further understanding and questions about the data and the real-world elements it represents.
Limits to visualization choices
Even when you want to communicate more data than would be appropriate for a sentence or a table, there are limits to what we are able to interpret from a visualization. These limits affect whether visualization is a good choice for communicating your data as is, or if you need to do more analysis, grouping, or filtering first.
For example, humans are limited by the number of elements that we can distinguish, both in terms of categories and in terms of colors. These factors are often intertwined in plots, since multiple categories are often represented by multiple colors. (See here for more discussion and examples.)
This limitation has to do both with our eyes being able to distinguish between different colors, and our brains being able to make multiple comparisons between different categories. Too many bars in a bar plot or lines in a line plot can obscure trends or communication of key points.
For example, the next plots builds on the currency exchange plots above, with more currencies shown. This time, however, values are normalized by the same currency’s 2014 values, and thus represent proportional changes since their 2014 values. This means that they are all the same units (unitless proportions) and can reasonably be compared on a plot.
Why are these plots so challenging to interpret?
One challenge comes from the number of values we can compare at once. We can visually compare between several bars, but there is a limit (potentially 5-7; see here and here for discussion ).2 We can’t, say, easily order all currencies from highest to lowest for 2016, and understand how each changed over time. Thus, to understand the data, we need to methodically go through each currency and compare across years.
Furthermore, in both plots, the many different categories and colors require viewers to continually check back and forth between the legend and the plot, while remembering the color information of previously examined categories. Humans are limited in the number of different elements (individual items or grouped blocks) that we can hold in our working memory at the same time; a 2008 study suggests the limit is around 3-4, lower than had previously been thought.3 Thus, too many different categories on the same plot makes it harder to see patterns and grasp trends.
Additionally, it is difficult to distinguish between similar colors, making it harder to determine which category is which. This is especially notable in the line plot, where lines cross other lines with similar colors and at low angles. While the issue of distinguishing similar colors could be addressed with a different color palette, it will still remain to some extent when so many colors are included. (See here for more discussion on selecting colors for visualizations, especially graduated color scales for continuous values.)
Is this visualization still useful? Ultimately, that comes down to the question that you are trying to answer, or the point that you are trying to communicate. Are you doing exploratory data analysis, or trying to communicate a data story to someone else? Some aspects of plots may be effective for one purpose, but not for the other.
If you are making a plot for exploratory data analysis, a plot like the bar plot above might give you useful, first pass information about the data. For example, this shows you that in this group of currencies, some became stronger relative to the Canadian dollar, whereas others didn’t change or became weaker. There also may be meaningful fluctuations from year to year. You would still want to do further analysis to better understand and visualize trends over time – in fact, you are probably already grouping different categories together in your mind to derive insight from these plots. However, this could be a useful starting point for further analysis.
However, if you are trying to tell a story or communicate information about certain relationships between the Canadian dollar and other currencies in the past few years, this is probably not the most effective visualization. It might be more helpful to group currencies into plots where values are above or below 0, so that there are fewer categories and colors per plot. Minimally, it could be helpful to at least to order the categories in a meaningful way. For example, in the following two plots, the currencies are split by whether they increased or decreased (relative to their 2014 exchange rate) in 2015, and ordered in ascending or descending order.
By splitting the plot into smaller groups of currencies, and by ordering categories based on the first year shown (2016), it becomes easier to understand which currency each bar represents and to more quickly spot trends.
Ultimately, if it is difficult to understand your plot once all your data is represented, you probably need to split your data further, do further analysis, or summarize meaningfully.
This is the second part in a three-part series. The two articles of Part 1, Data Visualization Throughout the Data Science Workflow, can be found here and here. The third part of this series, The Importance of Integrity, is forthcoming.
- For the moment, let’s ignore the subjectivity of the question and assume that it’s interesting to know how many marshmallows each contestant ate! ↩
- On a related but tangential note, see these links for discussions on how many objects the human brain can recognize at once without explicitly counting and how many moving objects humans can track at the same time. ↩
- The original Proceedings of the National Academy of the Sciences article can be accessed here. ↩