Visualizing Data: When, Why, and How
Part 3: The Importance of Integrity: How Color Choice Influences Interpretation (Article 2)
This article is part of a multi-part series on data visualization. Parts 1 and 2 focus on using data visualization throughout the data science workflow and determining when visualizing your data is an appropriate approach for communicating information. Part 3 focuses on factors that affect effective and honest communication of a data story. This is the second article of Part 3.
Color is a critical factor affecting the viewer’s interpretation of a visualization. In this article, we will start with some background on color theory, and then discuss two different aspects of color usage that can drive this interpretation:
- Color indicating categories
- Color representing numerical value
Color is integral to the way that we perceive composition and pattern. The visual effects of different colors and color combinations are the subject of the study of color theory, the concepts of which can help explain why different visual compositions are more or less effective. These concepts, used to compose paintings, photographs, and other artwork, are just as relevant to the way that our eyes move through the two-dimensional compositions that data visualizations essentially are. Thus, in this section, we will explore how the elements of color affect our interpretation of plotted data.
There are three elements of color: hue, value, and intensity. Hue is what we think of as a color – it is the color name (“red”, “blue”, “green”), with no indication of lightness, darkness, saturation, etc. Hues can be demonstrated on a color wheel, such as the one below1:
Value is the lightness or darkness of a color, and a high value or light value color reflects more light and has greater brightness. For example, a pure yellow is higher value than a pure purple. The figure below shows a gradient of values.
Last, intensity is a color’s saturation, or how different it is from gray. Pure colors are at their highest intensity, and lower intensity colors begin to look dirtier or dingier. The figure below shows a gradient of intensities for turquoise.
Some hues, like red, will carry weight even when present in a small amount. In a two-dimensional composition, an artist might use this effect to intentionally draw the eye of the viewer, or consider balancing a red element with other heavy elements, such as a complex design, larger regions of lower intense colors, or small regions of other intense colors.2 For example, consider the painting below, by Edgar Degas.
In the Degas painting above, the composition and use of color work together to focus the viewer’s eye on the dancer in the red shawl at the front (bottom) left, and to move attention between this dancer and the small patch of orange on the dancer’s skirt in the back (top) right.
We will see shortly that color has a similar effect on the viewer’s attention in a data visualization, in that red points or lines will draw the attention of the viewer, even if they make up a small part of an overall trend or group of data. Areas of high intensity can also act to steal the show, even when small, capturing the viewer’s eye. The third element of color, value, can have as strong an effect, with areas of similar values defining the viewer’s path through a visualization.
These elements are inextricable from your color choices, and may act to enhance or undermine accurate interpretation of data. Be aware of their effects so that you can communicate honestly!
Color indicating categories
Color will affect the way your eyes move through a data visualization just as much as they would with a two-dimensional artistic composition. It’s helpful to be aware of these concepts so that you understand how they will affect interpretation, whether your own or the viewers of your visualizations.
The figures below demonstrate how these elements of color can influence the way that they draw your attention. What do you notice about the trends in each plot?
All four plots show the same data, despite the fact that the importance of different trends may look different in each plot. Groups 1 and 2 are randomly normally distributed over the two variables – there is no relationship between the variables for these groups. However, Group 3 has a parabolic relationship between the two variables, and Group 4 has a linear relationship.
In plot (a), the red points draw they eye because the color red is a dominant hue and because this particular shade of red is lower in value – it’s darker. These points appear to have a parabolic shape. While it is not clear what is happening with the other points, you might visually extend this relationship to the other groups of points, and start exploring whether this is an overall trend.
In plot (b), the red points are now linearly related, and emphasize this relationship, which was less obvious when group 4 was plotted in orange.
In plot (c), the red and orange points still draw the eye, but in this case, these groups (1 and 2) are random and do not appear to have a relationship. You might notice the green or blue points and explore whether there are relationships between variables for these groups, but it is less likely to drive your intuition around the overall relationship. The same is true for plot (d), which is plotted using an entirely different color scale. The low value purple points are still noticeable, but the high value yellow points pull attention away from the other colors, especially in contrast with the low value purple.
What is the “correct” way to show the data? Again, this depends on what you are doing with the data and what you are trying to communicate. If you are doing exploratory data analysis and your purpose is to find groups with interesting relationships between the two variables, it might be helpful to have your eye drawn to the parabolic relationship of the points in red in plot (a) or the linear relationship of the points in plot (b). However, if your purpose is to communicate the general relationship between the two variables, these two plots might drive home the conclusion that there is a significant relationship, when in fact these are exceptions rather than the rule.
This is a relatively simple example, with only two variables and only four groups. Something like this might be relevant if you’re exploring variable relationships in pair plots during exploratory data analysis. However, the same concepts scale up to more complex figures. For example, imagine that you’re looking at a complex network graph where some of the nodes or edges are colors that stand out more than others, whether due to hue, intensity, or value – this would still affect the relationships that you will see first, and thus your developed intuition around the data.
The importance of value contrasts are again demonstrated in the plots below. Here, a y variable is plotted against month, a categorical variable. How many of the groups appear to have a consistent time-dependent relationship?
In the plot above, groups 1 and 2 have values that are randomly distributed around 20 for all months in the plot, whereas group three has a parabolic relationship, with a minimum around May. Because group three is plotted in a high value yellow against a low value background, it dominates visually, making the overall trend appear to be parabolic over time.
In the plots below, the same data are plotted with two different color assignments against two different backgrounds. Plots (a) and (c), and plots (b) and (d), have the same colors per group, whereas plots (a) and (b), and (c) and (d), have the same plot color backgrounds. When is each color more dominant? How does this affect what patterns are most noticeable?
The high value yellow is most noticeable against the low value gray background, so the parabolic pattern is more obvious in (a) than in (b). In the second two plots, the red becomes the clearest color, so the parabolic pattern is more obvious in (d) than in (c).
In none of these cases is it impossible to figure out that one group has a clear pattern and the others are more random. However, if you are looking quickly at a plot for the underlying story, color combinations such as these can affect your initial read on the patterns, driving the direction you take your analysis or affecting the impression of a take-home message.
Again, the best option for the plot depends on what you are trying to accomplish. Many color schemes and approaches exist where colors are selected to be balanced visually, or to be the most visible to those who are color blind (see the viridis palette, discussion on colorblindness, ColorBrewer for maps, and color selection suggestions here). It is important to consider these factors as well as the effect that a chosen palette has on the visibility of trends.
Color similarity, or relatedness, can also affect how people interpret relationships between data in a figure3. For example, in the D3 category20 scale, the first two colors are light blue and dark blue; when using all colors in this scale, viewers will understandably assume that data in these colors are related, whether or not this is the case.
This extends to colors that are near each other on the color wheel. For example, when seeing similar colors (e.g., reds and oranges), people will interpret greater relatedness in the underlying data than when seeing complementary colors (e.g., red and green – although, keep color blindness in mind). In visualizations, this can be used to effectively indicate actual relatedness in the data, but beware of unintentionally communicating incorrect conclusions – to yourself or to others!
This can also be relevant for interpreting patterns across groups that should not necessarily be grouped together. For example, take a look at the line plots below, showing changes in values over time. What trends seem more apparent in each plot?
All of the data in these plots was created as a random walk, selecting only series that remained within specified bounds. Nonetheless, different patterns are emphasized by different color schemes. In plot (a), the red lines – colors that are similar and also lower value – emphasize the series that trend upward at the end of the time shown, whereas the green lines peak earlier in the time series. This causes the viewer’s mind to group the lines that have similar trends, and might appear as patterns that doesn’t actually exist in the randomly created dataset. In plot (b), the colors are assigned randomly, and the overall impression is more random, although there is still the tendency to group red lines.
Plots (c) and (d) have similar effects, with a different color scheme (viridis). In plot (c), the low value purples emphasize the trend towards high values at the end of the time series, whereas plot (d) appears more moderate.
The plots below demonstrate a similar, color grouping effect with scatterplot data.
All four plots include the same data: 9 groups of points normally distributed around points on a grid. In plot (a), it is easy to visually group the red and orange points and spot an increasing linear trend in the data. The same is true of the low value blues and purples in plot (b). Plots (c) and (d) have colors arranged more randomly across the dataset, and it is more obvious that the groups themselves are random, with no relationship between Variables 1 and 2.
If groups one, two, and three are related, it might make sense to use similar colors for them – but if not, beware of visual effects that appear because of grouping similar colors!
Color representing numerical value
When working with data with three variables, it is common to plot the data as a heatmap, with color indicating the value of the 3rd variable (z-axis). In this context, it is just as important to pay attention to how the range of the axis determines interpretation of patterns and values as it is with two-dimensional plots – except that here the range is a color scale, rather than a plot axis.
For example, when there is substantial variation at one end of the range but extreme values on the other, notable patterns can be obscured. In this case, your options are similar to those above: you could consider log-transformation of the variable represented by color, or try using a color scale with more variation.
In plot (a) below, where is your attention drawn, and what might your conclusion be in terms of the dependence of Z on X and Y?
It looks like there might be a bit of variation throughout, but the only clear pattern is a peak in values centred around X = 15 and Y = 15. In contrast, take a look at the same data plotted with a broader range of colors (b), as well as with the data log-transformed (c and d).
In these plots, it becomes more clear that Z has a relationship with X and Y throughout the range of these latter variables. The broader color range in plot (b) helps communicate this by providing more contrast at the low end of the range. However, this new step in color – from purple to white – has a high contrast, and this might indicate to the viewer that this difference is as important as those at the higher end. Whether or not this is appropriate depends on the variables being described and the story being told.
In plots (c) and (d), log-transformation also makes this color range more clear with both color palettes, by emphasizing differences between smaller values and down-playing differences between higher values. While log-transformation might not make sense as part of your later data processing, it is a useful tool that you can use to explore variation at smaller values.
The third article of Part 3, Maps – Potentials & Pitfalls, discusses another aspect of the design choices that affect interpretation of data visualizations. This article is forthcoming.
- Wikimedia Commons color wheel: https://commons.wikimedia.org/wiki/File:Ryb-colorwheel.svg↩
- Cooperman, Marcie. 2013. Color: How to Use It. Pearson Education, Inc., publishing as Prentice Hall, Upper Saddle River, New Jersey. 319 pp.↩
- See this paper for discussion on how color similarity can also affect how people perceive and remember patterns.↩