ETX2250/ETF5922

Iterating on your Visualisations

Lecturer: Kate Saunders

Department of Econometrics and Business Statistics


  • etx2250-etf5922.caulfield-x@monash.edu
  • Lecture 7
  • <a href=“dvac.ss.numbat.space”>dvac.ss.numbat.space


Theory


So far we have developed a good understanding of the theory.

We have learnt:

  • The best practice principles in data visualisation

  • Know what separates a good plot from a bad one

  • Also know what plots work best for which variables

  • We know how to create a range of different plots

In practice


Practice is often very different to theory.

The facts:

  • The first plot you create will almost never be the plot you use

  • Visualising data is not a linear process!

  • You’ll take 2 steps forward, 1 step backwards and another sideways before you produce your final plot

  • You’ll also need to create many, many visualisations, before you finalise a visualisation

  • Visualising data is an iterative process

The cycle

Data visualisation is a key part of the analysis cycle.

Figure is from R for Data Science Textbook 2nd Edition.

Learning Objectives



Time to put theory into practice!

Today’s class

  • Perform an exploratory analysis.

  • Practice iterating your visualisations.

  • Use case studies to practice.

  • Create polished visualisations for a report or publication.

Exploratory Visualisation

Penguins Data

Penguins Data

Seems obvious but always look at the data before you start!

library(palmerpenguins)
library(kableExtra)
head(penguins) |> kable() |> kable_styling(font_size = 25)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007

What to look for?

Looking at your data helps you to get your data ready for plotting.

Sanity checklist

  • Did your data read in correctly?

  • Does the data you have read in meet with your expectations?

  • Does the data contains what it should?

  • Does the data look like what you expect?

  • Does your data need any transformations or pre-processing?

For each variable

You might like to ask yourself the following questions.

Understanding your variables

  • What is the range of possible values?

  • What are the most common values?

  • What are the least common values?

  • Are there any missing values/coded as NA?

  • Are there any missing values coded as something other than NA?
    (e.g., 99 is a common alternative)

  • Is there anything weird or surprising about the data?

Summary

Summary will help you to answer these questions.

summary(penguins) 
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 

Note: summary shows you what R “thinks” each variable is.

Learning about the data


Looking at the summary only gets you so far.

We should visualise the data!

Visualisation helps us to

  • to explore
  • to understand, and
  • to answer questions

about our data.

For example with Palmer penguins

What different species are represented in the data set?

Now we have an intuition that there are approximately twice as many Adelie Penguins as Chinstrap.

For example with Palmer penguins

What does the distribution of different body sizes look like?

Now we know this is a right skewed distribution.

For example with Palmer penguins

Let’s try changing the bin width

This bin width is too small! But that’s okay we learnt something.

For two or more variables

We also want to understand if variables are related. Questions you might ask are:

Understanding your variables

  • Do these values seem to increase/decrease together?

  • How noisy is this relationship (does it hold for all observations or just some)?

  • Do different groups tend to have similar measurements on a particular variable?

For example with Palmer penguins

Do different penguin species have different body mass?

Although there are twice as many Adelie penguins as Chinstrap, the body mass distribution is very similar.

Gentoo penguins are much bigger!

For example with Palmer penguins

There are other more informative ways to look at that data.

While the boxplots looked similar, now we have a better understanding of differences in the shape of the distribution.

For example with Palmer penguins

There aren’t too many, so we can also add the data points.

Although this doesn’t add much in terms of added interpretation.

For example with Palmer penguins

But what about if we consider sex? The added points tell us quite a bit!

We learn males are heavier.
We are also able to guess at what the sex of the NA penguins is.

Exploratory Analysis


Tip

  • What are the key variables you should visualise?

  • Are there any data relationships that you should explore? (space, time etc.)

  • Do you know anything about your data already that might help you?

  • How can you be creative about your data and relationships you visualise?

  • What are you curious about?

.

Case Study

Reporting on penguin data

Penguins reporting

This exploratory visualisation has been to help us with the following brief.

The Brief

You are working with a government department that is creating summaries of the different species in Antarctica.

Your task is to create a report the communicates the key characteristics of current species.

That way any impacts on species due to climate change and tourism in Antarctica can be tracked over time.

As part of this report you need to create a visualisation showing how the body mass of penguins differs by both species and by sex.

What is important to visualise

What we know

Based on our exploratory analysis, we know:

  • There is variation between the body mass of Gentoo penguins and other types.

  • The body mass varies with the penguin sex. Males are heavier.

  • There is missing data but it doesn’t impact our understanding about the relationship between body mass, penguins species and penguin sex.

Missing data

It’s not okay to just leave missing data out!

  • If we do leave it out, we need to be transparent about it in our report writing.

  • We also need to be very careful the missing data does not contribute to our understanding about the key relationships

penguins_plot = penguins |>
  filter(!is.na(body_mass_g) & !is.na(sex))

Remove missing data with great caution

Based on our visualisations the missing data does not appear missing for a systematic reason, so we are going to leave it out - More on this is okay later.

Penguin species, sex and body mass

Commonly, we can use facets to display multiple variables together.

Penguin species, sex and body mass


Not finished yet ..

This plot is just okay

  • It doesn’t violate any principles of graphical excellence

  • But, it is difficult to work out which comparisons we are meant to focus on.

  • We cannot directly tell the differences between body mass for penguins of the same species.

  • It might be better to have everything on one plot (Gestalt Principles)

Penguin species, sex and body mass

This is better.

Are we done? Let’s try to make the comparisons even more direct.

Penguin species, sex and body mass

Let’s try switching up the primary and secondary variables on on x-axis, and let’s try a different geometry.

Why this plot works


We communicated the key messages!

  • Can can clearly compare the differences in size between species.
    \(\rightarrow\) Effectively used colour to make this connection

  • We can also clearly see the body size difference based on penguin sex.
    \(\rightarrow\) We’ve chosen to visualise these groups separately on the x-axis.

But wait there is more …

Finalising the plot

The next step is to tidy up the small details.

Watch out!

Anyone red/green colour blind won’t be able to tell there are three penguin types!

We also need to polish the plot annotations e.g. axis, labels, legends etc.

Finished plot

We created at least 11 different iterations on plots before this one!

The details

Changes:

  • Colours are accessible and are sharp against the white background.

  • Points were made a little transparent. This is less overwhelming visually.

  • Theme with grid lines supports the main comparison but does not overwhelm the plot.

  • Factor (categorical) variables were given capitals.

  • Axis labels were changed from variable names to descriptors that contain units where required.

  • Legend has been moved to the bottom of the plot. This maximises data ink.

  • The title was also updated.

Code for finished product

library(ggthemes)
penguins |>
  filter(!is.na(body_mass_g) & !is.na(sex)) |>
  mutate(sex = fct_recode(sex, Female = "female",
                          Male = "male")) |>
  ggplot(aes(x = sex, y = body_mass_g, colour = species)) +
  geom_jitter(height = 0, alpha = .7) +
  labs(y = "Penguin Body Mass (g)",
       x = "Penguin Sex",
       colour = "Species") +
  theme_bw() +
  theme(legend.position = "bottom") +
  scale_color_colorblind()

Your turn


Your turn

  • Run each line of code one at a time.

  • Understand how each line the code changes the plot.

  • Add a comment about what each line of code does.

Describing a plot

What now?

Sharing

Once we have a plot we are happy with the next step is often to share it.

This might be:

  • With your boss
  • In a weekly meeting
  • For a presentation
  • In a report

Describing a plot

For a report:

  • It’s good communication practice to describe plots accurately in text.

  • You want to guide the reader to understand the key information visualised

  • Helping to draw their eye to the important parts

  • Here’s one structure.

Describing a plot

Note

Start with a one sentence summary of the main point the plot is trying to communicate.

“Figure 1. The Gentoo penguin species has a larger body mass than the Adelie and Chinstrap species.”

Describing a plot

Note

Then describe the different features of the plot. Here is another example that directly connects the colour aesthetic.

“After disaggregating for sex (x-axis, points horizontally jittered within sex category), we see that the Gentoo penguins (blue points) tend to have a larger body mass (y-axis) than the Adelie (black) and Chinstrap (yellow) penguins.”

Describing a plot

Note

Next follow with other observations/secondary points.

“On average, male penguins have a larger body mass than female penguins.”

“The female Chinstrap penguins appear to be slightly larger on average than the Adelie penguins…”

Describing a plot

Note

And lastly any constraints or cautions on interpretation

“… but more data would be needed to get a clearer understanding. This data was only obtained on three islands, with not all penguin species present on different islands. Further investigation would be needed to investigate whether the difference in penguin mass is due to physiological species differences and not due to available island resources”.

Remember Last Lecture - Example Data

library(tidyverse)
energy = read_csv("data/energydata.csv")
weather = read_csv("data/weather.csv")
energy_weather = full_join(energy, weather, by = c("Date", "State"))
head(energy_weather)
# A tibble: 6 × 9
  Date       Day   State Price Demand NetExport MaxTemp WindDir WindSpeed
  <date>     <chr> <chr> <dbl>  <dbl>     <dbl>   <dbl> <chr>       <dbl>
1 2018-07-15 Sun   NSW    51.7  7564.   -1231.     18   NW              2
2 2018-07-16 Mon   NSW    87.9  8966.     -18.6    18.5 W               7
3 2018-07-17 Tue   NSW    62.8  8050.    -643.     22.5 WNW             9
4 2018-07-18 Wed   NSW    54.5  7840.    -742.     20.8 SSW             1
5 2018-07-19 Thu   NSW    64.2  8168.     -40.6    20.8 NNW             1
6 2018-07-20 Fri   NSW    60.9  8254.     318.     15.5 W              13

And this was our best plot

Iterations


Iterations

We made 10 different plots last lecture!

We used the plots to explore the relationships between:

  • How Demand varies by State
  • How MaxTemp influences Demand
  • How Day of the week changes Demand

Your turn


Your turn

Write a description for our final plot

Use this structure:

  • Write a one sentence summary

  • Then describe the different aesthetic mappings / features

  • Share any other observations or secondary points

Case study

Communicating Climate Change

Communicating Climate Change

The Brief

You are a scientist working with the media to help the public understand climate change.

There has been a lot of misinformation and disinformation on this topic, such as:

  • The trend is not significant

  • Any change is within the natural variability of the climate

You need to create a visualisation that can be used to communicate the science that the climate is warming.

This visualisation needs to be robust to manipulations.

Line Graph

The most common way to visualise this sort of data is with a line graph.

Is it clear what the trend is?

Perspective - Stretched


The problem: This visualisation is not robust to distortions.

Stretching the plot makes it look like there is less of a trend and highlights the variability.

Perspective - Squished

Squishing the plot brings out the trend, but is still a distortion of the figure. This is not okay.

Real World

Watch famous science communicator Brian Cox use a visualisation just like this one to try to explain to Senator Malcolm Roberts that the climate is warming. Q and A Video see 4:00 minute mark

Do you think this was successful / convincing?

Take 2

Let’s rethink our approach

Important messages

  • Need to communicate the climate is warming.

  • The amount of warming will vary in different places (makes the messaging tricky).

  • The amount of warming does matter, but the public won’t necessarily understand what the difference between of 0.5°C, 1.°C, 1.5°C or 2.0°C of warming.

  • #1 Priority is showing there is a warming trend

  • #2 Priority is showing the amount of warming

  • Want to avoid confusing the trend and natural variability

Warming Stripes

Sydney #showyourstripes - The trend is clear.

There is no legend attached but does that matter? We quickly understand what this visualisation shows.

Graphic by Prof. Ed Hawkins: https://showyourstripes.info/

Labelled Stripes

Do we need to know what the exact amount of temperature to get the message?

Iterations

Warming Stripes

  • This visualisation is world famous.

  • But this was not Ed’s first iteration

  • It wasn’t his last.

Bars

Does the bar size and summary text help convey the amount of change that is happening?

Bars with scale

What are the advantages / disadvantages to making this graphic more scientifically accurate?

Why this plot works


We communicated the key messages!

  • Can clearly see there is a warming trend.
    \(\rightarrow\) Effectively used colour to intuitively make this connection

  • The warming stripes are very effective at getting the message across that the climate is warming.

  • The warming bars are more suitable when it is important to communicate the amount of warming.
    \(\rightarrow\) A bar geometry is effective here.

  • And the warming stripes and bars are a much harder figures to argue with than those line plots!

This plot is also already polished, so its ready to go.

Climate Spiral

Interactive version shown at the Rio Olympics in 2016

Does animation help convey the climate is warming?

https://climate.nasa.gov/climate_resources/300/video-climate-spiral-1880-2022/

Wrap Up

Summary

Today we’ve covered -

  • Used visualisation to explore and understand our data

  • Learnt creating a visualisation is not a linear process - it is an iterative process!

  • And you need to create a lot of visualisations for a production ready visualisation.

  • Practiced iterating on a design to communicate the key messages.

  • Polished the final visualisations ready for inclusion in a report or publication.

  • Also learnt how to write about our visualisations