ETX2250/ETF5922

Infographics

Lecturer: Kate Saunders

Department of Econometrics and Business Statistics

etx2250-etf5922.caulfield-x@monash.edu
Lecture 6
dvac.ss.numbat.space

Infographics

What are infographics?

Infographics

Infographics are powerful forms of visual communication: Making complex data accessible, memorable, and engaging. They are:

Visual representations of data that combine graphics, text, and numbers.
Go beyond just visualisation
Guide the reader to the key messages
Commonly use a simple, narrative form
Help reduce the cognitive load

And we’ve seen a few examples already

Example

Source

There is a big downward spike in 2020. Adding text here helps the audience understand why.

Infographics vs Visualisation

What is the difference?

They aren’t necessarily different from one another.
The key difference is that infographics usually contain additional text or other graphics, like icons.
Infographics are just a form of data visualisation that use additional narrative elements

Chart junk?

Hang on

Isn’t adding extra elements a from of chart junk?

Technically yes …
But Tufte’s principles were written in 1982
So let’s think about this issue flexibly

So when should we use infographics?

Some of the reasons

To deliver the message quickly
To explain a complex process
Alert the audience to an important part of our figure
Summarise a long report or blog succintly
Create a visual that is easy to share

Another Example

Source

Plastic recycling can be a boring
This image is engaging and colourful
The text guides the viewer
Notice the use of human perception

Some Resources

Infographics

Infographics use many of the same principles of good data visualisation!

Here is more details on key principles
Here is a more general guide from Monash library on infographics
Some more advanced examples of infographics in R
Guidance on creating graphical summaries for an Article or Report

Your turn

Your turn

Let’s take some time to look some examples of infographics and their design elements:

Note infographics are common on social media, but less so in technical reports.
Why do you think that is?

How do we create one?

Messages

Messages

What messages are you trying to convey?
Where do you need to draw your reader’s eye?
What situational context can you provide to improve their understanding?
For example, important events like COVID-19 hurt the economy.
Or, apple announces a new product that drives up the stock price

For example

Adding messages

Text layer

You already know how to create a visualisation.
Now let’s add some extra narratvie layers to our plot.
There are many ways to do this in ggplot
- geom_text()
- geom_label()
- annotate()

geom_text() and geom_label()

These works the same way as geom_point().
But instead of a point, you are adding text.
geom_label wraps the text inside a rectangle

Additional

Required aesthetics:

label: the text you want to display

Useful addition inputs:

nudge_x and nudge_y: shifts the text along the x and y axis

Starter plot

ggplot(data = mtcars) +
  geom_point(aes(x = wt, y = mpg))

Add some text

The text can be added using a character string

plot <- ggplot(data = mtcars) +
  geom_point(aes(x = wt, y = mpg)) + 
  geom_text(x = 4, y = 30, 
            label = "As the car weight increases \n the fuel economy gets worse", 
            color = "darkred", 
            size = 6)

Add some text

Add some more text

The text can be also added using a data frame

plot <- ggplot(data = mtcars) +
  geom_point(aes(x = wt, y = mpg)) + 
  geom_text(x = 4, y = 30, 
            label = "As the car weight increases \n the fuel economy gets worse", 
            color = "darkred", 
            size = 6) + 
  geom_text(aes(x = wt, y = mpg, label = rownames(mtcars)),
            size = 2.5, 
            alpha = 0.8, 
            hjust = -0.2)

But this is chart junk

Better

The text can be also added as a label

most_efficient = mtcars |>
  arrange(desc(mpg)) |>
  slice(1)

plot <- ggplot(data = mtcars) +
  geom_point(aes(x = wt, y = mpg)) + 
  geom_label(aes(
    x = most_efficient$wt, y = most_efficient$mpg, 
    label = paste(rownames(most_efficient), "is the most efficient car")),
            size = 4, 
            alpha = 0.8, 
            hjust = - 0.1) + 
  ggtitle("Lighter cars have better fuel economy") + 
  xlab("Weight (in lbs)") + 
  ylab("Miles per Gallon")

Add a label

annotate()

Note

The annotate() function is useful for adding small annotations (such as text labels)
There are many geoms you can use with this function, for example:
- text: adding text
- segment: drawing a line
- rect: drawing a rectangle
- point: highlighting a point

Go to this link for more information.

annotate

mtcars |> 
  ggplot(aes(x = wt, y = mpg)) +
  annotate("point", x = 2.2, y = 32.45, color = "orange", size = 10) +
  geom_point() +
  annotate("segment", xend = 2.25, yend = 32.5, x = 3, y = 32.5, color = "orange", arrow = arrow(length = unit(3, "mm")), size = 2.5) +
  annotate("rect", xmin = 3.1, xmax = 4, ymin = 31, ymax = 34, fill = "blue") +
  annotate("text", x = 3.55, y = 32.5, label = "Here is some text", color = "white", size = 5)

An example to recreate

Code

stock <- read_csv("data/big-tech-stock-price.csv") |> 
  mutate(date = ymd(date)) |>
  filter(stock_symbol == "AAPL",
         year(date) > 2016) 

ggplot(stock) +
  geom_line(aes(x = date, y = close)) +
  geom_vline(xintercept = as.numeric(as.Date("2020-01-01")), linetype = "dashed", 
             color = "red", alpha = 0.4) +
  annotate("segment", x = as_date("2018-01-01"), xend = as_date("2019-12-01"), 
           y = 140, yend = 100, color = "red") +
  annotate("label", x = as.Date("2018-01-01"), y = 150, 
           label = "COVID-19 Pandemic", color = "white", fill = "red") +
  labs(x = "Date", y = "Closing price", 
       title = "Apple Inc stock price during the pandemic") +
  theme_bw() +
  theme(
    aspect.ratio = 0.5,
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5)
    )

Your turn

Your turn

Copy the code provided.
Run each line to see what it does - check you understand.
Try it yourself: Add a vertical dashed line, label and segment for the other important events.
- The Apple spring 2022 event on 2020-03-18
- The M1 Chip announcement on 2020-11-10

Visualising Uncertainty

Linda Problem

Question:

Background: Lucy was a math major in college and got top marks on all her exams in probability and statistics. Which do you think is more likely:

That Lucy is a portrait artist? or
That Lucy is a portrait artist who plays poker.

Waffle Plot

Waffle Plot Code

if(!require(waffle))
  remotes::install_github("hrbrmstr/waffle")

library(waffle)

people_data = data.frame(
  Person = c("Maths", "Math Artists", "Math Artists that Play Poker"), 
  count = c(20*20 - 10, 7, 3))

ggplot(people_data) + 
  geom_waffle(aes(fill = Person, values = count),
    n_rows = 20, size = 0.5, colour = "white"
  ) + 
  scale_fill_colorblind() +
  coord_fixed() + 
  geom_label(x = 30, y = 18, 
            label = "The number of maths majors \n who are just artists \n is greater than the number \n  that are artists and play poker!", 
            color = "black", size = 4)  +
  theme_minimal() + 
  labs(title = "One Million Majors in Maths") + 
  theme(axis.text = element_blank(),
        legend.position = "bottom",
        plot.title = element_text(size = 20),
        legend.text = element_text(size = 10)) + 
  xlim(c(0,40))

Conjunction Fallacy

Explaining Frequencies

Presenting the problem as before -
People get the answer wrong 80% of the time.
However, if asked the same question in terms of frequencies this is reversed.
For example: Estimate out of 100 math majors how many are:

Artists: ___ in 100
Artists who play poker: ___ in 100

Visualising the problem also helps people get the answer correct much more often!

Read more about the Conjunctive Fallacy and Linda Problem here.

People and probabilities

Important

People often prefer certainty and struggle with probabilistic reasoning, preferring definitive answers instead of ranges or likelihoods.
People commonly overestimate rare events (e.g., plane crashes) and underestimate common ones (e.g., car accidents).
How information is presented (e.g., “10% failure” vs. “90% success”) influences interpretation and decision-making.
Understanding probabilities requires numerical literacy, which varies widely among the public.
Probabilities and uncertainties when given as numbers alone often are abstract.

Why waffles work?

Note

Like a pie chart you see the data as part of a whole
Like a bar chart you are also able to effectively compare the size of categories
Key difference: Bigger parts are broken down into the number of individual components instead of being shown in a solid colour.
This blogpost is a great resource for more details and examples.
You can also look at the Waffle plots on data-to-viz.

Uncertainty

Uncertainty

Waffle plots are great for visualising uncertainty
For example, 1 in 10 chance can be easily shown
This is one of the reasons they are so commonly used in infographics

Pictograms

Waffles with Pictures

You can also make waffle plots using picture icons in R using geom_pictogram
However it requires installing additional icons which is tricky.
This is quite advanced for this unit.

Waffle edits

Your turn

Your turn

Try to recreate my waffle edits
Change the number of rows in the waffle to 10
Change the waffle background colour to “gray90”
Make the waffle size 1
Add a rectangle to show the math majors who are artists
Edit the label text

Missingness

Example: Survivorship Bias

Visualisation from WWII by statistician Abraham Wald Source

This visualisations shows bullet holes

The pattern of damage shows locations where planes can sustain damage and still return home.
The missing areas show where the plane should be reinforced

Missingness

Important

Understanding what data is not there and why is very important
Missing data or incomplete data can lead to a wrong conclusion
You should always think consider the missing data
Visualising what is missing is important

Missing data points

For small datasets you can visual missing data using the R package naniar.

library(naniar)
ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) +
  geom_miss_point()

Missingness types

Type 1: Missing completely at random (MCAR)

The cause of missingness is unrelated to both the independent variables and the dependent variables.
Example: A students car breaks down and they miss their exam.
Reason: The missingness (the student missing the exam) is due to an unpredictable, unrelated external event (a car breakdown). It is not related to any of the independent variables (like the student’s academic history) or the dependent variable (their potential exam score).
This is the easiest type to deal with: You can ignore the missing values or interpolate them.

Missingness types

Type 2: Missing at random (MAR)

The missingness can be explained by a variable in the dataset.
However, the missingness is not related to the dependent variables.
Example: Students in a group all catch COVID and miss the exam.
Reason: The missingness (students missing the exam) is related to an observed variable (belonging to a specific group). However, it is not directly related to the unobserved variable (their exam scores).
Here we are assuming the groups are not related to academic performance.
Depending on the data this may require more sophisticed techniques to deal with.

Missingness types

Type 3: Missing Not At Random

This missingness should not be ignored
The cause missing data is related to the underlying variables.
Example: Students who fail the assignments are more likely to skip the exam.
Reason: The missingness (students who miss the exam) is directly related to the value of the missing data (their exam scores).
The missing data, the exam scores, is more likely becasue of the failed assignment grades.

Your turn

Your turn

What type of missingness is each of the following:

In a tobacco study:

Younger participants report their values less often (regardless of how much they smoke).
A survey participant unintentionally skips a question.
Participants who smoke intentionally withhold details about their smoking habits.

Summary

Wrap Up

Summary

Learnt about infographics
Know to combine text and other other narrative elements to improve the communication of the key messages
Also learnt people aren’t great at understanding probabilities and uncertainty
Uncertainty is challenge for visualisation and communication
Waffle plots are great a communicating chance
Discussed the importance of visualising missing data

Solutions

ggplot(stock, aes(x = date, y = close)) +
  geom_line() +
  geom_vline(xintercept = as.numeric(as.Date("2020-01-01")), linetype = 2, color = "red", alpha = 0.4) +
  geom_vline(xintercept = as.numeric(as.Date("2020-03-18")), linetype = 2, color = "blue", alpha = 0.4) +
  geom_vline(xintercept = as.numeric(as.Date("2020-11-10")), linetype = 2, color = "blue", alpha = 0.4) +
  # spring
  annotate("rect", xmin = as.Date("2021-02-01"), xmax = as.Date("2022-12-01"), ymin = 20, ymax = 38, fill = "blue") +
  annotate("segment", x = as.Date("2020-03-30"), xend = as.Date("2021-03-01"), y = 40, yend = 30, color = "blue") +
  annotate("text", x = as.Date("2022-01-01"), y = 30, label = "Apple spring 2020 event", color = "white") +
  # m1
  annotate("rect", xmin = as.Date("2021-07-01"), xmax = as.Date("2023-01-01"), ymin = 70, ymax = 88, fill = "blue") +
  annotate("segment", x = as.Date("2020-11-30"), xend = as.Date("2021-08-01"), y = 90, yend = 80, color = "blue") +
  annotate("text", x = as.Date("2022-04-01"), y = 80, label = "M1 announcement", color = "white") +
  # covid
  annotate("rect", xmin = as.Date("2017-03-01"), xmax = as.Date("2018-11-01"), ymin = 140, ymax = 158, fill = "red") +
  annotate("segment", x = as.Date("2018-01-01"), xend = as.Date("2019-12-01"), y = 140, yend = 100, color = "red") +
  annotate("text", x = as.Date("2018-01-01"), y = 150, label = "COVID-19 Pandemic", color = "white") +
  labs(x = "Date", y = "Closing price USD", title = "Apple Inc stock price during the pandemic") +
  theme_bw() +
  theme(aspect.ratio = 0.5,
        plot.title = element_text(size = 16, face = "bold", hjust = 0.5)
        )

Solutions

library(waffle)
library(ggthemes)

people_data = data.frame(
  Person = c("Maths", "Math Artists", "Math Artists that Play Poker"), 
  count = c(20*20 - 10, 7, 3))

ggplot(people_data) + 
  geom_waffle(aes(fill = Person, values = count),
    n_rows = 10, size = 1, colour = "gray90"
  ) + 
  scale_fill_colorblind() +
  coord_fixed() + 
  geom_rect(
    aes(xmin = 39.5, xmax = 40.5),
    col = "black", fill = NULL,
    ymin = 0.5, ymax = 10.5, alpha = 0
  ) +
  geom_label(x = 50, y = 7, 
            label = "The maths majors \n who are artists and play poker \n are also just \n maths majors who are artists", 
             size = 4)  +
  theme_void() + 
  labs(title = "        All the Maths Majors") +   theme(axis.text = element_blank(),
        legend.position = "bottom",
        plot.title = element_text(size = 20),
        legend.text = element_text(size = 10)) + 
  xlim(c(0,60))

Solutions

Note

(MAR) In a tobacco study, younger participants report their values less often (regardless of how much they smoke).
(MCAR) A survey participant unintentionally skips a question.
(MNAR) In a tobacco study, participants who smoke intentionally withhold details about their smoking habits.

The other missingness examples can be found here