ETX2250/ETF5922

Data Visualisations in R

Lecturer: Kate Saunders

Department of Econometrics and Business Statistics


  • etx2250-etf5922.caulfield-x@monash.edu
  • Lecture 4
  • <a href=“dvac.ss.numbat.space”>dvac.ss.numbat.space


Today’s Lecture

Learning Objectives

  • Introduce you to the grammar of graphics

  • Learn how to create plots in R

  • Learn about plots to show distributions, correlations and relationships

  • Builds on plots showing rankings/comparisons of categoric variables from last lecture

Plotting in R

Base R Plotting

Base R Plotting

There are basic plotting functions in R that don’t require any packages.

  • Examples of common base R plotting functions:

    • plot() for scatterplots.
    • barplot() for bar charts.
    • hist() for histograms.
    • check these out using the help menu ?plot()
  • But, there is limited flexibility for complex or layered plots

  • Base R plots are much harder to customise

Enter ggplot2

ggplot2

ggplot2 is part of the tidyverse, a collection of R packages great for data analytics and data science.

  • A powerful and flexible tool for creating layered, customisable plots.

  • The “2” in ggplot2?

    • It’s the second iteration of the ggplot package, created by Hadley Wickham.

    • ggplot2 improved upon the original package with more features and better usability.

The grammar of graphics

Grammar of graphics

Part of what makes ggplot2 so powerful is it built on the ideas of Grammar of Graphics a text by Leland Wilkinson.

  • The grammar of graphics breaks down visualisations into individual pieces that you layer together

  • Basically creating a set of rules for creating almost any graphic

  • At first using ggplot2 will seem complicated

  • But master the grammar and you’ll create detailed plots easily

Getting started in ggplot2

Shrek

Shrek and ggplot2

Don’t be scared of ggplot2, it’s just like Shrek!

Get to know it before you judge it!

“Ogres have layers. Onions have layers. You get it? We both have layers” - Shrek

The Layers

Key layers include:

  • Data:
    • The dataset you’re visualising.
  • Aesthetic Mappings (aes() for short):
    • Map variables to visual properties like x, y, color, size, etc.
  • Geometries (geom_*):
    • Define the type of plot (e.g., bars, lines, points).
  • Scales:
    • Control how data maps to aesthetics (e.g., axis limits, color gradients).
  • Facets:
    • Split the data into multiple panels (e.g., facet_wrap(), facet_grid()).
  • Themes:
    • Customise the non-data components (e.g., background, grid lines).

Base Layer

Start by creating an empty plot on which to add your layers. We’ll add layers to this plot using +.

library(ggplot2)

ggplot()

Common Mistake

To add layers to a plot we use + not |>!

Data Layer

What to plot

  • First step is to add our data

  • I’m going to use this data set I prepared earlier

library(here)
all_years_boston_celtics = read_csv(here::here("data/boston_celtics.csv")) 
boston_celtics = all_years_boston_celtics |> 
   filter(season == 2025)
head(boston_celtics)
# A tibble: 6 × 57
    game_id season season_type game_date  game_date_time      team_id team_uid  
      <dbl>  <dbl>       <dbl> <date>     <dttm>                <dbl> <chr>     
1 401703390   2025           2 2024-11-19 2024-11-20 00:00:00       2 s:40~l:46…
2 401704796   2025           2 2024-11-16 2024-11-17 01:00:00       2 s:40~l:46…
3 401704784   2025           2 2024-11-13 2024-11-14 00:30:00       2 s:40~l:46…
4 401703370   2025           2 2024-11-12 2024-11-13 00:00:00       2 s:40~l:46…
5 401704768   2025           2 2024-11-10 2024-11-10 20:30:00       2 s:40~l:46…
6 401704753   2025           2 2024-11-08 2024-11-09 00:30:00       2 s:40~l:46…
# ℹ 50 more variables: team_slug <chr>, team_location <chr>, team_name <chr>,
#   team_abbreviation <chr>, team_display_name <chr>,
#   team_short_display_name <chr>, team_color <chr>,
#   team_alternate_color <chr>, team_logo <chr>, team_home_away <chr>,
#   team_score <dbl>, team_winner <lgl>, assists <dbl>, blocks <dbl>,
#   defensive_rebounds <dbl>, fast_break_points <dbl>, field_goal_pct <dbl>,
#   field_goals_made <dbl>, field_goals_attempted <dbl>, …

Reading Data

Checklist

  • Remember you need to tell R where to look for this file by setting your working directory

  • Setting up a R Project helps with this!

  • Check you current working directory with getwd()

  • Check your file is in this directory list.files()

  • After your read in your data look at it using View() of head()

  • Make sure it looks like what you expect

  • Also check what type the computer thinks your data is: use str(), class() or summary()

  • If all that seems good - we can add it to our plot!

Add you data layer

It’s still an empty plot because we haven’t told R what to do with the data yet.

ggplot(data = boston_celtics) 

Geometry Layer (geom)

geom

  • If you type ?geom_ in your Console and hit tab to scroll through a list of all the different plot geometries

  • Think of all these types is like the Visualisation Pane in Power BI

Our turn

  • Let’s create a bar plot showing the team_ score each game this season

  • Use the geometry layer - geom_col

  • Similar to geom_bar (but does slightly different things)

Bar Plot

Add the geom

This is what your code should look like when you add your geom layer

ggplot(data = boston_celtics) + 
  geom_col()

Warning

  • But … this code won’t work yet - we haven’t added our aesthetic layer

  • The aesthetic layer defines how data is mapped to visual properties in your plot

    • e.g what goes on the x/y axes

Aesthetic Layer

Common Aesthetic Mappings

Use the aes() function to map variables to aesthetics.

The common parts are:

  • x: The variable on the x-axis.

  • y: The variable on the y-axis.

  • color: The color of points, lines, or outlines.

  • fill: The fill color for bars, areas, or shapes.

  • size: The size of points or lines.

  • shape: The shape of points (e.g., circles, triangles).

  • alpha: The transparency level.

Adding the aesthetic layer

ggplot(data = boston_celtics, aes(x = game_date, y = team_score)) + 
  geom_col() 

Subtle Difference

Important

  • When using only one data source you can put the aes() mapping as input into ggplot2()

  • For multiple data sources - the data and aesthetics mapping should go in the geom layer directly.

ggplot() +
  geom_col(data = boston_celtics, aes(x = game_date, y = team_score))

Colours in R


Colours

  • There are 657 named colours in R e.g. “red”, “blue”, “yellow”

  • To see theam all run colors()

  • Colours can also be specified using hex codes, e.g. "forestgreen" is "#228B22".

Colour and Fill

Colour vs Fill

Let’s set the bar colour to Green

ggplot() + 
  geom_col(data = boston_celtics, aes(x = game_date, y = team_score), col = "forestgreen") 

Colour and Fill

Colour vs Fill

Let’s set the bar fill to Green

ggplot() + 
  geom_col(data = boston_celtics, aes(x = game_date, y = team_score), fill = "forestgreen") 

Scale Layer

Next Layer

Now we are happy with our:

  • data layer
  • aesthetic layer, and
  • geometry layer

we can work on the scale layer. e.g. axis limits and color scales

Scales: Colour and Fill

Scale mapping

To map a variable to the colour or fill, we need to specify the mapping in aes().

ggplot(data = boston_celtics) + 
  geom_col(aes(x = game_date, y = team_score, fill = team_winner))

Using In Built Fill/Colour Scales

There are many inbuilt palettes, see RColourBrewer

ggplot(data = boston_celtics) + 
  geom_col(aes(x = game_date, y = team_score, fill = team_winner), col = "gray") + 
  scale_fill_brewer(palette = "Greens") 

Manual fill and colour scales

You can also change fill/colour scales manually using scale_colour_manual or scale_fill_manual.

ggplot(data = boston_celtics) + 
  geom_col(aes(x = game_date, y = team_score, fill = team_winner), col = "gray") +
  scale_fill_manual(label = c("Loss", "Win"), values = c("TRUE" = "forestgreen", "FALSE" = "lightgreen"))

Colour Scales

Colour Scales

IMO: Colour scales are on the hardest parts about learning ggplot2

  • To change colour scale, use scale_colour_*

  • To change fill scale, use scale_fill_*

  • Check out all the different types of scales using the help menu ?scale_colour and hit tab.

  • Note for discrete variables needing distinct colours, such as categorical variables, you can use scale_*_brewer

  • For continuous variables needing a smooth gradient use scale_*_distiller

  • You can also set colours manually using scale_*_manual

Note * here is like a blank space and it means there are multiple things that could be inserted here

Theme Layer

Themes

  • Here is a list of the themes.

  • My favourite is theme_bw().

Default Theme

Grey Background

  • The default background for ggplot2 is arguably chartjunk.

  • But - There are good reasons for using it.

“We can still see the gridlines to aid in the judgement of position (Cleveland, 1993b), but they have little visual impact and we can easily”tune” them out… Finally, the grey background creates a continuous field of colour which ensures that the plot is perceived as a single visual entity.”
Wickham on the grey background* Source: ggplot2: Elegant Graphics for Data Analysis.

Changing Theme Background

Here I change the theme background to theme_bw().

ggplot(data = boston_celtics) + 
  geom_col(aes(x = game_date, y = team_score, fill = team_winner), col = "gray") +
  scale_fill_manual(label = c("Loss", "Win"), values = c("TRUE" = "forestgreen", "FALSE" = "lightgreen")) + 
  theme_bw() 

Plot Theme Specifics

Plot Theme Specifics

  • To tune the more specific aspects of your theme, we use the theme() layer.

  • Look up ?theme there are a lot of options!

Changing Theme Specifics

To improve data-density we can move the legend to the bottom and remove the legend label.

ggplot(data = boston_celtics) + 
  geom_col(aes(x = game_date, y = team_score, fill = team_winner), col = "gray") +
  scale_fill_manual(label = c("Loss", "Win"), values = c("TRUE" = "forestgreen", "FALSE" = "lightgreen")) + 
  theme_bw() + 
  theme(legend.position = "bottom", legend.title = element_blank())

Polising your plot

The theme() layer is also were you can add specifics about titles, text and axes.

celtics_plot <- ggplot(data = boston_celtics) + 
  geom_col(aes(x = game_date, y = team_score, fill = team_winner), col = "gray") +
  scale_fill_manual(label = c("Win", "Loss"), values = c("TRUE" = "forestgreen", "FALSE" = "lightgreen")) + 
  theme_bw() + 
  theme(
    legend.position = "bottom", 
    legend.title = element_blank(),
    plot.title = element_text(size = 20), 
    axis.title.x = element_text(size = 15),  
    axis.title.y = element_text(size = 15),  
    axis.text = element_text(size = 12),      
    legend.text = element_text(size = 12)) + 
  xlab("Game Date") + 
  ylab("Team Score") + 
  ggtitle("Boston Celtics 2025 Season so far")

Final plot

celtics_plot

Your turn

Your turn

Recreate this plot - But experiment with the different parts to see how they work!

  • Change the fixed fill and the colour

  • Change the colour/fill palette

  • Experiment with different ranges of the y-axes (x is tricky so don’t worry about that for now.)

  • Change the titles and axes labels

  • Change the theme

Do these one at a time so you understand what each piece of code does!

Distribution Plots

Distribution Plots

Distribution Plots

Instead of comparing the points scored per game, I might like to look at the distribution of points scored in a game.

  • What’s the highest number of points scored?
  • What’s the lowest number of points scored?
  • What’s the average number of points scored?
  • How much variation is their in the number of points scored each game?

Histogram

Too many bins:

Histogram

Too few bins:

Histogram

A better number of bins:

Density Plot

Box Plot

Histogram

When to Use

  • Use to visualise the distribution of a single numeric variable.
  • Good for identifying the shape of data (e.g., normal, skewed, bimodal).
  • Example: Examining the frequency of income ranges.

Watch out for

  • Too few bins: Hides important details
  • Too many bins: Creates noise in the data

Density Plot

When to use

  • Smoothed alternative to a histogram
  • Best for showing the underlying distribution without binning
  • Example: Distribution of test scores.

Watch out for

  • Over-smoothing or under-smoothing
    • Over-smoothing hides important features like multiple peaks.
    • Under-smoothing makes the plot noisy.
  • Inappropriate for small datasets
    • Need sufficient data points to estimate the density curve

Boxplot

When to Use

  • Use to summarize the central tendency and spread of a numeric variable.
  • Provides a 5 number summary
    • min, first quartile (Q1), median, third quartile (Q3), max
  • Can be used to show outliers compared to the main data distribution.
  • Example: Summarising salary bands

Watch out for

  • Outliers are necessarily errors! Leave them in, unless you have a good reason.

  • R automatically plots points as outliers if they are 1.5 times greater than the interquartile range

  • A boxplot is a summary. If the shape of the underlying distribution is important, try geom_violin()

A guide

Choosing the Right Plot

Plot Type Best For Features
Histogram Raw frequency counts Splits the data into bins
Density Plot Smoothed Distribution Continuous, no bins
Box plot Showing the 5 number Summary Shows outliers as points

Key Takeaways

Pick the appropriate plot for you data and the relationship you want to show

Histogram and Density

Code

Key Takeaways

And sometimes the best plot combines more than one geometry togehter!

boston_celtics = read_csv("data/boston_celtics.csv")

ggplot(data = boston_celtics, aes(x = team_score)) + 
  geom_histogram(aes(y = ..density..), 
                 alpha = 0.4, 
                 col = "lightgray") + 
  geom_density(col = "red", 
               adjust = 1.25, 
               size = 1.1) +
  geom_vline(aes(xintercept = mean(team_score)), 
             col = "red", 
             linetype = "dotted", 
             size = 1.1) +
  theme_bw() + 
  ggtitle("Boston Celtics Team Score 2021 - Present")

Plots showing relationships

Correlation and Relationships

Plots showing correlation and realtionships

  • Scatter Plot (geom_point)
  • Bubble Plots (size)
  • Line Plots (geom_line)
  • Heatmaps (geom_hex, geom_bin2d)
  • Contour Plots (geom_density_2d)

And their variations!

Scatter Plot

Scatter Plot

Good for looking a relationships between two numeric variables

field_goals_scatter_plot <- ggplot(data = all_years_boston_celtics) + 
  geom_point(aes(x = field_goals_attempted, y = field_goals_made)) + 
  theme_bw() + 
  xlab("Field Goals Attempted (2 pts)") + 
  ylab("Field Goals Made (2 pts)") + 
  theme(legend.position = "bottom") + 
  ggtitle("Boston Celtics Field Goals 2021 - Present")

Scatter Plot

field_goals_scatter_plot

What if you want to add more variables

Say I want to show the total point score as well or which games they won?

Back to preattentive processing

  • Could try changing point colour
  • Could try changing point size

Point Colour

field_goals_colour_plot <-
  all_years_boston_celtics |>
  ggplot() + 
  geom_point(aes(x = field_goals_attempted, y = field_goals_made, col = team_score), alpha = 0.4) +
  scale_colour_distiller(palette = "Greens") +
  theme_bw() + 
  xlab("Field Goals Attempted (2 pts)") + 
  ylab("Field Goals Made (2 pts)") + 
  theme(legend.position = "bottom") + 
  ggtitle("Boston Celtics Field Goals 2021 - Present") + 
  coord_fixed()

Points by Colour

field_goals_colour_plot

Point Size: Bubble Plot

field_goals_bubble_plot <-
  all_years_boston_celtics |>
  # filter(season == 2025) |>
  ggplot() + 
  geom_point(aes(x = field_goals_attempted, y = field_goals_made, size = team_score), alpha = 0.4) + 
 scale_size_continuous(
    name = "Team Score",
    breaks = c(110, 115, 120, 125, 130)
  ) +
  theme_bw() + 
  xlab("Field Goals Attempted (2 pts)") + 
  ylab("Field Goals Made (2 pts)") + 
  theme(legend.position = "bottom") + 
  ggtitle("Boston Celtics Field Goals 2021  - Present") + 
  coord_fixed()

Bubble Plot

field_goals_bubble_plot

Bubble Plot ctd

Watch out for

In both scatter and bubble plots overplotting can occur. Changing transparency can help, but sometimes there are too many points!

Solution

In these instances a heatmap can help.

Heatmaps are like a histogram, but data is binned in along both the x and y axis.

Heatmap and Density

field_goals_heatmap_all <-
  ggplot(data = all_years_boston_celtics) + 
  geom_hex(
    aes(
    x = field_goals_attempted, 
    y = field_goals_made
    )) + 
  scale_fill_distiller(palette = "Greens", direction = 1, name = "Count") +
  geom_density_2d(
    aes(
    x = field_goals_attempted, 
    y = field_goals_made
    ), alpha = 0.4, adjust = 1.2, col = "forestgreen") +
  theme_bw() +
  xlab("Field Goals Attempted (2 pts)") + 
  ylab("Field Goals Made (2 pts)") + 
  ggtitle("Boston Celtics Field Goals 2021 - 2024") + 
  coord_fixed()

Heatmap and Density

field_goals_heatmap_all

Summary

Summary

Summary

  • Learnt about plotting in R.

    • This included how to use the ggplot2 package in R, and

    • The grammar of graphics

  • Learnt ggplot2 is like Shrek!

    • It is an onion with many layers.
  • Learnt how to create a range of plots in R

    • These include common plots to show correlations and relationships
  • Remember the tips for learning R from Lecture 1 if you get stuck!